Module 13 — Detection Authoring & Reporting¶

Type 13 · Eval Harness — author YARA and Sigma rules for the synth loader and validate them against a held-out benign corpus with a match/no-match scorecard and a false-positive gate, deliverable a completed analysis report plus rules proven to fire on the sample (and a variant) while staying quiet on clean binaries. (Secondary: Tool-Build — the corpus + scorecard runner is reusable.) Go to the hands-on lab →

Last reviewed: 2026-06

Malware Analysis — close the loop: turn your analysis into validated detections and a report that other teams can act on.

Difficulty: Intermediate · Estimated time: ~4–6 hrs (study + lab) · Type: Eval Harness · Prerequisites: Foundations

In 60 seconds

Analysis without detection is intelligence locked in a workstation. The final deliverable isn't a narrative — it's a package of validated detections plus a report each audience can act on. The discipline that makes a rule real is the train/dev/test wall: "it matched my sample" is a demo, not validation, because you tuned it on that sample. You grade against a held-out corpus and watch both recall and precision — because coverage ≠ precision, and a high-recall, false-positiving rule is worse than no rule. The proof is a regression gate you've watched go red.

Why this matters¶

Analysis without detection is intelligence that stays locked in an analyst's workstation. A YARA rule that fires on the packer stub from Module 09, or a Sigma rule that matches the registry persistence key from Module 12, turns your findings into something a SOC can actually deploy. The final deliverable of any malware analysis engagement is not a narrative document — it is a package of validated detections plus a report that gives the detection engineer, the threat intel analyst, and the executive enough context to act at their own level. Cobalt Strike detection is the canonical lesson in why precision — not just coverage — is the whole job. Its default Beacon profile fetches /jquery-3.3.1.min.js with a synthetic header set; a naive rule on that URI catches default Beacons and floods on the real jQuery requests every benign web session makes. The durable detections defenders actually deploy target the harder-to-change observables — the default SMB named pipes (_msagent_#), the default JARM/JA3 TLS fingerprints — exactly the "find the observable the adversary can't cheaply rotate" discipline this module drills. (The DFIR Report — Cobalt Strike, a Defender's Guide (Part 2) documents the default profile, named pipes, and JARM/JA3 signatures.)

Objective¶

Write a YARA rule that matches the malicious pattern in the synth loader sample but not on a corpus of clean binaries (false-positive discipline), write a Sigma rule for the registry persistence behaviour observed in the sandbox report, validate both rules, and produce a completed malware analysis report using the provided template.

The core idea¶

The mental model

A YARA rule is a structured hypothesis: "I believe samples with this combination of byte patterns are related." The discipline is specificity — too broad and false positives erode analyst trust until the rule is disabled; too narrow and it misses variants the moment the family rotates. The balance point is the observable the adversary can't cheaply change: a distinctive algorithm constant, a config-blob header, a mutex derived from the victim hostname. Packer-stub strings are cheap to change; decrypted-payload strings are durable.

Cobalt Strike makes the spectrum concrete: the jquery-3.3.1.min.js URI is the cheap observable — one line in a malleable C2 profile flips it, and meanwhile real jQuery traffic collides with it — whereas the default named pipe name and the JARM/JA3 fingerprint cost the operator more to change, so a rule built on those is both more durable and far less prone to false positives.

Sigma is YARA's counterpart for log events. Where YARA operates on file bytes, Sigma operates on structured fields in SIEM events. A Sigma rule has a detection block with one or more search identifiers — typically a map of field names to values — and a condition that combines them. The design goal is portability: the same Sigma rule can be converted to Splunk SPL, Elastic EQL, Microsoft Sentinel KQL, or Suricata using the sigma-cli backends. Writing a Sigma rule is not writing a SIEM query — it is writing the abstraction layer that a query compiler targets. That distinction matters: avoid using backend-specific syntax in the detection block.

Rule validation is not optional, and "it matched my sample" is not validation — it is a demo. You wrote the rule against that sample, so of course it fires; that tells you nothing about the next file. A rule is a hypothesis, and a hypothesis is only as good as the data you test it on that you didn't tune it on. The discipline that turns a rule into a measured detection is the same train/dev/test wall machine-learning uses: tune against the sample in front of you, but grade against a held-out corpus — known-malicious variants the rule has never seen (the family that dropped the mutex, or rotated the packer) plus benign files chosen to be hard, including near-misses that share one superficial feature with the family (a legitimate utility that uses the same public algorithm constant; a parser that reads the same config tag; an app whose mutex name merely contains the substring). The held-out set is the only honest estimate of how the rule behaves in production.

flowchart LR
    R["rule tuned on<br/>the sample"] --> H["score on held-out corpus<br/>(unseen variants + benign near-misses)"]
    H --> M{"recall AND<br/>precision OK?"}
    M -->|"no — missed or<br/>false-positived"| G["gate exits non-zero<br/>— build fails"]
    M -->|"yes"| P["ship the rule"]

The gotcha

The trap the held-out set exposes is that coverage is not precision. A rule that catches every malicious sample but also fires on the benign near-misses has perfect recall and terrible precision — and a high-recall, low-precision rule is worse than no rule at all. It doesn't fail quietly: it floods the SOC, erodes trust, gets disabled — taking its real detections out of service with it. Watch recall (of the malicious samples, how many caught?) and precision (of what fired, how much was truly malicious?) — accuracy alone hides the failure. The minimum is not "ten benign binaries, zero matches"; it is a labelled held-out set, scored, where you find the knee of the tradeoff deliberately.

The deliverable that makes this engineering rather than a one-off check is a regression gate: an eval that scores the rule against the held-out corpus and exits non-zero — failing the build — when the rule misses a malicious sample or fires on a benign one, exactly as a unit test fails on a broken function. The proof that the gate works is seeing it go red: ship a deliberately over-broad rule and watch the gate catch its false positives. A gate you have only ever seen pass is not a gate. This is what lets a team update a rule on a Friday without praying it didn't quietly start matching svchost.exe.

Go deeper: the report has three audiences

The report follows a standard structure because its audience is not one person. The executive summary is for leadership — one paragraph, no jargon: what happened, what changed. The technical findings are for IR engineers and analysts — hashes, timeline, capability summary, behavioural observations. The detection and remediation section is for the SOC and blue team — the validated rules, the ATT&CK layer, and the specific changes to make (block the domain, push the YARA rule, update the Sigma threshold). Each section is written for its reader, not for its author.

AI caveat

A model drafts rules and the confusion-matrix scorecard well — but it quietly gets the load-bearing parts wrong: it will score on the same sample it tuned the rule against (you enforce the held-out wall), reach for accuracy (you override to recall and precision), and may leave the gate failing-open. When it generates adversarial near-miss fixtures, label each yourself — a model grading its own test set is the contamination the held-out discipline exists to prevent.

Learn (~3 hrs)¶

YARA rule writing - YARA official documentation — Writing YARA Rules — covers strings, conditions, and modules; read the "Strings" and "Conditions" sections (~40 min). - YARA Modules Documentation — understand how to use the PE module (imports, sections, version info) to write specific, low-false-positive rules (~20 min).

Sigma rule writing - Sigma project — sigmahq.io rule creation guide — the canonical creation guide; read "Writing Your First Rule" through "Condition Syntax" (~30 min). - sigma-cli documentation — how to convert and validate rules from the command line; read the "Usage" section (~15 min).

Malware analysis reporting - SANS — "Writing a Malware Analysis Report" (whitepapers/680) — a practitioner template that has been standard for 15 years; read it before writing your report (~30 min).

A real detection precision case study (~25 min) - The DFIR Report — Cobalt Strike, a Defender's Guide (Part 2) — a reputable DFIR walkthrough of detecting Cobalt Strike: the default jquery.min.js profile (the cheap, false-positive-prone observable), the default _msagent_# named pipes, and JARM/JA3 fingerprints (the durable ones). The clearest real-world illustration of why coverage ≠ precision and which observables to anchor a rule on.

Key concepts¶

YARA rules are hypotheses; the observable must be hard for the adversary to change.
"It matched my sample" is a demo, not validation — grade against a held-out corpus (known-malicious variants + benign near-misses) the rule was never tuned on.
Coverage ≠ precision: a high-recall, low-precision rule that false-positives is worse than no rule — it floods the SOC and gets disabled. Watch recall and precision, not accuracy.
The deliverable is a regression gate: an eval that fails the build on a missed malicious sample or a benign false positive — and you've seen it go red on a too-broad rule.
Sigma rules target log fields, not file bytes; write the abstraction, not the backend query.
sigma-cli convert validates structure and produces backend-specific queries.
A complete report has three audiences: executive, technical, and operational.
ATT&CK Navigator layers and validated rule packages are the deliverables, not the document.
Real worked framework: Cobalt Strike — its default jquery.min.js URI (cheap, false-positive-prone) vs. its named pipes / JARM / JA3 (durable) is the textbook coverage-vs-precision tradeoff this module teaches you to navigate

AI acceleration¶

Generate a first-draft YARA rule by feeding your IOC strings to a model: "Write a YARA rule that matches a PE file containing the string 'SynthLoader_Mutex_A1B2' and any of these byte sequences. Include PE import hash if possible." Then generate a Sigma rule draft: "Write a Sigma rule for Windows Event Log that detects registry key creation under HKCU\Software\Microsoft\Windows\CurrentVersion\Run with the value name 'SynthUpdater'." For both: validate the rule structure manually, run it against the test corpus, and adjust — do not deploy AI-drafted rules without testing.

A model will also happily build your eval — let it draft the confusion-matrix arithmetic and the scorecard table, which is boilerplate it writes well. But what you must own is everything a model quietly gets wrong here: the held-out wall (a model will cheerfully score on the same sample it tuned the rule against — you enforce the separation), the metric (it reaches for accuracy; you override to recall and precision and justify it), and the gate's fail-closed direction (does a missing rule or a broken eval fail the build, or silently pass?). When you ask it to generate adversarial near-miss benign fixtures — legitimate files crafted to look like the family — label each one yourself and verify it, because a model labelling its own test set is the contamination the held-out discipline exists to prevent.

Check yourself

Why is "my rule matched the sample I wrote it for" a demo rather than validation?
A rule has perfect recall but fires on benign near-misses. Why is that worse than shipping no rule?
What single observable behaviour proves your regression gate actually gates, rather than only ever passing?

Comments

Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).