Lab 09 — Detection-as-Code: the Scored Regression Gate¶
Hands-on lab · ← Back to the module concept
Setup¶
git clone https://github.com/plaintext-security/plaintext-labs
cd plaintext-labs/automation/09-detection-as-code-pipelines
make up # sigma-cli + pytest in Docker
make demo # stage 1: sigma check (lint) over data/rules/ + stage 2: pytest intent table
make eval # the eval harness: scorecard over the HELD-OUT corpus + the regression gate (GREEN)
make gate # proves the gate goes RED on the planted regression (exits non-zero)
make shell
make down
data/rules/ ships five Sigma rules: four valid, one deliberately broken (a misspelled condition:
key that sigma check rejects). data/tests/ carries the conftest.py matcher (match_rule(rule, event))
and the per-rule tuning events — the set you write the pytest table against.
data/heldout/ is the eval-harness layer this module adds. data/heldout/corpus.jsonl is a held-out,
labelled event corpus the rules were never tuned on — known-malicious events (including variants like
the -enc short form and the long -EncodedCommand) and known-benign lookalikes (a signed updater
writing a Run key, a certutil -decode by an installer, an EDR sensor's allow-listed LSASS read).
data/heldout/rules-regressed/ is a deliberately degraded copy of the ruleset for the gate proof.
eval.py scores a ruleset against the corpus, prints a scorecard, and — with --gate — exits non-zero
on a regression. It reuses the same in-process matcher the pytest suite uses, so make eval is
deterministic and offline — no SIEM, no network. In real use you point --corpus at events exported
from your own SIEM.
Honor system: the gate is a regression guard for you, not a grader. There is no answer key being checked.
Scenario¶
A detection-engineering team runs a Sigma rule repo. The rules are in git and they pass review — and they still rot. A "cosmetic" refactor drops a CommandLine variant; an over-broad selection floods the queue; nobody notices until the alert that mattered never fires. Your job: gate the repo so a syntactically valid rule cannot merge broken in intent, then make that measurable — score every rule against a held-out corpus and fail CI the day the numbers regress.
These rules detect attacker techniques but attack nothing — no authorization note needed. The events are recorded/synthetic Sysmon-shaped records, replayed offline.
Do¶
- [ ]
make demo— watch stage 1 fail on the broken rule. Read thesigma checkerror: which rule, what syntax error? Fix the condition key indata/rules/and rerunsigma check data/rules/— confirm exit 0. - [ ] Read
data/tests/conftest.py— understandmatch_rule(rule, event)and how it returnsTrue/False. Writedata/tests/test_detections.pywithpytest.parametrizerows for all five rules. For each rule: one(rule, malicious_event, True)and one(rule, benign_event, False). Runpytest data/tests/ -v— all 10 rows green. This is your tuning set — passing it is necessary but not sufficient. - [ ] Write
ci-gate.sh:sigma check data/rules/ && pytest data/tests/ -q. Confirm it exits 0 on clean rules and 1 when a rule is broken. (This is the Type-8 gate; steps 4–7 add the Type-13 eval harness.) - [ ] Read the held-out corpus. Open
data/heldout/corpus.jsonl. Note it is separate from yourpytestevents and richer: it carries the variants of each technique and the near-miss benigns your rules must NOT fire on. That wall — tuned-on vs. graded-on — is what makes the score honest. - [ ] Score it. Run
make eval. Read the scorecard: the confusion matrix, then recall (caught attacks / all attacks) and FP-rate (benign that fired). Note that accuracy looks high even when a variant is missed — that is exactly why you watch recall, not accuracy, on an imbalanced corpus. - [ ] Plant and catch a regression. Run
make gate(orpython3 eval.py --rules data/heldout/rules-regressed --gate recall=0.90). The regressed ruleset dropped a CommandLine variant and over-broadened one selection; watch recall drop (missed attacks listed) and/or FP-rate climb, the gate go RED, and the process exit non-zero. Confirm the good ruleset (make eval) is GREEN at the same threshold. - [ ] Tune toward the gate. Pick one rule, deliberately widen or narrow it, re-run
make eval, and read the scorecard move: which events did you newly catch or newly mis-fire on? Find the knee of the recall/FP-rate tradeoff deliberately, from the numbers — not by feel.
Success criteria — you're done when¶
- [ ] All five rules pass
sigma check; all 10pytestrows (5 match + 5 no-match) pass. - [ ]
ci-gate.shexits 1 on a broken rule and 0 when everything is clean. - [ ] You have a scorecard over the held-out corpus — recall + FP-rate, not just "it fired in the demo."
- [ ] Your regression gate is GREEN on the good ruleset and you have seen it go RED on the planted regression (exits non-zero) — a gate you've only watched pass isn't a gate.
- [ ] You can explain why each false-positive row and each held-out near-miss proves the rule is precise, and why you grade on recall, not accuracy.
Deliverables¶
Fixed data/rules/<broken-rule>.yml + data/tests/test_detections.py + ci-gate.sh, plus the eval
harness: the held-out corpus, eval.py (scorecard), and the regression gate (make eval /
make gate targets) — committed so a detection cannot silently regress. Lab artifacts (raw event dumps)
stay out of commits; the curated held-out corpus is committed on purpose.
Automate & own it¶
Required. Don't stop at the two-stage gate — turn the ruleset into a scored regression gate so it
can't silently rot. Wire eval.py to score the rules against your held-out corpus and exit non-zero
when recall drops below your floor (or FP-rate climbs past your ceiling) — exactly as a unit test fails on
a broken function. Then add it as a third CI stage and prove it both ways: GREEN on the good rules, RED on
data/heldout/rules-regressed/. A model drafts the metric arithmetic and the scorecard table; you own the
metric choice (recall on the malicious class, not accuracy), the held-out wall, and the gate's fail-closed
direction (a broken eval must fail the build, not silently pass). Commit eval.py + the corpus + the gate
alongside the rules.
Then make CI run it: a GitHub Actions workflow (.github/workflows/sigma-ci.yml) that runs sigma check,
pytest, and python3 eval.py --gate recall=0.90 --gate fp_rate=0.10 on every push and PR. Have a model
draft it; review — does it pin the action SHAs and the container version (tie to the repo's Actions
hardening)? Does it fail the PR when recall regresses? Commit the workflow.
AI acceleration¶
Ask a model to write the false-positive pytest rows and a batch of adversarial held-out events — benign
activity crafted to look malicious (a signed updater touching a Run key; a base64 string in a PowerShell
arg that isn't -enc). Then label each one yourself against the real technique it mimics and verify
the rule's behaviour on it: the model's "benign" events are often subtly matching, and a model labelling
its own test set is the contamination this whole module guards against. You own the labels, the metric, and
the gate.
Connects forward¶
This is detection-as-code, complete: rules in git, syntax-checked, intent-tested, and measured — scored against a held-out corpus with a regression gate. It is automation's worked example of the Eval Harness type (ai-augmented-ops 11 applies the identical shape to an AI triage model and a RAG). Combined with module 08 (SOAR), it is the full defensive automation stack — rules that detect, playbooks that respond, an eval that proves the rules still work.
Marketable proof¶
"I gate Sigma rule changes in CI with
sigma checkfor syntax and apytestintent table — then I measure the rules against a held-out, labelled corpus and gate on recall and FP-rate, so a refactor that silently stops catching a variant turns the scorecard red and fails the build. A broken detection can't merge; a regressed one can't either."
Stretch¶
- Add
sigma convert -t splunkas a stage that compiles each rule and fails on an empty query — catching rules that are syntactically valid but logically empty after compilation. - Track the scorecard over time: append each run's recall/FP-rate to a CSV and plot the trend, so you can see a detection decaying across commits — the offline analog of production observability.
- Expand the held-out corpus with a genuinely novel variant (a technique phrased a way no current event uses) and watch which rules miss it — coverage is not effectiveness until the hard cases are in the set.
Comments
Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).