Lab 11 — Eval Harness for Security Tools¶
Hands-on lab · ← Back to the module concept
Lab environment: real-data rewire — validation deferred. The labelled corpus is now anchored in real loghub OpenSSH attack traffic (see below) rather than a wholly synthetic log. The env (
docker-compose.yml,Makefile,eval.py, the corpus + answer key underdata/) is in place;make up && make demo && make downhas not yet been re-run on a clean Linux runner against this change — validate before marking the lab done.
Setup¶
git clone https://github.com/plaintext-security/plaintext-labs
cd plaintext-labs/python-for-security/11-eval-harness
make up # Python 3.12 container with pytest + coverage; no network, no model
make demo # scores a GOOD parser (gate passes, exit 0), then a REGRESSED one
# (under-detects attacks — gate fails, exit 1), and prints the verdict
make shell
make down
Requirements: Docker only. No GPU, no model, no network — the eval grades a deterministic Python detection tool over a committed labelled corpus, so the whole lab is offline and reproducible (CI included). Pure stdlib.
data/auth-corpus.jsonl is a labelled corpus of SSH-auth log lines whose attack backbone is real:
the fast brute force from 103.99.0.122 and the invalid-user dictionary sweep from 5.188.10.180 are
verbatim lines from loghub's OpenSSH_2k.log — a real
internet-facing server's auth log — including the genuine 5.36.59.76 message repeated 5 times:
meta-line. Layered on top, on the same real LabSZ host format, are the deliberate near-misses
the real capture doesn't isolate cleanly (the cron job that mistypes its password twice a night; the
truncated/Unicode-mangled line; the slow-and-low spray that never trips a per-minute threshold). Every
line carries a src field marking it loghub OpenSSH_2k.log vs. crafted ..., and each source IP is
tagged attack or benign in data/auth-labels.json with a reason you can audit — you own the
labels. scripts/parser_good.py is a tuned version of the Module-02 brute-force flagger;
scripts/parser_regressed.py is the same tool with a weakened rule that silently under-detects the
slow-and-low case (recall regression).
Scenario¶
Your Module-02 log parser flags SSH brute-force offenders, and Module 10's tests prove the code does what you wrote. Your SOC lead asks the question the tests can't answer: is it actually catching the attacks, and how often is it crying wolf — and how will we know the day someone "improves" the regex and quietly breaks it? Your job is to stop trusting the demo log: build a held-out labelled corpus, score the parser into a precision/recall scorecard, find the operating point deliberately, and wire a CI gate that fails the build when detection quality regresses.
Everything runs locally against committed fixtures. No external targets, no live systems, no authorization needed.
Do¶
-
[ ] Read the corpus, and understand why it's held out. Open
data/auth-corpus.jsonlanddata/auth-labels.json. Confirm it is separate from the Module-02 sample log the parser was tuned against. Skim threeattacklines and threebenignlines and find at least two near-miss pairs you could not separate with a single keyword (e.g. the benigncrondouble-failure vs. a real two-attempt spray). That coverage of the hard cases — not just more lines — is the point. -
[ ] Run the parser over the corpus → recorded verdicts.
make classifyrunsscripts/parser_good.pyover the corpus and writes its per-line verdicts toresults/verdicts-good.json. Confirm you have a list ofattack/benigncalls to grade against the answer key — not a vibe. -
[ ] Build the scorecard — and pick the metric on purpose. Write (or complete)
eval.pyso it compares the verdicts to the labels and prints the confusion matrix (TP/FP/FN/TN) plus precision, recall, F1, FP-rate, and accuracy. Read them off. Then write down, in one sentence ineval-report.md, why recall is the load-bearing metric for this detector and why accuracy is dangerous here — and prove it: note that a do-nothing parser scores ~85% accuracy on this imbalanced corpus while catching zero attacks. -
[ ] Watch the gate pass on good and FAIL on a regression — the core lesson.
make demorunseval.py --gate recall=0.85onverdicts-good.json(passes, exit 0) and on the output ofscripts/parser_regressed.py(a weakened rule that marks real attacks as benign — fails, exit 1). Openparser_regressed.pyand confirm the regression is under-detection (false "all clear"), the failure that buries the alert that mattered. The green/red contrast is what lets a teammate refactor the parser without praying. -
[ ] Find the precision/recall knee. Re-run the gate with a stricter floor:
make gate RECALL_MIN=0.95. The good parser may now fail — "good" depends entirely on the bar you declared. Then loosen the parser's rule (broaden the regex) and re-score: watch recall rise and precision fall as the FP queue floods. Record the operating point you'd ship and justify it against the FP-rate cost ineval-report.md. -
[ ] Show coverage ≠ effectiveness. Run
coverage run -m ... && coverage reportover the parser against an easy slice of the corpus and note the high line-coverage. Then point out in your report that the same coverage number says nothing about whether the hard near-misses are caught — coverage is a foil for detection quality, never a substitute.
Success criteria — you're done when¶
- [ ]
make demoruns offline and ends withPASS: gate is GREEN on the good parser and RED on the regression. - [ ]
make evalprints a scorecard (confusion matrix + precision/recall/F1/FP-rate/accuracy) for the parser over the held-out corpus. - [ ] You can state, in writing, why recall is the metric — and why accuracy misleads on this imbalanced corpus (you've seen the do-nothing baseline score ~85% accuracy at 0% recall).
- [ ] You've moved the gate floor and the parser's rule and watched the precision/recall tradeoff move.
- [ ]
eval-report.mdis filled in: chosen metric + threshold + justification, the precision/recall knee, and the coverage-≠-effectiveness note.
Deliverables¶
eval.py + the labelled corpus (data/auth-corpus.jsonl, data/auth-labels.json) + eval-report.md,
all committed. The eval-as-code is the artifact: a held-out corpus, a scorecard, and a gate that
fails on regression. Do not commit generated run outputs (results/verdicts-*.json, coverage
data) — they're gitignored; the corpus and the eval regenerate them.
Automate & own it¶
Required. Wire the gate into CI so a detection regression cannot merge. Add a
.github/workflows/eval.yml (in your own portfolio repo) that runs, on every PR:
python3 eval.py --corpus data/auth-corpus.jsonl --labels data/auth-labels.json \
--tool scripts/parser.py --gate recall=0.90
eval.py errors or the metric is missing, the build fails, it
does not silently pass (verify by running the gate with a typo'd metric name and confirming a non-zero
exit); (2) the threshold and its direction (a recall floor, not a "score exists" check);
(3) that the corpus fed to CI is the held-out one, never the lines the parser was tuned on. Commit
the workflow and a log/screenshot of it going red on the planted regression.
AI acceleration¶
Have a model expand the corpus with adversarial near-misses — benign lines crafted to look like brute-force, and real attacks phrased to dodge the obvious regex — then label every one yourself and verify it against what the line actually represents. A model labelling its own corpus is exactly the contamination this module warns about; you generate candidates, you own the ground truth. Then ask a frontier model to critique your metric choice ("I'm gating on recall@0.90 for an SSH brute-force detector — what am I missing?") and weigh its answer against the FP-rate cost you measured.
Connects forward¶
This is the measurement discipline behind every tool in the track. The Module-02 parser and the Module-05 CLI tool get a scorecard instead of a demo; the Module-10 test suite gains an eval gate alongside its unit tests. It's the deterministic-tool sibling of the AI-ops track's AI Evaluation & Observability module, which applies the same held-out-corpus / scorecard / regression-gate construct to a non-deterministic model. The track capstone's tool should ship with one.
Marketable proof¶
"I don't just claim my detection tool works — I built it a held-out labelled corpus, a precision/recall scorecard chosen for the asymmetric cost of a missed alert, and a CI gate that fails the build the day a change drops recall. My tools have a number, not a vibe."
Stretch¶
- Add a precision/recall curve: sweep the rule's threshold (e.g. attempts-per-window) over the corpus and print recall vs. FP-rate, then pick the operating point deliberately instead of by feel.
- Add a malformed-input slice to the corpus (truncated lines, bad encodings, injected newlines) and gate that the parser neither crashes nor mis-classifies on it — robustness as a measured property.
Comments
Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).