Lab 11 — AI Evaluation & Observability¶
Hands-on lab · ← Back to the module concept
Setup¶
git clone https://github.com/plaintext-security/plaintext-labs
cd plaintext-labs/ai-augmented-ops/11-ai-evaluation
make up && make demo
Requirements: Docker. ~256 MB RAM. No GPU, no model, no network — the eval grades recorded system outputs (committed fixtures), so the whole lab is deterministic and runs offline, in CI too. Pure Python stdlib.
make demo scores a good triage system (the gate passes), then a regressed one that
under-classifies attacks (the gate fails), then does the same for a RAG retriever — and prints a
verdict confirming the gate is green on good and red on the regression. That contrast is the whole
lesson.
Scenario¶
You built a triage classifier in Module 07 and a RAG in Module 04. Both looked good in the demo. Now a model upgrade is queued for Friday and your lead asks the question you cannot answer with adjectives: is it still good enough to route real alerts and answer real questions — and how will we know the day it stops being? Your job is to stop trusting the vibe: build a held-out test set, choose a metric and defend it, score the system into a scorecard, and wire a CI regression gate that fails the build when quality drops.
Why this discipline exists. In the 2017 Equifax breach, a tool was watching network traffic for exactly this kind of malicious activity — but an expired certificate had silently stopped it from inspecting encrypted traffic, so the exfiltration ran undetected for ~76 days until the certificate was replaced and inspection resumed (GAO-18-559, p. 11). The detector still "ran"; it just wasn't catching anything anymore, and nothing measured that it had stopped being good enough. A held-out eval plus a regression gate is the discipline that surfaces that silent degradation before it becomes the headline.
Everything runs locally against committed fixtures. No external targets, no live model, no authorization needed.
Do¶
-
[ ] Read the held-out corpus, and understand why it's held out.
data/triage-heldout.jsonlis 32 realistic alerts (16 malicious, 16 benign), with the answer key indata/triage-labels.json. Note it is separate from the Module-07 demo/tuning set — it contains alerts the triage prompt was never tuned against, including deliberate near-misses (the benigncertbotrenewal vs. the malicious hiddenDownloadString; the benign backup VSS job vs. the maliciousvssadmin Delete Shadows). Skim three malicious and three benign items and confirm you could not separate them with a single keyword — that's the point: coverage of the hard cases, not just more cases. -
[ ] Run the system over the held-out set → recorded predictions.
make classifyruns the deterministic stub classifier (scripts/stub_classifier.py) over the corpus and writesresults/predictions-stub.json, then scores it. The stub is an honest stand-in so the loop runs offline — in real use you delete it and drop in your Module-07 model; the corpus, eval, and gate are unchanged. Confirm you get a scorecard, not a vibe. -
[ ] Read the metric off the scorecard and decide if it's the right one.
make evalprints precision, recall, F1, FN-rate, FP-rate, and accuracy on thepredictions-good.jsonfixture. Find the line marked the metric that matters and write down, in one sentence, why recall on the malicious class (and its FN-rate) is the load-bearing number for SOC triage, and why accuracy alone is dangerous here. Then prove it to yourself: run the gate on the regressed system below and watch accuracy stay ~69% while recall collapses. -
[ ] Watch the gate pass on good and FAIL on a regression — the core lesson.
make demorunseval.py ... --gate recall=0.80onpredictions-good.json(passes, exit 0) and onpredictions-regressed.json(a model that started marking real attacks as "benign" — fails, exit 1). Opendata/predictions-regressed.json; confirm the regression is under-classified maliciousness (false "all clear"), the failure that buries the alert that mattered. The green/red contrast is what lets a team upgrade a model without praying. -
[ ] Tune a threshold and watch the tradeoff move. Re-run the gate with a stricter floor:
make gate TRIAGE_RECALL_MIN=0.95. The good system (recall 0.938) now fails — you've discovered that "good" depends entirely on the bar you declared. Pick a defensible floor for a SOC and justify it against the FP-rate cost in youreval-report.md(push recall up and you flood the analyst queue; the eval is how you find the knee deliberately). -
[ ] Do the same for RAG retrieval.
make eval-ragscoresretrieval-good.jsonagainstdata/rag-heldout.jsonwith retrieval@k (did a genuinely-relevant doc land in the top-k?). Then look atretrieval-regressed.json— a retriever that pulls generically-similar but wrong chunks — and confirmretrieval@3collapses from ~92% to ~42%. Note in your report why a RAG needs a retrieval metric and not just a "the answer read well" check: confident generation on top of wrong context is the silent failure. -
[ ] Add observability. In
eval-report.md, write the one-paragraph plan for the standing version of this eval in production: what you would log (inputs, outputs, scores, realised outcomes) and how re-scoring last month's labelled traffic catches input-distribution drift before an analyst does.
Success criteria — you're done when¶
- [ ]
make demoruns offline and ends withPASS: gate is GREEN on the good system and RED on the regression. - [ ]
make evalprints a triage scorecard (confusion matrix + precision/recall/F1/FN-rate/FP-rate). - [ ] You can state, in writing, why recall/FN-rate on the malicious class is the metric — and why accuracy misleads on this imbalanced set (you've seen the regressed run prove it: ~69% accuracy, 37.5% recall).
- [ ]
make eval-ragprints retrieval@k, and you've seen the gate fail on the regressed retriever. - [ ]
eval-report.mdis filled in: chosen metric + threshold + justification, the FP/recall tradeoff, and the observability plan.
Deliverables¶
eval.py (with any metric or gate change you made) + the held-out corpus + eval-report.md, all
committed. The eval-as-code is the artifact: a held-out set, a scorecard, and a gate that fails
on regression. Do not commit generated run outputs (results/predictions-stub.json, metric
dumps) — they're gitignored; the corpus and the eval regenerate them.
Automate & own it¶
Required. Wire the gate into CI so a regression cannot merge. Add a .github/workflows/eval.yml
(in your own portfolio repo) that runs, on every PR:
python3 scripts/eval.py triage --predictions <your model's output> --gate recall=0.85
python3 scripts/eval.py rag --retrieval <your retriever's output> --gate recall_at_k=0.75
AI acceleration¶
Have a model expand the held-out corpus with adversarial items — benign events crafted to look malicious, and novel techniques phrased unusually — then label them yourself and verify each against the technique it mimics. A model labelling its own test set is exactly the contamination this module warns about; you generate candidates, you own the ground truth. Ask a frontier model to critique your metric choice ("I'm gating on recall@0.85 for SOC triage — what am I missing?") and weigh its answer against the FP-rate cost you measured.
Connects forward¶
This is the measurement layer the rest of the track plugs into. Module 04 (RAG) gets the retrieval@k gate; Module 06 (SoC copilot) gets a groundedness check on its summaries; Module 07 (triage) swaps its one-off confusion matrix for this held-out scorecard + gate. Modules 09/10 (securing/attacking AI) reuse the gate as a regression test for a fixed jailbreak: the exploit must stay blocked, proven by an eval that fails if it ever works again.
Marketable proof¶
"I built an eval harness for an AI security system — a held-out labelled corpus, a recall/FN-rate scorecard chosen for the asymmetric cost of a missed alert, and a CI regression gate that fails the build on a planted degradation. I can prove my model is good, not just claim it."
Stretch¶
- Add a precision/recall curve: sweep a confidence threshold over a graded-confidence prediction fixture and plot (or print) recall vs. FP-rate, then pick the operating point deliberately.
- Add a groundedness check to the RAG eval: given retrieved chunks and a generated answer, score whether the answer's claims are supported by the retrieved text (start with simple span overlap; note where it needs an LLM-grader and why that re-introduces the eval-the-evaluator problem).
Comments
Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).