Lab 10 — Red-team the copilot, then freeze it into a regression eval¶
Hands-on lab · ← Back to the module concept
Type 15 · Red-team-the-AI (+ Type 13 · Eval Harness). You run a systematic red-team of the SoC
copilot — broad statistical coverage with garak, expected-output assertions with promptfoo — write
a threat model of its attack surface, and ship the promptfoo suite as a CI regression gate
(plugging into module 11) that fails the build when a blocked attack becomes possible again.
Setup¶
git clone https://github.com/plaintext-security/plaintext-labs
cd plaintext-labs/ai-augmented-ops/10-attacking-ai
make up && make demo
Requirements: Docker, 4 GB RAM free. No GPU needed. make demo runs garak against the local
Ollama model with a focused probe set (injection + leakage only — the full scan takes 30–60 min).
make garak-full runs the complete probe suite; make promptfoo-eval runs the assertion suite.
Authorization — offensive tooling. garak and promptfoo generate real jailbreak, injection, and data-extraction payloads. Run them only against models and applications you own or have explicit written permission to assess. Every target in this lab is a local Docker container; never point these tools at a hosted model or a third party's deployment without sanction.
What this lab is — and isn't. Every attack class you scan for is grounded in a real, documented incident, mapped in
results/threat-model.md: the role-override / jailbreak probes are the Chevrolet "$1 Tahoe" jailbreak (2023); the injection-via-alert-data probes are the shape of EchoLeak / CVE-2025-32711 (zero-click indirect injection in M365 Copilot); the tool-surface probes are Invariant Labs' MCP tool poisoning (2025); and the reason any of it matters operationally is Moffatt v. Air Canada (2024 BCCRT 149) — the org owns what its bot says. It is not those production systems: tinyllama is a tiny local model with little safety training, so it fails probes a frontier model would pass — the garak pass rates are not a verdict on any real product. What transfers is the method (probe → interpret → threat-model → gate) and the attack classes, not the numbers against this toy target.
Scenario¶
Before the SoC copilot you built across modules 04–06 goes to production, the security team runs a systematic AI red-team — the move the three named incidents make non-optional: Air Canada owns what its bot says, a system prompt didn't stop the Chevy "$1 car" jailbreak, and EchoLeak rode in on a retrieved document with zero clicks. Your job: scan the copilot, interpret findings against those incident classes, produce a threat model, and leave behind a regression suite that re-runs in CI so the next model swap can't silently reopen a hole.
Do¶
-
[ ] Run the breadth scan.
make demo(garak,injection+leakage). For each probe class read: which probes ran, and the pass rate (higher = more attacks blocked). Treat any class below your declared threshold (start at 80% pass) as a finding. Copy the summary table intoresults/garak-findings.md. Note: re-running gives slightly different rates — that's the point, the results are statistical. -
[ ] Prove the misconception, hands-on. Open
data/attack-prompts.yaml; the copilot's system prompt tells it to act only as a SOC analyst. Craft one chat message that tries to override that role (a Chevy-style "from now on, ignore your role and..."). Send it a few times. Record inresults/garak-findings.mdwhether the system prompt held every time — and why "just tell it not to" is not a control. -
[ ] Run the assertion suite (the Type 13 half).
make promptfoo-evalrunspromptfooover the test cases indata/attack-prompts.yaml. For each: the prompt sent, the model's actual output, and whether it passed its assertion (what a safe response must contain / must not contain). Copy any failing assertions intoresults/promptfoo-findings.mdand explain why each failed. -
[ ] Write the per-finding analysis. For each garak finding and each failing promptfoo assertion, one paragraph in the matching results file: what attack class it is, what an attacker could do to a SOC copilot if it works (reclassify a critical, extract the prompt, exfil via a retrieved doc), the named incident it rhymes with (Air Canada / Chevy / EchoLeak), and the mitigation (from module 09 or the OWASP-LLM ID).
-
[ ] Write
results/threat-model.mdfor the copilot: - Adversaries — who targets it and why (include malicious alert data / retrieved documents).
- Assets — what an attacker wants (correct triage, the system prompt, tool access).
- Attack surface — prompt input, tool results, RAG context, the model API.
- Top 3 threats — each tagged with an OWASP-LLM risk ID and a MITRE ATLAS technique ID, and anchored to one named incident.
- Mitigations implemented — reference modules 05 and 09.
-
Residual risk — what's still open after all mitigations (be honest; EchoLeak shows filters get bypassed).
-
[ ] Extend the suite. Add at least one new test case to
data/attack-prompts.yaml— a SOC-specific attack the existing set doesn't cover (e.g. an injection planted in a retrieved alert body, the EchoLeak-shaped vector). Give it a real assertion. Runmake promptfoo-eval; record pass/fail. -
[ ] Ship the regression gate (the deliverable that makes it engineering). Wire
promptfoointo CI: add.github/workflows/ai-redteam.yml(or extendscripts/scan.sh) that runs the suite and fails the build if the safe-response pass rate drops below a declared threshold (e.g. 90%). Then prove the gate bites: point the suite at a deliberately weakened target (swap to a smaller model, or strip a guardrail / output filter) and confirm the run goes red; restore and confirm green. This is the same held-out-set-plus-regression-gate discipline as module 11 — AI Evaluation & Observability; reuse its scorecard shape. The contrast — green on the hardened copilot, red on the regressed one — is the proof.
Success criteria — you're done when (honor system — self-verified; no grader)¶
- [ ]
make democompletes and prints a garak probe summary (pass rates per class). - [ ]
make promptfoo-evalruns every test case and prints pass/fail per assertion. - [ ]
results/garak-findings.mdhas the summary table, one paragraph per finding, and the role-override hands-on note from step 2. - [ ]
results/promptfoo-findings.mdanalyses each failing assertion. - [ ]
results/threat-model.mdis complete with all six sections, OWASP-LLM + ATLAS IDs, and each top threat anchored to a named incident. - [ ] At least one new EchoLeak-/SOC-specific test case is in
data/attack-prompts.yamlwith a real assertion. - [ ] The CI regression gate exists, fails on the weakened target, and passes on the hardened one — you've seen it go both red and green, not just pass.
Deliverables¶
data/attack-prompts.yaml (with your new case) + results/garak-findings.md +
results/promptfoo-findings.md + results/threat-model.md + the CI regression workflow
(.github/workflows/ai-redteam.yml or the gating scripts/scan.sh). Commit all of them. Lab
artifacts — raw garak reports, full model transcripts — stay out of the commit.
Automate & own it¶
Required — and it's the regression gate above. Turn "I red-teamed it once" into a suite that
re-runs on every change. Have a model draft the workflow YAML and the grep/jq that extracts the
failing-probe count from garak's report and the pass rate from promptfoo's JSON; you review the
extraction logic (does it count all failure modes, or just lines containing "FAIL"?) and you set
the gate to fail closed — a missing or errored eval must turn the build red, never silently pass.
The artifact is a red-team you can re-prove on demand; a finding nobody re-checks reopens itself.
AI acceleration¶
Two loops. Drafting: let a model write promptfoo assertion YAML from a plain-language safety spec
and the gate's extraction logic — it knows the formats. Adversarial: never ask a model "is this
safe?" (it reassures); ask it to attack — "given this copilot is RAG + MCP + Ollama, write ten
injection payloads that would arrive via a retrieved alert, and name ten MITRE ATLAS techniques my
threat model misses." Then verify each lands against the real surface and label them yourself — a
model grading its own attack list is the contamination module 11 warns about.
Connects forward¶
This closes the track: you built the AI (04–06), measured it (07, 11), secured it (09), and now red-teamed it systematically and froze the result into a gate (10). The capstone takes one tool from the stack, demonstrates a finding against it, and ships a hardened version with this very regression suite attached — the threat model here is the capstone's starting point, and the gate is what proves the hardening holds.
Marketable proof¶
"I run systematic LLM red-teams with garak (statistical probe coverage) and promptfoo (expected-output assertions), interpret findings against real incidents (Air Canada, the Chevy '$1 car' jailbreak, EchoLeak / CVE-2025-32711), produce a structured threat model tagged to OWASP-LLM and MITRE ATLAS, and ship the red-team as a CI regression gate that fails the build when a blocked attack becomes possible again."
Stretch¶
- Run
make garak-fulland compare to the fast scan — which extra probe classes produce findings, and are any operationally significant for a SOC copilot? - Add a RAG-poisoning test: plant an injection inside a document the copilot will retrieve (the EchoLeak shape) and assert the copilot does not act on it. This exercises the attack surface a prompt-only test never reaches.
- Gate on garak too: have the CI job also fail if any probe class drops below a declared pass-rate, so both breadth (garak) and depth (promptfoo) regressions block merge.
Comments
Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).