Lab 02 — Running Local Models¶
Hands-on lab · Type 7 Build-&-Operate · ← Back to the module concept
Setup¶
git clone https://github.com/plaintext-security/plaintext-labs
cd plaintext-labs/ai-augmented-ops/02-running-local-models
make up && make demo
Requirements: Docker, 4 GB RAM free minimum (8 GB recommended for phi3:mini).
No GPU required — all inference runs on CPU. First run pulls tinyllama (~637 MB);
set OLLAMA_MODEL=phi3:mini to benchmark a more capable model if you have 8 GB free.
The make demo target runs all five benchmark prompts and prints a results table with
latency and token throughput for each. It takes 2–5 minutes on CPU.
Scenario¶
A security team wants to justify local-model infrastructure to their CISO. The argument
isn't just privacy — it's that a local model processing 40 alerts per minute is operationally
useful; one that processes 4 is not. And neither number can be quoted from a model card: it
depends on your hardware and your prompts. Your job: stand up the local model as a running
service, benchmark it on the five security-domain prompts in data/benchmark-prompts.txt,
interpret the throughput vs. quality results from your own measurements, and produce a one-page
evaluation report.
Everything runs locally. No external targets, no cloud API keys required.
Do¶
-
[ ]
make demoand read the benchmark output. For each of the five prompts, record the latency (seconds) and throughput (tokens/sec) inresults/benchmark-results.md. The demo script prints these for you — copy them into the table template in that file. This is your throughput measurement, on your CPU — not a leaderboard's. -
[ ] Qualitatively evaluate each answer: is it factually correct, partially correct, or wrong? Mark each cell in
results/benchmark-results.mdas Correct / Partial / Wrong and add a one-sentence note explaining what the model got right or missed. This is your first, by-hand cut at quality eval — the discipline Module 11 generalises into a held-out scorecard and a regression gate. -
[ ]
Find: the model's parameter count, quantisation level, and context length. Write these in the "Model card" section ofmake shelland call the Ollama API directly to explore the model metadata:results/benchmark-results.md. -
[ ] Compare to a larger model (optional, requires 8 GB RAM):
If you have the hardware, run the benchmark again and add a comparison column. Does the throughput drop match the quality improvement? Is the tradeoff worth it for these prompts? -
[ ] Write a one-paragraph "Recommendation" at the bottom of
results/benchmark-results.md: which model and configuration would you recommend for this team's alert triage pipeline, and why? Cite the throughput number and at least one qualitative quality finding — both from your own run, not a published benchmark.
Success criteria — you're done when¶
- [ ]
make demoruns and the local model serves inference over its OpenAI-compatible API (the running service exists). - [ ]
results/benchmark-results.mdcontains latency and throughput data for all five prompts, measured on your hardware. - [ ] Each answer is marked Correct / Partial / Wrong with a one-sentence explanation (your by-hand quality pass).
- [ ] The model card section (parameters, quantisation, context length) is filled in.
- [ ] The Recommendation paragraph is written and cites a concrete throughput number from your own run.
Deliverables¶
results/benchmark-results.md (your completed evaluation) and benchmark.py (see
Automate & own it). Commit both — the report is the technical evidence behind the
infrastructure decision; the script is the reusable measurement. (Lab artifacts — model
weights, raw API dumps — stay out of commits.)
Automate & own it¶
Required. Write benchmark.py — a Python script that reads data/benchmark-prompts.txt,
sends each prompt to the Ollama API, records latency and token count, and writes a Markdown
table to stdout. The script should accept --model and --host arguments. Have a model
draft the HTTP client and result-formatting logic; you review the error handling (what happens
if the model isn't pulled yet? if Ollama is unreachable?) and add it. Run the script and
confirm it produces the same table make demo shows. Commit it — this is the throughput half
of "measure on your own hardware," made repeatable. (The quality half becomes the scored,
held-out eval you build in Module 11.)
AI acceleration¶
Paste your benchmark results table into a frontier model and ask it to interpret the quality/speed tradeoff for a high-volume SOC alert triage use case. Compare its analysis to your own. Where do they agree? Where does the model offer a dimension you hadn't considered? The numbers are yours — measured on your hardware on your prompts — so the recommendation is yours to defend. Document the differences in the Recommendation paragraph.
Connects forward¶
The service you stand up here is the inference engine for the RAG system in Module 04, the triage script in Module 07, and the attack surface in Module 10 — and the by-hand quality pass you do in step 2 is the seed of Module 11 (AI Evaluation & Observability), which turns "I eyeballed the answers" into a held-out test set, a scorecard, and a regression gate. Understanding its throughput and quality ceiling now prevents surprises later.
Marketable proof¶
"I can deploy and benchmark a local language model, measure its throughput and quality on domain-specific prompts on my own hardware, and produce a hardware-justified recommendation for or against local inference in a security operations context."
Stretch¶
- Build a simple HTTP load test against the Ollama API (10 concurrent requests) and measure how throughput degrades under concurrency. This models a multi-analyst SOC where several triage workers hit the same local model simultaneously.
- Explore the Modelfile format: create a custom Modelfile that wraps
tinyllamawith a security-analyst system prompt baked in. Pull it withollama createand confirm the system prompt persists across calls without being sent in each request.
Comments
Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).