Module 02 — Running Local Models¶

Type 7 · Build-&-Operate — stand up a local LLM and benchmark its throughput and answer-quality against your own alerts and hardware; the deliverable is the running model plus its measured baseline, not a leaderboard number. (Secondary: Decision / ADR.) Go to the hands-on lab →

Last reviewed: 2026-06

AI-Augmented Security Operations — a model you can't explain is a dependency you can't audit; running it yourself is where that audit starts.

Difficulty: Intermediate · Estimated time: ~4–6 hrs (study + lab) · Type: Build-&-Operate · Prerequisites: Foundations

In 60 seconds

Data-residency and segmentation rules often forbid sending telemetry to a hosted model, so you run it on infrastructure you control. Quantisation is what makes that practical — 4-bit weights shrink a 7B model from ~14 GB to ~4 GB, laptop-sized. Ollama serves it behind an OpenAI-compatible API, so the rest of the track swaps local↔frontier by changing one URL. The one load-bearing judgment: measure throughput and answer quality on your alerts and your hardware — never a leaderboard.

Why this matters¶

Sending sensitive telemetry to a hosted model is often a non-starter — data residency requirements, security classification policies, and network segmentation all push toward running the model on infrastructure you control. Ollama made this accessible: a single binary that pulls a model, starts an OpenAI-compatible REST API on localhost, and serves inference without a PhD in CUDA. But the moment you stand it up, the real question lands: is this thing fast enough and good enough to do the job? — and the only honest answer comes from running it on your own work. This module is where you build the running serving layer and measure it the way a leaderboard never will: against your prompts, on your hardware.

Objective¶

Stand up a quantised open-weight model locally via Ollama as a running, OpenAI-compatible service, then measure it on your own work: throughput (tokens/sec) and answer quality on a set of security-domain prompts, ending in a hardware-justified recommendation a security team can act on.

The core idea¶

This is a build-and-operate module: the point is not a breach to dissect but a working serving layer you stand up, measure, and own. A language model is, at its core, a very large file of floating-point numbers — the weights — plus the code that multiplies them together at inference time. The breakthrough that made local inference practical is quantisation: instead of storing each weight as a 32- or 16-bit float, you represent it in 4 or 8 bits, losing a small amount of numerical precision in exchange for a dramatic reduction in memory footprint. A 7-billion-parameter model in 16-bit precision needs ~14 GB of RAM; the same model in 4-bit quantisation fits in ~4 GB. That's the difference between "needs a high-end workstation" and "runs on a developer laptop." The GGUF format (used by llama.cpp and Ollama) is the container format that packages quantised weights for CPU inference.

The mental model

A model is just a big file of floating-point weights plus the code that multiplies them. Quantisation trades a little numerical precision for a large drop in memory footprint, and the OpenAI-compatible API means your integration code never cares whether the weights live on localhost:11434 or api.openai.com — local-vs-frontier becomes a config change, not a rewrite.

Ollama's key abstraction is the Modelfile — a small configuration that points at a GGUF base model and layers in a system prompt, a context size, and any sampling parameters. When you run ollama run tinyllama, you're actually pulling a Modelfile plus a quantised GGUF, starting a serving process that loads the weights into RAM, and opening an HTTP API on port 11434. The API is intentionally OpenAI-compatible: the same client code that hits api.openai.com/v1/chat/completions can hit localhost:11434/v1/chat/completions with a one-line URL change. This matters for security tool integration: you write the integration once against the local model, and swapping to a frontier model is a configuration change, not a code change — the whole rest of this track plugs into the service you stand up here.

The one load-bearing judgment: measure on YOUR alerts and YOUR hardware, not a leaderboard. A benchmark score from a model card tells you how some model did on someone else's task on someone else's GPU; it does not tell you whether this quantisation answers your triage prompts fast enough on your CPU. The practical axis is never "which model is smarter" — it's throughput vs. quality at the task you actually run. Throughput is tokens per second; quality is measured empirically against your own prompts. A model that generates 40 tokens/second and classifies correctly 92% of the time is useful for alert triage; one that generates 8 tokens/second and is right 94% is not, if your queue grows faster than it's processed. llama.cpp's llama-bench and the timer loop over this lab's data/benchmark-prompts.txt give you the throughput number directly; the quality side — labelling each answer Correct / Partial / Wrong against ground truth — is a first, by-hand cut at evaluation. Module 11 (AI Evaluation & Observability) generalises exactly this move into a held-out test set, a scorecard, and a regression gate; the qualitative pass you do here is the seed of the discipline that becomes non-negotiable once a model is making decisions unattended. Measure here so you have the instinct before module 11 makes it rigorous.

The gotcha

A model card's benchmark number measures some model on someone else's task on someone else's GPU. It does not tell you whether this quantisation answers your triage prompts fast enough on your CPU. A model that does 40 tok/s at 92% beats one that does 8 tok/s at 94% if your queue grows faster than it drains — throughput and quality, at the task you actually run.

One important operational constraint: models don't update themselves. A local model's knowledge is frozen at its training cutoff. For threat intelligence — which evolves daily — this means the model can reason about technique classes (phishing, encoded execution, lateral movement) but will not know about a CVE disclosed last month. The operational pattern is "model classifies the technique; the analyst (or a tool) queries the live threat feed." The model handles the reasoning pattern; current data comes from tools (more on that in Modules 04–06).

AI caveat

Use a model to interpret your benchmark numbers — paste the throughput/quality table and ask which tasks justify a larger model. It reasons about the tradeoff well, but it cannot know your numbers; you supply the empirical measurements, and you own the recommendation precisely because the data came from your hardware on your prompts, not its training set.

Learn (~3 hrs)¶

How models run locally (~1.5 hrs) - Ollama documentation — Models overview — browse the model library to understand the naming convention (name:size-quantisation); pay attention to the size column vs. the parameter count. - GGUF and the llama.cpp ecosystem (Hugging Face blog) — explains the GGUF format, quantisation levels (Q4_K_M, Q8_0, etc.), and how to find models. Read the "Quantization" section carefully. - llama.cpp README — skim the benchmarking section; the llama-bench tool is what the lab automates.

Practical deployment (~1 hr) - Simon Willison, "Run a model with llama.cpp" — short walkthrough that demystifies the whole stack in one read; written in 2023 but the concepts haven't changed. - Ollama API reference — the REST API you'll call directly; focus on /api/generate and /api/chat endpoints.

Hardware and throughput (~30 min) - Tim Dettmers, "Which GPU for deep learning?" — the most cited practical guide; skim for the memory bandwidth discussion, which explains why VRAM dominates inference speed.

Key concepts¶

Quantisation: how 4-bit weights make 7B models fit on a laptop
GGUF format and Ollama's Modelfile abstraction
OpenAI-compatible API: write once, swap endpoint for local-vs-frontier — the service the rest of the track plugs into
Throughput (tokens/sec) vs. quality as the practical evaluation axis — measured on your prompts and your hardware, not a leaderboard
The by-hand quality pass here is the seed of the rigorous eval harness in Module 11
Training cutoff as a hard limit for threat intel recency

AI acceleration¶

Use a model to help you analyse your benchmark results — paste in the throughput numbers and ask it to explain which tasks benefit from a larger model and which are well-served by the small one. The model can reason about the tradeoff; you supply the empirical numbers it can't know — and you own the recommendation, because the numbers came from your hardware on your prompts, not its training set.

Check yourself

Why does 4-bit quantisation let a 7B model run on a laptop, and what does it cost you?
You read that a model scores well on a public leaderboard. Why is that not enough to deploy it for your alert triage?
A local model is frozen at its training cutoff. How does the operational pattern still let it help triage a CVE disclosed last week?

Comments

Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).