Module 10 — Packaging, Testing & Owning AI Code¶

Type 14 · Adversarial Review — take a deliberately flawed AI-generated security script, hunt the bugs with static analysis and testing, write pytest tests that pin each one, and fix until green — the review workflow every AI-authored change must clear before production. (Secondary: Eval Harness — your test suite is the spec, not vibes.) Go to the hands-on lab →

Last reviewed: 2026-06

Python for Security — the code an AI generates is a first draft; the tests you write turn it into software.

Difficulty: Beginner · Estimated time: ~3.5–4.5 hrs (study + lab) · Prerequisites: Foundations

In 60 seconds

AI writes the script; that's no longer the skill. What ships reliable tooling is the review-and-test loop: a five-pass review (bandit/ruff, input validation, SQL/shell/path injection, network error handling, secrets) catches ~80% of the production bugs in a typical AI-generated security script. Then test-driven debugging — write the failing test first, confirm it fails, fix, confirm green — proves you fixed the bug rather than masking the symptom. pyproject.toml entry points turn the proven script into an installable command.

Why this matters¶

AI will write the script. That is no longer the differentiating skill. What separates an engineer who ships reliable security tooling from one who ships interesting demos is the review-and-test loop: does it handle bad input? Does it do what it claims? Does the regex actually match what it says? These questions are answered by tests, not by reading the code. Every security script that runs unattended in production needs a test that proves it behaves correctly on the edge cases that will eventually happen.

Objective¶

Given a deliberately flawed AI-generated security script, identify the bugs through static analysis and testing, write pytest tests that catch each one, and fix the script until all tests pass — practicing the review workflow that should precede any AI-authored code going to production.

The core idea¶

The most dangerous property of AI-generated code is that it looks right. The structure is idiomatic, the variable names are sensible, the comments are coherent — and buried inside is an off-by-one in the regex character class, a SQL query built by string concatenation (SQL injection), a missing try/except around the network call that will kill the process on the first timeout, and a hardcoded credential in the test fixture that will end up in git. None of these are obvious from a casual read. All of them are obvious from a systematic review.

The gotcha

The most dangerous property of AI-generated code is that it looks right — idiomatic structure, sensible names, coherent comments — with a string-concatenated SQL query or a missing timeout buried inside. None of it survives a systematic review; almost none of it is caught by a casual read. Reading the code is not reviewing it.

The review checklist is not long: (1) run bandit and ruff; read every finding; don't dismiss MEDIUM findings. (2) Check every function that takes user input — does it validate before using? (3) Find every SQL query, shell command, and file path that involves user-provided data — is it parameterized? (4) Find every network call — is there a timeout? Is the error handled? (5) Check every secret — is it coming from an environment variable, or is it a string literal? Five questions, five passes, and you've caught 80% of the production bugs in a typical AI-generated security script.

pytest is the standard Python testing framework and the right tool for this module. Write tests before you fix — this is the test-driven debugging workflow: write a test that captures the broken behavior, confirm it fails, fix the code, confirm it passes. This sequence ensures you actually fixed the bug and not just made the symptom disappear. A fix that makes the test pass is a fix you can trust.

flowchart LR
    B["spot bug<br/>(review / bandit)"] --> W["write test for<br/>broken behavior"]
    W --> RF{"run: fails?"}
    RF -->|"no — passes"| W2["test doesn't<br/>pin the bug; rewrite"]
    W2 --> RF
    RF -->|"yes — red"| FX["fix the code"]
    FX --> RG{"run: green?"}
    RG -->|no| FX
    RG -->|yes| D["bug pinned shut"]

The mental model

Write the failing test before the fix. If you fix first and the test passes, you've proven nothing — you can't tell a real fix from a coincidence. The test that you watched fail, then watched pass, is the one that pins the bug shut against the next refactor.

Packaging with pyproject.toml is the final step that makes a script into a distributable tool. [project] section declares the name, version, dependencies, and entry points; pip install . installs it. The entry point [project.scripts] ioc-check = "ioc_check.cli:app" is what turns python ioc_check.py into the command ioc-check. This matters for security tools that get installed on jump servers and shared via internal package mirrors — the install process is the same as any other Python package, with a pinned requirements.txt and a known version.

AI caveat

Use a model to draft the tests — it covers the happy path and stops. Push it: "test the API timeout," "test the empty-string input," "test for SQL injection." Each prompt surfaces a test it didn't write spontaneously, which usually means the code doesn't handle that case either. The gaps in the test suite are a map of the bugs.

Learn (~2.5 hrs)¶

pytest fundamentals (~1 hr) - pytest — Getting Started (official docs) — read through "How to write and run tests" and "Fixtures"; understand assert, parametrize, and tmp_path fixture. - Effective Python Testing With pytest — Real Python — deeper treatment of fixtures, marks, and mocking; the section on monkeypatch is especially useful for patching API calls.

Packaging (~1 hr) - Python Packaging User Guide — Writing your pyproject.toml — the canonical reference; read the "Declaring the project name", "Dependencies", and "Entry points" sections. - hatch — Modern Python project management — the modern build backend; skim the "Getting started" section to understand how pyproject.toml + hatch replaces setup.py.

Reviewing AI code (~30 min) - ruff — Security rule set (S rules) — the specific rules that catch security bugs in Python; skim the list so you know what to enable.

Key concepts¶

The five-pass AI code review: bandit/ruff, input validation, SQL/shell/path injection, network error handling, secrets
Test-driven debugging: write the failing test first, then fix
pytest.parametrize for testing multiple inputs with one test function
pyproject.toml entry points: turning a script into an installable command
unittest.mock.patch for patching API calls in tests without network access

AI acceleration¶

Use a model to write the initial tests — it will cover the happy path. Then push it: "Write a test for when the API call times out." "Write a test for when the input is an empty string." "Write a test for SQL injection." Each of those prompts uncovers a test the model didn't write spontaneously, which usually means the code doesn't handle it correctly either. The gaps in the test suite are a map of the bugs.

Check yourself

Name the five passes of the AI-code review and roughly what fraction of production bugs they catch.
In test-driven debugging, why write the test before the fix instead of after?
What does a [project.scripts] entry point in pyproject.toml actually do to a script?

Comments

Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).