Module 02 — Files, Regex & Log Parsing¶

Type 9 · Tool-Build — build a stdlib-only log parser (re, pathlib, collections) that extracts failed-login IPs from a realistic SSH auth log, flags brute-force patterns, and reports the top offenders. (Secondary: Eval Harness — verify the parser on positive and negative cases, not just the happy path.) Go to the hands-on lab →

Last reviewed: 2026-06

Python for Security — logs are the ground truth; parsing them is the first craft.

Difficulty: Beginner · Estimated time: ~3.5–4.5 hrs (study + lab) · Prerequisites: Foundations

In 60 seconds

Every security tool eventually emits text — auth events, IDS alerts, firewall drops — and the first craft is pulling reliable signal out of it. The trap is a regex that matches most lines (silently dropping the rest) or too broadly (pulling in noise). Write the pattern against the documented format with named groups, compile it once, iterate line-by-line for memory, and use Counter plus a sliding-window deque to turn failed logins into a brute-force verdict — then verify on positive and negative cases before you trust the count.

Why this matters¶

Every security tool eventually outputs text: syslog lines, auth events, IDS alerts, firewall drops. Before you can graph, alert, or escalate anything, you have to extract signal from that text reliably. The analyst who can write a regex that plucks IPs from a noisy SSH auth log in thirty seconds has a permanent edge over one who reaches for a GUI every time.

Objective¶

Write a Python script that parses a realistic SSH authentication log, extracts failed-login IPs with timestamps, identifies brute-force patterns, and outputs the top offenders — using only the standard library (re, pathlib, collections).

The core idea¶

Log parsing has two failure modes that look like success: the regex that matches most lines (and silently drops the rest) and the one that matches too broadly (and pulls in noise that looks like signal). The discipline is to write the regex against the actual format documented by the application — not by eye-balling a few samples — then verify it on both a positive case and a negative case before trusting the output.

The mental model

A parser isn't done when it works on your sample — it's done when you've proven it on a line that should match and a line that shouldn't. The format you code against is the documented one, not the three samples you happened to look at.

The gotcha

A parser that misses 5% of failed logins because of a minor sshd format variation is worse than no parser at all — because you will believe its output. The silent miss is the failure mode, not the crash.

re in Python is a full POSIX ERE dialect with named groups. Named groups — (?P<name>...) — are the difference between a regex that produces a tuple of strings and one that produces a dictionary you can reason about. Always use named groups for fields you intend to act on: (?P<ip>\d{1,3}(?:\.\d{1,3}){3}) is self-documenting; (\d+\.\d+\.\d+\.\d+) is a mystery three months later. Compile the pattern once with re.compile() outside any loop — re.match() called with a string literal inside a million-line loop recompiles on every call.

pathlib.Path is the modern replacement for os.path string manipulation. Path("/var/log") gives you an object with .name, .suffix, .read_text(), .glob() — no more string concatenation to build file paths. For large log files, iterate line-by-line (for line in path.open()) rather than .read_text().splitlines() — the latter loads the entire file into memory, which matters when sshd has been logging for a year.

Brute-force detection from logs is a counting problem: group failed-login attempts by source IP, count them in a time window, and flag the ones that exceed a threshold. collections.Counter is the right tool for the aggregate; a dictionary of collections.deque (with maxlen=N) is the right tool for the sliding window. Neither requires a database — for a few thousand IPs, an in-memory structure is fast enough and simpler to reason about. The moment your window needs persistence across restarts, you add a database; don't add one before you need it.

flowchart LR
    L["auth.log<br/>(line by line)"] --> R["compiled regex<br/>(named groups)"]
    R -->|"failed login"| C["Counter<br/>per source IP"]
    R -->|"failed login"| W["deque(maxlen=N)<br/>sliding window"]
    C --> T{"over threshold<br/>in window?"}
    W --> T
    T -->|yes| F["flag: brute-force"]
    T -->|no| S["ignore"]

Go deeper: why named groups and re.compile aren't optional

(?P<ip>\d{1,3}(?:\.\d{1,3}){3}) is self-documenting and returns a dict you can reason about; (\d+\.\d+\.\d+\.\d+) is a mystery three months later. And compile the pattern once outside the loop — re.match() with a string literal inside a million-line loop recompiles on every call. These two habits are the difference between a script that scales and one that quietly burns CPU.

AI caveat

A model writes the regex instantly — and it's usually subtly wrong on edge cases (IPv6, extra whitespace, a slightly different sshd version). Feed it the actual log lines, ask it to name the groups, then test the output on at least five lines it has never seen. The validation is the skill.

Learn (~2.5 hrs)¶

Regex fundamentals (~1 hr) - Regular Expressions HOWTO — Python docs — the canonical reference; the "Grouping" and "Non-capturing and Named Groups" sections are the ones to read carefully. - regex101.com — not a tutorial, but the fastest way to iterate on a pattern against a real log sample; paste a few lines and experiment.

File handling (~30 min) - pathlib — Object-oriented filesystem paths (Python docs) — focus on Path, read_text(), open(), glob(); this replaces os.path for everything you'll do here.

Collections for log analytics (~1 hr) - collections — Container datatypes (Python docs) — specifically Counter and defaultdict; these two handle 80% of the aggregation patterns in log analytics.

Key concepts¶

Named capture groups for self-documenting regex patterns
re.compile() once; match in a loop — not the reverse
pathlib.Path for portable, readable file operations
Line-by-line iteration for memory-efficient log processing
Counter for frequency analysis; sliding-window deque for time-windowed detection
Verifying parsers on both positive and negative test cases

AI acceleration¶

A model will write the regex for you instantly — and it will usually be subtly wrong on edge cases (IPv6, lines with extra whitespace, slightly different sshd format versions). Give it a sample of the actual log lines you're parsing, ask it to name the groups, then test its output on at least five lines it has not seen. The validation step is the skill; the regex is just the draft.

Check yourself

What are the two ways a log-parsing regex can "succeed" while actually being broken?
Why compile the pattern with re.compile() outside the loop instead of calling re.match() with a literal inside it?
For windowed brute-force detection, why a deque(maxlen=N) rather than a growing list?

Comments

Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).