Module 07 — Automating the Web¶

Type 9 · Tool-Build — build an httpx + beautifulsoup4 scraper that extracts links, finds hidden endpoints, and follows redirect chains against a bounded local target, proven by a test_scraper.py that pins endpoint discovery, the off-host scope guard, and 404 resilience. Only scrape hosts you own or are permitted to test. (Secondary: Build-&-Operate — scope and session handling that keeps the crawl safe at scale.) Go to the hands-on lab →

Last reviewed: 2026-06

Python for Security — an unprotected endpoint is only hidden if nobody looks; a scraper looks.

Difficulty: Beginner · Estimated time: ~3–4 hrs (study + lab) · Prerequisites: Foundations

In 60 seconds

A web scraper is reconnaissance, data collection, and automation in one: httpx sends the requests, beautifulsoup4 parses the HTML, and together they cover ~80% of practical scraping. Apps leak their structure — /admin links, /api/v2/internal comments, paths buried in JS — and that structure is the map. The line between responsible automation and unauthorized access isn't the tool; it's scope and authorization: identify yourself, respect robots.txt, and only touch targets you own or are permitted to test.

Why this matters¶

Web scraping is a reconnaissance skill, a data-collection skill, and an automation skill in one. Security engineers scrape for exposed configuration files, forgotten admin endpoints, and leaked credentials. Automation engineers scrape to collect threat data from sites that don't offer an API. The same techniques apply to both, and the line between "responsible automation" and "unauthorized access" is authorization and scope — not the tool.

Objective¶

Use httpx and beautifulsoup4 to scrape a local web application: extract all links, identify hidden or unlisted endpoints, and follow a redirect chain — and prove it with a test you wrote: a test_scraper.py that asserts the hidden endpoint is discovered, the off-host guard holds, and a 404 doesn't crash the crawl. Building the scraper and committing a test that pins those behaviours are equal halves — all within a clearly bounded local target, never against external hosts without permission.

The core idea¶

HTTP is a text protocol. A web scraper is a program that sends HTTP requests and parses the text responses — mechanically identical to what a browser does, minus the JavaScript execution and the display rendering. httpx handles the HTTP; beautifulsoup4 handles the HTML parsing. The combination covers about 80% of practical web-scraping needs.

The mental model

A scraper is a browser minus the JavaScript and the rendering — it sends HTTP and parses text. For security work, the payload isn't the page content, it's the structure: the unlinked endpoint, the commented-out API path, the script that embeds an internal route. You're reading the app's map, not its words.

The key insight for security work is that web applications leak information through their structure: links to /admin, comments referencing /api/v2/internal, action= attributes pointing to unlinked endpoints, JavaScript files that embed API paths. These are not vulnerabilities in themselves, but they are the map. soup.find_all("a", href=True) gets you every explicit link; soup.find_all(True) gets you every tag; a regex on the raw HTML source (re.findall(r'/[\w/.-]+', response.text)) surfaces paths embedded in scripts and attributes that BeautifulSoup doesn't expose.

Session handling is the next layer: cookies, authentication tokens, and CSRF tokens are what keep a web application's state. httpx.Client (the session object) automatically stores and resends cookies across requests — you log in once and every subsequent request carries the session. This is also what makes authenticated scraping work: log in with a POST to the login endpoint, capture the session cookie, then scrape pages that require authentication.

The gotcha

Scraping a system you don't own carries the same authorization bar as a port scan — written permission, defined scope. The most common failure is a crawler that follows a link off the target domain; without a urlparse(url).netloc == target_host guard and a depth limit, your scoped recon quietly becomes unauthorized access to someone else's host. For this lab, the target is a local Flask app you run yourself — no external targets, ever.

Go deeper: finding what isn't linked, and session state

soup.find_all("a", href=True) gets explicit links, but a regex on the raw source (re.findall(r'/[\w/.-]+', response.text)) surfaces paths embedded in scripts and attributes that BeautifulSoup doesn't expose — often where the interesting endpoints hide. For authenticated scraping, httpx.Client stores and resends cookies across requests, so a single POST to the login endpoint carries the session through every subsequent page.

Learn (~2 hrs)¶

httpx sessions (~30 min) - httpx — Client documentation — focus on the "Client Instances" and "Cookies" sections; understand how a session maintains state across requests.

BeautifulSoup4 (~1 hr) - BeautifulSoup4 — Official documentation — read the "Quick Start", "Kinds of objects", and "Searching the tree" sections; find_all() is 90% of what you'll use. - Web Scraping with BeautifulSoup — Real Python — worked example with good coverage of common patterns; skip the "Interact with HTML Forms" section (that's for browser automation).

Responsible scraping (~30 min) - robots.txt specification — robotstxt.org — understand what it says and what it does not say legally; short page, worth reading in full.

Key concepts¶

httpx.Client as a session: cookies and headers persist across requests
BeautifulSoup4 parse tree: find_all(), get(), .text, .attrs
Extracting links from HTML vs. extracting paths from raw text (regex on response.text)
Following redirect chains with httpx (by default it follows; follow_redirects=False to inspect)
Responsible automation: User-Agent, robots.txt, rate-limiting, scope limitation
Verify by test, not by eye: a learner-written test_scraper.py that asserts hidden-endpoint discovery, the off-host guard, and 404 resilience — the ownership half, not a diff against make demo

AI acceleration¶

A model will write the scraper quickly. The missing piece is usually the scope: the model will write code that follows links off the target domain if you don't specify urlparse(url).netloc == target_host as a filter. Test it against a URL that resolves to an external site — does it follow the link? Does it have a depth limit? These are the production safety checks.

AI caveat

A model writes the scraper quickly and omits the scope guard by default — its crawl will happily walk off the target domain unless you specify the netloc filter. Test it against a URL that resolves off-host: does it follow? Does it bound depth? Those checks are what keep the crawl legal.

Check yourself

What separates "responsible automation" from "unauthorized access" — and what is it not?
Why scan the raw response.text with a regex when BeautifulSoup already finds the links?
How does httpx.Client make authenticated scraping work across multiple requests?

Comments

Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).