Module 10 — VPN → ZTNA Migration¶

Type 12 · Migration / Brownfield — replace a legacy VPN on a flat network with identity-aware access incrementally, without an outage (strangler-fig); the deliverable is the migration runbook + per-app cutover checklist + proof (logs) that no app went down + a per-cohort rollback, not an essay. Go to the hands-on lab →

Last reviewed: 2026-06 · (placement: between Module 06 and Module 07, or as the Phase-3 project — final number TBD at promotion)

Zero Trust Network Access — every greenfield ZTNA tutorial starts from an empty network and a clean IdP. Your job starts from a VPN that already exists, that the whole company logs into every morning, and that you are not allowed to break.

Difficulty: Intermediate–Advanced · Estimated time: ~4.5–6 hrs (study + lab) · Prerequisites: Module 01 — Zero Trust Principles (the flat-network breach), Module 06 — Identity-Aware Access (the proxy you migrate to)

In 60 seconds

Every ZTNA tutorial starts greenfield — an empty network, a clean IdP, nothing to break. The real job is brownfield: a decade-old VPN on a flat network that the whole company logs into every morning, and the one rule is no outage. The naive big-bang cutover (flip everything one Saturday) is the canonical disaster — one blast radius, no incremental rollback. The discipline is strangler-fig: run both paths, move one cohort at a time, prove each with a before/after test, keep a per-cohort rollback, and decommission the VPN last — when the un-migrated surface is provably zero.

Why this matters¶

Every ZTNA tutorial you have read — including Modules 05 and 06 of this track — starts from nothing. An empty network, a clean IdP, a backend with no users yet. You write the policy, stand up the proxy, publish the app with no inbound ports, and it works on the first try because there was never anything to break. That is greenfield, and almost no real ZTNA project starts there. The project you actually walk into is brownfield: an organization that has run a VPN for a decade, on a flat internal network, where the entire workforce — plus a half-dozen vendors and a fleet of CI runners — connects every morning and reaches a sprawl of internal apps that nobody has fully inventoried. Module 01 showed you exactly this shape and exactly why it is dangerous: a single legacy VPN account with no MFA was the entry point in the Colonial Pipeline 2021 breach, and once inside, the flat network let the intrusion spread. Your job is to retire that VPN and put every app behind identity-aware access instead. The one rule: no outage. The VPN cannot go dark on a Tuesday morning while 800 people are mid-task.

The naive move is the dangerous one, and it is the move almost every team is first tempted to make: pick a Saturday, stand up the new ZTNA proxy for everything at once, flip DNS, turn the VPN off, and go home. This is the big-bang cutover, and it fails for a reason that has nothing to do with whether ZTNA "works" — it fails because you have changed the access path for every app and every user simultaneously, so when something breaks (and with an un-inventoried estate, something always breaks — an app that depended on a hardcoded internal IP, a service account whose token the proxy rejects, a vendor whose source range you forgot to allow), you cannot tell which of the hundred changes caused it, you cannot roll back just the broken slice, and you are debugging a total outage live while the whole company is locked out. The blast radius is the entire organization, and the rollback is "turn the VPN back on and admit the migration failed." This is why big-bang cutovers are the canonical migration disaster, and why the discipline that prevents them is the actual skill of this module.

The correct path is run both, move cohorts, prove each, then retire — the strangler-fig cutover that every team who has actually done this runs. You do not flip a switch. You stand the new identity-aware path beside the running VPN, move one small cohort of users-and-apps across at a time behind a DNS or feature-flag cutover, prove with a before/after access test that the moved cohort still reaches its app — now through the proxy — while everyone still on the VPN is untouched, keep a per-cohort rollback ready at every step, and only when the last cohort is across and proven do you finally close the old flat path and decommission the VPN. The un-migrated surface shrinks toward zero one provable step at a time, and at no point is more than one small cohort exposed to a change you can't reverse in minutes.

The core idea: you don't flip a switch; you run both, move cohorts, prove each, then retire the VPN¶

The mental model

The strangler fig (Martin Fowler, 2001): the rainforest vine grows around a host tree, gradually taking over until it can stand on its own — the original never cut down in a single stroke. Applied to access: you do not tear down the VPN and replace it in one cutover. You grow the identity-aware path around the running VPN, moving apps and users across one cohort at a time, so the surface still depending on the flat network shrinks toward zero while access never stops.

The mental model is the strangler fig (Martin Fowler, 2001 — named for the rainforest vine that grows around a host tree, gradually taking over until it can stand on its own, with the original never cut down in a single stroke). Applied to network access: you do not tear down the VPN and replace it with ZTNA in one cutover. You grow the new identity-aware path around the running VPN — moving apps and users across one cohort at a time — and the surface still depending on the flat VPN network shrinks toward zero while access never stops. Big-bang cutover is the failure mode the pattern exists to prevent: changing every access path at once is a single blast radius with no incremental rollback, debugged live against the whole organization.

The mechanism that makes this incremental is the two paths run side-by-side, and a per-cohort cutover switch decides which path a given app's traffic takes. Concretely: the legacy path is the VPN onto the flat network (a user connects, gets an internal IP, and reaches app.internal directly — connectivity is the only check); the new path is the identity-aware proxy from Module 06 (a request hits the proxy, which validates the caller's IdP-issued JWT against policy and only then forwards upstream — identity is the check, on every request, with no inbound port on the backend). A cohort is the unit you move: a small, coherent group of apps-plus-the-users-who-need-them — start with the lowest-risk one (an internal low-stakes web app used by a single team you can coordinate with), never the crown jewels first. The cutover is the switch that points that cohort's traffic at the proxy instead of the VPN route: in this lab a DNS change (point app.internal at the proxy) or a feature flag; in production a DNS record, a routing rule, or a per-group access policy. Because the two paths are independent, moving cohort 1 changes nothing for cohorts 2..N — they keep using the VPN, untouched, exactly as before.

flowchart LR
    C1(["cohort 1 (migrated)"]) -->|cutover: DNS → proxy| PX["identity-aware proxy<br/>(identity checked per request)"]
    C2(["cohorts 2..N (not yet)"]) -->|still on VPN| VPN["VPN → flat network"]
    PX --> APP["internal apps"]
    VPN --> APP
    PX -. rollback: point DNS back .-> VPN

The VPN is decommissioned only when the last cohort is across and the old flat path is provably closed.

The discipline that makes it safe reduces to one feedback signal per cohort: the before/after access test, and the goal is "still reachable for authorized users, now via the proxy; the old flat path closing." Before you cut a cohort over, you record the baseline — the authorized user reaches the app and how (through the VPN/flat route). You cut over. Then you run the same test and prove three things: (1) the authorized user still reaches the app (no outage — the whole point); (2) they now reach it through the proxy (the request carries identity, the proxy logged an allow — the migration actually moved, it didn't just appear to); and (3) the old flat path to that app is now closed (a direct request to the backend on the flat network no longer connects — otherwise you have added a ZTNA path without removing the bypass, which is migration theater, not migration). Only when all three hold is the cohort migrated. If any fails, you roll back that cohort — flip the cutover switch back to the VPN route — and the cohort is serving on the old path again in minutes while you debug one small slice, not a company-wide outage. The rollback is cheap precisely because the VPN is still running: you never removed it, so backing out is "point this cohort's DNS back." Then you do the next cohort. Only after the last cohort is across and proven do you close the flat network and decommission the VPN — the one irreversible step, taken last, when the un-migrated surface is provably zero.

The honest gotcha that distinguishes a real migration from a checkbox one: a migration is only done when the old path is closed, not merely when the new path works. It is tempting to declare victory the moment the app answers through the proxy — but if the flat network still routes to the backend, every user (and every attacker who lands on the VPN) can still skip the proxy entirely, and you have spent the whole project building a front door while leaving the back door open. The before/after test's third assertion — the old flat path is closing — is the one teams skip and the one that matters most, because closing it per-cohort is what actually shrinks the breach surface Module 01 warned about. Decommissioning the VPN at the end is the final, organization-wide version of that same per-cohort close.

The gotcha

A migration is done only when the old path is closed, not when the new path works. Declare victory the moment the app answers through the proxy and — if the flat network still routes to the backend — every user and every attacker who lands on the VPN can skip the proxy entirely. The before/after test's third assertion (the old flat path is closing) is the one teams skip and the one that matters most: closing it per-cohort is what actually shrinks the Module-01 breach surface.

Go deeper: the before/after test proves three things

Before you cut a cohort, record the baseline — the authorized user reaches the app, and how (through the VPN/flat route). Cut over, then run the same test and prove all three: (1) the user still reaches the app (no outage), (2) now through the proxy (identity-checked, the proxy logged an allow), and (3) the old flat path is closed (direct-to-backend no longer connects). All three, or it isn't migrated — and if any fails, flip the cutover switch back to the VPN and debug one small slice, not a company-wide outage.

AI caveat

A model is genuinely useful on the bookkeeping — the per-app checklist, the cohort sequence, the proof table, the access-test harness. But ask it to "just migrate everything" and it cheerfully produces a big-bang plan, because that's the simplest thing to express. The judgment it can't do for you is sequencing by blast radius and, above all, verifying the third assertion — asked to "confirm the migration worked," it checks the app answers through the proxy and misses the open back door.

Learn (~2.5 hrs)¶

Build-first and migration-focused: read enough to understand why big-bang fails, how strangler-fig de-risks it, and what a real VPN→ZTNA cohort cutover looks like — then go to the lab.

The strangler-fig pattern — why incremental beats big-bang (~30 min) - Martin Fowler — Strangler Fig Application (~15 min) — the original 2001 essay that named the pattern. Read it for the why: gradual replacement around a running system de-risks what a big-bang rewrite cannot. The metaphor (grow around, then retire) maps one-to-one onto moving access cohorts across while the VPN keeps serving; this is the mental model the whole module rests on. - Google BeyondCorp — Migrating to BeyondCorp: Maintaining Productivity While Improving Security (the migration paper) (~15 min) — Google's own account of moving a real workforce off the VPN to identity-aware access incrementally, running access policies in a simulation/monitoring mode that proved each cohort would still have access before the cutover. Read it as the real-world proof that the cohort-by-cohort, prove-before-you-cut discipline is how this is actually done at scale — not a textbook ideal.

Why big-bang cutovers fail — the real pain (~40 min) - CISA — Colonial Pipeline / DarkSide reporting and the legacy-VPN entry point (~20 min) — re-anchor on why the legacy VPN you're retiring is the liability: an inactive VPN account with no MFA on a flat network was the entry, and the flat interior let it spread. Read the "Mitigations" section for what ZTNA fixes that the VPN couldn't — this is the why now behind the migration. - NIST SP 800-207 — Zero Trust Architecture, §7 "Migrating to a Zero Trust Architecture" (~20 min) — the authoritative treatment of migration specifically. §7.2 ("Hybrid ZTA and Perimeter-Based Architecture") describes operating ZTA and perimeter-based access in parallel — the hybrid/coexistence period this lab reproduces — and §7.3 ("Steps to Introducing ZTA to a Perimeter-Based Architected Network") lays out the phased rollout that warns against the all-at-once swap. Read §7.2–7.3; it is the standards-body version of "run both, move cohorts."

What the new path is, concretely (~1 hr) - Re-read your own Module 06 — the Identity-Aware Access lab is the proxy you migrate to. The migration's "new path" is exactly that Pomerium-in-front-of-a-backend shape; this module's new contribution is running it beside a legacy path and cutting cohorts over. Skim it for the proxy data flow and the no-bypass / trust-only-the-proxy discipline you must preserve as you migrate. - Tailscale — How to migrate from a legacy VPN (~20 min) — a concrete vendor walkthrough of the side-by-side approach (run the new mesh alongside the old VPN, move users in groups, then turn the VPN off). Read it for the operational rhythm of a real cutover — overlap period, per-group moves, the "turn it off last" step — not as a tool endorsement; the same rhythm applies to the Pomerium/proxy path you build.

Key concepts¶

Brownfield is the ZTNA default, not the edge case — the VPN already exists, the whole org logs into it, and the estate is half-inventoried; "just stand up ZTNA" is greenfield advice that does not survive contact with a running network.
Big-bang cutover is the disaster the pattern prevents — changing every app's access path at once is one blast radius with no incremental rollback, debugged live against the whole organization. The before/after test and the per-cohort rollback exist because of this.
Strangler-fig: run both paths, move one cohort at a time — stand the identity-aware proxy beside the running VPN; a per-cohort DNS/feature-flag cutover decides which path an app's traffic takes. Moving cohort 1 changes nothing for the rest.
The before/after access test proves three things, not one — after each cutover: (1) the authorized user still reaches the app (no outage), (2) now through the proxy (identity-checked, proxy logged the allow), and (3) the old flat path is closing (direct-to-backend no longer connects). All three, or it's not migrated.
A migration is done when the old path is CLOSED, not when the new one WORKS — leaving the flat route open is migration theater; every user and attacker can still skip the proxy. Closing it per-cohort is what shrinks the Module-01 breach surface.
Rollback is cheap because the VPN is still running — back a cohort out by flipping its cutover switch to the VPN route; it serves on the old path in minutes while you debug one slice. The VPN is decommissioned last, the one irreversible step, when the un-migrated surface is provably zero.

AI acceleration¶

A model is genuinely useful at the planning and bookkeeping half of this migration — drafting the per-app cutover checklist, generating the cohort sequence (least-risky-first), turning your raw before/after curl output into a clean proof table, and writing the access-test harness that asserts the three conditions. That is real leverage on the tedious parts. But the posture is strict, because the dangerous instinct is the same one a stressed engineer has: AI drafts the plan → you decide the cohort order and own the cutover → you read the before/after proof yourself. Ask a model to "just migrate everything" or "write a cutover script" and it will cheerfully produce a big-bang plan — one flip, all apps — because that is the simplest thing to express and it doesn't carry the operational fear of a live outage. The judgment the model cannot do for you is sequencing by blast radius (which cohort is safe to move first, what its dependencies are, who to coordinate with) and, above all, verifying the third assertion — that the old flat path actually closed — because a model asked to "confirm the migration worked" will check that the app answers through the proxy and call it done, missing the open back door entirely. Make the model draft the checklist and the harness; you confirm every cohort has a tested rollback, that the proof shows no-outage and old-path-closed, and that the VPN comes down only after the last cohort is provably across. AI authors the runbook; you own the cutover.

Check yourself

Why does a big-bang cutover fail for reasons that have nothing to do with whether ZTNA "works"?
The before/after test asserts three things after each cohort cutover — what are they, and which one do teams most often skip?
Why is the per-cohort rollback cheap, and why is decommissioning the VPN the one irreversible step you take last?

Comments

Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).