Skip to content

Lab 02 — File Triage & Identification

Hands-on lab · ← Back to the module concept

Setup

git clone https://github.com/plaintext-security/plaintext-labs
cd plaintext-labs/malware/02-file-triage
make up
make fetch-sample      # pulls a small real corpus (Agent Tesla, AsyncRAT, a UPX-packed PE) into the isolated container
make demo

⚠ This lab triages live malware samples. Handle them accordingly. - Static only. This module never executes a sample — triage is file, section entropy, and Detect-It-Easy (DIE). Do not run them. - Isolation. All work stays inside the isolated container (network-internal, no host mounts that copy out); never move a sample to your host. - Hygiene. The corpus is fetched at lab time (password-protected zips, password infected) and is never committed.gitignore covers samples/. make fetch-sample needs a free abuse.ch Auth-Key (set MB_AUTH_KEY). - Offline fallback. No key / MalwareBazaar unreachable? Skip make fetch-sample; make demo falls back to the bundled synthetic corpus in data/samples/ (one clean PE, one richer PE, a DLL, an ELF, a PDF, a ZIP) so you can still exercise the full triage workflow.

Scenario

You're the first analyst on a SOC triage queue. Overnight, three flagged binaries and two attachments (a ZIP and a PDF) landed in the bucket, and nothing downstream moves until you say what each one is. Two of the binaries are real, named families: Agent Tesla, a .NET infostealer that keylogs and exfiltrates credentials over SMTP (MITRE S0331), and AsyncRAT, an open-source remote-access trojan (MITRE S1087). The third is a UPX-packed PE whose real code is compressed out of reach until it's unpacked. Your job is first-pass triage: identify the true type of each file regardless of extension, flag anything packed or obfuscated, and write a one-line routing decision so the deep-analysis team knows which path each sample takes.

make fetch-sample stages that real corpus in samples/ (mounted read-only at /lab/samples); the bundled synthetic set in data/samples/ is the offline fallback. Triage routes by what you find: a clean-ish PE like Agent Tesla goes to static-strings-pe, the UPX-packed PE goes to unpack-first (its high section entropy gives it away), and a RAT like AsyncRAT — built to be configured and run — goes to dynamic-behavioural once static triage has confirmed it's a PE worth detonating in a sandbox.

Do

Run these against the real corpus in /lab/samples (or the synthetic set in /lab/data/samples if you're offline).

  1. [ ] Run file against every sample and record the output. Note where the extension matches the detected type and where it doesn't. (Hint: MalwareBazaar names by hash, so a PE may arrive with no .exefile is the ground truth, not the name.)

  2. [ ] Calculate section entropy for each PE file. Use pefile (Python) to dump each PE's sections with their entropy. Flag any section above 7.0. The UPX-packed sample should light up here; Agent Tesla and AsyncRAT (uncompressed PEs) should not. (Hint: pefile has a sections attribute and each section has a get_entropy() method.)

  3. [ ] Run die-cli (Detect-It-Easy) against each sample. Record: detected format, compiler/linker, and any packer detection. DIE should name UPX on the packed sample and the .NET runtime on Agent Tesla. (Hint: die -j <file> gives JSON output.)

  4. [ ] Classify each file into one of: clean-PE, packed-PE, ELF, PDF, archive. Document your reasoning in one sentence per sample. (Agent Tesla / AsyncRAT → clean-PE structurally even though they're malicious; the UPX sample → packed-PE.)

  5. [ ] Write a triage report. For each file: true type, entropy assessment, packer/compiler fingerprint (if PE), and recommended next step (static-strings-pe, dynamic-behavioural, unpack-first, not-malware). Ground each routing in the family: Agent Tesla → static-strings-pe, the UPX-packed PE → unpack-first, AsyncRAT → dynamic-behavioural.

Success criteria — you're done when

  • [ ] Every sample is correctly identified by true type (ignoring extension / hash-name).
  • [ ] The UPX-packed PE's high-entropy section is flagged and routed to unpack-first.
  • [ ] The triage report exists and has a routing decision for every sample, justified by its fingerprint.

Deliverables

triage-report.md — the classification table and routing decisions. triage.py — the automated triage script (see Automate & own it). Commit both.

Automate & own it

Required. Write triage.py: a Python script that takes a directory path as an argument, iterates every file, and outputs a Markdown table with columns: filename, detected-type, max-section-entropy, packer, routing. AI can draft the table-generation and pefile loop; you review the entropy threshold logic and test it against the provided samples. The script must exit 0 and produce deterministic output each run.

AI acceleration

Prompt an AI to draft triage.py — then feed it a sample where a section named .text has entropy 7.8 and verify the AI's script correctly flags it as suspicious. If the script says "low risk," find and fix the threshold comparison before committing.

Connects forward

The PE samples identified here become the subjects of Module 03 (string and import extraction) and Module 04 (capability detection with capa). The triage routing you write here is what tells the team which analysis path each sample takes.

Marketable proof

"I triage unknown files — correct type identification, entropy-based packing detection, compiler fingerprinting — and produce a routing report before any analysis tool runs."

Stretch

  • Add MIME type detection via python-magic alongside file output and note any discrepancies.
  • Add a check: if the PE's TimeDateStamp in the COFF header is in the future or more than 20 years in the past, flag it as a timestamp-manipulation indicator (T1027.005).

Comments

Sign in with GitHub to comment. Choose the type: Feedback (errors or suggestions on this page) · Hints (help for fellow learners — no spoilers) · General (anything else).