Detecting disguised malware by content, not extension

The first content-based malware filter I ever built was a 30-line Python script that ran file(1) on every email attachment and quarantined anything that didn’t match its declared extension. It caught more genuine threats in its first week than the commercial gateway it sat behind had caught all month.

The lesson generalizes: most defensive tooling stops looking too early. The extension is the easiest thing to check, so it gets checked - and a determined attacker only has to be slightly less lazy than the defender to slip through.

This post is about what content-based detection actually buys you, where it stops, and how to wire it into the systems you already run.

The threat model

The disguised-payload pattern shows up in three flavors:

Plain rename. invoice.pdf.exe is the joke version - real attackers strip the inner extension and just use invoice.pdf. On Windows, the OS happily executes a PE binary regardless of the displayed extension; on Linux and macOS it doesn’t, but a curious user can still be socially engineered into making it executable.
Polyglot files. A file that’s a valid PDF and a valid HTML page and a valid JAR - all simultaneously. The browser renders the HTML, the JVM runs the JAR, and your scanner sees what its parser was looking for first. These exist; they’re not theoretical.
Container abuse. A ZIP archive that contains an inner file with a misleading extension, or a DOCX (which is ZIP-under-the-hood) with embedded macros, or a SVG that’s actually a stored XSS payload. The outer wrapper is benign; the dangerous content is one level down.

Each of these defeats extension-only filtering. Each is reliably caught by content inspection.

What “content-based” actually means in practice

Three layers, in increasing order of cost and fidelity:

Magic-byte sniffing

The cheapest defense: check the first ~20 bytes against a database of known signatures. Tools: file(1), libmagic, the various file-type-style libraries on npm and PyPI. Cost: microseconds per file. Catches: the plain rename. Misses: polyglots, container abuse, anything where the magic bytes match but the structure is malformed.

Structural parsing

Open the file with a real parser for its declared type and confirm it’s well-formed. PDF has an xref table; if the parser can’t reach it, the file’s been tampered with. DOCX is ZIP plus a known set of XML entries; missing or extra entries are suspicious. PE binaries have section tables; section sizes and entry points should make sense.

This catches polyglots that pass magic-byte checks but fail structural validation. Cost: milliseconds to seconds depending on size. Open-source parsers exist for every common format.

ML-based content classification

Sample ~1.5 KB from the file, run a small CNN, get a probability distribution over ~200 known formats. This is what Magika does. The output isn’t just “what is this?” but “how confident are we?” - and crucially, the confidence drops on adversarial inputs. A polyglot designed to confuse a classifier will produce a low-confidence answer rather than a wrong high-confidence one.

Cost: ~10-50ms per file in the browser, less on a CPU server. Catches: all of the above, plus a wide swath of less-common formats that signature databases never bothered to encode.

A practical pipeline

For an organization processing user uploads at any non-trivial scale, the right architecture is layered:

upload -> extension allowlist -> magic-byte check -> structural parse -> ML classify -> downstream processor

Each layer is cheap enough to run on every file. Each layer rejects what its successor would otherwise have to handle. The ML classifier is the most expensive step but the most discriminating, so it goes last - and only files that survived the cheaper checks reach it.

A few implementation notes from running this in anger:

Log the disagreements, not just the rejections. If extension says PDF and Magika says PE, that’s interesting whether or not you reject the file. Those samples are gold for tuning.
Confidence thresholds are policy, not technology. The model gives you a probability; you decide what to do at 95% vs 80% vs 60%. There’s no universally right answer; it depends on what’s downstream.
Don’t trust container previews. A file that’s a valid ZIP isn’t necessarily safe. Recurse into the archive and apply the same pipeline to each entry.

Where this stops working

Three honest limitations:

Encrypted payloads. A file that’s been XOR’d with a per-installation key looks like noise to any classifier. The defense at that point is upstream - block the loader, not the payload. If the user’s running arbitrary code from arbitrary archives, content classification is too late.
Living-off-the-land formats. A weaponized PowerShell script is a PowerShell script. The classifier will identify it correctly. Whether it’s malicious is a behavioral question, not a format question - that’s the next layer up the stack.
Adversarial training. Researchers have demonstrated that a sufficiently determined attacker can craft files that fool a specific model while remaining functional. This is a real problem for ML-based detection in the abstract; in practice, attackers rarely bother because the cost of evading detection is higher than the cost of finding a target that doesn’t have detection at all.

A small experiment

Take any benign executable on your system - cmd.exe on Windows, /bin/ls on Linux. Copy it, rename the copy to report.pdf, and drop it on the home page. Magika will identify it as a PE or ELF binary regardless of what you called it. The confidence will be high (>95%) because executables have very recognizable structural signatures.

Now repeat with a small text file. Save a one-line shell script as script.txt - the model is much less confident in that direction, because a single line of text is genuinely ambiguous between many text-like formats. That’s the right behavior. Confidence should track the ambiguity actually present in the data.

The takeaway: content-based detection isn’t perfect, but it’s strictly better than the extension check most systems still rely on. If you’re building anything that ingests files from outside your organization, it belongs in the pipeline.