How Magika AI detects mislabeled files (and why extensions lie)
Filename extensions are user-provided metadata - they can be wrong, missing, or actively deceptive. Here's how a small deep-learning model figures out the truth from a few hundred bytes.
A file’s extension is a string that anyone can change. Rename payload.exe to report.pdf and a dozen Windows tools will happily display a PDF icon. The bytes on disk haven’t moved an inch, but every shortcut your operating system takes to “know” what a file is depends on metadata that the file itself doesn’t enforce.
That’s why content-based detection exists. And it’s the gap Magika was built to close.
The way file type detection used to work
For decades, the standard answer was “look at the magic bytes” - a small, well-known signature near the start of the file. PDFs begin with %PDF-, PNGs with the eight-byte sequence 89 50 4E 47 0D 0A 1A 0A, ZIP archives with PK\x03\x04. The classic Unix file(1) command and its database (magic.mgc) compile thousands of these patterns into a fast pattern-matcher.
Magic-byte detection works well when:
- The format actually defines a stable signature.
- The file isn’t truncated.
- You don’t care about ambiguity between formats that share a container.
It falls over when:
- The format is text-like (Python source, YAML, INI) - any short script could match dozens of “starts with
#” rules. - The container is generic - a
.docx,.xlsx,.jar, and an unrelated ZIP archive all start withPK\x03\x04. - Someone deliberately prepended innocuous-looking bytes to a malicious payload.
A growing list of hand-tuned regular expressions doesn’t scale either. By the time you’ve added rules for the thousandth obscure scientific format, the rules contradict each other and the maintainer burns out.
What Magika does instead
Magika is a small (~1 MB) deep-learning model trained on millions of files. It samples a few short windows from the start, middle, and end of the file - typically about 1,536 bytes total - and runs them through a convolutional network that emits a probability distribution over ~200 known file types.
A few things that fall out of this design:
- The model never reads the filename. Whatever the user named the file is irrelevant to the verdict. That alone defeats most extension-based deception.
- Confidence scores are calibrated. When Magika says 99%, it’s right close to 99% of the time. When it says 60%, something is genuinely ambiguous - usually a code file in a language the model’s seen less of, or a very short text fragment.
- It generalizes. A truncated PDF, a PE binary with garbage prepended, a polyglot file that’s both a valid GIF and a valid JavaScript - the model handles all of those better than a brittle regex would, because its features are statistical rather than positional.
Google’s writeup reports about 99% top-1 accuracy on their evaluation set. The exact number depends on which formats you include and how you weight rare types, but for the common cases the failure rate is roughly an order of magnitude lower than file(1).
When detection-by-content actually matters
Three flavors of real-world problem:
Disguised malware
Email gateways and download proxies often filter on extension. A .docx attachment is “fine”; a .exe is quarantined. Send the same payload as quarterly-report.docx and it sails through - until something downstream actually inspects the bytes. Content-based detection is what catches this on the way in, before the user double-clicks it.
Forensic recovery
Recovering files from a corrupted disk often gives you blobs without filenames or extensions. The investigator’s first task is grouping them by likely type so they can pick which to triage. Magic bytes get you halfway; ML closes the gap on the formats that don’t have them.
Build pipelines and CI
You’d be surprised how often a build breaks because a developer committed a .json file that’s actually JSON5, or a .csv that’s tab-separated, or a .yaml with mixed tabs and spaces. A content-based check at PR time saves a deploy.
Where it’s still hard
A few formats remain genuinely difficult:
- Encrypted blobs - by design, the bytes look like noise. There’s nothing to learn from.
- Heavily compressed text inside a generic container - one ZIP looks a lot like another until you decompress. Magika reports the container correctly (ZIP), not the inner content.
- Very short files - 12 bytes of “Hello world” isn’t enough to distinguish English from Python from Markdown. The confidence drops accordingly, which is the right behavior.
The honest version of this story is: content-based detection isn’t a silver bullet, but it’s a strict superset of what extension-checking can do. If you’re building anything that processes files from untrusted sources, it should be the floor, not the ceiling.
You can try Magika on your own files at the home page - drop anything in the box, including renamed-on-purpose tests, and watch what comes back. The model runs entirely in your browser; your files never leave the tab.