A utility to identify and map the semantic and syntactic structure of files,
including polyglots, chimeras, and schizophrenic files. It has a pure-Python implementation of libmagic and can act as a drop-in replacement for the file
command. However, unlike file
, PolyFile can recursively identify embedded files, like binwalk.
PolyFile can be used in conjunction with its sister tool PolyTracker for Automated Lexical Annotation and Navigation of Parsers, a backronym devised solely for the purpose of collectively referring to the tools as The ALAN Parsers Project.
You can install the latest stable version of PolyFile from PyPI:
pip3 install polyfile
To install PolyFile from source, in the same directory as this README, run:
pip3 install .
Important: Before installing from source, make sure Java is installed. Java is used to run the Kaitai Struct compiler, which compiles the file format definitions.
This will automatically install the polyfile
and polymerge
executables in your path.
Running polyfile
on a file with no arguments will mimic the behavior of file --keep-going
:
$ polyfile png-polyglot.png
PNG image data, 256 x 144, 8-bit/color RGB, non-interlaced
Brainfu** Program
Malformed PDF
PDF document, version 1.3, 1 pages
ZIP end of central directory record Java JAR archive
To generate an interactive hex viewer for the file, use the --html
option:
$ polyfile --html output.html png-polyglot.png
Found a file of type application/pdf at byte offset 0
Found a file of type application/x-brainfuck at byte offset 0
Found a file of type image/png at byte offset 0
Found a file of type application/zip at byte offset 0
Found a file of type application/java-archive at byte offset 0
Saved HTML output to output.html
Run polyfile --help
for full usage instructions.
PolyFile has an interactive debugger both for its file matching and parsing. It can be used to debug a libmagic pattern
definition, determine why a specific file fails to be classified as the expected MIME type, or step through a parser.
You can run PolyFile with the debugger enabled using the -db
option.
PolyFile has a cleanroom, pure Python implementation of the libmagic file classifier, and supports all 263 MIME types that it can identify.
It currently has support for parsing and semantically mapping the following formats:
- PDF, using an instrumented version of Didier Stevens' public domain, permissive, forensic parser
- ZIP, including recursive identification of all ZIP contents
- JPEG/JFIF, using its Kaitai Struct grammar
- iNES
- Any other format specified in a KSY grammar
For an example that exercises all of these file formats, run:
curl -v --silent https://www.sultanik.com/files/ESultanikResume.pdf | polyfile --html ESultanikResume.html -
Prior to PolyFile version 0.3.0, it used the TrID database for file identification rather than the libmagic file definitions. This proved to be very slow (since TrID has many duplicate entries) and prone to false positives (since TrID's file definitions are much simpler than libmagic's). The original TrID matching code is still shipped with PolyFile and can be invoked programmatically, but it is not used by default.
PolyFile has several options for outputting its results, specified by its --format
option. For computer-readable output, PolyFile has an extension of the SBuD JSON format described in the documentation. Prior to version 0.5.0 this was the default output format of PolyFile. However, now the default output format is to mimic the behavior of the file
command. To maintain the original behavior, use the --format sbud
option.
PolyFile has a cleanroom implementation of libmagic (used in the file
command).
It can be invoked programmatically by running:
from polyfile.magic import MagicMatcher
with open("file_to_test", "rb") as f:
# the default instance automatically loads all file definitions
for match in MagicMatcher.DEFAULT_INSTANCE.match(f.read()):
for mimetype in match.mimetypes:
print(f"Matched MIME: {mimetype}")
print(f"Match string: {match!s}")
To load a specific or custom file definition:
list_of_paths_to_definitions = ["def1", "def2"]
matcher = MagicMatcher.parse(*list_of_paths_to_definitions)
with open("file_to_test", "rb") as f:
for match in matcher.match(f.read()):
...
Instructions on extending PolyFile to support more file formats with new matchers and parsers is described [in the documentation](in the documentation).
This research was developed by Trail of Bits with funding from the Defense Advanced Research Projects Agency (DARPA) under the SafeDocs program as a subcontractor to Galois. It is licensed under the Apache 2.0 license. © 2019, Trail of Bits.