A package for inspecting PDF files.
It is at an early stage of development.
The current aim of this package is to implement the following features:
- Parse PDF files
- Validate PDF files
- Extract metadata
- Extract text, images, tables, links, annotations...
- Check for potential security vulnerabilities
References to the International Standard ISO 32000-2:2020 (PDF 2.0) Portable document format – Part 2: PDF 2.0 are included in the comments and documentation. These are indicated by the section number, name, and page number(s) in square brackets, e.g. [7.3.10 Indirect objects, p33-34]. Nested square brackets indicate references to other sources, e.g. [[https://www.w3.org/TR/png/#4Concepts.EncodingScanlineAbs] 4.6.2 Scanline serialization].
If you are interested in contributing, please check the TODO list. Contributions to tests with extracts of PDF files that do not open correctly are highly appreciated, provided they do not require a change to the LICENSE.