Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for rebuilding the xref table for damaged PDF files #45

Open
kleuter opened this issue Oct 6, 2023 · 9 comments
Open

Add support for rebuilding the xref table for damaged PDF files #45

kleuter opened this issue Oct 6, 2023 · 9 comments
Assignees
Labels
enhancement New feature or request priority-low
Milestone

Comments

@kleuter
Copy link
Contributor

kleuter commented Oct 6, 2023

The pdfiototext tool fails to parse the file:
https://www.dropbox.com/scl/fi/1nhivpa3sbjejza8l53rz/NTFS.pdf?rlkey=zvphkczuy71b0vil8zvmrz95v&dl=0

System Information:

  • OS: Windows 10, Visual Studio 2019
@michaelrsweet
Copy link
Owner

So for this file the "startxref" value is wrong, as are all of the xref table offsets. More than likely the original file was edited on Windows with a plain text editor (Notepad or similar) which changed the line endings from LF only to CR LF.

Some PDF viewers will attempt to generate their own xref value for files like this, but I have not done so for PDFio due to the chances for errors and the likelihood that such corruption will also damage the binary streams in the file, making it unreadable that way... I will keep this issue open for now but it will not be "fixed" any time soon...

@michaelrsweet michaelrsweet changed the title NTFS.pdf: Bad xref table header '19 0 R /XYZ 115.0 340.907 null]'. Add support for rebuilding the xref table for damaged PDF files Oct 6, 2023
@michaelrsweet michaelrsweet self-assigned this Oct 6, 2023
@michaelrsweet michaelrsweet added enhancement New feature or request priority-low labels Oct 6, 2023
@michaelrsweet michaelrsweet added this to the Future milestone Oct 6, 2023
@michaelrsweet
Copy link
Owner

NTFS.pdf

@kleuter
Copy link
Contributor Author

kleuter commented Oct 9, 2023

@kleuter
Copy link
Contributor Author

kleuter commented Oct 9, 2023

Bad xref table header 'xref '.

@michaelrsweet
Copy link
Owner

That file isn't damaged in the same way; in fact, the issue is that there is trailing whitespace after the "xref" keyword but the current parser won't allow it since the PDF specs all say the xref table starts with a line consisting of a single "xref" keyword and doesn't talk about extra whitespace, etc.

So I will update the xref loading code to allow for this but it won't fix the problem with the first file you linked to...

@michaelrsweet
Copy link
Owner

[master b0a66ee] Fix reading of PDF files from Crystal Reports (Issue #45)

@michaelrsweet
Copy link
Owner

If you find other files with issues, please report them as separate issues, otherwise it makes it harder for me to track when a problem is actually fixed... Thanks!

@kleuter
Copy link
Contributor Author

kleuter commented Oct 10, 2023

Will do, thanks a lot, Michael. though the fix doesn't seem to work 😢

@michaelrsweet
Copy link
Owner

PDFBOX-2250-0.pdf

@michaelrsweet michaelrsweet modified the milestones: Future, 1.5 Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority-low
Projects
None yet
Development

No branches or pull requests

2 participants