Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible startxrefPreg extension #756

Open
xBambey opened this issue Jan 15, 2025 · 7 comments
Open

Possible startxrefPreg extension #756

xBambey opened this issue Jan 15, 2025 · 7 comments
Labels

Comments

@xBambey
Copy link

xBambey commented Jan 15, 2025

Some PDFs in a project could not be read by the parser.

After a closer examination of the binary data, it was noticed that there is often a space before the reference byte.

After a brief search on the Internet, I could not find any information as to whether this space may be included or not. Perhaps someone here who is more familiar with the subject knows more.

By inserting an optional space in the RegEx at this point, the PDF is recognized again.

RegEx would then look like the following:

'/(?<=[\r\n])startxref[\s]*[\r\n]+[\s]*([0-9]+)[\s]*[\r\n]+%%EOF/i'

// Find all startxref tables from this $offset forward
$startxrefPreg = preg_match_all(
'/(?<=[\r\n])startxref[\s]*[\r\n]+([0-9]+)[\s]*[\r\n]+%%EOF/i',
$pdfData,
$startxrefMatches,
\PREG_SET_ORDER,
$offset
);

@unixnut
Copy link
Contributor

unixnut commented Jan 30, 2025

Other files can cause the Uncaught Exception: Unable to find startxref error too.

getXrefData() uses a regex that requires a newline before the offset value. This means that some PDF files (e.g. containing "startxref 1746580") won't parse.

Example PDF details seen in production:
Creator: DocuPrint CM225/228
PDF version: 1.7

A different example will be attached soon.

A combination of my working fix for the problem mentioned in this comment, plus the one above, is:
'/(?<=[\r\n])startxref(?:[\s]*[\r\n]+[\s]*|\s+)([0-9]+)[\s]*[\r\n]+%%EOF/i'

@xBambey
Copy link
Author

xBambey commented Jan 30, 2025

I also had other cases that caused problems.
Yesterday I had the case that everything was in one line.

My current solution to the problem is

'/(?<=[\r\n])startxref[\s\r\n]+([0-9]+)[\s\r\n]+%%EOF/i'

I am currently not aware of any case where the final values (startxref checksum %%EOF) ​​are in the same line as other information.
The beginning might have to be expanded to include a possible space instead of a new line.

@k00ni
Copy link
Collaborator

k00ni commented Jan 30, 2025

@xBambey @unixnut Feel free to create a pull request and start a discussion if you think its worth to change the related code.

@k00ni k00ni added the fix label Jan 30, 2025
@unixnut
Copy link
Contributor

unixnut commented Feb 3, 2025

My current solution to the problem is

'/(?<=[\r\n])startxref[\s\r\n]+([0-9]+)[\s\r\n]+%%EOF/i'

This seems too general to me, e.g. allows multiple newlines between startxref statement and offset. Also would allow the eof token on the same line as the offset.

@xBambey what did you think of my regex that allows for either newlines or spaces, plus optional leading/trailing spaces around the offset?

@xBambey
Copy link
Author

xBambey commented Feb 4, 2025

@unixnut your regex looks good. I kept my solution as general as possible because I don't want to keep making further adjustments.

Therefore there is also the possibility that the eof token can be on the same line as the offset.

I know that this doesn't normally happen, but there is the extremely rare case in whis a generation in the last functions of the PDF can throw an error, which cout actually cause this to happen.

@unixnut
Copy link
Contributor

unixnut commented Feb 6, 2025

I think the startxref statement and the eof marker are always produced together and it's safer to look for the eof marker on a new line.

Because this is a design decision that is related to a specification I haven't read, can we get @k00ni to make the call on which regex shall be used in the PR?

@k00ni
Copy link
Collaborator

k00ni commented Feb 19, 2025

Because this is a design decision that is related to a specification I haven't read, can we get @k00ni to make the call on which regex shall be used in the PR?

To be honest, I lack the knowledge to make this call. Could one of you outline again pro's and con's of each approach in a short manner? This might help moving to a decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants