Possible startxrefPreg extension #756

xBambey · 2025-01-15T08:49:08Z

Some PDFs in a project could not be read by the parser.

After a closer examination of the binary data, it was noticed that there is often a space before the reference byte.

After a brief search on the Internet, I could not find any information as to whether this space may be included or not. Perhaps someone here who is more familiar with the subject knows more.

By inserting an optional space in the RegEx at this point, the PDF is recognized again.

RegEx would then look like the following:

'/(?<=[\r\n])startxref[\s]*[\r\n]+[\s]*([0-9]+)[\s]*[\r\n]+%%EOF/i'

pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php

Lines 884 to 891 in f44ada0

    
           // Find all startxref tables from this $offset forward 
        
           $startxrefPreg = preg_match_all( 
        
               '/(?<=[\r\n])startxref[\s]*[\r\n]+([0-9]+)[\s]*[\r\n]+%%EOF/i', 
        
               $pdfData, 
        
               $startxrefMatches, 
        
               \PREG_SET_ORDER, 
        
               $offset 
        
           );

The text was updated successfully, but these errors were encountered:

unixnut · 2025-01-30T08:24:45Z

Other files can cause the Uncaught Exception: Unable to find startxref error too.

getXrefData() uses a regex that requires a newline before the offset value. This means that some PDF files (e.g. containing "startxref 1746580") won't parse.

Example PDF details seen in production:
Creator: DocuPrint CM225/228
PDF version: 1.7

A different example will be attached soon.

A combination of my working fix for the problem mentioned in this comment, plus the one above, is:
'/(?<=[\r\n])startxref(?:[\s]*[\r\n]+[\s]*|\s+)([0-9]+)[\s]*[\r\n]+%%EOF/i'

xBambey · 2025-01-30T12:14:33Z

I also had other cases that caused problems.
Yesterday I had the case that everything was in one line.

My current solution to the problem is

'/(?<=[\r\n])startxref[\s\r\n]+([0-9]+)[\s\r\n]+%%EOF/i'

I am currently not aware of any case where the final values (startxref checksum %%EOF) are in the same line as other information.
The beginning might have to be expanded to include a possible space instead of a new line.

k00ni · 2025-01-30T13:19:21Z

@xBambey @unixnut Feel free to create a pull request and start a discussion if you think its worth to change the related code.

unixnut · 2025-02-03T04:27:56Z

My current solution to the problem is
'/(?<=[\r\n])startxref[\s\r\n]+([0-9]+)[\s\r\n]+%%EOF/i'

This seems too general to me, e.g. allows multiple newlines between startxref statement and offset. Also would allow the eof token on the same line as the offset.

@xBambey what did you think of my regex that allows for either newlines or spaces, plus optional leading/trailing spaces around the offset?

xBambey · 2025-02-04T08:17:38Z

@unixnut your regex looks good. I kept my solution as general as possible because I don't want to keep making further adjustments.

Therefore there is also the possibility that the eof token can be on the same line as the offset.

I know that this doesn't normally happen, but there is the extremely rare case in whis a generation in the last functions of the PDF can throw an error, which cout actually cause this to happen.

unixnut · 2025-02-06T03:10:49Z

I think the startxref statement and the eof marker are always produced together and it's safer to look for the eof marker on a new line.

Because this is a design decision that is related to a specification I haven't read, can we get @k00ni to make the call on which regex shall be used in the PR?

k00ni · 2025-02-19T14:41:00Z

Because this is a design decision that is related to a specification I haven't read, can we get @k00ni to make the call on which regex shall be used in the PR?

To be honest, I lack the knowledge to make this call. Could one of you outline again pro's and con's of each approach in a short manner? This might help moving to a decision.

k00ni added the fix label Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible startxrefPreg extension #756

Possible startxrefPreg extension #756

xBambey commented Jan 15, 2025

unixnut commented Jan 30, 2025

xBambey commented Jan 30, 2025

k00ni commented Jan 30, 2025 •

edited

Loading

unixnut commented Feb 3, 2025

xBambey commented Feb 4, 2025

unixnut commented Feb 6, 2025

k00ni commented Feb 19, 2025

Possible startxrefPreg extension #756

Possible startxrefPreg extension #756

Comments

xBambey commented Jan 15, 2025

unixnut commented Jan 30, 2025

xBambey commented Jan 30, 2025

k00ni commented Jan 30, 2025 • edited Loading

unixnut commented Feb 3, 2025

xBambey commented Feb 4, 2025

unixnut commented Feb 6, 2025

k00ni commented Feb 19, 2025

k00ni commented Jan 30, 2025 •

edited

Loading