-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible startxrefPreg extension #756
Comments
Other files can cause the getXrefData() uses a regex that requires a newline before the offset value. This means that some PDF files (e.g. containing "startxref 1746580") won't parse. Example PDF details seen in production: A different example will be attached soon. A combination of my working fix for the problem mentioned in this comment, plus the one above, is: |
I also had other cases that caused problems. My current solution to the problem is
I am currently not aware of any case where the final values (startxref checksum %%EOF) are in the same line as other information. |
This seems too general to me, e.g. allows multiple newlines between @xBambey what did you think of my regex that allows for either newlines or spaces, plus optional leading/trailing spaces around the offset? |
@unixnut your regex looks good. I kept my solution as general as possible because I don't want to keep making further adjustments. Therefore there is also the possibility that the eof token can be on the same line as the offset. I know that this doesn't normally happen, but there is the extremely rare case in whis a generation in the last functions of the PDF can throw an error, which cout actually cause this to happen. |
I think the Because this is a design decision that is related to a specification I haven't read, can we get @k00ni to make the call on which regex shall be used in the PR? |
To be honest, I lack the knowledge to make this call. Could one of you outline again pro's and con's of each approach in a short manner? This might help moving to a decision. |
Some PDFs in a project could not be read by the parser.
After a closer examination of the binary data, it was noticed that there is often a space before the reference byte.
After a brief search on the Internet, I could not find any information as to whether this space may be included or not. Perhaps someone here who is more familiar with the subject knows more.
By inserting an optional space in the RegEx at this point, the PDF is recognized again.
RegEx would then look like the following:
'/(?<=[\r\n])startxref[\s]*[\r\n]+[\s]*([0-9]+)[\s]*[\r\n]+%%EOF/i'
pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php
Lines 884 to 891 in f44ada0
The text was updated successfully, but these errors were encountered: