Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expected positive integer in object trailer #7

Open
GhostRock37 opened this issue Feb 2, 2017 · 3 comments
Open

Expected positive integer in object trailer #7

GhostRock37 opened this issue Feb 2, 2017 · 3 comments

Comments

@GhostRock37
Copy link

Hello,

I have a problem with a pdf.
It is detected malformed by an antivirus and I wanted to know at what level it does not respect the pdf structure.

I also think your tool will be able to clean it. Can you tell me how?
thanxs for your help !

./caradoc cleanup ../PDF_MALFORMED/KO/1/1.pdf --out ../PDF_MALFORMED/KO/1/2.pdf
PDF error : Expected positive integer in object trailer at entry /Prev at offset 1872031 [0x1c909f] in file !

thats the end of the pdf :
<< /Pages 1 0 R /Type /Catalog >>
endobj
xref
1 5
0001871801 00000 n
0000000208 00000 n
0001871655 00000 n
0000000012 00000 n
0001871861 00000 n
trailer
<< /Prev 0 /Root 5 0 R /Size 6 >>
startxref
1871913
%%EOF

@gendx
Copy link
Member

gendx commented Feb 6, 2017

Thank you for your report.

The /Prev field in an xref table is supposed to be an offset in the file describing the start of the previous xref section. As such, it must be a positive or null integer. However, since the offset 0 is supposed to contain the PDF magic string starting with %PDF, it should not be zero either.

In your case, it may be that /Prev 0 is meant to say that there is no previous xref table. To clean up the file, it might be worth trying removing the /Prev field altogether (erase it or replace it with spaces). If this doesn't work, could you provide the first few lines of the file ?

We might consider adding some code or a manual an option to handle this case in the relaxed mode in the future.

@GhostRock37
Copy link
Author

Thank you for your return!

I have try to remove the /prev field, and i have another error:

pdfVersion : 1.7
Incremental updates : 0
Neither updates nor object streams nor free objects nor encryption
Object count : 5
Filter : FlateDecode -> 2 times
Type error : Unexpected entry /Type in instance of class content_stream in object 3

Below , the output by a dump of xref with caradoc:

trailer
<<
/Root 5 0 R
/Size 6

obj(1, 0)
<<
/Count 1
/Type /Pages
/Kids [4 0 R]

obj(2, 0)
<<
/ColorSpace /DeviceRGB
/Filter /FlateDecode
/Type /XObject
/Width 850
/Height 1170
/BitsPerComponent 8
/Subtype /Image
/Length 1824804

stream <encoded stream of length 1824804>

obj(3, 0)
<<
/Filter /FlateDecode
/Type /Stream
/Length 55

stream <encoded stream of length 55>

obj(4, 0)
<<
/Contents 3 0 R
/Rotate 0
/CropBox [0.0 0.0 850.0 1170.0]
/Type /Page
/Resources <<
/Font <<
>>
/XObject <<
/Im1 2 0 R
>>
>>
/MediaBox [0.0 0.0 850.0 1170.0]
/Parent 1 0 R

obj(5, 0)
<<
/Pages 1 0 R
/Type /Catalog

And here, the first line of the file :

%PDF-1.7
4 0 obj
<< /Contents 3 0 R /CropBox [0.0 0.0 850.0 1170.0] /MediaBox [0.0 0.0 850.0 1170.0] /Parent 1 0 R /Resources << /Font << >> /XObject << /Im1 2 0 R >> >> /Rotate 0 /Type /Page >>
endobj
2 0 obj
<< /BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter /FlateDecode /Height 1170 /Length 1824804 /Subtype /Image /Type /XObject /Width 850 >>
stream

Another question : i try to find a way to convert malformed pdf files into a correct pdf format.
Because we received a lot of malformed pdf, We think we could convert the malformed files into a correct pdf format. Do you know of any tool or method that could achieve this conversion?

I think there will be two techniques to do that.

The first one: export the malformed pdf to a correct pdf (I realized a simple test with pdfcreator: by printing a malformed pdf to a pdf respecting the standard pdf/a, the resulting file is in a correct format.
What seems interesting with this technique is that the polyglot files become simple pdf. It's very interesting from a security point of view.

The second: parse the pdf file malformed and correct errors then export.

What do you think ? Do you know of this type of tool? Can it be transposed in a web environment (example: convert a pdf while upload?)

From a security point of view, files that will be converted to a pdf/a format should be clean and no longer have an antiviral threat?

thanxs !

@gendx
Copy link
Member

gendx commented Feb 16, 2017

It looks like the first explanation was correct in your case (i.e. there should not be a /Prev field because there is no previous xref table).

I have try to remove the /prev field, and i have another error:

You now have a type error in an object of type "content_stream". The error seems legitimate because the specification does not define a "/Type" field for this type. Also, bear in mind that caradoc aim at being a strict validator (e.g. to avoid any ambiguities), but that a lot of PDF-producing software are not so strict and type errors/inaccuracies are not uncommon.

Besides, this is still a beta version, i.e. the type system does not yet implement all of the 700+ pages of the PDF specification, which requires a large amount of work: the specification describes everything in a natural language (English text) and we have to convert it into a formal language. Even though the most common types are already implemented, you will probably end up with a type error/warning if your PDF input is a bit complex.

Another question : i try to find a way to convert malformed pdf files into a correct pdf format.
Because we received a lot of malformed pdf, We think we could convert the malformed files into a correct pdf format. Do you know of any tool or method that could achieve this conversion?

Caradoc is a good start to clean up the syntax. However, we do not modify the higher-level content (at least for now), to preserve the semantics of the file and avoid inadvertently destroying legitimate features. So yes another converter (e.g. "printing" towards PDF/A) can be a complement to remove all kinds of features.

One day we might implement in Caradoc a more thorough converter that only keeps the core graphical content (similarly to the "print" feature that you mention).

Also, bear in mind that some errors are ambiguous, e.g. they are interpreted differently by distinct PDF readers. In that case, the choice made by Caradoc is to reject the file as "unrecoverable".

I think there will be two techniques to do that.

The first one: export the malformed pdf to a correct pdf (I realized a simple test with pdfcreator: by printing a malformed pdf to a pdf respecting the standard pdf/a, the resulting file is in a correct format.

If you trust pdfcreator for being robust against malformed files it's also a good start.

What seems interesting with this technique is that the polyglot files become simple pdf. It's very interesting from a security point of view.

In principle, "caradoc cleanup" gets rid of polyglot files, by converting the low-level syntax. But the original polyglot needs to be close enough to a PDF file for the normalizer to work. It depends if you want a large coverage and accept weird polyglots or be more strict about the inputs you get.

The second: parse the pdf file malformed and correct errors then export.

I don't really understand what you mean here. Correct the errors manually?

What do you think ? Do you know of this type of tool? Can it be transposed in a web environment (example: convert a pdf while upload?)

There's no reason why it shouldn't work in a web environment. But the converter must be robust enough to not become a threat itself.

From a security point of view, files that will be converted to a pdf/a format should be clean and no longer have an antiviral threat?

PDF/A is a subset of the specification that may be relevant, but similarly to input restrictions of Caradoc that can be a problem, PDF/A may damage interesting files, depending on your use-case / the features you want to support.

Also, PDF/A conversion is somewhat orthogonal to the syntax sanitisation done by Caradoc, as PDF/A cares mostly about higher-level features (e.g. embed all fonts inside the file) (I am not an expert in PDF/A though, as it is yet another quite large specification). So "PDF/A printing" and "caradoc cleanup" are complementary operations.

Thanks again for your feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants