Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Lang flag raises error #1502

Open
ChAckermannSFA opened this issue Feb 3, 2025 · 7 comments
Open

PDF Lang flag raises error #1502

ChAckermannSFA opened this issue Feb 3, 2025 · 7 comments

Comments

@ChAckermannSFA
Copy link

We have been seeing a lot of error messages when validating the PDF/A in our archive with VeraPDF which other validators deem valid.

The error is from the /Lang tag if there is no space between /Lang and the language tag.
Image

I was able to get the file validated correctly by adding a space
Image

The PDF/A was created by Acrobat Distiller 9.3.0 (Windows) and you can find the original and the fixed documents attached.

Is this a bug which can get fixed in VeraPDF or do we have to fix the PDFs themselves?

p0001.pdf
p0001-fixed.pdf

@mkl-public
Copy link

Isn't the NUL after the DE more likely to be the culprit?

@ChAckermannSFA
Copy link
Author

You're right. Here's the same file with only the NUL removed which VeraPDF recognizes as valid.

p0001-no-null.pdf

@mkl-public
Copy link

As an aside, in both p0001-fixed.pdf and p0001-no-null.pdf you actually have

Image

I.e. no space between /Lang and the opening bracket (, and no 0x00 byte anymore.

Thus, you effectively only tested removing the NULL ;).

Furthermore, you have changed the encoding of that string to UTF-16 while in the original file it was encoded in PDFDocEncoding. But that likely was not a relevant change.

@petervwyatt
Copy link

Interestingly, PDF and PDF/A are both vague about exactly how the /Lang entry should be parsed besides stating that in 14.9.2.2 "... [is a] Language-Tag as defined in BCP 47." Does leading or trailing whitespace invalidate the entry?

RFC 5646 states, "Whitespace is not permitted in a language tag" in 2.1 Syntax, 2nd last paragraph (assuming NUL counts as whitespace). This is very buried so maybe this needs to be noted somewhere for PDF/A (and PDF/UA) devs?

Leaving for @bdoubrov to decide in case he knows of some wording somewhere I have missed or past advice/discussions...

@mkl-public
Copy link

A language identifier shall either be the empty text string, to indicate that the language is unknown, or a Language-Tag as defined in BCP 47.

"shall be" to me sounds like equality, i.e. no whitespace allowed. "shall contain" could have been argued to allow for whitespace.

My 2c ;)

@bdoubrov
Copy link
Contributor

bdoubrov commented Feb 5, 2025

RFC 5646 states, "Whitespace is not permitted in a language tag" in 2.1 Syntax, 2nd last paragraph (assuming NUL counts as whitespace). This is very buried so maybe this needs to be noted somewhere for PDF/A (and PDF/UA) devs?

This is more explicit in the RFC 4647, which is general syntax for Language Tags:

  1. Character Set Considerations: "Language tags permit only the characters A-Z, a-z, 0-9, and HYPHEN-MINUS (%x2D)."

So, I think not permitting trailing NULL is very logical. In the end this is most likely non-intentional implementation issue.

@petervwyatt
Copy link

Where can we note this for posterity?
I assume veraPDF might even now do a regex check on the contents of the Lang keys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants