Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presenting PDF files (and other doc types) in the browser #7

Open
kohsah opened this issue Jan 5, 2018 · 1 comment
Open

Presenting PDF files (and other doc types) in the browser #7

kohsah opened this issue Jan 5, 2018 · 1 comment

Comments

@kohsah
Copy link
Contributor

kohsah commented Jan 5, 2018

PDF files (and other formats like DOCX) pose a challenge for presenting content online. PDF viewers for browsers are complex software by themselves and there is no consistent standard for presenting PDFs across mobile and desktop browsers. Formats like DOCX can be converted to PDF and made available for presentation.

Approach 1 - present pdf directly

Large PDF files cause a slow loading response, because even viewing the first few pages requires the full PDF document into the browser. Currently we follow this approach

An alternative is to process the PDF into a linearized PDF . THat means processing the pdf files into a linearized pdf using something like qpdf.

This still presents a problem of loading a single pdf.

Approach 2 - convert a pdf to an image at runtime

PDF (or a specific page of a pdf) can be converted to an image at runtime and presented online. This allows on demand request of pages, and pages themselves are just images so they can be loaded across devices without a problem. This implies using an intermediate service to process the PDF page request into an image.

Approach 3 - preprocess the PDF into images

Convert the PDF into images in advance and serve images when requested via the browser. Complete PDF can be made available for download. THis approach is similar to Approach 2, but simpler because there is no intermediate service that processes the pdf. The downside, the disk-space usage immediately doubles as the images are essentially duplicates of the file.

Approach 4 - using specialized tools that convert PDF to HTML "lookalikes"

See http://coolwanglu.github.io/pdf2htmlEX/

@ccsmart
Copy link
Contributor

ccsmart commented Jan 8, 2018

Sticking to "PDF" as an example for the general "not every webbrowser" document type.

Approach 1 will probably remain as the default option for heavy clients where we expect no problems / availability of readers. This is also the version that is covered by digital signing. None of the other options should be designated as digitally signed, as there is a possibility that derivatives might be out of date, tampered with or otherwise incorrect due to mistakes (eg restore), bugs (failing updates, mistakes in dynamic filename creation) or abuse.

Approach 4 is to some extent close to general conversion / OCR which we are looking at in the context of data analysis and improving searches towards content. The risk of changing meaning through incorrect results in structure interpretation remains. So i'm thinking this approach may not be adding much to the pure text edition in terms of utility, but it probably adds much in complexity resulting in an increase in maintenance.

So of these i'm thinking either 2 or 3 should be done, with no clear preference. Nonetheless leaning towards preprocessed, for smaller benefits such as lower latency and easier diagnostics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants