Presenting PDF files (and other doc types) in the browser #7

kohsah · 2018-01-05T04:59:36Z

PDF files (and other formats like DOCX) pose a challenge for presenting content online. PDF viewers for browsers are complex software by themselves and there is no consistent standard for presenting PDFs across mobile and desktop browsers. Formats like DOCX can be converted to PDF and made available for presentation.

Approach 1 - present pdf directly

Large PDF files cause a slow loading response, because even viewing the first few pages requires the full PDF document into the browser. Currently we follow this approach

An alternative is to process the PDF into a linearized PDF . THat means processing the pdf files into a linearized pdf using something like qpdf.

This still presents a problem of loading a single pdf.

Approach 2 - convert a pdf to an image at runtime

PDF (or a specific page of a pdf) can be converted to an image at runtime and presented online. This allows on demand request of pages, and pages themselves are just images so they can be loaded across devices without a problem. This implies using an intermediate service to process the PDF page request into an image.

Approach 3 - preprocess the PDF into images

Convert the PDF into images in advance and serve images when requested via the browser. Complete PDF can be made available for download. THis approach is similar to Approach 2, but simpler because there is no intermediate service that processes the pdf. The downside, the disk-space usage immediately doubles as the images are essentially duplicates of the file.

Approach 4 - using specialized tools that convert PDF to HTML "lookalikes"

See http://coolwanglu.github.io/pdf2htmlEX/

ccsmart · 2018-01-08T12:56:57Z

Sticking to "PDF" as an example for the general "not every webbrowser" document type.

Approach 1 will probably remain as the default option for heavy clients where we expect no problems / availability of readers. This is also the version that is covered by digital signing. None of the other options should be designated as digitally signed, as there is a possibility that derivatives might be out of date, tampered with or otherwise incorrect due to mistakes (eg restore), bugs (failing updates, mistakes in dynamic filename creation) or abuse.

Approach 4 is to some extent close to general conversion / OCR which we are looking at in the context of data analysis and improving searches towards content. The risk of changing meaning through incorrect results in structure interpretation remains. So i'm thinking this approach may not be adding much to the pure text edition in terms of utility, but it probably adds much in complexity resulting in an increase in maintenance.

So of these i'm thinking either 2 or 3 should be done, with no clear preference. Nonetheless leaning towards preprocessed, for smaller benefits such as lower latency and easier diagnostics.

kohsah mentioned this issue Jan 5, 2018

Switch to using React-PDF for loading pdfs on the page gawati/gawati-portal-ui#6

Closed

kohsah added the enhancement label Jan 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Presenting PDF files (and other doc types) in the browser #7

Presenting PDF files (and other doc types) in the browser #7

kohsah commented Jan 5, 2018 •

edited

Loading

ccsmart commented Jan 8, 2018

Presenting PDF files (and other doc types) in the browser #7

Presenting PDF files (and other doc types) in the browser #7

Comments

kohsah commented Jan 5, 2018 • edited Loading

Approach 1 - present pdf directly

Approach 2 - convert a pdf to an image at runtime

Approach 3 - preprocess the PDF into images

Approach 4 - using specialized tools that convert PDF to HTML "lookalikes"

ccsmart commented Jan 8, 2018

kohsah commented Jan 5, 2018 •

edited

Loading