Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Pandoc integration #2491

Closed
maggie44 opened this issue Jan 15, 2021 · 10 comments
Closed

Feature request: Pandoc integration #2491

maggie44 opened this issue Jan 15, 2021 · 10 comments

Comments

@maggie44
Copy link

maggie44 commented Jan 15, 2021

Hi @ssddanbrown,

I was thinking Pandoc integration as an optional module. It would add some efficiencies to the various exports by keeping the assets seperate as discussed above (and potentially resolve some other outstanding issues), but also provide a bunch of additional options, such as EPUB (#1949), Word doc, video export support (#883; #2412) and a bunch more.

Here are a few shortcuts to try it out:

  1. Here is Pandoc: https://pandoc.org
  2. In most repositories so apt-get install pandoc or brew install pandoc should do the trick (if installing in a docker container, may need to install build-essential and/or curl).
  3. An example Markdown I have tested with:

test.md

# Test file
Test MD File.

[![Build Status](https://cdn.vox-cdn.com/thumbor/zEZJzZFEXm23z-Iw9ESls2jYFYA=/89x0:1511x800/1600x900/cdn.vox-cdn.com/uploads/chorus_image/image/55717463/google_ai_photography_street_view_2.0.jpg)](https://travis-ci.org/joemccann/dillinger)
Dillinger is a cloud-enabled, mobile-ready, offline-storage, AngularJS powered HTML5 Markdown editor.

  - Type some Markdown
  - Convert some Markdown

![](https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4)

# New Features!

  - sdfsdf
  - sdfsdvldkvnc
 
You can also:
  - send

Execute the command:

pandoc test.md -o example2.html --extract-media ./assets

More info relating to this originally discussed in: #2412

@maggie44
Copy link
Author

maggie44 commented Jan 15, 2021

@ssddanbrown in response to the last comment over in #2412, indeed, these ostensibly simple things often get more complex very quickly.

In terms of workflow, after giving it some thought perhaps a similar integration as WKHTMLTOPDF. The user installs Pandoc manually, using the Pandoc docs for their environment (apt-get Pandoc for example in Ubuntu). Then adds in a PANDOC=True variable to the .env file so that BookStack doesn't have any responsibility for the Pandoc install.

When PANDOC=True there could be some new fields in the export dropdown menu: EPUB; HTML Archive (or something more logically named instead of HTML Archive.

Hopefully then passing the same content being pulled for the current export features to Pandoc on the system locally, followed by a return of the output to download.

By using the same method as WKHTMLTOPDF, it doesn't make as mission critical to maintain and allows for some dev experimentation. Similarly, only using EPUB and HTML Archive rather than replacing the current PDF and html export processes, as certainly not confident enough in it to recommend that off the bat.

I realise a lot of this is preaching to the choir, but seems you have plenty of tickets and things on your plate, so figure the more thought/detail given to a feature request and the use case considered before making the request the better.

Big thanks for the work on this, it is going to become quite a central part of our EdTech COVID response work.

@maggie44
Copy link
Author

After further thought, how about simplifying this down to allowing the original markdown that bookstack uses to be exported? When included in the api this would allow us to utilise third party processing of exported data (like pandoc) without the extra support burden.

@ssddanbrown
Copy link
Member

Hi @Maggie0002 ,
If you're using the Markdown editor to edit pages, The pages API should already provide the stored markdown content (pages.show endpoint).

@maggie44
Copy link
Author

maggie44 commented Jan 24, 2021

Hi @Maggie0002 ,
If you're using the Markdown editor to edit pages, The pages API should already provide the stored markdown content (pages.show endpoint).

Whoops, sorry, thought it defaulted to Markdown. I meant an API point to export the WYSIWYG content as is, rather than converting first to HTML or PDF. I don't see that in the API docs.

@ssddanbrown
Copy link
Member

That (pages => read) endpoint should give you the HTML that's used when viewing a page. This is pretty much the same as the HTML loaded in the WYSIWYG editor but with a pass to remove some potentially dangerous elements.

@maggie44
Copy link
Author

maggie44 commented Jan 26, 2021

Helpful, and interesting, thanks. My understanding then is the difference is just that the export -> html function takes that same html seen in the pages -> read endpoint, passes it to a processor that converts pictures etc into an embedded html file. But without headers, which presumably is what the html processor takes care of (among other things).

Will experiment with that endpoint and report back anything useful.

@maggie44
Copy link
Author

Helpful, and interesting, thanks. My understanding then is the difference is just that the export -> html function takes that same html seen in the pages -> read endpoint, passes it to a processor that converts pictures etc into an embedded html file. But without headers, which presumably is what the html processor takes care of (among other things).

Will experiment with that endpoint and report back anything useful.

Didn't get very far. Turns out the HTML the API pipes out is missing headings, css, all the formatting, would be a lot of work to go from there to something usable.

Is there a way to access the HTML used by the exporter but with the original HREF to the images and/or video rather than the embedded images? It would be a fairly simple (in theory) mirror of that page to then get it with exported content. Wget for example has a --mirror option I could experiment with as a light-weight solution.

@ssddanbrown
Copy link
Member

Is there a way to access the HTML used by the exporter but with the original HREF to the images and/or video rather than the embedded images?

No way to get that directly, Although the main content HTML is what you'd get out of the API; The export just wraps it up in a template with some extra styles. The export uses this template, With these export styles.

@maggie44 maggie44 mentioned this issue Feb 1, 2021
2 tasks
@maggie44
Copy link
Author

maggie44 commented May 27, 2021

Having given it some more thought, how would you feel about PanDoc as an optional exporter similar to how wkhtmltopdf is currently integrated? This wrapper is proving useful: https://github.com/ueberdosis/pandoc

Would also help resolve some other issues that I don't think we will find a way around:

linuxserver/docker-bookstack#80
#2459

@ssddanbrown
Copy link
Member

Hi @Maggie0002,
Sorry for my lack of response.

To be honest, I'd not be very keen. Supporting both of the existing PDF export options has already proved a lot more challenging than hoped and consumed a lot of my time in the various requests & issues that have generated from it. The range of conversion formats that pandoc would open up would worry me, and I think that it's optimistic that it'll solve more issues than it'll create as an alternative PDF generator, especially since I believe pandoc will use WKHTMLtoPDF by default anyway for HTML to PDF conversions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

2 participants