Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table of Contents #282

Open
map7 opened this issue Feb 11, 2025 · 7 comments
Open

Table of Contents #282

map7 opened this issue Feb 11, 2025 · 7 comments

Comments

@map7
Copy link

map7 commented Feb 11, 2025

This is feature request. It would be nice if we could add a table of contents using grover with dynamic page numbers onto the second page (or nominated page). It's something which would have to be done after the PDF is created to get the page numbers or how the footer page numbers are dynamically calculated maybe the same code could be used.

Also if each item in the Table of Contents could be a link to the page that would be a great feature.

@abrom
Copy link
Contributor

abrom commented Feb 12, 2025

Hey @map7 thanks for the suggestion. Although as far as I'm aware this is already supported (in a way) by Puppeteer/Chrome. See puppeteer/puppeteer#11779

Per the documentation it is still marked as experimental, but should just require passing through outline: true when initialising Grover (or via the global options or meta tag options). Doesn't require any change to Grover, just a version of puppeteer that includes the feature (anything after Feb 2024 should support it)

The feature does not add a static TOC with a summary of the headings etc in the way that you seem to be describing, but rather will add the TOC to the PDF so that suitable PDF readers will display it in the sidebar. To have something render an in-page TOC the problem is the data. It wouldn't be hard to use some script along with the execute_script option to iterate over the page, extract the heading elements and generate a TOC/insert it in the page.. the problem would be that page numbers are generated in the guts of the rendering engine of Chrome, and not something you'd likely be able to calculate. You might be able to make some assumptions over "page height" to "guess" what the page likely would be.

See puppeteer/puppeteer#1778 (comment) and puppeteer/puppeteer#1778 (comment) for some examples of what others have done.

HTH

@map7
Copy link
Author

map7 commented Feb 12, 2025

The project I was working on required a page with a TOC on the second page of the PDF document.

I did managed to create a TOC but it's a work around and required pdf-reader, combine_pdf and prawn gems. After the pdf is created I read each page, look for headings and record the page and headings in a array of hashes. Then I used prawn to create the TOC, and then used combine_pdf to put the TOC on page two and create a new PDF. It actually works very well, it's just that I didn't want to bring all those GEMs in and create three files everytime I'm generating.

I used Tempfile so there is no overlap.

@abrom
Copy link
Contributor

abrom commented Feb 17, 2025

Sure, seems like a reasonable solution. So are you using prawn and grover? Or were you looking for an alternative that would negate the need for the combo?

@map7
Copy link
Author

map7 commented Feb 17, 2025

I'm currently using prawn and grover. It would be nice to have the functionality inbuilt into grover, rather than introducing more dependencies.

Here is how I currently do it;


  # Table of Contents
  # Find all the reports and get the page number + 1
  def extract_headings_and_pages(pdf_file)
    headings = []
    reader = PDF::Reader.new(pdf_file)
    
    reader.pages.each_with_index do |page, index|
      page.text.each_line do |line|

        # Headings to look for in the PDF to create dynamic TOC
        if line =~ /^HEADING1/ or
            line =~ /HEADING2/ or
            line =~ /HEADING3/

          # If the heading doesn't already exist in the TOC, then add it
          if headings.select {|x| x[:title] == line.strip }.empty?
            headings << { title: line.strip, page: index + 1 }
          end
        end
      end
    end
    headings
  end

  # Creating the Table of Contents page with text in the middle
  def create_toc_page(headings,period)
    toc_file = Tempfile.new('toc_file')
    
    Prawn::Document.generate(toc_file, page_size: "A4") do

      # Set the left margin and the right margin for a neat layout
      left_margin =  20
      right_margin = 70

      # [x(from left),y(from bottom)]
      image "public/images/logo.png", at: [482,752], height: 58

      # Add heading
      move_down 70
      draw_text "CONTENTS", size: 16, at: [left_margin, cursor]
      move_down 40

      headings.each_with_index do |heading, index|
        # Format the heading text and page number
        toc_item = heading[:title]
        
        # Position the TOC entry with space for the page number
        draw_text toc_item, size: 12, at: [left_margin, cursor]
        
        # Move the cursor down slightly and print the page number aligned to the right
        draw_text "#{heading[:page]}", size: 12, at: [bounds.width - right_margin, cursor]
        
        move_down 20  # Space between TOC entries
      end
    end

    toc_file
  end

  def merge_toc_with_existing_pdf(toc_pdf, original_pdf, output_pdf)
    toc = CombinePDF.load(toc_pdf)
    original = CombinePDF.load(original_pdf)
    pdf = CombinePDF.new

    # Insert the TOC at the beginning of the original PDF (or wherever you'd like)
    original.pages.each_with_index do |page,index|
      pdf << toc.pages[0] if index == 1 # Add table of contents on second page
      pdf << page                       # Add the rest of the pages
    end
    
    # Save the final PDF
    pdf.save(output_pdf)
  end

  def add_toc(file,period)
    headings = extract_headings_and_pages(file)
    toc_file=create_toc_page(headings,period)

    output_file = Tempfile.new('output_with_toc')
    merge_toc_with_existing_pdf(toc_file.path, file.path, output_file.path)

    # puts output_file.path
    output_file
  end

@abrom
Copy link
Contributor

abrom commented Feb 18, 2025

You could potentially create the TOC page by rendering your TOC to HTML then passing that into grover?

It might be useful to create a "how to" document linked from the README that suggests how this problem can be solved. My concern with adding this as a built in feature is that it seems like it would require domain knowledge of the document being processed. eg your example code has this guard:

line =~ /^HEADING1/ or
line =~ /HEADING2/ or
line =~ /HEADING3/

It could end up making things quite complicated if such cases needed to be handled within Grover.

Also, if grover was to include such a feature I don't see that it would reduce the dependency count as it would likely want to use existing libraries (like PDF Reader and Combine PDF) for doing the "heavy lifting" anyway.

Thank you for posting the detail on how you solved the problem 😄

@kluzny
Copy link

kluzny commented Mar 8, 2025

If we lived in star trek times, I can foresee an API like this.

grover = Grover.new(url, locals: { 
  "some-class": ->(controller,  page) { controller.some_method(page)  }
})

This would let us call arbitrary methods/view helpers at runtime on the controller that map to css classes like key/value pairs

This something like this you could build arbitrary features like a TOC, or hiding page numbers on first/last pages, the sky is the limit.

@abrom
Copy link
Contributor

abrom commented Mar 9, 2025

I'm struggling to see what you're getting at here? What is "controller" in this case? And what is "page"? Grover is simply rendering a URL or some HTML to a browser using Chrome. Where is it you are thinking the controller or page would be coming back through to "Ruby land" in this example? Your code suggest make It seem like you'd want it to somehow escape out of the Chrome render engine to do that! That might be possible.. but this is waaaay outside the scope of what Grover is for and can/will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants