Skip to content

Conversation

inamdarzaid
Copy link

Multi-Page PDF Generation Implementation

Overview

This implementation adds support for generating multi-page PDFs from table data in Superset reports, replacing screenshot-based PDFs with HTML-to-PDF conversion using WeasyPrint.

Changes Made

1. Added WeasyPrint Dependency

File: requirements/base.in

  • Added weasyprint>=61.0 as a new dependency for HTML-to-PDF conversion

2. Enhanced PDF Utility Functions

File: superset/utils/pdf.py

Added new functions for HTML-to-PDF conversion:

  • generate_table_html(): Converts pandas DataFrame to properly formatted HTML

    • Includes CSS for multi-page layout
    • Adds page headers, footers, and page numbering
    • Ensures table headers repeat on each page
    • Provides professional styling
  • build_pdf_from_html(): Converts HTML to PDF using WeasyPrint

    • Handles WeasyPrint errors gracefully
    • Returns PDF as bytes
  • build_pdf_from_dataframe(): Complete workflow for DataFrame to PDF

    • Combines HTML generation and PDF conversion
    • Accepts title and description parameters

3. Modified Report Execution Logic

File: superset/commands/report/execute.py

Enhanced the _get_pdf() method:

  • Smart Detection: Checks if the chart is a table type (table, pivot_table, pivot_table_v2)
  • Full Data Access: Uses _get_embedded_data() to get complete dataset (not just visible data)
  • Multi-page Support: Generates PDF from full data using HTML conversion
  • Graceful Fallback: Falls back to screenshot-based PDF if data-based generation fails

Key Features

1. Complete Data Export

  • Uses embedded_data which contains the full dataset from ChartDataResultFormat.JSON
  • Not limited to currently visible/paginated data in the UI
  • Includes all rows regardless of frontend pagination

2. Professional Multi-Page Layout

  • Page Headers: Repeat table headers on every page
  • Page Breaks: Intelligent page breaking to avoid splitting rows
  • Page Numbers: Automatic page numbering in footer
  • Styling: Professional table formatting with alternating row colors

3. CSS Features for PDF

@page {
    size: A4;
    margin: 2cm 1.5cm;
    @bottom-center {
        content: "Page " counter(page) " of " counter(pages);
    }
}

/* Table headers repeat on each page */
.data-table thead {
    display: table-header-group;
}

/* Prevent row breaks across pages */
.data-table tbody tr {
    page-break-inside: avoid;
}

4. Error Handling

  • Gracefully handles missing WeasyPrint installation
  • Falls back to screenshot-based PDF generation on errors
  • Provides detailed error logging

Usage Flow

  1. Report Generation Request: User requests PDF report for a table chart
  2. Chart Type Detection: System checks if chart is table-based
  3. Data Retrieval: _get_embedded_data() fetches complete dataset as DataFrame
  4. HTML Generation: DataFrame converted to HTML with multi-page CSS
  5. PDF Conversion: WeasyPrint converts HTML to multi-page PDF
  6. Fallback: If any step fails, falls back to screenshot-based PDF

Benefits

Before (Screenshot-based)

  • ❌ Limited to visible data only
  • ❌ Single page screenshots stitched together
  • ❌ Poor text quality (image-based)
  • ❌ Large file sizes
  • ❌ No searchable text

After (HTML-to-PDF)

  • ✅ Complete dataset included
  • ✅ True multi-page layout
  • ✅ High-quality text rendering
  • ✅ Smaller file sizes
  • ✅ Searchable PDF content
  • ✅ Professional page headers/footers
  • ✅ Proper page breaking

Installation Requirements

After these changes, you'll need to:

  1. Install WeasyPrint: Run pip install weasyprint>=61.0 or use the updated requirements
  2. System Dependencies: WeasyPrint may require system-level dependencies (varies by OS)

Compatibility

  • Backward Compatible: Existing screenshot-based PDF generation remains as fallback
  • Chart Types: Currently enabled for table, pivot_table, and pivot_table_v2 charts
  • Other Charts: Non-table charts continue using screenshot-based PDF generation

Future Enhancements

Potential improvements:

  • Extend to other chart types with tabular data
  • Add configuration options for PDF styling
  • Support for custom page layouts
  • Chart embedding alongside table data

This commit introduces functionality to export multi-page PDF reports for charts of the table type.

Key changes include:

1.  **PDF Generation Library:**
    *   WeasyPrint is used for converting HTML and CSS to PDF when the `PLAYWRIGHT_REPORTS_AND_THUMBNAILS` feature flag is false. (The flag was determined to be false during implementation).

2.  **EmailNotification Enhancement (`superset/reports/notifications/email.py`):**
    *   The `_get_content` method in the `EmailNotification` class now checks if the report is for a table and if the requested format is PDF.
    *   If so, it generates a PDF using WeasyPrint.
    *   The generated PDF includes:
        *   The full table data (verified to be fetched completely).
        *   Report description (typically includes chart title).
        *   Pagination for large tables.
        *   Customizable headers and footers.

3.  **Configuration Options (`superset/config.py`):**
    *   I added new configuration options to customize PDF exports:
        *   `PDF_EXPORT_HEADERS_FOOTERS_ENABLED` (boolean): To enable/disable headers/footers.
        *   `PDF_EXPORT_HEADER_TEMPLATE` (string): Template for PDF headers. Placeholders: `{report_name}`, `{page_number}`, `{total_pages}`.
        *   `PDF_EXPORT_FOOTER_TEMPLATE` (string): Template for PDF footers. Placeholders: `{generation_date}`, `{report_name}`.
        *   `PDF_EXPORT_PAGE_SIZE` (string): Default page size (e.g., "A4", "Letter").
        *   `PDF_EXPORT_ORIENTATION` (string): Default page orientation (e.g., "portrait", "landscape").
    *   These configurations are integrated into the PDF generation logic in `EmailNotification`.

4.  **Testing (`tests/unit_tests/reports/notifications/email_tests.py`):**
    *   I added comprehensive unit tests for the new PDF generation functionality.
    *   Tests cover various scenarios, including:
        *   Conditional PDF generation.
        *   Correctness of HTML and CSS passed to WeasyPrint.
        *   Header/footer rendering based on configuration templates and enabled status.
        *   Application of page size and orientation.
        *   Fallback to standard HTML email for non-PDF formats.

This enhancement allows you to receive detailed, multi-page PDF versions of your table-based reports via email, complete with proper layout and metadata.
This fix addresses an issue where PDF reports for table charts were being generated as screenshots instead of multi-page PDFs based on the full dataset.

The root cause was that the report execution logic did not correctly handle data preparation for PDF chart reports. It was defaulting to a screenshot-based PDF generation for all PDF reports.

The following changes were made:
- `superset/reports/notifications/base.py`: Added a `report_format` attribute to the `NotificationContent` dataclass. This allows the report format to be passed to the notification handlers.
- `superset/commands/report/execute.py`: Modified the `_get_notification_content` method to:
  - Fetch the full dataset as a DataFrame (`embedded_data`) for chart reports with the PDF format.
  - Continue using screenshot-based PDF generation for dashboard reports.
  - Pass the `report_format` to the `NotificationContent` object.

These changes ensure that for chart reports, the `EmailNotification` handler receives the necessary data and report format to trigger the existing WeasyPrint logic, which correctly generates a multi-page PDF from the full dataset.
@dosubot dosubot bot added change:backend Requires changing the backend viz:charts:table Related to the Table chart labels Sep 4, 2025
Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Security HTML escaping disabled in DataFrame rendering ▹ view ✅ Fix detected
Files scanned
File Path Reviewed
superset/reports/notifications/base.py
superset/reports/notifications/email.py
superset/commands/report/execute.py
superset/config.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

<body>
<div class="report-description">{description}</div>
<br>
{df.to_html(na_rep="", index=True, escape=False)}

This comment was marked as resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be especially important with
https://doc.courtbouillon.org/weasyprint/stable/first_steps.html#python-library

First warning there and the security part of their docs mentioning this

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good reference. The WeasyPrint docs emphasize HTML injection risks even further. Both DataFrame HTML and WeasyPrint HTML parsing need to be secured. We should:

  1. Keep escape=True in DataFrame.to_html()
  2. Add additional HTML sanitization before WeasyPrint processing
  3. Consider WeasyPrint's URL fetching settings to prevent local file access

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify your response? We're discussing important security considerations around HTML escaping in WeasyPrint and DataFrame rendering - what specific aspect are you commenting on?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's how to fix the security issue:

Find this line in the PDF generation section:

df.to_html(na_rep="", index=True, escape=False)

Change it to:

df.to_html(na_rep="", index=True, escape=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need help updating

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update 2 sections:

  1. In the PDF generation section:
# Around line 190, change:
df.to_html(na_rep="", index=True, escape=False)
# to:
df.to_html(na_rep="", index=True, escape=True)
  1. Add HTML sanitization after that:
html_table = nh3.clean(df_html, tags=TABLE_TAGS, attributes=ALLOWED_TABLE_ATTRIBUTES)

Let me know if you need help with this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Y

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the exact steps to make the security fixes:

  1. Open superset/reports/notifications/email.py
  2. Go to line 190 (PDF section)
  3. Replace this line:
    df.to_html(na_rep="", index=True, escape=False)
    with:
    df.to_html(na_rep="", index=True, escape=True)

Need me to explain any of these steps?

Copy link
Contributor

@bito-code-review bito-code-review bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Agent Run #28dabd

Actionable Suggestions - 1
  • tests/unit_tests/reports/notifications/email_tests.py - 1
    • Test logic error in HTML escaping validation · Line 272-274
Review Details
  • Files reviewed - 5 · Commit Range: 583b903..dd59943
    • superset/commands/report/execute.py
    • superset/config.py
    • superset/reports/notifications/base.py
    • superset/reports/notifications/email.py
    • tests/unit_tests/reports/notifications/email_tests.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful

Bito Usage Guide

Commands

Type the following command in the pull request comment and save the comment.

  • /review - Manually triggers a full AI review.

  • /pause - Pauses automatic reviews on this pull request.

  • /resume - Resumes automatic reviews.

  • /resolve - Marks all Bito-posted review comments as resolved.

  • /abort - Cancels all in-progress reviews.

Refer to the documentation for additional commands.

Configuration

This repository uses Default Agent You can customize the agent settings here or contact your Bito workspace admin at [email protected].

Documentation & Help

AI Code Review powered by Bito Logo

Comment on lines +272 to +274
# Check that pandas escapes HTML by default
mock_content.embedded_data = pd.DataFrame({'col1': ['<script>alert(1)</script>']})
email_content_result_escaped = notification._get_content()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test logic error in HTML escaping validation

Test logic error: The test modifies mock_content.embedded_data after creating the EmailNotification instance but expects the second _get_content() call to use the new data. Create a new EmailNotification instance with the modified content to properly test HTML escaping.

Code suggestion
Check the AI-generated fix before applying
Suggested change
# Check that pandas escapes HTML by default
mock_content.embedded_data = pd.DataFrame({'col1': ['<script>alert(1)</script>']})
email_content_result_escaped = notification._get_content()
# Check that pandas escapes HTML by default
mock_content.embedded_data = pd.DataFrame({'col1': ['<script>alert(1)</script>']})
notification_escaped = EmailNotification(recipient=MagicMock(), content=mock_content)
email_content_result_escaped = notification_escaped._get_content()

Code Review Run #28dabd


Should Bito avoid suggestions like this for future reviews? (Manage Rules)

  • Yes, avoid them

Copy link

@waelrimas566-png waelrimas566-png left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHANGELOG.md

Copy link

@waelrimas566-png waelrimas566-png left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@waelrimas566-png waelrimas566-png left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rusackas rusackas requested a review from kgabryje September 4, 2025 17:24
@rusackas
Copy link
Member

rusackas commented Sep 4, 2025

Superset uses Git pre-commit hooks courtesy of pre-commit. To install run the following:

pip3 install -r requirements/development.txt
pre-commit install

A series of checks will now run when you make a git commit.

Alternatively it is possible to run pre-commit by running pre-commit manually:

pre-commit run --all-files

@rusackas rusackas requested a review from eschutho September 5, 2025 17:24
Comment on lines +148 to +153
# Retrieve PDF export configurations
pdf_headers_footers_enabled = app.config.get("PDF_EXPORT_HEADERS_FOOTERS_ENABLED", True)
pdf_header_template = app.config.get("PDF_EXPORT_HEADER_TEMPLATE", "Report: {report_name} - Page {page_number} of {total_pages}")
pdf_footer_template = app.config.get("PDF_EXPORT_FOOTER_TEMPLATE", "Generated: {generation_date}")
pdf_page_size = app.config.get("PDF_EXPORT_PAGE_SIZE", "A4")
pdf_orientation = app.config.get("PDF_EXPORT_ORIENTATION", "portrait")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config file has defaults, so there's no need to set them again here.

@eschutho
Copy link
Member

eschutho commented Sep 5, 2025

Thank you for the contribution @inamdarzaid!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:backend Requires changing the backend size/L viz:charts:table Related to the Table chart
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants