Skip to content

Add optional SDK image asset export for rendered and embedded PDF images#174

Open
VooDisss wants to merge 4 commits intozai-org:mainfrom
VooDisss:image-asset-export
Open

Add optional SDK image asset export for rendered and embedded PDF images#174
VooDisss wants to merge 4 commits intozai-org:mainfrom
VooDisss:image-asset-export

Conversation

@VooDisss
Copy link
Copy Markdown
Contributor

Fixes: #173

Summary

This PR builds on the branch changes introduced in PR #171:

Add optional SDK-owned image asset export for PP-DocLayoutV3 image regions.

This change introduces:

  • imgs_rendered/ as the canonical rendered image asset directory
  • optional imgs_embedded/ extraction for geometrically matched embedded PDF images
  • configurable Markdown preference between embedded and rendered
  • stable image block metadata describing selected and available image assets

Behavior

Rendered-only mode

When enable_image_asset_export=false:

  • image regions are still exported to imgs_rendered/
  • no embedded extraction is attempted
  • image blocks expose:
    • image_path
    • rendered_image_path
    • embedded_image_path: null
    • image_asset_source: "rendered"

Embedded-enabled mode

When enable_image_asset_export=true:

  • rendered image assets are still created
  • the SDK inspects embedded PDF images via PyMuPDF
  • matched embedded assets are saved to imgs_embedded/
  • image_path follows markdown_image_preference
  • embedded_image_path is populated only when a match exists

Matching

Embedded matching is based on:

  • same-page filtering
  • bbox geometry
  • IoU threshold
  • containment threshold
  • aspect-ratio plausibility
  • one-to-one assignment

It deliberately avoids center-distance-only matching.

Output contract

Image blocks now expose:

{
  "image_path": "...",
  "rendered_image_path": "...",
  "embedded_image_path": null,
  "image_asset_source": "rendered"
}

or, when an embedded match exists and embedded is selected:

{
  "image_path": "imgs_embedded/...",
  "rendered_image_path": "imgs_rendered/...",
  "embedded_image_path": "imgs_embedded/...",
  "image_asset_source": "embedded"
}

Failure behavior

The implementation explicitly avoids stale asset advertisement:

  • no nonexistent rendered asset should remain in image_path
  • no nonexistent rendered asset should remain in rendered_image_path
  • stale markdown image references are removed when no asset survives
  • if a valid embedded asset survives recovery, it remains the selected asset

Validation

  • focused regression tests cover:

    • rendered-only default behavior
    • embedded match preference
    • rendered preference
    • preservation-mode recovery
    • missing rendered-key recovery
    • no-render-pages behavior
    • crop-failure behavior
    • rendered-origin stale markdown cleanup
    • embedded-origin stale markdown cleanup
    • nested asset persistence in saved output
  • real validation confirmed:

    • imgs_rendered/ created in rendered-only mode
    • imgs_rendered/ + imgs_embedded/ created in embedded-enabled mode
    • markdown prefers embedded assets when configured and matched

Route PP-DocLayoutV3 'number' regions through OCR instead of dropping them, then extract printed page number evidence from recognized number blocks in the result formatter. Preserve the feature as number-only, support both numeric and Roman labels, and derive three explicit output layers: page_number_candidates, document_page_numbering, and page_metadata.

Keep the general json_result contract lean by stripping transient layout_index and layout_score fields from final blocks while retaining native_label, and wrap saved paper.json only when real printed-page data exists. Also expose detect_printed_page_numbers through config, constructor, and environment overrides, align MaaS output with self-hosted behavior, add contract-focused tests for legacy-vs-wrapped save behavior and lean json_result output, and document the exact save contract in the English and Chinese READMEs.
Add an SDK-owned image asset export path for PP-DocLayoutV3 image regions. Rendered region assets are now the base behavior of this feature and are written to imgs_rendered/. When enable_image_asset_export=True, the SDK additionally inspects embedded PDF images via PyMuPDF, matches them to layout image regions using same-page geometry, IoU, containment, aspect-ratio plausibility, and one-to-one assignment, and writes matched assets to imgs_embedded/. Markdown selection is controlled by markdown_image_preference ('embedded' or 'rendered').

The image block contract is explicit and stable: image_path is the selected asset path, rendered_image_path reflects the rendered asset when one exists, embedded_image_path is null when no embedded match exists, and image_asset_source records whether the selected asset is rendered or embedded. The implementation avoids center-distance-only matching, preserves formatter-produced rendered assets in self-hosted mode instead of re-deriving them, and aggressively prevents stale asset advertisement: if a rendered asset cannot actually be preserved or regenerated, final JSON and Markdown do not continue to reference it. Focused regression tests cover rendered-only mode, embedded preference, rendered preference, nested asset persistence, preservation misses, no-render-pages recovery, crop-failure recovery, and both rendered-origin and embedded-origin stale-markdown cleanup.
Close the remaining stale-asset recovery gaps in the SDK image export path so final JSON and Markdown do not advertise rendered assets unless they were actually preserved or produced. This covers both no-render-pages and explicit crop-failure branches, including cases where stale markdown originally pointed at embedded assets.

Also document the image asset export feature in README.md and README_zh.md, including the exposed configuration surface, the imgs_rendered/ and imgs_embedded/ directory contract, and the stable image block fields: image_path, rendered_image_path, embedded_image_path, and image_asset_source.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add optional SDK image asset export for rendered and embedded PDF images

1 participant