Add optional SDK image asset export for rendered and embedded PDF images#174
Open
VooDisss wants to merge 4 commits intozai-org:mainfrom
Open
Add optional SDK image asset export for rendered and embedded PDF images#174VooDisss wants to merge 4 commits intozai-org:mainfrom
VooDisss wants to merge 4 commits intozai-org:mainfrom
Conversation
Route PP-DocLayoutV3 'number' regions through OCR instead of dropping them, then extract printed page number evidence from recognized number blocks in the result formatter. Preserve the feature as number-only, support both numeric and Roman labels, and derive three explicit output layers: page_number_candidates, document_page_numbering, and page_metadata. Keep the general json_result contract lean by stripping transient layout_index and layout_score fields from final blocks while retaining native_label, and wrap saved paper.json only when real printed-page data exists. Also expose detect_printed_page_numbers through config, constructor, and environment overrides, align MaaS output with self-hosted behavior, add contract-focused tests for legacy-vs-wrapped save behavior and lean json_result output, and document the exact save contract in the English and Chinese READMEs.
Add an SDK-owned image asset export path for PP-DocLayoutV3 image regions. Rendered region assets are now the base behavior of this feature and are written to imgs_rendered/. When enable_image_asset_export=True, the SDK additionally inspects embedded PDF images via PyMuPDF, matches them to layout image regions using same-page geometry, IoU, containment, aspect-ratio plausibility, and one-to-one assignment, and writes matched assets to imgs_embedded/. Markdown selection is controlled by markdown_image_preference ('embedded' or 'rendered').
The image block contract is explicit and stable: image_path is the selected asset path, rendered_image_path reflects the rendered asset when one exists, embedded_image_path is null when no embedded match exists, and image_asset_source records whether the selected asset is rendered or embedded. The implementation avoids center-distance-only matching, preserves formatter-produced rendered assets in self-hosted mode instead of re-deriving them, and aggressively prevents stale asset advertisement: if a rendered asset cannot actually be preserved or regenerated, final JSON and Markdown do not continue to reference it. Focused regression tests cover rendered-only mode, embedded preference, rendered preference, nested asset persistence, preservation misses, no-render-pages recovery, crop-failure recovery, and both rendered-origin and embedded-origin stale-markdown cleanup.
Close the remaining stale-asset recovery gaps in the SDK image export path so final JSON and Markdown do not advertise rendered assets unless they were actually preserved or produced. This covers both no-render-pages and explicit crop-failure branches, including cases where stale markdown originally pointed at embedded assets. Also document the image asset export feature in README.md and README_zh.md, including the exposed configuration surface, the imgs_rendered/ and imgs_embedded/ directory contract, and the stable image block fields: image_path, rendered_image_path, embedded_image_path, and image_asset_source.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes: #173
Summary
This PR builds on the branch changes introduced in PR #171:
Add optional SDK-owned image asset export for PP-DocLayoutV3 image regions.
This change introduces:
imgs_rendered/as the canonical rendered image asset directoryimgs_embedded/extraction for geometrically matched embedded PDF imagesembeddedandrenderedBehavior
Rendered-only mode
When
enable_image_asset_export=false:imgs_rendered/image_pathrendered_image_pathembedded_image_path: nullimage_asset_source: "rendered"Embedded-enabled mode
When
enable_image_asset_export=true:imgs_embedded/image_pathfollowsmarkdown_image_preferenceembedded_image_pathis populated only when a match existsMatching
Embedded matching is based on:
It deliberately avoids center-distance-only matching.
Output contract
Image blocks now expose:
{ "image_path": "...", "rendered_image_path": "...", "embedded_image_path": null, "image_asset_source": "rendered" }or, when an embedded match exists and embedded is selected:
{ "image_path": "imgs_embedded/...", "rendered_image_path": "imgs_rendered/...", "embedded_image_path": "imgs_embedded/...", "image_asset_source": "embedded" }Failure behavior
The implementation explicitly avoids stale asset advertisement:
image_pathrendered_image_pathValidation
focused regression tests cover:
real validation confirmed:
imgs_rendered/created in rendered-only modeimgs_rendered/+imgs_embedded/created in embedded-enabled mode