Add optional SDK image asset export for rendered and embedded PDF images by VooDisss · Pull Request #174 · zai-org/GLM-OCR

VooDisss · 2026-03-31T03:11:53Z

Fixes: #173

Summary

This PR builds on the branch changes introduced in PR #171:

Add printed page number metadata to SDK JSON output #171

Add optional SDK-owned image asset export for PP-DocLayoutV3 image regions.

This change introduces:

imgs_rendered/ as the canonical rendered image asset directory
optional imgs_embedded/ extraction for geometrically matched embedded PDF images
configurable Markdown preference between embedded and rendered
stable image block metadata describing selected and available image assets

Behavior

Rendered-only mode

When enable_image_asset_export=false:

image regions are still exported to imgs_rendered/
no embedded extraction is attempted
image blocks expose:
- image_path
- rendered_image_path
- embedded_image_path: null
- image_asset_source: "rendered"

Embedded-enabled mode

When enable_image_asset_export=true:

rendered image assets are still created
the SDK inspects embedded PDF images via PyMuPDF
matched embedded assets are saved to imgs_embedded/
image_path follows markdown_image_preference
embedded_image_path is populated only when a match exists

Matching

Embedded matching is based on:

same-page filtering
bbox geometry
IoU threshold
containment threshold
aspect-ratio plausibility
one-to-one assignment

It deliberately avoids center-distance-only matching.

Output contract

Image blocks now expose:

{
  "image_path": "...",
  "rendered_image_path": "...",
  "embedded_image_path": null,
  "image_asset_source": "rendered"
}

or, when an embedded match exists and embedded is selected:

{
  "image_path": "imgs_embedded/...",
  "rendered_image_path": "imgs_rendered/...",
  "embedded_image_path": "imgs_embedded/...",
  "image_asset_source": "embedded"
}

Failure behavior

The implementation explicitly avoids stale asset advertisement:

no nonexistent rendered asset should remain in image_path
no nonexistent rendered asset should remain in rendered_image_path
stale markdown image references are removed when no asset survives
if a valid embedded asset survives recovery, it remains the selected asset

Validation

focused regression tests cover:
- rendered-only default behavior
- embedded match preference
- rendered preference
- preservation-mode recovery
- missing rendered-key recovery
- no-render-pages behavior
- crop-failure behavior
- rendered-origin stale markdown cleanup
- embedded-origin stale markdown cleanup
- nested asset persistence in saved output
real validation confirmed:
- imgs_rendered/ created in rendered-only mode
- imgs_rendered/ + imgs_embedded/ created in embedded-enabled mode
- markdown prefers embedded assets when configured and matched

Route PP-DocLayoutV3 'number' regions through OCR instead of dropping them, then extract printed page number evidence from recognized number blocks in the result formatter. Preserve the feature as number-only, support both numeric and Roman labels, and derive three explicit output layers: page_number_candidates, document_page_numbering, and page_metadata. Keep the general json_result contract lean by stripping transient layout_index and layout_score fields from final blocks while retaining native_label, and wrap saved paper.json only when real printed-page data exists. Also expose detect_printed_page_numbers through config, constructor, and environment overrides, align MaaS output with self-hosted behavior, add contract-focused tests for legacy-vs-wrapped save behavior and lean json_result output, and document the exact save contract in the English and Chinese READMEs.

Add an SDK-owned image asset export path for PP-DocLayoutV3 image regions. Rendered region assets are now the base behavior of this feature and are written to imgs_rendered/. When enable_image_asset_export=True, the SDK additionally inspects embedded PDF images via PyMuPDF, matches them to layout image regions using same-page geometry, IoU, containment, aspect-ratio plausibility, and one-to-one assignment, and writes matched assets to imgs_embedded/. Markdown selection is controlled by markdown_image_preference ('embedded' or 'rendered'). The image block contract is explicit and stable: image_path is the selected asset path, rendered_image_path reflects the rendered asset when one exists, embedded_image_path is null when no embedded match exists, and image_asset_source records whether the selected asset is rendered or embedded. The implementation avoids center-distance-only matching, preserves formatter-produced rendered assets in self-hosted mode instead of re-deriving them, and aggressively prevents stale asset advertisement: if a rendered asset cannot actually be preserved or regenerated, final JSON and Markdown do not continue to reference it. Focused regression tests cover rendered-only mode, embedded preference, rendered preference, nested asset persistence, preservation misses, no-render-pages recovery, crop-failure recovery, and both rendered-origin and embedded-origin stale-markdown cleanup.

Close the remaining stale-asset recovery gaps in the SDK image export path so final JSON and Markdown do not advertise rendered assets unless they were actually preserved or produced. This covers both no-render-pages and explicit crop-failure branches, including cases where stale markdown originally pointed at embedded assets. Also document the image asset export feature in README.md and README_zh.md, including the exposed configuration surface, the imgs_rendered/ and imgs_embedded/ directory contract, and the stable image block fields: image_path, rendered_image_path, embedded_image_path, and image_asset_source.

VooDisss added 4 commits March 30, 2026 23:33

Apply pre-commit formatting fixes

0892588

VooDisss mentioned this pull request Apr 1, 2026

Fix PP-DocLayoutV3 head aliasing in layout loader #180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional SDK image asset export for rendered and embedded PDF images#174

Add optional SDK image asset export for rendered and embedded PDF images#174
VooDisss wants to merge 4 commits intozai-org:mainfrom
VooDisss:image-asset-export

VooDisss commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VooDisss commented Mar 31, 2026

Summary

Behavior

Rendered-only mode

Embedded-enabled mode

Matching

Output contract

Failure behavior

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant