-
Notifications
You must be signed in to change notification settings - Fork 469
Add optional SDK image asset export for rendered and embedded PDF images #173
Description
Summary
GLM-OCR currently emits image regions as rendered crops, but there is no SDK-owned way to also export matched embedded PDF images at source fidelity when they correspond to PP-DocLayoutV3 image regions.
This issue proposes an optional SDK feature for image asset export with two provenance-aware outputs:
imgs_rendered/— rendered region cropsimgs_embedded/— embedded PDF images matched to layout image regions
The feature should be default-off and should keep the standard OCR flow unchanged unless explicitly enabled.
Proposed behavior
Base behavior
- Rendered image regions are the default image asset output for this feature path.
- Rendered assets are written to
imgs_rendered/. - Legacy
imgs/should not be used by this feature path.
Optional embedded export
When enable_image_asset_export=true:
- inspect embedded PDF images with PyMuPDF
- match embedded image instances to PP-DocLayoutV3 image regions using geometry
- save matched embedded assets to
imgs_embedded/
Markdown selection
Expose:
pipeline:
result_formatter:
enable_image_asset_export: false
markdown_image_preference: embedded # embedded | rendered
image_match_iou_threshold: 0.5
image_match_containment_threshold: 0.8
rendered_image_dpi: 300Rules:
- if
markdown_image_preference=embeddedand a match exists, Markdown should use the embedded asset - otherwise Markdown should use the rendered asset
- if no asset actually survives a recovery path, final JSON and Markdown should not keep stale references
Matching rules
Use only geometry-based matching between:
- PP-DocLayoutV3 layout
bbox_2d - embedded image placement rectangles from the PDF
Recommended signals:
- same-page filtering
- IoU threshold
- containment threshold
- aspect-ratio plausibility
- one-to-one assignment
Avoid:
- center-distance-only matching
Output contract
Image blocks should consistently expose:
{
"image_path": "...",
"rendered_image_path": "...",
"embedded_image_path": null,
"image_asset_source": "rendered"
}or, when a match exists and embedded is selected:
{
"image_path": "imgs_embedded/...",
"rendered_image_path": "imgs_rendered/...",
"embedded_image_path": "imgs_embedded/...",
"image_asset_source": "embedded"
}Cases
Case 1 — Rendered-only mode
When enable_image_asset_export=false:
- image regions still export to
imgs_rendered/ - embedded extraction is not attempted
- image blocks look like:
{
"image_path": "imgs_rendered/rendered_page2_idx0.jpg",
"rendered_image_path": "imgs_rendered/rendered_page2_idx0.jpg",
"embedded_image_path": null,
"image_asset_source": "rendered"
}Case 2 — Embedded match exists and embedded is preferred
When enable_image_asset_export=true and a geometric embedded-image match succeeds:
{
"image_path": "imgs_embedded/embedded_page2_idx0_xref199.png",
"rendered_image_path": "imgs_rendered/rendered_page2_idx0.jpg",
"embedded_image_path": "imgs_embedded/embedded_page2_idx0_xref199.png",
"image_asset_source": "embedded"
}Case 3 — No embedded match
When enable_image_asset_export=true but no embedded image matches the region:
{
"image_path": "imgs_rendered/rendered_page2_idx0.jpg",
"rendered_image_path": "imgs_rendered/rendered_page2_idx0.jpg",
"embedded_image_path": null,
"image_asset_source": "rendered"
}Case 4 — Recovery/failure paths
If a rendered asset cannot actually be preserved or regenerated:
- final JSON should not advertise a nonexistent rendered asset
- final Markdown should not keep stale rendered or embedded image references
- if a valid embedded asset survives, it may remain selected
- otherwise stale asset references should be removed entirely
Why this is useful
This keeps image asset handling inside the SDK instead of relying on downstream pipeline heuristics.
It also improves fidelity:
- embedded PDF images can preserve source quality
- rendered crops remain the fallback for charts, composites, and non-embedded visuals
Scope constraints
- SDK-first feature
- default-off
- no center-distance-only matching
- no debug-only output files by default
- no downstream
pipeline_skeletonredesign in this issue