Skip to content

Add optional SDK image asset export for rendered and embedded PDF images #173

@VooDisss

Description

@VooDisss

Summary

GLM-OCR currently emits image regions as rendered crops, but there is no SDK-owned way to also export matched embedded PDF images at source fidelity when they correspond to PP-DocLayoutV3 image regions.

This issue proposes an optional SDK feature for image asset export with two provenance-aware outputs:

  • imgs_rendered/ — rendered region crops
  • imgs_embedded/ — embedded PDF images matched to layout image regions

The feature should be default-off and should keep the standard OCR flow unchanged unless explicitly enabled.


Proposed behavior

Base behavior

  • Rendered image regions are the default image asset output for this feature path.
  • Rendered assets are written to imgs_rendered/.
  • Legacy imgs/ should not be used by this feature path.

Optional embedded export

When enable_image_asset_export=true:

  • inspect embedded PDF images with PyMuPDF
  • match embedded image instances to PP-DocLayoutV3 image regions using geometry
  • save matched embedded assets to imgs_embedded/

Markdown selection

Expose:

pipeline:
  result_formatter:
    enable_image_asset_export: false
    markdown_image_preference: embedded   # embedded | rendered
    image_match_iou_threshold: 0.5
    image_match_containment_threshold: 0.8
    rendered_image_dpi: 300

Rules:

  • if markdown_image_preference=embedded and a match exists, Markdown should use the embedded asset
  • otherwise Markdown should use the rendered asset
  • if no asset actually survives a recovery path, final JSON and Markdown should not keep stale references

Matching rules

Use only geometry-based matching between:

  • PP-DocLayoutV3 layout bbox_2d
  • embedded image placement rectangles from the PDF

Recommended signals:

  • same-page filtering
  • IoU threshold
  • containment threshold
  • aspect-ratio plausibility
  • one-to-one assignment

Avoid:

  • center-distance-only matching

Output contract

Image blocks should consistently expose:

{
  "image_path": "...",
  "rendered_image_path": "...",
  "embedded_image_path": null,
  "image_asset_source": "rendered"
}

or, when a match exists and embedded is selected:

{
  "image_path": "imgs_embedded/...",
  "rendered_image_path": "imgs_rendered/...",
  "embedded_image_path": "imgs_embedded/...",
  "image_asset_source": "embedded"
}

Cases

Case 1 — Rendered-only mode

When enable_image_asset_export=false:

  • image regions still export to imgs_rendered/
  • embedded extraction is not attempted
  • image blocks look like:
{
  "image_path": "imgs_rendered/rendered_page2_idx0.jpg",
  "rendered_image_path": "imgs_rendered/rendered_page2_idx0.jpg",
  "embedded_image_path": null,
  "image_asset_source": "rendered"
}
Case 2 — Embedded match exists and embedded is preferred

When enable_image_asset_export=true and a geometric embedded-image match succeeds:

{
  "image_path": "imgs_embedded/embedded_page2_idx0_xref199.png",
  "rendered_image_path": "imgs_rendered/rendered_page2_idx0.jpg",
  "embedded_image_path": "imgs_embedded/embedded_page2_idx0_xref199.png",
  "image_asset_source": "embedded"
}
Case 3 — No embedded match

When enable_image_asset_export=true but no embedded image matches the region:

{
  "image_path": "imgs_rendered/rendered_page2_idx0.jpg",
  "rendered_image_path": "imgs_rendered/rendered_page2_idx0.jpg",
  "embedded_image_path": null,
  "image_asset_source": "rendered"
}
Case 4 — Recovery/failure paths

If a rendered asset cannot actually be preserved or regenerated:

  • final JSON should not advertise a nonexistent rendered asset
  • final Markdown should not keep stale rendered or embedded image references
  • if a valid embedded asset survives, it may remain selected
  • otherwise stale asset references should be removed entirely

Why this is useful

This keeps image asset handling inside the SDK instead of relying on downstream pipeline heuristics.

It also improves fidelity:

  • embedded PDF images can preserve source quality
  • rendered crops remain the fallback for charts, composites, and non-embedded visuals

Scope constraints

  • SDK-first feature
  • default-off
  • no center-distance-only matching
  • no debug-only output files by default
  • no downstream pipeline_skeleton redesign in this issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions