Skip to content

Fix PP-DocLayoutV3 head aliasing in layout loader#180

Open
VooDisss wants to merge 7 commits intozai-org:mainfrom
VooDisss:ppdoclayout-head-remap-minimal
Open

Fix PP-DocLayoutV3 head aliasing in layout loader#180
VooDisss wants to merge 7 commits intozai-org:mainfrom
VooDisss:ppdoclayout-head-remap-minimal

Conversation

@VooDisss
Copy link
Copy Markdown
Contributor

@VooDisss VooDisss commented Apr 1, 2026

Summary

Fixes #179

  • fix PP-DocLayoutV3 checkpoint loading in GLM-OCR by aliasing tied enc_* detection-head weights to the decoder head names expected by PPDocLayoutV3ForObjectDetection
  • load the layout model from a prepared state dict so transformers no longer initializes missing decoder detection heads
  • replace deprecated PPDocLayoutV3ImageProcessorFast usage with PPDocLayoutV3ImageProcessor
  • add focused unit coverage for the aliasing logic and update detector startup mocks for the new load path and processor rename

Why

The published PaddlePaddle/PP-DocLayoutV3_safetensors checkpoint stores trained detection-head weights under:

  • model.enc_score_head.*
  • model.enc_bbox_head.layers.*

but the object-detection wrapper used by GLM-OCR expects:

  • model.decoder.class_embed.*
  • model.decoder.bbox_embed.layers.*

Without aliasing those keys before model load, the decoder detection heads are treated as missing and newly initialized, which degrades layout detection in practice.

The runtime also still used the deprecated PPDocLayoutV3ImageProcessorFast entry point, which produced a warning on every worker startup under transformers 5.4.0.

Validation

  • pytest glmocr/tests/test_unit.py -k "detector_device_selection or detector_prepares_pp_doclayout_decoder_head_aliases"
  • real self-hosted OCR pipeline run on local PDFs with successful processing after the fix

Scope

This PR intentionally keeps the fix narrow:

  • no meta-tensor recovery logic
  • no broader layout runtime refactor
  • no broader processor refactor beyond replacing the deprecated entry point used by the detector

Those can be handled separately if needed.

Route PP-DocLayoutV3 'number' regions through OCR instead of dropping them, then extract printed page number evidence from recognized number blocks in the result formatter. Preserve the feature as number-only, support both numeric and Roman labels, and derive three explicit output layers: page_number_candidates, document_page_numbering, and page_metadata.

Keep the general json_result contract lean by stripping transient layout_index and layout_score fields from final blocks while retaining native_label, and wrap saved paper.json only when real printed-page data exists. Also expose detect_printed_page_numbers through config, constructor, and environment overrides, align MaaS output with self-hosted behavior, add contract-focused tests for legacy-vs-wrapped save behavior and lean json_result output, and document the exact save contract in the English and Chinese READMEs.
Add an SDK-owned image asset export path for PP-DocLayoutV3 image regions. Rendered region assets are now the base behavior of this feature and are written to imgs_rendered/. When enable_image_asset_export=True, the SDK additionally inspects embedded PDF images via PyMuPDF, matches them to layout image regions using same-page geometry, IoU, containment, aspect-ratio plausibility, and one-to-one assignment, and writes matched assets to imgs_embedded/. Markdown selection is controlled by markdown_image_preference ('embedded' or 'rendered').

The image block contract is explicit and stable: image_path is the selected asset path, rendered_image_path reflects the rendered asset when one exists, embedded_image_path is null when no embedded match exists, and image_asset_source records whether the selected asset is rendered or embedded. The implementation avoids center-distance-only matching, preserves formatter-produced rendered assets in self-hosted mode instead of re-deriving them, and aggressively prevents stale asset advertisement: if a rendered asset cannot actually be preserved or regenerated, final JSON and Markdown do not continue to reference it. Focused regression tests cover rendered-only mode, embedded preference, rendered preference, nested asset persistence, preservation misses, no-render-pages recovery, crop-failure recovery, and both rendered-origin and embedded-origin stale-markdown cleanup.
Close the remaining stale-asset recovery gaps in the SDK image export path so final JSON and Markdown do not advertise rendered assets unless they were actually preserved or produced. This covers both no-render-pages and explicit crop-failure branches, including cases where stale markdown originally pointed at embedded assets.

Also document the image asset export feature in README.md and README_zh.md, including the exposed configuration surface, the imgs_rendered/ and imgs_embedded/ directory contract, and the stable image block fields: image_path, rendered_image_path, embedded_image_path, and image_asset_source.
GLM-OCR loaded PPDocLayoutV3ForObjectDetection directly from the published Hugging Face checkpoint, but the checkpoint stores the tied detection head weights under model.enc_score_head.* and model.enc_bbox_head.layers.* while the object-detection wrapper expects model.decoder.class_embed.* and model.decoder.bbox_embed.layers.*. In practice this caused the decoder detection heads to be treated as missing and newly initialized, which surfaced as startup warnings, unstable layout behavior, and degraded self-hosted OCR results.

The fix keeps the change narrow: load the PP-DocLayoutV3 config separately, load model.safetensors directly, alias the tied encoder-head keys onto the decoder-head names before model construction, and instantiate the model with from_pretrained(None, config=..., state_dict=...). This avoids broader runtime recovery logic and keeps the compatibility repair at the checkpoint-loading boundary where the mismatch actually occurs.

The background investigation included local inspection of the cached safetensors checkpoint, installed transformers 5.4.0 PP-DocLayoutV3 source, Paddle inference artifacts, and upstream release context. The key finding was that the checkpoint is not headless: the trained head weights are present under enc_* names, and the local transformers implementation explicitly declares decoder.class_embed <-> enc_score_head and decoder.bbox_embed <-> enc_bbox_head as tied/shared weight groups. That made aliasing the minimal defensible fix for GLM-OCR rather than reworking the full layout runtime.

Tests were updated only as needed for the new load path. Existing detector device-selection tests now stub the config and prepared state-dict helpers, and a focused unit test verifies that _prepare_pp_doclayout_state_dict aliases encoder-head weights into the decoder-head keys expected by the object-detection wrapper. Validation also included a real self-hosted pipeline run over local PDFs, where the old missing decoder-head load report disappeared and processing completed successfully after the fix.
Follow up the PP-DocLayoutV3 checkpoint aliasing fix by switching the layout detector from PPDocLayoutV3ImageProcessorFast to PPDocLayoutV3ImageProcessor. Under the current transformers 5.4.0 runtime, the Fast-suffixed processor emits a deprecation warning on every worker startup even though the rest of the layout path is functioning correctly.

This change is intentionally narrow. It does not alter the checkpoint aliasing logic, model loading strategy, layout post-processing, or device-selection behavior. The only production change is to use the non-deprecated image processor entry point that transformers now expects. Tests were updated only where the detector startup path mocks the image processor loader.

The need for this cleanup was confirmed by a real self-hosted OCR pipeline run after the head-aliasing fix landed. That run showed successful PP-DocLayoutV3 startup and processing, but still printed the deprecation warning telling callers to use PPDocLayoutV3ImageProcessor instead of the Fast variant. Replacing the import and matching test patches removes that remaining startup warning without widening the scope of the loader fix.

Validation included re-running the focused detector test slice covering detector device selection and PP-DocLayout decoder-head aliasing, which passed after the rename. A subsequent real OCR pipeline run on local PDFs also started and processed documents without the previous deprecation warning, confirming that the cleanup behaves correctly in the actual self-hosted path.
@VooDisss
Copy link
Copy Markdown
Contributor Author

VooDisss commented Apr 1, 2026

@JaredforReal note that my PRs are stacked and not based directly on main:

For #180 specifically, I spent some time ruling out user/config error vs an actual GLM-OCR integration issue.

What I found is that under the current transformers 5.x PP-DocLayoutV3 load path, the layout detector comes up with missing decoder detection-head weights and then fails during startup/device move. In practice this showed up when running the self-hosted OCR pipeline with multi-worker layout initialization.

Minimal symptom:

PPDocLayoutV3ForObjectDetection LOAD REPORT
model.decoder.bbox_embed.layers.{0,1,2}.bias   | MISSING
model.decoder.bbox_embed.layers.{0,1,2}.weight | MISSING
model.decoder.class_embed.weight               | MISSING
model.decoder.class_embed.bias                 | MISSING

NotImplementedError: Cannot copy out of meta tensor; no data!
Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to()
when moving module from meta to a different device.

After digging into the checkpoint and local transformers implementation, the issue appears to be a checkpoint/load-path mismatch rather than simple user error:

  • the published PaddlePaddle/PP-DocLayoutV3_safetensors checkpoint stores the tied prediction head weights under enc_* names
  • PPDocLayoutV3ForObjectDetection expects the corresponding decoder.* names at load time
  • without aliasing those keys before model construction, the decoder detection heads are treated as missing and newly initialized

That is what #180 fixes:

  • alias enc_score_head.* -> decoder.class_embed.*
  • alias enc_bbox_head.layers.* -> decoder.bbox_embed.layers.*
  • construct the model from the prepared state dict
  • also switch from the deprecated PPDocLayoutV3ImageProcessorFast entry point to PPDocLayoutV3ImageProcessor

I opened the upstream issue with the evidence here:

So from my side this does not look like just a local environment mistake; it looks like a real PP-DocLayoutV3 HF checkpoint/load-path compatibility bug that GLM-OCR needs to bridge explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PP-DocLayoutV3 checkpoint/load-path mismatch under transformers 5.4.0

1 participant