Fix PP-DocLayoutV3 head aliasing in layout loader by VooDisss · Pull Request #180 · zai-org/GLM-OCR

VooDisss · 2026-04-01T21:45:35Z

Summary

Fixes #179

fix PP-DocLayoutV3 checkpoint loading in GLM-OCR by aliasing tied enc_* detection-head weights to the decoder head names expected by PPDocLayoutV3ForObjectDetection
load the layout model from a prepared state dict so transformers no longer initializes missing decoder detection heads
replace deprecated PPDocLayoutV3ImageProcessorFast usage with PPDocLayoutV3ImageProcessor
add focused unit coverage for the aliasing logic and update detector startup mocks for the new load path and processor rename

Why

The published PaddlePaddle/PP-DocLayoutV3_safetensors checkpoint stores trained detection-head weights under:

model.enc_score_head.*
model.enc_bbox_head.layers.*

but the object-detection wrapper used by GLM-OCR expects:

model.decoder.class_embed.*
model.decoder.bbox_embed.layers.*

Without aliasing those keys before model load, the decoder detection heads are treated as missing and newly initialized, which degrades layout detection in practice.

The runtime also still used the deprecated PPDocLayoutV3ImageProcessorFast entry point, which produced a warning on every worker startup under transformers 5.4.0.

Validation

pytest glmocr/tests/test_unit.py -k "detector_device_selection or detector_prepares_pp_doclayout_decoder_head_aliases"
real self-hosted OCR pipeline run on local PDFs with successful processing after the fix

Scope

This PR intentionally keeps the fix narrow:

no meta-tensor recovery logic
no broader layout runtime refactor
no broader processor refactor beyond replacing the deprecated entry point used by the detector

Those can be handled separately if needed.

Route PP-DocLayoutV3 'number' regions through OCR instead of dropping them, then extract printed page number evidence from recognized number blocks in the result formatter. Preserve the feature as number-only, support both numeric and Roman labels, and derive three explicit output layers: page_number_candidates, document_page_numbering, and page_metadata. Keep the general json_result contract lean by stripping transient layout_index and layout_score fields from final blocks while retaining native_label, and wrap saved paper.json only when real printed-page data exists. Also expose detect_printed_page_numbers through config, constructor, and environment overrides, align MaaS output with self-hosted behavior, add contract-focused tests for legacy-vs-wrapped save behavior and lean json_result output, and document the exact save contract in the English and Chinese READMEs.

Add an SDK-owned image asset export path for PP-DocLayoutV3 image regions. Rendered region assets are now the base behavior of this feature and are written to imgs_rendered/. When enable_image_asset_export=True, the SDK additionally inspects embedded PDF images via PyMuPDF, matches them to layout image regions using same-page geometry, IoU, containment, aspect-ratio plausibility, and one-to-one assignment, and writes matched assets to imgs_embedded/. Markdown selection is controlled by markdown_image_preference ('embedded' or 'rendered'). The image block contract is explicit and stable: image_path is the selected asset path, rendered_image_path reflects the rendered asset when one exists, embedded_image_path is null when no embedded match exists, and image_asset_source records whether the selected asset is rendered or embedded. The implementation avoids center-distance-only matching, preserves formatter-produced rendered assets in self-hosted mode instead of re-deriving them, and aggressively prevents stale asset advertisement: if a rendered asset cannot actually be preserved or regenerated, final JSON and Markdown do not continue to reference it. Focused regression tests cover rendered-only mode, embedded preference, rendered preference, nested asset persistence, preservation misses, no-render-pages recovery, crop-failure recovery, and both rendered-origin and embedded-origin stale-markdown cleanup.

Close the remaining stale-asset recovery gaps in the SDK image export path so final JSON and Markdown do not advertise rendered assets unless they were actually preserved or produced. This covers both no-render-pages and explicit crop-failure branches, including cases where stale markdown originally pointed at embedded assets. Also document the image asset export feature in README.md and README_zh.md, including the exposed configuration surface, the imgs_rendered/ and imgs_embedded/ directory contract, and the stable image block fields: image_path, rendered_image_path, embedded_image_path, and image_asset_source.

GLM-OCR loaded PPDocLayoutV3ForObjectDetection directly from the published Hugging Face checkpoint, but the checkpoint stores the tied detection head weights under model.enc_score_head.* and model.enc_bbox_head.layers.* while the object-detection wrapper expects model.decoder.class_embed.* and model.decoder.bbox_embed.layers.*. In practice this caused the decoder detection heads to be treated as missing and newly initialized, which surfaced as startup warnings, unstable layout behavior, and degraded self-hosted OCR results. The fix keeps the change narrow: load the PP-DocLayoutV3 config separately, load model.safetensors directly, alias the tied encoder-head keys onto the decoder-head names before model construction, and instantiate the model with from_pretrained(None, config=..., state_dict=...). This avoids broader runtime recovery logic and keeps the compatibility repair at the checkpoint-loading boundary where the mismatch actually occurs. The background investigation included local inspection of the cached safetensors checkpoint, installed transformers 5.4.0 PP-DocLayoutV3 source, Paddle inference artifacts, and upstream release context. The key finding was that the checkpoint is not headless: the trained head weights are present under enc_* names, and the local transformers implementation explicitly declares decoder.class_embed <-> enc_score_head and decoder.bbox_embed <-> enc_bbox_head as tied/shared weight groups. That made aliasing the minimal defensible fix for GLM-OCR rather than reworking the full layout runtime. Tests were updated only as needed for the new load path. Existing detector device-selection tests now stub the config and prepared state-dict helpers, and a focused unit test verifies that _prepare_pp_doclayout_state_dict aliases encoder-head weights into the decoder-head keys expected by the object-detection wrapper. Validation also included a real self-hosted pipeline run over local PDFs, where the old missing decoder-head load report disappeared and processing completed successfully after the fix.

Follow up the PP-DocLayoutV3 checkpoint aliasing fix by switching the layout detector from PPDocLayoutV3ImageProcessorFast to PPDocLayoutV3ImageProcessor. Under the current transformers 5.4.0 runtime, the Fast-suffixed processor emits a deprecation warning on every worker startup even though the rest of the layout path is functioning correctly. This change is intentionally narrow. It does not alter the checkpoint aliasing logic, model loading strategy, layout post-processing, or device-selection behavior. The only production change is to use the non-deprecated image processor entry point that transformers now expects. Tests were updated only where the detector startup path mocks the image processor loader. The need for this cleanup was confirmed by a real self-hosted OCR pipeline run after the head-aliasing fix landed. That run showed successful PP-DocLayoutV3 startup and processing, but still printed the deprecation warning telling callers to use PPDocLayoutV3ImageProcessor instead of the Fast variant. Replacing the import and matching test patches removes that remaining startup warning without widening the scope of the loader fix. Validation included re-running the focused detector test slice covering detector device selection and PP-DocLayout decoder-head aliasing, which passed after the rename. A subsequent real OCR pipeline run on local PDFs also started and processed documents without the previous deprecation warning, confirming that the cleanup behaves correctly in the actual self-hosted path.

VooDisss · 2026-04-01T21:56:49Z

@JaredforReal note that my PRs are stacked and not based directly on main:

For #180 specifically, I spent some time ruling out user/config error vs an actual GLM-OCR integration issue.

What I found is that under the current transformers 5.x PP-DocLayoutV3 load path, the layout detector comes up with missing decoder detection-head weights and then fails during startup/device move. In practice this showed up when running the self-hosted OCR pipeline with multi-worker layout initialization.

Minimal symptom:

PPDocLayoutV3ForObjectDetection LOAD REPORT
model.decoder.bbox_embed.layers.{0,1,2}.bias   | MISSING
model.decoder.bbox_embed.layers.{0,1,2}.weight | MISSING
model.decoder.class_embed.weight               | MISSING
model.decoder.class_embed.bias                 | MISSING

NotImplementedError: Cannot copy out of meta tensor; no data!
Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to()
when moving module from meta to a different device.

After digging into the checkpoint and local transformers implementation, the issue appears to be a checkpoint/load-path mismatch rather than simple user error:

the published PaddlePaddle/PP-DocLayoutV3_safetensors checkpoint stores the tied prediction head weights under enc_* names
PPDocLayoutV3ForObjectDetection expects the corresponding decoder.* names at load time
without aliasing those keys before model construction, the decoder detection heads are treated as missing and newly initialized

That is what #180 fixes:

alias enc_score_head.* -> decoder.class_embed.*
alias enc_bbox_head.layers.* -> decoder.bbox_embed.layers.*
construct the model from the prepared state dict
also switch from the deprecated PPDocLayoutV3ImageProcessorFast entry point to PPDocLayoutV3ImageProcessor

I opened the upstream issue with the evidence here:

PP-DocLayoutV3 checkpoint/load-path mismatch under transformers 5.4.0 #179

So from my side this does not look like just a local environment mistake; it looks like a real PP-DocLayoutV3 HF checkpoint/load-path compatibility bug that GLM-OCR needs to bridge explicitly.

VooDisss added 7 commits March 30, 2026 23:33

Apply pre-commit formatting fixes

0892588

Fix according to precommit checks

09456d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PP-DocLayoutV3 head aliasing in layout loader#180

Fix PP-DocLayoutV3 head aliasing in layout loader#180
VooDisss wants to merge 7 commits intozai-org:mainfrom
VooDisss:ppdoclayout-head-remap-minimal

VooDisss commented Apr 1, 2026 •

edited

Loading

Uh oh!

VooDisss commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

VooDisss commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation

Scope

Uh oh!

VooDisss commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

VooDisss commented Apr 1, 2026 •

edited

Loading