Fix PP-DocLayoutV3 head aliasing in layout loader#180
Fix PP-DocLayoutV3 head aliasing in layout loader#180VooDisss wants to merge 7 commits intozai-org:mainfrom
Conversation
Route PP-DocLayoutV3 'number' regions through OCR instead of dropping them, then extract printed page number evidence from recognized number blocks in the result formatter. Preserve the feature as number-only, support both numeric and Roman labels, and derive three explicit output layers: page_number_candidates, document_page_numbering, and page_metadata. Keep the general json_result contract lean by stripping transient layout_index and layout_score fields from final blocks while retaining native_label, and wrap saved paper.json only when real printed-page data exists. Also expose detect_printed_page_numbers through config, constructor, and environment overrides, align MaaS output with self-hosted behavior, add contract-focused tests for legacy-vs-wrapped save behavior and lean json_result output, and document the exact save contract in the English and Chinese READMEs.
Add an SDK-owned image asset export path for PP-DocLayoutV3 image regions. Rendered region assets are now the base behavior of this feature and are written to imgs_rendered/. When enable_image_asset_export=True, the SDK additionally inspects embedded PDF images via PyMuPDF, matches them to layout image regions using same-page geometry, IoU, containment, aspect-ratio plausibility, and one-to-one assignment, and writes matched assets to imgs_embedded/. Markdown selection is controlled by markdown_image_preference ('embedded' or 'rendered').
The image block contract is explicit and stable: image_path is the selected asset path, rendered_image_path reflects the rendered asset when one exists, embedded_image_path is null when no embedded match exists, and image_asset_source records whether the selected asset is rendered or embedded. The implementation avoids center-distance-only matching, preserves formatter-produced rendered assets in self-hosted mode instead of re-deriving them, and aggressively prevents stale asset advertisement: if a rendered asset cannot actually be preserved or regenerated, final JSON and Markdown do not continue to reference it. Focused regression tests cover rendered-only mode, embedded preference, rendered preference, nested asset persistence, preservation misses, no-render-pages recovery, crop-failure recovery, and both rendered-origin and embedded-origin stale-markdown cleanup.
Close the remaining stale-asset recovery gaps in the SDK image export path so final JSON and Markdown do not advertise rendered assets unless they were actually preserved or produced. This covers both no-render-pages and explicit crop-failure branches, including cases where stale markdown originally pointed at embedded assets. Also document the image asset export feature in README.md and README_zh.md, including the exposed configuration surface, the imgs_rendered/ and imgs_embedded/ directory contract, and the stable image block fields: image_path, rendered_image_path, embedded_image_path, and image_asset_source.
GLM-OCR loaded PPDocLayoutV3ForObjectDetection directly from the published Hugging Face checkpoint, but the checkpoint stores the tied detection head weights under model.enc_score_head.* and model.enc_bbox_head.layers.* while the object-detection wrapper expects model.decoder.class_embed.* and model.decoder.bbox_embed.layers.*. In practice this caused the decoder detection heads to be treated as missing and newly initialized, which surfaced as startup warnings, unstable layout behavior, and degraded self-hosted OCR results. The fix keeps the change narrow: load the PP-DocLayoutV3 config separately, load model.safetensors directly, alias the tied encoder-head keys onto the decoder-head names before model construction, and instantiate the model with from_pretrained(None, config=..., state_dict=...). This avoids broader runtime recovery logic and keeps the compatibility repair at the checkpoint-loading boundary where the mismatch actually occurs. The background investigation included local inspection of the cached safetensors checkpoint, installed transformers 5.4.0 PP-DocLayoutV3 source, Paddle inference artifacts, and upstream release context. The key finding was that the checkpoint is not headless: the trained head weights are present under enc_* names, and the local transformers implementation explicitly declares decoder.class_embed <-> enc_score_head and decoder.bbox_embed <-> enc_bbox_head as tied/shared weight groups. That made aliasing the minimal defensible fix for GLM-OCR rather than reworking the full layout runtime. Tests were updated only as needed for the new load path. Existing detector device-selection tests now stub the config and prepared state-dict helpers, and a focused unit test verifies that _prepare_pp_doclayout_state_dict aliases encoder-head weights into the decoder-head keys expected by the object-detection wrapper. Validation also included a real self-hosted pipeline run over local PDFs, where the old missing decoder-head load report disappeared and processing completed successfully after the fix.
Follow up the PP-DocLayoutV3 checkpoint aliasing fix by switching the layout detector from PPDocLayoutV3ImageProcessorFast to PPDocLayoutV3ImageProcessor. Under the current transformers 5.4.0 runtime, the Fast-suffixed processor emits a deprecation warning on every worker startup even though the rest of the layout path is functioning correctly. This change is intentionally narrow. It does not alter the checkpoint aliasing logic, model loading strategy, layout post-processing, or device-selection behavior. The only production change is to use the non-deprecated image processor entry point that transformers now expects. Tests were updated only where the detector startup path mocks the image processor loader. The need for this cleanup was confirmed by a real self-hosted OCR pipeline run after the head-aliasing fix landed. That run showed successful PP-DocLayoutV3 startup and processing, but still printed the deprecation warning telling callers to use PPDocLayoutV3ImageProcessor instead of the Fast variant. Replacing the import and matching test patches removes that remaining startup warning without widening the scope of the loader fix. Validation included re-running the focused detector test slice covering detector device selection and PP-DocLayout decoder-head aliasing, which passed after the rename. A subsequent real OCR pipeline run on local PDFs also started and processed documents without the previous deprecation warning, confirming that the cleanup behaves correctly in the actual self-hosted path.
|
@JaredforReal note that my PRs are stacked and not based directly on
For #180 specifically, I spent some time ruling out user/config error vs an actual GLM-OCR integration issue. What I found is that under the current Minimal symptom: After digging into the checkpoint and local
That is what #180 fixes:
I opened the upstream issue with the evidence here: So from my side this does not look like just a local environment mistake; it looks like a real PP-DocLayoutV3 HF checkpoint/load-path compatibility bug that GLM-OCR needs to bridge explicitly. |
Summary
Fixes #179
enc_*detection-head weights to the decoder head names expected byPPDocLayoutV3ForObjectDetectiontransformersno longer initializes missing decoder detection headsPPDocLayoutV3ImageProcessorFastusage withPPDocLayoutV3ImageProcessorWhy
The published
PaddlePaddle/PP-DocLayoutV3_safetensorscheckpoint stores trained detection-head weights under:model.enc_score_head.*model.enc_bbox_head.layers.*but the object-detection wrapper used by GLM-OCR expects:
model.decoder.class_embed.*model.decoder.bbox_embed.layers.*Without aliasing those keys before model load, the decoder detection heads are treated as missing and newly initialized, which degrades layout detection in practice.
The runtime also still used the deprecated
PPDocLayoutV3ImageProcessorFastentry point, which produced a warning on every worker startup undertransformers 5.4.0.Validation
pytest glmocr/tests/test_unit.py -k "detector_device_selection or detector_prepares_pp_doclayout_decoder_head_aliases"Scope
This PR intentionally keeps the fix narrow:
Those can be handled separately if needed.