-
Notifications
You must be signed in to change notification settings - Fork 469
PP-DocLayoutV3 should reliably preserve printed page numbers as number for scientific PDFs #168
Description
@JaredforReal I investigated this further from the GLM-OCR SDK side and found that the core model/layout support is not really the main question anymore. The practical maintainer decision now seems to be the output contract.
I was able to get the SDK to preserve number regions from PP-DocLayoutV3, OCR those regions, and derive structured printed page metadata such as:
page_number_candidatesdocument_page_numberingpage_metadata
For example, the extra metadata layer can look like:
{
"page_number_candidates": [
{
"page_index": 1,
"label": "number",
"content": "22",
"layout_index": 0,
"bbox_2d": [93, 26, 120, 41],
"layout_score": 0.77,
"numeric_like": true,
"roman_like": false
}
],
"document_page_numbering": {
"strategy": "visual_sequence",
"confidence": 1.0,
"sequence_type": "arabic",
"page_offset": 21,
"candidate_pages": 4
},
"page_metadata": [
{
"page_index": 1,
"printed_page_label": "22",
"printed_page_block_index": 0,
"printed_page_bbox_2d": [93, 26, 120, 41],
"printed_page_confidence": 0.77
}
]
}The main open design question is now:
Would such metadata be acceptable only if the saved paper.json output becomes a top-level wrapped object, or would that be considered too breaking for downstream SDK users?
Concretely, the current output shape is effectively:
[
[...page 0 blocks...],
[...page 1 blocks...]
]A metadata-friendly wrapped shape would be:
{
"json_result": [...], // the existing OCR/layout block output, grouped by file page index
"page_number_candidates": [...], // raw `number` region evidence found on pages, with OCRed content and bbox/score info
"document_page_numbering": {...}, // document-level inference, e.g. sequence type and inferred page offset
"page_metadata": [...] // selected per-page printed page labels derived from the candidates
}For example, the extra metadata layers can look like:
{
"page_number_candidates": [
{
"page_index": 1,
"label": "number",
"content": "22",
"layout_index": 0,
"bbox_2d": [93, 26, 120, 41],
"layout_score": 0.77,
"numeric_like": true,
"roman_like": false
}
],
"document_page_numbering": {
"strategy": "visual_sequence",
"confidence": 1.0,
"sequence_type": "arabic",
"page_offset": 21,
"candidate_pages": 4
},
"page_metadata": [
{
"page_index": 1,
"printed_page_label": "22",
"printed_page_block_index": 0,
"printed_page_bbox_2d": [93, 26, 120, 41],
"printed_page_confidence": 0.77
}
]
}So I would like maintainer guidance on which direction is acceptable:
-
Always-wrapped
paper.json- cleanest place to store additional metadata
- but changes the saved output contract for downstream users
-
Keep the current
paper.jsonshape unchanged and save printed-page metadata separately- avoids breaking downstream consumers
- but adds another artifact / sidecar JSON
-
Keep the feature disabled by default
- and only emit the wrapped structure when printed-page detection is explicitly enabled
- still a contract change, but opt-in
My current feeling is that many users of scientific PDFs would benefit from having real printed page numbers available for citation-oriented workflows, especially in RAG systems. But I do not want to move forward with an output-structure change unless maintainers are comfortable with it.
So the most useful feedback for me now would be:
- Is a wrapped JSON output acceptable for GLM-OCR SDK?
- If not, would a separate metadata file be preferred?
- If neither is acceptable, what output format would you prefer for exposing printed page number metadata?