PP-DocLayoutV3 should reliably preserve printed page numbers as `number` for scientific PDFs

@JaredforReal I investigated this further from the GLM-OCR SDK side and found that the core model/layout support is not really the main question anymore. The practical maintainer decision now seems to be the output contract.

I was able to get the SDK to preserve `number` regions from PP-DocLayoutV3, OCR those regions, and derive structured printed page metadata such as:

- `page_number_candidates`
- `document_page_numbering`
- `page_metadata`

For example, the extra metadata layer can look like:

```json
{
  "page_number_candidates": [
    {
      "page_index": 1,
      "label": "number",
      "content": "22",
      "layout_index": 0,
      "bbox_2d": [93, 26, 120, 41],
      "layout_score": 0.77,
      "numeric_like": true,
      "roman_like": false
    }
  ],
  "document_page_numbering": {
    "strategy": "visual_sequence",
    "confidence": 1.0,
    "sequence_type": "arabic",
    "page_offset": 21,
    "candidate_pages": 4
  },
  "page_metadata": [
    {
      "page_index": 1,
      "printed_page_label": "22",
      "printed_page_block_index": 0,
      "printed_page_bbox_2d": [93, 26, 120, 41],
      "printed_page_confidence": 0.77
    }
  ]
}
```

The main open design question is now:

**Would such metadata be acceptable only if the saved `paper.json` output becomes a top-level wrapped object, or would that be considered too breaking for downstream SDK users?**

Concretely, the current output shape is effectively:

```json
[
  [...page 0 blocks...],
  [...page 1 blocks...]
]
```

A metadata-friendly wrapped shape would be:

```json
{
  "json_result": [...],                // the existing OCR/layout block output, grouped by file page index
  "page_number_candidates": [...],     // raw `number` region evidence found on pages, with OCRed content and bbox/score info
  "document_page_numbering": {...},    // document-level inference, e.g. sequence type and inferred page offset
  "page_metadata": [...]               // selected per-page printed page labels derived from the candidates
}
```
For example, the extra metadata layers can look like:
```json
{
  "page_number_candidates": [
    {
      "page_index": 1,
      "label": "number",
      "content": "22",
      "layout_index": 0,
      "bbox_2d": [93, 26, 120, 41],
      "layout_score": 0.77,
      "numeric_like": true,
      "roman_like": false
    }
  ],
  "document_page_numbering": {
    "strategy": "visual_sequence",
    "confidence": 1.0,
    "sequence_type": "arabic",
    "page_offset": 21,
    "candidate_pages": 4
  },
  "page_metadata": [
    {
      "page_index": 1,
      "printed_page_label": "22",
      "printed_page_block_index": 0,
      "printed_page_bbox_2d": [93, 26, 120, 41],
      "printed_page_confidence": 0.77
    }
  ]
}
```
So I would like maintainer guidance on which direction is acceptable:

1. **Always-wrapped `paper.json`**
   - cleanest place to store additional metadata
   - but changes the saved output contract for downstream users

2. **Keep the current `paper.json` shape unchanged and save printed-page metadata separately**
   - avoids breaking downstream consumers
   - but adds another artifact / sidecar JSON

3. **Keep the feature disabled by default**
   - and only emit the wrapped structure when printed-page detection is explicitly enabled
   - still a contract change, but opt-in

My current feeling is that many users of scientific PDFs would benefit from having real printed page numbers available for citation-oriented workflows, especially in RAG systems. But I do not want to move forward with an output-structure change unless maintainers are comfortable with it.

So the most useful feedback for me now would be:

- Is a wrapped JSON output acceptable for GLM-OCR SDK?
- If not, would a separate metadata file be preferred?
- If neither is acceptable, what output format would you prefer for exposing printed page number metadata?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PP-DocLayoutV3 should reliably preserve printed page numbers as `number` for scientific PDFs #168

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PP-DocLayoutV3 should reliably preserve printed page numbers as number for scientific PDFs #168

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

PP-DocLayoutV3 should reliably preserve printed page numbers as `number` for scientific PDFs #168