Skip to content

PP-DocLayoutV3 should reliably preserve printed page numbers as number for scientific PDFs #168

@VooDisss

Description

@VooDisss

@JaredforReal I investigated this further from the GLM-OCR SDK side and found that the core model/layout support is not really the main question anymore. The practical maintainer decision now seems to be the output contract.

I was able to get the SDK to preserve number regions from PP-DocLayoutV3, OCR those regions, and derive structured printed page metadata such as:

  • page_number_candidates
  • document_page_numbering
  • page_metadata

For example, the extra metadata layer can look like:

{
  "page_number_candidates": [
    {
      "page_index": 1,
      "label": "number",
      "content": "22",
      "layout_index": 0,
      "bbox_2d": [93, 26, 120, 41],
      "layout_score": 0.77,
      "numeric_like": true,
      "roman_like": false
    }
  ],
  "document_page_numbering": {
    "strategy": "visual_sequence",
    "confidence": 1.0,
    "sequence_type": "arabic",
    "page_offset": 21,
    "candidate_pages": 4
  },
  "page_metadata": [
    {
      "page_index": 1,
      "printed_page_label": "22",
      "printed_page_block_index": 0,
      "printed_page_bbox_2d": [93, 26, 120, 41],
      "printed_page_confidence": 0.77
    }
  ]
}

The main open design question is now:

Would such metadata be acceptable only if the saved paper.json output becomes a top-level wrapped object, or would that be considered too breaking for downstream SDK users?

Concretely, the current output shape is effectively:

[
  [...page 0 blocks...],
  [...page 1 blocks...]
]

A metadata-friendly wrapped shape would be:

{
  "json_result": [...],                // the existing OCR/layout block output, grouped by file page index
  "page_number_candidates": [...],     // raw `number` region evidence found on pages, with OCRed content and bbox/score info
  "document_page_numbering": {...},    // document-level inference, e.g. sequence type and inferred page offset
  "page_metadata": [...]               // selected per-page printed page labels derived from the candidates
}

For example, the extra metadata layers can look like:

{
  "page_number_candidates": [
    {
      "page_index": 1,
      "label": "number",
      "content": "22",
      "layout_index": 0,
      "bbox_2d": [93, 26, 120, 41],
      "layout_score": 0.77,
      "numeric_like": true,
      "roman_like": false
    }
  ],
  "document_page_numbering": {
    "strategy": "visual_sequence",
    "confidence": 1.0,
    "sequence_type": "arabic",
    "page_offset": 21,
    "candidate_pages": 4
  },
  "page_metadata": [
    {
      "page_index": 1,
      "printed_page_label": "22",
      "printed_page_block_index": 0,
      "printed_page_bbox_2d": [93, 26, 120, 41],
      "printed_page_confidence": 0.77
    }
  ]
}

So I would like maintainer guidance on which direction is acceptable:

  1. Always-wrapped paper.json

    • cleanest place to store additional metadata
    • but changes the saved output contract for downstream users
  2. Keep the current paper.json shape unchanged and save printed-page metadata separately

    • avoids breaking downstream consumers
    • but adds another artifact / sidecar JSON
  3. Keep the feature disabled by default

    • and only emit the wrapped structure when printed-page detection is explicitly enabled
    • still a contract change, but opt-in

My current feeling is that many users of scientific PDFs would benefit from having real printed page numbers available for citation-oriented workflows, especially in RAG systems. But I do not want to move forward with an output-structure change unless maintainers are comfortable with it.

So the most useful feedback for me now would be:

  • Is a wrapped JSON output acceptable for GLM-OCR SDK?
  • If not, would a separate metadata file be preferred?
  • If neither is acceptable, what output format would you prefer for exposing printed page number metadata?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions