Skip to content

Add printed page number metadata to SDK JSON output#171

Open
VooDisss wants to merge 2 commits intozai-org:mainfrom
VooDisss:printed-page-clean
Open

Add printed page number metadata to SDK JSON output#171
VooDisss wants to merge 2 commits intozai-org:mainfrom
VooDisss:printed-page-clean

Conversation

@VooDisss
Copy link
Copy Markdown
Contributor

@VooDisss VooDisss commented Mar 30, 2026

Fixes: #170 and #168 (was just planning)

Summary

  • route PP-DocLayoutV3 number regions through OCR instead of dropping them
  • extract printed page evidence from recognized number blocks and expose page_number_candidates, document_page_numbering, and page_metadata
  • keep final general json_result blocks lean and wrap saved paper.json only when real printed-page data exists

Details

  • keep extraction strict: native_label == "number" only
  • support both numeric and Roman numeral page labels
  • preserve legacy flat paper.json when the feature is disabled or when no printed-page candidates are found
  • keep MaaS and self-hosted output contracts aligned
  • expose the feature via constructor, environment variable, and YAML config
  • document the exact save contract in both README files

Validation

  • contract-focused unit tests cover:
    • feature off -> legacy flat paper.json
    • feature on, no hits -> legacy flat paper.json
    • feature on, hits -> wrapped paper.json
    • lean json_result output without broad layout_index / layout_score leakage
    • Roman numeral acceptance
    • MaaS parity
  • real OCR validation confirmed:
    • native_label: "number" blocks survive into final json_result
    • printed page labels are extracted into page_metadata
    • document-level numbering metadata is emitted when real data exists

Route PP-DocLayoutV3 'number' regions through OCR instead of dropping them, then extract printed page number evidence from recognized number blocks in the result formatter. Preserve the feature as number-only, support both numeric and Roman labels, and derive three explicit output layers: page_number_candidates, document_page_numbering, and page_metadata.

Keep the general json_result contract lean by stripping transient layout_index and layout_score fields from final blocks while retaining native_label, and wrap saved paper.json only when real printed-page data exists. Also expose detect_printed_page_numbers through config, constructor, and environment overrides, align MaaS output with self-hosted behavior, add contract-focused tests for legacy-vs-wrapped save behavior and lean json_result output, and document the exact save contract in the English and Chinese READMEs.
@VooDisss
Copy link
Copy Markdown
Contributor Author

@JaredforReal please check it out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Preserve printed page numbers from PP-DocLayoutV3 number regions in SDK output

1 participant