Skip to content

Conversation

@schen1102
Copy link

This PR refactors the sort_text_lines function in marker/util.py to improve robustness, readability, and accuracy when sorting text lines in document layouts.

✅ Changes Made:

  • Rewrote sort_text_lines to:
    • Group lines into rows based on vertical proximity using the line center Y-coordinate.
    • Sort text lines within each row from left to right.
  • Introduced helper functions for better modularity and testability:
    • _calculate_row_tolerance
    • _center_y
    • _group_lines_into_rows
    • _sort_and_flatten_rows
  • Defined DEFAULT_ROW_TOLERANCE_FACTOR based on median line height for adaptive grouping.
  • Improved type annotations and added detailed docstrings.
  • Minor code style updates and formatting for consistency.

🎯 Why This Change?

The previous implementation used a direct y-axis rounding strategy, which could result in incorrect grouping for varying line heights. This refactor makes the function:

  • More accurate in grouping nearby lines into their true "rows"
  • Easier to understand, extend, and test
  • Adaptive to documents with variable line spacing

@github-actions
Copy link
Contributor

github-actions bot commented Jul 17, 2025

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@schen1102
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Jul 17, 2025
@schen1102
Copy link
Author

recheck

…enter alignment

- Refactored `sort_text_lines` to use a more robust method for sorting text lines
  in reading order (top-to-bottom, left-to-right).
- Introduced helper functions to:
  - Compute vertical center of lines (`_center_y`)
  - Group lines into rows based on vertical median tolerance
  - Sort text within rows horizontally and flatten result
- Added `_calculate_row_tolerance` with dynamic tolerance factor based on median line height
- Defined `DEFAULT_ROW_TOLERANCE_FACTOR` constant for configurable grouping sensitivity
- Improved overall code style, spacing, and docstrings for clarity and maintainability
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant