Merged
Conversation
- Implement dynamic registration of Tracker in pipeline for immediate results without waiting for t1 and t2 to finish
… memory release performance
…dle empty mask crops and prevent crashes in cv2.resize
…ed related methods to return visualization images directly
…erriding config values
…e methods across parser result classes and pipeline
…ve tempfile dependence
…output to log directory
…et connectivity check
…le crashed processes
…ls before checking for empty content
…l for improved logging
…a method to free per-page data after processing
…ing formulas, text blocks, and formatting bullet points
…stent output order
JaredforReal
approved these changes
Mar 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR delivers a major refactor of the Pipeline framework, along with multi-GPU deployment support, performance and memory optimizations, Layout module improvements, post-processing enhancements, and broader input/output capabilities.
The main goals of this change are:
Pipeline framework refactor
Refactored the Pipeline architecture around scheduling logic, thread coordination, and runtime stability.
Highlights
Split the original ~700-line
pipeline.pyinto 5 focused modules:pipeline.py: top-level scheduling_common.py: constants and utility functions_state.py: shared state across threads_unit_tracker.py: completion tracking_workers.py: three worker implementationsThis makes the pipeline easier to test, maintain, and evolve.
New features
shutdown event(threading.Event) so that if any worker crashes, the remaining workers and the main thread can detect the failure and exit cleanlyChanges
enable_layout); Layout analysis is now the default pipeline pathBug fixes
image-type blocks were added to the pending list and not returned until all files were fully processedt1/t2Multi-GPU parallel deployment
Added
examples/multi-gpu-deploywith a complete tutorial and one-click multi-GPU processing workflow.New features
Added a full multi-GPU deployment framework:
launch.py: entry pointgpu_utils.py: GPU detection, port allocation, and file shardingengine.py:sglang/vLLMengine startup and health checksworker.py: single-GPU processing unitcoordinator.py: global scheduling, progress monitoring, and graceful shutdownAutomatically detects available GPUs based on free VRAM
Dynamically allocates ports while skipping occupied ones
Evenly shards files across GPUs
Introduces zero communication overhead between GPUs
Automatically skips failed GPU instances and redistributes files to healthy GPUs
Adds coordinator-level progress monitoring and engine process polling, with immediate worker termination when an engine crash is detected
Performance and memory optimizations
Optimized the system for large-scale batch processing with a focus on memory pressure and unnecessary I/O.
Changes
release_unit_data()to release per-page recognition and layout results immediately after a unit is yielded_recognition_resultslist to reduce memory usagesave()cropping by reusing in-memoryPIL.Imageobjectstempfiledependencies by passing visualization images directly in memory asPIL.ImageobjectsLayout module improvements
Improved Layout-related functionality and fixed several stability issues.
New features
Bug fixes
Fully fixed the empty-mask cropping crash that triggered the
cv2.resize!ssize.empty()assertionFixed multi-page Layout visualization filenames to use page index within each file instead of a global page counter
Fixed incorrect filtering of
image-type regions by the empty-content checkPost-processing improvements
Improved post-processing behavior for recognition results.
New features
Added
normalize_inline_formula()to:$...$($ x $→$x$)Added three independent post-processing switches:
enable_merge_formula_numbersenable_merge_text_blocksenable_format_bullet_pointsThese options make it possible to balance Markdown readability against evaluation metrics.
Changes
<table>...</table>blocks as-is to avoid breaking HTML structures during duplicate-content checksInput / output improvements
Expanded input/output support and improved usability.
New features
Added
bytesinput support toparse(), with automatic file type detection via magic bytes for:Supports mixed inputs of file paths and raw bytes
Added
{name}_model.jsonto store raw Recognition model outputs before post-processing for easier debugging and comparisonReplaced
globwithrglobfor recursive CLI directory scanningAdded cropped image path fields to JSON output
Implemented a 5-level configuration priority system:
--set.envAdded
preserve_order(defaultTrue) to let users choose between preserving original input order or yielding results in completion orderChanges
bytesare processed fully in memorytempfileusagepypdfium2toPyMuPDFfor better performance and memory behavior under large batch workloadsPageLoadermethodspreserve_orderavailable to overrideBug fixes
pypdfium2memory release issues under large-batch workloads that could lead to PDF loading failuresocr_clientwhere.strip()was called whenoutputwasNoneImpact
This PR substantially improves: