Skip to content

Reconstruct pipeline#167

Merged
JaredforReal merged 39 commits intozai-org:mainfrom
Ataraxy33:reconstruct-pipeline
Mar 27, 2026
Merged

Reconstruct pipeline#167
JaredforReal merged 39 commits intozai-org:mainfrom
Ataraxy33:reconstruct-pipeline

Conversation

@Ataraxy33
Copy link
Copy Markdown
Contributor

Summary

This PR delivers a major refactor of the Pipeline framework, along with multi-GPU deployment support, performance and memory optimizations, Layout module improvements, post-processing enhancements, and broader input/output capabilities.

The main goals of this change are:

  • improve pipeline maintainability through modularization
  • increase batch throughput and runtime stability
  • support scalable multi-GPU deployment
  • reduce memory usage and redundant I/O
  • improve robustness across Layout, OCR, and post-processing workflows

Pipeline framework refactor

Refactored the Pipeline architecture around scheduling logic, thread coordination, and runtime stability.

Highlights

  • Split the original ~700-line pipeline.py into 5 focused modules:

    • pipeline.py: top-level scheduling
    • _common.py: constants and utility functions
    • _state.py: shared state across threads
    • _unit_tracker.py: completion tracking
    • _workers.py: three worker implementations

This makes the pipeline easier to test, maintain, and evolve.

New features

  • Added asynchronous Pipeline invocation for CLI directory processing, replacing serial execution with a unified async pipeline flow for significantly higher batch throughput
  • Added a global shutdown event (threading.Event) so that if any worker crashes, the remaining workers and the main thread can detect the failure and exit cleanly
  • Added a watchdog health-monitoring thread that probes the OCR API port every 5 seconds via socket; if the service goes down, it immediately triggers shutdown to avoid deadlocks and cascading request failures

Changes

  • Removed the end-to-end OCR path (enable_layout); Layout analysis is now the default pipeline path

Bug fixes

  • Fixed a blocking issue where image-type blocks were added to the pending list and not returned until all files were fully processed
  • Fixed the issue where a single worker crash could leave the remaining workers and main thread blocked indefinitely
  • Fixed the issue where the Pipeline had to wait for t1 / t2

Multi-GPU parallel deployment

Added examples/multi-gpu-deploy with a complete tutorial and one-click multi-GPU processing workflow.

New features

  • Added a full multi-GPU deployment framework:

    • launch.py: entry point
    • gpu_utils.py: GPU detection, port allocation, and file sharding
    • engine.py: sglang / vLLM engine startup and health checks
    • worker.py: single-GPU processing unit
    • coordinator.py: global scheduling, progress monitoring, and graceful shutdown
  • Automatically detects available GPUs based on free VRAM

  • Dynamically allocates ports while skipping occupied ones

  • Evenly shards files across GPUs

  • Introduces zero communication overhead between GPUs

  • Automatically skips failed GPU instances and redistributes files to healthy GPUs

  • Adds coordinator-level progress monitoring and engine process polling, with immediate worker termination when an engine crash is detected


Performance and memory optimizations

Optimized the system for large-scale batch processing with a focus on memory pressure and unnecessary I/O.

Changes

  • Added release_unit_data() to release per-page recognition and layout results immediately after a unit is yielded
  • Removed the ever-growing global _recognition_results list to reduce memory usage
  • Avoided repeated PDF reads during save() cropping by reusing in-memory PIL.Image objects
  • Removed unnecessary PIL format conversions inside the Pipeline
  • Removed all tempfile dependencies by passing visualization images directly in memory as PIL.Image objects

Layout module improvements

Improved Layout-related functionality and fixed several stability issues.

New features

  • Added parameter control for PP-DocLayoutV3 polygon output behavior, since polygon output can introduce bad cases and slight metric regressions on regular documents

Bug fixes

  • Fully fixed the empty-mask cropping crash that triggered the cv2.resize !ssize.empty() assertion

    • implemented via monkey-patching the upstream method
    • validates box width/height before cropping
    • checks whether the cropped mask is empty
    • falls back to a rectangular bbox when needed
  • Fixed multi-page Layout visualization filenames to use page index within each file instead of a global page counter

  • Fixed incorrect filtering of image-type regions by the empty-content check


Post-processing improvements

Improved post-processing behavior for recognition results.

New features

  • Added normalize_inline_formula() to:

    • remove extra spaces inside $...$ ($ x $$x$)
    • ensure inline formulas are properly separated from surrounding text
    • improve Markdown rendering quality
  • Added three independent post-processing switches:

    • enable_merge_formula_numbers
    • enable_merge_text_blocks
    • enable_format_bullet_points

These options make it possible to balance Markdown readability against evaluation metrics.

Changes

  • Preserved complete <table>...</table> blocks as-is to avoid breaking HTML structures during duplicate-content checks
  • Added code fence closure correction for text content to avoid malformed Markdown rendering

Input / output improvements

Expanded input/output support and improved usability.

New features

  • Added bytes input support to parse(), with automatic file type detection via magic bytes for:

    • PDF
    • PNG
    • JPG
    • GIF
    • WEBP
    • BMP
  • Supports mixed inputs of file paths and raw bytes

  • Added {name}_model.json to store raw Recognition model outputs before post-processing for easier debugging and comparison

  • Replaced glob with rglob for recursive CLI directory scanning

    • supports nested subdirectories
    • automatically deduplicates inputs
    • preserves output directory structure
  • Added cropped image path fields to JSON output

  • Implemented a 5-level configuration priority system:

    • CLI --set
    • Python API kwargs
    • environment variables / .env
    • YAML
    • built-in defaults
  • Added preserve_order (default True) to let users choose between preserving original input order or yielding results in completion order

Changes

  • Unified input loading logic so bytes are processed fully in memory
  • Removed all remaining tempfile usage
  • Switched PDF rendering from pypdfium2 to PyMuPDF for better performance and memory behavior under large batch workloads
  • Simplified config surface area by removing unused options and deprecated PageLoader methods
  • Changed default result yielding behavior from completion order to original input order, with preserve_order available to override

Bug fixes

  • Fixed pypdfium2 memory release issues under large-batch workloads that could lead to PDF loading failures
  • Fixed the bug in ocr_client where .strip() was called when output was None

Impact

This PR substantially improves:

  • codebase maintainability
  • batch throughput
  • runtime fault tolerance
  • multi-GPU scalability
  • memory efficiency
  • I/O efficiency
  • Markdown output quality
  • flexibility of input/output and configuration handling

- Implement dynamic registration of Tracker in pipeline for immediate results without waiting for t1 and t2 to finish
…dle empty mask crops and prevent crashes in cv2.resize
…ed related methods to return visualization images directly
…e methods across parser result classes and pipeline
@JaredforReal JaredforReal merged commit b069c6f into zai-org:main Mar 27, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants