Reconstruct pipeline by Ataraxy33 · Pull Request #167 · zai-org/GLM-OCR

Ataraxy33 · 2026-03-27T09:16:18Z

Summary

This PR delivers a major refactor of the Pipeline framework, along with multi-GPU deployment support, performance and memory optimizations, Layout module improvements, post-processing enhancements, and broader input/output capabilities.

The main goals of this change are:

improve pipeline maintainability through modularization
increase batch throughput and runtime stability
support scalable multi-GPU deployment
reduce memory usage and redundant I/O
improve robustness across Layout, OCR, and post-processing workflows

Pipeline framework refactor

Refactored the Pipeline architecture around scheduling logic, thread coordination, and runtime stability.

Highlights

Split the original ~700-line pipeline.py into 5 focused modules:
- pipeline.py: top-level scheduling
- _common.py: constants and utility functions
- _state.py: shared state across threads
- _unit_tracker.py: completion tracking
- _workers.py: three worker implementations

This makes the pipeline easier to test, maintain, and evolve.

New features

Added asynchronous Pipeline invocation for CLI directory processing, replacing serial execution with a unified async pipeline flow for significantly higher batch throughput
Added a global shutdown event (threading.Event) so that if any worker crashes, the remaining workers and the main thread can detect the failure and exit cleanly
Added a watchdog health-monitoring thread that probes the OCR API port every 5 seconds via socket; if the service goes down, it immediately triggers shutdown to avoid deadlocks and cascading request failures

Changes

Removed the end-to-end OCR path (enable_layout); Layout analysis is now the default pipeline path

Bug fixes

Fixed a blocking issue where image-type blocks were added to the pending list and not returned until all files were fully processed
Fixed the issue where a single worker crash could leave the remaining workers and main thread blocked indefinitely
Fixed the issue where the Pipeline had to wait for t1 / t2

Multi-GPU parallel deployment

Added examples/multi-gpu-deploy with a complete tutorial and one-click multi-GPU processing workflow.

New features

Added a full multi-GPU deployment framework:
- launch.py: entry point
- gpu_utils.py: GPU detection, port allocation, and file sharding
- engine.py: sglang / vLLM engine startup and health checks
- worker.py: single-GPU processing unit
- coordinator.py: global scheduling, progress monitoring, and graceful shutdown
Automatically detects available GPUs based on free VRAM
Dynamically allocates ports while skipping occupied ones
Evenly shards files across GPUs
Introduces zero communication overhead between GPUs
Automatically skips failed GPU instances and redistributes files to healthy GPUs
Adds coordinator-level progress monitoring and engine process polling, with immediate worker termination when an engine crash is detected

Performance and memory optimizations

Optimized the system for large-scale batch processing with a focus on memory pressure and unnecessary I/O.

Changes

Added release_unit_data() to release per-page recognition and layout results immediately after a unit is yielded
Removed the ever-growing global _recognition_results list to reduce memory usage
Avoided repeated PDF reads during save() cropping by reusing in-memory PIL.Image objects
Removed unnecessary PIL format conversions inside the Pipeline
Removed all tempfile dependencies by passing visualization images directly in memory as PIL.Image objects

Layout module improvements

Improved Layout-related functionality and fixed several stability issues.

New features

Added parameter control for PP-DocLayoutV3 polygon output behavior, since polygon output can introduce bad cases and slight metric regressions on regular documents

Bug fixes

Fully fixed the empty-mask cropping crash that triggered the cv2.resize !ssize.empty() assertion
- implemented via monkey-patching the upstream method
- validates box width/height before cropping
- checks whether the cropped mask is empty
- falls back to a rectangular bbox when needed
Fixed multi-page Layout visualization filenames to use page index within each file instead of a global page counter
Fixed incorrect filtering of image-type regions by the empty-content check

Post-processing improvements

Improved post-processing behavior for recognition results.

New features

Added normalize_inline_formula() to:
- remove extra spaces inside $...$ ( $ x $ → $x$ )
- ensure inline formulas are properly separated from surrounding text
- improve Markdown rendering quality
Added three independent post-processing switches:
- enable_merge_formula_numbers
- enable_merge_text_blocks
- enable_format_bullet_points

These options make it possible to balance Markdown readability against evaluation metrics.

Changes

Preserved complete <table>...</table> blocks as-is to avoid breaking HTML structures during duplicate-content checks
Added code fence closure correction for text content to avoid malformed Markdown rendering

Input / output improvements

Expanded input/output support and improved usability.

New features

Added bytes input support to parse(), with automatic file type detection via magic bytes for:
- PDF
- PNG
- JPG
- GIF
- WEBP
- BMP
Supports mixed inputs of file paths and raw bytes
Added {name}_model.json to store raw Recognition model outputs before post-processing for easier debugging and comparison
Replaced glob with rglob for recursive CLI directory scanning
- supports nested subdirectories
- automatically deduplicates inputs
- preserves output directory structure
Added cropped image path fields to JSON output
Implemented a 5-level configuration priority system:
- CLI --set
- Python API kwargs
- environment variables / .env
- YAML
- built-in defaults
Added preserve_order (default True) to let users choose between preserving original input order or yielding results in completion order

Changes

Unified input loading logic so bytes are processed fully in memory
Removed all remaining tempfile usage
Switched PDF rendering from pypdfium2 to PyMuPDF for better performance and memory behavior under large batch workloads
Simplified config surface area by removing unused options and deprecated PageLoader methods
Changed default result yielding behavior from completion order to original input order, with preserve_order available to override

Bug fixes

Fixed pypdfium2 memory release issues under large-batch workloads that could lead to PDF loading failures
Fixed the bug in ocr_client where .strip() was called when output was None

Impact

This PR substantially improves:

codebase maintainability
batch throughput
runtime fault tolerance
multi-GPU scalability
memory efficiency
I/O efficiency
Markdown output quality
flexibility of input/output and configuration handling

- Implement dynamic registration of Tracker in pipeline for immediate results without waiting for t1 and t2 to finish

… memory release performance

…detector

…fault config

…dle empty mask crops and prevent crashes in cv2.resize

…ed related methods to return visualization images directly

…erriding config values

…e methods across parser result classes and pipeline

…ve tempfile dependence

…output to log directory

…content to None

…et connectivity check

…le crashed processes

…ls before checking for empty content

… writing

…l for improved logging

…ng control

…a method to free per-page data after processing

…ing formulas, text blocks, and formatting bullet points

…stent output order

…e files

Ataraxy33 added 30 commits February 28, 2026 14:10

Remove support for OCR without layout analysis

000d8af

reconstruct pipeline

796519a

- Add async pipeline support for files in directory via CLI

0489c36

- Implement dynamic registration of Tracker in pipeline for immediate results without waiting for t1 and t2 to finish

Fix a blocking bug; remove redundant PIL format conversions; optimize…

6dc8d8c

… memory release performance

Added shutdown event handling to safely stop processing and drain queues

59e6ba6

support image / PDF bytes input

56c40a8

add image path to json result for cropped images

f55d36a

Fix a layout visualization file naming bug

73ef92d

add fallback handling to page loader

eb2ebe3

change PDF renderer to PyMuPDF; harden error handling

c7104f3

simplify image region save flow to reduce IO

f6b67c4

save raw output json file from recognition model

eebf13a

add an argument to control whether to use polygon property in layout …

5dd2778

…detector

support load image/PDF files from a directory recursively & update de…

ffa8f53

…fault config

Implement safe extraction of polygon points in layout detector to han…

1d7a43a

…dle empty mask crops and prevent crashes in cv2.resize

Removed temporary directory usage for layout visualizations and updat…

9ce61f7

…ed related methods to return visualization images directly

Enhance configuration flexibility by adding CLI --set option for ov…

961abcd

…erriding config values

Update default output directory from './results' to './output' in sav…

79404cd

…e methods across parser result classes and pipeline

Update methods to load images and PDFs from various input types, remo…

da003dd

…ve tempfile dependence

Improve recognition result postprocess

83e4c71

Add multi-GPU deployment support for GLM-OCR

28f108b

Refactor multi-GPU deployment to eliminate tempfile usage and direct …

1239089

…output to log directory

Update error handling in recognition process to log failures and set …

f19f404

…content to None

Add health monitoring to OCR pipeline with a watchdog thread and sock…

88fc499

…et connectivity check

Add engine health checks in multi-GPU coordinator to monitor and hand…

0d51f41

…le crashed processes

Refactor content validation in ResultFormatter to skip non-image labe…

978564a

…ls before checking for empty content

Add inline formula normalization in result formatting process

507c2d1

Refactor result yielding in Pipeline to maintain original input order

8f134ad

Add --no-save option to multi-GPU deployment for optional result file…

b3874ad

… writing

Update multi-GPU deployment to pass log directory and engine log leve…

cee8fa0

…l for improved logging

Ataraxy33 added 9 commits March 20, 2026 10:01

Add engine log level parameter to build_engine_cmd for enhanced loggi…

33b78f7

…ng control

Enhance memory management in PipelineState by adding release_unit_dat…

9b83925

…a method to free per-page data after processing

Add post-processing configuration options to ResultFormatter for merg…

2ec0809

…ing formulas, text blocks, and formatting bullet points

Update default configuration parameters

4252a3c

Add preserve_order parameter to GlmOcr and Pipeline classes for consi…

cf32315

…stent output order

Refactor code for improved readability and consistency across multipl…

52687c2

…e files

Merge remote-tracking branch 'upstream/main' into reconstruct-pipeline

fdb10fc

Passed the pre-commit code check

b9cc4cb

Add preserve_order argument to stream parsing test

77a7430

JaredforReal approved these changes Mar 27, 2026

View reviewed changes

JaredforReal merged commit b069c6f into zai-org:main Mar 27, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconstruct pipeline#167

Reconstruct pipeline#167
JaredforReal merged 39 commits intozai-org:mainfrom
Ataraxy33:reconstruct-pipeline

Ataraxy33 commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ataraxy33 commented Mar 27, 2026

Summary

Pipeline framework refactor

Highlights

New features

Changes

Bug fixes

Multi-GPU parallel deployment

New features

Performance and memory optimizations

Changes

Layout module improvements

New features

Bug fixes

Post-processing improvements

New features

Changes

Input / output improvements

New features

Changes

Bug fixes

Impact

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants