zai-org · JaredforReal · Mar 27, 2026 · Feb 28, 2026 · Feb 28, 2026 · Mar 2, 2026
diff --git a/README.md b/README.md
@@ -197,6 +197,10 @@ glmocr parse examples/source/code.png --layout-device cpu
 
 # Run layout detection on a specific GPU
 glmocr parse examples/source/code.png --layout-device cuda:1
+
+# Override any config value via --set (dotted path, repeatable)
+glmocr parse examples/source/code.png --set pipeline.ocr_api.api_port 8080
+glmocr parse examples/source/ --set pipeline.layout.use_polygon true --set logging.level DEBUG
 ```
 
 #### Python API
@@ -256,6 +260,14 @@ Semantics:
 
 ### Configuration
 
+Configuration priority (highest to lowest):
+
+1. CLI `--set` overrides
+2. Python API keyword arguments
+3. `GLMOCR_*` environment variables / `.env` file
+4. YAML config file
+5. Built-in defaults
+
 Full configuration in `glmocr/config.yaml`:
 
 ```yaml
@@ -276,13 +288,13 @@ pipeline:
     api_host: localhost
     api_port: 8080
     api_key: null # or set API_KEY env var
-    connect_timeout: 300
-    request_timeout: 300
+    connect_timeout: 30
+    request_timeout: 120
 
   # Page loader settings
   page_loader:
-    max_tokens: 16384
-    temperature: 0.01
+    max_tokens: 8192
+    temperature: 0.0
     image_format: JPEG
     min_pixels: 12544
     max_pixels: 71372800
@@ -291,9 +303,6 @@ pipeline:
   result_formatter:
     output_format: both # json, markdown, or both
 
-  # Layout detection (optional)
-  enable_layout: false
-
   # Layout model device placement
   layout:
     # device: null   # null=auto, "cpu", "cuda", or "cuda:N"

diff --git a/README_zh.md b/README_zh.md
@@ -190,6 +190,10 @@ glmocr parse examples/source/code.png --config my_config.yaml
 
 # 开启 debug 日志（包含 profiling）
 glmocr parse examples/source/code.png --log-level DEBUG
+
+# 通过 --set 覆盖任意配置项（使用 dotted path，可多次使用）
+glmocr parse examples/source/code.png --set pipeline.ocr_api.api_port 8080
+glmocr parse examples/source/ --set pipeline.layout.use_polygon true --set logging.level DEBUG
 ```
 
 #### Python API
@@ -241,6 +245,14 @@ curl -X POST http://localhost:5002/glmocr/parse \
 
 ### 配置
 
+配置加载优先级（从高到低）：
+
+1. CLI `--set` 参数
+2. Python API 关键字参数
+3. `GLMOCR_*` 环境变量 / `.env` 文件
+4. YAML 配置文件
+5. 内置默认值
+
 完整配置见 `glmocr/config.yaml`：
 
 ```yaml
@@ -261,23 +273,20 @@ pipeline:
     api_host: localhost
     api_port: 8080
     api_key: null # or set API_KEY env var
-    connect_timeout: 300
-    request_timeout: 300
+    connect_timeout: 30
+    request_timeout: 120
 
   # Page loader settings
   page_loader:
-    max_tokens: 16384
-    temperature: 0.01
+    max_tokens: 8192
+    temperature: 0.0
     image_format: JPEG
     min_pixels: 12544
     max_pixels: 71372800
 
   # Result formatting
   result_formatter:
     output_format: both # json, markdown, or both
-
-  # Layout detection (optional)
-  enable_layout: false
 ```
 
 更多选项请参考 [config.yaml](glmocr/config.yaml)。

diff --git a/agent.md b/agent.md
@@ -70,7 +70,6 @@ or in a `.env` file anywhere in the working-directory ancestry.
 | `GLMOCR_OCR_API_HOST`  | `pipeline.ocr_api.api_host`     | `localhost`                         |
 | `GLMOCR_OCR_API_PORT`  | `pipeline.ocr_api.api_port`     | `5002`                              |
 | `GLMOCR_OCR_MODEL`     | `pipeline.ocr_api.model`        | `glm-ocr-model`                     |
-| `GLMOCR_ENABLE_LAYOUT` | `pipeline.enable_layout`        | `true` / `false`                    |
 | `GLMOCR_LOG_LEVEL`     | `logging.level`                 | `DEBUG`, `INFO`, `WARNING`, `ERROR` |
 
 ### `.env` File Auto-Loading
@@ -102,7 +101,6 @@ with **higher priority**.
 | `model`         | `str`  | Model name.                                               |
 | `mode`          | `str`  | `"maas"` or `"selfhosted"`.                               |
 | `timeout`       | `int`  | Request timeout in seconds.                               |
-| `enable_layout` | `bool` | Enable layout detection.                                  |
 | `log_level`     | `str`  | Logging level.                                            |
 
 ---

diff --git a/examples/multi-gpu-deploy/README.md b/examples/multi-gpu-deploy/README.md
@@ -0,0 +1,120 @@
+# Multi-GPU Deployment for GLM-OCR
+
+Automatically launch sglang/vLLM inference services across multiple GPUs, distribute image files evenly, and run the GLM-OCR pipeline in parallel for maximum throughput.
+
+Each GPU hosts both an inference server (sglang or vLLM) and a layout detection model, forming a self-contained processing unit with zero cross-GPU communication.
+
+## Features
+
+- **Auto GPU detection** — discovers all available GPUs and filters by free VRAM
+- **Dynamic port allocation** — automatically skips occupied ports
+- **Fault tolerance** — failed GPUs are skipped, files are redistributed to healthy GPUs
+- **Global progress bar** — real-time `tqdm` progress across all GPUs
+- **Graceful shutdown** — `Ctrl+C` cleanly terminates all subprocesses; double `Ctrl+C` force-kills
+- **Centralized logging** — all engine/worker logs saved under `logs/<timestamp>/`
+- **Speculative decoding** — MTP enabled by default for both sglang and vLLM
+
+## Quick Start
+
+```bash
+# Use all available GPUs with sglang (default)
+python examples/multi-gpu-deploy/launch.py -i ./images -o ./output -m /path/to/GLM-OCR
+
+# Specify GPUs and use vLLM
+python examples/multi-gpu-deploy/launch.py -i ./images -o ./output --engine vllm --gpus 0,1,2,3
+
+# Custom model path and VRAM threshold
+python examples/multi-gpu-deploy/launch.py -i ./images -o ./output -m /path/to/GLM-OCR --min-free-mb 20000
+```
+
+## Parameters
+
+| Parameter | Default | Description |
+|---|---|---|
+| `-i`, `--input` | *required* | Input image file or directory (recursive) |
+| `-o`, `--output` | `./output` | Output directory for results |
+| `-m`, `--model` | `zai-org/GLM-OCR` | Model name or local path |
+| `--engine` | `sglang` | Inference engine: `sglang` or `vllm` |
+| `--gpus` | `auto` | GPU IDs (comma-separated) or `auto` for all available |
+| `--base-port` | `8080` | Base port for engine services |
+| `--min-free-mb` | `16000` | Minimum free GPU memory in MB to use a GPU |
+| `--timeout` | `600` | Engine startup timeout in seconds |
+| `--engine-args` | *none* | Extra arguments passed to the engine |
+| `-c`, `--config` | *none* | Path to a custom glmocr config YAML |
+| `--log-level` | `WARNING` | Log level for worker processes |
+
+
+## Examples
+
+### Basic usage
+
+```bash
+python examples/multi-gpu-deploy/launch.py -i /data/documents -o /data/results
+```
+
+### Use vLLM with specific GPUs
+
+```bash
+python examples/multi-gpu-deploy/launch.py \
+  -i /data/documents \
+  -o /data/results \
+  --engine vllm \
+  --gpus 0,2,4,6
+```
+
+### Custom engine arguments
+
+```bash
+# sglang with custom memory fraction
+python examples/multi-gpu-deploy/launch.py \
+  -i /data/documents \
+  -o /data/results \
+  --engine-args "--mem-fraction-static 0.85"
+```
+
+### Custom config YAML
+
+```bash
+python examples/multi-gpu-deploy/launch.py \
+  -i /data/documents \
+  -o /data/results \
+  --config my_config.yaml
+```
+
+## Logs
+
+All logs are saved under `logs/<timestamp>/`:
+
+| File | Content |
+|---|---|
+| `main.log` | Coordinator stdout/stderr |
+| `engine_gpu<N>_port<P>.log` | Engine service output for each GPU |
+| `worker_gpu<N>.log` | Worker process output for each GPU |
+| `failed_files.json` | Aggregated list of failed files (if any) |
+
+## Troubleshooting
+
+**Q: Some ports are occupied, will it still work?**
+
+Yes. The launcher automatically scans for available ports starting from `--base-port` and skips any that are in use.
+
+**Q: A GPU runs out of memory mid-processing. What happens?**
+
+The worker on that GPU will fail, but other GPUs continue processing. Failed files are logged in `failed_files.json` for later re-processing.
+
+**Q: How do I re-run only the failed files?**
+
+Copy the failed files to a directory and run the launcher again pointing to that directory.
+
+## File Structure
+
+```
+examples/multi-gpu-deploy/
+├── launch.py        # Entry point and CLI argument parser
+├── coordinator.py   # Orchestration: GPU detection, engine/worker lifecycle
+├── engine.py        # Engine service management and progress tracking
+├── worker.py        # Worker process: GLM-OCR pipeline execution
+├── gpu_utils.py     # GPU detection, port checking, file sharding
+├── README.md        # This file (English)
+└── README_zh.md     # Chinese documentation
+```
diff --git a/examples/multi-gpu-deploy/README_zh.md b/examples/multi-gpu-deploy/README_zh.md
@@ -0,0 +1,120 @@
+# GLM-OCR 多卡并行部署
+
+自动在多张 GPU 上启动 sglang/vLLM 推理服务，均匀分配图像文件，并行运行 GLM-OCR 流水线以获得最大吞吐量。
+
+每张 GPU 同时承载推理服务（sglang 或 vLLM）和版面检测模型，形成独立的处理单元，GPU 之间零通信开销。
+
+## 特性
+
+- **自动检测 GPU** — 自动发现所有可用 GPU，按空闲显存过滤
+- **动态端口分配** — 自动跳过已被占用的端口
+- **容错机制** — 失败的 GPU 自动跳过，文件重新分配到健康的 GPU 上
+- **全局进度条** — 实时 `tqdm` 进度展示，汇总所有 GPU 的处理进度
+- **优雅退出** — `Ctrl+C` 清理所有子进程；双击 `Ctrl+C` 强制终止
+- **集中日志** — 所有引擎/Worker 日志保存在 `logs/<时间戳>/` 目录下
+- **投机解码** — sglang 和 vLLM 均默认启用 MTP（多 Token 预测）
+
+## 快速开始
+
+```bash
+# 使用所有可用 GPU，默认 sglang 引擎
+python examples/multi-gpu-deploy/launch.py -i ./images -o ./output -m /path/to/GLM-OCR
+
+# 指定 GPU 并使用 vLLM
+python examples/multi-gpu-deploy/launch.py -i ./images -o ./output --engine vllm --gpus 0,1,2,3
+
+# 自定义模型路径和显存阈值
+python examples/multi-gpu-deploy/launch.py -i ./images -o ./output -m /path/to/GLM-OCR --min-free-mb 20000
+```
+
+## 参数说明
+
+| 参数 | 默认值 | 说明 |
+|---|---|---|
+| `-i`, `--input` | *必填* | 输入图像文件或目录（支持递归扫描） |
+| `-o`, `--output` | `./output` | 输出结果目录 |
+| `-m`, `--model` | `zai-org/GLM-OCR` | 模型名称或本地路径 |
+| `--engine` | `sglang` | 推理引擎：`sglang` 或 `vllm` |
+| `--gpus` | `auto` | GPU 编号（逗号分隔）或 `auto` 自动检测 |
+| `--base-port` | `8080` | 推理服务起始端口 |
+| `--min-free-mb` | `16000` | 使用 GPU 所需的最小空闲显存（MB） |
+| `--timeout` | `600` | 推理服务启动超时时间（秒） |
+| `--engine-args` | *无* | 传递给推理引擎的额外参数 |
+| `-c`, `--config` | *无* | 自定义 glmocr 配置 YAML 路径 |
+| `--log-level` | `WARNING` | Worker 进程的日志级别 |
+
+
+## 使用示例
+
+### 基本用法
+
+```bash
+python examples/multi-gpu-deploy/launch.py -i /data/documents -o /data/results
+```
+
+### 使用 vLLM 并指定 GPU
+
+```bash
+python examples/multi-gpu-deploy/launch.py \
+  -i /data/documents \
+  -o /data/results \
+  --engine vllm \
+  --gpus 0,2,4,6
+```
+
+### 自定义引擎参数
+
+```bash
+# sglang 设置显存占用比例
+python examples/multi-gpu-deploy/launch.py \
+  -i /data/documents \
+  -o /data/results \
+  --engine-args "--mem-fraction-static 0.85"
+```
+
+### 使用自定义配置文件
+
+```bash
+python examples/multi-gpu-deploy/launch.py \
+  -i /data/documents \
+  -o /data/results \
+  --config my_config.yaml
+```
+
+## 日志
+
+所有日志保存在 `logs/<时间戳>/` 目录下：
+
+| 文件 | 内容 |
+|---|---|
+| `main.log` | 协调器主进程的 stdout/stderr |
+| `engine_gpu<N>_port<P>.log` | 各 GPU 的推理引擎输出 |
+| `worker_gpu<N>.log` | 各 GPU 的 Worker 进程输出 |
+| `failed_files.json` | 汇总的失败文件列表（如有） |
+
+## 常见问题
+
+**Q：某些端口被占用了，还能正常工作吗？**
+
+可以。启动器会从 `--base-port` 开始自动扫描可用端口，跳过所有已被占用的端口。
+
+**Q：某张 GPU 在处理过程中显存不足怎么办？**
+
+该 GPU 上的 Worker 会失败，但其他 GPU 继续处理。失败的文件会记录在 `failed_files.json` 中，方便后续重新处理。
+
+**Q：如何只重跑失败的文件？**
+
+将失败的文件复制到一个目录中，然后重新运行启动器指向该目录即可。
+
+## 文件结构
+
+```
+examples/multi-gpu-deploy/
+├── launch.py        # 入口文件与命令行参数解析
+├── coordinator.py   # 编排器：GPU 检测、引擎/Worker 生命周期管理
+├── engine.py        # 推理引擎管理与进度追踪
+├── worker.py        # Worker 进程：GLM-OCR 流水线执行
+├── gpu_utils.py     # GPU 检测、端口检查、文件分片
+├── README.md        # 英文文档
+└── README_zh.md     # 本文件（中文文档）
+```