Skip to content

Conversation

@SageStack
Copy link

This PR optimizes the benchmarking process for the marker project, improving runtime and reducing memory footprint when running on both MPS and CPU backends. It includes tests with different levels of parallelism (-P 8, -P 6, -P 1) to find optimal configurations for both devices.

Changes
• Added and tested TORCH_DEVICE=mps and TORCH_DEVICE=cpu runs
• Benchmarked with varying parallelism to identify optimal speed/memory trade-offs
• Collected /usr/bin/time -l metrics for accurate performance profiling

Benchmark Results (summary)
• CPU, -P 8: ~30.25s total, ~1806 MB memory — fastest on CPU with moderate memory usage.
• MPS, -P 6: ~60.04s total, ~3154 MB memory — slower than CPU but leverages GPU; higher memory use.
• MPS, -P 1: ~31.57s total, ~3864 MB memory — comparable to CPU at low parallelism but more memory-hungry.
• CPU, -P 1: ~30.77s total, ~6778 MB memory — very high memory usage at low parallelism.

Notes
• MPS showed slower times at high parallelism but similar or better performance at -P 1 compared to CPU.
• CPU backend remains most efficient at high parallelism (-P 8).
• Significant memory usage spikes at lower parallelism on CPU may indicate inefficient resource reuse.

Next Steps
• Investigate memory usage spike at low parallelism on CPU.
• Explore mixed CPU+MPS execution for hybrid speed gains.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 15, 2025

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@SageStack
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@SageStack
Copy link
Author

recheck

github-actions bot added a commit that referenced this pull request Aug 15, 2025
@SageStack SageStack force-pushed the improve/mps-performance branch from f286400 to 6f03371 Compare August 15, 2025 22:09
…ple Silicon

- Ensure device detection is applied correctly across batch-size logic.
- Add USING_CUDA/USING_MPS helpers for clearer branching.
- MODEL_DTYPE: bfloat16 (CUDA), float16 (MPS), float32 (CPU).
- Increase MPS batch sizes for layout, OCR error, recognition, equations,
  and table recognition; modest bump for detection (CPU fallback under MPS).
- Normalize/remove duplicate getter definitions.
- Fix gpu.using_cuda() equality check; add gpu.using_mps().

Benchmarks on M1 Pro (5 PDFs):
CPU P=1: 30.77s total (~0.162 files/s)
MPS P=1: 31.57s total (~0.158 files/s)
CPU P=8: 30.25s total (~0.165 files/s)
MPS P=6: 60.04s total (~0.083 files/s)

Note: text detection remains CPU-only on MPS, so CPU is faster end-to-end today; this patch still improves correctness and MPS throughput where supported.
@SageStack SageStack force-pushed the improve/mps-performance branch from 6f03371 to 77627a8 Compare August 15, 2025 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant