Skip to content

feat: Add torch.compile support for ~2x faster inference#16

Merged
fspecii merged 1 commit intomainfrom
feature/torch-compile-support
Jan 25, 2026
Merged

feat: Add torch.compile support for ~2x faster inference#16
fspecii merged 1 commit intomainfrom
feature/torch-compile-support

Conversation

@fspecii
Copy link
Owner

@fspecii fspecii commented Jan 25, 2026

Summary

This PR adds support for PyTorch 2.0+ torch.compile to significantly improve inference speed (~2x on supported GPUs like RTX 4090, A100).

Changes

New Environment Variables

  • HEARTMULA_COMPILE: Enable torch.compile (true/false, default: false)
  • HEARTMULA_COMPILE_MODE: Compile optimization mode (default/reduce-overhead/max-autotune)

Code Changes

  • Added apply_torch_compile() function that compiles the model backbone and decoder
  • Auto-detects Triton availability and falls back to eager backend if not found
  • Updated create_quantized_pipeline() with compile parameters
  • Applied torch.compile to both quantized and non-quantized pipelines
  • All configurations (single GPU, multi-GPU, sequential offload) now support torch.compile

Documentation

  • Updated .env.example with new configuration options
  • Added torch.compile section to README.md with usage examples

Usage

# Enable torch.compile
HEARTMULA_COMPILE=true python -m uvicorn backend.app.main:app --host 0.0.0.0 --port 8000

# With max performance mode
HEARTMULA_COMPILE=true HEARTMULA_COMPILE_MODE=max-autotune ./start.sh

Requirements

  • PyTorch 2.0+
  • Linux/WSL2: pip install triton
  • Windows: pip install -U 'triton-windows>=3.2,<3.3'

Notes

  • First generation will be slower due to compilation
  • Subsequent generations benefit from compiled kernels
  • Falls back gracefully if compilation fails

Related

This implements a feature similar to heartlib PR #64.

- Add HEARTMULA_COMPILE and HEARTMULA_COMPILE_MODE environment variables
- Add apply_torch_compile() function that compiles backbone and decoder
- Auto-detect Triton availability (inductor vs eager backend)
- Support compile modes: default, reduce-overhead, max-autotune
- Update create_quantized_pipeline() with compile_model/compile_mode params
- Apply torch.compile to both quantized and non-quantized pipelines
- Update .env.example with new configuration options
- Update README.md with torch.compile documentation

Note: First generation will be slower due to compilation, but subsequent
generations benefit from the compiled kernels (~2x speedup on RTX 4090, A100).

Requirements:
- PyTorch 2.0+
- Linux/WSL2: pip install triton
- Windows: pip install -U 'triton-windows>=3.2,<3.3'
@fspecii fspecii merged commit 130674b into main Jan 25, 2026
@fspecii fspecii deleted the feature/torch-compile-support branch January 25, 2026 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant