diff --git a/CHANGELOG.rst b/CHANGELOG.rst index d94d9ae1d..3d69ca18c 100755 --- a/CHANGELOG.rst +++ b/CHANGELOG.rst @@ -7,6 +7,7 @@ NVIDIA Model Optimizer Changelog (Linux) **New Features** - Add support for Transformer Engine quantization for Megatron Core models. +- Add automated QDQ placement tool to search optimial QDQ insertion points. 0.40 (2025-12-12) ^^^^^^^^^^^^^^^^^ diff --git a/docs/source/guides/9_qdq_placement.rst b/docs/source/guides/9_qdq_placement.rst new file mode 100644 index 000000000..00466c3b9 --- /dev/null +++ b/docs/source/guides/9_qdq_placement.rst @@ -0,0 +1,911 @@ +=============================================== +Automated Q/DQ Placement Optimization +=============================================== + +Overview +======== + +The ``modelopt.onnx.quantization.autotune`` module provides automated optimization of Quantize/Dequantize (Q/DQ) node placement in ONNX models. Instead of manually deciding where to insert Q/DQ nodes, the autotuner systematically explores different placement strategies and uses TensorRT performance measurements to find the optimal configuration that minimizes inference latency. + +**Key Features:** + +* **Automatic Region Discovery**: Intelligently partitions your model into optimization regions +* **Pattern-Based Optimization**: Groups structurally similar regions and optimizes them together +* **TensorRT Performance Measurement**: Uses actual inference latency (not theoretical estimates) +* **Crash Recovery**: Checkpoint/resume capability for long-running optimizations +* **Warm-Start Support**: Reuses learned patterns from previous runs +* **Multiple Quantization Types**: Supports INT8 and FP8 quantization + +**When to Use This Tool:** + +* You have an ONNX model you want to quantize for TensorRT deployment +* You want to optimize Q/DQ placement for best performance (not just accuracy) +* Your model has repeating structures (e.g., transformer blocks, ResNet layers) +* You need automated optimization without manual Q/DQ placement + +Quick Start +=========== + +Command-Line Interface +----------------------- + +The easiest way to use the autotuner is via the command-line interface: + +.. code-block:: bash + + # Basic usage - INT8 quantization + python -m modelopt.onnx.quantization.autotune --model model.onnx --output ./results + + # FP8 quantization with more exploration + python -m modelopt.onnx.quantization.autotune \ + --model model.onnx \ + --output ./results \ + --quant-type fp8 \ + --schemes-per-region 50 + +The command will: + +1. Discover regions in your model automatically +2. Measure baseline performance (no quantization) +3. Test different Q/DQ placement schemes for each region pattern +4. Select the best scheme based on TensorRT latency measurements +5. Export an optimized ONNX model with Q/DQ nodes + +**Output Files:** + +.. code-block:: text + + results/ + ├── autotuner_state.yaml # Checkpoint for resuming + ├── autotuner_state_pattern_cache.yaml # Pattern cache for future runs + ├── baseline.onnx # Unquantized baseline + ├── optimized_final.onnx # Final optimized model + ├── logs/ # TensorRT build logs + │ ├── baseline.log + │ ├── region_*_scheme_*.log + │ └── final.log + └── region_models/ # Best model per region + └── region_*_level_*.onnx + +Python API +---------- + +For programmatic control, use the workflow function: + +.. code-block:: python + + from pathlib import Path + from modelopt.onnx.quantization.autotune.workflows import ( + region_pattern_autotuning_workflow, + init_benchmark_instance + ) + + # Initialize TensorRT benchmark + init_benchmark_instance( + timing_cache_file="timing.cache", + warmup_runs=5, + timing_runs=20 + ) + + # Run autotuning workflow + autotuner = region_pattern_autotuning_workflow( + model_path="model.onnx", + output_dir=Path("./results"), + num_schemes_per_region=30, + quant_type="int8" + ) + +How It Works +============ + +The autotuner uses a pattern-based approach that makes optimization both efficient and consistent: + +1. **Region Discovery Phase** + + The model's computation graph is automatically partitioned into hierarchical regions. Each region is a subgraph containing related operations (e.g., a Conv-BatchNorm-ReLU block). + +2. **Pattern Identification Phase** + + Regions with identical structural patterns are grouped together. For example, all Convolution->BatchNormalization->ReLU blocks in your model will share the same pattern. + +3. **Scheme Generation Phase** + + For each unique pattern, multiple Q/DQ insertion schemes are generated. Each scheme specifies different locations to insert Q/DQ nodes. + +4. **Performance Measurement Phase** + + Each scheme is evaluated by: + + * Exporting the ONNX model with Q/DQ nodes applied + * Building a TensorRT engine + * Measuring actual inference latency + +5. **Best Scheme Selection** + + The scheme with the lowest latency is selected for each pattern. This scheme automatically applies to all regions matching that pattern. + +6. **Model Export** + + The final model includes the best Q/DQ scheme for each pattern, resulting in an optimized quantized model. + +**Why Pattern-Based?** + +Pattern-based optimization significantly reduces the search space. Instead of optimizing each region independently (which could require thousands of benchmarks), the autotuner optimizes each unique pattern once. The time reduction depends on pattern overlap—models with many regions sharing few patterns (like transformers with repeated blocks) see the greatest speedup, while models with mostly unique patterns see less benefit. + +Advanced Usage +============== + +Warm-Start with Pattern Cache +------------------------------ + +Pattern cache files store the best Q/DQ schemes from previous optimization runs. You can reuse these patterns on similar models or model versions: + +.. code-block:: bash + + # First optimization (cold start) + python -m modelopt.onnx.quantization.autotune \ + --model model_v1.onnx \ + --output ./run1 + + # The pattern cache is saved to ./run1/autotuner_state_pattern_cache.yaml + + # Second optimization with warm-start + python -m modelopt.onnx.quantization.autotune \ + --model model_v2.onnx \ + --output ./run2 \ + --pattern-cache ./run1/autotuner_state_pattern_cache.yaml + +By prioritizing cached schemes, the second test run has the potential to discover optimal configurations much more quickly. + +**When to use pattern cache:** + +* You're optimizing multiple versions of the same model +* You're optimizing models from the same family (e.g., different BERT variants) +* You want to transfer learned patterns across models + +Import Patterns from Existing QDQ Models +----------------------------------------- + +If you have a pre-quantized baseline model (e.g., from manual optimization or another tool), you can import its Q/DQ patterns: + +.. code-block:: bash + + python -m modelopt.onnx.quantization.autotune \ + --model model.onnx \ + --output ./results \ + --qdq-baseline manually_quantized.onnx + +The autotuner will: + +1. Extract Q/DQ insertion points from the baseline model +2. Map these points to region patterns +3. Use them as seed schemes during optimization + +This is useful for: + +* Starting from expert-tuned quantization schemes +* Comparing against reference implementations +* Fine-tuning existing quantized models + +Resume After Interruption +-------------------------- + +Long optimizations can be interrupted (Ctrl+C, cluster preemption, crashes) and automatically resumed: + +.. code-block:: bash + + # Start optimization + python -m modelopt.onnx.quantization.autotune \ + --model model.onnx \ + --output ./results + + # ... interrupted after 2 hours ... + + # Resume from checkpoint (just run the same command) + python -m modelopt.onnx.quantization.autotune \ + --model model.onnx \ + --output ./results + +The autotuner automatically: + +* Detects the state file (``autotuner_state.yaml``) +* Loads all previous measurements and best schemes +* Continues from the next unprofiled region + +Custom TensorRT Plugins +----------------------- + +If your model uses custom TensorRT operations, provide the plugin libraries: + +.. code-block:: bash + + python -m modelopt.onnx.quantization.autotune \ + --model model.onnx \ + --output ./results \ + --plugin-libraries /path/to/plugin1.so /path/to/plugin2.so + +Low-Level API Usage +=================== + +For maximum control, use the autotuner classes directly: + +Basic Workflow +-------------- + +.. code-block:: python + + import onnx + from modelopt.onnx.quantization.autotune import ( + QDQAutotuner, + Config, + TensorRTPyBenchmark + ) + + # Load model + model = onnx.load("model.onnx") + + # Initialize autotuner with automatic region discovery + autotuner = QDQAutotuner(model) + config = Config( + default_quant_type="int8", + verbose=True + ) + autotuner.initialize(config) + + # Setup TensorRT benchmark + benchmark = TensorRTPyBenchmark( + timing_cache_file="timing.cache", + warmup_runs=5, + timing_runs=100 + ) + + # Measure baseline (no Q/DQ) + autotuner.export_onnx("baseline.onnx", insert_qdq=False) + baseline_latency = benchmark.run("baseline.onnx") + autotuner.submit(baseline_latency) + print(f"Baseline: {baseline_latency:.2f} ms") + + # Profile each region + regions = autotuner.regions + print(f"Found {len(regions)} regions to optimize") + + for region_idx, region in enumerate(regions): + print(f"\nRegion {region_idx + 1}/{len(regions)}") + + # Set current profile region + autotuner.set_profile_region(region, commit=(region_idx > 0)) + + # Check if already profiled (for crash recovery) + if autotuner.current_profile_pattern_schemes is None: + print(" Already profiled, skipping") + continue + + # Generate and test schemes + for scheme_num in range(30): # Test 30 schemes per region + scheme_idx = autotuner.generate() + + if scheme_idx == -1: + print(f" No more unique schemes after {scheme_num}") + break + + # Export model with Q/DQ nodes + model_bytes = autotuner.export_onnx(None, insert_qdq=True) + + # Measure performance + latency = benchmark.run(model_bytes) + success = latency != float('inf') + autotuner.submit(latency, success=success) + + if success: + speedup = baseline_latency / latency + print(f" Scheme {scheme_idx}: {latency:.2f} ms ({speedup:.3f}x)") + + # Best scheme is automatically selected + ps = autotuner.current_profile_pattern_schemes + if ps and ps.best_scheme: + print(f" Best: {ps.best_scheme.latency_ms:.2f} ms") + + # Commit final region + autotuner.set_profile_region(None, commit=True) + + # Export optimized model + autotuner.export_onnx("optimized_final.onnx", insert_qdq=True) + print("\nOptimization complete!") + +State Management +---------------- + +Save and load optimization state for crash recovery: + +.. code-block:: python + + # Save state after each region + autotuner.save_state("autotuner_state.yaml") + + # Load state to resume + autotuner = QDQAutotuner(model) + autotuner.initialize(config) + autotuner.load_state("autotuner_state.yaml") + + # Continue optimization from last checkpoint + # (regions already profiled will be skipped) + +Pattern Cache Management +------------------------ + +Create and use pattern caches: + +.. code-block:: python + + from modelopt.onnx.quantization.autotune import PatternCache + + # Load existing cache + cache = PatternCache.load("pattern_cache.yaml") + print(f"Loaded {cache.num_patterns} patterns") + + # Initialize autotuner with cache + autotuner = QDQAutotuner(model) + autotuner.initialize(config, pattern_cache=cache) + + # After optimization, pattern cache is automatically saved + # when you call save_state() + autotuner.save_state("autotuner_state.yaml") + # This also saves: autotuner_state_pattern_cache.yaml + +Import from QDQ Baseline +------------------------- + +Extract patterns from pre-quantized models: + +.. code-block:: python + + import onnx + from modelopt.onnx.quantization.autotune.qdq_utils import get_quantized_tensors + + # Load baseline model with Q/DQ nodes + baseline_model = onnx.load("quantized_baseline.onnx") + + # Extract quantized tensor names + quantized_tensors = get_quantized_tensors(baseline_model) + print(f"Found {len(quantized_tensors)} quantized tensors") + + # Import into autotuner + autotuner = QDQAutotuner(model) + autotuner.initialize(config) + autotuner.import_insertion_points(quantized_tensors) + + # These patterns will be tested first during optimization + +Configuration Options +===================== + +Config Class +------------ + +The ``Config`` class controls autotuner behavior: + +.. code-block:: python + + from modelopt.onnx.quantization.autotune import Config + + config = Config( + # Quantization settings + default_quant_type="int8", # "int8" or "fp8" + default_q_scale=0.1, # Default scale for Q/DQ nodes + default_q_zero_point=0, # Default zero-point (0 for int8) + + # Scheme generation settings + top_percent_to_mutate=0.1, # Top 10% schemes for mutation + minimum_schemes_to_mutate=10, # Min schemes to keep as seeds + maximum_mutations=3, # Max mutations per scheme + maximum_generation_attempts=100, # Max attempts to generate unique scheme + + # Pattern cache settings + pattern_cache_minimum_distance=4, # Min edit distance for diversity + pattern_cache_max_entries_per_pattern=32, # Max schemes per pattern + + # Region discovery settings + maximum_sequence_region_size=10, # Max nodes in sequence regions + minimum_topdown_search_size=10, # Min nodes for top-down search + + # Logging + verbose=True # Detailed logging + ) + +Command-Line Arguments +---------------------- + +Full list of CLI options: + +.. code-block:: text + + Model and Output: + --model, -m Path to ONNX model file + --output, -o Output directory (default: ./autotuner_output) + + Autotuning Strategy: + --schemes-per-region, -s Number of schemes per region (default: 30) + --pattern-cache Pattern cache YAML file for warm-start + --qdq-baseline QDQ baseline model to import patterns + --state-file State file path for resume capability + + Quantization: + --quant-type Quantization type: int8 or fp8 (default: int8) + + TensorRT Benchmark: + --timing-cache TensorRT timing cache file + --warmup-runs Number of warmup runs (default: 5) + --timing-runs Number of timing runs (default: 20) + --plugin-libraries TensorRT plugin .so files (optional) + + Logging: + --verbose, -v Enable debug logging + +Best Practices +============== + +Choosing Scheme Count +--------------------- + +The ``--schemes-per-region`` parameter controls exploration depth: + +* **30-50 schemes**: Fast exploration, good for quick experiments +* **50-100 schemes**: Balanced (recommended for most cases) +* **100-200+ schemes**: Thorough exploration, use with pattern cache + + +For models with many small regions, start with fewer schemes. For models with many big regions, start with more schemes. + +Managing Optimization Time +-------------------------- + +Optimization time depends on: + +* **Number of unique patterns** (not total regions) +* **Schemes per region** +* **TensorRT engine build time** (model complexity) + +**Time Estimation Formula:** + +Total time ≈ (m unique patterns) × (n schemes per region) × (t seconds per benchmark) + baseline measurement + +Where: +- **m** = number of unique region patterns in your model +- **n** = schemes per region (e.g., 30) +- **t** = average benchmark time (typically 3-10 seconds, depends on model size) + +**Example Calculations:** + +Assuming t = 5 seconds per benchmark: + +* Small model: 10 patterns × 30 schemes × 5s = **25 minutes** +* Medium model: 50 patterns × 30 schemes × 5s = **2.1 hours** +* Large model: 100 patterns × 30 schemes × 5s = **4.2 hours** + +Note: Actual benchmark times may depend on TensorRT engine build complexity and GPU hardware. + +**Strategies to reduce time:** + +1. Use pattern cache from similar models (warm-start) +2. Reduce schemes per region for initial exploration +3. Use crash recovery to split optimization across sessions + +Using Pattern Cache Effectively +-------------------------------- + +Pattern cache is most effective when: + +* Models share architectural patterns (e.g., BERT → RoBERTa) +* You're iterating on the same model (v1 → v2 → v3) +* You're optimizing a model family + +**Building a pattern library:** + +.. code-block:: bash + + # Optimize first model and save patterns + python -m modelopt.onnx.quantization.autotune \ + --model bert_base.onnx \ + --output ./bert_base_run \ + --schemes-per-region 50 + + # Use patterns for similar models + python -m modelopt.onnx.quantization.autotune \ + --model bert_large.onnx \ + --output ./bert_large_run \ + --pattern-cache ./bert_base_run/pattern_cache.yaml + + python -m modelopt.onnx.quantization.autotune \ + --model roberta_base.onnx \ + --output ./roberta_run \ + --pattern-cache ./bert_base_run/pattern_cache.yaml + +Interpreting Results +-------------------- + +The autotuner reports speedup ratios: + +.. code-block:: text + + Baseline: 12.50 ms + Final: 9.80 ms (1.276x speedup) + +**What does the speedup ratio mean:** + +The speedup ratio is the ratio of the baseline latency to the final latency. It means the final latency is 1.276x faster than the baseline latency. + +**If speedup is low (<1.1x):** + +* Model may already be memory-bound (not compute-bound) +* Q/DQ overhead dominates small operations +* TensorRT may not fully exploit quantization for this architecture +* Try FP8 instead of INT8 + +Deploying Optimized Models +=========================== + +The optimized ONNX model contains Q/DQ nodes and is ready for TensorRT deployment: + +Using trtexec +------------- + +.. code-block:: bash + + # Build TensorRT engine from optimized ONNX + trtexec --onnx=optimized_final.onnx \ + --saveEngine=model.engine \ + --stronglyTyped + + # Run inference + trtexec --loadEngine=model.engine + +Using TensorRT Python API +-------------------------- + +.. code-block:: python + + import tensorrt as trt + import numpy as np + + # Create builder and logger + logger = trt.Logger(trt.Logger.WARNING) + builder = trt.Builder(logger) + network = builder.create_network( + 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH) + ) + parser = trt.OnnxParser(network, logger) + + # Parse optimized ONNX model + with open("optimized_final.onnx", "rb") as f: + if not parser.parse(f.read()): + for error in range(parser.num_errors): + print(parser.get_error(error)) + raise RuntimeError("Failed to parse ONNX") + + # Build engine + config = builder.create_builder_config() + engine = builder.build_serialized_network(network, config) + + # Save engine + with open("model.engine", "wb") as f: + f.write(engine) + + print("TensorRT engine built successfully!") + +Troubleshooting +=============== + +Common Issues +------------- + +**Issue: "Benchmark instance not initialized"** + +.. code-block:: python + + # Solution: Initialize benchmark before running workflow + from modelopt.onnx.quantization.autotune.workflows import init_benchmark_instance + init_benchmark_instance() + +**Issue: All schemes show inf latency** + +Possible causes: + +* TensorRT cannot parse the ONNX model +* Model contains unsupported operations +* Missing custom plugin libraries + +.. code-block:: bash + + # Solution: Check TensorRT logs in ./output/logs/ + # Add plugins if needed + python -m modelopt.onnx.quantization.autotune \ + --model model.onnx \ + --plugin-libraries /path/to/plugin.so + +**Issue: Optimization is very slow** + +* Check number of unique patterns (shown at start) +* Reduce schemes per region for faster exploration +* Use pattern cache from similar model + +.. code-block:: bash + + # Faster exploration with fewer schemes + python -m modelopt.onnx.quantization.autotune \ + --model model.onnx \ + --schemes-per-region 15 + +**Issue: Out of GPU memory during optimization** + +TensorRT engine building is GPU memory intensive: + +* Close other GPU processes +* Use smaller batch size in ONNX model if applicable +* Run optimization on a GPU with more memory + +**Issue: Final speedup is negative (slowdown)** + +The model may not benefit from quantization: + +* Try FP8 instead of INT8 +* Check if model is memory-bound (not compute-bound) +* Verify TensorRT can optimize the quantized operations + +**Issue: Resume doesn't work after interruption** + +* Ensure output directory is the same +* Check that ``autotuner_state.yaml`` exists +* If corrupted, delete state file and restart + +Debugging +--------- + +Enable verbose logging to see detailed information: + +.. code-block:: bash + + python -m modelopt.onnx.quantization.autotune \ + --model model.onnx \ + --verbose + +Check TensorRT build logs for each scheme: + +.. code-block:: bash + + # Logs are saved per scheme + ls ./output/logs/ + # baseline.log + # region_0_scheme_0.log + # region_0_scheme_1.log + # ... + + # View a specific log + cat ./output/logs/region_0_scheme_0.log + +Inspect Region Discovery +~~~~~~~~~~~~~~~~~~~~~~~~~ + +To understand how the autotuner partitions your model into regions, use the region inspection tool: + +.. code-block:: bash + + # Basic inspection - shows region hierarchy and statistics + python -m modelopt.onnx.quantization.autotune.region_search \ + --model model.onnx + + # Verbose mode for detailed debug information + python -m modelopt.onnx.quantization.autotune.region_search \ + --model model.onnx \ + --verbose + + # Custom maximum sequence size (default: 10) + python -m modelopt.onnx.quantization.autotune.region_search \ + --model model.onnx \ + --max-sequence-size 20 + + # Include all regions (even without quantizable operations) + python -m modelopt.onnx.quantization.autotune.region_search \ + --model model.onnx \ + --include-all-regions + +**What this tool shows:** + +* **Region hierarchy**: How your model is partitioned into LEAF and COMPOSITE regions +* **Region types**: Convergence patterns (divergence→branches→convergence) vs sequences +* **Node counts**: Number of operations in each region +* **Input/output tensors**: Data flow boundaries for each region +* **Coverage statistics**: Percentage of nodes in the model covered by regions +* **Size distribution**: Histogram showing region sizes + +**When to use:** + +* Before optimization: Understand how many unique patterns to expect +* Slow optimization: Check if model has too many unique patterns +* Debugging: Verify region discovery is working correctly +* Model analysis: Understand computational structure + +**Example output:** + +.. code-block:: text + + Phase 1 complete: 45 regions, 312/312 nodes (100.0%) + Phase 2 complete: refined 40 regions, skipped 5 + Summary: 85 regions (80 LEAF, 5 COMPOSITE), 312/312 nodes (100.0%) + LEAF region sizes: min=1, max=15, avg=3.9 + + ├─ Region 0 (Level 0, Type: COMPOSITE) + │ ├─ Direct nodes: 0 + │ ├─ Total nodes (recursive): 28 + │ ├─ Children: 4 + │ ├─ Inputs: 3 tensors + │ └─ Outputs: 2 tensors + │ ├─ Region 1 (Level 1, Type: LEAF) + │ │ ├─ Direct nodes: 5 + │ │ ├─ Nodes: Conv, BatchNormalization, Relu + │ ... + +This helps you understand: + +* **Number of patterns**: More regions = more unique patterns = longer optimization +* **Region sizes**: Very large regions might need adjustment via ``--max-sequence-size`` +* **Model structure**: Identifies divergent/convergent patterns (skip connections, branches) + +API Reference +============= + +For detailed API documentation, see :doc:`../reference/2_qdq_placement`. + +Key Classes: + +* :class:`~modelopt.onnx.quantization.autotune.QDQAutotuner` - Main autotuner with automatic region discovery +* :class:`~modelopt.onnx.quantization.autotune.Config` - Configuration parameters +* :class:`~modelopt.onnx.quantization.autotune.PatternCache` - Pattern cache for warm-start +* :class:`~modelopt.onnx.quantization.autotune.Region` - Hierarchical subgraph representation +* :class:`~modelopt.onnx.quantization.autotune.InsertionScheme` - Q/DQ insertion point collection + +Key Functions: + +* :func:`~modelopt.onnx.quantization.autotune.workflows.region_pattern_autotuning_workflow` - Complete optimization workflow +* :func:`~modelopt.onnx.quantization.autotune.workflows.benchmark_onnx_model` - Benchmark model with TensorRT + +Frequently Asked Questions +========================== + +**Q: How long does optimization take?** + +A: Optimization time is: (unique patterns) × (schemes per region) × (benchmark time). For example, with 30 schemes/region and 5 seconds/benchmark: 10 patterns = 25 minutes, 50 patterns = 2.1 hours, 100 patterns = 4.2 hours. The number of unique patterns depends on your model's architectural diversity—models with repeated structures (like transformers) have fewer unique patterns. Use pattern cache to significantly reduce time for similar models. + +**Q: Can I stop optimization early?** + +A: Yes! Press Ctrl+C to interrupt. The progress is saved and you can resume later. + +**Q: Do I need calibration data?** + +A: No, the autotuner focuses on Q/DQ placement optimization, not calibration. Calibration scales are added when the Q/DQ nodes are inserted. For best accuracy, run calibration separately after optimization. + +**Q: Can I use this with PyTorch models?** + +A: Export your PyTorch model to ONNX first using ``torch.onnx.export()``, then run the autotuner on the ONNX model. + +**Q: What's the difference from modelopt.onnx.quantization.quantize()?** + +A: ``quantize()`` is a fast PTQ tool that uses heuristics for Q/DQ placement. The autotuner uses TensorRT measurements to optimize placement for best performance. Use ``quantize()`` for quick results, autotuner for maximum performance. + +**Q: Can I customize region discovery?** + +A: Yes, inherit from ``QDQAutotunerBase`` and provide your own regions instead of using automatic discovery: + +.. code-block:: python + + from modelopt.onnx.quantization.autotune import QDQAutotunerBase, Region + + class CustomAutotuner(QDQAutotunerBase): + def __init__(self, model, custom_regions): + super().__init__(model) + self.regions = custom_regions # Your custom regions + +**Q: Does this work with dynamic shapes?** + +A: The autotuner uses TensorRT for benchmarking, which requires fixed shapes. Set fixed input shapes in your ONNX model before optimization. + +**Q: Can I optimize for accuracy instead of latency?** + +A: Currently, the autotuner optimizes for latency. + +Examples +======== + +Example 1: Basic Optimization +------------------------------ + +.. code-block:: bash + + # Optimize a ResNet model with INT8 quantization + python -m modelopt.onnx.quantization.autotune \ + --model resnet50.onnx \ + --output ./resnet50_optimized \ + --quant-type int8 \ + --schemes-per-region 30 + +Example 2: Transfer Learning with Pattern Cache +------------------------------------------------ + +.. code-block:: bash + + # Optimize GPT-2 small + python -m modelopt.onnx.quantization.autotune \ + --model gpt2_small.onnx \ + --output ./gpt2_small_run \ + --quant-type fp8 \ + --schemes-per-region 50 + + # Reuse patterns for GPT-2 medium (much faster) + python -m modelopt.onnx.quantization.autotune \ + --model gpt2_medium.onnx \ + --output ./gpt2_medium_run \ + --quant-type fp8 \ + --pattern-cache ./gpt2_small_run/pattern_cache.yaml + +Example 3: Import from Manual Baseline +--------------------------------------- + +.. code-block:: bash + + # You have a manually quantized baseline + # Import its patterns as starting point + python -m modelopt.onnx.quantization.autotune \ + --model model.onnx \ + --output ./auto_optimized \ + --qdq-baseline manually_quantized.onnx \ + --schemes-per-region 40 + +Example 4: Full Python Workflow +-------------------------------- + +.. code-block:: python + + from pathlib import Path + from modelopt.onnx.quantization.autotune.workflows import ( + region_pattern_autotuning_workflow, + init_benchmark_instance + ) + + # Initialize TensorRT benchmark + init_benchmark_instance( + timing_cache_file="/tmp/trt_cache.cache", + warmup_runs=5, + timing_runs=20 + ) + + # Run optimization + autotuner = region_pattern_autotuning_workflow( + model_path="model.onnx", + output_dir=Path("./results"), + num_schemes_per_region=30, + quant_type="int8", + pattern_cache_file=None, # Cold start + qdq_baseline_model=None # No baseline import + ) + + # Access results + print(f"Baseline latency: {autotuner.baseline_latency_ms:.2f} ms") + print(f"Number of patterns: {len(autotuner.profiled_patterns)}") + + # Pattern cache is automatically saved during workflow + # Check the output directory for autotuner_state_pattern_cache.yaml + if autotuner.pattern_cache: + print(f"Pattern cache contains {autotuner.pattern_cache.num_patterns} patterns") + +Conclusion +========== + +The ``modelopt.onnx.quantization.autotune`` module provides a powerful automated approach to Q/DQ placement optimization. By combining automatic region discovery, pattern-based optimization, and TensorRT performance measurement, it finds optimal quantization strategies without manual tuning. + +**Next Steps:** + +* Try the quick start example on your model +* Experiment with different ``--schemes-per-region`` values +* Build a pattern cache library for your model family +* Integrate optimized models into your deployment pipeline + +For architectural details and API reference, see :doc:`../reference/2_qdq_placement`. diff --git a/docs/source/reference/2_qdq_placement.rst b/docs/source/reference/2_qdq_placement.rst new file mode 100644 index 000000000..78a48ac0f --- /dev/null +++ b/docs/source/reference/2_qdq_placement.rst @@ -0,0 +1,1092 @@ +==================================================== +Automatic Q/DQ Placement Optimizer Architecture +==================================================== + +.. contents:: Table of Contents + :local: + :depth: 3 + +Overview +======== + +The ``modelopt.onnx.quantization.autotune`` module provides an automatic optimization framework for Quantize/Dequantize (Q/DQ) node placement in ONNX models. The system partitions ONNX computation graphs into smaller regions and systematically searches for optimal Q/DQ insertion points to minimize TensorRT inference latency while maintaining model accuracy. + +**Key Capabilities:** + +* **Automatic Region Discovery**: Identifies optimization regions around compute-intensive operations +* **Pattern-Based Optimization**: Groups structurally similar regions and applies learned schemes across all instances +* **Performance-Driven Search**: Uses TensorRT profiling to measure actual inference latency and guide optimization +* **Incremental State Management**: Supports crash recovery and resumption of optimization sessions +* **Pattern Cache**: Enables warm-start optimization by reusing known-good schemes from previous runs +* **Baseline Import**: Transfers quantization patterns from existing QDQ models + +Architecture Overview +===================== + +Core Design Principles +---------------------- + +1. **Hierarchical Region Partitioning**: The module decomposes ONNX graphs into a hierarchical tree of regions, enabling focused optimization at different granularity levels. + +2. **Pattern-Based Scheme Sharing**: Regions with identical topological structure share the same pattern signature. Optimization schemes are portable across all regions matching a pattern, reducing the search space significantly. + +3. **Performance-Driven Selection**: Every insertion scheme is evaluated through actual TensorRT engine compilation and profiling, ensuring real-world performance gains. + +4. **Incremental Optimization**: Regions are optimized sequentially with the best scheme committed before proceeding to the next region, allowing progressive refinement. + +Module Structure +---------------- + +.. code-block:: text + + autotune/ + ├── Core API + │ ├── autotuner.py # QDQAutotuner and QDQAutotunerBase classes + │ ├── workflows.py # High-level workflow functions + │ └── common.py # Data structures (Region, Config, PatternCache, etc.) + │ + ├── Region Management + │ ├── region_search.py # Automatic region discovery and partitioning + │ └── region_pattern.py # Structural pattern analysis and matching + │ + ├── Q/DQ Insertion + │ ├── insertion_points.py # Insertion point definitions and resolution + │ └── qdq_utils.py # Q/DQ node analysis utilities + │ + ├── Benchmarking + │ └── tensorrt_utils.py # TensorRT benchmarking and graph utilities + │ + └── Entry Points + ├── __init__.py # Public API exports + └── __main__.py # Command-line interface + + +Key Components +============== + +1. Autotuner (autotuner.py) +--------------------------- + +The autotuner is the central orchestrator of the Q/DQ optimization process. + +QDQAutotunerBase +~~~~~~~~~~~~~~~~ + +Base class providing core optimization functionality: + +* **Scheme Generation**: Creates candidate Q/DQ insertion schemes for regions +* **Model Export**: Generates ONNX models with specified Q/DQ insertions applied +* **Performance Tracking**: Records and ranks schemes by measured latency +* **State Persistence**: Saves/loads optimization progress for crash recovery + +**Key Attributes:** + +* ``graph``: Clean ONNX GraphSurgeon representation of the model +* ``regions``: List of regions to optimize (populated by subclass) +* ``profiled_patterns``: Pattern-based scheme results +* ``current_profile_region``: Region currently being optimized +* ``config``: Configuration parameters +* ``pattern_cache``: Seed schemes from previous optimization runs + +**Workflow Methods:** + +* ``initialize(config, pattern_cache)``: Configure autotuner and prepare for profiling +* ``set_profile_region(region, commit)``: Select region to profile and commit previous results +* ``generate()``: Generate a new insertion scheme for current region +* ``export_onnx(path, insert_qdq)``: Export model with Q/DQ nodes +* ``submit(latency, success)``: Record performance measurement for current scheme +* ``save_state(path)`` / ``load_state(path)``: Persist/restore optimization state + +QDQAutotuner +~~~~~~~~~~~~ + +Concrete implementation with automatic region discovery: + +* Inherits from ``QDQAutotunerBase`` +* Automatically discovers regions during initialization using ``CombinedRegionSearch`` +* Default choice for most use cases + +**Initialization Process:** + +1. Constructs root region encompassing entire graph +2. Runs combined region search to identify optimization candidates +3. Prepares region hierarchy for sequential optimization + +2. Region Management +-------------------- + +Region Partitioning (region_search.py) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The region search module implements hierarchical partitioning strategies to decompose ONNX graphs into optimization regions. + +**CombinedRegionSearch** + +Multi-strategy region discovery combining: + +* **Pattern-Based Search**: Identifies common subgraph patterns (Conv+BN+Relu, etc.) +* **Operation-Centered Search**: Creates regions around major quantizable operations (Conv, MatMul, Gemm) +* **Sequence Merging**: Combines adjacent linear operations into single regions +* **Hierarchical Composition**: Builds multi-level region trees + +**Region Discovery Algorithm:** + +1. **Bottom-Up Search**: Start from individual operations +2. **Local Expansion**: Expand forward/backward from seed nodes within step limits +3. **Pattern Recognition**: Identify and merge common computational patterns +4. **Hierarchy Construction**: Build parent-child relationships between regions + +**Key Classes:** + +* ``RegionSearchBase``: Base class with graph traversal utilities +* ``CombinedRegionSearch``: Main region discovery implementation +* Supports forward-only, backward-only, and bidirectional search + +**Region Types:** + +* ``LEAF``: Atomic regions containing only direct nodes +* ``COMPOSITE``: Hierarchical regions containing child regions + +Region Pattern Analysis (region_pattern.py) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Provides structural pattern matching for regions, enabling scheme portability. + +**RegionPattern Class** + +Represents the topological signature of a region: + +* **Signature Generation**: Creates deterministic hash from region structure + + - Node operation types + - Connectivity patterns (inputs/outputs per node) + - Child region structures (for composite regions) + - Handles symmetric operations (Add, Mul) order-invariantly + +* **Pattern Matching**: Groups regions by structural similarity +* **Insertion Point Resolution**: Resolves pattern-relative addresses to actual tensor names + +**Signature Components:** + +.. code-block:: text + + Pattern Signature = hash( + node_types_sorted + + connectivity_structure + + child_region_patterns + + symmetry_normalization + ) + +**Key Methods:** + +* ``from_region(region, graph)``: Generate pattern from region + +3. Q/DQ Insertion Points +------------------------ + +Insertion Point Types (insertion_points.py) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Defines three types of Q/DQ insertion locations: + +**NodeInputInsertionPoint** + +Inserts Q/DQ at a specific node input: + +* Pattern-relative node index +* Input tensor index (0, 1, 2, ...) +* Most common insertion type for quantizing operation inputs + +**RegionOutputInsertionPoint** + +Inserts Q/DQ at region output, only used for composite regions: + +* Pattern-relative child region index +* Output tensor index from that region +* Used for composite regions with child boundaries + +**ChildRegionInputInsertionPoint** + +Inserts Q/DQ at a child region input boundary: + +* Pattern-relative child region index +* Input tensor index to that region +* Enables quantization of data flowing into subregions + +**InsertionScheme** + +Collection of insertion points with performance metadata: + +* Set of insertion points (pattern-relative) +* Measured latency (ms) +* Success/failure status +* Unique fingerprint for deduplication + +**Resolution Process:** + +1. Take pattern-relative insertion points +2. Map node/region indices to actual graph elements +3. Resolve to concrete tensor names +4. Handle merging and deduplication +5. Generate Q/DQ nodes at specified locations + +Q/DQ Utilities (qdq_utils.py) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Low-level utility for Q/DQ node analysis: + +* ``get_quantized_tensors(model)``: Extract tensor names with Q/DQ nodes from ONNX model + +This function is primarily used for importing Q/DQ patterns from existing quantized models +into the autotuner for warm-start optimization. + +4. Workflows (workflows.py) +--------------------------- + +High-level workflow functions orchestrating the complete optimization process. + +region_pattern_autotuning_workflow +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Main workflow for pattern-based Q/DQ optimization: + +**Workflow Steps:** + +1. **Initialization** + + * Load ONNX model + * Create autotuner with automatic region discovery + * Load pattern cache (if provided) + * Import patterns from QDQ baseline (if provided) + +2. **Baseline Measurement** + + * Export model without Q/DQ nodes + * Benchmark with TensorRT to establish baseline latency + +3. **Region Profiling Loop** + + For each discovered region: + + * Set as current profile region + * Generate N insertion schemes (default: 30) + * For each scheme: + + - Export ONNX model with Q/DQ nodes applied + - Build TensorRT engine and measure latency + - Submit result to autotuner + + * Commit best scheme for region + * Save incremental state (crash recovery) + +4. **Finalization** + + * Export final optimized model with all best schemes + * Measure final latency and compute speedup + * Save complete state and pattern cache + +**Key Features:** + +* **Automatic Resume**: Detects existing state file and continues from last checkpoint +* **Pattern Cache Warm-Start**: Seeds scheme generation with known-good patterns +* **Baseline Import**: Extracts quantization patterns from existing QDQ models +* **Progressive Saving**: State saved after each region for crash recovery + +Benchmarking Functions +~~~~~~~~~~~~~~~~~~~~~~ + +* ``benchmark_onnx_model(model_path)``: Benchmark ONNX model with TensorRT +* ``init_benchmark_instance(timing_cache, warmup, timing)``: Initialize global TensorRT benchmark + +5. Benchmarking (tensorrt_utils.py) +------------------------------------ + +TensorRT integration for performance measurement and graph utilities. + +Benchmark Classes +~~~~~~~~~~~~~~~~~ + +**Abstract Benchmark Interface:** + +* ``run(model_path, log_file)``: Benchmark model and return median latency (ms) + +**TensorRTPyBenchmark** (Default) + +Uses TensorRT Python API: + +* Direct Python bindings to TensorRT +* Persistent Builder/Runtime/Logger instances +* Efficient for repeated benchmarking +* Timing cache support for faster engine builds +* Custom TensorRT plugin library loading + +**TrtExecBenchmark** (Alternative) + +Uses ``trtexec`` command-line tool: + +* Spawns subprocess for each benchmark +* More isolated but slower +* Useful when Python API has issues +* Custom TensorRT plugin library loading + +**Benchmarking Process:** + +1. Parse ONNX model +2. Build TensorRT engine with optimization +3. Load timing cache (if available) +4. Run warmup iterations (default: 5) +5. Run timing iterations (default: 10-100) +6. Compute median latency +7. Update timing cache + +**Configuration:** + +* Timing cache file path (persistent across runs) +* Warmup iterations (eliminate cold-start effects) +* Timing iterations (statistical stability) +* Plugin library paths (custom TensorRT plugins) + +Graph Utilities +~~~~~~~~~~~~~~~ + +**get_tensor_consumer_node_indices(graph)** + +Builds a mapping from tensor names to the indices of nodes that consume them: + +* Input: ONNX GraphSurgeon graph +* Output: Dictionary mapping tensor names to lists of consuming node indices +* Used for efficient graph traversal and insertion point resolution + +6. Configuration (common.py) +----------------------------- + +Config Class +~~~~~~~~~~~~ + +Central configuration for autotuning behavior. Controls the autotuning process including +performance requirements, quantization parameters, region building, scheme generation, and +pattern cache behavior. + +**Logging:** + +* ``verbose`` (bool): Enable detailed logging of autotuning progress (default: False) + +**Quantization Parameters:** + +* ``default_q_scale`` (float): Default scale parameter for Q/DQ nodes. Controls quantization + granularity. Typical range: 0.01-0.1 (default: 0.1) +* ``default_q_zero_point`` (int): Default zero-point for Q/DQ nodes. Use 0 for signed int8, + 128 for unsigned uint8 (default: 0) +* ``default_quant_type`` (str): Quantization type for Q/DQ nodes. Options: "int8" (default), "fp8" + +**Region Builder Settings:** + +* ``maximum_sequence_region_size`` (int): Maximum number of nodes in a sequence region during + top-down refinement. Prevents overly large merged regions (default: 10) +* ``minimum_topdown_search_size`` (int): Minimum number of nodes in a region to trigger + top-down search during region building (default: 10) + +**Scheme Generation Settings:** + +* ``top_percent_to_mutate`` (float): Top percentage of best schemes to use as mutation seeds + during scheme generation. Range: 0.0-1.0 (default: 0.1 = top 10%) +* ``minimum_schemes_to_mutate`` (int): Minimum number of schemes to keep as mutation seeds, + even if top_percent_to_mutate results in fewer (default: 10) +* ``maximum_mutations`` (int): Maximum number of mutations to apply to a single scheme + during generation (default: 3) +* ``maximum_generation_attempts`` (int): Maximum attempts to generate a unique new scheme + before giving up (default: 100) + +**Pattern Cache Settings:** + +* ``pattern_cache_minimum_distance`` (int): Minimum edit distance required between schemes in cache. + When adding schemes, if a scheme is too similar (distance < minimum_distance) to an existing + scheme, only the better-performing one is kept (default: 4) +* ``pattern_cache_max_entries_per_pattern`` (int): Maximum number of schemes to keep per pattern + in pattern cache. Only the top N best-performing schemes are kept for each pattern. + Use 0 to keep all schemes (default: 32) + +**Example:** + +.. code-block:: python + + from modelopt.onnx.quantization.autotune import Config + + config = Config( + # Quantization settings + default_quant_type="fp8", + default_q_scale=0.05, + + # Scheme generation + top_percent_to_mutate=0.2, # Use top 20% schemes as seeds + maximum_mutations=5, # More aggressive mutation + + # Pattern cache + pattern_cache_minimum_distance=2, # Require more diversity + pattern_cache_max_entries_per_pattern=64, # Keep more schemes + + # Logging + verbose=True + ) + +PatternCache Class +~~~~~~~~~~~~~~~~~~ + +Stores top-performing schemes for pattern-based warm-start: + +* Maps pattern signatures to ``PatternSchemes`` +* Maintains diversity through distance-based filtering +* Limits entries per pattern to avoid bloat +* Serializable to YAML for persistence + +**Cache Operations:** + +* ``add_scheme(pattern, scheme)``: Add scheme with diversity check +* ``get_schemes(pattern)``: Retrieve schemes for pattern +* ``save(path)`` / ``load(path)``: Persist to file + +Region Class +~~~~~~~~~~~~ + +Hierarchical subgraph representation: + +**Attributes:** + +* ``id``: Unique identifier +* ``level``: Hierarchical level (0=leaf, higher=composite) +* ``type``: RegionType (LEAF/COMPOSITE) +* ``parent``: Parent region reference +* ``children``: List of child regions +* ``nodes``: Set of direct node indices +* ``inputs``: Input tensor names +* ``outputs``: Output tensor names + +**Methods:** + +* Hierarchy navigation (parent/children access) +* Node management (direct vs recursive nodes) +* Boundary computation (inputs/outputs) +* Metadata storage + + +Autotuning Workflow +=================== + +Complete Optimization Process +------------------------------ + +.. code-block:: text + + ┌─────────────────────────────────────────────────────────────┐ + │ 1. Model Loading & Initialization │ + │ • Load ONNX model │ + │ • Create QDQAutotuner instance │ + │ • Run automatic region discovery │ + │ • Load pattern cache (warm-start) │ + │ • Import patterns from QDQ baseline (optional) │ + └────────────────────┬────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────────────────────────────┐ + │ 2. Baseline Measurement │ + │ • Export model without Q/DQ nodes │ + │ • Build TensorRT engine │ + │ • Measure baseline latency │ + │ • Submit to autotuner │ + └────────────────────┬────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────────────────────────────┐ + │ 3. Pattern-Based Region Profiling │ + │ ┌───────────────────────────────────────────┐ │ + │ │ For each region: │ │ + │ │ • Set as current profile region │ │ + │ │ • Check if pattern already profiled │ │ + │ │ • Generate N insertion schemes │ │ + │ │ ┌─────────────────────────────┐ │ │ + │ │ │ For each scheme: │ │ │ + │ │ │ • Generate unique scheme │ │ │ + │ │ │ • Export model with Q/DQ │ │ │ + │ │ │ • Build TRT engine │ │ │ + │ │ │ • Measure latency │ │ │ + │ │ │ • Submit result │ │ │ + │ │ └─────────────────────────────┘ │ │ + │ │ • Select best scheme for pattern │ │ + │ │ • Commit scheme (applies to all │ │ + │ │ regions with this pattern) │ │ + │ │ • Save incremental state │ │ + │ └───────────────────────────────────────────┘ │ + └────────────────────┬────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────────────────────────────┐ + │ 4. Finalization │ + │ • Commit final region │ + │ • Export optimized model with all best schemes │ + │ • Measure final latency │ + │ • Compute speedup ratio │ + │ • Save complete state file │ + │ • Save pattern cache for future runs │ + └─────────────────────────────────────────────────────────────┘ + +Scheme Generation Process +-------------------------- + +For each region being profiled: + +1. **Pattern Identification**: Compute structural pattern signature +2. **Pattern Schemes Initialization**: Create or retrieve ``PatternSchemes`` for pattern +3. **Cache Seeding**: Add schemes from pattern cache (warm-start) +4. **Iterative Generation**: Generate new schemes up to configured limit + + * Random selection of insertion points + * Diversity filtering (avoid duplicates) + * Pattern-relative addressing + +5. **Evaluation**: Each scheme is exported, benchmarked, and ranked +6. **Best Selection**: Scheme with lowest latency becomes pattern's best scheme + +Pattern-Relative Addressing +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Schemes are defined using pattern-relative indices: + +.. code-block:: python + + # Pattern-relative insertion point + NodeInputInsertionPoint(node_index=2, input_index=0) + + # Resolved to actual tensor for Region A + "conv1_output" # Node 2 in Region A's pattern + + # Resolved to actual tensor for Region B (same pattern) + "conv5_output" # Node 2 in Region B's pattern + +This portability enables: + +* One optimization per pattern instead of per region +* Transfer learning across similar models +* Significant reduction in search space + +State Management +---------------- + +Incremental State Saving +~~~~~~~~~~~~~~~~~~~~~~~~~ + +State is saved after each region optimization: + +**State File Contents (YAML):** + +.. code-block:: yaml + + baseline_latency_ms: 12.5 + profiled_patterns: + pattern_abc123: + schemes: + - insertion_points: [...] + latency_ms: 11.2 + success: true + - insertion_points: [...] + latency_ms: 11.8 + success: true + best_scheme_index: 0 + profiled_regions: + - region_id: 1 + scheme_index: 0 + committed: true + - region_id: 2 + scheme_index: 0 + committed: false + +**Crash Recovery:** + +If optimization is interrupted: + +1. Rerun workflow with same output directory +2. State file is automatically detected and loaded +3. Already-profiled patterns are skipped +4. Optimization continues from next unprofiled region + +Pattern Cache +------------- + +Warm-Start Optimization +~~~~~~~~~~~~~~~~~~~~~~~ + +Pattern cache files store top-performing schemes: + +**Cache File Structure (YAML):** + +.. code-block:: yaml + + patterns: + pattern_def456: + signature: "def456..." + schemes: + - insertion_points: [...] + latency_ms: 9.8 + distance: 5 + - insertion_points: [...] + latency_ms: 10.1 + distance: 7 + max_entries: 16 + +**Usage:** + +1. After first optimization, pattern cache saved automatically +2. For similar models, load cache at initialization +3. Cache schemes tested first before random generation +4. Enables faster convergence to optimal solutions + +**Diversity Filtering:** + +* Schemes are filtered by minimum Hamming distance +* Ensures cache contains diverse candidates +* Prevents redundant similar schemes + +Region Discovery Details +======================== + +Hierarchical Partitioning Strategy +----------------------------------- + +The region search algorithm builds a hierarchical tree of regions: + +Level 0: Leaf Regions +~~~~~~~~~~~~~~~~~~~~~~ + +* Individual operations or small operation sequences +* Conv, MatMul, Gemm, Add, etc. +* Forward/backward expansion around seed nodes +* Direct boundary computation + +Level 1+: Composite Regions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +* Merging of related leaf regions +* Pattern-based combination (Conv+BN+Relu) +* Sequence merging (Linear→Linear→Linear) +* Hierarchical boundaries (child inputs/outputs) + +Region Boundaries +----------------- + +Input Tensors +~~~~~~~~~~~~~ + +Tensors consumed by region nodes but produced outside: + +* From model inputs +* From nodes in other regions +* Used to determine Q/DQ insertion at region entry + +Output Tensors +~~~~~~~~~~~~~~ + +Tensors produced by region nodes and consumed outside: + +* By nodes in other regions +* As model outputs +* Used to determine Q/DQ insertion at region exit + +Boundary Computation Algorithm +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +1. Collect all tensors consumed by region nodes +2. Filter out tensors produced within region +3. Remaining = input boundary tensors +4. Collect all tensors produced by region nodes +5. Filter out tensors only consumed within region +6. Remaining = output boundary tensors + +Insertion Point Selection +========================= + +Types of Insertion Points +-------------------------- + +The autotuner considers multiple insertion strategies: + +Node Input Quantization +~~~~~~~~~~~~~~~~~~~~~~~ + +Quantize data flowing into specific operations: + +* Most common strategy +* Targets compute-intensive ops (Conv, MatMul) +* Reduces precision of activations +* Can be applied to individual inputs of multi-input operations + +Region Output Quantization +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Quantize data at child region boundaries: + +* For composite regions +* Quantizes outputs of entire subgraphs +* Useful for quantizing residual connections +* Maintains precision within subregions + +Child Region Input Quantization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Quantize data entering child regions: + +* Complements region output quantization +* Controls precision at subregion boundaries +* Enables hierarchical quantization strategies + +Scheme Generation Strategies +----------------------------- + +Random Sampling +~~~~~~~~~~~~~~~ + +* Randomly select subset of available insertion points +* Probability-based selection (configurable) +* Generates diverse candidate schemes +* Default strategy for exploration + +Cache-Guided Sampling +~~~~~~~~~~~~~~~~~~~~~ + +* When pattern cache available, test cached schemes first +* Provides warm-start for faster convergence +* Falls back to random sampling after cache exhausted + +Diversity Filtering +~~~~~~~~~~~~~~~~~~~ + +* Compute Hamming distance between schemes +* Reject schemes too similar to already-tested ones +* Ensures exploration of diverse configurations +* Minimum distance threshold configurable + +Performance Evaluation +====================== + +TensorRT Engine Building +------------------------- + +For each scheme: + +1. **ONNX Export**: Generate model with Q/DQ nodes applied +2. **Parser**: TensorRT parses ONNX graph +3. **Optimization**: TensorRT layer fusion, kernel selection +4. **Timing Cache**: Reuse measured kernel timings +5. **Engine Build**: Generate optimized engine binary + +Latency Measurement +------------------- + +Benchmarking Protocol: + +1. **Engine Loading**: Load built engine to GPU +2. **Warmup Phase**: Run N iterations (default: 5) + + * Eliminate cold-start effects + * Prime GPU caches + +3. **Timing Phase**: Run M iterations (default: 10-100) + + * Measure end-to-end latency per iteration + * Synchronize GPU after each iteration + +4. **Aggregation**: Compute median latency (robust to outliers) + +Best Scheme Selection +---------------------- + +For each pattern: + +* Track all successfully-benchmarked schemes +* Rank by measured latency (lower is better) +* Select scheme with minimum latency +* Apply to all regions matching pattern + +Usage Patterns +============== + +Command-Line Interface +---------------------- + +Basic Usage +~~~~~~~~~~~ + +.. code-block:: bash + + # Default INT8 quantization + python -m modelopt.onnx.quantization.autotune \ + --model model.onnx \ + --output ./output + + # FP8 quantization with more exploration + python -m modelopt.onnx.quantization.autotune \ + --model model.onnx \ + --output ./output \ + --quant-type fp8 \ + --schemes-per-region 50 + +Advanced Usage +~~~~~~~~~~~~~~ + +.. code-block:: bash + + # With pattern cache warm-start + python -m modelopt.onnx.quantization.autotune \ + --model model.onnx \ + --output ./output \ + --pattern-cache ./previous_run/pattern_cache.yaml + + # Import patterns from existing QDQ model + python -m modelopt.onnx.quantization.autotune \ + --model model.onnx \ + --output ./output \ + --qdq-baseline quantized_baseline.onnx + + # Resume after interruption (automatic) + python -m modelopt.onnx.quantization.autotune \ + --model model.onnx \ + --output ./output + # Automatically detects and loads state file + +Python API +---------- + +High-Level Workflow +~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + from pathlib import Path + from modelopt.onnx.quantization.autotune.workflows import ( + region_pattern_autotuning_workflow + ) + + # Pattern-based optimization (recommended) + autotuner = region_pattern_autotuning_workflow( + model_path="model.onnx", + output_dir=Path("./output"), + num_schemes_per_region=30, + quant_type="int8", + pattern_cache_file="pattern_cache.yaml", + qdq_baseline_model="baseline.onnx" + ) + +Low-Level API +~~~~~~~~~~~~~ + +.. code-block:: python + + import onnx + from modelopt.onnx.quantization.autotune import ( + QDQAutotuner, Config, TensorRTPyBenchmark + ) + + # Load and initialize + model = onnx.load("model.onnx") + autotuner = QDQAutotuner(model) + config = Config(default_quant_type="fp8") + autotuner.initialize(config) + + # Setup benchmark + benchmark = TensorRTPyBenchmark( + timing_cache_file="/tmp/timing.cache" + ) + + # Measure baseline + autotuner.export_onnx("baseline.onnx", insert_qdq=False) + baseline_latency = benchmark.run("baseline.onnx") + autotuner.submit(baseline_latency) + + # Profile each region + for region in autotuner.regions: + autotuner.set_profile_region(region, commit=True) + + for _ in range(30): # Test 30 schemes + scheme_idx = autotuner.generate() + if scheme_idx == -1: + break + + model_bytes = autotuner.export_onnx(None, insert_qdq=True) + latency = benchmark.run(model_bytes) + autotuner.submit(latency, success=(latency != float('inf'))) + + # Finalize + autotuner.set_profile_region(None, commit=True) + autotuner.export_onnx("optimized.onnx", insert_qdq=True) + +Design Rationale +================ + +Pattern-Based Optimization +-------------------------- + +The autotuner uses a pattern-based optimization approach: + +**How It Works:** + +* Regions with identical structural patterns are grouped together +* Each unique pattern is optimized once with N schemes tested +* The best scheme for a pattern is automatically applied to all regions matching that pattern +* This dramatically reduces the number of benchmarks required + +**Benefits:** + +* **Efficiency**: Optimize each unique pattern once instead of every region independently +* **Consistency**: All structurally similar regions use the same quantization strategy +* **Scalability**: Time scales with number of unique patterns, not total regions +* **Transfer Learning**: Pattern cache enables warm-start on similar models + +**Trade-offs:** + +* Assumes structural similarity implies performance similarity +* May not capture performance variations due to different input/output contexts +* Models with many unique patterns see less benefit + +**Best For:** + +* Models with repeated structures (transformers, ResNets, etc.) +* Most production models where consistent quantization is desirable +* Scenarios where optimization time is constrained + +Forward-Only Region Search +--------------------------- + +The current implementation focuses on forward (downstream) region expansion: + +* Simpler boundary computation +* Aligns with typical dataflow (inputs → outputs) +* Sufficient for most optimization scenarios +* Backward expansion can be added if needed + +Hierarchical vs Flat Regions +----------------------------- + +Hierarchical region structure provides: + +* **Multi-Granularity Optimization**: Can optimize at different abstraction levels +* **Composability**: Child regions can be optimized independently +* **Scalability**: Handles large models by partitioning into manageable pieces +* **Pattern Reuse**: Patterns can be defined at multiple levels + +Incremental State Saving +------------------------- + +State is saved after each region instead of at the end: + +* **Crash Recovery**: Long optimizations (hours/days) can be resumed +* **Early Access**: Partial results available before completion +* **Debugging**: Can inspect intermediate state +* **Resource Management**: Can pause/resume optimization as needed + +Limitations and Future Work +============================ + +Current Limitations +------------------- + +1. **Search Space Exploration** + + * Random sampling may miss optimal configurations + * No gradient-based or learned search strategies + * Number of schemes per region is fixed + +2. **Pattern Matching** + + * Assumes structural similarity implies performance similarity + * May miss performance variations due to input data or context + +3. **Quantization Types** + + * Uniform quantization for all Q/DQ nodes in a scheme + * No mixed-precision within schemes + +4. **Benchmarking Overhead** + + * TensorRT engine build time dominates (even with timing cache) + * Each scheme requires full engine rebuild + +5. **Input Sensitivity** + + * Performance measured on default/dummy inputs + * May not generalize to all input distributions + +Future Enhancements +------------------- + +1. **Advanced Search Strategies** + + * Reinforcement learning-based exploration + * Bayesian optimization for scheme selection + * Evolutionary algorithms for population-based search + +2. **Mixed-Precision Support** + + * Different quantization types per insertion point + * Learnable precision selection + * Per-layer quantization bit-width + +3. **Accuracy Constraint** + + * Optimize for latency while maintaining accuracy threshold + * Multi-objective optimization (latency + accuracy) + * Accuracy-aware scheme selection and evaluation + * Integration with calibration and validation datasets + * Pareto frontier exploration for latency-accuracy trade-offs + +Glossary +======== + +.. glossary:: + + Q/DQ Nodes + QuantizeLinear (Q) and DequantizeLinear (DQ) nodes in ONNX that convert between + floating-point and quantized integer representations. + + Region + A hierarchical subgraph in an ONNX computation graph with well-defined input and + output boundaries. Can be LEAF (atomic), COMPOSITE (containing child regions), or + ROOT (entire graph). + + Pattern + A structural signature representing the topology and operation types in a region. + Regions with identical patterns can share insertion schemes. + + Insertion Scheme + A collection of insertion points specifying where to insert Q/DQ nodes within a + region. Schemes use pattern-relative addressing for portability. + + Insertion Point + A specific location where Q/DQ nodes can be inserted: at a node input, region + output, or region boundary. + + Pattern-Relative Addressing + Addressing scheme using indices relative to pattern structure rather than absolute + graph positions, enabling scheme portability across regions with matching patterns. + + Pattern Cache + Collection of top-performing insertion schemes for multiple patterns, used to + warm-start optimization on similar models. + + Baseline Latency + Inference latency of the original model without any Q/DQ nodes inserted, used as + reference for measuring optimization improvement. + + TensorRT Timing Cache + Persistent cache of kernel performance measurements maintained by TensorRT to + accelerate engine building by reusing previously measured timings. + + Scheme Diversity + Measure of how different two insertion schemes are, typically computed as Hamming + distance between their insertion point sets. + +References +========== + +* **ONNX Specification**: https://onnx.ai/ +* **ONNX Quantization**: https://onnx.ai/onnx/technical/quantization.html +* **TensorRT Documentation**: https://docs.nvidia.com/deeplearning/tensorrt/ +* **NVIDIA ModelOpt**: https://github.com/NVIDIA/TensorRT-Model-Optimizer +* **Graph Surgery**: https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon diff --git a/examples/qdq_placement/README.md b/examples/qdq_placement/README.md new file mode 100644 index 000000000..e779ebda0 --- /dev/null +++ b/examples/qdq_placement/README.md @@ -0,0 +1,228 @@ +# QDQ Placement Optimization Example + +This example demonstrates automated Q/DQ (Quantize/Dequantize) node placement optimization for ONNX models using TensorRT performance measurements. + +## Prerequisites + +### Get the Model + +Download the ResNet50 model from the ONNX Model Zoo: + +```bash +# Download ResNet50 from ONNX Model Zoo +curl -L -o resnet50_Opset17.onnx https://github.com/onnx/models/raw/main/Computer_Vision/resnet50_Opset17_torch_hub/resnet50_Opset17.onnx +``` + +### Set Fixed Batch Size (Recommended) + +The downloaded model has a dynamic batch size. For best performance with TensorRT benchmarking, set a fixed batch size: + +```bash +# Set batch size to 128 using the provided script +python3 set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx + +# Or for other batch sizes +python3 set_batch_size.py resnet50_Opset17.onnx --batch-size 1 --output resnet50.bs1.onnx +``` + +This creates `resnet50.bs128.onnx` with a fixed batch size of 128, which is optimal for TensorRT performance benchmarking. + +**Note:** The script requires the `onnx` package. If you have modelopt installed, this dependency should already be available. + +### What's in This Directory + +- `set_batch_size.py` - Script to convert dynamic batch size models to fixed batch size +- `README.md` - This guide + +**Note:** ONNX model files are not included in the repository (excluded via `.gitignore`). Download and prepare them using the instructions above. + +## Quick Start + +### Basic Usage + +Optimize the ResNet50 model with INT8 quantization: + +```bash +# Using the fixed batch size model (recommended) +python3 -m modelopt.onnx.quantization.autotune \ + --model resnet50.bs128.onnx \ + --output ./resnet50_results \ + --quant-type int8 \ + --schemes-per-region 30 + +# Or use the original dynamic batch size model +python3 -m modelopt.onnx.quantization.autotune \ + --model resnet50_Opset17.onnx \ + --output ./resnet50_results \ + --quant-type int8 \ + --schemes-per-region 30 +``` + +This will: +1. Automatically discover optimization regions in your model +2. Test 30 different Q/DQ placement schemes per region pattern +3. Measure TensorRT performance for each scheme +4. Export the best optimized model to `./resnet50_results/optimized_final.onnx` + +### FP8 Quantization + +For FP8 quantization (faster on modern GPUs): + +```bash +python3 -m modelopt.onnx.quantization.autotune \ + --model resnet50.bs128.onnx \ + --output ./resnet50_fp8_results \ + --quant-type fp8 \ + --schemes-per-region 50 +``` + +### Faster Exploration + +For quick experiments, reduce the number of schemes: + +```bash +python3 -m modelopt.onnx.quantization.autotune \ + --model resnet50.bs128.onnx \ + --output ./resnet50_quick \ + --schemes-per-region 15 +``` + +## Output Structure + +After running, you'll get: + +``` +resnet50_results/ +├── optimized_final.onnx # Your optimized model +├── baseline.onnx # Baseline for comparison +├── autotuner_state.yaml # Resume checkpoint +├── autotuner_state_pattern_cache.yaml # Reusable patterns +└── logs/ + ├── baseline.log # TensorRT baseline log + ├── region_*_scheme_*.log # Per-scheme logs + └── final.log # Final model log +``` + +## Using the Optimized Model + +Deploy with TensorRT: + +```bash +trtexec --onnx=resnet50_results/optimized_final.onnx \ + --saveEngine=resnet50.engine \ + --stronglyTyped +``` + +## Pattern Cache (Transfer Learning) + +Reuse learned patterns on similar models: + +```bash +# First optimization on ResNet50 +python3 -m modelopt.onnx.quantization.autotune \ + --model resnet50.bs128.onnx \ + --output ./resnet50_run + +# Download and prepare ResNet101 (or any similar model) +curl -L -o resnet101_Opset17.onnx https://github.com/onnx/models/raw/main/Computer_Vision/resnet101-v2-7.onnx +python3 set_batch_size.py resnet101_Opset17.onnx --batch-size 128 --output resnet101.bs128.onnx + +# Reuse patterns from ResNet50 on ResNet101 (much faster!) +python3 -m modelopt.onnx.quantization.autotune \ + --model resnet101.bs128.onnx \ + --output ./resnet101_run \ + --pattern-cache ./resnet50_run/autotuner_state_pattern_cache.yaml +``` + +## Optimize from Existing QDQ Model + +If you already have a quantized model (e.g., from manual quantization or another tool), you can use it as a starting point to potentially find even better Q/DQ placements: + +```bash +# Use an existing QDQ model as baseline +python3 -m modelopt.onnx.quantization.autotune \ + --model resnet50.bs128.onnx \ + --output ./resnet50_improved \ + --qdq-baseline resnet50_quantized.onnx \ + --schemes-per-region 40 +``` + +This will: +1. Extract Q/DQ insertion points from the baseline model +2. Use them as seed schemes during optimization +3. Generate and test variations to find better placements +4. Compare against the baseline performance + +**Use cases:** +- **Improve existing quantization**: Fine-tune manually quantized models +- **Compare tools**: Test if autotuner can beat other quantization methods +- **Bootstrap optimization**: Start from expert-tuned schemes + +**Example workflow:** + +```bash +# Step 1: Create initial quantized model with any quantization tool +# For example, using modelopt's quantize function: +python3 -c " +import numpy as np +from modelopt.onnx.quantization import quantize + +# Create dummy calibration data (replace with real data for production) +dummy_input = np.random.randn(128, 3, 224, 224).astype(np.float32) +quantize( + 'resnet50.bs128.onnx', + calibration_data=dummy_input, + calibration_method='entropy', + output_path='resnet50_quantized.onnx' +) +" + +# Step 2: Use the quantized baseline for autotuning +python3 -m modelopt.onnx.quantization.autotune \ + --model resnet50.bs128.onnx \ + --output ./resnet50_autotuned \ + --qdq-baseline resnet50_quantized.onnx \ + --schemes-per-region 50 + +# The autotuner will try to find better Q/DQ placements than the initial quantization +``` + +**Note:** This example uses dummy calibration data. For production use, provide real calibration data representative of your inference workload. + +## Programmatic API Usage + +All examples above use the command-line interface. For **low-level programmatic control** in your Python code, use the Python API directly. This allows you to: +- Integrate autotuning into custom pipelines +- Implement custom evaluation functions +- Control state management and checkpointing +- Build custom optimization workflows + +**See the API Reference documentation for low-level usage:** +- [`docs/source/reference/2_qdq_placement.rst`](../../docs/source/reference/2_qdq_placement.rst) + +The API docs include detailed examples of: +- Using the `Autotuner` class directly +- Customizing region discovery and scheme generation +- Managing optimization state programmatically +- Implementing custom performance evaluators + +## Documentation + +For comprehensive documentation on QDQ placement optimization, see: + +- **User Guide**: [`docs/source/guides/9_qdq_placement.rst`](../../docs/source/guides/9_qdq_placement.rst) + - Detailed explanations of how the autotuner works + - Advanced usage patterns and best practices + - Configuration options and performance tuning + - Troubleshooting common issues + +- **API Reference**: [`docs/source/reference/2_qdq_placement.rst`](../../docs/source/reference/2_qdq_placement.rst) + - Complete API documentation for all classes and functions + - Low-level usage examples + - State management and pattern cache details + +For command-line help: + +```bash +python3 -m modelopt.onnx.quantization.autotune --help +``` diff --git a/examples/qdq_placement/set_batch_size.py b/examples/qdq_placement/set_batch_size.py new file mode 100644 index 000000000..205dbb551 --- /dev/null +++ b/examples/qdq_placement/set_batch_size.py @@ -0,0 +1,121 @@ +#!/usr/bin/env python3 +# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" +Script to set a fixed batch size for ONNX models. + +This script modifies an ONNX model with dynamic batch size to use a fixed batch size, +which is often beneficial for TensorRT performance benchmarking. + +Usage: + python set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx +""" + +import argparse + +import onnx +from onnx import shape_inference + + +def set_batch_size(model_path: str, batch_size: int, output_path: str) -> None: + """ + Set a fixed batch size for an ONNX model. + + Args: + model_path: Path to input ONNX model + batch_size: Desired batch size + output_path: Path to save modified model + """ + # Load the model + print(f"Loading model from {model_path}...") + model = onnx.load(model_path) + + # Get the input tensor + graph = model.graph + input_tensor = graph.input[0] + + print( + f"Original input shape: {[d.dim_param or d.dim_value for d in input_tensor.type.tensor_type.shape.dim]}" + ) + + # Modify the batch dimension (first dimension) + if len(input_tensor.type.tensor_type.shape.dim) > 0: + input_tensor.type.tensor_type.shape.dim[0].dim_value = batch_size + # Clear any symbolic dimension parameter + input_tensor.type.tensor_type.shape.dim[0].ClearField("dim_param") + + # Also update output shapes if needed + for output_tensor in graph.output: + if len(output_tensor.type.tensor_type.shape.dim) > 0: + output_tensor.type.tensor_type.shape.dim[0].dim_value = batch_size + output_tensor.type.tensor_type.shape.dim[0].ClearField("dim_param") + + print( + f"Modified input shape: {[d.dim_param or d.dim_value for d in input_tensor.type.tensor_type.shape.dim]}" + ) + + # Run shape inference to propagate the batch size through the model + print("Running shape inference...") + try: + model = shape_inference.infer_shapes(model) + except Exception as e: + print(f"Warning: Shape inference failed: {e}") + print("Continuing without shape inference...") + + # Save the modified model + print(f"Saving modified model to {output_path}...") + onnx.save(model, output_path) + + # Verify the saved model + print("Verifying model...") + onnx.checker.check_model(output_path) + print("✓ Model saved and verified successfully!") + + +def main(): + parser = argparse.ArgumentParser( + description="Set a fixed batch size for an ONNX model", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Set batch size to 128 for ResNet50 + python set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx + + # Set batch size to 1 for single-image inference + python set_batch_size.py resnet50_Opset17.onnx --batch-size 1 --output resnet50.bs1.onnx + """, + ) + + parser.add_argument("model", help="Path to input ONNX model") + parser.add_argument( + "--batch-size", "-b", type=int, default=128, help="Batch size to set (default: 128)" + ) + parser.add_argument( + "--output", "-o", help="Path to save modified model (default: _bs.onnx)" + ) + + args = parser.parse_args() + + # Generate output path if not provided + if args.output is None: + base_name = args.model.rsplit(".", 1)[0] + args.output = f"{base_name}.bs{args.batch_size}.onnx" + + set_batch_size(args.model, args.batch_size, args.output) + + +if __name__ == "__main__": + main() diff --git a/modelopt/onnx/logging_config.py b/modelopt/onnx/logging_config.py index fd0c306a6..47b430a53 100644 --- a/modelopt/onnx/logging_config.py +++ b/modelopt/onnx/logging_config.py @@ -38,7 +38,7 @@ def configure_logging(level=logging.INFO, log_file=None): for handler in logger.handlers[:]: logger.removeHandler(handler) - formatter = logging.Formatter("[modelopt][onnx] - %(levelname)s - %(message)s") + formatter = logging.Formatter("%(asctime)s - [modelopt][onnx] - %(levelname)s - %(message)s") # Add file handler if log_file is specified if log_file: