+
+Log Level |
+Description |
+
+
+
+
+`FATAL` | `ERROR` | `WARNING` (For Users)
+
+`WARNING` is the default and recommended level for
+ users.
+ |
+
+
+- ONNX Runtime backend log level - `FATAL` | `ERROR` | `WARNING`.
+- ORTModule log level - `FATAL` | `ERROR` | `WARNING`.
+- Rank-0 log filtering is `ON` (e.g. logging on rank-0-only).
+- PyTorch exporter export logs filtering is `ON`.
+- PyTorch exporter verbose logs (including tracing graph) filtering is `ON`.
+
+ |
+
+
+
+
+`INFO` (For Users | ORT Developers)
+
+`INFO` is used for collecting experimental
+ feature stats, or a little bit more error messages.
+ |
+
+
+- ONNX Runtime backend log level - `WARNING`.
+- ORTModule log level - `INFO`.
+- Rank-0 log filtering is `ON` (e.g. logging on rank-0-only).
+- PyTorch exporter export logs filtering is `ON`.
+- PyTorch exporter verbose logs (including tracing graph) filtering is `OFF`.
+
+ |
+
+
+
+
+`DEVINFO` (For ORT Developers)
+
+`DEVINFO` is the recommended level for
+ debugging purposes.
+ |
+
+
+- ONNX Runtime backend log level - `INFO`.
+- ORTModule log level - `INFO`.
+- Rank-0 log filtering is `OFF` (e.g. logging on all ranks).
+- PyTorch exporter export logs filtering is `OFF`.
+- PyTorch exporter verbose logs (including tracing graph) filtering is `OFF`.
+
+ |
+
+
+
+
+
+`VERBOSE` (For ORT Developers)
+
+`VERBOSE` is the last resort for debugging
+ hard problems.
+ |
+
+
+- ONNX Runtime backend log level - `VERBOSE`.
+- ORTModule log level - `VERBOSE`.
+- Rank-0 log filtering is `OFF` (e.g. logging on all ranks).
+- PyTorch exporter export logs filtering is `OFF`.
+- PyTorch exporter verbose logs (including tracing graph) filtering is `OFF`.
+
+ |
+
+
+
+
+
### 2.1 Environment Variables
`ORTModule` provides environment variables targeting different use cases.
@@ -62,7 +146,6 @@ Check [DebugOptions implementation](../orttraining/orttraining/python/training/o
export ORTMODULE_ONNX_OPSET_VERSION=14
```
-
#### ORTMODULE_FALLBACK_POLICY
- **Feature Area**: *ORTMODULE/FallbackToPytorch*
@@ -71,7 +154,6 @@ Check [DebugOptions implementation](../orttraining/orttraining/python/training/o
export ORTMODULE_FALLBACK_POLICY="FALLBACK_DISABLE"
```
-
#### ORTMODULE_LOG_LEVEL
- **Feature Area**: *ORTMODULE/DebugOptions*
@@ -98,7 +180,6 @@ The output directory of the onnx models by default is set to the current working
> On the other hand, if the wrapped computation graph is small, it is reasonable to allow it.
> Overall users should be aware that ORT performance boost might be trivial when they explicitly allow it.
-
#### ORTMODULE_ENABLE_CUSTOM_AUTOGRAD
- **Feature Area**: *ORTMODULE/PythonOp (torch.autograd.Function)*
@@ -115,8 +196,6 @@ The output directory of the onnx models by default is set to the current working
enable_custom_autograd_support(False)
```
-
-
#### ORTMODULE_ENABLE_COMPUTE_OPTIMIZER
- **Feature Area**: *ORTMODULE/Optimizations*
@@ -129,19 +208,6 @@ debugging).
export ORTMODULE_ENABLE_COMPUTE_OPTIMIZER=0 # Disable
```
-#### ORTMODULE_ENABLE_SPARSE_OPTIMIZER
-
-- **Feature Area**: *ORTMODULE/Optimizations*
-- **Description**: By default, this is enabled. This env var can be used for enabling or disabling the input data sparsity
-based performance optimizations, including embedding sparsity and label sparsity.
-This optimization is applicable when using optimum, which has an implementation of the ModuleWithLoss class that wraps the HuggingFace Training that allows loss computation inside ONNX Runtime (ORT).
-If you're not using optimum but want to implement a similar wrapper in your codebase to compute the loss inside ONNX Runtime (ORT), you can refer to this [Link](ORTModule_ModuleWithLoss_Wrapper.md) for detailed steps and guidelines on how to achieve this.
-
- ```bash
- export ORTMODULE_ENABLE_SPARSE_OPTIMIZER=1 # Enable
- export ORTMODULE_ENABLE_SPARSE_OPTIMIZER=0 # Disable
- ```
-
#### ORTMODULE_PRINT_INPUT_DENSITY
- **Feature Area**: *ORTMODULE/RuntimeInspector*
@@ -167,7 +233,7 @@ to standard outputs.
#### ORTMODULE_ENABLE_EMBEDDING_SPARSE_OPTIMIZER
- **Feature Area**: *ORTMODULE/Optimizations*
-- **Description**: By default, this is disabled. This env var can be used for enabling or disabling the embedding input
+- **Description**: By default, this is enabled. This env var can be used for enabling or disabling the embedding input
data sparsity based performance optimizations.
```bash
@@ -175,6 +241,17 @@ data sparsity based performance optimizations.
export ORTMODULE_ENABLE_EMBEDDING_SPARSE_OPTIMIZER=0 # Disable
```
+#### ORTMODULE_ENABLE_LABEL_SPARSE_OPTIMIZER
+
+- **Feature Area**: *ORTMODULE/Optimizations*
+- **Description**: By default, this is enabled. This env var can be used for enabling or disabling the label input
+data sparsity based performance optimizations.
+
+ ```bash
+ export ORTMODULE_ENABLE_LABEL_SPARSE_OPTIMIZER=1 # Enable
+ export ORTMODULE_ENABLE_LABEL_SPARSE_OPTIMIZER=0 # Disable
+ ```
+
#### ORTMODULE_CACHE_DIR
- **Feature Area**: *ORTMODULE/RuntimeOptions*
@@ -185,6 +262,48 @@ data sparsity based performance optimizations.
unset ORTMODULE_CACHE_DIR # Disable
```
+#### ORTMODULE_USE_EFFICIENT_ATTENTION
+
+- **Feature Area**: *ORTMODULE/Optimizations*
+- **Description**: By default, this is disabled. This env var can be used for enabling attention fusion and falling back to PyTorch's efficient_attention ATen kernel for execution. NOTE that it requires torch's version is 2.1.1 or above. There are some build-in patterns for attention fusion, if none of the patterns works for your model, you can add a custom one in your user script manually.
+
+ ```bash
+ export ORTMODULE_USE_EFFICIENT_ATTENTION=1
+ ```
+
+#### ORTMODULE_DEEPCOPY_BEFORE_MODEL_EXPORT
+
+- **Feature Area**: *ORTMODULE/Optimizations*
+- **Description**: By default, this is enabled. This env var can be used for enabling or disabling the module deep copy when preparing output data which will be used by ONNX export.
+A classical usage of disabling the deep copy: when the deep copy before module export bring the memory peak, then we should disable it and have a try.
+
+ ```bash
+ export ORTMODULE_DEEPCOPY_BEFORE_MODEL_EXPORT=1 # Enable
+ export ORTMODULE_DEEPCOPY_BEFORE_MODEL_EXPORT=0 # Disable
+ ```
+
+#### ORTMODULE_MEMORY_OPT_LEVEL
+
+- **Feature Area**: *ORTMODULE/Optimizations*
+- **Description**: By default, the level is 0. This env var can be used for enabling recomputation for reducing memory peak requirement.
+ - Setting the level to be 1 means all detected recomputable subgraphs (NOT including compromised recomputable graphs) with each transformer-based model layer generating stashed activations will be recomputed. This is conceptually equivalent to PyTorch's gradient checkpoint.
+ - Setting the level to be 2 means all detected recomputable subgraphs (including compromised recomputable graphs) with each transformer-based model layer generating stashed activations will be recomputed. This is conceptually equivalent to PyTorch's gradient checkpoint.
+ - When the level is 0, check Check [Memory Optimizer for ONNX Runtime Training](Memory_Optimizer.md) for more details.
+
+ ```bash
+ export ORTMODULE_MEMORY_OPT_LEVEL=0
+ ```
+
+#### ORTMODULE_ENABLE_MEM_EFFICIENT_GRAD_MGMT
+
+- **Feature Area**: *ORTMODULE/Optimizations*
+- **Description**: By default, the memory-efficient gradient management is turned off. The gradient after it is computed in ONNX Runtime, will trigger the corresponding parameter's backward function through `PythonOpGrad` operator. This would help release the gradient buffer managed in ONNX Runtime, which originally is released once all backward computation finishes.
+
+ ```bash
+ export ORTMODULE_ENABLE_MEM_EFFICIENT_GRAD_MGMT=1 # Enable
+ export ORTMODULE_ENABLE_MEM_EFFICIENT_GRAD_MGMT=0 # Disable
+ ```
+
### 2.2 Memory Optimization
Q: *Want to run a bigger batch size?*
@@ -286,6 +405,30 @@ Check [FP16_Optimizer implementation](../orttraining/orttraining/python/training
export ORTMODULE_USE_TRITON=1
```
+#### ORTMODULE_TRITON_CONFIG_FILE
+
+- **Feature Area**: *ORTMODULE/TritonOp*
+- **Description**: Triton codegen currently supported some Ops such as some elementwise Ops and some reduction Ops. If Triton optimization is enabled, all these supported Ops will be optimized by default if possible. User can provide a customized JSON config file to control which Ops to optimize and how to optimize them. Below is a sample of config JSON. For each Op, Opset version list and domain is needed. Currently "conditions" field can be used to control axis/axes attribute or input, by specify the real value, or "single" means it contains only one dimension, or "constant" means it must be constant tensor. Save the JSON as a file somewhere and assign its path to below env variable to enable the customized config.
+
+ ```json
+ {
+ "ops": {
+ "Add": {"versions": [13, 14]},
+ "Sub": {"versions": [13, 14]},
+ "Identity": {"versions": [13], "is_no_op": True},
+ "ReduceSum": {"versions": [13], "conditions": {"axes": "[-1]"}},
+ "Softmax": {"versions": [13]},
+ "SoftmaxGrad_13": {"domain": "com.microsoft", "versions": [1]}
+ },
+ "initializer": "scalar",
+ "min_nodes": 2
+ }
+ ```
+
+ ```bash
+ export ORTMODULE_TRITON_CONFIG_FILE=triton_config.json
+ ```
+
#### ORTMODULE_ENABLE_TUNING
- **Feature Area**: *ORTMODULE/TritonOp*
@@ -313,6 +456,15 @@ Check [FP16_Optimizer implementation](../orttraining/orttraining/python/training
export ORTMODULE_TUNING_RESULTS_PATH=/tmp/tuning_results
```
+#### ORTMODULE_USE_FLASH_ATTENTION
+
+- **Feature Area**: *ORTMODULE/TritonOp*
+- **Description**: By default, this is disabled. This env var can be used for enabling attention fusion and using Flash Attention's Triton version as the kernel. NOTE that it requires ORTMODULE_USE_TRITON to be enabled, and CUDA device capability is 8.0 or above. There are some build-in patterns for attention fusion, if none of the patterns works for your model, you can add a custom one in your user script manually.
+
+ ```bash
+ export ORTMODULE_USE_FLASH_ATTENTION=1
+ ```
+
#### ORTMODULE_TRITON_DEBUG
- **Feature Area**: *ORTMODULE/TritonOp*
@@ -341,3 +493,31 @@ for epoch in range(start_epoch, n_epochs):
```
Check [LoadBalancingDistributedBatchSampler implementation](../orttraining/orttraining/python/training/utils/data/sampler.py) for more details.
+
+## 8 Using ORTPipelineModule for Deepspeed Pipeline Parallel
+
+You can use `ORTPipelineModule` to support Deepspeed Pipeline Parallelism. Here's how you can integrate it into your pipeline:
+
+```python
+from onnxruntime.training.ortmodule import DebugOptions
+from onnxruntime.training.ortmodule.experimental.pipe import ORTPipelineModule
+
+# Create a debug configuration if needed
+# Since we're exporting multiple graphs here, this will generate multiple graphs with their index added as a prefix to differentiate them.
+
+debug_options = DebugOptions(save_onnx=True, log_level=LogLevel.VERBOSE, onnx_prefix="model_name")
+
+# Keep your deepspeed script the same and use ORTPipelineModule instead of PipelineModule
+# Initialize the ORTPipelineModule
+pipeline_module = ORTPipelineModule(
+ layers,
+ num_stages=2, # Set your number of stages
+ base_seed=1234,
+ partition_method="parameters",
+ debug_options=debug_options # Pass the debug configuration if needed
+)
+
+# Keep the rest of the script as it is.
+```
+
+Check [ORTPipelineModule implementation](../orttraining/orttraining/python/training/ortmodule/experimental/pipe/_ort_pipeline_module.py) for more details.
diff --git a/docs/OperatorKernels.md b/docs/OperatorKernels.md
index b5c8fdb4bfd1a..4533884a51773 100644
--- a/docs/OperatorKernels.md
+++ b/docs/OperatorKernels.md
@@ -25,6 +25,7 @@ Do not modify directly.*
|||13|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|||[7, 12]|**T** = tensor(double), tensor(float), tensor(int32), tensor(int64)|
|Affine|*in* X:**T**