diff --git a/docs/hlo_passes.md b/docs/hlo_passes.md new file mode 100644 index 0000000000000..be64bd1da36e9 --- /dev/null +++ b/docs/hlo_passes.md @@ -0,0 +1,233 @@ +# HLO Passes + +This document outlines the [HLO](https://openxla.org/xla/terminology) +optimizations and transformations passes in the +[XLA compiler](https://openxla.org/xla/architecture). + +## Introduction + +A single HLO Pass can be comprised of one or many compiler optimizations and +transformations, and XLA provides several hundred such passes. HLO focuses only +on the shape (e.g. a 3x4 matrix) and the +[operation semantics](https://openxla.org/xla/operation_semantics) of the arrays +to make the optimization or transformation easier. + +For example: + +* [`AlgebraicSimplifier`:](https://github.com/openxla/xla/blob/c37fc6a383b870f43cef82280418fcefcc90b0f8/xla/hlo/transforms/simplifiers/algebraic_simplifier.h#L417) + A pass that performs a number of mostly arithmetic simplifications and + optimizations. Including: + + * When dividing by a constant, an optimization is performed to transform + the operation to multiplication by the inversion of the constant. + +* [`HloRematerialization`:](https://github.com/openxla/xla/tree/main/xla/hlo/transforms/simplifiers/hlo_rematerialization.h) + A pass that recomputes selected expressions in the computation to reduce + memory pressure caused by long live ranges of array-shaped values. + +## Developer details + +The base class for HLO passes can be found in +[`xla/hlo/pass/hlo_pass_interface.h`](https://github.com/openxla/xla/blob/main/xla/hlo/pass/hlo_pass_interface.h). +HLO pass should not extend this class directly but instead should extend +[`HloModulePass`](https://github.com/openxla/xla/blob/main/xla/hlo/pass/hlo_pass_interface.h#L142) +or +[`HloModuleGroupPass`](https://github.com/openxla/xla/blob/main/xla/hlo/pass/hlo_pass_interface.h#L172). + +See also +[XLA HLO Pass Framework](https://github.com/openxla/xla/tree/main/xla/hlo/pass#readme). + +### Tooling and Testing + +XLA comes with multiple command line tools, including the hlo-opt tool. This +tool allows execution of an individual pass independent of the given platform +compilation stages. For more information see +[Tooling](https://openxla.org/xla/tools#hlo-opt_hlo_pass_development_and_debugging). + +For information on writing unit tests for HLO Passes see +[Testing HLO Passes](https://openxla.org/xla/test_hlo_passes). + +## Hardware-independent HLO Pass Examples + +This section describes a few examples of passes shared across XLA backends. Some +passes may be specialized for specific backends, but the high-level +functionality is similar. + +Shared passes or hardware-independent passes can be found in +[`xla/hlo/transforms`](https://github.com/openxla/xla/tree/main/xla/hlo/transforms). + +### Rematerialization + +See also +[`HloRematerialization`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/hlo_rematerialization.h). + +Selectively recomputes expressions within the HLO graph to reduce memory usage. +Trades off higher compute for lower memory usage. Can reduce memory usage by +tens of percent and is required to run many large models. + +### Algebraic Simplifier + +See also +[`AlgebraicSimplifier`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/algebraic_simplifier.h). + +A grab bag of simplifications, optimizations, and canonicalizations. Analogous +to +[LLVM’s `instcombine` pass](https://llvm.org/docs/Passes.html#instcombine-combine-redundant-instructions). + +### Constant Folding + +See also +[`HloConstantFolding`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/hlo_constant_folding.h). + +Replaces expressions which can be evaluated at compile time with their constant +equivalent. + +### Dead Code Elimination + +See also +[`HloDCE`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/hlo_dce.h) +. + +Removes operations with unused results (fast implementation). + +### Call Graph Flattening + +See also +[`FlattenCallGraph`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/flatten_call_graph.h). + +A legalization pass which converts the HLO call graph into a tree by cloning +computations. Required because memory is statically assigned to HLO operations +and not based on dynamic call context. + +### Reshape Mover + +See also +[`ReshapeMover`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/reshape_mover.h). + +Reshapes and transposes can be expensive, especially on TPU. This pass moves and +reshapes and transposes across elementwise operations enabling the operations to +be merged or eliminated. + +### Zero-sized HLO Elimination + +See also +[`ZeroSizedHloElimination`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/zero_sized_hlo_elimination.h). + +HLO supports arrays of zero size (one or more dimensions has a bound of zero). +This pass simplifies the graph by replacing zero-sized operations with +zero-sized constants. + +## TPU-specific HLO Pass Examples + +Passes specific to the TPU backend. + +### Model parallelism + +The partitioning of an XLA program across multiple cores is performed at the HLO +level and the TPU HLO pipeline includes a number of passes for supporting +multi-core execution. + +#### Spatial partitioning + +See also +[`ShardingPropagation`](https://github.com/openxla/xla/blob/main/xla/service/sharding_propagation.h). + +Pass to support dividing operations across devices along non-batch dimensions. + +### Handling of bfloat16 + +See also +[`BFloat16ConversionFolding`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/bfloat16_conversion_folding.h), +[`BFloat16MixedPrecisionRemoval`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/float_normalization.h), +and +[`BFloat16Propagation`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/bfloat16_propagation.h). + +TPUs support bfloat16 as a lower-precision, more compact floating-point +representation than 32-bit floats. Using bfloat16 reduces memory footprint and +memory bandwidth. The TPU HLO pipeline includes various passes for replacing +floats with bfloat16 into the program and propagating the precision through the +graph. + +### Legalization passes + +See also +[`GatherExpander`](https://github.com/openxla/xla/blob/main/xla/service/gather_expander.h), +and +[`BatchNormExpander`](https://github.com/openxla/xla/blob/main/xla/service/batchnorm_expander.h). + +Passes which transform unsupported HLO into a form which the backend can emit or +for which the backend produces a more efficient lowering. + +## GPU-specific HLO Pass Example + +Passes specific to the GPU backend are found in +[`xla/service/gpu`](https://github.com/openxla/xla/tree/main/xla/service/gpu). +These passes can be identified as classes defined in `namespace gpu`. + +### cuDNN Rewriter + +See also +[`CudnnFusedConvRewriter`](https://github.com/openxla/xla/blob/main/xla/service/gpu/transforms/cudnn_fused_conv_rewriter.h) +and +[`CudnnNormRewriter`](https://github.com/openxla/xla/blob/main/xla/service/gpu/transforms/cudnn_norm_rewriter.h). + +Rewrites fused convolution and norm operations into their respective library +calls in cuDNN. + +## CPU-specific HLO Pass Examples + +Passes specific to the CPU backend are found in +[`xla/service/cpu`](https://github.com/openxla/xla/tree/main/xla/service/cpu). +These passes can be identified as classes defined in `namespace cpu`. + +### Convolution Canonicalization + +See also +[`ConvCanonicalization`](https://github.com/openxla/xla/blob/main/xla/service/cpu/conv_canonicalization.h). + +Canonicalizes convolutions so that they can be lowered to a fast implementation +in Eigen. + +### Operation Parallelization + +See also +[`ParallelTaskAssigner`](https://github.com/openxla/xla/blob/main/xla/service/cpu/parallel_task_assignment.h). + +Partitions HLOs into tasks to run on separate threads. + +## Analysis passes + +Analysis passes are not considered "HLO passes" since they do not transform HLO +and may not extend `HloModulePass` or `HloModuleGroupPass`. Shared analyses are +found in +[`xla/hlo/analysis`](https://github.com/openxla/xla/tree/main/xla/hlo/analysis). + +### Analysis Pass Examples + +#### Dataflow Analysis + +See also +[`HloDataflowAnalysis`](https://github.com/openxla/xla/tree/main/xla/hlo/analysis/hlo_dataflow_analysis.h). + +Identifies all HLO values in the graph and their uses. + +#### Alias Analysis + +See also +[`HloAliasAnalysis`](https://github.com/openxla/xla/tree/main/xla/hlo/analysis/hlo_alias_analysis.h). + +Identifies must-alias relationships between values in the program. + +#### Computation Cost Analysis + +See also +[`HloCostAnalysis`](https://github.com/openxla/xla/tree/main/xla/service/hlo_cost_analysis.h). + +Computes FLOP count and memory usage for all operations in the program. + +#### HLO Verification + +See also +[`HloVerifier`](https://github.com/openxla/xla/tree/main/xla/service/hlo_verifier.h). + +Verifies various invariants of the HLO graph. diff --git a/docs/terminology.md b/docs/terminology.md index 1644f6ece8fa4..feb8202a04a43 100644 --- a/docs/terminology.md +++ b/docs/terminology.md @@ -4,61 +4,65 @@ There are several terms that are used in the context of XLA, MLIR, LLVM, and other related technologies. Below is a partial list of these terms and their definitions. -- **OpenXLA** - - OpenXLA is an open ecosystem of performant, portable, and extensible machine - learning (ML) infrastructure - components that simplify ML development by defragmenting the tools between - frontend frameworks and hardware backends. It includes the XLA compiler, - StableHLO, VHLO, [PJRT](https://openxla.org/xla/pjrt) and other - components. -- **XLA** - - XLA (Accelerated Linear Algebra) is an open source compiler for machine - learning. The XLA compiler takes models from popular frameworks such as - PyTorch, TensorFlow, and JAX, and optimizes the models for high-performance - execution across different hardware platforms including GPUs, CPUs, and ML - accelerators. The XLA compiler outputs some code to LLVM, some to "standard" - MLIR, and some to [Triton MLIR](https://triton-lang.org/main/dialects/dialects.html) - that is processed by (MLIR-based) OpenAI Triton compiler. -- **PJRT** - - [PJRT](https://github.com/openxla/xla/blob/main/xla/pjrt/c/pjrt_c_api.h) is - a uniform Device API that simplifies the growing complexity of ML workload - execution across hardware and frameworks. It provides a hardware and framework - independent interface for compilers and runtimes. -- **StableHLO** - - StableHLO is the public interface to OpenXLA, it is a standardized MLIR - dialect that may be used by different frameworks and compilers in the OpenXLA - ecosystem. XLA supports StableHLO, and immediately converts it to HLO on the - input. There are some [StableHLO to StableHLO](https://openxla.org/stablehlo/generated/stablehlo_passes) - passes implemented using the MLIR framework. It is also possible to convert - StableHLO to other compilers' IR without using HLO, for example in cases where - an existing IR is more appropriate. -- **CHLO** - - CHLO is a collection of higher level operations which are optionally - decomposable to StableHLO. -- **VHLO** - - The [VHLO Dialect](https://openxla.org/stablehlo/vhlo) is a MLIR dialect - that is a compatibility layer on top of StableHLO. It provides a snapshot of - the StableHLO dialect at a given point in time by versioning individual - program elements, and is used for serialization and stability. -- **MHLO** - - MHLO is a standalone MLIR-based representation of XLA's HLO IR. The dialect - is being evaluated for deprecation, and new users of the dialect should prefer - to use StableHLO instead. -- **HLO** - - HLO is an internal graph representation (IR) for the XLA compiler (and also - supported input). It is **not** based on MLIR, and has its own textual syntax - and binary (protobuf based) representation. -- **MLIR** - - [MLIR](https://mlir.llvm.org) is a hybrid IR infrastructure that - allows users to define "dialects" of operations at varying degrees of - abstraction, and gradually lower between these opsets, performing - transformations at each level of granularity. StableHLO and CHLO are two - examples of MLIR dialects. -- **LLVM** - - [LLVM](https://llvm.org/) is a compiler backend, and a language that it - takes as an input. Many compilers generate LLVM code as a first step, and - then LLVM generates machine code from it. This allows developers to reuse - code that is similar in different compilers, and also makes supporting - different target platforms easier. XLA:GPU and CPU backends have - [LLVM IR emitters](https://github.com/openxla/xla/tree/main/xla/service/llvm_ir) - for targeting specific hardware. +- **OpenXLA** + - OpenXLA is an open ecosystem of performant, portable, and extensible + machine learning (ML) infrastructure components that simplify ML + development by defragmenting the tools between frontend frameworks and + hardware backends. It includes the XLA compiler, StableHLO, VHLO, + [PJRT](https://openxla.org/xla/pjrt) and other components. +- **XLA** + - XLA (Accelerated Linear Algebra) is an open source compiler for machine + learning. The XLA compiler takes models from popular frameworks such as + PyTorch, TensorFlow, and JAX, and optimizes the models for + high-performance execution across different hardware platforms including + GPUs, CPUs, and ML accelerators. The XLA compiler outputs some code to + LLVM, some to "standard" MLIR, and some to + [Triton MLIR](https://triton-lang.org/main/dialects/dialects.html) that + is processed by (MLIR-based) OpenAI Triton compiler. +- **PJRT** + - [PJRT](https://github.com/openxla/xla/blob/main/xla/pjrt/c/pjrt_c_api.h) + is a uniform Device API that simplifies the growing complexity of ML + workload execution across hardware and frameworks. It provides a + hardware and framework independent interface for compilers and runtimes. +- **StableHLO** + - StableHLO is the public interface to OpenXLA, it is a standardized MLIR + dialect that may be used by different frameworks and compilers in the + OpenXLA ecosystem. XLA supports StableHLO, and immediately converts it + to HLO on the input. There are some + [StableHLO to StableHLO](https://openxla.org/stablehlo/generated/stablehlo_passes) + passes implemented using the MLIR framework. It is also possible to + convert StableHLO to other compilers' IR without using HLO, for example + in cases where an existing IR is more appropriate. +- **CHLO** + - CHLO is a collection of higher level operations which are optionally + decomposable to StableHLO. +- **VHLO** + - The [VHLO Dialect](https://openxla.org/stablehlo/vhlo) is a MLIR dialect + that is a compatibility layer on top of StableHLO. It provides a + snapshot of the StableHLO dialect at a given point in time by versioning + individual program elements, and is used for serialization and + stability. +- **MHLO** + - MHLO is a standalone MLIR-based representation of XLA's HLO IR. The + dialect is being evaluated for deprecation, and new users of the dialect + should prefer to use StableHLO instead. +- **HLO** + - HLO (High Level Optimizer) is an internal graph representation (IR) for + the XLA compiler (and also supported input). It is **not** based on + MLIR, and has its own textual syntax and binary (protobuf based) + representation. +- **MLIR** + - [MLIR](https://mlir.llvm.org) is a hybrid IR infrastructure that allows + users to define "dialects" of operations at varying degrees of + abstraction, and gradually lower between these opsets, performing + transformations at each level of granularity. StableHLO and CHLO are two + examples of MLIR dialects. +- **LLVM** + - [LLVM](https://llvm.org/) is a compiler backend, and a language that it + takes as an input. Many compilers generate LLVM code as a first step, + and then LLVM generates machine code from it. This allows developers to + reuse code that is similar in different compilers, and also makes + supporting different target platforms easier. XLA:GPU and CPU backends + have + [LLVM IR emitters](https://github.com/openxla/xla/tree/main/xla/service/llvm_ir) + for targeting specific hardware.