Skip to content

Commit cf9fa0f

Browse files
athurdekoosGoogle-ML-Automation
authored andcommitted
PR #31855: [DOC] New document - hlo_pass
Imported from GitHub PR #31855 📝 Summary of Changes - New document Hlo_pass, gives an overview of HLO passes. Update to terminology.md - Added "High Level Optimizer" to HLO 🚀 Kind of Contribution 📚 Documentation Copybara import of the project: -- c95ef8b by Amelia Thurdekoos <[email protected]>: creating hlo_pass doc -- 183cc0f by Amelia Thurdekoos <[email protected]>: hlo_pass doc -- df3cbbd by Amelia Thurdekoos <[email protected]>: resolved comments and minor update to terminology.md Merging this change closes #31855 COPYBARA_INTEGRATE_REVIEW=#31855 from athurdekoos:hlo_passes df3cbbd PiperOrigin-RevId: 814817223
1 parent 9813877 commit cf9fa0f

File tree

2 files changed

+295
-58
lines changed

2 files changed

+295
-58
lines changed

docs/hlo_passes.md

Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
# HLO Passes
2+
3+
This document outlines the [HLO](https://openxla.org/xla/terminology)
4+
optimizations and transformations passes in the
5+
[XLA compiler](https://openxla.org/xla/architecture).
6+
7+
## Introduction
8+
9+
A single HLO Pass can be comprised of one or many compiler optimizations and
10+
transformations, and XLA provides several hundred such passes. HLO focuses only
11+
on the shape (e.g. a 3x4 matrix) and the
12+
[operation semantics](https://openxla.org/xla/operation_semantics) of the arrays
13+
to make the optimization or transformation easier.
14+
15+
For example:
16+
17+
* [`AlgebraicSimplifier`:](https://github.com/openxla/xla/blob/c37fc6a383b870f43cef82280418fcefcc90b0f8/xla/hlo/transforms/simplifiers/algebraic_simplifier.h#L417)
18+
A pass that performs a number of mostly arithmetic simplifications and
19+
optimizations. Including:
20+
21+
* When dividing by a constant, an optimization is performed to transform
22+
the operation to multiplication by the inversion of the constant.
23+
24+
* [`HloRematerialization`:](https://github.com/openxla/xla/tree/main/xla/hlo/transforms/simplifiers/hlo_rematerialization.h)
25+
A pass that recomputes selected expressions in the computation to reduce
26+
memory pressure caused by long live ranges of array-shaped values.
27+
28+
## Developer details
29+
30+
The base class for HLO passes can be found in
31+
[`xla/hlo/pass/hlo_pass_interface.h`](https://github.com/openxla/xla/blob/main/xla/hlo/pass/hlo_pass_interface.h).
32+
HLO pass should not extend this class directly but instead should extend
33+
[`HloModulePass`](https://github.com/openxla/xla/blob/main/xla/hlo/pass/hlo_pass_interface.h#L142)
34+
or
35+
[`HloModuleGroupPass`](https://github.com/openxla/xla/blob/main/xla/hlo/pass/hlo_pass_interface.h#L172).
36+
37+
See also
38+
[XLA HLO Pass Framework](https://github.com/openxla/xla/tree/main/xla/hlo/pass#readme).
39+
40+
### Tooling and Testing
41+
42+
XLA comes with multiple command line tools, including the hlo-opt tool. This
43+
tool allows execution of an individual pass independent of the given platform
44+
compilation stages. For more information see
45+
[Tooling](https://openxla.org/xla/tools#hlo-opt_hlo_pass_development_and_debugging).
46+
47+
For information on writing unit tests for HLO Passes see
48+
[Testing HLO Passes](https://openxla.org/xla/test_hlo_passes).
49+
50+
## Hardware-independent HLO Pass Examples
51+
52+
This section describes a few examples of passes shared across XLA backends. Some
53+
passes may be specialized for specific backends, but the high-level
54+
functionality is similar.
55+
56+
Shared passes or hardware-independent passes can be found in
57+
[`xla/hlo/transforms`](https://github.com/openxla/xla/tree/main/xla/hlo/transforms).
58+
59+
### Rematerialization
60+
61+
See also
62+
[`HloRematerialization`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/hlo_rematerialization.h).
63+
64+
Selectively recomputes expressions within the HLO graph to reduce memory usage.
65+
Trades off higher compute for lower memory usage. Can reduce memory usage by
66+
tens of percent and is required to run many large models.
67+
68+
### Algebraic Simplifier
69+
70+
See also
71+
[`AlgebraicSimplifier`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/algebraic_simplifier.h).
72+
73+
A grab bag of simplifications, optimizations, and canonicalizations. Analogous
74+
to
75+
[LLVM’s `instcombine` pass](https://llvm.org/docs/Passes.html#instcombine-combine-redundant-instructions).
76+
77+
### Constant Folding
78+
79+
See also
80+
[`HloConstantFolding`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/hlo_constant_folding.h).
81+
82+
Replaces expressions which can be evaluated at compile time with their constant
83+
equivalent.
84+
85+
### Dead Code Elimination
86+
87+
See also
88+
[`HloDCE`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/hlo_dce.h)
89+
.
90+
91+
Removes operations with unused results (fast implementation).
92+
93+
### Call Graph Flattening
94+
95+
See also
96+
[`FlattenCallGraph`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/flatten_call_graph.h).
97+
98+
A legalization pass which converts the HLO call graph into a tree by cloning
99+
computations. Required because memory is statically assigned to HLO operations
100+
and not based on dynamic call context.
101+
102+
### Reshape Mover
103+
104+
See also
105+
[`ReshapeMover`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/reshape_mover.h).
106+
107+
Reshapes and transposes can be expensive, especially on TPU. This pass moves and
108+
reshapes and transposes across elementwise operations enabling the operations to
109+
be merged or eliminated.
110+
111+
### Zero-sized HLO Elimination
112+
113+
See also
114+
[`ZeroSizedHloElimination`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/zero_sized_hlo_elimination.h).
115+
116+
HLO supports arrays of zero size (one or more dimensions has a bound of zero).
117+
This pass simplifies the graph by replacing zero-sized operations with
118+
zero-sized constants.
119+
120+
## TPU-specific HLO Pass Examples
121+
122+
Passes specific to the TPU backend.
123+
124+
### Model parallelism
125+
126+
The partitioning of an XLA program across multiple cores is performed at the HLO
127+
level and the TPU HLO pipeline includes a number of passes for supporting
128+
multi-core execution.
129+
130+
#### Spatial partitioning
131+
132+
See also
133+
[`ShardingPropagation`](https://github.com/openxla/xla/blob/main/xla/service/sharding_propagation.h).
134+
135+
Pass to support dividing operations across devices along non-batch dimensions.
136+
137+
### Handling of bfloat16
138+
139+
See also
140+
[`BFloat16ConversionFolding`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/bfloat16_conversion_folding.h),
141+
[`BFloat16MixedPrecisionRemoval`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/simplifiers/float_normalization.h),
142+
and
143+
[`BFloat16Propagation`](https://github.com/openxla/xla/blob/main/xla/hlo/transforms/bfloat16_propagation.h).
144+
145+
TPUs support bfloat16 as a lower-precision, more compact floating-point
146+
representation than 32-bit floats. Using bfloat16 reduces memory footprint and
147+
memory bandwidth. The TPU HLO pipeline includes various passes for replacing
148+
floats with bfloat16 into the program and propagating the precision through the
149+
graph.
150+
151+
### Legalization passes
152+
153+
See also
154+
[`GatherExpander`](https://github.com/openxla/xla/blob/main/xla/service/gather_expander.h),
155+
and
156+
[`BatchNormExpander`](https://github.com/openxla/xla/blob/main/xla/service/batchnorm_expander.h).
157+
158+
Passes which transform unsupported HLO into a form which the backend can emit or
159+
for which the backend produces a more efficient lowering.
160+
161+
## GPU-specific HLO Pass Example
162+
163+
Passes specific to the GPU backend are found in
164+
[`xla/service/gpu`](https://github.com/openxla/xla/tree/main/xla/service/gpu).
165+
These passes can be identified as classes defined in `namespace gpu`.
166+
167+
### cuDNN Rewriter
168+
169+
See also
170+
[`CudnnFusedConvRewriter`](https://github.com/openxla/xla/blob/main/xla/service/gpu/transforms/cudnn_fused_conv_rewriter.h)
171+
and
172+
[`CudnnNormRewriter`](https://github.com/openxla/xla/blob/main/xla/service/gpu/transforms/cudnn_norm_rewriter.h).
173+
174+
Rewrites fused convolution and norm operations into their respective library
175+
calls in cuDNN.
176+
177+
## CPU-specific HLO Pass Examples
178+
179+
Passes specific to the CPU backend are found in
180+
[`xla/service/cpu`](https://github.com/openxla/xla/tree/main/xla/service/cpu).
181+
These passes can be identified as classes defined in `namespace cpu`.
182+
183+
### Convolution Canonicalization
184+
185+
See also
186+
[`ConvCanonicalization`](https://github.com/openxla/xla/blob/main/xla/service/cpu/conv_canonicalization.h).
187+
188+
Canonicalizes convolutions so that they can be lowered to a fast implementation
189+
in Eigen.
190+
191+
### Operation Parallelization
192+
193+
See also
194+
[`ParallelTaskAssigner`](https://github.com/openxla/xla/blob/main/xla/service/cpu/parallel_task_assignment.h).
195+
196+
Partitions HLOs into tasks to run on separate threads.
197+
198+
## Analysis passes
199+
200+
Analysis passes are not considered "HLO passes" since they do not transform HLO
201+
and may not extend `HloModulePass` or `HloModuleGroupPass`. Shared analyses are
202+
found in
203+
[`xla/hlo/analysis`](https://github.com/openxla/xla/tree/main/xla/hlo/analysis).
204+
205+
### Analysis Pass Examples
206+
207+
#### Dataflow Analysis
208+
209+
See also
210+
[`HloDataflowAnalysis`](https://github.com/openxla/xla/tree/main/xla/hlo/analysis/hlo_dataflow_analysis.h).
211+
212+
Identifies all HLO values in the graph and their uses.
213+
214+
#### Alias Analysis
215+
216+
See also
217+
[`HloAliasAnalysis`](https://github.com/openxla/xla/tree/main/xla/hlo/analysis/hlo_alias_analysis.h).
218+
219+
Identifies must-alias relationships between values in the program.
220+
221+
#### Computation Cost Analysis
222+
223+
See also
224+
[`HloCostAnalysis`](https://github.com/openxla/xla/tree/main/xla/service/hlo_cost_analysis.h).
225+
226+
Computes FLOP count and memory usage for all operations in the program.
227+
228+
#### HLO Verification
229+
230+
See also
231+
[`HloVerifier`](https://github.com/openxla/xla/tree/main/xla/service/hlo_verifier.h).
232+
233+
Verifies various invariants of the HLO graph.

docs/terminology.md

Lines changed: 62 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -4,61 +4,65 @@ There are several terms that are used in the context of XLA, MLIR, LLVM, and
44
other related technologies. Below is a partial list of these terms and their
55
definitions.
66

7-
- **OpenXLA**
8-
- OpenXLA is an open ecosystem of performant, portable, and extensible machine
9-
learning (ML) infrastructure
10-
components that simplify ML development by defragmenting the tools between
11-
frontend frameworks and hardware backends. It includes the XLA compiler,
12-
StableHLO, VHLO, [PJRT](https://openxla.org/xla/pjrt) and other
13-
components.
14-
- **XLA**
15-
- XLA (Accelerated Linear Algebra) is an open source compiler for machine
16-
learning. The XLA compiler takes models from popular frameworks such as
17-
PyTorch, TensorFlow, and JAX, and optimizes the models for high-performance
18-
execution across different hardware platforms including GPUs, CPUs, and ML
19-
accelerators. The XLA compiler outputs some code to LLVM, some to "standard"
20-
MLIR, and some to [Triton MLIR](https://triton-lang.org/main/dialects/dialects.html)
21-
that is processed by (MLIR-based) OpenAI Triton compiler.
22-
- **PJRT**
23-
- [PJRT](https://github.com/openxla/xla/blob/main/xla/pjrt/c/pjrt_c_api.h) is
24-
a uniform Device API that simplifies the growing complexity of ML workload
25-
execution across hardware and frameworks. It provides a hardware and framework
26-
independent interface for compilers and runtimes.
27-
- **StableHLO**
28-
- StableHLO is the public interface to OpenXLA, it is a standardized MLIR
29-
dialect that may be used by different frameworks and compilers in the OpenXLA
30-
ecosystem. XLA supports StableHLO, and immediately converts it to HLO on the
31-
input. There are some [StableHLO to StableHLO](https://openxla.org/stablehlo/generated/stablehlo_passes)
32-
passes implemented using the MLIR framework. It is also possible to convert
33-
StableHLO to other compilers' IR without using HLO, for example in cases where
34-
an existing IR is more appropriate.
35-
- **CHLO**
36-
- CHLO is a collection of higher level operations which are optionally
37-
decomposable to StableHLO.
38-
- **VHLO**
39-
- The [VHLO Dialect](https://openxla.org/stablehlo/vhlo) is a MLIR dialect
40-
that is a compatibility layer on top of StableHLO. It provides a snapshot of
41-
the StableHLO dialect at a given point in time by versioning individual
42-
program elements, and is used for serialization and stability.
43-
- **MHLO**
44-
- MHLO is a standalone MLIR-based representation of XLA's HLO IR. The dialect
45-
is being evaluated for deprecation, and new users of the dialect should prefer
46-
to use StableHLO instead.
47-
- **HLO**
48-
- HLO is an internal graph representation (IR) for the XLA compiler (and also
49-
supported input). It is **not** based on MLIR, and has its own textual syntax
50-
and binary (protobuf based) representation.
51-
- **MLIR**
52-
- [MLIR](https://mlir.llvm.org) is a hybrid IR infrastructure that
53-
allows users to define "dialects" of operations at varying degrees of
54-
abstraction, and gradually lower between these opsets, performing
55-
transformations at each level of granularity. StableHLO and CHLO are two
56-
examples of MLIR dialects.
57-
- **LLVM**
58-
- [LLVM](https://llvm.org/) is a compiler backend, and a language that it
59-
takes as an input. Many compilers generate LLVM code as a first step, and
60-
then LLVM generates machine code from it. This allows developers to reuse
61-
code that is similar in different compilers, and also makes supporting
62-
different target platforms easier. XLA:GPU and CPU backends have
63-
[LLVM IR emitters](https://github.com/openxla/xla/tree/main/xla/service/llvm_ir)
64-
for targeting specific hardware.
7+
- **OpenXLA**
8+
- OpenXLA is an open ecosystem of performant, portable, and extensible
9+
machine learning (ML) infrastructure components that simplify ML
10+
development by defragmenting the tools between frontend frameworks and
11+
hardware backends. It includes the XLA compiler, StableHLO, VHLO,
12+
[PJRT](https://openxla.org/xla/pjrt) and other components.
13+
- **XLA**
14+
- XLA (Accelerated Linear Algebra) is an open source compiler for machine
15+
learning. The XLA compiler takes models from popular frameworks such as
16+
PyTorch, TensorFlow, and JAX, and optimizes the models for
17+
high-performance execution across different hardware platforms including
18+
GPUs, CPUs, and ML accelerators. The XLA compiler outputs some code to
19+
LLVM, some to "standard" MLIR, and some to
20+
[Triton MLIR](https://triton-lang.org/main/dialects/dialects.html) that
21+
is processed by (MLIR-based) OpenAI Triton compiler.
22+
- **PJRT**
23+
- [PJRT](https://github.com/openxla/xla/blob/main/xla/pjrt/c/pjrt_c_api.h)
24+
is a uniform Device API that simplifies the growing complexity of ML
25+
workload execution across hardware and frameworks. It provides a
26+
hardware and framework independent interface for compilers and runtimes.
27+
- **StableHLO**
28+
- StableHLO is the public interface to OpenXLA, it is a standardized MLIR
29+
dialect that may be used by different frameworks and compilers in the
30+
OpenXLA ecosystem. XLA supports StableHLO, and immediately converts it
31+
to HLO on the input. There are some
32+
[StableHLO to StableHLO](https://openxla.org/stablehlo/generated/stablehlo_passes)
33+
passes implemented using the MLIR framework. It is also possible to
34+
convert StableHLO to other compilers' IR without using HLO, for example
35+
in cases where an existing IR is more appropriate.
36+
- **CHLO**
37+
- CHLO is a collection of higher level operations which are optionally
38+
decomposable to StableHLO.
39+
- **VHLO**
40+
- The [VHLO Dialect](https://openxla.org/stablehlo/vhlo) is a MLIR dialect
41+
that is a compatibility layer on top of StableHLO. It provides a
42+
snapshot of the StableHLO dialect at a given point in time by versioning
43+
individual program elements, and is used for serialization and
44+
stability.
45+
- **MHLO**
46+
- MHLO is a standalone MLIR-based representation of XLA's HLO IR. The
47+
dialect is being evaluated for deprecation, and new users of the dialect
48+
should prefer to use StableHLO instead.
49+
- **HLO**
50+
- HLO (High Level Optimizer) is an internal graph representation (IR) for
51+
the XLA compiler (and also supported input). It is **not** based on
52+
MLIR, and has its own textual syntax and binary (protobuf based)
53+
representation.
54+
- **MLIR**
55+
- [MLIR](https://mlir.llvm.org) is a hybrid IR infrastructure that allows
56+
users to define "dialects" of operations at varying degrees of
57+
abstraction, and gradually lower between these opsets, performing
58+
transformations at each level of granularity. StableHLO and CHLO are two
59+
examples of MLIR dialects.
60+
- **LLVM**
61+
- [LLVM](https://llvm.org/) is a compiler backend, and a language that it
62+
takes as an input. Many compilers generate LLVM code as a first step,
63+
and then LLVM generates machine code from it. This allows developers to
64+
reuse code that is similar in different compilers, and also makes
65+
supporting different target platforms easier. XLA:GPU and CPU backends
66+
have
67+
[LLVM IR emitters](https://github.com/openxla/xla/tree/main/xla/service/llvm_ir)
68+
for targeting specific hardware.

0 commit comments

Comments
 (0)