-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Home
- Read inference_session.h to get familiar with the common interface functions,
Following are the important ones for training
- Load()
- Run()
- NewIOBinding()
- RegisterGraphTransformer()
- RegisterExecutionProvider()
- Advance: Initialize()
- What happens under the hood when we call session.initalize()?
- Understand the configs in SessionOptions (session_option.h)
Followings are important ones for training
- execution_order
- enable_mem_pattern
- use_deterministic_compute
- session_log_severity_level
-
Prerequisite: What is an ORTValue? (ml_value.h)
-
BindInput()
- How to create an ORTValue?
-
BindOutput()
-
With preallocated buffer
-
Without preallocated buffer
- Who allocates the output buffer? and How is this returned back to user?
-
-
What should be the lifespan of an IOBinding? Can user reuse IOBinding accross multiple Session::Run()?
-
How is binded inputs/outputs passed into ExecutionFrame?
-
Advance: How is IOBinding different from dlpack's approach? What are their advantages?
-
Read training_agent.h to understand the available interface functions
-
Understand how ORTModule uses TrainingAgent
-
Read ORTModule forward() and backward() function
- How does ORTModule get an ONNX graph from torch nn.module?
- How is ORT doing the auto-diff without torch's autograd engine?
- How is ORT hijacking torch's foward/backward call?
- How's C++ InferenceSession used as python's onnxruntime.InferenceSession?
- Read onnxruntime_pybind_state.cc, onnxruntime_inference_collection.py for InferenceSession binding
- Read orttraining_pybind_state.cc for TrainingAgent binding
-
Node (graph.h/.cc)
-
What's the difference between an Op and a Node?
-
What are the common properties for a node?
- Can a node's name be empty?
- What's the identifier of a node in a graph? Index or Name?
-
Advance: Function Ops, node with FunctionBody
-
-
Graph (graph.h/.cc)
-
How to traverse from one node to another node?
-
What's the difference between a GraphInput and an Initializer?
-
Look for an example using GetProducerNode() and GetConsumerNodes()
-
What's the purpose of Graph::Resolve()?
- How is ShapeAndTypeInference invoked?
-
-
NodeArg (node_arg.h)
- What the relationship between a graph edge and a NodeArg?
- What's the unique identifier of a NodeArg in a graph?
- Action: Look for some example using Graph::GetOrCreateNodeArg() (You will need to use this at some point)
-
Understand the difference between GraphTransformer and RewriteRule
-
Understanding the purpose of GraphTransformerManager
- How to register a set of graph transformers into a session?
-
Understanding the two versions of graph_transformer_utils.cc (onnxruntime/orttraining ones)
-
Get familiar with graph_utils.cc
-
Experiment with onnx.helper to compose a onnx model from the script (see transpose_matmul_gen.py for examples)
-
Action: Implement a graph transformer to get hands-on experience
-
Understand the workflow of training graph transformation
-
Understand GraphAugmenter (graph_augmenter.h/.cc)
-
GradientGraphBuilder
- Understand the purpose/usage of STOP_GRADIENT_EDGES
- Understand the meaning of x_node_args/y_node_args
- Advance: Understand the back-propagation process in GradientGraphBuilder::Build()
-
Per Op GradientBuilder
-
Understand the Gradient Registry (gradient_builder_registry.cc)
-
Understand the Gradient Builder Declaration (gradient_builder.h)
-
Read a few examples in Gradient Builder Implementation (gradient_builder.cc)
-
Understand the shorthands of I, GI, O, GO (gradient_builder_base.h)
-
Understand how gradient subgraph is composed with existing ops, followings are good examples
- Easy: GetDropoutGradient, GetSqrtGradient
- Medium: GetAddSubGradient, GetMulGradient
- Hard: GetMatMulGradient, GetGemmGradient
-
Understand how broadcasting is handled when building gradient graph GradientBuilderBase::HandleBroadcasting()
-
Action: Implement a gradient definition for an op to get hands-on experience
-
-
-
Read onnx.proto and onnx-ml.proto and understand the design principle behind it
-
Get familiar with the Onnx Operaters: https://github.com/onnx/onnx/blob/master/docs/Operators.md
- Must know: Dropout, MatMul, Gemm, Transpose, ReduceSum, Reshape
-
Understand the concept and purpose of opset, domain
-
When to use which?
- onnx domain
- msdomain
-
-
Understand the C++ data structure in onnx::TensorProto, onnx::AttributeProto, onnx::TypeProto
-
Understand how Shape and Type Inference works in the schema definition
-
Function Ops
-
Understand the difference among the following 3 sets of schema. When to use which?
-
Onnx's op Schema (onnx repo: defs.cc)
-
contrib ops (contrib_defs.cc)
- Good to know: LayerNorm, Gelu
-
training ops (training_op_defs.cc)
-
-
Action: Add an op or update an op's schema to get hands-on experience
-
Kernel Declaration and Registory
-
Understand when to use which registry for a kernel
-
Inference Kernels
-
Onnx Op Kernels
- cpu_execution_provider.cc
- cuda_execution_provider.cc
-
Contrib Op Kernels
- cpu_contrib_kernels.cc
- cuda_contrib_kernels.cc
- Advance: rocm_contrib_kernels.cc
-
-
Training Kernels
- CPU (cpu_training_kernels.cc)
- CUDA (cuda_training_kernels.cc)
- Advance: Rocm (rocm_training_kernels.cc)
-
-
Kernel Implementation
-
Tensor vs. OrtValue
- Read tensor.h and ml_value.h
- What's the difference between Tensor and OrtValue? Why we need two classes?
- How to get a Tensor from OrtValue?
- How to get data's raw pointer from a Tensor?
-
Kernel Definition
- When to use Alias() and VariadicAlias()?
- How to set TypeConstraint()?
- When to use InputMemoryType?
-
CPU Kernel vs. CUDA Kernel
- What does it mean to have a CPU input/output for a CUDA kernel?
- Subtopic 3
- Subtopic 4
-
Gradient Kernels
-
Examples
- Easy: DropoutGrad, GeluGrad
- Medium: GatherGrad
- Hard: LayerNormalizationGrad
-
Understand how to write unit tests to check gradient's correctness
-
-
Understand how to user OpTester in UnitTest
-
Action: Implement a kernel to get hands-on experience
-
-
nvprof
- try run with/without --print-gpu-summary
- try --profile-child-processes
- Action: profile a training run
-
Visual Profiler UI
- Use ruler to measure a time span
- Identify the top hitters in kernels
- Compare two sets of profiling results to identify the performance gap
- Can you identify the start/end of a train_step from the timeline view?
-
torch profiler
-
Linux perf
- Read RunOptions and understand the options (run_option.h)
-
What's the purpose of ExecutionFrame? (execution_frame.h)
- How is one nodes output passed in as another node's input?
- What happens when we call context->Output() inside an op kernel?
- How are feeds and fetches stored in ExecutionFrame?
-
How is the execution order determined? (graph_viewer.cc)
- Default execution order uses Graph::ReverseDFS() to generated topological sort
- Priority-based execution order uses Graph::KahnsTopologicalSort with per-node priority
-
How is each node's kernel invoked ?
-
How does ORT guarantees all the cuda kernel is completed before session.run return?
- How each node is determined to be place on which execution provider? (graph_partitioner.h)
- Why we need an arena? What problem does it solve?
- How does ORT come up with a peak memory consumption?
- How does ORTModule uses pytorch allocator? (ortmoduel.py)
- Advance: What's the difference between BFCArena and CUDACachingAllocator?
-
Understand the hardward
-
Architecture Generations
- P100: Pascale / sm60
- V100: Volta / sm 70
- A100: Amper/ sm 80
-
CUDA Core vs. Tensor Core
-
-
Programming model
- Thread
- Block
- Grid
- Stream
-
Must known funnctions
- cudaMalloc() vs. cudaFree()
- cudaMemcpy() vs. cudaMemcpyAsync()
- cudsMemset() vs. cudaMemsetAsync()
- cudaStreamSynchronize() vs. cudaDeviceSynchronize()
- cudaEventRecord() vs. cudaStreamWaitEvent()
- Avoid memcpy
- Avoid unnecessary Sync
- Preprocess data in CPU
- when to use #pragma unroll?
- Easy: Dropout/DropGrad
- Medium: SoftmaxCrossEntropyLoss(Grad)
- Hard: LayerNormalization, ReduceSum, GatherGrad
- printf() is working inside cuda code
- Memcpy data to CPU for inspection?
- Good read: https://mpitutorial.com/tutorials/
- Understand NCCLAllReduce
- Get familiar with DDP usage/setup
-
Zero-1
- Understand ReduceScatter/AllGather
- Understand how optimizer state is partitioned
-
Zero-2
-
Zero-3
- Understand All2All
- Understand the meaning and implications of common configurations: batch size, seq len, learning rate, weight decay, global norm, loss scale...
- Familiarize with the common patterns in loss decreasing curve, spot abnormal patterns
- Understand the difference between optimizers: SGD, Adam and LAMB
- Advance: Understanding Backpropagation https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
- Familiar with running/monitoring AML experiments
- Familiarize with setting up tensorboard
- Action: submit a distributed training job to AML cluster and get familiar with it's user interface/logging/available metrics
-
Remove all randomness in the program
- Set Seeds
- Set Dropout Ratio to 0
- Set use_deterministice_compute=True
-
Shrink the reproducible condition to the very minimal, as long as it can still repro
- Use 1 layer model
- Use smaller hidden_size
- Use single GPU
- ...
-
Common Tricks
- Set the learning rate to 0 to disable model change
-
Advance: how to do hyper-parameter tuning to get the model to converge better?
- Setting up VScode with remote VM
- Debugging within Vscode
- Getting the .onnx inference/training graph
- Enable I/O Dump
- Enable execution plan and memory plan dump
- Enable CPU profiling dump
- Enable CUDA memory consumption logs
- InferenceSession/ORTModule
- Graph/Node/NodeArg
- Onnx/Op/Schema/Kernel
- ORTValue/Tensor
- GraphTransformer
- Per-op Gradient Building
- Performance Investigation
- ExecutionProvider
- IOBinding/dlpack
- PyBind
- Gradient Graph Building
- CUDA Programming
-
Execution Engine
- SessionState
- ExecutionFrame
-
Memory
-
Distributed Training
- Performance optimization for CUDA kernels
- Hyper-parameter tuning
Please use the learning roadmap on the home wiki page for building general understanding of ORT.