Torch-TensorRT 2.0 #1826

narendasan · 2023-04-14T18:39:46Z

narendasan
Apr 14, 2023
Collaborator

UX Goals

APIs

`torch.compile`

torch.compile is a JIT compiler. This should be taken in the sense that the users workflow would be to take a PyTorch Module

`torch.export`

`torch_tensorrt.compile`

Workflows

JIT Optimization

The idea of this workflow is that a user will deploy a boxed version of their model tied to Torch-TensorRT via torch.compile. When users call their model, torch.compile will compile the graph and provide the torch_tensorrt.dynamo.backend with a set of example inputs, settings and the graph.

Conceptually either dynamo or the backend will recognize when the current compiled module is invalid due to a change in constraints (typically input size) can recompile the target or call on a cache to pull up a previously compiled serialized version.

Compile and Deploy on the Same Machine

AOT Optimization

Compile And Deploy on Seperate Machines

Internals

Backends

ATen and AOTAutograd

https://github.com/pytorch/pytorch/blob/93d75568c7070942a59337dd83194c2fd5221adb/torch/_functorch/aot_autograd.py#L2837

Engine Cache

It seems like for torch.compile that there is a strong requirement for a engine cache, some sort of store of compiled TensorRT engines tied to an identifying hash calculated from the source graph and provided inputs. This cache should be able to short circuit the torch-tensorrt backend and deserialize and return the previously created engine

https://github.com/pytorch/pytorch/blob/master/torch/_inductor/codecache.py

There are a couple methods we could think about for maintaining this cache

Implementation

FSCache

Write engines to disk in some sort of temp directory with a file system convention for locating and matching files.

Advantages

Implicit persistent storage
Low memory consumption

Disadvantages

Not OS agnostic
Introduces new constraints on disk availability
Still a cost even at cache hit from reading and deserializing engines from disk

Additional User Configurations required

Cache size limit
Cache location

MemCache

Hold serialized engines in host memory (i.e. dictionary) and deserialize

Advantages

OS Agnostic

Disadvantages

Increases memory consumption
Still cost on deserialization time
No persistent storage

Additional User Configurations required

Cache size limit

HotCache

As an addition to either of these cache options, we can include the ability to have a "Hot Cache" i.e. a number of engines which stay live and deserialized, the cost being additional VRAM and host memory usage.

Options for HotCache Rules

VRAM Allocation
$N$ most recently used engines

Saving and reloading caches

We need to come up with a format to store an load caches so that if in a future run, dynamo detects an identical graph we can load in the model

What causes a cache miss?

There are a number of reasons why a cache might be invalid.

The input size changed
The weights changed
- We can look to leverage refitting here

Guards

https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/guards.py

Guards are the mechanism to detect if a subgraph is different than the target. Some of these guards are provided by dynamo. However we need to provide guards that are TensorRT specific. These may include changed weights, changed inputs etc.

Lowering

There are three classes of lowering in the dynamo backend. Decompositions, Subgraph Rewriting and Module Level Lowering

Decompositions

https://github.com/pytorch/pytorch/blob/master/torch/_inductor/decomposition.py

Decompositions are small functions which map an operator to a lowered form similar to unpack passes in the TorchScript frontend and serve to customize cases or reduced the opset that the converters need to handle.

Decompositions are run as part of the functorch.aot_autograd step

Subgraph Rewriting

Subgraph rewriting takes small repeating patterns and replaces them with one or many operations. See support for Linear/AddMM in TS for examples of what this looks like

Subgraph rewriting will be run post aot_autograd

Module Level Lowering

Module level passes identify submodules in graphs to perform aliasing or other high level operations.

Module level passes will run pre aot autograd

Ex.

    for n in gm.graph.nodes:
        if n.op == "call_module":
            submodule = gm.get_submodule(n.target)
            if isinstance(submodule, torch.nn.MultiheadAttention):
                with gm.graph.inserting_after(n):
                    new_node = gm.graph.create_node(
                        "call_function",
                        replacement_function,
                        args=n.args,
                        kwargs=n.kwargs,
                    )
                n.replace_all_uses_with(new_node)
                gm.graph.eliminate_dead_code()
                gm.recompile()

Partitioning

Dynamo has a builtin capability partitioner that we have a prototype for:

from typing import Dict
import logging

import torch

from torch.fx.passes.infra.partitioner import CapabilityBasedPartitioner
from torch.fx.passes.operator_support import OperatorSupport

from torch_tensorrt.fx.converter_registry import CONVERTERS

log = logging.getLogger(__name__)

class TorchTensorRTOperatorSupport(OperatorSupport):
    def is_node_supported(self, submodules: Dict[str, torch.nn.Module], node: torch.fx.Node) -> bool:
        if node.target in CONVERTERS.keys():
            print(f"{node.target} is supported")
            return True
        else:
            print(f"{node.target} is not supported")
            return False

def partition(gm: torch.fx.GraphModule):
    supported_ops = TorchTensorRTOperatorSupport()
    partitioner = CapabilityBasedPartitioner(gm, supported_ops)
    fused_graph = partitioner.partition_and_fuse()
    return fused_graph

https://github.com/pytorch/pytorch/blob/master/torch/fx/passes/utils/fuser_utils.py

This uses the default CapabilityBasedPartitioner. We would likely need to modify this to add support for features like min_block_size, torch_supported_ops, torch_support_modules

Overview

In short, the three levels of Lowering work in conjunction with partitioning. First, we have the pre-tracing Module-Level Lowering for high-level modules. Then, we have during-tracing Decompositions for fusions, in-place ops, and operator simplification. Finally, we have post-tracing subgraph rewriting which can also assist with fusions, as well as other graph simplifications. This final, post-tracing pass can help with reducing segmentation in partitioning, since we can use post-tracing subgraph rewriting to replace operations with their prims-equivalents, which are lower-level and easier to implement converters for.

Conversion

FX Interpreter

https://github.com/pytorch/pytorch/blob/master/torch/fx/interpreter.py

Evaluation

There are some constants which we need at compile time produced by intermediate operations. The FX interpreter can execute these and store that data some where for converters to use.

Converters

Consolidate TRTNetwork, IR, in some context to pass around

Dynamo Symbolic Shapes + Shape Tensors

Dynamic shape cases

Sym Shape / Shape Tensor Interop

Runtime / Callable

Torch-TensorRT Legacy Runtime

Inductor has this function: https://github.com/pytorch/pytorch/blob/d5aa4cec578f40afd43cc0f96ba0d0abaf38b1f4/torch/_inductor/compile_fx.py#L514

Python Runtime

gs-olive · 2023-04-18T01:05:06Z

gs-olive
Apr 18, 2023
Collaborator

Comment on Recompilation + Dynamic Batch

One thing to add to the caching section is refactoring the Guards/Caching implementation so that differing shapes in a pre-specified (batch) dimension do not cause recompilation. For example, if a user has specified min_shape = (1,), opt_shape=(4,), max_shape=(8,), then so long as a new control flow branch is not entered, recompilation should not be required for simply changing the shape of a batch dimension. For example, consider the following snippet, where the shapes of Tensors change, but no new control flow is encountered on these different inputs (no recompilation should occur):

class Sample(nn.Module):
...
def forward(self, x):
    if x.sum() > 2:
        return torch.sum(x)
    else:
        return torch.sum(x**2)


model = torch_tensorrt.dynamo.compile(Sample(), ... [min_shape = (1,), opt_shape=(4,), max_shape=(8,)],...)

input_1 = torch.zeros(1)
input_2 = torch.zeros(4)
input_3 = torch.zeros(8)

# No Recompilation Should Occur
model(input_1)
model(input_2)
model(input_3)

0 replies

ncomly-nvidia · 2023-04-25T18:10:50Z

ncomly-nvidia
Apr 25, 2023

Conceptually either dynamo or the backend will recognize when the current compiled module is invalid due to a change in constraints

This seemed very abstract in our conversation with PyT - do we have a clear way to represent the additional constraints of a TRT engine over Dynamo?

Engine Caching

Most customers are very sensitive to Device memory consumption and often host too - FSCache is most analogous to how TRT is used today & seems reasonable, but a few thoughts / considerations.

What is the load time from FS? Is it only on the first cache hit, then the deserialized engine will stay in memory until it is evicted or the process ends?
How is a cache hit define? Does the whole engine need to be hashed? How are target GPU, build flags, input size, dynamic shape or not, etc. all accounted for in the cache validation?
Who is actually building this caching system? This seems like a significant undertaking very different from core TRT value prop. What is already available from PyT.

Subgraph rewriting

Do we have an answer from Meta on preserving high level ops? We don't want Torch-TRT to turn into another MHA pattern matcher.

Partitioning

Is there any consideration for how something like CCD would fit in here? Ideally we leverage dynamo to do much of the selection process required in CCD.

Additionally, Dynamic shapes, sources of dynamo overhead, export workflow, and potentially additional workflows like extracting or loading TRT engines should be considered.

1 reply

narendasan Apr 25, 2023
Collaborator Author

do we have a clear way to represent the additional constraints of a TRT engine over Dynamo?

They did tell us there is a way to provide additional guards to dynamo, but I havent seen an example as to how yet.

What is the load time from FS?

Not sure, TRT might have metrics on deserialization time from disk

Is it only on the first cache hit, then the deserialized engine will stay in memory until it is evicted or the process ends?

This is something we need to decide, as you said people may be sensitive to memory usage, so there is a number of ways we can handle this. We could set cache size limits and evict the oldest engines summing to the size of the newest engine. I think this gets complicated and probably a phase 2 sort of thing.

How is a cache hit define? Does the whole engine need to be hashed? How are target GPU, build flags, input size, dynamic shape or not, etc. all accounted for in the cache validation?

A hash should be provided by dynamo, this was the proposal the pytorch team put forward. This hash encompasses system config, graph etc. We associate that hash with an engine.

AFAIK we build the cache i.e. the dictionary that maps hashes to engines and the loading and evicting behavior. PyT is building the hashing system and hooks for us to set custom guards.

Do we have an answer from Meta on preserving high level ops? We don't want Torch-TRT to turn into another MHA pattern matcher.

@gs-olive might have more insight for an official solution but we are able to intercept nn.MHA in graphs currently

CCD

CCD is simpler in the sense that its easier to associate metadata with converters in python. Detecting cases will be specific to those cases, some dynamo might help us with, others like DS support would likely need to be implemented by us.

peri044 · 2023-07-04T00:20:01Z

peri044
Jul 4, 2023
Collaborator

Unifying dynamo export and compile workflows

fx_ts_compat directory will be replaced by new export directory. The prototype export directory contains the same steps as fx_ts_compat but a lot simpler (by removing utilities, observers which aren't directly contributing for compilation). Here are the basic steps it does

Calls aten_tracer to get an aten graph
Run a bunch of lowering passes (not currently shared by backend and export )
Partition the model (can switch between TRTSplitter and Partitioner from backend)
Use convert_module API for TRT conversion (shared by backend and export)

Both export and backend compile function should be same as ts.compile from a feature stand point.

explicit_batch_dimension and is_aten arguments will be removed.

export (dynamo export) proposed directory structure

dynamo/
   - backend/
     - __init__.py (compile function resides in this)
   - export/
     - compile.py
   - lowering/
   - _partition.py
   - _defaults.py
   - _settings.py
   - conversion.py
   - utils.py
   - test/
     - models/ 
     - backend/
     - export/
     - partitioning/

Partitioning, conversion etc should be shared by both export and backend paths.

Issues with Partitioner

Adds constants as inputs, functional programming to the extreme, no state. Just inputs/outputs and function

#1995
#1954

#2061

For a simple model like,

class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = torch.nn.Conv2d(3, 16, 3, stride=1, bias=False)
        self.bn = torch.nn.BatchNorm2d(16)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        out = self.conv(x)
        out = self.bn(out)
        out = self.relu(out)
        return out

This will fail in the batch norm conversion phase because partitioner treats constants (weights of conv/bn) as placeholder tensors which get added to the graph https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/utils/fuser_utils.py#L142-L147

Question here : Should we handle ITensors in the converters (or) a transformation pass over partitioning (or) write a custom partitioner based off CapabilityBasedPartitioner?

Note : This is not a problem with torch.compile (backend workflow) because there is no batch norm layer in the graph as batch_norm gets split into add/mul layers while dynamo.export has batch_norm layer explicitly.

Probably torch.export can help with this since it has aot_autograd support, but it is still in development.

Issues with TRT splitter

legacy (?) code and changes require longer review time.
not preferrable the way it groups torch layers as a separate module. Just leave it as is, can't see the flow of ops easily in the graph.
```
Easy to debug : 
run_on_acc0
torch.silu 
run_on_acc1
```

torch.export

torch._dynamo.export doesn't do aot_autograd yet. From Meta's discussion, torch.export should support it which is in development.

Naming of APIs

Current prototype:

torch_tensorrt.dynamo.compile - will point to dynamo export based workflow
torch_tensorrt.dynamo.backend.compile - will point to torch.compile based workflow
torch_tensorrt.compile(model, ir="default") - will point to dynamo export workflow
torch_tensorrt.compile(model, ir="dynamo_export") - will point to dynamo export based workflow
torch_tensorrt.compile(model, ir="dynamo_compile") - will point to torch.compile based workflow.

2 replies

narendasan Jul 4, 2023
Collaborator Author

_partition.py

Should we just consider this part of lowering organization wise?

conversion.py

Likely we will need to make this its own submodule eventually, might as well do that now like lowering

test/
- models/
- backend/
- export/
- partitioning/

Thought on moving these into the test directory we use for torchscript? under the name //tests/dyanmo? This is what PyTorch does as well as the recommended layout from python. Will make it easier to script tests since we can glob everything at once.

Question here : Should we handle ITensors in the converters (or) a transformation pass over partitioning (or) write a custom partitioner based off CapabilityBasedPartitioner?

@gs-olive and I have been discussing centralizing on the idea that all Tensors inputs to a converter are ITensors, and that conversion middleware will handle capturing numpy arrays and creating IConstantLayers so that torch tensors will not be used. This lets us use Fake Tensors as well. One outstanding question for export is what inlining constants will look like since for dyanmo they are lowered to just being graph inputs dynamo will provide at runtime. IScaleLayer seems to be the limiting factor here, but we should be able to replicate scale with elementwise and unary right? We can also ask TRT to extend their current ITensors as weights API to IScaleLayer as well (though will not be ready in time for 2.1).

Partition the model (can switch between TRTSplitter and Partitioner from backend)

IMO lets pick one, I like the partitioner because we dont need to maintain the partitioning logic (though we lose control but we can always subclass to modify behavior). If one is better, easier to maintain than the other lets stick with that one. Not sure why a user would care what partitioner we use.

torch_tensorrt.dynamo.compile - will point to dynamo export based workflow 👍

torch_tensorrt.dynamo.backend.compile - will point to torch.compile based workflow

torch_tensorrt.compile(model, ir="default") - will point to dynamo export workflow 👍

torch_tensorrt.compile(model, ir="dynamo_export") - will point to dynamo export based workflow

torch_tensorrt.compile(model, ir="dynamo_compile") - will point to torch.compile based workflow.

Thoughts on the rest: I am wondering if we should expose this torch_tensorrt.dynamo.backend.compile, what is its utility if we are going to call back out to torch.compile, Unlike ts.compile which takes an exported torchscript, fx.compile an exported torch.fx.GraphModule (kinda), dynamo.compile an exported whatever the format torch.export will produce (probably a GraphModule), dynamo.backend.compile is non functional without calling back to torch.compile with an nn.Module. Perhaps we only expose this via torch_tensorrt.compile but no subpackage api to keep things consistent and limit confusion. This is similar to how we treated the torchscript backend.

torch_tensorrt.compile(model, ir="dynamo_export") - will point to dynamo export based workflow

torch_tensorrt.compile(model, ir="dynamo_compile") - will point to torch.compile based workflow.

thoughts on torch_compile and torch_export? We could have both as an alias?

narendasan Jul 5, 2023
Collaborator Author

Talking @gs-olive: Here is what I think we should do

ir="dynamo": -> torch_tensorrt.dynamo.compile (torch.export) [takes a torch.fx.GraphModule]
ir="fx": -> torch_tensorrt.fx.compile (torch.fx.trace) [takes a torch.fx.GraphModule]
ir="ts": -> torch_tensorrt.ts.compile (torch.jit.script) [takes a torch.jit.ScriptModule]
ir="torch_compile": torch.compile("torch_tensorrt") [takes a torch.nn.Module]

There will be no .compile for our torch.compile backend since it does not consume an IR

peri044 · 2023-07-06T06:44:20Z

peri044
Jul 6, 2023
Collaborator

Should we just consider this partition.py as lowering organization wise?

Yeah

Likely we will need to make this its own submodule eventually, might as well do that now like lowering

Yeah sure. We might need utils or other stuff so submodule is good. (similar to torchscript).

Thought on moving these into the test directory we use for torchscript?

Yeah will do that. That would be great

One outstanding question for export is what inlining constants will look like since for dyanmo they are lowered to just being graph inputs dynamo will provide at runtime

So the torch._dynamo.export is not modifying the constants in the graph. The constants in the graph are accessed as attributes (eg: graph._param_constant0 will be a numpy weight ). The Partitioner is basically creating a new graph and while copying the nodes, it registers these constants as new placeholders (inputs) to the graph which is a problem for converters.

centralizing on the idea that all Tensors inputs to a converter are ITensors, and that conversion middleware will handle capturing numpy arrays and creating IConstantLayers so that torch tensors will not be used. IScaleLayer seems to be the limiting factor here, but we should be able to replicate scale with elementwise and unary right?

This sounds good. Yeah we can express IScaleLayer using other APIs. We need to modify converters (only concern is longer review times from Meta). Alternate way is to maintain our own registry of converters which I guess is in works ?

IMO lets pick one, I like the partitioner because we dont need to maintain the partitioning logic (though we lose control but we can always subclass to modify behavior)

Agreed. My intention was to pick one (ideally partitioner). Since the partitioner had problems, I tried TRTSplitter as an option in this prototype to understand what the issue is. I think we should subclass partitioner to fix the above mentioned issue or maybe needed in the future to make some advanced partitioning heuristics.

API namings seem good to me. However, one thing is not clear. Based on your comment, are you sayingtorch_tensorrt.compile(model, ir="torch_compile") would still exist or doesn't (only torch.compile(backend="torch_tensorrt") ?

0 replies

narendasan · 2023-07-11T21:58:20Z

narendasan
Jul 11, 2023
Collaborator Author

//py/torch_tensorrt/
    Input.py
    Device.py
    _aten_tracer.py
    dynamo/
        backend/
            backends.py (subgraphs from dynamo -> ) (should expect torch only apis / handle)
        _compile.py (torch_tensorrt.dynamo.compile(aten_graph) -> (aten_graph)) [May split the graph into subgraphs based capability partitioner] 
        lowering/
            lower.py
            partitioning/
        conversion/
            TRTInterpreter.py
            conversion.py
            converters/
                DynamoConverterRegistry.py
                aten_op_converters.py
                [future] prim_op_converters.py
                impl/
                    conv/
                    norm/
        runtime/
            _TorchTRTModule.py
            _PythonTorchTRTModule.py (TRTModule.py)
            csrc/
                ...
        _defaults.py
        _settings.py
        _enums.py (and translations between pytorch and tensorrt)
        utils/
            Logger.py


//tests/py/
    ts/
    dynamo/

0 replies

narendasan · 2023-07-11T22:02:49Z

narendasan
Jul 11, 2023
Collaborator Author

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch-TensorRT 2.0 #1826

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Torch-TensorRT 2.0 #1826

narendasan Apr 14, 2023 Collaborator

UX Goals

APIs

torch.compile

torch.export

torch_tensorrt.compile

Workflows

JIT Optimization

Compile and Deploy on the Same Machine

AOT Optimization

Compile And Deploy on Seperate Machines

Internals

Backends

ATen and AOTAutograd

Engine Cache

Implementation

FSCache

MemCache

HotCache

Saving and reloading caches

What causes a cache miss?

Guards

Lowering

Decompositions

Subgraph Rewriting

Module Level Lowering

Partitioning

Overview

Conversion

FX Interpreter

Evaluation

Converters

Dynamo Symbolic Shapes + Shape Tensors

Runtime / Callable

Torch-TensorRT Legacy Runtime

Python Runtime

Replies: 6 comments · 3 replies

gs-olive Apr 18, 2023 Collaborator

Comment on Recompilation + Dynamic Batch

ncomly-nvidia Apr 25, 2023

narendasan Apr 25, 2023 Collaborator Author

peri044 Jul 4, 2023 Collaborator

Unifying dynamo export and compile workflows

Issues with Partitioner

Note : This is not a problem with torch.compile (backend workflow) because there is no batch norm layer in the graph as batch_norm gets split into add/mul layers while dynamo.export has batch_norm layer explicitly.

Issues with TRT splitter

torch.export

Naming of APIs

narendasan Jul 4, 2023 Collaborator Author

narendasan Jul 5, 2023 Collaborator Author

peri044 Jul 6, 2023 Collaborator

narendasan Jul 11, 2023 Collaborator Author

narendasan Jul 11, 2023 Collaborator Author

narendasan
Apr 14, 2023
Collaborator

`torch.compile`

`torch.export`

`torch_tensorrt.compile`

Replies: 6 comments 3 replies

gs-olive
Apr 18, 2023
Collaborator

ncomly-nvidia
Apr 25, 2023

narendasan Apr 25, 2023
Collaborator Author

peri044
Jul 4, 2023
Collaborator

narendasan Jul 4, 2023
Collaborator Author

narendasan Jul 5, 2023
Collaborator Author

peri044
Jul 6, 2023
Collaborator

narendasan
Jul 11, 2023
Collaborator Author

narendasan
Jul 11, 2023
Collaborator Author