🐛 [Bug] Compiling model crashes after updating nvidia driver #2256

lennartmoritz · 2023-08-23T16:39:11Z

Bug Description

I've been using Torch-TensorRT 1.2.0 within the Nvidia Pytorch 22.07 Docker image as well as by local pip installation (in conda environment) successfully to compile and use my model with Nvidia Driver 510 until recently. An Ubuntu kernel update broke the gpu driver and i was forced to update to version 535.

Now I get this crash during compiling with my conda environment (non-docker):
Segmentation fault (core dumped)

No crash in the docker container (everything fine).
Crash when I locally compile OSNet model (needed in my project)
No crash compiling pretrained resnet18 similar to torch_trt_simple_example.py
Crash with OSNet when I recreate it similar to torch_trt_simple_example.py
No issues when I don't use Torch-TensorRT (loading and infecence on OSNet works fine)

Please help me resolve this issue. I've been stuck for some days now trying to fix it.

What I tried

Using Nvidia Driver 525
Purging all Nvidia related packages and reinstalling Nvidia Driver 535 multiple times
Removing and reinstalling my conda environment multiple times
Create clean conda env with necessary components only to reproduce issue
Using cudatoolkit==11.6.2 and nvidia-tensorrt==8.4.3.1 (this used to work with driver 510)
Using cudatoolkit==11.6.0 and nvidia-tensorrt==8.4.1.5 (match container versions)

What I did not try

Updating Torch-TensorRT Version to 1.3 or 1.4. This would be my last straw effort, since it is not guaranteed to work but it will be a lot of work and will break other components of my project due to higher torch version.
Ignoring issue and only using docker. I like solving my issues instead of giving up on them.

To Reproduce

Steps to reproduce the behavior:

Trace OSNet model and store it module.save("osnet_ain_reid_mpd_traced5.pt")
Create conda environment from file (see environment.yml below)
Execute both functions in my mininal example.
Crash during model = torch_tensorrt.compile(...) (seg fault) with OSNet, completion with resnet

Code Samples

Click to expand! (environment.yml)

name: my_env
channels:
  - pytorch
  - conda-forge
dependencies:
  - python=3.8
  - pytorch==1.12.1
  - torchvision==0.13.1
  - torchaudio==0.12.1
  - cudatoolkit==11.6.0
  - pip
  - pip:
      - nvidia-pyindex
      - nvidia-tensorrt==8.4.1.5
      - --find-links https://github.com/pytorch/TensorRT/releases/expanded_assets/v1.2.0
      - torch-tensorrt==1.2.0

Click to expand! (crash_reproduction.py)

def test_trt_osnet():
    import torch
    import tensorrt
    import torch_tensorrt

    modelfile = "/.../models/osnet_ain_reid_mpd_traced5.pt"
    model = torch.jit.load(modelfile)
    model.cuda()
    model.eval()

    print("Feature extractor: Compiling new torch-tensorrt model!")
    inputs = [
        torch_tensorrt.Input(
            shape=[64, 3, 256, 128],
            dtype=torch.float,
        )
    ]
    enabled_precisions = {torch.float}
    print("Feature extractor: Compiling!")
    try:
        with torch_tensorrt.logging.debug():
            # Segmentation fault (core dumped)
            model = torch_tensorrt.compile(
                model,
                inputs=inputs,
                enabled_precisions=enabled_precisions,
                truncate_long_and_double=True,
            )
        print("Feature extractor: Storing!")
        result = model(torch.ones((64, 3, 256, 128), dtype=torch.float32 ,device=torch.device("cuda")))
        print(f"Calculated result with shape {result.shape}")
    except:
        print("Error was caught as expected.")
        raise

def test_trt_resnet():
    import torch
    import torchvision
    import tensorrt
    import torch_tensorrt

    untraced_model = torchvision.models.resnet18(pretrained=True).cuda().eval()
    inputs = [torch.ones((32, 3, 224, 224), dtype=torch.float32 ,device=torch.device("cuda"))]
    traced_model = torch.jit.trace(untraced_model, inputs)
    traced_model.save("resnet18_traced.pt")
    traced_model = None
    traced_model = torch.jit.load("resnet18_traced.pt")
    traced_model.cuda().eval()
    inputs = [
        torch_tensorrt.Input(
            shape=[32, 3, 224, 224], dtype=torch.float
        )
    ]
    enabled_precisions = {torch.float}
    print("Feature extractor: Compiling!")
    try:
        with torch_tensorrt.logging.info():
            trt_model = torch_tensorrt.compile(
                traced_model,
                inputs=inputs,
                enabled_precisions=enabled_precisions,
                truncate_long_and_double=True,
            )
        print("Feature extractor: Storing!")
        result = trt_model(torch.ones((32, 3, 224, 224), dtype=torch.float32 ,device=torch.device("cuda")))
        print(f"Calculated result with shape {result.shape}")
    except:
        print("Error was caught as expected.")
        raise

Expected behavior / Compile logs

I expected that compilation works in my conda environment like it used to before i updated my gpu driver.

The full output log of executing test_trt_osnet() with docker is attached here:
report_debug_docker.txt

Unfortunately my terminal has cut off some of the ealier output when I executed test_trt_osnet() locally, but this is the part leading up to the segmentation fault:
report_debug_conda.txt

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

Torch-TensorRT Version (e.g. 1.0.0): 1.2.0
PyTorch Version (e.g. 1.0): 1.12.1
CPU Architecture: x86_64
OS (e.g., Linux): Ubuntu 22.04
How you installed PyTorch (conda, pip, libtorch, source): conda
Python version: 3.8.17
CUDA version: 11.6
GPU models and configuration: RTX 3080, Driver Version 535

The text was updated successfully, but these errors were encountered:

narendasan · 2023-08-24T17:47:03Z

Not sure what exactly about the driver upgrade changed this behavior but it seems like the repro script works fine with the provided environment if you move the torch_tensorrt import outside of the function body. For example this works for me:

import torch_tensorrt

def test_trt_osnet():
    import torch
    import tensorrt

    modelfile = "./osnet_ain_reid_mpd_traced5.pt"
    model = torch.jit.load(modelfile)
    model.cuda()
    model.eval()

    print("Feature extractor: Compiling new torch-tensorrt model!")
    inputs = [
        torch_tensorrt.Input(
            shape=[64, 3, 256, 128],
            dtype=torch.float,
        )
    ]
    enabled_precisions = {torch.float}
    print("Feature extractor: Compiling!")
    try:
        # Segmentation fault (core dumped)
        model = torch_tensorrt.compile(
            model,
            inputs=inputs,
            enabled_precisions=enabled_precisions,
            truncate_long_and_double=True,
        )
        print("Feature extractor: Storing!")
        result = model(torch.ones((64, 3, 256, 128), dtype=torch.float32 ,device=torch.device("cuda")))
        print(f"Calculated result with shape {result.shape}")
    except:
        print("Error was caught as expected.")
        raise

def test_trt_resnet():
    import torch
    import torchvision
    import tensorrt

    untraced_model = torchvision.models.resnet18(pretrained=True).cuda().eval()
    inputs = [torch.ones((32, 3, 224, 224), dtype=torch.float32 ,device=torch.device("cuda"))]
    traced_model = torch.jit.trace(untraced_model, inputs)
    traced_model.save("resnet18_traced.pt")
    traced_model = None
    traced_model = torch.jit.load("resnet18_traced.pt")
    traced_model.cuda().eval()
    inputs = [
        torch_tensorrt.Input(
            shape=[32, 3, 224, 224], dtype=torch.float
        )
    ]
    enabled_precisions = {torch.float}
    print("Feature extractor: Compiling!")
    try:
        with torch_tensorrt.logging.errors():
            trt_model = torch_tensorrt.compile(
                traced_model,
                inputs=inputs,
                enabled_precisions=enabled_precisions,
                truncate_long_and_double=True,
            )
        print("Feature extractor: Storing!")
        result = trt_model(torch.ones((32, 3, 224, 224), dtype=torch.float32 ,device=torch.device("cuda")))
        print(f"Calculated result with shape {result.shape}")
    except:
        print("Error was caught as expected.")
        raise

if __name__ == "__main__":
     test_trt_osnet()

I looked into the segfault itself and it seems to happen post graph construction inside the tensorrt library so it doesnt seem like something faulty in the compilation process and perhaps something like the initialization. I also tested with latest main and it seems like this issue has been addressed regardless of import location.

narendasan · 2023-11-13T21:42:01Z

Closing

lennartmoritz added the bug Something isn't working label Aug 23, 2023

narendasan closed this as completed Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [Bug] Compiling model crashes after updating nvidia driver #2256

🐛 [Bug] Compiling model crashes after updating nvidia driver #2256

lennartmoritz commented Aug 23, 2023 •

edited

Loading

narendasan commented Aug 24, 2023 •

edited

Loading

narendasan commented Nov 13, 2023

🐛 [Bug] Compiling model crashes after updating nvidia driver #2256

🐛 [Bug] Compiling model crashes after updating nvidia driver #2256

Comments

lennartmoritz commented Aug 23, 2023 • edited Loading

Bug Description

What I tried

What I did not try

To Reproduce

Code Samples

Expected behavior / Compile logs

Environment

narendasan commented Aug 24, 2023 • edited Loading

narendasan commented Nov 13, 2023

lennartmoritz commented Aug 23, 2023 •

edited

Loading

narendasan commented Aug 24, 2023 •

edited

Loading