Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 [Bug] Compiling model crashes after updating nvidia driver #2256

Closed
lennartmoritz opened this issue Aug 23, 2023 · 2 comments
Closed

🐛 [Bug] Compiling model crashes after updating nvidia driver #2256

lennartmoritz opened this issue Aug 23, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@lennartmoritz
Copy link

lennartmoritz commented Aug 23, 2023

Bug Description

I've been using Torch-TensorRT 1.2.0 within the Nvidia Pytorch 22.07 Docker image as well as by local pip installation (in conda environment) successfully to compile and use my model with Nvidia Driver 510 until recently. An Ubuntu kernel update broke the gpu driver and i was forced to update to version 535.

Now I get this crash during compiling with my conda environment (non-docker):
Segmentation fault (core dumped)

  • No crash in the docker container (everything fine).
  • Crash when I locally compile OSNet model (needed in my project)
  • No crash compiling pretrained resnet18 similar to torch_trt_simple_example.py
  • Crash with OSNet when I recreate it similar to torch_trt_simple_example.py
  • No issues when I don't use Torch-TensorRT (loading and infecence on OSNet works fine)

Please help me resolve this issue. I've been stuck for some days now trying to fix it.

What I tried

  • Using Nvidia Driver 525
  • Purging all Nvidia related packages and reinstalling Nvidia Driver 535 multiple times
  • Removing and reinstalling my conda environment multiple times
  • Create clean conda env with necessary components only to reproduce issue
  • Using cudatoolkit==11.6.2 and nvidia-tensorrt==8.4.3.1 (this used to work with driver 510)
  • Using cudatoolkit==11.6.0 and nvidia-tensorrt==8.4.1.5 (match container versions)

What I did not try

  • Updating Torch-TensorRT Version to 1.3 or 1.4. This would be my last straw effort, since it is not guaranteed to work but it will be a lot of work and will break other components of my project due to higher torch version.
  • Ignoring issue and only using docker. I like solving my issues instead of giving up on them.

To Reproduce

Steps to reproduce the behavior:

  1. Trace OSNet model and store it module.save("osnet_ain_reid_mpd_traced5.pt")
  2. Create conda environment from file (see environment.yml below)
  3. Execute both functions in my mininal example.
  4. Crash during model = torch_tensorrt.compile(...) (seg fault) with OSNet, completion with resnet

Code Samples

Click to expand! (environment.yml)
name: my_env
channels:
  - pytorch
  - conda-forge
dependencies:
  - python=3.8
  - pytorch==1.12.1
  - torchvision==0.13.1
  - torchaudio==0.12.1
  - cudatoolkit==11.6.0
  - pip
  - pip:
      - nvidia-pyindex
      - nvidia-tensorrt==8.4.1.5
      - --find-links https://github.com/pytorch/TensorRT/releases/expanded_assets/v1.2.0
      - torch-tensorrt==1.2.0
Click to expand! (crash_reproduction.py)
def test_trt_osnet():
    import torch
    import tensorrt
    import torch_tensorrt

    modelfile = "/.../models/osnet_ain_reid_mpd_traced5.pt"
    model = torch.jit.load(modelfile)
    model.cuda()
    model.eval()

    print("Feature extractor: Compiling new torch-tensorrt model!")
    inputs = [
        torch_tensorrt.Input(
            shape=[64, 3, 256, 128],
            dtype=torch.float,
        )
    ]
    enabled_precisions = {torch.float}
    print("Feature extractor: Compiling!")
    try:
        with torch_tensorrt.logging.debug():
            # Segmentation fault (core dumped)
            model = torch_tensorrt.compile(
                model,
                inputs=inputs,
                enabled_precisions=enabled_precisions,
                truncate_long_and_double=True,
            )
        print("Feature extractor: Storing!")
        result = model(torch.ones((64, 3, 256, 128), dtype=torch.float32 ,device=torch.device("cuda")))
        print(f"Calculated result with shape {result.shape}")
    except:
        print("Error was caught as expected.")
        raise

def test_trt_resnet():
    import torch
    import torchvision
    import tensorrt
    import torch_tensorrt

    untraced_model = torchvision.models.resnet18(pretrained=True).cuda().eval()
    inputs = [torch.ones((32, 3, 224, 224), dtype=torch.float32 ,device=torch.device("cuda"))]
    traced_model = torch.jit.trace(untraced_model, inputs)
    traced_model.save("resnet18_traced.pt")
    traced_model = None
    traced_model = torch.jit.load("resnet18_traced.pt")
    traced_model.cuda().eval()
    inputs = [
        torch_tensorrt.Input(
            shape=[32, 3, 224, 224], dtype=torch.float
        )
    ]
    enabled_precisions = {torch.float}
    print("Feature extractor: Compiling!")
    try:
        with torch_tensorrt.logging.info():
            trt_model = torch_tensorrt.compile(
                traced_model,
                inputs=inputs,
                enabled_precisions=enabled_precisions,
                truncate_long_and_double=True,
            )
        print("Feature extractor: Storing!")
        result = trt_model(torch.ones((32, 3, 224, 224), dtype=torch.float32 ,device=torch.device("cuda")))
        print(f"Calculated result with shape {result.shape}")
    except:
        print("Error was caught as expected.")
        raise

Expected behavior / Compile logs

I expected that compilation works in my conda environment like it used to before i updated my gpu driver.

The full output log of executing test_trt_osnet() with docker is attached here:
report_debug_docker.txt

Unfortunately my terminal has cut off some of the ealier output when I executed test_trt_osnet() locally, but this is the part leading up to the segmentation fault:
report_debug_conda.txt

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

  • Torch-TensorRT Version (e.g. 1.0.0): 1.2.0
  • PyTorch Version (e.g. 1.0): 1.12.1
  • CPU Architecture: x86_64
  • OS (e.g., Linux): Ubuntu 22.04
  • How you installed PyTorch (conda, pip, libtorch, source): conda
  • Python version: 3.8.17
  • CUDA version: 11.6
  • GPU models and configuration: RTX 3080, Driver Version 535
@lennartmoritz lennartmoritz added the bug Something isn't working label Aug 23, 2023
@narendasan
Copy link
Collaborator

narendasan commented Aug 24, 2023

Not sure what exactly about the driver upgrade changed this behavior but it seems like the repro script works fine with the provided environment if you move the torch_tensorrt import outside of the function body. For example this works for me:

import torch_tensorrt

def test_trt_osnet():
    import torch
    import tensorrt

    modelfile = "./osnet_ain_reid_mpd_traced5.pt"
    model = torch.jit.load(modelfile)
    model.cuda()
    model.eval()

    print("Feature extractor: Compiling new torch-tensorrt model!")
    inputs = [
        torch_tensorrt.Input(
            shape=[64, 3, 256, 128],
            dtype=torch.float,
        )
    ]
    enabled_precisions = {torch.float}
    print("Feature extractor: Compiling!")
    try:
        # Segmentation fault (core dumped)
        model = torch_tensorrt.compile(
            model,
            inputs=inputs,
            enabled_precisions=enabled_precisions,
            truncate_long_and_double=True,
        )
        print("Feature extractor: Storing!")
        result = model(torch.ones((64, 3, 256, 128), dtype=torch.float32 ,device=torch.device("cuda")))
        print(f"Calculated result with shape {result.shape}")
    except:
        print("Error was caught as expected.")
        raise

def test_trt_resnet():
    import torch
    import torchvision
    import tensorrt

    untraced_model = torchvision.models.resnet18(pretrained=True).cuda().eval()
    inputs = [torch.ones((32, 3, 224, 224), dtype=torch.float32 ,device=torch.device("cuda"))]
    traced_model = torch.jit.trace(untraced_model, inputs)
    traced_model.save("resnet18_traced.pt")
    traced_model = None
    traced_model = torch.jit.load("resnet18_traced.pt")
    traced_model.cuda().eval()
    inputs = [
        torch_tensorrt.Input(
            shape=[32, 3, 224, 224], dtype=torch.float
        )
    ]
    enabled_precisions = {torch.float}
    print("Feature extractor: Compiling!")
    try:
        with torch_tensorrt.logging.errors():
            trt_model = torch_tensorrt.compile(
                traced_model,
                inputs=inputs,
                enabled_precisions=enabled_precisions,
                truncate_long_and_double=True,
            )
        print("Feature extractor: Storing!")
        result = trt_model(torch.ones((32, 3, 224, 224), dtype=torch.float32 ,device=torch.device("cuda")))
        print(f"Calculated result with shape {result.shape}")
    except:
        print("Error was caught as expected.")
        raise

if __name__ == "__main__":
     test_trt_osnet()

I looked into the segfault itself and it seems to happen post graph construction inside the tensorrt library so it doesnt seem like something faulty in the compilation process and perhaps something like the initialization. I also tested with latest main and it seems like this issue has been addressed regardless of import location.

@narendasan
Copy link
Collaborator

Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants