-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 [Bug] Compiling model crashes after updating nvidia driver #2256
Comments
Not sure what exactly about the driver upgrade changed this behavior but it seems like the repro script works fine with the provided environment if you move the torch_tensorrt import outside of the function body. For example this works for me: import torch_tensorrt
def test_trt_osnet():
import torch
import tensorrt
modelfile = "./osnet_ain_reid_mpd_traced5.pt"
model = torch.jit.load(modelfile)
model.cuda()
model.eval()
print("Feature extractor: Compiling new torch-tensorrt model!")
inputs = [
torch_tensorrt.Input(
shape=[64, 3, 256, 128],
dtype=torch.float,
)
]
enabled_precisions = {torch.float}
print("Feature extractor: Compiling!")
try:
# Segmentation fault (core dumped)
model = torch_tensorrt.compile(
model,
inputs=inputs,
enabled_precisions=enabled_precisions,
truncate_long_and_double=True,
)
print("Feature extractor: Storing!")
result = model(torch.ones((64, 3, 256, 128), dtype=torch.float32 ,device=torch.device("cuda")))
print(f"Calculated result with shape {result.shape}")
except:
print("Error was caught as expected.")
raise
def test_trt_resnet():
import torch
import torchvision
import tensorrt
untraced_model = torchvision.models.resnet18(pretrained=True).cuda().eval()
inputs = [torch.ones((32, 3, 224, 224), dtype=torch.float32 ,device=torch.device("cuda"))]
traced_model = torch.jit.trace(untraced_model, inputs)
traced_model.save("resnet18_traced.pt")
traced_model = None
traced_model = torch.jit.load("resnet18_traced.pt")
traced_model.cuda().eval()
inputs = [
torch_tensorrt.Input(
shape=[32, 3, 224, 224], dtype=torch.float
)
]
enabled_precisions = {torch.float}
print("Feature extractor: Compiling!")
try:
with torch_tensorrt.logging.errors():
trt_model = torch_tensorrt.compile(
traced_model,
inputs=inputs,
enabled_precisions=enabled_precisions,
truncate_long_and_double=True,
)
print("Feature extractor: Storing!")
result = trt_model(torch.ones((32, 3, 224, 224), dtype=torch.float32 ,device=torch.device("cuda")))
print(f"Calculated result with shape {result.shape}")
except:
print("Error was caught as expected.")
raise
if __name__ == "__main__":
test_trt_osnet() I looked into the segfault itself and it seems to happen post graph construction inside the tensorrt library so it doesnt seem like something faulty in the compilation process and perhaps something like the initialization. I also tested with latest main and it seems like this issue has been addressed regardless of import location. |
Closing |
Bug Description
I've been using Torch-TensorRT 1.2.0 within the Nvidia Pytorch 22.07 Docker image as well as by local pip installation (in conda environment) successfully to compile and use my model with Nvidia Driver 510 until recently. An Ubuntu kernel update broke the gpu driver and i was forced to update to version 535.
Now I get this crash during compiling with my conda environment (non-docker):
Segmentation fault (core dumped)
Please help me resolve this issue. I've been stuck for some days now trying to fix it.
What I tried
What I did not try
To Reproduce
Steps to reproduce the behavior:
module.save("osnet_ain_reid_mpd_traced5.pt")
model = torch_tensorrt.compile(...)
(seg fault) with OSNet, completion with resnetCode Samples
Click to expand! (environment.yml)
Click to expand! (crash_reproduction.py)
Expected behavior / Compile logs
I expected that compilation works in my conda environment like it used to before i updated my gpu driver.
The full output log of executing test_trt_osnet() with docker is attached here:
report_debug_docker.txt
Unfortunately my terminal has cut off some of the ealier output when I executed test_trt_osnet() locally, but this is the part leading up to the segmentation fault:
report_debug_conda.txt
Environment
conda
,pip
,libtorch
, source): condaThe text was updated successfully, but these errors were encountered: