-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
System Info
System Information:
- OS: 24.04
- Python version: 3.12
- CUDA version: 13.0
- GPU model(s): 6000 pro
- Driver version:
- TensorRT-LLM version: 1.2.0rc6
Detailed output:
Paste the output of the above commands here
How would you like to use TensorRT-LLM:
I want to run inference of GPT-OSS models (specifically 20B and 120B variants from Hugging Face). I don't know how to integrate them with TensorRT-LLM using the TRT Flow approach (similar to EXAONE models) or optimize them for my use case.
Specific questions:
- Model:
- GPT-OSS-20B, GPT-OSS-120B
What I've tried:
I'm familiar with the EXAONE example in the documentation which uses:
convert_checkpoint.pyfrom the LLaMA exampletrtllm-buildto create the engine
However, there's no specific guide for GPT-OSS models. I'd like to know:
-
Is GPT-OSS architecture compatible with TensorRT-LLM? Should I use the LLaMA convert script or a different one?
-
What's the correct conversion flow for GPT-OSS models?
# Is this the right approach?
python examples/llama/convert_checkpoint.py \
--model_dir $HF_MODEL_DIR \
--output_dir trt_models/gpt-oss/fp16/1-gpu \
--dtype float16
trtllm-build \
--checkpoint_dir trt_models/gpt-oss/fp16/1-gpu \
--output_dir trt_engines/gpt-oss/fp16/1-gpu \
--gemm_plugin auto- For the 120B model, what tensor parallelism configuration do you recommend? (e.g., mpxf4)
Thanks for contributing 🎉!
How would you like to use TensorRT-LLM
I want to run inference of a [specific model](put Hugging Face link here). I don't know how to integrate it with TensorRT-LLM or optimize it for my use case.
Specific questions:
- Model:
- Use case (e.g., chatbot, batch inference, real-time serving):
- Expected throughput/latency requirements:
- Multi-GPU setup needed:
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.