-
Notifications
You must be signed in to change notification settings - Fork 372
integrated vlm code for benchmark for Eagle2 #3698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Qwen model : command I used: Error: File "/work/TensorRT/tools/llm/run_vlm.py", line 448, in <module>
inputs = load_inputs(args, processor, DEVICE)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work/TensorRT/tools/llm/run_vlm.py", line 188, in load_inputs
from qwen_vl_utils import process_vision_info
ModuleNotFoundError: No module named 'qwen_vl_utils' |
When I tried Eagle2 model, it shows Traceback (most recent call last):
File "/work/TensorRT/tools/llm/run_vlm.py", line 443, in <module>
model, processor, emb_layer = load_model(args.model, DEVICE, dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work/TensorRT/tools/llm/run_vlm.py", line 141, in load_model
return _load_eagle2(device, torch_dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work/TensorRT/tools/llm/run_vlm.py", line 101, in _load_eagle2
AutoModel.from_pretrained(
File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained
config = cls._autoset_attn_implementation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation
cls._check_and_enable_flash_attn_2(
File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2
raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}")
ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.
root@45fb01c53ae9:/work/TensorRT/tools/llm# python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update docs and add these models to the list of supported models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your useful comments! I have addressed every comment!
I have added the installation instructions (for both FlashAttention2 and qwen_vl_utils) to the README and tutorial, and also included a helpful message to guide users on installation if the package is not found when running the script. |
tools/llm/README.md
Outdated
#### Vision Language Models: `run_vlm.py` | ||
|
||
```bash | ||
python run_vlm.py --model Qwen/Qwen2.5-VL-3B-Instruct --precision FP16 --num_tokens 128 --cache static_v1 --enable_pytorch_run --benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use eagle model command here since that is fully optimized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you modify this command to use eagle instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed your comments before. I have modified it! Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installing flash-attn 2.7.1+post4 works. let's mention this in the README under limitations. let's convey that we install this version but we actually don't use flash-attn and instead modify it to use sdpa
ee57be9
to
5cfeafe
Compare
@peri044 @lanluo-nvidia After rebasing to the latest main branch, graph breaks are occurring due to the Eagle2 is hardcoded to force the use of flash attention, so I added a workaround at the top of run_vlm.py: # --- WORKAROUND FOR EAGLE2 SDPA COMPILATION ---
# Eagle2's language model (Qwen2) implicitly defaults to "flash_attention_2"
# due to settings in its remote code and config.json. This prevents direct
# compilation with SDPA. To work around this without modifying the library,
ms.ALL_ATTENTION_FUNCTIONS["flash_attention_2"] = ms.ALL_ATTENTION_FUNCTIONS["sdpa"]
mq.ALL_ATTENTION_FUNCTIONS["flash_attention_2"] = mq.ALL_ATTENTION_FUNCTIONS["sdpa"] However, it seems like this workaround (WAR) is no longer working after the rebase. I'm continuing to debug this issue but haven't found a solution yet. Do you have any ideas? |
@peri044 @lanluo-nvidia Adding the SDPA registration fixed the issue. # register SDPA variant for the model
if register_sdpa._SDPA_MAPPING.get(args.model, None) is not None:
register_sdpa._SDPA_MAPPING[args.model](model_config=model.config)
else:
register_sdpa._SDPA_MAPPING["default"](model_config=model.config) But even without fp32_acc lowering, setting use_fp32_acc=True actually increases the mean abs error. Note: Even without register_sdpa, enabling use_fp32_acc alone hurts performance — I’ll debug this with a toy example later. |
I've resolved the precision issue. The root cause was a redundant cast to fp32 within the matmul implementation, even when the input was already fp32, which paradoxically increased the mean absolute error. With this fix, the fp16 outputs from PyTorch and TensorRT now match perfectly.
|
e51f346
to
13fb978
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. functionality looks good to me
@@ -271,3 +282,35 @@ def default_sdpa_pass( | |||
"google/gemma-3-1b-it": register_gemma3_sdpa_pass, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this _SDPA_MAPPING to the top of the file after imports for easy readability ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can’t move the map to the top because it references register_gemma3_sdpa_pass
and register_default_sdpa_pass
, which haven’t been defined yet.
tools/llm/README.md
Outdated
#### Vision Language Models: `run_vlm.py` | ||
|
||
```bash | ||
python run_vlm.py --model Qwen/Qwen2.5-VL-3B-Instruct --precision FP16 --num_tokens 128 --cache static_v1 --enable_pytorch_run --benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you modify this command to use eagle instead ?
tools/llm/README.md
Outdated
@@ -61,8 +75,15 @@ This codebase can be extended to | |||
|
|||
## Limitations | |||
- We do not currently support sliding window attention (used in Gemma3 and Qwen 3 models) yet. | |||
- **Flash Attention Limitation**: Some models (e.g., Eagle2-2B) internally use flash attention operations (`torch.ops.flash_attn._flash_attn_forward.default`) which require the `flash-attn` package to be installed. Without flash-attn, these models will fail to load or run properly. | |||
- **Eagle2 FP16 Precision Output Differences**: When compiling Eagle2 models with FP16 precision, minor token-level differences may occur between PyTorch and TensorRT outputs. While the overall context and meaning remain consistent, specific word choices or phrasing may vary slightly. This is expected behavior due to numerical precision differences in FP16 computation and does not affect the model's core functionality or accuracy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also add that we only compile LLM for qwen 2.5 VLM because the image encoder has export issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed precision issue, and also added that we only compile LLM for qwen 2.5-VL!
- Transformers v4.52.3 | ||
- For VLM models (run_vlm.py): | ||
- `pip install qwen-vl-utils` (for Qwen2.5-VL-3B-Instruct model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh is this needed ? What are we using from this package ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used for input data processing for qwen2.5-vl:
if args.model == "Qwen/Qwen2.5-VL-3B-Instruct":
try:
from qwen_vl_utils import process_vision_info
except ImportError:
raise ImportError(
"The 'qwen-vl-utils' package is required for Qwen VLM models. "
"Please install it using: pip install qwen-vl-utils"
)
image_inputs, video_inputs = process_vision_info(messages)
tools/llm/utils.py
Outdated
@@ -47,7 +47,45 @@ def export_llm(model, inputs, min_seq_len=1, max_seq_len=16): | |||
return ep | |||
|
|||
|
|||
def get_zeroed_static_cache_inputs(model: torch.fx.GraphModule): | |||
def export_llm_no_position_ids(model, inputs, min_seq_len=1, max_seq_len=16): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we exporting without position_ids here ? Where is it used ?
Consider merging this with export_llm
by adding a use_position_ids argument to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed export_llm_no_position_ids
function!
tools/llm/run_vlm.py
Outdated
lm_wrap, input_embeds, min_seq_len=1, max_seq_len=2560 | ||
) | ||
|
||
with torch_tensorrt.logging.debug() if args.debug else nullcontext(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify all the debugging from torch_tensorrt.logging.debug()
to torch_tensorrt.dynamo.Debugger()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have modified all the debugging from torch_tensorrt.logging.debug()
to torch_tensorrt.dynamo.Debugger()
!
tools/llm/utils.py
Outdated
|
||
|
||
@torch.inference_mode() | ||
def generate_mm_qwen2_5_vl_with_timing( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function being used somewhere? If not, consider removing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the generate_mm_qwen2_5_vl_with_timing
function!
# For benchmarking, we run the generation with timing enabled. | ||
# For regular runs, we run without timing for a single output. | ||
if args.benchmark: | ||
if args.model == "Qwen/Qwen2.5-VL-3B-Instruct": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point can we build some sort of spec file format for this instead of a long set of cases? (can be later / for a the vla tool)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start thinking about how we might construct a more generic tool where each model we support isnt cased in. We might want to write an RFC for this.
My thought is that we create some sort of interface that defines the different model specific components and some sort of python based spec file to define the custom behavior (basically what HF transformers does)
Totally agree. We can define a Python-based spec covering: modality (LLM/VLM), input preprocessing (text/image), VLM image-embed handling, custom token generation, min/max sequence lengths, presence of a vision encoder and whether to compile it, etc. I’ll pursue this as a separate RFC/PR. |
Description
Closing the previous pull request (#3652) due to rebase difficulties with the main branch. This new PR resubmits the same changes for the VLM benchmark framework—now cleanly rebased on the latest main branch—and incorporates all feedback from the original review.
Type of change
Please delete options that are not relevant and/or add your own.
Checklist: