integrated vlm code for benchmark for Eagle2 #3698

chohk88 · 2025-07-21T16:27:17Z

Description

Closing the previous pull request (#3652) due to rebase difficulties with the main branch. This new PR resubmits the same changes for the VLM benchmark framework—now cleanly rebased on the latest main branch—and incorporates all feedback from the original review.

Integrated VLM benchmark framework
- Currently supports Eagle2, Qwen 2.5-VL
- Planned support: Paligemma etc.
Added custom token-generation function** for multi-modal (MM) models

Type of change

Please delete options that are not relevant and/or add your own.

New feature (non-breaking change which adds functionality)

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

peri044 · 2025-08-06T20:15:00Z

Qwen model : command I used:
python run_vlm.py

Error:

File "/work/TensorRT/tools/llm/run_vlm.py", line 448, in <module>
    inputs = load_inputs(args, processor, DEVICE)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/TensorRT/tools/llm/run_vlm.py", line 188, in load_inputs
    from qwen_vl_utils import process_vision_info
ModuleNotFoundError: No module named 'qwen_vl_utils'

peri044 · 2025-08-06T20:17:17Z

When I tried Eagle2 model, it shows

Traceback (most recent call last):
  File "/work/TensorRT/tools/llm/run_vlm.py", line 443, in <module>
    model, processor, emb_layer = load_model(args.model, DEVICE, dtype)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/TensorRT/tools/llm/run_vlm.py", line 141, in load_model
    return _load_eagle2(device, torch_dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/TensorRT/tools/llm/run_vlm.py", line 101, in _load_eagle2
    AutoModel.from_pretrained(
  File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained
    config = cls._autoset_attn_implementation(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation
    cls._check_and_enable_flash_attn_2(
  File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2
    raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}")
ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.
root@45fb01c53ae9:/work/TensorRT/tools/llm# python

peri044

Please update docs and add these models to the list of supported models.

tools/llm/run_vlm.py

tools/llm/utils.py

chohk88

Thank you for your useful comments! I have addressed every comment!

tools/llm/run_vlm.py

chohk88 · 2025-08-11T14:59:45Z

Qwen model : command I used: python run_vlm.py

Error:

File "/work/TensorRT/tools/llm/run_vlm.py", line 448, in <module>
    inputs = load_inputs(args, processor, DEVICE)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/TensorRT/tools/llm/run_vlm.py", line 188, in load_inputs
    from qwen_vl_utils import process_vision_info
ModuleNotFoundError: No module named 'qwen_vl_utils'

I have added the installation instructions (for both FlashAttention2 and qwen_vl_utils) to the README and tutorial, and also included a helpful message to guide users on installation if the package is not found when running the script.

peri044 · 2025-08-12T22:49:14Z

tools/llm/README.md

+#### Vision Language Models: `run_vlm.py`
+
+```bash
+python run_vlm.py --model Qwen/Qwen2.5-VL-3B-Instruct --precision FP16 --num_tokens 128 --cache static_v1 --enable_pytorch_run --benchmark


let's use eagle model command here since that is fully optimized

can you modify this command to use eagle instead ?

I missed your comments before. I have modified it! Thank you!

peri044

Installing flash-attn 2.7.1+post4 works. let's mention this in the README under limitations. let's convey that we install this version but we actually don't use flash-attn and instead modify it to use sdpa

chohk88 · 2025-09-06T00:49:22Z

@peri044 @lanluo-nvidia After rebasing to the latest main branch, graph breaks are occurring due to the torch.ops.aten._scaled_dot_product_flash_attention.default and _operator.getitem operation, as shown in the attached log:
eagle_lm_layer_1.log

Eagle2 is hardcoded to force the use of flash attention, so I added a workaround at the top of run_vlm.py:

# --- WORKAROUND FOR EAGLE2 SDPA COMPILATION ---
# Eagle2's language model (Qwen2) implicitly defaults to "flash_attention_2"
# due to settings in its remote code and config.json. This prevents direct
# compilation with SDPA. To work around this without modifying the library,
ms.ALL_ATTENTION_FUNCTIONS["flash_attention_2"] = ms.ALL_ATTENTION_FUNCTIONS["sdpa"]
mq.ALL_ATTENTION_FUNCTIONS["flash_attention_2"] = mq.ALL_ATTENTION_FUNCTIONS["sdpa"]

However, it seems like this workaround (WAR) is no longer working after the rebase. I'm continuing to debug this issue but haven't found a solution yet. Do you have any ideas?

chohk88 · 2025-09-08T07:45:55Z

@peri044 @lanluo-nvidia After rebasing to the latest main branch, graph breaks are occurring due to the torch.ops.aten._scaled_dot_product_flash_attention.default and _operator.getitem operation, as shown in the attached log: eagle_lm_layer_1.log

Eagle2 is hardcoded to force the use of flash attention, so I added a workaround at the top of run_vlm.py:
# --- WORKAROUND FOR EAGLE2 SDPA COMPILATION ---
# Eagle2's language model (Qwen2) implicitly defaults to "flash_attention_2"
# due to settings in its remote code and config.json. This prevents direct
# compilation with SDPA. To work around this without modifying the library,
ms.ALL_ATTENTION_FUNCTIONS["flash_attention_2"] = ms.ALL_ATTENTION_FUNCTIONS["sdpa"]
mq.ALL_ATTENTION_FUNCTIONS["flash_attention_2"] = mq.ALL_ATTENTION_FUNCTIONS["sdpa"]
However, it seems like this workaround (WAR) is no longer working after the rebase. I'm continuing to debug this issue but haven't found a solution yet. Do you have any ideas?

@peri044 @lanluo-nvidia Adding the SDPA registration fixed the issue.

        # register SDPA variant for the model
        if register_sdpa._SDPA_MAPPING.get(args.model, None) is not None:
            register_sdpa._SDPA_MAPPING[args.model](model_config=model.config)
        else:
            register_sdpa._SDPA_MAPPING["default"](model_config=model.config)

But even without fp32_acc lowering, setting use_fp32_acc=True actually increases the mean abs error.
Since outputs match better with use_fp32_acc=False, I think we should stick with that.

Note: Even without register_sdpa, enabling use_fp32_acc alone hurts performance — I’ll debug this with a toy example later.

chohk88 · 2025-09-08T14:22:24Z

@peri044 @lanluo-nvidia After rebasing to the latest main branch, graph breaks are occurring due to the torch.ops.aten._scaled_dot_product_flash_attention.default and _operator.getitem operation, as shown in the attached log: eagle_lm_layer_1.log
Eagle2 is hardcoded to force the use of flash attention, so I added a workaround at the top of run_vlm.py:
# --- WORKAROUND FOR EAGLE2 SDPA COMPILATION ---
# Eagle2's language model (Qwen2) implicitly defaults to "flash_attention_2"
# due to settings in its remote code and config.json. This prevents direct
# compilation with SDPA. To work around this without modifying the library,
ms.ALL_ATTENTION_FUNCTIONS["flash_attention_2"] = ms.ALL_ATTENTION_FUNCTIONS["sdpa"]
mq.ALL_ATTENTION_FUNCTIONS["flash_attention_2"] = mq.ALL_ATTENTION_FUNCTIONS["sdpa"]
However, it seems like this workaround (WAR) is no longer working after the rebase. I'm continuing to debug this issue but haven't found a solution yet. Do you have any ideas?
@peri044 @lanluo-nvidia Adding the SDPA registration fixed the issue.
        # register SDPA variant for the model
        if register_sdpa._SDPA_MAPPING.get(args.model, None) is not None:
            register_sdpa._SDPA_MAPPING[args.model](model_config=model.config)
        else:
            register_sdpa._SDPA_MAPPING["default"](model_config=model.config)
But even without fp32_acc lowering, setting use_fp32_acc=True actually increases the mean abs error. Since outputs match better with use_fp32_acc=False, I think we should stick with that.

Note: Even without register_sdpa, enabling use_fp32_acc alone hurts performance — I’ll debug this with a toy example later.

@peri044 @lanluo-nvidia

I've resolved the precision issue. The root cause was a redundant cast to fp32 within the matmul implementation, even when the input was already fp32, which paradoxically increased the mean absolute error. With this fix, the fp16 outputs from PyTorch and TensorRT now match perfectly.

> ========= PyTorch =========
> PyTorch model generated text:  The image depicts a vibrant street scene in what appears to be a Chinatown district. In the foreground, there is a prominent red stop sign with the word "STOP" in white letters. The stop sign is mounted on a pole and is positioned at an intersection, indicating a traffic control point.
> 
> Behind the stop sign, there are two white stone lion statues, which are traditional guardians in Chinese culture, often found at the entrance of temples and gates. These statues are placed symmetrically on either side of the street, adding a cultural and historical element to the scene.
> 
> The background features a traditional Chinese archway with red and gold colors, which
> ===================================
> ========= TensorRT =========
> TensorRT model generated text:  The image depicts a vibrant street scene in what appears to be a Chinatown district. In the foreground, there is a prominent red stop sign with the word "STOP" in white letters. The stop sign is mounted on a pole and is positioned at an intersection, indicating a traffic control point.
> 
> Behind the stop sign, there are two white stone lion statues, which are traditional guardians in Chinese culture, often found at the entrance of temples and gates. These statues are placed symmetrically on either side of the street, adding a cultural and historical element to the scene.
> 
> The background features a traditional Chinese archway with red and gold colors, which
> ===================================
> PyTorch and TensorRT outputs match: True

peri044

Minor comments. functionality looks good to me

peri044 · 2025-09-09T02:28:42Z

tools/llm/torchtrt_ext/register_sdpa.py

@@ -271,3 +282,35 @@ def default_sdpa_pass(
    "google/gemma-3-1b-it": register_gemma3_sdpa_pass,


Can you move this _SDPA_MAPPING to the top of the file after imports for easy readability ?

We can’t move the map to the top because it references register_gemma3_sdpa_pass and register_default_sdpa_pass, which haven’t been defined yet.

peri044 · 2025-09-09T02:31:12Z

tools/llm/README.md

+#### Vision Language Models: `run_vlm.py`
+
+```bash
+python run_vlm.py --model Qwen/Qwen2.5-VL-3B-Instruct --precision FP16 --num_tokens 128 --cache static_v1 --enable_pytorch_run --benchmark


can you modify this command to use eagle instead ?

peri044 · 2025-09-09T02:33:06Z

tools/llm/README.md

@@ -61,8 +75,15 @@ This codebase can be extended to

 ## Limitations
 - We do not currently support sliding window attention (used in Gemma3 and Qwen 3 models) yet.
+- **Flash Attention Limitation**: Some models (e.g., Eagle2-2B) internally use flash attention operations (`torch.ops.flash_attn._flash_attn_forward.default`) which require the `flash-attn` package to be installed. Without flash-attn, these models will fail to load or run properly.
+- **Eagle2 FP16 Precision Output Differences**: When compiling Eagle2 models with FP16 precision, minor token-level differences may occur between PyTorch and TensorRT outputs. While the overall context and meaning remain consistent, specific word choices or phrasing may vary slightly. This is expected behavior due to numerical precision differences in FP16 computation and does not affect the model's core functionality or accuracy.


please also add that we only compile LLM for qwen 2.5 VLM because the image encoder has export issue

I have removed precision issue, and also added that we only compile LLM for qwen 2.5-VL!

peri044 · 2025-09-09T02:33:32Z

tools/llm/README.md

+- Transformers v4.52.3
+- For VLM models (run_vlm.py):
+  - `pip install qwen-vl-utils` (for Qwen2.5-VL-3B-Instruct model)


oh is this needed ? What are we using from this package ?

It is used for input data processing for qwen2.5-vl:

if args.model == "Qwen/Qwen2.5-VL-3B-Instruct": try: from qwen_vl_utils import process_vision_info except ImportError: raise ImportError( "The 'qwen-vl-utils' package is required for Qwen VLM models. " "Please install it using: pip install qwen-vl-utils" ) image_inputs, video_inputs = process_vision_info(messages)

peri044 · 2025-09-09T02:36:22Z

tools/llm/utils.py

@@ -47,7 +47,45 @@ def export_llm(model, inputs, min_seq_len=1, max_seq_len=16):
    return ep


-def get_zeroed_static_cache_inputs(model: torch.fx.GraphModule):
+def export_llm_no_position_ids(model, inputs, min_seq_len=1, max_seq_len=16):


why are we exporting without position_ids here ? Where is it used ?
Consider merging this with export_llm by adding a use_position_ids argument to do this

I have removed export_llm_no_position_ids function!

peri044 · 2025-09-09T02:41:21Z

tools/llm/run_vlm.py

+        lm_wrap, input_embeds, min_seq_len=1, max_seq_len=2560
+    )
+
+    with torch_tensorrt.logging.debug() if args.debug else nullcontext():


Modify all the debugging from torch_tensorrt.logging.debug() to torch_tensorrt.dynamo.Debugger()

I have modified all the debugging from torch_tensorrt.logging.debug() to torch_tensorrt.dynamo.Debugger()!

peri044 · 2025-09-09T02:43:18Z

tools/llm/utils.py

+
+
+@torch.inference_mode()
+def generate_mm_qwen2_5_vl_with_timing(


Is this function being used somewhere? If not, consider removing it

I have removed the generate_mm_qwen2_5_vl_with_timing function!

narendasan · 2025-09-09T14:29:00Z

tools/llm/run_vlm.py

+        # For benchmarking, we run the generation with timing enabled.
+        # For regular runs, we run without timing for a single output.
+        if args.benchmark:
+            if args.model == "Qwen/Qwen2.5-VL-3B-Instruct":


At some point can we build some sort of spec file format for this instead of a long set of cases? (can be later / for a the vla tool)

narendasan

Start thinking about how we might construct a more generic tool where each model we support isnt cased in. We might want to write an RFC for this.

My thought is that we create some sort of interface that defines the different model specific components and some sort of python based spec file to define the custom behavior (basically what HF transformers does)

chohk88 · 2025-09-09T16:54:57Z

Start thinking about how we might construct a more generic tool where each model we support isnt cased in. We might want to write an RFC for this.

My thought is that we create some sort of interface that defines the different model specific components and some sort of python based spec file to define the custom behavior (basically what HF transformers does)

Totally agree. We can define a Python-based spec covering: modality (LLM/VLM), input preprocessing (text/image), VLM image-embed handling, custom token generation, min/max sequence lengths, presence of a vision encoder and whether to compile it, etc. I’ll pursue this as a separate RFC/PR.

chohk88 requested review from peri044 and zewenli98 July 21, 2025 16:27

chohk88 self-assigned this Jul 21, 2025

chohk88 added component: conversion Issues re: Conversion stage component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Jul 21, 2025

meta-cla bot added the cla signed label Jul 21, 2025

github-actions bot removed component: conversion Issues re: Conversion stage component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Jul 21, 2025

peri044 requested changes Aug 6, 2025

View reviewed changes

github-actions bot added the documentation Improvements or additions to documentation label Aug 11, 2025

This comment was marked as resolved.

Sign in to view

chohk88 commented Aug 11, 2025

View reviewed changes

peri044 reviewed Aug 14, 2025

View reviewed changes

peri044 requested changes Aug 14, 2025

View reviewed changes

chohk88 force-pushed the kv_cache_eagle_rebase branch 2 times, most recently from ee57be9 to 5cfeafe Compare September 6, 2025 00:36

chohk88 added 3 commits September 9, 2025 01:08

integrated vlm code for benchmark

e1521f7

add vision_model compile

ccc8f56

Improve clarity of naming and comments

9a6936b

chohk88 added 11 commits September 9, 2025 01:08

support qwen2.5_vl with cache

a485dbc

fix: align ISL/OSL with arguments and remove padding in language model

fc187e6

Improve usability and visibility of arguments, README, and tutorial

9f65ad2

refactoring utils for vision inputs and timings

b0a155b

chore: slicing ouput token

de08f5e

chore: minor linting

b4c305c

fix: resolving SDPA lowering position issue (output mismatch)

7ef66a0

fix: remove use_fp32_acc lowering

da19a99

chore: adding limitation for fp16 precision

2ecc0db

revert use_fp32_acc args

33afb83

fix: bugfix for matmul when use_fp32_acc

13fb978

chohk88 force-pushed the kv_cache_eagle_rebase branch from e51f346 to 13fb978 Compare September 9, 2025 01:09

peri044 reviewed Sep 9, 2025

View reviewed changes

narendasan reviewed Sep 9, 2025

View reviewed changes

chore: resolving comments

2f5f974

This comment was marked as duplicate.

Sign in to view

chore: Modify all the debugging to torch_tensorrt.dynamo.Debugger()

a7e2315

		@@ -271,3 +282,35 @@ def default_sdpa_pass(
		"google/gemma-3-1b-it": register_gemma3_sdpa_pass,



		@torch.inference_mode()
		def generate_mm_qwen2_5_vl_with_timing(

integrated vlm code for benchmark for Eagle2 #3698

Are you sure you want to change the base?

integrated vlm code for benchmark for Eagle2 #3698

Uh oh!

Conversation

chohk88 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Uh oh!

peri044 commented Aug 6, 2025

Uh oh!

peri044 commented Aug 6, 2025

Uh oh!

peri044 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

chohk88 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chohk88 commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peri044 left a comment

Choose a reason for hiding this comment

Uh oh!

chohk88 commented Sep 6, 2025

Uh oh!

chohk88 commented Sep 8, 2025

Uh oh!

chohk88 commented Sep 8, 2025

Uh oh!

peri044 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

chohk88 commented Jul 21, 2025 •

edited

Loading

chohk88 commented Aug 11, 2025 •

edited

Loading