⚡️ Speed up function `eager_attention_forward` by 9% #359

codeflash-ai · 2025-11-22T03:35:16Z

📄 9% (0.09x) speedup for `eager_attention_forward` in `src/transformers/models/janus/modeling_janus.py`

⏱️ Runtime : 6.63 milliseconds → 6.10 milliseconds (best of 187 runs)

📝 Explanation and details

The optimized code achieves an 8% speedup through several targeted optimizations in the attention mechanism:

Key Optimizations:

Conditional repeat_kv calls: The optimized version avoids calling repeat_kv when num_key_value_groups == 1, which is a common case where no key-value repetition is needed. This eliminates unnecessary tensor operations.
In-place scaling: Instead of creating a new tensor with * scaling, the code uses attn_weights.mul_(scaling) for in-place multiplication when scaling != 1.0, reducing memory allocation and copy overhead.
Conditional dropout: The optimized version checks if dropout: before applying dropout, completely skipping the computation when dropout is 0.0, which is common during inference.
Optimized type conversion: Rather than always calling .to(query.dtype) after softmax, it first checks if the conversion is needed with if attn_weights.dtype != query.dtype, avoiding unnecessary type casting.
Smart contiguous check: The code only calls .contiguous() when actually needed by checking if not attn_output.is_contiguous(), eliminating redundant memory operations.
Improved repeat_kv implementation: Uses reshape().repeat().reshape() pattern instead of expand(), which can be more memory-efficient for certain tensor shapes.

Performance Impact:
The test results show the optimizations are particularly effective for:

Small tensors and edge cases (up to 65% faster)
Cases without attention masks (13-42% faster)
Scenarios with no dropout (significant gains when skipping dropout entirely)
Non-contiguous inputs benefit from the smarter contiguous handling

These optimizations target common patterns in transformer attention, making them valuable for inference workloads where dropout is typically disabled and scaling factors are often 1.0.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 28 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pytest
import torch
from torch import nn

from transformers.models.janus.modeling_janus import eager_attention_forward


# Helper class to simulate module with num_key_value_groups and training flag
class DummyModule(nn.Module):
    def __init__(self, num_key_value_groups=1, training=False):
        super().__init__()
        self.num_key_value_groups = num_key_value_groups
        self.training = training


# ---------------------- Basic Test Cases ----------------------


def test_basic_shapes_and_types():
    # Basic test: shapes and types
    batch, num_heads, seqlen, head_dim = 2, 4, 8, 16
    num_key_value_groups = 2
    module = DummyModule(num_key_value_groups=num_heads // num_key_value_groups)
    query = torch.randn(batch, num_heads, seqlen, head_dim)
    key = torch.randn(batch, num_key_value_groups, seqlen, head_dim)
    value = torch.randn(batch, num_key_value_groups, seqlen, head_dim)
    attention_mask = torch.zeros(batch, num_heads, seqlen, seqlen)
    scaling = 1.0 / (head_dim**0.5)

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 137μs -> 139μs (1.56% slower)


def test_basic_no_attention_mask():
    # Test with attention_mask=None
    batch, num_heads, seqlen, head_dim = 1, 2, 4, 8
    num_key_value_groups = 1
    module = DummyModule(num_key_value_groups=num_heads // num_key_value_groups)
    query = torch.randn(batch, num_heads, seqlen, head_dim)
    key = torch.randn(batch, num_key_value_groups, seqlen, head_dim)
    value = torch.randn(batch, num_key_value_groups, seqlen, head_dim)
    scaling = 1.0

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, None, scaling
    )  # 102μs -> 90.0μs (13.5% faster)


def test_basic_dropout_training_flag():
    # Test dropout with training flag True
    batch, num_heads, seqlen, head_dim = 1, 2, 4, 8
    module = DummyModule(num_key_value_groups=2, training=True)
    query = torch.randn(batch, 2, seqlen, head_dim)
    key = torch.randn(batch, 1, seqlen, head_dim)
    value = torch.randn(batch, 1, seqlen, head_dim)
    attention_mask = torch.zeros(batch, 2, seqlen, seqlen)
    scaling = 1.0
    dropout = 0.5

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling, dropout
    )  # 126μs -> 127μs (1.13% slower)


# ---------------------- Edge Test Cases ----------------------


def test_edge_single_element():
    # Test with batch=1, heads=1, seqlen=1, head_dim=1
    module = DummyModule(num_key_value_groups=1)
    query = torch.ones(1, 1, 1, 1)
    key = torch.ones(1, 1, 1, 1)
    value = torch.ones(1, 1, 1, 1)
    attention_mask = torch.zeros(1, 1, 1, 1)
    scaling = 1.0

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 85.0μs -> 59.6μs (42.5% faster)


def test_edge_attention_mask_extreme_values():
    # Test with attention_mask having large negative values (simulating masking)
    batch, num_heads, seqlen, head_dim = 1, 2, 4, 8
    module = DummyModule(num_key_value_groups=2)
    query = torch.randn(batch, num_heads, seqlen, head_dim)
    key = torch.randn(batch, 1, seqlen, head_dim)
    value = torch.randn(batch, 1, seqlen, head_dim)
    attention_mask = torch.full((batch, num_heads, seqlen, seqlen), -1e9)
    scaling = 1.0

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 109μs -> 101μs (7.27% faster)


def test_edge_different_dtypes():
    # Test with float16 and float32
    batch, num_heads, seqlen, head_dim = 1, 2, 4, 8
    module = DummyModule(num_key_value_groups=2)
    query = torch.randn(batch, num_heads, seqlen, head_dim, dtype=torch.float16)
    key = torch.randn(batch, 1, seqlen, head_dim, dtype=torch.float16)
    value = torch.randn(batch, 1, seqlen, head_dim, dtype=torch.float16)
    attention_mask = torch.zeros(batch, num_heads, seqlen, seqlen, dtype=torch.float16)
    scaling = 1.0

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 118μs -> 114μs (3.66% faster)


def test_edge_non_contiguous_inputs():
    # Test with non-contiguous tensors
    batch, num_heads, seqlen, head_dim = 1, 2, 4, 8
    module = DummyModule(num_key_value_groups=2)
    query = torch.randn(batch, num_heads, seqlen, head_dim).transpose(2, 3)
    key = torch.randn(batch, 1, seqlen, head_dim).transpose(2, 3)
    value = torch.randn(batch, 1, seqlen, head_dim).transpose(2, 3)
    attention_mask = torch.zeros(batch, num_heads, seqlen, seqlen)
    scaling = 1.0

    attn_output, attn_weights = eager_attention_forward(
        module, query.transpose(2, 3), key.transpose(2, 3), value.transpose(2, 3), attention_mask, scaling
    )  # 104μs -> 92.9μs (12.5% faster)


def test_edge_invalid_shapes_raise():
    # Test with mismatched shapes (should raise)
    batch, num_heads, seqlen, head_dim = 1, 2, 4, 8
    module = DummyModule(num_key_value_groups=2)
    query = torch.randn(batch, num_heads, seqlen, head_dim)
    key = torch.randn(batch, 3, seqlen, head_dim)  # Wrong num_key_value_groups
    value = torch.randn(batch, 1, seqlen, head_dim)
    attention_mask = torch.zeros(batch, num_heads, seqlen, seqlen)
    scaling = 1.0
    # Should raise due to shape mismatch in repeat_kv
    with pytest.raises(RuntimeError):
        eager_attention_forward(module, query, key, value, attention_mask, scaling)  # 96.9μs -> 98.2μs (1.26% slower)


# ---------------------- Large Scale Test Cases ----------------------


def test_large_scale_max_size():
    # Test with large tensors (but <100MB)
    # Estimate: 2*8*64*32*4 bytes = 131072 bytes per tensor, well below 100MB
    batch, num_heads, seqlen, head_dim = 8, 16, 64, 32
    num_key_value_groups = 4
    module = DummyModule(num_key_value_groups=num_heads // num_key_value_groups)
    query = torch.randn(batch, num_heads, seqlen, head_dim)
    key = torch.randn(batch, num_key_value_groups, seqlen, head_dim)
    value = torch.randn(batch, num_key_value_groups, seqlen, head_dim)
    attention_mask = torch.zeros(batch, num_heads, seqlen, seqlen)
    scaling = 1.0 / (head_dim**0.5)

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 1.57ms -> 1.48ms (6.24% faster)


def test_large_scale_many_heads():
    # Test with many attention heads
    batch, num_heads, seqlen, head_dim = 2, 64, 16, 8
    num_key_value_groups = 8
    module = DummyModule(num_key_value_groups=num_heads // num_key_value_groups)
    query = torch.randn(batch, num_heads, seqlen, head_dim)
    key = torch.randn(batch, num_key_value_groups, seqlen, head_dim)
    value = torch.randn(batch, num_key_value_groups, seqlen, head_dim)
    attention_mask = torch.zeros(batch, num_heads, seqlen, seqlen)
    scaling = 1.0 / (head_dim**0.5)

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 214μs -> 209μs (2.42% faster)


def test_large_scale_many_sequences():
    # Test with many sequences
    batch, num_heads, seqlen, head_dim = 2, 4, 256, 8
    num_key_value_groups = 2
    module = DummyModule(num_key_value_groups=num_heads // num_key_value_groups)
    query = torch.randn(batch, num_heads, seqlen, head_dim)
    key = torch.randn(batch, num_key_value_groups, seqlen, head_dim)
    value = torch.randn(batch, num_key_value_groups, seqlen, head_dim)
    attention_mask = torch.zeros(batch, num_heads, seqlen, seqlen)
    scaling = 1.0 / (head_dim**0.5)

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 1.08ms -> 973μs (11.1% faster)


def test_large_scale_dropout():
    # Large scale with dropout
    batch, num_heads, seqlen, head_dim = 2, 8, 32, 16
    num_key_value_groups = 2
    module = DummyModule(num_key_value_groups=num_heads // num_key_value_groups, training=True)
    query = torch.randn(batch, num_heads, seqlen, head_dim)
    key = torch.randn(batch, num_key_value_groups, seqlen, head_dim)
    value = torch.randn(batch, num_key_value_groups, seqlen, head_dim)
    attention_mask = torch.zeros(batch, num_heads, seqlen, seqlen)
    scaling = 1.0 / (head_dim**0.5)
    dropout = 0.3

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling, dropout
    )  # 336μs -> 341μs (1.62% slower)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import math

# imports
import torch

from transformers.models.janus.modeling_janus import eager_attention_forward


# function to test
# (see above: eager_attention_forward and repeat_kv)


# Helper class to mimic nn.Module with num_key_value_groups and training attributes
class DummyModule(torch.nn.Module):
    def __init__(self, num_key_value_groups=1, training=False):
        super().__init__()
        self.num_key_value_groups = num_key_value_groups
        self.training = training


# ---------------------- BASIC TEST CASES ----------------------


def test_basic_single_head_no_mask_no_dropout():
    # Test with batch=1, heads=1, seq_len=2, head_dim=2, no mask, no dropout
    module = DummyModule(num_key_value_groups=1, training=False)
    query = torch.tensor([[[[1.0, 0.0], [0.0, 1.0]]]])  # (1, 1, 2, 2)
    key = torch.tensor([[[[1.0, 0.0], [0.0, 1.0]]]])  # (1, 1, 2, 2)
    value = torch.tensor([[[[1.0, 2.0], [3.0, 4.0]]]])  # (1, 1, 2, 2)
    attention_mask = None
    scaling = 1.0

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 83.9μs -> 52.5μs (59.7% faster)

    # Check that softmax is applied (rows sum to 1)
    for b in range(attn_weights.shape[0]):
        for h in range(attn_weights.shape[1]):
            for i in range(attn_weights.shape[2]):
                row_sum = attn_weights[b, h, i].sum().item()


def test_basic_multi_head_no_mask_with_dropout():
    # Test with batch=2, heads=2, seq_len=3, head_dim=4, no mask, with dropout
    module = DummyModule(num_key_value_groups=1, training=True)
    query = torch.randn(2, 2, 3, 4)
    key = torch.randn(2, 2, 3, 4)
    value = torch.randn(2, 2, 3, 4)
    attention_mask = None
    scaling = 0.5
    dropout = 0.1

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling, dropout
    )  # 102μs -> 101μs (1.15% faster)


def test_basic_with_attention_mask():
    # Test with attention mask applied
    module = DummyModule(num_key_value_groups=1, training=False)
    query = torch.randn(1, 1, 2, 2)
    key = torch.randn(1, 1, 2, 2)
    value = torch.randn(1, 1, 2, 2)
    # Attention mask: shape (batch, heads, query_seq_len, key_seq_len)
    attention_mask = torch.zeros(1, 1, 2, 2)
    attention_mask[0, 0, 0, 1] = -10000.0  # Mask out position (0,1)
    scaling = 1.0

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 79.5μs -> 55.8μs (42.6% faster)


def test_basic_repeat_kv():
    # Test with num_key_value_groups > 1
    module = DummyModule(num_key_value_groups=2, training=False)
    query = torch.randn(1, 2, 2, 2)
    key = torch.randn(1, 1, 2, 2)
    value = torch.randn(1, 1, 2, 2)
    attention_mask = None
    scaling = 1.0

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 99.4μs -> 89.0μs (11.7% faster)


# ---------------------- EDGE TEST CASES ----------------------


def test_edge_one_element():
    # Test with batch=1, heads=1, seq_len=1, head_dim=1
    module = DummyModule(num_key_value_groups=1, training=False)
    query = torch.ones(1, 1, 1, 1)
    key = torch.ones(1, 1, 1, 1)
    value = torch.ones(1, 1, 1, 1)
    attention_mask = None
    scaling = 1.0

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 84.4μs -> 51.1μs (65.1% faster)


def test_edge_large_negative_mask():
    # Test with large negative mask (should zero out attention)
    module = DummyModule(num_key_value_groups=1, training=False)
    query = torch.randn(1, 1, 2, 2)
    key = torch.randn(1, 1, 2, 2)
    value = torch.randn(1, 1, 2, 2)
    attention_mask = torch.full((1, 1, 2, 2), -1e9)
    scaling = 1.0

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 88.9μs -> 64.0μs (38.9% faster)

    # All softmax rows should be uniform (since all logits are -inf)
    for b in range(attn_weights.shape[0]):
        for h in range(attn_weights.shape[1]):
            for i in range(attn_weights.shape[2]):
                row = attn_weights[b, h, i]


def test_edge_dropout_zero_and_one():
    # Test dropout=0.0 (no dropout) and dropout=1.0 (all dropped except one)
    module = DummyModule(num_key_value_groups=1, training=True)
    query = torch.randn(1, 1, 2, 2)
    key = torch.randn(1, 1, 2, 2)
    value = torch.randn(1, 1, 2, 2)
    attention_mask = None
    scaling = 1.0

    # Dropout 0.0: output should be same as no dropout
    attn_output_0, attn_weights_0 = eager_attention_forward(
        module, query, key, value, attention_mask, scaling, dropout=0.0
    )  # 76.6μs -> 47.9μs (59.7% faster)
    # Dropout 1.0: all weights dropped except one (random)
    attn_output_1, attn_weights_1 = eager_attention_forward(
        module, query, key, value, attention_mask, scaling, dropout=1.0
    )  # 27.9μs -> 31.3μs (10.9% slower)

    # For dropout=0, all weights should sum to 1
    for b in range(attn_weights_0.shape[0]):
        for h in range(attn_weights_0.shape[1]):
            for i in range(attn_weights_0.shape[2]):
                pass
    # For dropout=1, all weights except one should be zero in each row
    for b in range(attn_weights_1.shape[0]):
        for h in range(attn_weights_1.shape[1]):
            for i in range(attn_weights_1.shape[2]):
                nonzero = (attn_weights_1[b, h, i] != 0).sum().item()


def test_edge_non_contiguous_inputs():
    # Test with non-contiguous input tensors
    module = DummyModule(num_key_value_groups=1, training=False)
    query = torch.randn(1, 1, 4, 2).transpose(2, 3)
    key = torch.randn(1, 1, 4, 2).transpose(2, 3)
    value = torch.randn(1, 1, 4, 2).transpose(2, 3)
    attention_mask = None
    scaling = 1.0

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 68.8μs -> 41.7μs (64.9% faster)


# ---------------------- LARGE SCALE TEST CASES ----------------------


def test_large_batch_heads_seq_len():
    # Large batch, heads, seq_len, but < 100MB
    batch = 8
    heads = 8
    seq_len = 32
    head_dim = 32
    module = DummyModule(num_key_value_groups=2, training=True)
    query = torch.randn(batch, heads, seq_len, head_dim)
    key = torch.randn(batch, heads // 2, seq_len, head_dim)
    value = torch.randn(batch, heads // 2, seq_len, head_dim)
    attention_mask = torch.zeros(batch, heads, seq_len, seq_len)
    scaling = 1.0 / math.sqrt(head_dim)
    dropout = 0.2

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling, dropout
    )  # 902μs -> 878μs (2.72% faster)


def test_large_seq_len():
    # Large sequence length, but < 100MB
    batch = 2
    heads = 4
    seq_len = 128
    head_dim = 16
    module = DummyModule(num_key_value_groups=2, training=False)
    query = torch.randn(batch, heads, seq_len, head_dim)
    key = torch.randn(batch, heads // 2, seq_len, head_dim)
    value = torch.randn(batch, heads // 2, seq_len, head_dim)
    attention_mask = torch.zeros(batch, heads, seq_len, seq_len)
    scaling = 1.0 / math.sqrt(head_dim)

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 338μs -> 318μs (6.38% faster)


def test_large_key_value_groups():
    # Large num_key_value_groups
    batch = 1
    heads = 8
    seq_len = 16
    head_dim = 8
    num_key_value_groups = 4
    module = DummyModule(num_key_value_groups=num_key_value_groups, training=False)
    query = torch.randn(batch, heads, seq_len, head_dim)
    key = torch.randn(batch, heads // num_key_value_groups, seq_len, head_dim)
    value = torch.randn(batch, heads // num_key_value_groups, seq_len, head_dim)
    attention_mask = torch.zeros(batch, heads, seq_len, seq_len)
    scaling = 1.0 / math.sqrt(head_dim)

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 137μs -> 133μs (3.04% faster)


def test_large_randomized_mask():
    # Large random attention mask
    batch = 2
    heads = 4
    seq_len = 32
    head_dim = 16
    module = DummyModule(num_key_value_groups=2, training=True)
    query = torch.randn(batch, heads, seq_len, head_dim)
    key = torch.randn(batch, heads // 2, seq_len, head_dim)
    value = torch.randn(batch, heads // 2, seq_len, head_dim)
    attention_mask = torch.randn(batch, heads, seq_len, seq_len) * -10.0
    scaling = 1.0 / math.sqrt(head_dim)
    dropout = 0.1

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling, dropout
    )  # 192μs -> 202μs (4.71% slower)


# ---------------------- FUNCTIONALITY CONSISTENCY TESTS ----------------------


def test_attention_mask_truncation():
    # attention_mask should be truncated to key_states.shape[-2]
    module = DummyModule(num_key_value_groups=1, training=False)
    query = torch.randn(1, 1, 2, 2)
    key = torch.randn(1, 1, 1, 2)
    value = torch.randn(1, 1, 1, 2)
    attention_mask = torch.zeros(1, 1, 2, 3)  # key_seq_len=3, but key_states.shape[-2]=1
    scaling = 1.0

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 97.4μs -> 71.0μs (37.1% faster)


def test_dtype_preservation():
    # Output should preserve input dtype
    module = DummyModule(num_key_value_groups=1, training=False)
    query = torch.randn(1, 1, 2, 2, dtype=torch.float16)
    key = torch.randn(1, 1, 2, 2, dtype=torch.float16)
    value = torch.randn(1, 1, 2, 2, dtype=torch.float16)
    attention_mask = None
    scaling = 1.0

    attn_output, attn_weights = eager_attention_forward(
        module, query, key, value, attention_mask, scaling
    )  # 81.8μs -> 61.3μs (33.5% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-eager_attention_forward-mi9qkfkh and push.

The optimized code achieves an 8% speedup through several targeted optimizations in the attention mechanism: **Key Optimizations:** 1. **Conditional repeat_kv calls**: The optimized version avoids calling `repeat_kv` when `num_key_value_groups == 1`, which is a common case where no key-value repetition is needed. This eliminates unnecessary tensor operations. 2. **In-place scaling**: Instead of creating a new tensor with `* scaling`, the code uses `attn_weights.mul_(scaling)` for in-place multiplication when `scaling != 1.0`, reducing memory allocation and copy overhead. 3. **Conditional dropout**: The optimized version checks `if dropout:` before applying dropout, completely skipping the computation when dropout is 0.0, which is common during inference. 4. **Optimized type conversion**: Rather than always calling `.to(query.dtype)` after softmax, it first checks if the conversion is needed with `if attn_weights.dtype != query.dtype`, avoiding unnecessary type casting. 5. **Smart contiguous check**: The code only calls `.contiguous()` when actually needed by checking `if not attn_output.is_contiguous()`, eliminating redundant memory operations. 6. **Improved repeat_kv implementation**: Uses `reshape().repeat().reshape()` pattern instead of `expand()`, which can be more memory-efficient for certain tensor shapes. **Performance Impact:** The test results show the optimizations are particularly effective for: - Small tensors and edge cases (up to 65% faster) - Cases without attention masks (13-42% faster) - Scenarios with no dropout (significant gains when skipping dropout entirely) - Non-contiguous inputs benefit from the smarter contiguous handling These optimizations target common patterns in transformer attention, making them valuable for inference workloads where dropout is typically disabled and scaling factors are often 1.0.

codeflash-ai bot requested a review from mashraf-222 November 22, 2025 03:35

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `eager_attention_forward` by 9% #359

⚡️ Speed up function `eager_attention_forward` by 9% #359

Uh oh!

codeflash-ai bot commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function eager_attention_forward by 9% #359

Are you sure you want to change the base?

⚡️ Speed up function eager_attention_forward by 9% #359

Uh oh!

Conversation

codeflash-ai bot commented Nov 22, 2025

📄 9% (0.09x) speedup for eager_attention_forward in src/transformers/models/janus/modeling_janus.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `eager_attention_forward` by 9% #359

⚡️ Speed up function `eager_attention_forward` by 9% #359

📄 9% (0.09x) speedup for `eager_attention_forward` in `src/transformers/models/janus/modeling_janus.py`