⚡️ Speed up function `_cleanup_spaces` by 9% #351

codeflash-ai · 2025-11-22T01:27:45Z

📄 9% (0.09x) speedup for `_cleanup_spaces` in `src/transformers/models/kosmos2/processing_kosmos2.py`

⏱️ Runtime : 518 microseconds → 475 microseconds (best of 249 runs)

📝 Explanation and details

The optimization replaces an imperative loop with a list comprehension, providing a 9% speedup by eliminating overhead from repeated variable assignments and list appends.

Key changes:

List comprehension instead of loop: The original code used a manual loop with intermediate variable assignments (entity_name_leading_spaces, entity_name_trailing_spaces) and list appends. The optimized version computes these values inline within a list comprehension.
Inline calculations: Instead of storing len(entity_name) - len(entity_name.lstrip()) and len(entity_name) - len(entity_name.rstrip()) in variables, these are calculated directly in the tuple construction.

Why this is faster:

Reduced function call overhead: List comprehensions are implemented in C and avoid the repeated overhead of list.append() calls
Eliminated intermediate variable assignments: The original code created temporary variables for each entity, adding assignment overhead
Better memory locality: List comprehensions can pre-allocate the result list size in some cases, reducing memory allocation overhead

Impact on workloads:
Based on the function reference, _cleanup_spaces is called from clean_text_and_extract_entities_with_bboxes, which processes text with grounding tokens for the Kosmos2 vision-language model. This suggests it's in a text processing pipeline that could be called frequently during model inference.

Test case performance:
The optimization shows the best gains (9-18% faster) on large-scale test cases with many entities (test_large_many_entities, test_large_number_of_empty_entities), where the loop overhead reduction is most significant. For small inputs with few entities, the optimization provides modest gains or is slightly slower due to list comprehension setup costs.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 41 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

from transformers.models.kosmos2.processing_kosmos2 import _cleanup_spaces


# unit tests

# 1. Basic Test Cases


def test_basic_no_spaces():
    # No spaces in text or entity names, indices should remain the same
    text = "Hello world"
    entities = [("Hello", (0, 5), [1, 2, 3]), ("world", (6, 11), [4, 5, 6])]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.42μs -> 2.35μs (2.85% faster)


def test_basic_text_leading_trailing_spaces():
    # Spaces around text, entity indices should be adjusted accordingly
    text = "   Hello world   "
    entities = [("Hello", (3, 8), [1]), ("world", (9, 14), [2])]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.44μs -> 2.51μs (2.91% slower)


def test_basic_entity_leading_trailing_spaces():
    # Entity names have spaces, should be stripped and indices adjusted
    text = "Hello world"
    entities = [(" Hello ", (0, 6), [1]), (" world", (6, 12), [2])]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.59μs -> 2.56μs (1.21% faster)


def test_basic_both_text_and_entity_spaces():
    # Both text and entity names have spaces
    text = "  Hello world  "
    entities = [(" Hello ", (2, 8), [1]), (" world", (8, 14), [2])]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.54μs -> 2.52μs (0.993% faster)


def test_basic_empty_entities():
    # No entities, should not fail
    text = "  Hello world  "
    entities = []
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.09μs -> 1.27μs (14.3% slower)


# 2. Edge Test Cases


def test_edge_empty_text_and_entities():
    # Empty text and empty entities
    text = ""
    entities = []
    new_text, new_entities = _cleanup_spaces(text, entities)  # 988ns -> 1.24μs (20.3% slower)


def test_edge_only_spaces_text():
    # Text is all spaces, should return empty string
    text = "     "
    entities = []
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.08μs -> 1.19μs (9.30% slower)


def test_edge_entity_all_spaces():
    # Entity name is all spaces, should be stripped to empty string
    text = "     "
    entities = [("   ", (1, 4), [1])]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.15μs -> 2.09μs (3.26% faster)


def test_edge_entity_index_out_of_bounds():
    # Entity indices are out of bounds, function should not fail, just adjust as per logic
    text = "  Hello  "
    entities = [(" Hello ", (0, 8), [1])]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.99μs -> 2.08μs (4.61% slower)


def test_edge_entity_empty_name():
    # Entity name is empty string
    text = "abc"
    entities = [("", (1, 1), [42])]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.88μs -> 1.91μs (1.67% slower)


def test_edge_entity_indices_overlap():
    # Overlapping entity indices, should be handled independently
    text = "  abcd  "
    entities = [(" ab", (2, 5), [1]), ("bcd ", (3, 7), [2])]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.63μs -> 2.72μs (3.49% slower)


def test_edge_entity_zero_length():
    # Entity with zero-length span
    text = "  abc  "
    entities = [(" ", (2, 2), [1])]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.98μs -> 2.11μs (6.25% slower)


def test_edge_entity_negative_indices():
    # Entity with negative indices
    text = "  abc  "
    entities = [("abc", (-1, 2), [1])]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.00μs -> 2.10μs (4.58% slower)


# 3. Large Scale Test Cases


def test_large_many_entities():
    # Many entities in a long text
    text = " " * 10 + " ".join(f"word{i}" for i in range(500)) + " " * 10
    # Each word is at position 10 + i*6 (since "word0" is 5 chars + 1 space), except last has no trailing space
    entities = []
    for i in range(500):
        word = f"word{i}"
        start = 10 + i * 6
        end = start + len(word)
        entities.append((word, (start, end), [i]))
    new_text, new_entities = _cleanup_spaces(text, entities)  # 120μs -> 109μs (9.90% faster)
    # Check all entities have correct indices and names
    for i, (entity_name, (start, end), bboxes) in enumerate(new_entities):
        pass


def test_large_long_text_single_entity():
    # Very long text with one entity spanning the whole text
    text = " " * 5 + "x" * 900 + " " * 5
    entities = [(" " * 3 + "x" * 900 + " " * 2, (3, 903 + 3), [0])]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.59μs -> 2.89μs (10.4% slower)


def test_large_entities_with_spaces():
    # Entities with spaces in names in a large text
    text = " " * 7 + "foo bar baz qux" + " " * 7
    entities = [
        (" foo ", (7, 12), [1]),
        ("bar ", (12, 16), [2]),
        (" baz", (16, 20), [3]),
        (" qux ", (20, 25), [4]),
    ]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 3.62μs -> 3.65μs (0.686% slower)


def test_large_all_spaces_entities():
    # All entities are just spaces, large number
    text = " " * 100
    entities = [(" " * 5, (i, i + 5), [i]) for i in range(0, 100, 5)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 7.31μs -> 6.96μs (5.01% faster)
    for i, (entity_name, (start, end), bboxes) in enumerate(new_entities):
        pass


def test_large_empty_entity_list():
    # Large text, no entities
    text = " " * 50 + "a" * 900 + " " * 50
    entities = []
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.52μs -> 1.71μs (11.2% slower)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from transformers.models.kosmos2.processing_kosmos2 import _cleanup_spaces


# unit tests

# 1. Basic Test Cases


def test_basic_no_spaces():
    # Text and entity have no extra spaces
    text = "Hello world"
    entities = [("Hello", (0, 5), None), ("world", (6, 11), None)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.66μs -> 2.50μs (6.16% faster)


def test_basic_leading_trailing_spaces():
    # Text has spaces, entities are correct
    text = "   Hello world   "
    entities = [("Hello", (3, 8), None), ("world", (9, 14), None)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.45μs -> 2.51μs (2.19% slower)


def test_basic_entity_within_spaces():
    # Entity itself has spaces
    text = "   Hello world   "
    entities = [(" Hello ", (3, 9), None)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.05μs -> 2.06μs (0.097% slower)


def test_basic_multiple_entities_mixed_spaces():
    # Some entities have spaces, some don't
    text = "  foo bar baz  "
    entities = [
        (" foo", (2, 6), None),
        ("bar ", (7, 11), None),
        (" baz ", (11, 16), None),
    ]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.94μs -> 3.05μs (3.48% slower)


# 2. Edge Test Cases


def test_empty_text_and_entities():
    # Both text and entities are empty
    text = ""
    entities = []
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.01μs -> 1.19μs (14.4% slower)


def test_text_all_spaces():
    # Text is all spaces, entities empty
    text = "     "
    entities = []
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.03μs -> 1.30μs (20.8% slower)


def test_entity_all_spaces():
    # Entity name is all spaces
    text = "   foo   "
    entities = [("   ", (3, 6), None)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.12μs -> 2.12μs (0.283% slower)


def test_entity_start_end_overlap_text_spaces():
    # Entity starts/ends at text's leading/trailing spaces
    text = "  abc def  "
    entities = [("  abc", (0, 5), None), ("def  ", (6, 11), None)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.68μs -> 2.77μs (3.39% slower)


def test_entity_within_only_spaces():
    # Entity is only the spaces in text
    text = "   "
    entities = [(" ", (1, 2), None)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.95μs -> 2.00μs (2.25% slower)


def test_entity_with_negative_indices():
    # Entity indices are negative (invalid but should be handled)
    text = "   foo   "
    entities = [("foo", (-1, 2), None)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.91μs -> 2.10μs (8.82% slower)


def test_entity_indices_out_of_bounds():
    # Entity indices out of bounds of text
    text = "  bar  "
    entities = [("bar", (10, 20), None)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.91μs -> 1.99μs (3.87% slower)


def test_entity_with_empty_name():
    # Entity name is empty string
    text = "   foo   "
    entities = [("", (3, 3), None)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.04μs -> 2.07μs (1.45% slower)


def test_entity_with_only_trailing_spaces():
    # Entity has only trailing spaces
    text = "  test  "
    entities = [("test  ", (2, 8), None)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.02μs -> 2.12μs (4.62% slower)


def test_entity_with_only_leading_spaces():
    # Entity has only leading spaces
    text = "  test  "
    entities = [("  test", (0, 6), None)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.96μs -> 1.99μs (1.46% slower)


def test_entity_with_both_leading_and_trailing_spaces():
    # Entity has both leading and trailing spaces
    text = "  test  "
    entities = [("  test  ", (0, 8), None)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.94μs -> 2.05μs (5.37% slower)


def test_entity_with_empty_bbox():
    # Entity with empty bbox
    text = "  foo  "
    entities = [("foo", (2, 5), ())]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.93μs -> 1.84μs (5.00% faster)


def test_entity_with_none_bbox():
    # Entity with None bbox
    text = "  foo  "
    entities = [("foo", (2, 5), None)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.88μs -> 1.97μs (4.62% slower)


def test_entity_with_complex_bbox():
    # Entity with a complex bbox (tuple of ints)
    text = "  foo  "
    entities = [("foo", (2, 5), (1, 2, 3, 4))]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 1.91μs -> 1.86μs (2.57% faster)


# 3. Large Scale Test Cases


def test_large_text_and_entities():
    # Large text with many entities, all with leading/trailing spaces
    text = "  " + " ".join(f"word{i}" for i in range(500)) + "  "
    # Each entity is " wordX ", with spaces, and indices are calculated accordingly
    entities = []
    idx = 2
    for i in range(500):
        word = f"word{i}"
        entity_name = f" {word} "
        start = idx
        end = idx + len(entity_name)
        entities.append((entity_name, (start, end), (i, i + 1)))
        idx += len(entity_name) + 1  # +1 for the space between words

    new_text, new_entities = _cleanup_spaces(text, entities)  # 144μs -> 132μs (9.17% faster)
    # Check all entities are stripped and indices are correct
    for i, (entity_name, (start, end), bbox) in enumerate(new_entities):
        # Each word is separated by a single space, so index is i * (len(wordX) + 1)
        expected_start = sum(len(f"word{j}") + 1 for j in range(i)) if i > 0 else 0
        expected_end = expected_start + len(f"word{i}")


def test_large_entities_with_varied_spaces():
    # Large number of entities, some with leading/trailing spaces, some not
    text = "  " + " ".join(f"word{i}" for i in range(100)) + "  "
    entities = []
    idx = 2
    for i in range(100):
        word = f"word{i}"
        # Alternate entities with spaces
        if i % 2 == 0:
            entity_name = f" {word} "
            start = idx
            end = idx + len(entity_name)
        else:
            entity_name = word
            start = idx
            end = idx + len(entity_name)
        entities.append((entity_name, (start, end), None))
        idx += len(entity_name) + 1  # +1 for space between words

    new_text, new_entities = _cleanup_spaces(text, entities)  # 26.9μs -> 24.3μs (10.7% faster)
    for i, (entity_name, (start, end), bbox) in enumerate(new_entities):
        expected_start = sum(len(f"word{j}") + 1 for j in range(i)) if i > 0 else 0
        expected_end = expected_start + len(f"word{i}")


def test_large_text_with_only_spaces():
    # Large text of only spaces
    text = " " * 999
    entities = []
    new_text, new_entities = _cleanup_spaces(text, entities)  # 2.32μs -> 2.73μs (15.1% slower)


def test_large_number_of_empty_entities():
    # Large number of entities, all empty names
    text = "   foo   "
    entities = [("", (3, 3), None) for _ in range(500)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 102μs -> 87.0μs (18.1% faster)
    for entity_name, (start, end), bbox in new_entities:
        pass


def test_large_entities_with_all_spaces_names():
    # Entities with names that are all spaces
    text = "   foo   "
    entities = [("   ", (3, 6), None) for _ in range(200)]
    new_text, new_entities = _cleanup_spaces(text, entities)  # 43.1μs -> 38.8μs (11.2% faster)
    for entity_name, (start, end), bbox in new_entities:
        pass


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_cleanup_spaces-mi9m0g7b and push.

The optimization replaces an imperative loop with a list comprehension, providing a **9% speedup** by eliminating overhead from repeated variable assignments and list appends. **Key changes:** - **List comprehension instead of loop**: The original code used a manual loop with intermediate variable assignments (`entity_name_leading_spaces`, `entity_name_trailing_spaces`) and list appends. The optimized version computes these values inline within a list comprehension. - **Inline calculations**: Instead of storing `len(entity_name) - len(entity_name.lstrip())` and `len(entity_name) - len(entity_name.rstrip())` in variables, these are calculated directly in the tuple construction. **Why this is faster:** - **Reduced function call overhead**: List comprehensions are implemented in C and avoid the repeated overhead of `list.append()` calls - **Eliminated intermediate variable assignments**: The original code created temporary variables for each entity, adding assignment overhead - **Better memory locality**: List comprehensions can pre-allocate the result list size in some cases, reducing memory allocation overhead **Impact on workloads:** Based on the function reference, `_cleanup_spaces` is called from `clean_text_and_extract_entities_with_bboxes`, which processes text with grounding tokens for the Kosmos2 vision-language model. This suggests it's in a text processing pipeline that could be called frequently during model inference. **Test case performance:** The optimization shows the best gains (9-18% faster) on large-scale test cases with many entities (`test_large_many_entities`, `test_large_number_of_empty_entities`), where the loop overhead reduction is most significant. For small inputs with few entities, the optimization provides modest gains or is slightly slower due to list comprehension setup costs.

codeflash-ai bot requested a review from mashraf-222 November 22, 2025 01:27

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `_cleanup_spaces` by 9% #351

⚡️ Speed up function `_cleanup_spaces` by 9% #351

Uh oh!

codeflash-ai bot commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function _cleanup_spaces by 9% #351

Are you sure you want to change the base?

⚡️ Speed up function _cleanup_spaces by 9% #351

Uh oh!

Conversation

codeflash-ai bot commented Nov 22, 2025

📄 9% (0.09x) speedup for _cleanup_spaces in src/transformers/models/kosmos2/processing_kosmos2.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `_cleanup_spaces` by 9% #351

⚡️ Speed up function `_cleanup_spaces` by 9% #351

📄 9% (0.09x) speedup for `_cleanup_spaces` in `src/transformers/models/kosmos2/processing_kosmos2.py`