Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 22, 2025

📄 26% (0.26x) speedup for BasicTokenizer._run_strip_accents in src/transformers/models/deprecated/realm/tokenization_realm.py

⏱️ Runtime : 1.19 milliseconds 949 microseconds (best of 250 runs)

📝 Explanation and details

The optimization replaces an imperative loop-based approach with a functional generator expression, achieving a 25% speedup through several key improvements:

What was optimized:

  1. Eliminated explicit list construction: Replaced output = [] and output.append(char) with a generator expression passed directly to str.join()
  2. Reduced function call overhead: Cached unicodedata.category as a local variable to avoid repeated attribute lookups
  3. Streamlined control flow: Replaced explicit loop with continue statements with a filtering generator expression

Why this is faster:

  • Memory efficiency: Generator expressions are more memory-efficient than building intermediate lists, especially for large texts
  • Reduced Python bytecode: The generator expression compiles to fewer Python operations than the explicit loop
  • Function call optimization: Caching unicodedata.category eliminates repeated attribute resolution (visible in line profiler: 13,019 calls to unicodedata.category in original vs more efficient access pattern in optimized)

Performance characteristics:
The optimization shows diminishing returns for very short strings (some small test cases are 3-40% slower due to generator setup overhead) but provides significant gains for larger inputs:

  • Large accented text: 32-35% faster
  • Mixed content: 26-30% faster
  • The break-even point appears around 20-50 characters

This optimization is particularly valuable for text preprocessing pipelines in NLP models where _run_strip_accents processes batches of documents or long text sequences, making the consistent 25%+ improvement on realistic workloads highly beneficial.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 104 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import unicodedata

# imports
import pytest  # used for our unit tests

from transformers.models.deprecated.realm.tokenization_realm import BasicTokenizer


# unit tests

# ---- Basic Test Cases ----


def test_strip_accents_basic_ascii():
    # ASCII only, should remain unchanged
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("hello world")  # 3.06μs -> 3.17μs (3.60% slower)


def test_strip_accents_basic_latin_accented():
    # Basic accented Latin characters
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("café")  # 3.06μs -> 3.35μs (8.46% slower)
    codeflash_output = tokenizer._run_strip_accents("naïve")  # 1.74μs -> 1.87μs (6.65% slower)
    codeflash_output = tokenizer._run_strip_accents("résumé")  # 1.62μs -> 1.59μs (2.01% faster)
    codeflash_output = tokenizer._run_strip_accents("fiancée")  # 1.45μs -> 1.46μs (0.615% slower)


def test_strip_accents_mixed_accented_and_unaccented():
    # Mixed accented and unaccented characters
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("año 2022")  # 3.24μs -> 3.15μs (2.99% faster)
    codeflash_output = tokenizer._run_strip_accents("über cool")  # 1.91μs -> 1.98μs (3.58% slower)


def test_strip_accents_uppercase_accented():
    # Uppercase accented characters
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("ÉLÉPHANT")  # 3.38μs -> 3.51μs (3.62% slower)
    codeflash_output = tokenizer._run_strip_accents("À LA CARTE")  # 2.03μs -> 2.07μs (1.98% slower)


def test_strip_accents_multiple_accents():
    # Multiple accents on a single character
    tokenizer = BasicTokenizer()
    # 'a' with ring and acute: 'á̊' (U+00E1 U+030A)
    input_text = unicodedata.normalize("NFC", "a\u0301\u030a")
    codeflash_output = tokenizer._run_strip_accents(input_text)  # 1.95μs -> 2.37μs (17.9% slower)


# ---- Edge Test Cases ----


def test_strip_accents_empty_string():
    # Empty string should return empty string
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("")  # 894ns -> 1.63μs (45.3% slower)


def test_strip_accents_only_accents():
    # String of only combining marks should return empty string
    tokenizer = BasicTokenizer()
    # U+0301 is combining acute accent
    codeflash_output = tokenizer._run_strip_accents("\u0301\u0302\u0303")  # 1.94μs -> 2.43μs (20.4% slower)


def test_strip_accents_non_latin_scripts():
    # Non-Latin scripts should not be affected unless they have combining marks
    tokenizer = BasicTokenizer()
    # Chinese, Japanese, Arabic, Cyrillic
    codeflash_output = tokenizer._run_strip_accents("你好")  # 2.15μs -> 2.57μs (16.2% slower)
    codeflash_output = tokenizer._run_strip_accents("こんにちは")  # 1.51μs -> 1.52μs (0.592% slower)
    codeflash_output = tokenizer._run_strip_accents("مرحبا")  # 1.03μs -> 1.06μs (2.92% slower)
    codeflash_output = tokenizer._run_strip_accents("Привет")  # 1.16μs -> 1.15μs (1.13% faster)


def test_strip_accents_combining_marks_on_non_latin():
    # Combining marks on non-Latin scripts should be stripped
    tokenizer = BasicTokenizer()
    # Arabic letter with combining mark (e.g., shadda U+0651)
    text = "مُحَمَّد"  # with Arabic diacritics
    expected = "محمد"
    codeflash_output = tokenizer._run_strip_accents(text)  # 2.68μs -> 2.95μs (9.16% slower)


def test_strip_accents_surrogate_pairs_and_emojis():
    # Emojis and surrogate pairs should remain unchanged
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("😀😃😄")  # 2.26μs -> 2.69μs (15.9% slower)
    codeflash_output = tokenizer._run_strip_accents("hello 👋")  # 1.87μs -> 1.87μs (0.000% faster)


def test_strip_accents_precomposed_and_decomposed():
    # Compare precomposed and decomposed forms
    tokenizer = BasicTokenizer()
    # 'é' precomposed vs 'e' + combining acute
    precomposed = "é"
    decomposed = "e\u0301"
    codeflash_output = tokenizer._run_strip_accents(precomposed)  # 2.35μs -> 2.71μs (13.4% slower)
    codeflash_output = tokenizer._run_strip_accents(decomposed)  # 775ns -> 923ns (16.0% slower)


def test_strip_accents_mixed_script_with_accents():
    # Mixed scripts and accents
    tokenizer = BasicTokenizer()
    # Cyrillic with combining acute accent
    text = "дос\u0301видания"  # 'о' with acute
    expected = "досвидания"
    codeflash_output = tokenizer._run_strip_accents(text)  # 3.47μs -> 3.42μs (1.29% faster)


def test_strip_accents_control_characters():
    # Control characters should remain
    tokenizer = BasicTokenizer()
    input_text = "a\u0301\nb\u0302\tc"
    expected = "a\nb\tc"
    codeflash_output = tokenizer._run_strip_accents(input_text)  # 2.60μs -> 2.69μs (3.45% slower)


def test_strip_accents_with_punctuation():
    # Punctuation should remain
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("café!")  # 3.07μs -> 3.26μs (5.82% slower)
    codeflash_output = tokenizer._run_strip_accents("naïve?")  # 1.86μs -> 1.85μs (0.541% faster)


def test_strip_accents_combining_marks_only_on_some_chars():
    # Only some characters have combining marks
    tokenizer = BasicTokenizer()
    input_text = "a\u0301b\u0302c"
    expected = "abc"
    codeflash_output = tokenizer._run_strip_accents(input_text)  # 2.16μs -> 2.27μs (4.92% slower)


def test_strip_accents_non_string_input_raises():
    # Should raise TypeError if input is not string
    tokenizer = BasicTokenizer()
    with pytest.raises(TypeError):
        tokenizer._run_strip_accents(123)  # 1.73μs -> 1.82μs (5.15% slower)
    with pytest.raises(TypeError):
        tokenizer._run_strip_accents(None)  # 994ns -> 1.01μs (1.49% slower)
    with pytest.raises(TypeError):
        tokenizer._run_strip_accents(["café"])  # 688ns -> 685ns (0.438% faster)


# ---- Large Scale Test Cases ----


def test_strip_accents_long_text():
    # Large input with repeated accented characters
    tokenizer = BasicTokenizer()
    input_text = "é" * 1000
    expected = "e" * 1000
    codeflash_output = tokenizer._run_strip_accents(input_text)  # 157μs -> 117μs (33.9% faster)


def test_strip_accents_long_mixed_text():
    # Large input with mixed accented and unaccented characters
    tokenizer = BasicTokenizer()
    input_text = ("café " * 200).strip()
    expected = ("cafe " * 200).strip()
    codeflash_output = tokenizer._run_strip_accents(input_text)  # 101μs -> 78.2μs (29.9% faster)


def test_strip_accents_large_unicode_range():
    # Large input covering a wide range of Unicode, including combining marks
    tokenizer = BasicTokenizer()
    # Build a string with every Latin letter + combining acute accent
    input_text = "".join(chr(cp) + "\u0301" for cp in range(0x41, 0x5A + 1))  # A-Z with acute
    expected = "".join(chr(cp) for cp in range(0x41, 0x5A + 1))
    codeflash_output = tokenizer._run_strip_accents(input_text)  # 6.12μs -> 5.22μs (17.1% faster)


def test_strip_accents_large_random_text():
    # Large input with random mixture of accented, non-accented, and non-Latin
    tokenizer = BasicTokenizer()
    # Build a string: 333 Latin accented, 333 Cyrillic, 333 emoji
    latin_accented = "é" * 333
    cyrillic = "Ж" * 333
    emoji = "😀" * 333
    input_text = latin_accented + cyrillic + emoji
    expected = "e" * 333 + cyrillic + emoji
    codeflash_output = tokenizer._run_strip_accents(input_text)  # 138μs -> 112μs (23.3% faster)


def test_strip_accents_performance_large_input():
    # Performance: Ensure function completes quickly for large input
    import time

    tokenizer = BasicTokenizer()
    input_text = ("naïve café résumé " * 50).strip()
    start = time.time()
    codeflash_output = tokenizer._run_strip_accents(input_text)
    result = codeflash_output  # 97.7μs -> 74.0μs (32.1% faster)
    end = time.time()
    # Result should be correct
    expected = ("naive cafe resume " * 50).strip()


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
# imports
from transformers.models.deprecated.realm.tokenization_realm import BasicTokenizer


# unit tests

# ---------- BASIC TEST CASES ----------


def test_strip_accents_basic_ascii():
    # No accents, should return unchanged
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("hello world")  # 2.74μs -> 2.88μs (4.76% slower)


def test_strip_accents_basic_accented():
    # Common accented letters
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("café")  # 2.67μs -> 2.90μs (7.86% slower)
    codeflash_output = tokenizer._run_strip_accents("naïve")  # 1.69μs -> 1.77μs (4.90% slower)
    codeflash_output = tokenizer._run_strip_accents("résumé")  # 1.52μs -> 1.49μs (2.29% faster)


def test_strip_accents_mixed():
    # Mixture of accented and unaccented
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("fiancée and fiancé")  # 4.18μs -> 4.14μs (1.14% faster)


def test_strip_accents_uppercase_accented():
    # Accented uppercase letters
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("ÉCOLE")  # 2.91μs -> 3.02μs (3.64% slower)
    codeflash_output = tokenizer._run_strip_accents("ÀÉÎÖÜ")  # 2.12μs -> 1.94μs (9.13% faster)


def test_strip_accents_non_latin():
    # Non-latin scripts without accents should be unchanged
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("東京")  # 2.08μs -> 2.47μs (15.7% slower)
    codeflash_output = tokenizer._run_strip_accents("Москва")  # 1.56μs -> 1.59μs (1.94% slower)


# ---------- EDGE TEST CASES ----------


def test_strip_accents_empty_string():
    # Empty string should return empty string
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("")  # 866ns -> 1.45μs (40.3% slower)


def test_strip_accents_only_accents():
    # String containing only combining marks should return empty
    tokenizer = BasicTokenizer()
    # U+0301 is combining acute accent
    codeflash_output = tokenizer._run_strip_accents("\u0301\u0302\u0303")  # 1.84μs -> 2.19μs (16.0% slower)


def test_strip_accents_combining_marks_attached():
    # Combining marks attached to base characters
    tokenizer = BasicTokenizer()
    # 'a' + combining acute accent
    codeflash_output = tokenizer._run_strip_accents("a\u0301")  # 1.72μs -> 2.21μs (22.2% slower)
    # 'e' + combining tilde
    codeflash_output = tokenizer._run_strip_accents("e\u0303")  # 711ns -> 907ns (21.6% slower)


def test_strip_accents_unicode_normalization():
    # Characters that can be decomposed in multiple ways
    tokenizer = BasicTokenizer()
    # U+00E9 (é) vs 'e' + U+0301 (combining acute)
    codeflash_output = tokenizer._run_strip_accents("\u00e9")  # 2.26μs -> 2.67μs (15.5% slower)
    codeflash_output = tokenizer._run_strip_accents("e\u0301")  # 703ns -> 983ns (28.5% slower)


def test_strip_accents_non_printable():
    # Non-printable characters should remain
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("\n\t")  # 1.67μs -> 2.01μs (17.0% slower)


def test_strip_accents_symbols_and_punctuation():
    # Symbols and punctuation should remain unchanged
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("!@#$%^&*()_+-=[]{};':\",.<>/?")  # 4.95μs -> 4.77μs (3.77% faster)


def test_strip_accents_emojis():
    # Emojis should remain unchanged
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_strip_accents("😀😃😄")  # 2.51μs -> 2.75μs (8.77% slower)


def test_strip_accents_surrogate_pairs():
    # Surrogate pairs (rare, but possible in some environments)
    tokenizer = BasicTokenizer()
    # U+1F600 (grinning face)
    codeflash_output = tokenizer._run_strip_accents("\U0001f600")  # 1.42μs -> 1.91μs (25.4% slower)


def test_strip_accents_accented_non_latin():
    # Accented characters in non-latin scripts
    tokenizer = BasicTokenizer()
    # Greek: ά (alpha with tonos) -> α
    codeflash_output = tokenizer._run_strip_accents("ά")  # 2.50μs -> 2.93μs (14.8% slower)
    # Cyrillic: ё (yo) -> е
    codeflash_output = tokenizer._run_strip_accents("ё")  # 909ns -> 1.08μs (15.5% slower)


def test_strip_accents_multiple_combining_marks():
    # Characters with multiple combining marks
    tokenizer = BasicTokenizer()
    # 'a' + combining acute + combining tilde
    codeflash_output = tokenizer._run_strip_accents("a\u0301\u0303")  # 1.99μs -> 2.35μs (15.3% slower)


def test_strip_accents_precomposed_and_decomposed_equivalence():
    # Precomposed and decomposed forms should produce the same output
    tokenizer = BasicTokenizer()
    # 'á' (precomposed) and 'a' + combining acute
    codeflash_output = tokenizer._run_strip_accents("á")  # 2.28μs -> 2.57μs (11.5% slower)


# ---------- LARGE SCALE TEST CASES ----------


def test_strip_accents_large_text():
    # Large text with many accents
    tokenizer = BasicTokenizer()
    base = "áéíóúüñç" * 100  # 900 characters, all accented
    expected = "aeiouunc" * 100
    codeflash_output = tokenizer._run_strip_accents(base)  # 126μs -> 95.2μs (32.4% faster)


def test_strip_accents_large_mixed_text():
    # Large text with mixed accented and unaccented
    tokenizer = BasicTokenizer()
    base = ("hello café résumé naïve fiancée " * 50).strip()
    expected = ("hello cafe resume naive fiancee " * 50).strip()
    codeflash_output = tokenizer._run_strip_accents(base)  # 156μs -> 120μs (29.8% faster)


def test_strip_accents_large_random_unicode():
    # Large text with random unicode, including combining marks
    tokenizer = BasicTokenizer()
    # Compose a string with 500 'a' + combining acute, 500 'b'
    text = ("a\u0301" * 500) + ("b" * 500)
    expected = ("a" * 500) + ("b" * 500)
    codeflash_output = tokenizer._run_strip_accents(text)  # 112μs -> 82.9μs (35.5% faster)


def test_strip_accents_large_non_accented():
    # Large text with no accents should be unchanged
    tokenizer = BasicTokenizer()
    text = "The quick brown fox jumps over the lazy dog. " * 20  # 900 chars
    codeflash_output = tokenizer._run_strip_accents(text)  # 72.9μs -> 57.5μs (26.8% faster)


def test_strip_accents_large_emojis_and_symbols():
    # Large text with emojis and symbols should be unchanged
    tokenizer = BasicTokenizer()
    text = "😀😃😄!@#$%^&*()_+" * 80
    codeflash_output = tokenizer._run_strip_accents(text)  # 114μs -> 88.7μs (29.3% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-BasicTokenizer._run_strip_accents-mi9wp2wr and push.

Codeflash Static Badge

The optimization replaces an imperative loop-based approach with a functional generator expression, achieving a **25% speedup** through several key improvements:

**What was optimized:**
1. **Eliminated explicit list construction**: Replaced `output = []` and `output.append(char)` with a generator expression passed directly to `str.join()`
2. **Reduced function call overhead**: Cached `unicodedata.category` as a local variable to avoid repeated attribute lookups
3. **Streamlined control flow**: Replaced explicit loop with `continue` statements with a filtering generator expression

**Why this is faster:**
- **Memory efficiency**: Generator expressions are more memory-efficient than building intermediate lists, especially for large texts
- **Reduced Python bytecode**: The generator expression compiles to fewer Python operations than the explicit loop
- **Function call optimization**: Caching `unicodedata.category` eliminates repeated attribute resolution (visible in line profiler: 13,019 calls to `unicodedata.category` in original vs more efficient access pattern in optimized)

**Performance characteristics:**
The optimization shows **diminishing returns for very short strings** (some small test cases are 3-40% slower due to generator setup overhead) but provides **significant gains for larger inputs**:
- Large accented text: 32-35% faster
- Mixed content: 26-30% faster
- The break-even point appears around 20-50 characters

This optimization is particularly valuable for text preprocessing pipelines in NLP models where `_run_strip_accents` processes batches of documents or long text sequences, making the consistent 25%+ improvement on realistic workloads highly beneficial.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 22, 2025 06:26
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant