⚡️ Speed up method `BasicTokenizer._run_split_on_punc` by 97% #367

codeflash-ai · 2025-11-22T06:39:26Z

📄 97% (0.97x) speedup for `BasicTokenizer._run_split_on_punc` in `src/transformers/models/deprecated/realm/tokenization_realm.py`

⏱️ Runtime : 5.20 milliseconds → 2.65 milliseconds (best of 212 runs)

📝 Explanation and details

The optimization achieves a 96% speedup through two key improvements:

1. ASCII Punctuation Fast Path (_is_punctuation)
The original code used four range comparisons (cp >= 33 and cp <= 47...) for every character, plus an expensive ord(char) call. The optimized version precomputes all ASCII punctuation characters into a set _ASCII_PUNCTUATION_SET, enabling O(1) lookups that bypass both the ord() call and range comparisons for common ASCII text.

2. Streamlined Text Processing (_run_split_on_punc)
The original implementation used complex indexing with while i < len(chars) and maintained separate state variables (start_new_word). The optimized version uses a simple for char in text loop with a straightforward current buffer approach, eliminating manual indexing, reducing list operations, and avoiding the expensive ["".join(x) for x in output] comprehension at the end.

Performance Impact Analysis:

ASCII-heavy text (most common case) benefits most, with 60-100% speedups across test cases
Large-scale tests show dramatic improvements: 1000-character strings see 96%+ speedups
Unicode punctuation still gets significant gains (50-60%) from the simplified loop structure
Only edge case with never_split parameter shows minor regression (15-25% slower) due to additional conditional checks

Hot Path Optimization:
The _is_punctuation function is called for every character in text processing, making it extremely performance-critical. The ASCII fast path optimization directly targets this bottleneck, while the streamlined tokenization loop reduces overhead in the character iteration process that calls _is_punctuation repeatedly.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 141 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

# imports
from transformers.models.deprecated.realm.tokenization_realm import BasicTokenizer


# unit tests

# -------------------- BASIC TEST CASES --------------------


def test_basic_no_punctuation():
    # No punctuation, should return the whole string as one token
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_split_on_punc("hello")  # 5.04μs -> 2.94μs (71.2% faster)


def test_basic_single_punctuation_middle():
    # Punctuation in the middle, should split at the punctuation
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_split_on_punc("hello,world")  # 6.69μs -> 4.17μs (60.2% faster)


def test_basic_multiple_punctuations():
    # Multiple punctuation marks
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_split_on_punc("hi!how?are,you.")  # 8.06μs -> 4.95μs (62.8% faster)


def test_basic_punctuation_at_edges():
    # Punctuation at start and end
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_split_on_punc(".hello.")  # 5.19μs -> 3.27μs (58.8% faster)


def test_basic_space_is_not_punctuation():
    # Space is not punctuation, so it should be part of the token
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_split_on_punc("hello world")  # 6.74μs -> 3.87μs (74.1% faster)


def test_basic_adjacent_punctuation():
    # Adjacent punctuation should be split into separate tokens
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_split_on_punc("hi!?")  # 4.36μs -> 2.68μs (62.6% faster)


def test_basic_only_punctuation():
    # Only punctuation characters
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_split_on_punc("!?.,")  # 3.23μs -> 1.69μs (91.1% faster)


def test_basic_never_split_argument():
    # If text is in never_split, should not split
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_split_on_punc(
        "hello,world", never_split=["hello,world"]
    )  # 646ns -> 856ns (24.5% slower)


# -------------------- EDGE TEST CASES --------------------


def test_edge_empty_string():
    # Empty string should return a list with one empty string
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_split_on_punc("")  # 1.25μs -> 808ns (54.7% faster)


def test_edge_only_spaces():
    # Only spaces, should return as one token
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_split_on_punc("   ")  # 4.28μs -> 2.66μs (60.5% faster)


def test_edge_unicode_punctuation():
    # Unicode punctuation (e.g. “ ” — …)
    tokenizer = BasicTokenizer()
    # Unicode quotes and em dash and ellipsis
    s = "“Hello”—…"
    expected = ["“", "Hello", "”", "—", "…"]
    codeflash_output = tokenizer._run_split_on_punc(s)  # 7.50μs -> 4.82μs (55.5% faster)


def test_edge_mixed_unicode_and_ascii_punctuation():
    # Mix of ASCII and Unicode punctuation
    tokenizer = BasicTokenizer()
    s = "Hi—there!"
    expected = ["Hi", "—", "there", "!"]
    codeflash_output = tokenizer._run_split_on_punc(s)  # 6.89μs -> 4.26μs (62.0% faster)


def test_edge_non_punctuation_symbols():
    # Symbols that are not punctuation (e.g. emoji, currency)
    tokenizer = BasicTokenizer()
    s = "hello🙂world$"
    # Emoji is not punctuation, $ is punctuation
    expected = ["hello🙂world", "$"]
    codeflash_output = tokenizer._run_split_on_punc(s)  # 8.00μs -> 5.14μs (55.7% faster)


def test_edge_punctuation_between_spaces():
    # Punctuation surrounded by spaces
    tokenizer = BasicTokenizer()
    s = "hello , world"
    expected = ["hello , world"]
    codeflash_output = tokenizer._run_split_on_punc(s)  # 7.59μs -> 4.55μs (66.8% faster)


def test_edge_punctuation_and_numbers():
    # Punctuation between numbers
    tokenizer = BasicTokenizer()
    s = "123,456.789"
    expected = ["123", ",", "456", ".", "789"]
    codeflash_output = tokenizer._run_split_on_punc(s)  # 6.58μs -> 4.19μs (57.1% faster)


def test_edge_control_characters():
    # Control characters (e.g. \n, \t) are not punctuation
    tokenizer = BasicTokenizer()
    s = "hello\nworld\t!"
    expected = ["hello\nworld\t", "!"]
    codeflash_output = tokenizer._run_split_on_punc(s)  # 7.28μs -> 4.41μs (65.1% faster)


def test_edge_punctuation_in_never_split():
    # If never_split contains a punctuation token, it should not split
    tokenizer = BasicTokenizer()
    s = "hello,world"
    expected = ["hello,world"]
    codeflash_output = tokenizer._run_split_on_punc(s, never_split=["hello,world"])  # 665ns -> 780ns (14.7% slower)


def test_edge_surrogate_pairs_and_combining_chars():
    # Surrogate pairs and combining characters (e.g. accents) should not be split
    tokenizer = BasicTokenizer()
    s = "e\u0301clair!"  # "éclair!"
    expected = ["e\u0301clair", "!"]
    codeflash_output = tokenizer._run_split_on_punc(s)  # 6.93μs -> 4.62μs (50.0% faster)


def test_edge_punctuation_with_non_ascii_letters():
    # Non-ASCII letters next to punctuation
    tokenizer = BasicTokenizer()
    s = "naïve.café!"
    expected = ["naïve", ".", "café", "!"]
    codeflash_output = tokenizer._run_split_on_punc(s)  # 7.62μs -> 4.90μs (55.4% faster)


def test_edge_long_string_of_punctuation():
    # Long sequence of punctuation
    tokenizer = BasicTokenizer()
    s = "!!!???...,;;"
    expected = ["!", "!", "!", "?", "?", "?", ".", ".", ".", ",", ";", ";"]
    codeflash_output = tokenizer._run_split_on_punc(s)  # 5.00μs -> 2.47μs (102% faster)


def test_edge_mixed_punctuation_and_whitespace():
    # Punctuation mixed with whitespace
    tokenizer = BasicTokenizer()
    s = "hello ! world ?"
    # Spaces are not punctuation, so the string is not split
    expected = ["hello ! world ?"]
    codeflash_output = tokenizer._run_split_on_punc(s)  # 8.06μs -> 5.13μs (57.2% faster)


# -------------------- LARGE SCALE TEST CASES --------------------


def test_large_scale_long_text():
    # Large text with regular punctuation
    tokenizer = BasicTokenizer()
    text = ("word," * 500) + "end."
    # Should split each 'word,' into ['word', ','] and 'end.' into ['end', '.']
    expected = []
    for _ in range(500):
        expected.extend(["word", ","])
    expected.extend(["end", "."])
    codeflash_output = tokenizer._run_split_on_punc(text)  # 718μs -> 365μs (96.5% faster)


def test_large_scale_no_punctuation():
    # Large text with no punctuation
    tokenizer = BasicTokenizer()
    text = "a" * 1000
    expected = [text]
    codeflash_output = tokenizer._run_split_on_punc(text)  # 273μs -> 139μs (96.6% faster)


def test_large_scale_only_punctuation():
    # Large text with only punctuation
    tokenizer = BasicTokenizer()
    text = "!" * 1000
    expected = ["!"] * 1000
    codeflash_output = tokenizer._run_split_on_punc(text)  # 176μs -> 64.1μs (176% faster)


def test_large_scale_mixed_punctuation_and_words():
    # Alternating words and punctuation
    tokenizer = BasicTokenizer()
    text = ""
    expected = []
    for i in range(500):
        text += f"word{i}!"
        expected.extend([f"word{i}", "!"])
    codeflash_output = tokenizer._run_split_on_punc(text)  # 1.11ms -> 586μs (89.4% faster)


def test_large_scale_unicode_punctuation():
    # Large text with unicode punctuation
    tokenizer = BasicTokenizer()
    text = "word" + "—" * 999  # em dash
    expected = ["word"] + ["—"] * 999
    codeflash_output = tokenizer._run_split_on_punc(text)  # 338μs -> 171μs (97.7% faster)


# -------------------- ADDITIONAL EDGE CASES --------------------


def test_edge_punctuation_in_never_split_with_other_text():
    # never_split only applies if the whole text matches
    tokenizer = BasicTokenizer()
    s = "hello,world!"
    expected = ["hello,world", "!"]
    # Only "hello,world" is in never_split, not the full string
    codeflash_output = tokenizer._run_split_on_punc(s, never_split=["hello,world"])  # 7.76μs -> 4.80μs (61.5% faster)


def test_edge_punctuation_with_multiple_never_split():
    # Multiple items in never_split
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_split_on_punc(
        "foo.bar", never_split=["foo.bar", "baz!"]
    )  # 656ns -> 782ns (16.1% slower)
    codeflash_output = tokenizer._run_split_on_punc(
        "baz!", never_split=["foo.bar", "baz!"]
    )  # 390ns -> 384ns (1.56% faster)


def test_edge_punctuation_with_non_string_never_split():
    # never_split with non-string items should not affect string input
    tokenizer = BasicTokenizer()
    # Should ignore integer in never_split
    codeflash_output = tokenizer._run_split_on_punc(
        "hello!", never_split=[123, "world"]
    )  # 5.64μs -> 3.56μs (58.4% faster)


def test_edge_punctuation_with_empty_never_split():
    # Empty never_split should not affect output
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_split_on_punc("hello!")  # 5.11μs -> 2.88μs (77.4% faster)
    codeflash_output = tokenizer._run_split_on_punc("hello!", never_split=[])  # 3.33μs -> 2.03μs (64.3% faster)


def test_edge_punctuation_with_none_never_split():
    # None never_split should not affect output
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer._run_split_on_punc("hello!")  # 4.83μs -> 2.96μs (63.1% faster)
    codeflash_output = tokenizer._run_split_on_punc("hello!", never_split=None)  # 3.12μs -> 1.91μs (63.4% faster)


def test_edge_punctuation_with_repeated_punctuation_and_word():
    # Repeated punctuation and word
    tokenizer = BasicTokenizer()
    s = "!!!word!!!"
    expected = ["!", "!", "!", "word", "!", "!", "!"]
    codeflash_output = tokenizer._run_split_on_punc(s)  # 5.83μs -> 3.40μs (71.4% faster)


def test_edge_punctuation_with_tab_and_newline():
    # Tab and newline are not punctuation
    tokenizer = BasicTokenizer()
    s = "hello\tworld\n!"
    expected = ["hello\tworld\n", "!"]
    codeflash_output = tokenizer._run_split_on_punc(s)  # 7.34μs -> 4.34μs (69.2% faster)


def test_edge_punctuation_with_combined_punctuation():
    # Combined punctuation (e.g. ":)")
    tokenizer = BasicTokenizer()
    s = "hello :)"
    expected = [
        "hello :)",
    ]
    codeflash_output = tokenizer._run_split_on_punc(s)  # 5.74μs -> 3.19μs (79.6% faster)


def test_edge_punctuation_with_math_symbols():
    # Math symbols like +, -, =, / are punctuation
    tokenizer = BasicTokenizer()
    s = "a+b=c/d"
    expected = ["a", "+", "b", "=", "c", "/", "d"]
    codeflash_output = tokenizer._run_split_on_punc(s)  # 5.52μs -> 3.34μs (65.4% faster)


def test_edge_punctuation_with_brackets_and_quotes():
    # Brackets and quotes are punctuation
    tokenizer = BasicTokenizer()
    s = '"(hello)"'
    expected = ['"', "(", "hello", ")", '"']
    codeflash_output = tokenizer._run_split_on_punc(s)  # 5.78μs -> 3.45μs (67.6% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

# imports
from transformers.models.deprecated.realm.tokenization_realm import BasicTokenizer


# unit tests


class TestBasicTokenizerRunSplitOnPunc:
    # --- Basic Test Cases ---

    def setup_method(self):
        # Create a tokenizer instance for use in all tests
        self.tokenizer = BasicTokenizer()

    def test_empty_string(self):
        # Should return an empty list for empty input
        codeflash_output = self.tokenizer._run_split_on_punc("")  # 1.33μs -> 787ns (69.1% faster)

    def test_single_word_no_punct(self):
        # Should return the word as a single token
        codeflash_output = self.tokenizer._run_split_on_punc("hello")  # 5.04μs -> 3.07μs (64.3% faster)

    def test_single_punctuation(self):
        # Should return the punctuation as a single token
        codeflash_output = self.tokenizer._run_split_on_punc(".")  # 2.36μs -> 1.41μs (67.0% faster)

    def test_word_with_single_punctuation(self):
        # Should split punctuation from word
        codeflash_output = self.tokenizer._run_split_on_punc("hello!")  # 5.30μs -> 3.40μs (55.9% faster)
        codeflash_output = self.tokenizer._run_split_on_punc("?hello")  # 3.11μs -> 1.81μs (71.6% faster)

    def test_multiple_words_with_punctuation(self):
        # Should split all punctuation
        codeflash_output = self.tokenizer._run_split_on_punc("hello, world!")  # 7.50μs -> 4.78μs (57.0% faster)
        codeflash_output = self.tokenizer._run_split_on_punc("goodbye...")  # 3.84μs -> 2.33μs (65.0% faster)

    def test_adjacent_punctuation(self):
        # Should split each punctuation into separate tokens
        codeflash_output = self.tokenizer._run_split_on_punc("!?")  # 2.75μs -> 1.46μs (88.7% faster)
        codeflash_output = self.tokenizer._run_split_on_punc("hello?!")  # 4.10μs -> 2.33μs (75.9% faster)

    def test_punctuation_between_words(self):
        # Should split punctuation between words
        codeflash_output = self.tokenizer._run_split_on_punc("hi-there")  # 5.64μs -> 3.73μs (51.3% faster)
        codeflash_output = self.tokenizer._run_split_on_punc("foo/bar")  # 3.32μs -> 2.02μs (64.4% faster)

    def test_mixed_punctuation_and_numbers(self):
        # Should not split numbers, only punctuation
        codeflash_output = self.tokenizer._run_split_on_punc("abc123!")  # 5.72μs -> 3.48μs (64.2% faster)
        codeflash_output = self.tokenizer._run_split_on_punc("1,234")  # 2.70μs -> 1.77μs (52.3% faster)

    # --- Edge Test Cases ---

    def test_unicode_punctuation(self):
        # Should split unicode punctuation (e.g. “ ” — …)
        codeflash_output = self.tokenizer._run_split_on_punc("hello…world")  # 7.28μs -> 4.51μs (61.3% faster)
        codeflash_output = self.tokenizer._run_split_on_punc("“hello”")  # 3.70μs -> 2.33μs (59.0% faster)
        codeflash_output = self.tokenizer._run_split_on_punc("foo—bar")  # 3.23μs -> 1.93μs (67.6% faster)

    def test_non_ascii_non_punctuation(self):
        # Should not split non-punctuation unicode chars
        codeflash_output = self.tokenizer._run_split_on_punc("你好世界")  # 5.20μs -> 3.47μs (49.9% faster)
        codeflash_output = self.tokenizer._run_split_on_punc("café")  # 3.14μs -> 1.99μs (58.0% faster)

    def test_only_punctuation(self):
        # Should split each punctuation into separate tokens
        codeflash_output = self.tokenizer._run_split_on_punc("!!!")  # 2.85μs -> 1.61μs (77.3% faster)
        codeflash_output = self.tokenizer._run_split_on_punc(".,!?")  # 1.93μs -> 854ns (126% faster)

    def test_punctuation_at_start_and_end(self):
        # Should split punctuation at both ends
        codeflash_output = self.tokenizer._run_split_on_punc(".hello.")  # 5.40μs -> 3.15μs (71.0% faster)
        codeflash_output = self.tokenizer._run_split_on_punc("!foo?")  # 2.46μs -> 1.50μs (64.0% faster)

    def test_punctuation_with_spaces(self):
        # Should treat spaces as non-punctuation and group them with words
        codeflash_output = self.tokenizer._run_split_on_punc("hello ! world ?")  # 8.15μs -> 5.12μs (59.1% faster)

    def test_never_split_argument(self):
        # Should not split if text is in never_split argument
        never_split = ["hello!"]
        codeflash_output = self.tokenizer._run_split_on_punc(
            "hello!", never_split=never_split
        )  # 652ns -> 815ns (20.0% slower)
        # Should split if not in never_split
        codeflash_output = self.tokenizer._run_split_on_punc(
            "hello!", never_split=["world"]
        )  # 5.05μs -> 3.14μs (60.7% faster)

    def test_mixed_never_split(self):
        # Should only avoid splitting if the whole text matches never_split
        never_split = ["foo/bar"]
        codeflash_output = self.tokenizer._run_split_on_punc(
            "foo/bar", never_split=never_split
        )  # 561ns -> 695ns (19.3% slower)
        codeflash_output = self.tokenizer._run_split_on_punc(
            "foo/bar!", never_split=never_split
        )  # 5.88μs -> 3.64μs (61.7% faster)

    def test_punctuation_with_control_characters(self):
        # Control characters should not be split as punctuation
        s = "hello\x00world!"
        codeflash_output = self.tokenizer._run_split_on_punc(s)  # 7.19μs -> 4.35μs (65.2% faster)

    def test_punctuation_with_currency_symbols(self):
        # Currency symbols are not punctuation, should not be split
        codeflash_output = self.tokenizer._run_split_on_punc("$100")  # 4.42μs -> 2.83μs (56.4% faster)
        codeflash_output = self.tokenizer._run_split_on_punc("€100")  # 3.38μs -> 2.31μs (46.2% faster)

    def test_punctuation_with_math_symbols(self):
        # Math symbols like +, =, <, > are treated as punctuation
        codeflash_output = self.tokenizer._run_split_on_punc("a+b=c")  # 5.00μs -> 3.04μs (64.7% faster)

    def test_punctuation_with_surrogate_pairs(self):
        # Surrogate pairs (e.g. emoji) should not be split unless punctuation
        s = "hello😊!"
        codeflash_output = self.tokenizer._run_split_on_punc(s)  # 6.64μs -> 4.39μs (51.4% faster)

    # --- Large Scale Test Cases ---

    def test_long_text_no_punctuation(self):
        # Should return the whole text as a single token
        long_text = "a" * 1000
        codeflash_output = self.tokenizer._run_split_on_punc(long_text)  # 271μs -> 137μs (96.8% faster)

    def test_long_text_with_repeated_punctuation(self):
        # Should split every punctuation in the long string
        text = "a!" * 500  # 500 'a!' pairs
        expected = []
        for _ in range(500):
            expected.append("a")
            expected.append("!")
        codeflash_output = self.tokenizer._run_split_on_punc(text)  # 286μs -> 125μs (129% faster)

    def test_large_mixed_punctuation(self):
        # Should handle alternating words and punctuation
        words = [f"word{i}" for i in range(500)]
        puncts = ["!", "?", ".", ",", ";"]
        text = ""
        expected = []
        for i, word in enumerate(words):
            text += word
            expected.append(word)
            if i < len(words) - 1:
                p = puncts[i % len(puncts)]
                text += p
                expected.append(p)
        codeflash_output = self.tokenizer._run_split_on_punc(text)  # 1.13ms -> 585μs (92.2% faster)

    def test_large_input_with_unicode_punctuation(self):
        # Should correctly split unicode punctuation in long input
        text = ("hello…world! " * 100).strip()
        expected = []
        for _ in range(100):
            expected.extend(["hello", "…", "world", "!", " "])
        # The last " " will be missing due to strip()
        expected = expected[:-1]
        codeflash_output = self.tokenizer._run_split_on_punc(text)  # 378μs -> 200μs (88.3% faster)

    def test_large_input_with_only_punctuation(self):
        # Should split every character into a separate token
        text = "!?.,;" * 200  # 1000 punctuation chars
        expected = list(text)
        codeflash_output = self.tokenizer._run_split_on_punc(text)  # 191μs -> 65.2μs (194% faster)

    # --- Determinism Test ---

    def test_determinism(self):
        # Should always return the same result for the same input
        text = "hello, world!"
        codeflash_output = self.tokenizer._run_split_on_punc(text)
        result1 = codeflash_output  # 7.60μs -> 4.83μs (57.3% faster)
        codeflash_output = self.tokenizer._run_split_on_punc(text)
        result2 = codeflash_output  # 4.66μs -> 2.74μs (70.3% faster)

    # --- Mutation Safeguard Test ---

    def test_mutation_safeguard(self):
        # If punctuation splitting logic is changed, this should fail
        # For example, if punctuation is not split, this test will fail
        text = "hello!"
        codeflash_output = self.tokenizer._run_split_on_punc(text)
        result = codeflash_output  # 5.03μs -> 2.93μs (71.7% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-BasicTokenizer._run_split_on_punc-mi9x55ms and push.

The optimization achieves a **96% speedup** through two key improvements: **1. ASCII Punctuation Fast Path (`_is_punctuation`)** The original code used four range comparisons (`cp >= 33 and cp <= 47...`) for every character, plus an expensive `ord(char)` call. The optimized version precomputes all ASCII punctuation characters into a set `_ASCII_PUNCTUATION_SET`, enabling O(1) lookups that bypass both the `ord()` call and range comparisons for common ASCII text. **2. Streamlined Text Processing (`_run_split_on_punc`)** The original implementation used complex indexing with `while i < len(chars)` and maintained separate state variables (`start_new_word`). The optimized version uses a simple `for char in text` loop with a straightforward `current` buffer approach, eliminating manual indexing, reducing list operations, and avoiding the expensive `["".join(x) for x in output]` comprehension at the end. **Performance Impact Analysis:** - ASCII-heavy text (most common case) benefits most, with 60-100% speedups across test cases - Large-scale tests show dramatic improvements: 1000-character strings see 96%+ speedups - Unicode punctuation still gets significant gains (50-60%) from the simplified loop structure - Only edge case with `never_split` parameter shows minor regression (15-25% slower) due to additional conditional checks **Hot Path Optimization:** The `_is_punctuation` function is called for every character in text processing, making it extremely performance-critical. The ASCII fast path optimization directly targets this bottleneck, while the streamlined tokenization loop reduces overhead in the character iteration process that calls `_is_punctuation` repeatedly.

codeflash-ai bot requested a review from mashraf-222 November 22, 2025 06:39

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `BasicTokenizer._run_split_on_punc` by 97% #367

⚡️ Speed up method `BasicTokenizer._run_split_on_punc` by 97% #367

codeflash-ai bot commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method BasicTokenizer._run_split_on_punc by 97% #367

Are you sure you want to change the base?

⚡️ Speed up method BasicTokenizer._run_split_on_punc by 97% #367

Conversation

codeflash-ai bot commented Nov 22, 2025

📄 97% (0.97x) speedup for BasicTokenizer._run_split_on_punc in src/transformers/models/deprecated/realm/tokenization_realm.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `BasicTokenizer._run_split_on_punc` by 97% #367

⚡️ Speed up method `BasicTokenizer._run_split_on_punc` by 97% #367

📄 97% (0.97x) speedup for `BasicTokenizer._run_split_on_punc` in `src/transformers/models/deprecated/realm/tokenization_realm.py`