⚡️ Speed up method BasicTokenizer._run_split_on_punc by 97%
#367
+24
−23
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 97% (0.97x) speedup for
BasicTokenizer._run_split_on_puncinsrc/transformers/models/deprecated/realm/tokenization_realm.py⏱️ Runtime :
5.20 milliseconds→2.65 milliseconds(best of212runs)📝 Explanation and details
The optimization achieves a 96% speedup through two key improvements:
1. ASCII Punctuation Fast Path (
_is_punctuation)The original code used four range comparisons (
cp >= 33 and cp <= 47...) for every character, plus an expensiveord(char)call. The optimized version precomputes all ASCII punctuation characters into a set_ASCII_PUNCTUATION_SET, enabling O(1) lookups that bypass both theord()call and range comparisons for common ASCII text.2. Streamlined Text Processing (
_run_split_on_punc)The original implementation used complex indexing with
while i < len(chars)and maintained separate state variables (start_new_word). The optimized version uses a simplefor char in textloop with a straightforwardcurrentbuffer approach, eliminating manual indexing, reducing list operations, and avoiding the expensive["".join(x) for x in output]comprehension at the end.Performance Impact Analysis:
never_splitparameter shows minor regression (15-25% slower) due to additional conditional checksHot Path Optimization:
The
_is_punctuationfunction is called for every character in text processing, making it extremely performance-critical. The ASCII fast path optimization directly targets this bottleneck, while the streamlined tokenization loop reduces overhead in the character iteration process that calls_is_punctuationrepeatedly.✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-BasicTokenizer._run_split_on_punc-mi9x55msand push.