⚡️ Speed up method BasicTokenizer.tokenize by 14%
#365
+56
−42
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 14% (0.14x) speedup for
BasicTokenizer.tokenizeinsrc/transformers/models/deprecated/realm/tokenization_realm.py⏱️ Runtime :
46.6 milliseconds→40.8 milliseconds(best of100runs)📝 Explanation and details
The optimized code achieves a 14% speedup through several key micro-optimizations that reduce Python's attribute lookup overhead and eliminate redundant operations:
Key Optimizations
1. Local Variable Caching for Attribute Lookups
The most impactful optimization caches frequently accessed instance attributes and methods as local variables in the hot
tokenize()method:This eliminates repeated
self.attribute lookups inside the main tokenization loop, which processes thousands of tokens in large inputs.2. Optimized
never_splitLogicChanged from always creating a new set union to conditional logic:
3. Streamlined
whitespace_tokenize()Removed intermediate variable assignment:
4. Optimized
_run_split_on_punc()Completely rewrote the punctuation splitting algorithm to eliminate the complex list-of-lists approach:
list(text)conversion and indexing overhead_is_punctuationfunction lookup locally5. Local Function Caching in Helper Methods
Added local variable caching in
_run_strip_accents(),_tokenize_chinese_chars(), and_clean_text():Performance Impact
The optimizations show consistent 5-20% speedups across test cases, with larger improvements for:
The optimizations are particularly effective for transformer tokenization workloads where this function processes thousands of tokens repeatedly, making the cumulative effect of these micro-optimizations substantial.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-BasicTokenizer.tokenize-mi9wkaooand push.