⚡️ Speed up method BasicTokenizer._run_strip_accents by 26%
#366
+2
−7
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 26% (0.26x) speedup for
BasicTokenizer._run_strip_accentsinsrc/transformers/models/deprecated/realm/tokenization_realm.py⏱️ Runtime :
1.19 milliseconds→949 microseconds(best of250runs)📝 Explanation and details
The optimization replaces an imperative loop-based approach with a functional generator expression, achieving a 25% speedup through several key improvements:
What was optimized:
output = []andoutput.append(char)with a generator expression passed directly tostr.join()unicodedata.categoryas a local variable to avoid repeated attribute lookupscontinuestatements with a filtering generator expressionWhy this is faster:
unicodedata.categoryeliminates repeated attribute resolution (visible in line profiler: 13,019 calls tounicodedata.categoryin original vs more efficient access pattern in optimized)Performance characteristics:
The optimization shows diminishing returns for very short strings (some small test cases are 3-40% slower due to generator setup overhead) but provides significant gains for larger inputs:
This optimization is particularly valuable for text preprocessing pipelines in NLP models where
_run_strip_accentsprocesses batches of documents or long text sequences, making the consistent 25%+ improvement on realistic workloads highly beneficial.✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-BasicTokenizer._run_strip_accents-mi9wp2wrand push.