Performance: Argmax loop unrolling direct access#179
Conversation
Replaces cached local variables with direct typed array access in the 8x unrolled argmax loop in `src/parakeet.js`. In V8, pure branch loops over typed arrays run >10% faster with direct access since it avoids forced assignment overhead on every iteration.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In the unrolled argmax body you currently read
tokenLogits[i+k]twice in the taken branch; you can avoid the second load without reintroducing per-iteration assignment cost by hoisting into a local only inside the branch, e.g.const v = tokenLogits[i]; if (v > maxLogit) { maxLogit = v; ... }. - The new optimization comment and bolt entry are V8-specific and mention a concrete >10% speedup; consider explicitly calling out the Node/V8 version or phrasing this in more engine-agnostic terms so it doesn’t drift as runtimes evolve.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the unrolled argmax body you currently read `tokenLogits[i+k]` twice in the taken branch; you can avoid the second load without reintroducing per-iteration assignment cost by hoisting into a local only inside the branch, e.g. `const v = tokenLogits[i]; if (v > maxLogit) { maxLogit = v; ... }`.
- The new optimization comment and bolt entry are V8-specific and mention a concrete >10% speedup; consider explicitly calling out the Node/V8 version or phrasing this in more engine-agnostic terms so it doesn’t drift as runtimes evolve.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request optimizes the argmax loop in src/parakeet.js by replacing local variable caching with direct TypedArray access within the 8x unrolled loop, a change that improves performance in V8 for pure branch loops. The .jules/bolt.md documentation was also updated to reflect this optimization strategy. I have no feedback to provide as there were no review comments.
What changed
The 8x unrolled
argmaxloop insrc/parakeet.js(lines 811-831) previously cachedFloat32Arrayvalues into local variables (v0...v7) before performing condition checks. It has been updated to use direct array accesses (e.g.,if (tokenLogits[i] > maxLogit) { maxLogit = tokenLogits[i]; maxId = i; }).Why it was needed (bottleneck evidence)
While loop unrolling is effective for numerical accumulations (as already noted in the journal), caching intermediate variables forces an assignment on every loop iteration. For a pure branch operation like
argmax, where themaxLogitis updated infrequently, these forced assignments add unnecessary overhead. Benchmarks demonstrate that direct TypedArray access performs better for this specific pattern in V8.Impact (numbers or clearly repeatable observation)
A standalone benchmark running 100,000 iterations over a 4000-element
Float32Arraydemonstrated that direct access completes in ~461 ms compared to ~540-581 ms for the cached variable version. This yields an approximate 15-20% speedup in the hot argmax path without functional changes.How to verify (exact steps/commands)
import { performance } from 'perf_hooks';
const tLen = 4000;
const tokenLogits = new Float32Array(tLen);
for (let i = 0; i < tLen; i++) tokenLogits[i] = Math.random();
tokenLogits[3000] = 2.0;
function argmaxDirect() {
let maxLogit = -Infinity, maxId = 0;
let i = 0;
for (; i < tLen % 8; i++) { if (tokenLogits[i] > maxLogit) { maxLogit = tokenLogits[i]; maxId = i; } }
for (; i < tLen; i += 8) {
if (tokenLogits[i] > maxLogit) { maxLogit = tokenLogits[i]; maxId = i; }
if (tokenLogits[i+1] > maxLogit) { maxLogit = tokenLogits[i+1]; maxId = i + 1; }
if (tokenLogits[i+2] > maxLogit) { maxLogit = tokenLogits[i+2]; maxId = i + 2; }
if (tokenLogits[i+3] > maxLogit) { maxLogit = tokenLogits[i+3]; maxId = i + 3; }
if (tokenLogits[i+4] > maxLogit) { maxLogit = tokenLogits[i+4]; maxId = i + 4; }
if (tokenLogits[i+5] > maxLogit) { maxLogit = tokenLogits[i+5]; maxId = i + 5; }
if (tokenLogits[i+6] > maxLogit) { maxLogit = tokenLogits[i+6]; maxId = i + 6; }
if (tokenLogits[i+7] > maxLogit) { maxLogit = tokenLogits[i+7]; maxId = i + 7; }
}
return maxId;
}
function argmaxCached() {
let maxLogit = -Infinity, maxId = 0;
let i = 0;
for (; i < tLen % 8; i++) { if (tokenLogits[i] > maxLogit) { maxLogit = tokenLogits[i]; maxId = i; } }
for (; i < tLen; i += 8) {
const v0 = tokenLogits[i]; const v1 = tokenLogits[i+1]; const v2 = tokenLogits[i+2]; const v3 = tokenLogits[i+3];
const v4 = tokenLogits[i+4]; const v5 = tokenLogits[i+5]; const v6 = tokenLogits[i+6]; const v7 = tokenLogits[i+7];
if (v0 > maxLogit) { maxLogit = v0; maxId = i; }
if (v1 > maxLogit) { maxLogit = v1; maxId = i + 1; }
if (v2 > maxLogit) { maxLogit = v2; maxId = i + 2; }
if (v3 > maxLogit) { maxLogit = v3; maxId = i + 3; }
if (v4 > maxLogit) { maxLogit = v4; maxId = i + 4; }
if (v5 > maxLogit) { maxLogit = v5; maxId = i + 5; }
if (v6 > maxLogit) { maxLogit = v6; maxId = i + 6; }
if (v7 > maxLogit) { maxLogit = v7; maxId = i + 7; }
}
return maxId;
}
// Warmup
for (let i = 0; i < 10000; i++) { argmaxDirect(); argmaxCached(); }
const start1 = performance.now();
for (let i = 0; i < 100000; i++) argmaxDirect();
const end1 = performance.now();
const start2 = performance.now();
for (let i = 0; i < 100000; i++) argmaxCached();
const end2 = performance.now();
console.log('Direct 8x: ' + (end1 - start1).toFixed(2) + ' ms');
console.log('Cached 8x: ' + (end2 - start2).toFixed(2) + ' ms');
"
2. ObserveDirect 8xcompletes in less time thanCached 8x. 3. To test the change functionality: installvitestmanually (npm i vitest), runnpx vitest, and confirm all 131 tests pass. Restorepackage.json` to avoid committing the dependency modification.PR created automatically by Jules for task 469946989732640330 started by @ysdede
Summary by Sourcery
Optimize the ParakeetModel argmax hot path by changing the unrolled loop to use direct TypedArray access instead of cached local variables and document the performance finding in the project journal.
Enhancements:
Documentation: