Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .jules/bolt.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,7 @@ Action: Apply loop unrolling for max reductions in high-frequency typed array op
## 2024-11-20 - Softmax math.exp 8x unrolling with local var cache
Learning: Unrolling the `Math.exp` accumulation loop to 8x and caching the multiplication `(tokenLogits[i] - maxLogit) * invTemp` into local variables before passing to `Math.exp` yields a measurable performance improvement (~4%) over the previous 4x unrolled implementation in the V8 engine, by reducing property access and allowing better instruction-level parallelism.
Action: Utilize 8x loop unrolling paired with local variable caching for tight floating-point accumulation loops over TypedArrays.

## 2024-11-20 - Set and Object.values overhead in hot loops
Learning: In V8, iterating over objects with `Object.values()` creates intermediate array allocations that trigger garbage collection, and using a `Set` for tracking a very small number of unique items (e.g. 3-5 items) incurs higher initialization and hashing overhead compared to a plain array.
Action: Replace `Object.values(obj)` with a `for...in` loop (checking `Object.hasOwn(obj, key)`) and replace `new Set()` with `[]` and `Array.includes()` for tracking small collections of items in high-frequency loops (like the inner DP decoder step) to improve performance.
1 change: 1 addition & 0 deletions profile.cpuprofile

Large diffs are not rendered by default.

16 changes: 9 additions & 7 deletions src/parakeet.js
Original file line number Diff line number Diff line change
Expand Up @@ -323,10 +323,12 @@ export class ParakeetModel {
const logits = out['outputs'];
const outputState1 = out['output_states_1'];
const outputState2 = out['output_states_2'];
const seenOutputs = new Set();
for (const value of Object.values(out)) {
if (!value || typeof value.dispose !== 'function' || seenOutputs.has(value)) continue;
seenOutputs.add(value);
const seenOutputs = [];
for (const key in out) {
if (!Object.hasOwn(out, key)) continue;
const value = out[key];
if (!value || typeof value.dispose !== 'function' || seenOutputs.includes(value)) continue;
seenOutputs.push(value);
if (value === logits || value === outputState1 || value === outputState2) continue;
value.dispose();
}
Expand All @@ -339,12 +341,12 @@ export class ParakeetModel {
const failDecoderStep = (message) => {
logits?.dispose?.();

const disposed = new Set();
const disposed = [];
const disposeUniqueState = (state) => {
if (!state) return;
for (const tensor of [state.state1, state.state2]) {
if (!tensor || tensor === this._combState1 || tensor === this._combState2 || disposed.has(tensor)) continue;
disposed.add(tensor);
if (!tensor || tensor === this._combState1 || tensor === this._combState2 || disposed.includes(tensor)) continue;
disposed.push(tensor);
tensor.dispose?.();
}
};
Comment on lines 341 to 352
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The failDecoderStep helper is defined as a closure inside _runCombinedStep, which is a hot function called for every token during transcription. This results in a new function object being allocated on every call. While this helper is only executed on error, the allocation of the closure itself happens every time. Moving this definition outside of _runCombinedStep (e.g., as a private method or a static helper) would avoid these unnecessary allocations and further reduce garbage collection pressure, which is the primary goal of this PR.

Expand Down
Loading