Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .jules/bolt.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,7 @@ Action: Apply loop unrolling for max reductions in high-frequency typed array op
## 2024-11-20 - Softmax math.exp 8x unrolling with local var cache
Learning: Unrolling the `Math.exp` accumulation loop to 8x and caching the multiplication `(tokenLogits[i] - maxLogit) * invTemp` into local variables before passing to `Math.exp` yields a measurable performance improvement (~4%) over the previous 4x unrolled implementation in the V8 engine, by reducing property access and allowing better instruction-level parallelism.
Action: Utilize 8x loop unrolling paired with local variable caching for tight floating-point accumulation loops over TypedArrays.

## 2024-11-23 - TypedArray Single Element Initialization
Learning: Replacing array literal initialization (e.g., `new Int32Array([1])`) with explicit sizing and assignment (e.g., `new Int32Array(1); arr[0] = 1;`) is an unjustified micro-optimization. In the context of ONNX model execution, this change will not yield a measurable performance improvement and directly violates the strict instruction: "Never do: Trade readability and maintainability for micro-optimizations".
Action: Never perform speculative micro-optimizations that trade readability for negligible/unmeasurable theoretical gains. Focus on clear bottlenecks (like reactive updates or DOM operations).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The mention of "reactive updates or DOM operations" in the action item is irrelevant to this repository, which focuses on pure signal processing and ONNX model execution. It's better to provide examples that are relevant to the project's domain to make the guidance more actionable for contributors.

Suggested change
Action: Never perform speculative micro-optimizations that trade readability for negligible/unmeasurable theoretical gains. Focus on clear bottlenecks (like reactive updates or DOM operations).
Action: Never perform speculative micro-optimizations that trade readability for negligible/unmeasurable theoretical gains. Focus on clear bottlenecks identified through profiling (like FFT twiddle lookups or large array reductions).

12 changes: 7 additions & 5 deletions src/mel.js
Original file line number Diff line number Diff line change
Expand Up @@ -339,11 +339,13 @@ function fft(re, im, N, tw) {
for (let len = 16; len <= N; len <<= 1) {
const halfLen = len >> 1;
const step = N / len;
for (let i = 0; i < N; i += len) {
for (let k = 0; k < halfLen; k++) {
const twIdx = k * step;
const wCos = tw.cos[twIdx];
const wSin = tw.sin[twIdx];
// Optimization: Loop interchange to hoist twiddle factor lookups.
// Inner loop iterates over 'i' instead of 'k'.
for (let k = 0; k < halfLen; k++) {
const twIdx = k * step;
const wCos = tw.cos[twIdx];
const wSin = tw.sin[twIdx];
for (let i = 0; i < N; i += len) {
Comment on lines +344 to +348
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To further reduce property access overhead in this hot path, you can hoist the tw.cos and tw.sin array lookups outside the k loop. This follows the repository's performance learning regarding reducing property access in tight loops (as noted in bolt.md).

Suggested change
for (let k = 0; k < halfLen; k++) {
const twIdx = k * step;
const wCos = tw.cos[twIdx];
const wSin = tw.sin[twIdx];
for (let i = 0; i < N; i += len) {
const { cos: twCos, sin: twSin } = tw;
for (let k = 0; k < halfLen; k++) {
const twIdx = k * step;
const wCos = twCos[twIdx];
const wSin = twSin[twIdx];
for (let i = 0; i < N; i += len) {

const p = i + k;
const q = p + halfLen;
const tRe = re[q] * wCos - im[q] * wSin;
Expand Down
Loading