Skip to content

perf(elasticsearch-plugin): faster full reindex via refresh tuning, parallel bulks, batch fetch#34

Closed
timcv wants to merge 2 commits into
vendurehq:mainfrom
timcv:feat/elasticsearch-reindex-perf
Closed

perf(elasticsearch-plugin): faster full reindex via refresh tuning, parallel bulks, batch fetch#34
timcv wants to merge 2 commits into
vendurehq:mainfrom
timcv:feat/elasticsearch-reindex-perf

Conversation

@timcv
Copy link
Copy Markdown

@timcv timcv commented May 6, 2026

Summary

Reduces full-reindex wallclock by adding four orthogonal, opt-in optimisations to the reindex path.

  • Synthetic bench (35-doc fixture): -47% median (391 ms → 206 ms)
  • Real-data bench (bov MariaDB, 51 593 docs): -43% (14 m 26 s → 8 m 14 s, 1.75× faster), snapshot bit-identical vs baseline

All four are backwards-compatible at default settings.

Changes

S1 — refresh policy + reindex-only index settings

  • New reindexIndexSettings option (default refresh_interval: -1, number_of_replicas: 0, translog.durability: async) merged on top of indexSettings for the temporary reindex index only.
  • Bulk operations during reindex now pass refresh: false. Once the loop completes, reindexRestoreSettings (default refresh_interval: 1s, number_of_replicas: 1) is PUT on the temp index and a single _refresh is issued before alias swap so search consumers see a warm index.
  • Adds putSettings to SearchClientAdapter (impl in both ES and OS adapters).

A6 — parallel bulk dispatch

  • executeBulkOperationsByChunks runs chunks via Promise.all with a concurrency window (reindexBulkConcurrency, default 4), but only when the caller is the reindex path (refresh=false). Delta paths stay sequential to preserve ordering.

A7 — byte-budgeted bulk flush + larger default bulk size

  • reindexBulkOperationSizeLimit default raised 3000 → 5000.
  • New reindexBulkSizeBytes option (default ≈ 5 MB) tracks payload size as ops accumulate and triggers an early flush when crossed, keeping bulk requests under typical http.max_content_length even with heavy custom mappings.

S2 — product-level concurrency (opt-in)

  • New reindexConcurrency option, default 1 (sequential, unchanged behaviour).
  • When raised, reindex processes products in parallel windows, each worker with its own MutableRequestContext clone. Documented caveat: Vendure's TypeORM identity map shares relations like channels across products so users should benchmark + run the e2e suite at the chosen value before rolling out (a flaky enabled mismatch was reproduced at concurrency=8 against sqljs in the existing suite — defaults stay safe, the option is for production tuning).

S3 — chunk-level prefetch

  • New loadProductChunkPrefetch issues two queries per reindexProductsChunkSize of products (one for products + relations, one for variants + relations grouped by productId) instead of the prior N+N queries inside updateProductsOperationsOnly. The per-product hot path accepts pre-fetched data through a new optional prefetched parameter; delta paths pass nothing and continue to load on demand.

Bench harness

  • bench/perf/perf-reindex.test.ts — separate vitest config so it doesn't pollute the e2e suite include glob; gated to bench/perf.
  • Records median/mean/min/max wallclock across PERF_RUNS reindexes plus a sorted+normalised NDJSON snapshot of the full alias contents under bench/snapshots/<label>.ndjson.
  • A second test in the same spec diffs the snapshot against bench/snapshots/baseline.ndjson and fails if any document body diverges — this is the regression gate that runs after every optimisation step.
  • bench/RESULTS.md documents the protocol, both the synthetic and the real-data results.

Real-data results (bov MariaDB / ES 7.17.18)

Dataset: 8 797 products / 111 386 variants → 51 593 indexed (variant × channel × language) docs.

Configuration Time Δ vs baseline docs snapshot diff
bov-baseline (@vendure/elasticsearch-plugin@3.5.5 from npm, default options) 866 s (14 m 26 s) 51 593
bov-optimized (S1+A6/A7+S2+S3, reindexConcurrency: 8, reindexBulkConcurrency: 4) 495 s (8 m 14 s) -43% (1.75×) 51 593 identical (0 bytes over 4 GB NDJSON)

Why not the 5-10× the synthetic plan estimated:

  • Bov's customProductMappings are CPU-heavy and run per (product × channel × language) — Node's single-thread caps S2's gain.
  • Single-instance MariaDB serialises some of the 8 concurrent workers' queries.
  • ES 7.17 single-node, dev-tier resources.

Even so, −371 s on a typical Swedish e-commerce catalogue is substantial and scales linearly with catalogue size (expected to widen further at ≥5 languages or ≥3 channels).

Synthetic results (regression gate)

ES 7.17.18 single-node, 5 reindexes per branch, median:

Step median ms Δ vs baseline e2e snapshot diff
baseline 391 96/96 ✅
+S1 351 -10% 96/96 ✅ identical ✅
+A6/A7 350 -10% 96/96 ✅ identical ✅
+S2 (conc=8) 206 -47% 96/96 ✅ ¹ identical ✅
+S3 208 -47% 96/96 ✅ identical ✅

¹ With default reindexConcurrency: 1. A6/A7 and S3 individually are near-no-ops on a 35-doc fixture (one bulk chunk, two queries dominated by ES write); they are designed to scale on real catalogues — confirmed by the bov bench above.

Test plan

  • bun run lint (0 errors)
  • bun run build
  • bun run e2e (96/96), run 3× consecutively to verify default reindexConcurrency: 1 is not flaky
  • Bench harness reproduces median wallclock with low jitter
  • Snapshot diff matches baseline at every synthetic step
  • Real-data bench against the bov_ecom_prod catalogue (51 593 docs, MariaDB+ES 7.17): 1.75× faster, snapshot bit-identical vs baseline

🤖 Generated with Claude Code

…arallel bulks, batch fetch

Reduces full-reindex wallclock by adding four orthogonal optimisations to
the reindex path. Measured -47% median (391ms -> 206ms) on the existing
e2e fixture (35 docs); larger gains expected on production-scale catalogues.
All four are backwards-compatible at default settings.

S1 - refresh policy + reindex-only index settings
- New `reindexIndexSettings` option (default `refresh_interval: -1`,
  `number_of_replicas: 0`, `translog.durability: async`) is merged on top
  of `indexSettings` for the temporary reindex index only.
- Bulk operations during reindex now pass `refresh: false`. Once the
  reindex loop completes, `reindexRestoreSettings` (default `refresh_interval: 1s`,
  `number_of_replicas: 1`) is PUT on the temp index and a single explicit
  `_refresh` is issued before alias swap so search consumers see a warm index.
- Adds `putSettings` to `SearchClientAdapter` (impl in both ES and OS adapters).

A6 - parallel bulks
- `executeBulkOperationsByChunks` dispatches chunks via `Promise.all` with
  a concurrency window (`reindexBulkConcurrency`, default 4), but only
  when the caller is the reindex path (`refresh=false`). Delta paths
  remain sequential.

A7 - byte-budgeted bulk flush + larger default bulk size
- `reindexBulkOperationSizeLimit` default raised 3000 -> 5000.
- New `reindexBulkSizeBytes` option (default ~5 MB) tracks payload size as
  ops accumulate and triggers an early flush when crossed, keeping bulk
  requests under typical `http.max_content_length` even with heavy custom
  mappings.

S2 - product-level concurrency (opt-in)
- New `reindexConcurrency` option, default 1 (sequential, unchanged).
- When raised, reindex processes products in parallel windows, each worker
  with its own `MutableRequestContext` clone. Documented caveat: Vendure's
  TypeORM identity map shares relations like `channels` across products so
  callers should benchmark + run the e2e suite at the chosen value before
  rolling out.

S3 - chunk-level prefetch
- New private `loadProductChunkPrefetch` issues two queries per
  `reindexProductsChunkSize` worth of products (one for products + relations,
  one for variants + relations grouped by productId) instead of N+N queries
  inside `updateProductsOperationsOnly`. The per-product hot path accepts
  pre-fetched data via a new optional `prefetched` parameter; delta paths
  pass nothing and continue to load on demand.

Bench harness
- Adds `bench/perf/perf-reindex.test.ts` (separate vitest config so it
  doesn't pollute the e2e suite include glob; gated to its own directory).
- Records median/mean/min/max wallclock across `PERF_RUNS` reindexes plus
  a sorted+normalised NDJSON snapshot of the full alias contents under
  `bench/snapshots/<label>.ndjson`.
- A second test in the same spec diffs the snapshot against
  `bench/snapshots/baseline.ndjson` and fails if any document body
  diverges - this is the regression gate that runs after every
  optimisation step.
- `bench/RESULTS.md` documents the protocol, the synthetic numbers and
  the deferred bov-MariaDB real-data run.

Verification
- `bun run e2e` 96/96, run 3x to confirm S2 default (=1) is not flaky.
- Snapshot diff matches baseline at every step (S1, S1+A6/A7, +S2, +S3).
@vendure-ci-automation-bot
Copy link
Copy Markdown
Contributor


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Run on a 8 797-product / 51 593-doc bov_ecom_prod catalogue against
ES 7.17.18 + MariaDB 11.3.2:
- baseline (@vendure/elasticsearch-plugin@3.5.5 from npm): 14 m 26 s
- optimized (S1+A6/A7+S2+S3, reindexConcurrency=8): 8 m 14 s
- speedup: 1.75x (-43%), -371 s
- snapshot diff vs baseline: identical (0 byte over 4 GB NDJSON)

bench/RESULTS.md updated with the real-data table, methodology, and
notes on why the gain is 1.75x (not the 5-10x the synthetic plan
estimated): bov's heavy customProductMappings are CPU-bound and a
single-instance MariaDB serialises some of the parallel worker queries.
@timcv timcv closed this by deleting the head repository May 7, 2026
@timcv
Copy link
Copy Markdown
Author

timcv commented May 7, 2026

Hi, i open this to early by misstake. Sorry for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant