Skip to content

feat: separate internal schema dimensions from embedding API request dimensions#377

Closed
jszhang98 wants to merge 4 commits intoCortexReach:masterfrom
jszhang98:fix/separate-request-dimensions
Closed

feat: separate internal schema dimensions from embedding API request dimensions#377
jszhang98 wants to merge 4 commits intoCortexReach:masterfrom
jszhang98:fix/separate-request-dimensions

Conversation

@jszhang98
Copy link
Copy Markdown

PR Title

feat: separate internal schema dimensions from embedding API request dimensions

Summary

This PR decouples two previously conflated dimension semantics in the embedding pipeline:

  • Internal dimensions used for LanceDB schema sizing and local vector validation.
  • Request dimensions sent to embedding providers that support variable output dimensions.

After this change:

  • embedding.dimensions is treated as internal schema/validation dimension.
  • embedding.requestDimensions is optional and only used for API request payload fields.
  • If requestDimensions is not configured, no dimensions field is sent to embedding APIs by default.
  • omitDimensions=true still has highest priority and suppresses dimensions fields even when requestDimensions is set.

Motivation

Some providers/models (for example non-matryoshka-compatible models) reject dimensions/output_dimension in request payloads.

Previously, embedding.dimensions was reused for both storage and API request parameters, which could cause provider-side 400 errors during startup/runtime.

This PR keeps storage behavior stable while preventing accidental request-parameter injection.

What Changed

  • Embedding config model extended with requestDimensions.
  • Embedder request payload dimension source changed from dimensions to requestDimensions.
  • Plugin config parsing and wiring updated to pass requestDimensions explicitly.
  • Plugin schema updated to declare embedding.requestDimensions.
  • Regression tests updated to reflect new default behavior and precedence rules.

Behavior Matrix

  • dimensions only:
    • Used internally for schema/validation.
    • Not forwarded to embedding API payload.
  • requestDimensions set:
    • Forwarded as provider-specific request field (dimensions/output_dimension).
  • requestDimensions + omitDimensions=true:
    • Request dimension fields are omitted.

Backward Compatibility

  • Existing storage behavior remains unchanged.
  • Existing users relying on implicit forwarding of embedding.dimensions to API payload should migrate to embedding.requestDimensions.
  • No change to fixed-dimension LanceDB table constraints in this PR.

Files Updated

Testing

  • Updated regression assertions for new dimension-forwarding semantics.
  • Added schema assertion for embedding.requestDimensions.
  • Local full test command was blocked in current environment due missing jiti package dependency, but modified regression logic and assertions are in place.

Copy link
Copy Markdown
Collaborator

@AliceLJY AliceLJY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean separation of concerns.

The decoupling of dimensions (internal schema) from requestDimensions (API payload) is the right fix. Solves the problem for non-matryoshka models that reject dimensions in requests.

Tests cover all three cases (no requestDimensions → not sent, requestDimensions set → forwarded, omitDimensions trumps all). Schema updated.

@rwmjhb ready for merge.

@AliceLJY AliceLJY requested a review from rwmjhb March 28, 2026 14:59
Copy link
Copy Markdown

@JiwaniZakir JiwaniZakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assignment change in embedder.ts at line 434 (this._requestDimensions = config.requestDimensions) is a silent breaking change for existing users. Anyone who previously set only dimensions and relied on it being forwarded to the OpenAI-compatible API (e.g., for text-embedding-3-small with a reduced dimension like 256) will now silently receive full-size embeddings from the provider. If they already have stored vectors at the reduced dimension, this will cause shape mismatches at query time with no warning surfaced to the user.

There's also no test covering the scenario where requestDimensions is omitted but dimensions is set — the test suite now only validates the happy path (requestDimensions: 4 present) and the omitDimensions override. A test asserting that omitting requestDimensions results in no dimensions field in the API request (distinct from the omitDimensions path) would make the contract explicit and catch regressions.

Finally, parsePluginConfig in index.ts gives dimensions a legacy top-level fallback (embedding.dimensions ?? cfg.dimensions), but requestDimensions has no such fallback — that asymmetry is fine given the intent, but it's worth a comment explicitly noting it's not an oversight.

@rwmjhb
Copy link
Copy Markdown
Collaborator

rwmjhb commented Mar 29, 2026

This change is valuable. Separating internal schema dimensions from request payload dimensions is the right direction, and it should avoid provider-side 400s for models/endpoints that reject dimensions / output_dimension.

I found one blocking issue before merge:

  • npm test is red on this branch.
  • test/embedder-error-hints.test.mjs still asserts the old behavior that embedding.dimensions is forwarded by default, but the implementation now only forwards requestDimensions.
  • I can reproduce this with both node test/embedder-error-hints.test.mjs and the full npm test run.

Relevant spots:

  • New behavior in src/embedder.ts: _requestDimensions = config.requestDimensions and payload forwarding now depends on requestDimensions
  • Old expectations still present in test/embedder-error-hints.test.mjs (for example the Jina / generic / Voyage payload assertions)

Suggested fix:

  • Update test/embedder-error-hints.test.mjs to match the new contract:
    • embedding.dimensions => internal schema / validation only
    • embedding.requestDimensions => forwarded to the embedding API when set
    • omitDimensions=true => still suppresses request dimension fields

Non-blocking note:

  • Consider adding embedding.requestDimensions to uiHints in openclaw.plugin.json as well, so the new config path is discoverable in the manifest/UI layer.

With the test suite aligned to the new semantics, this looks like a good change to land.

@jszhang98
Copy link
Copy Markdown
Author

Update: pushed a follow-up test-only commit (fe65f93) to align stale embedder assertions with the new requestDimensions behavior.

Validated locally:

  • node test/embedder-error-hints.test.mjs
  • node test/plugin-manifest-regression.mjs

For CI context:
The remaining failures are reflection-hook tests and can be reproduced on origin/master as well (not introduced by this PR):

  • node --test test/reflection-bypass-hook.test.mjs

@jszhang98
Copy link
Copy Markdown
Author

Thanks for the review.

The blocking issue is fixed in follow-up commit fe65f93:

  • Updated test/embedder-error-hints.test.mjs to match the new contract:
    • embedding.dimensions => internal schema/validation only
    • embedding.requestDimensions => forwarded when set
    • omitDimensions=true => still suppresses request dimension fields

For the non-blocking suggestion: agreed.
I can add embedding.requestDimensions to uiHints in openclaw.plugin.json as a small follow-up.

@jszhang98
Copy link
Copy Markdown
Author

Thanks for the review — addressed.

  • Fixed the blocking test mismatch by updating test/embedder-error-hints.test.mjs to the new requestDimensions contract (fe65f93).
  • Added uiHints for embedding.requestDimensions in openclaw.plugin.json for discoverability (82748fc).

I also re-checked locally:

  • node test/embedder-error-hints.test.mjs
  • node test/plugin-manifest-regression.mjs

Please take another look when convenient.

@rwmjhb
Copy link
Copy Markdown
Collaborator

rwmjhb commented Mar 30, 2026

requestDimensions is currently request-only, but local validation still follows embedding.dimensions. In practice this means requestDimensions alone does not work for variable-dimension models: e.g. text-embedding-3-large + requestDimensions: 1024 can return a 1024-dim vector, while the plugin still expects 3072 and throws.

So the new config contract is incomplete right now. Either:

  1. make internal validation follow requestDimensions when present, or
  2. explicitly require dimensions and requestDimensions to be kept in sync, and document/test that behavior.

Copy link
Copy Markdown
Author

@jszhang98 jszhang98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per your feedback, internal validation now prioritizes requestDimensions if present, ensuring local checks match the API output for variable-dimension models. All tests pass except for known unrelated failures in reflection-bypass-hook.
Let me know if any further changes are needed!

Copy link
Copy Markdown
Collaborator

@AliceLJY AliceLJY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — changes are clean, on-topic, and well-tested. Approving.

@rwmjhb
Copy link
Copy Markdown
Collaborator

rwmjhb commented Apr 3, 2026

Review: separate internal schema dimensions from embedding API request dimensions

Verdict: request-changes | Confidence: 0.90 | Value: 38%

The decoupling intent makes sense — separating internal schema sizing from the API request hint is cleaner than omitDimensions. However, the current implementation has a backward-compat break and a schema divergence issue.

Must Fix

1. Silent breaking change for reduced-dimension users (HIGH)

embedder.ts:434 changed from this._requestDimensions = config.dimensions to this._requestDimensions = config.requestDimensions. Existing configs like { dimensions: 256, model: "text-embedding-3-small" } (no requestDimensions) now:

  1. Send no dimension hint to the API → provider returns 1536-dim vectors
  2. validateEmbedding() compares against this.dimensions=256 → throws dimension-mismatch error

This breaks all users relying on reduced-dimension embeddings without warning. @JiwaniZakir flagged the same scenario in their review.

Suggestion: when dimensions is set but requestDimensions is absent, either (a) auto-backfill requestDimensions = dimensions for backward compat, or (b) emit a runtime warning explaining the migration path.

2. requestDimensions and dimensions can't actually diverge (HIGH)

Store schema uses getVectorDimensions(model, config.embedding.dimensions) for table sizing, while the embedder now uses requestDimensions for actual embedding size. A config like { dimensions: 1536, requestDimensions: 1024 } produces 1024-dim embeddings but the LanceDB table expects 1536-dim vectors → storage fails.

The advertised decoupling doesn't work unless both fields are equal, which defeats the purpose. The store sizing path needs to be aware of requestDimensions.

3. Rebase required — build is red (stale base)

reflection-bypass-hook.test.mjs failures are pre-existing on master per your comment. Rebase to confirm.

Nice to Have

  • Missing backward-compat test: no test covers the dimensions-only (no requestDimensions) scenario — the exact breaking path. Adding one would make the contract explicit.

Auto-reviewed by auto-pr-review-orchestrator | 6 rounds | Claude + Codex adversarial

@JiwaniZakir
Copy link
Copy Markdown

The separation makes sense, but worth confirming the fallback behavior when neither requestDimensions nor omitDimensions is set — the PR says no dimensions field is sent by default, which is a behavioral change from before and could silently break providers that require the field. It might be worth logging a debug warning when dimensions is set but requestDimensions is not, so users aren't surprised when variable-dimension-capable providers return a default size they didn't expect. Also, if requestDimensions is set but dimensions is not, should schema sizing infer from requestDimensions or error out?

@jszhang98
Copy link
Copy Markdown
Author

Thanks for the thoughtful feedback. I agree we should make the fallback behavior explicit and avoid surprises.

I’m going to keep the current functional behavior, but make it clearer and test-covered with a minimal follow-up:

Add debug logging when dimensions is set but requestDimensions is not, to make it explicit that dimensions is used for internal schema/validation only and is not forwarded to the embedding API.
Add debug logging when requestDimensions is set but dimensions is not, to make it explicit that internal schema sizing is inferred from requestDimensions.
Keep the current precedence rule (requestDimensions over dimensions for effective internal validation dimension) and document it clearly.
Add focused tests for both fallback paths and update docs so the behavior is unambiguous.
This should address the ambiguity you called out without introducing a broader behavior change in this PR.

@JiwaniZakir
Copy link
Copy Markdown

The separation makes sense, but worth clarifying the migration path for existing users who currently rely on embedding.dimensions being forwarded to the API — they'll silently stop sending dimensions to providers that support/require it unless they explicitly set requestDimensions. A deprecation warning or migration note in the config validation would help catch that. Also, does the schema validation still fail fast if the returned vector length doesn't match embedding.dimensions, or is that check now deferred to LanceDB insert time?

@jszhang98
Copy link
Copy Markdown
Author

Thanks, this is a great callout.

You’re right about migration risk: users who previously relied on embedding.dimensions being forwarded could silently stop sending a request-side dimension unless requestDimensions is explicitly set. I’ll add a migration note plus a config-time/deprecation warning to make that transition explicit.

On validation timing: we still fail fast on embedding length mismatch before persistence (not only at LanceDB insert time). The effective validation dimension follows the current precedence rule (requestDimensions when set, otherwise dimensions), and I’ll document that clearly in the same update.

@JiwaniZakir
Copy link
Copy Markdown

The separation of dimensions (schema/validation) from requestDimensions (API payload) is the right call — conflating them was always a footgun for providers that treat dimensions as a hard constraint rather than a hint. One thing worth double-checking: if requestDimensions is set but the returned vector length doesn't match dimensions, does the validation layer surface a clear error, or does it silently fail at insert time when LanceDB schema enforcement kicks in? That boundary should be explicit.

@jszhang98
Copy link
Copy Markdown
Author

Thanks, this is a great point.
Yes, the intent is to fail fast at the embedding validation layer, not defer failure to LanceDB insert-time schema enforcement. I’ll make that boundary explicit in code comments/docs and ensure the mismatch path returns a clear, actionable error message (including expected vs actual vector length and which effective dimension was used).

@JiwaniZakir
Copy link
Copy Markdown

The stray } in src/smart-extractor.ts around filterNoiseByEmbedding() is a blocking issue — that needs to be patched before any further review makes sense. On the migration concern: the silent behavior change is real, and a runtime warning when dimensions is set but requestDimensions is not (during provider initialization) would catch the majority of cases without requiring users to read migration notes. If this is superseded by #482, that PR should carry both the syntax fix and the deprecation warning.

@jszhang98
Copy link
Copy Markdown
Author

Agreed on priority: a syntax issue is blocking, so we’ll patch that first before further review.

Also agreed on migration safety: we’ll add a runtime warning during embedder/provider initialization when dimensions is set but requestDimensions is not, so users get an immediate signal instead of relying only on migration notes.

If this line of work is continued in #482, we’ll ensure that PR includes both:

the syntax fix, and
the deprecation/runtime warning for dimensions-only configs.

@JiwaniZakir
Copy link
Copy Markdown

The separation looks correct in principle, but there's a potential issue at the validation boundary: if requestDimensions is set and differs from dimensions, the embedding response vector length will be validated against dimensions (the schema value), but the provider actually returned vectors sized to requestDimensions. That mismatch will cause a spurious validation failure unless the validation path is updated to check against requestDimensions when it's present. Worth confirming that validateEmbeddingDimensions() (or equivalent) uses the right value depending on which field drove the request.

@jszhang98
Copy link
Copy Markdown
Author

Thanks, great boundary callout.

You’re right to focus on the validation/source-of-truth path.
In the current implementation, embedding-length validation already uses the effective runtime dimension (requestDimensions when present), so it is not always pinned to dimensions.

That said, there is still a real boundary risk if schema sizing and request-time sizing are derived from different fields. I’ll align the schema dimension source with the same effective dimension rule and add explicit tests for:

dimensions only,
requestDimensions only,
both set but different.
That should make the validation boundary deterministic and prevent insert-time surprises.

@JiwaniZakir
Copy link
Copy Markdown

One edge case worth flagging: if requestDimensions differs from dimensions and the embedding provider actually returns vectors sized to requestDimensions, those vectors will fail schema validation against LanceDB since the table was created with dimensions. The validation using the effective runtime dimension doesn't help if the schema itself is fixed at write time. It might be worth adding an explicit guard that rejects configs where requestDimensions !== dimensions unless the schema is being rebuilt, or at minimum surfacing a warning at startup.

Copy link
Copy Markdown

app3apps commented Apr 5, 2026

Re-checking after the follow-up, I still see one blocking issue before merge.

On the current PR head (f234f95), the embedder and the store still derive their effective dimensions from different sources:

  • Embedder now validates against requestDimensions ?? dimensions
  • store initialization still sizes LanceDB with getVectorDimensions(model, config.embedding.dimensions)

So a config like text-embedding-3-large + requestDimensions: 1024 still ends up split at the storage boundary: the embedder accepts/validates 1024-dim vectors, but the table is still sized for the model/default or embedding.dimensions path. That means requestDimensions-only or mismatched configs are still not actually safe end-to-end.

The April 3 follow-up comment said schema sizing would be aligned to the same effective-dimension rule, but I don’t see that change in the branch yet.

Once store sizing and embedder validation use the same source of truth for the effective dimension, or the PR explicitly rejects mismatched configs, this looks ready from my side.

Non-blocking: the migration/deprecation warning for dimensions-only configs can be follow-up if you want to keep this PR scoped.

Copy link
Copy Markdown
Collaborator

@AliceLJY AliceLJY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting after the latest review comment — there's a valid blocking issue.

Problem: The effective dimension source-of-truth is split:

  • Embedder validates against requestDimensions ?? dimensions
  • Store initialization sizes the LanceDB table via getVectorDimensions(model, dimensions) — ignoring requestDimensions

This means a config like text-embedding-3-large + requestDimensions: 1024 will pass embedder validation but hit a dimension mismatch at the storage layer.

What's needed:
Store sizing must use the same effective dimension as the embedder: requestDimensions ?? dimensions ?? modelDefault. Once both paths converge on the same source, this is ready to merge.

The migration warning for dimensions-only configs can be a follow-up.

@rwmjhb
Copy link
Copy Markdown
Collaborator

rwmjhb commented Apr 9, 2026

Superseded by #572, which rebases the fix onto current \ and closes the remaining request-dimension/schema-sizing mismatch called out in review.

@rwmjhb rwmjhb closed this Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants