Evaluation to Use Config Management #477

avirajsingh7 · 2025-12-10T04:44:18Z

Summary

This change refactors the evaluation run process to utilize a stored configuration instead of a configuration dictionary. It introduces fields for config_id, config_version, and model in the evaluation run table, streamlining the evaluation process and improving data integrity.

Checklist

Before submitting a pull request, please ensure that you mark these tasks.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Summary by CodeRabbit

New Features
- Evaluations now reference stored configurations (config_id + config_version) and store the chosen model with each run for better auditability and reusability.
Behavior Change
- Embedding pipeline now uses a standardized default embedding model for consistent processing.
Tests
- Updated tests to create and reference persisted configs and to validate config-not-found scenarios.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…ssistant_id handling

…g dict

…nstead of config dict

coderabbitai · 2025-12-10T04:44:28Z

Walkthrough

Refactors evaluation runs to reference persisted configs: EvaluationRun now stores config_id (UUID), config_version (int) and model instead of embedding a config dict. Adds an Alembic migration to add those columns and a foreign key to the config table. API and CRUD layers resolve stored configs before creating runs.

Changes

Cohort / File(s)	Summary
Database Migration `backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py`	Adds `config_id` (UUID, nullable), `config_version` (Integer, nullable), and `model` (AutoString, nullable) to `evaluation_run`; drops the `config` column; creates FK from `evaluation_run.config_id` → `config.id`. Includes downgrade to restore `config` and remove new columns/FK.
Model Definition `backend/app/models/evaluation.py`	Replaces `config: dict[str, Any]` on `EvaluationRun`/`EvaluationRunPublic` with `config_id: UUID
API Route `backend/app/api/routes/evaluation.py`	`evaluate(...)` signature changed to accept `config_id: UUID` and `config_version: int` (removed `config` and `assistant_id`). Resolves config via `ConfigVersionCrud`, validates provider is OPENAI, extracts `model` from `config.completion.params`, and passes `config.completion.params` for starting evaluations. Error paths return HTTP 400 for resolution failures.
CRUD Core `backend/app/crud/evaluations/core.py`	`create_evaluation_run(...)` signature updated to `config_id: UUID`, `config_version: int`, `model: str
Embedding & Processing `backend/app/crud/evaluations/embeddings.py`, `backend/app/crud/evaluations/processing.py`	Embeddings: now uses constant embedding model `"text-embedding-3-large"` (removed config-based selection/validation). Processing: cost-tracking/model usage now reads `eval_run.model` instead of a config dict.
Tests `backend/app/tests/api/routes/test_evaluation.py`	Tests updated to create persisted configs (via `create_test_config`) and send `config_id`/`config_version` (plus `model`) instead of inline config dicts. Adjusted assertions for config-not-found and related flows; added `uuid4` usage for non-existent config IDs.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as API Route
    participant ConfigCrud as ConfigVersionCrud
    participant DB as Database
    participant EvalCrud as Eval CRUD

    Client->>API: POST /evaluate (config_id, config_version, ...)
    API->>ConfigCrud: resolve(config_id, config_version)
    ConfigCrud->>DB: SELECT config_version WHERE id = config_id AND version = config_version
    DB-->>ConfigCrud: config record / not found
    alt config not found
        ConfigCrud-->>API: raise not-found/error
        API-->>Client: HTTP 400 (config resolution error)
    else config resolved
        ConfigCrud-->>API: return config object
        API->>API: extract model from config.completion.params
        API->>API: validate provider == OPENAI
        API->>EvalCrud: create_evaluation_run(config_id, config_version, model, ...)
        EvalCrud->>DB: INSERT evaluation_run (config_id, config_version, model, ...)
        DB-->>EvalCrud: created record
        EvalCrud-->>API: evaluation_run
        API-->>Client: HTTP 200 (evaluation created)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay special attention to the migration (FK, nullable handling, downgrade correctness).
Review config resolution/error handling in api/routes/evaluation.py and interactions with ConfigVersionCrud.
Verify model extraction from config.completion.params covers edge cases.
Check tests to ensure they properly create test configs and cover not-found paths.
Confirm embedding change to hardcoded model is intentional and documented.

Poem

A rabbit hid configs with care,
In rows of tables, safe and fair,
References now, tidy and light—
I nibble bugs away at night.
🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Evaluation to Use Config Management' directly aligns with the main change: refactoring evaluations to use stored config management instead of configuration dictionaries.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch evals/config_addition

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

backend/app/crud/evaluations/embeddings.py (1)
366-367: Misleading comment - update to reflect actual behavior.

The comment says "Get embedding model from config" but the code hardcodes the value. Update the comment to accurately describe the implementation.
-        # Get embedding model from config (default: text-embedding-3-large)
-        embedding_model = "text-embedding-3-large"
+        # Use fixed embedding model (text-embedding-3-large)
+        embedding_model = "text-embedding-3-large"
backend/app/tests/api/routes/test_evaluation.py (1)
524-545: Consider renaming function to match its new purpose.

The function test_start_batch_evaluation_missing_model was repurposed to test invalid config_id scenarios. The docstring was updated but the function name still references "missing_model". Consider renaming for clarity.
-    def test_start_batch_evaluation_missing_model(self, client, user_api_key_header):
-        """Test batch evaluation fails with invalid config_id."""
+    def test_start_batch_evaluation_invalid_config_id(self, client, user_api_key_header):
+        """Test batch evaluation fails with invalid config_id."""
backend/app/api/routes/evaluation.py (1)
499-510: Consider validating that model is present in config params.

The model is extracted with .get("model") which returns None if not present. Since model is critical for cost tracking (used in create_langfuse_dataset_run), consider validating its presence and returning an error if missing.
     # Extract model from config for storage
     model = config.completion.params.get("model")
+    if not model:
+        raise HTTPException(
+            status_code=400,
+            detail="Config must specify a 'model' in completion params for evaluation",
+        )

     # Create EvaluationRun record with config references
backend/app/crud/evaluations/core.py (1)
15-69: Config-based create_evaluation_run refactor is correctly implemented; consider logging model for improved traceability.

The refactor from inline config dict to config_id: UUID and config_version: int is properly implemented throughout:

The sole call site in backend/app/api/routes/evaluation.py:503 correctly passes all new parameters with the right types (config_id as UUID, config_version as int, model extracted from config).

The EvaluationRun model in backend/app/models/evaluation.py correctly defines all three fields with appropriate types and descriptions.

All type hints align with Python 3.11+ guidelines.

One suggested improvement for debugging:

Include model in the creation log for better traceability when correlating evaluation runs with model versions:
logger.info(
    f"Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, "
-   f"config_id={config_id}, config_version={config_version}"
+   f"config_id={config_id}, config_version={config_version}, model={model}"
)
Since the model is already extracted at the call site and passed to the function, including it in the log will provide fuller context for operational debugging without any additional cost.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30ef268 and d5f9d4d.

📒 Files selected for processing (7)

backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py (1 hunks)
backend/app/api/routes/evaluation.py (5 hunks)
backend/app/crud/evaluations/core.py (5 hunks)
backend/app/crud/evaluations/embeddings.py (1 hunks)
backend/app/crud/evaluations/processing.py (1 hunks)
backend/app/models/evaluation.py (3 hunks)
backend/app/tests/api/routes/test_evaluation.py (5 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/api/routes/evaluation.py
backend/app/models/evaluation.py
backend/app/crud/evaluations/embeddings.py
backend/app/tests/api/routes/test_evaluation.py
backend/app/crud/evaluations/processing.py
backend/app/crud/evaluations/core.py
backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Expose FastAPI REST endpoints under backend/app/api/ organized by domain

Files:

backend/app/api/routes/evaluation.py

backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

backend/app/models/evaluation.py

backend/app/crud/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement database access operations in backend/app/crud/

Files:

backend/app/crud/evaluations/embeddings.py
backend/app/crud/evaluations/processing.py
backend/app/crud/evaluations/core.py

🧬 Code graph analysis (2)

backend/app/tests/api/routes/test_evaluation.py (2)

backend/app/crud/evaluations/batch.py (1)

build_evaluation_jsonl (62-115)

backend/app/models/evaluation.py (2)

EvaluationDataset (74-130)

EvaluationRun (133-248)

backend/app/crud/evaluations/processing.py (1)

backend/app/crud/evaluations/langfuse.py (1)

create_langfuse_dataset_run (20-163)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (3)

backend/app/crud/evaluations/processing.py (1)

257-264: LGTM! Clean refactor to use stored model field.

The change correctly retrieves the model from eval_run.model instead of extracting it from config. This aligns with the new data model where the model is snapshotted at evaluation creation time.

backend/app/models/evaluation.py (1)

148-157: LGTM! Well-structured config reference fields.

The new config_id and config_version fields properly establish the relationship to stored configs with appropriate constraints (ge=1 for version). The nullable design allows backward compatibility with existing data.

backend/app/api/routes/evaluation.py (1)

478-495: LGTM! Robust config resolution with provider validation.

The config resolution flow properly validates that the stored config exists and uses the OPENAI provider. Error handling returns appropriate HTTP 400 responses with descriptive messages.

backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py

backend/app/models/evaluation.py

codecov · 2025-12-10T04:48:57Z

Codecov Report

❌ Patch coverage is 65.51724% with 10 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/api/routes/evaluation.py	38.46%	8 Missing ⚠️
backend/app/crud/evaluations/core.py	50.00%	1 Missing ⚠️
backend/app/crud/evaluations/embeddings.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…ig_version fields

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

backend/app/models/evaluation.py (1)
148-158: Align EvaluationRun type hints with nullable DB columns for config fields

config_id and config_version are nullable in the schema but annotated as non-optional types. This can mislead callers and type checkers into assuming they’re always present, even for legacy runs or transitional data.

Consider updating the annotations to reflect nullability:
-    config_id: UUID = SQLField(
+    config_id: UUID | None = SQLField(
         foreign_key="config.id",
         nullable=True,
         description="Reference to the stored config used for this evaluation",
     )
-    config_version: int = SQLField(
+    config_version: int | None = SQLField(
         nullable=True,
         ge=1,
         description="Version of the config used for this evaluation",
     )
This keeps the schema the same while making runtime and type expectations clearer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5f9d4d and eda7762.

📒 Files selected for processing (1)

backend/app/models/evaluation.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/models/evaluation.py

backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

backend/app/models/evaluation.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (1)

backend/app/models/evaluation.py (1)

271-273: Public model nullability now matches the schema

Making config_id, config_version, and model nullable in EvaluationRunPublic correctly reflects the DB fields and avoids validation issues for existing rows. This resolves the earlier mismatch between the table and the public model.

AkhileshNegi · 2025-12-10T09:17:54Z

backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py

+    op.create_foreign_key(None, "evaluation_run", "config", ["config_id"], ["id"])
+    op.add_column(
+        "evaluation_run",
+        sa.Column("model", sqlmodel.sql.sqltypes.AutoString(), nullable=True),


model should be part of the config as well

AkhileshNegi · 2025-12-10T09:19:36Z

backend/app/api/routes/evaluation.py

+from app.services.llm.jobs import resolve_config_blob
+from app.services.llm.providers import LLMProvider
+from app.models.llm.request import LLMCallConfig
+from app.crud.config.version import ConfigVersionCrud


this with other should follow alphabetic order like

app.crud app.models app.services. app.services

AkhileshNegi · 2025-12-10T09:20:36Z

backend/app/api/routes/evaluation.py

-                detail="Config must include 'model' when assistant_id is not provided",
-            )
+    # Extract model from config for storage
+    model = config.completion.params.get("model")


we can get rid of this when moving model in config

Prajna1999 · 2025-12-12T05:05:42Z

backend/app/api/routes/evaluation.py

+        )
+    elif config.completion.provider != LLMProvider.OPENAI:
+        raise HTTPException(
+            status_code=400,


The status code should not be 400 I guess, since its a validation error not an error thrown by the server(i.e json parsing, 422 error et. al) itself.

avirajsingh7 added 4 commits December 10, 2025 10:05

Refactor evaluation endpoint to use stored configuration and remove a…

11112ff

…ssistant_id handling

Refactor evaluation run to use config ID and version instead of confi…

1d07072

…g dict

Add config_id, config_version, and model fields to evaluation run table

e90ea84

Refactor batch evaluation tests to use config_id and config_version i…

d5f9d4d

…nstead of config dict

avirajsingh7 linked an issue Dec 10, 2025 that may be closed by this pull request

Add config management in Evals #440

Open

avirajsingh7 self-assigned this Dec 10, 2025

avirajsingh7 added enhancement New feature or request ready-for-review labels Dec 10, 2025

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py Show resolved Hide resolved

backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py Show resolved Hide resolved

backend/app/models/evaluation.py Outdated Show resolved Hide resolved

Update EvaluationRunPublic model to allow nullable config_id and conf…

eda7762

…ig_version fields

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

avirajsingh7 requested review from AkhileshNegi and Prajna1999 December 10, 2025 05:06

AkhileshNegi requested changes Dec 10, 2025

View reviewed changes

Prajna1999 reviewed Dec 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation to Use Config Management #477

Evaluation to Use Config Management #477

Uh oh!

avirajsingh7 commented Dec 10, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Dec 10, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

AkhileshNegi Dec 10, 2025

Uh oh!

AkhileshNegi Dec 10, 2025

Uh oh!

AkhileshNegi Dec 10, 2025

Uh oh!

Prajna1999 Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Evaluation to Use Config Management #477

Are you sure you want to change the base?

Evaluation to Use Config Management #477

Uh oh!

Conversation

avirajsingh7 commented Dec 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Dec 10, 2025

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

AkhileshNegi Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

AkhileshNegi Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

AkhileshNegi Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Prajna1999 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

avirajsingh7 commented Dec 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 10, 2025 •

edited

Loading