Skip to content

Conversation

@avirajsingh7
Copy link
Collaborator

@avirajsingh7 avirajsingh7 commented Dec 10, 2025

Summary

This change refactors the evaluation run process to utilize a stored configuration instead of a configuration dictionary. It introduces fields for config_id, config_version, and model in the evaluation run table, streamlining the evaluation process and improving data integrity.

Checklist

Before submitting a pull request, please ensure that you mark these tasks.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Summary by CodeRabbit

  • New Features

    • Evaluations now reference stored configurations (config_id + config_version) and store the chosen model with each run for better auditability and reusability.
  • Behavior Change

    • Embedding pipeline now uses a standardized default embedding model for consistent processing.
  • Tests

    • Updated tests to create and reference persisted configs and to validate config-not-found scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

Walkthrough

Refactors evaluation runs to reference persisted configs: EvaluationRun now stores config_id (UUID), config_version (int) and model instead of embedding a config dict. Adds an Alembic migration to add those columns and a foreign key to the config table. API and CRUD layers resolve stored configs before creating runs.

Changes

Cohort / File(s) Summary
Database Migration
backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py
Adds config_id (UUID, nullable), config_version (Integer, nullable), and model (AutoString, nullable) to evaluation_run; drops the config column; creates FK from evaluation_run.config_idconfig.id. Includes downgrade to restore config and remove new columns/FK.
Model Definition
backend/app/models/evaluation.py
Replaces config: dict[str, Any] on EvaluationRun/EvaluationRunPublic with `config_id: UUID
API Route
backend/app/api/routes/evaluation.py
evaluate(...) signature changed to accept config_id: UUID and config_version: int (removed config and assistant_id). Resolves config via ConfigVersionCrud, validates provider is OPENAI, extracts model from config.completion.params, and passes config.completion.params for starting evaluations. Error paths return HTTP 400 for resolution failures.
CRUD Core
backend/app/crud/evaluations/core.py
create_evaluation_run(...) signature updated to config_id: UUID, config_version: int, `model: str
Embedding & Processing
backend/app/crud/evaluations/embeddings.py, backend/app/crud/evaluations/processing.py
Embeddings: now uses constant embedding model "text-embedding-3-large" (removed config-based selection/validation). Processing: cost-tracking/model usage now reads eval_run.model instead of a config dict.
Tests
backend/app/tests/api/routes/test_evaluation.py
Tests updated to create persisted configs (via create_test_config) and send config_id/config_version (plus model) instead of inline config dicts. Adjusted assertions for config-not-found and related flows; added uuid4 usage for non-existent config IDs.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as API Route
    participant ConfigCrud as ConfigVersionCrud
    participant DB as Database
    participant EvalCrud as Eval CRUD

    Client->>API: POST /evaluate (config_id, config_version, ...)
    API->>ConfigCrud: resolve(config_id, config_version)
    ConfigCrud->>DB: SELECT config_version WHERE id = config_id AND version = config_version
    DB-->>ConfigCrud: config record / not found
    alt config not found
        ConfigCrud-->>API: raise not-found/error
        API-->>Client: HTTP 400 (config resolution error)
    else config resolved
        ConfigCrud-->>API: return config object
        API->>API: extract model from config.completion.params
        API->>API: validate provider == OPENAI
        API->>EvalCrud: create_evaluation_run(config_id, config_version, model, ...)
        EvalCrud->>DB: INSERT evaluation_run (config_id, config_version, model, ...)
        DB-->>EvalCrud: created record
        EvalCrud-->>API: evaluation_run
        API-->>Client: HTTP 200 (evaluation created)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay special attention to the migration (FK, nullable handling, downgrade correctness).
  • Review config resolution/error handling in api/routes/evaluation.py and interactions with ConfigVersionCrud.
  • Verify model extraction from config.completion.params covers edge cases.
  • Check tests to ensure they properly create test configs and cover not-found paths.
  • Confirm embedding change to hardcoded model is intentional and documented.

Poem

A rabbit hid configs with care,
In rows of tables, safe and fair,
References now, tidy and light—
I nibble bugs away at night.
🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Evaluation to Use Config Management' directly aligns with the main change: refactoring evaluations to use stored config management instead of configuration dictionaries.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch evals/config_addition

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@avirajsingh7 avirajsingh7 linked an issue Dec 10, 2025 that may be closed by this pull request
@avirajsingh7 avirajsingh7 self-assigned this Dec 10, 2025
@avirajsingh7 avirajsingh7 added enhancement New feature or request ready-for-review labels Dec 10, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
backend/app/crud/evaluations/embeddings.py (1)

366-367: Misleading comment - update to reflect actual behavior.

The comment says "Get embedding model from config" but the code hardcodes the value. Update the comment to accurately describe the implementation.

-        # Get embedding model from config (default: text-embedding-3-large)
-        embedding_model = "text-embedding-3-large"
+        # Use fixed embedding model (text-embedding-3-large)
+        embedding_model = "text-embedding-3-large"
backend/app/tests/api/routes/test_evaluation.py (1)

524-545: Consider renaming function to match its new purpose.

The function test_start_batch_evaluation_missing_model was repurposed to test invalid config_id scenarios. The docstring was updated but the function name still references "missing_model". Consider renaming for clarity.

-    def test_start_batch_evaluation_missing_model(self, client, user_api_key_header):
-        """Test batch evaluation fails with invalid config_id."""
+    def test_start_batch_evaluation_invalid_config_id(self, client, user_api_key_header):
+        """Test batch evaluation fails with invalid config_id."""
backend/app/api/routes/evaluation.py (1)

499-510: Consider validating that model is present in config params.

The model is extracted with .get("model") which returns None if not present. Since model is critical for cost tracking (used in create_langfuse_dataset_run), consider validating its presence and returning an error if missing.

     # Extract model from config for storage
     model = config.completion.params.get("model")
+    if not model:
+        raise HTTPException(
+            status_code=400,
+            detail="Config must specify a 'model' in completion params for evaluation",
+        )

     # Create EvaluationRun record with config references
backend/app/crud/evaluations/core.py (1)

15-69: Config-based create_evaluation_run refactor is correctly implemented; consider logging model for improved traceability.

The refactor from inline config dict to config_id: UUID and config_version: int is properly implemented throughout:

  • The sole call site in backend/app/api/routes/evaluation.py:503 correctly passes all new parameters with the right types (config_id as UUID, config_version as int, model extracted from config).
  • The EvaluationRun model in backend/app/models/evaluation.py correctly defines all three fields with appropriate types and descriptions.
  • All type hints align with Python 3.11+ guidelines.

One suggested improvement for debugging:

Include model in the creation log for better traceability when correlating evaluation runs with model versions:

logger.info(
    f"Created EvaluationRun record: id={eval_run.id}, run_name={run_name}, "
-   f"config_id={config_id}, config_version={config_version}"
+   f"config_id={config_id}, config_version={config_version}, model={model}"
)

Since the model is already extracted at the call site and passed to the function, including it in the log will provide fuller context for operational debugging without any additional cost.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30ef268 and d5f9d4d.

📒 Files selected for processing (7)
  • backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py (1 hunks)
  • backend/app/api/routes/evaluation.py (5 hunks)
  • backend/app/crud/evaluations/core.py (5 hunks)
  • backend/app/crud/evaluations/embeddings.py (1 hunks)
  • backend/app/crud/evaluations/processing.py (1 hunks)
  • backend/app/models/evaluation.py (3 hunks)
  • backend/app/tests/api/routes/test_evaluation.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

  • backend/app/api/routes/evaluation.py
  • backend/app/models/evaluation.py
  • backend/app/crud/evaluations/embeddings.py
  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/crud/evaluations/processing.py
  • backend/app/crud/evaluations/core.py
  • backend/app/alembic/versions/7b48f23ebfdd_add_config_id_and_version_in_evals_run_.py
backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Expose FastAPI REST endpoints under backend/app/api/ organized by domain

Files:

  • backend/app/api/routes/evaluation.py
backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

  • backend/app/models/evaluation.py
backend/app/crud/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement database access operations in backend/app/crud/

Files:

  • backend/app/crud/evaluations/embeddings.py
  • backend/app/crud/evaluations/processing.py
  • backend/app/crud/evaluations/core.py
🧬 Code graph analysis (2)
backend/app/tests/api/routes/test_evaluation.py (2)
backend/app/crud/evaluations/batch.py (1)
  • build_evaluation_jsonl (62-115)
backend/app/models/evaluation.py (2)
  • EvaluationDataset (74-130)
  • EvaluationRun (133-248)
backend/app/crud/evaluations/processing.py (1)
backend/app/crud/evaluations/langfuse.py (1)
  • create_langfuse_dataset_run (20-163)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (3)
backend/app/crud/evaluations/processing.py (1)

257-264: LGTM! Clean refactor to use stored model field.

The change correctly retrieves the model from eval_run.model instead of extracting it from config. This aligns with the new data model where the model is snapshotted at evaluation creation time.

backend/app/models/evaluation.py (1)

148-157: LGTM! Well-structured config reference fields.

The new config_id and config_version fields properly establish the relationship to stored configs with appropriate constraints (ge=1 for version). The nullable design allows backward compatibility with existing data.

backend/app/api/routes/evaluation.py (1)

478-495: LGTM! Robust config resolution with provider validation.

The config resolution flow properly validates that the stored config exists and uses the OPENAI provider. Error handling returns appropriate HTTP 400 responses with descriptive messages.

@codecov
Copy link

codecov bot commented Dec 10, 2025

Codecov Report

❌ Patch coverage is 65.51724% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/api/routes/evaluation.py 38.46% 8 Missing ⚠️
backend/app/crud/evaluations/core.py 50.00% 1 Missing ⚠️
backend/app/crud/evaluations/embeddings.py 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
backend/app/models/evaluation.py (1)

148-158: Align EvaluationRun type hints with nullable DB columns for config fields

config_id and config_version are nullable in the schema but annotated as non-optional types. This can mislead callers and type checkers into assuming they’re always present, even for legacy runs or transitional data.

Consider updating the annotations to reflect nullability:

-    config_id: UUID = SQLField(
+    config_id: UUID | None = SQLField(
         foreign_key="config.id",
         nullable=True,
         description="Reference to the stored config used for this evaluation",
     )
-    config_version: int = SQLField(
+    config_version: int | None = SQLField(
         nullable=True,
         ge=1,
         description="Version of the config used for this evaluation",
     )

This keeps the schema the same while making runtime and type expectations clearer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5f9d4d and eda7762.

📒 Files selected for processing (1)
  • backend/app/models/evaluation.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

  • backend/app/models/evaluation.py
backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

  • backend/app/models/evaluation.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (1)
backend/app/models/evaluation.py (1)

271-273: Public model nullability now matches the schema

Making config_id, config_version, and model nullable in EvaluationRunPublic correctly reflects the DB fields and avoids validation issues for existing rows. This resolves the earlier mismatch between the table and the public model.

op.create_foreign_key(None, "evaluation_run", "config", ["config_id"], ["id"])
op.add_column(
"evaluation_run",
sa.Column("model", sqlmodel.sql.sqltypes.AutoString(), nullable=True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model should be part of the config as well

Comment on lines +36 to +39
from app.services.llm.jobs import resolve_config_blob
from app.services.llm.providers import LLMProvider
from app.models.llm.request import LLMCallConfig
from app.crud.config.version import ConfigVersionCrud
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this with other should follow alphabetic order like

app.crud
app.models
app.services.
app.services

detail="Config must include 'model' when assistant_id is not provided",
)
# Extract model from config for storage
model = config.completion.params.get("model")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can get rid of this when moving model in config

)
elif config.completion.provider != LLMProvider.OPENAI:
raise HTTPException(
status_code=400,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status code should not be 400 I guess, since its a validation error not an error thrown by the server(i.e json parsing, 422 error et. al) itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add config management in Evals

4 participants