-
Notifications
You must be signed in to change notification settings - Fork 176
feat(schema-bench): add LangSmith integration for evaluation tracking #3715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
2 Skipped Deployments
|
Updates to Preview Branch (feat/langsmith-integration) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds LangSmith integration to schema-bench: new CLI scripts to upload datasets and run evaluations, a LangSmith evaluator and types, package.json scripts/dependency updates, and a refactor of JSON-loading utilities and executor call sites to use a generalized loader and renamed Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as Developer
participant CLI as evaluateWithLangsmith CLI
participant Utils as Dataset Utils
participant Exec as Executor (LiamDB/OpenAI)
participant LS as LangSmith
participant Eval as schemaEvaluator
Dev->>CLI: Run "pnpm schema-bench langsmith --liamdb|--openai ..."
CLI->>CLI: Parse flags (num-repetitions, max-concurrency)
CLI->>Utils: Resolve workspace & select datasets
Utils-->>CLI: Dataset list
loop per dataset (concurrent)
CLI->>Exec: Execute(case.data)
Exec-->>CLI: Output schema
CLI->>LS: Upload run + outputs
LS->>Eval: Evaluate(reference, output)
Eval-->>LS: Return metrics
LS-->>CLI: Store run & metrics
end
CLI-->>Dev: Exit (success/failure)
sequenceDiagram
autonumber
actor Dev as Developer
participant UCLI as uploadDatasetToLangsmith CLI
participant FS as Filesystem
participant LS as LangSmith
Dev->>UCLI: Run "pnpm schema-bench langsmith:upload --datasets ..."
UCLI->>FS: Read `execution/input` (loadJsonFiles)
UCLI->>FS: Read `execution/reference` (loadJsonFiles)
UCLI->>UCLI: Validate inputs & references
UCLI->>LS: Get or create dataset
loop For each caseId
alt reference exists
UCLI->>LS: Create or update example (input + reference)
LS-->>UCLI: Example upserted
else missing reference
UCLI-->>UCLI: Warn and skip
end
end
UCLI-->>Dev: Summary and exit code
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
frontend/internal-packages/schema-bench/src/langsmith/types.ts (1)
1-15
: Import cycle risk from referencing shared LangSmith typesThis types file isn’t referenced anywhere in the submitted changes; every consumer (upload CLI, evaluator, etc.) defines its own shapes inline. Keeping an unused export invites drift and can mislead API consumers. Either delete the file or wire these types into the new CLIs so the surface stays coherent.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
pnpm-lock.yaml
is excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (10)
frontend/internal-packages/schema-bench/README.md
(1 hunks)frontend/internal-packages/schema-bench/package.json
(2 hunks)frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts
(1 hunks)frontend/internal-packages/schema-bench/src/cli/executeLiamDbShared.ts
(3 hunks)frontend/internal-packages/schema-bench/src/cli/executeOpenaiUnified.ts
(4 hunks)frontend/internal-packages/schema-bench/src/cli/uploadDatasetToLangsmith.ts
(1 hunks)frontend/internal-packages/schema-bench/src/cli/utils/index.ts
(1 hunks)frontend/internal-packages/schema-bench/src/cli/utils/loadJsonFiles.ts
(2 hunks)frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts
(1 hunks)frontend/internal-packages/schema-bench/src/langsmith/types.ts
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
Name utility files in camelCase (e.g., mergeSchema.ts)
Files:
frontend/internal-packages/schema-bench/src/cli/utils/index.ts
frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts
frontend/internal-packages/schema-bench/src/cli/executeLiamDbShared.ts
frontend/internal-packages/schema-bench/src/cli/uploadDatasetToLangsmith.ts
frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts
frontend/internal-packages/schema-bench/src/langsmith/types.ts
frontend/internal-packages/schema-bench/src/cli/utils/loadJsonFiles.ts
frontend/internal-packages/schema-bench/src/cli/executeOpenaiUnified.ts
**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Use TypeScript/TSX across the codebase
**/*.{ts,tsx}
: Use runtime type validation with valibot for external data validation
Prefer early returns for readability
Write simple, direct code without backward compatibility shims; update all call sites together
Use const-assigned arrow functions instead of function declarations for small utilities (e.g., const toggle = () => {})
Follow existing import patterns and tsconfig path aliases
Files:
frontend/internal-packages/schema-bench/src/cli/utils/index.ts
frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts
frontend/internal-packages/schema-bench/src/cli/executeLiamDbShared.ts
frontend/internal-packages/schema-bench/src/cli/uploadDatasetToLangsmith.ts
frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts
frontend/internal-packages/schema-bench/src/langsmith/types.ts
frontend/internal-packages/schema-bench/src/cli/utils/loadJsonFiles.ts
frontend/internal-packages/schema-bench/src/cli/executeOpenaiUnified.ts
frontend/internal-packages/**
📄 CodeRabbit inference engine (AGENTS.md)
Infra and tooling (e2e, configs, storybook, agent) live under frontend/internal-packages
Files:
frontend/internal-packages/schema-bench/src/cli/utils/index.ts
frontend/internal-packages/schema-bench/package.json
frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts
frontend/internal-packages/schema-bench/src/cli/executeLiamDbShared.ts
frontend/internal-packages/schema-bench/src/cli/uploadDatasetToLangsmith.ts
frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts
frontend/internal-packages/schema-bench/src/langsmith/types.ts
frontend/internal-packages/schema-bench/README.md
frontend/internal-packages/schema-bench/src/cli/utils/loadJsonFiles.ts
frontend/internal-packages/schema-bench/src/cli/executeOpenaiUnified.ts
🧠 Learnings (2)
📚 Learning: 2025-07-30T05:52:56.270Z
Learnt from: hoshinotsuyoshi
PR: liam-hq/liam#2771
File: frontend/internal-packages/schema-bench/src/cli/executeLiamDb.ts:22-22
Timestamp: 2025-07-30T05:52:56.270Z
Learning: The schema-bench package (frontend/internal-packages/schema-bench) has been converted from ESM to CommonJS mode by removing "type": "module" from package.json, making __dirname available and correct to use in TypeScript files within this package.
Applied to files:
frontend/internal-packages/schema-bench/src/cli/utils/index.ts
frontend/internal-packages/schema-bench/package.json
📚 Learning: 2025-09-02T05:53:14.356Z
Learnt from: CR
PR: liam-hq/liam#0
File: AGENTS.md:0-0
Timestamp: 2025-09-02T05:53:14.356Z
Learning: Before pushing, run formatting, linting, and tests: pnpm fmt && pnpm lint && pnpm test
Applied to files:
frontend/internal-packages/schema-bench/package.json
🧬 Code graph analysis (5)
frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts (1)
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts (1)
evaluate
(220-284)
frontend/internal-packages/schema-bench/src/cli/executeLiamDbShared.ts (1)
frontend/internal-packages/schema-bench/src/cli/utils/loadJsonFiles.ts (1)
loadJsonFiles
(17-71)
frontend/internal-packages/schema-bench/src/cli/uploadDatasetToLangsmith.ts (6)
frontend/internal-packages/schema-bench/src/langsmith/types.ts (1)
LangSmithDatasetConfig
(12-15)frontend/internal-packages/schema-bench/src/cli/utils/loadJsonFiles.ts (1)
loadJsonFiles
(17-71)frontend/internal-packages/schema-bench/src/cli/utils/parseArgs.ts (1)
parseArgs
(17-37)frontend/internal-packages/schema-bench/src/cli/utils/workspace.ts (1)
getWorkspacePath
(14-16)frontend/internal-packages/schema-bench/src/cli/utils/selectTargetDatasets.ts (1)
selectTargetDatasets
(13-28)frontend/internal-packages/schema-bench/src/cli/utils/filterAndResolveDatasets.ts (1)
filterAndResolveDatasets
(10-30)
frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts (6)
frontend/internal-packages/schema-bench/src/langsmith/types.ts (2)
LangSmithInput
(3-6)LangSmithOutput
(8-10)frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts (1)
schemaEvaluator
(20-61)frontend/internal-packages/schema-bench/src/cli/utils/parseArgs.ts (1)
parseArgs
(17-37)frontend/internal-packages/schema-bench/src/cli/utils/workspace.ts (1)
getWorkspacePath
(14-16)frontend/internal-packages/schema-bench/src/cli/utils/selectTargetDatasets.ts (1)
selectTargetDatasets
(13-28)frontend/internal-packages/schema-bench/src/cli/utils/filterAndResolveDatasets.ts (1)
filterAndResolveDatasets
(10-30)
frontend/internal-packages/schema-bench/src/cli/executeOpenaiUnified.ts (1)
frontend/internal-packages/schema-bench/src/cli/utils/loadJsonFiles.ts (1)
loadJsonFiles
(17-71)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: frontend-ci
- GitHub Check: frontend-lint
- GitHub Check: Supabase Preview
- GitHub Check: security-review
🔇 Additional comments (1)
frontend/internal-packages/schema-bench/src/cli/uploadDatasetToLangsmith.ts (1)
12-15
: Earlier feedback withdrawn
...
frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts
Show resolved
Hide resolved
evaluate(target, { | ||
data: `schema-bench-${datasetName}`, | ||
evaluators: [schemaEvaluator], | ||
experimentPrefix: `${options.executorType}-${datasetName}`, | ||
maxConcurrency: options.maxConcurrency, | ||
numRepetitions: options.numRepetitions, | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main function for submitting evaluations to LangSmith.
ref: https://docs.langchain.com/langsmith/code-evaluator
Pass in the target and evaluators to execute and evaluate in one go. Use numRepetitions to run multiple executions.
if (existingExample) { | ||
await client.updateExample({ | ||
id: existingExample.id, | ||
inputs: input, | ||
outputs: reference, | ||
}) | ||
} else { | ||
await client.createExample({ | ||
inputs: input, | ||
outputs: reference, | ||
dataset_id: datasetId, | ||
metadata: { caseId }, | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before running evaluations on LangGraph, we must submit your dataset.
{ | ||
key: 'Table Count', | ||
score: countTables(outputSchema), | ||
}, | ||
{ | ||
key: 'Column Count', | ||
score: countColumns(outputSchema), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To aid in understanding the output, I've included the count as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive LangSmith integration to schema-bench for advanced evaluation tracking, visualization, and experiment comparison. The integration allows uploading local datasets to LangSmith and running evaluations with both LiamDB and OpenAI executors.
- LangSmith dataset upload functionality from local workspace files
- Evaluation execution system with configurable repetitions and concurrency
- Custom schema evaluator that tracks table/column counts and F1/recall scores
Reviewed Changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
src/langsmith/types.ts |
Defines TypeScript types for LangSmith integration |
src/langsmith/schemaEvaluator.ts |
Custom evaluator for schema similarity metrics |
src/cli/utils/loadJsonFiles.ts |
Refactored utility for generic JSON file loading |
src/cli/utils/index.ts |
Updated exports to reflect renamed utility |
src/cli/uploadDatasetToLangsmith.ts |
CLI tool for uploading datasets to LangSmith |
src/cli/executeOpenaiUnified.ts |
Updated to use refactored file loading utility |
src/cli/executeLiamDbShared.ts |
Updated to use refactored file loading utility |
src/cli/evaluateWithLangsmith.ts |
CLI tool for running evaluations with LangSmith |
package.json |
Added neverthrow dependency and npm scripts |
README.md |
Added documentation for LangSmith integration |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
frontend/internal-packages/schema-bench/src/cli/uploadDatasetToLangsmith.ts
Show resolved
Hide resolved
frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts
Show resolved
Hide resolved
frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts
Outdated
Show resolved
Hide resolved
🤖 Agent Deep Modeling ExecutionProcessing time: 3m 53s Command Output
RUN v3.2.4 /home/runner/work/liam/liam/frontend/internal-packages/agent (node:7673) ExperimentalWarning: WASI is an experimental feature and might change at any time ✅ [INFO] 2025-10-08T03:39:29.485Z ⏺ (pm): Purpose: Save the finalized goal and structured testcases for the simple user management system BRD so DB and QA agents can proceed. Inputs: goal string and categorized testcases object. 🧠 Thinking (pm)... Creating a user management BRD I need to create a BRD for a simple user management system that functions autonomously. I'll start with a checklist of 3-7 conceptual bullet points and focus on industry-standard assumptions since the schema is blank. I’ll structure the requirements and save them via the requirements tool, ensuring they’re actionable for both the database and QA agents. I’ll also think about entities, attributes, constraints, and essential features without letting scope creep in, keeping things minimal yet robust. Defining user management components I need to define the essential components of the user management system. First, for users, I’ll include fields like id, username, email, and status, along with unique constraints for email and username. Then, for roles, I want to establish a basic structure with a fixed set of roles and the possibility of adding permissions later. Lastly, for session management, I’m considering a user_tokens table to handle password resets and email verifications, keeping everything straightforward while covering the necessary features. Including user authentication and audit features I'm thinking about adding user authentication providers for optional OAuth functionality. For audit logs, I want to include a user_activity_log that captures changes and logins, while keeping it simple with an 'audit_trail' for CRUD operations. I need to ensure the DB design meets QA requirements, so I’ll outline test cases for various scenarios like user insertion, role updates, and managing tokens. I should also touch on security measures, such as strong password policies and PII protection, while keeping the BRD focused on business rules rather than implementation specifics. Structuring the user management BRD I need to outline several sections for the BRD, such as Business Context, Goals, In Scope, and Out of Scope. I’ll also specify the entities and data model requirements, including fields, types, and constraints. Functional requirements will be grouped into categories like User Registration and Authentication. Then, I’ll address non-functional requirements related to security and performance. We might need to define API endpoints as well. After drafting the BRD, I’ll ensure to save the requirements using the tool, focusing on the goal and test cases without including the entire document in the tool call. Structuring Requirements for User Management System I need to ensure the test cases are structured properly and distinct from the entire BRD. I want to include a checklist with about five key points and break up the document into major sections like Overview, Scope, and Assumptions. I'll define the data model, covering entities like users and roles, along with their fields and constraints. Factors like functional and non-functional requirements will be essential too. Additionally, I plan to include test cases segmented by categories—everything ensuring a solid framework without excessive complexity while aiming for clarity in forming actionable requirements. Structuring Test Cases for User Management I need to ensure that each test case includes both a title and a type, aiming for a total of 20-30 test cases. The tool payload must contain a "goal" string and a "testcases" object, which I'll prepare after presenting the BRD. I’ll start with a checklist, followed by assumptions, sections of the BRD, and then list the test cases. For categories like Registration and Email Verification, Authentication, and Roles, I’ll create specific cases ensuring clarity in titles and relevant types while avoiding heavy formatting. After structuring everything, I'll finalize the tool call, ensuring we confirm successful completion! 🔧 saveRequirementsToArtifactTool(goal: "Define a simple, ...", testcases: {10 keys}) ⎿ unknown: Requirements saved successfully to artifact 🧠 Thinking (db)... I’m focusing on creating a schema for a database, which includes fields like token text, user_id, expiration dates, and more for an audits table. I’ll need to define columns such as id, actor_id, and action details. After that, I plan to call a specific tool to help design the schema and then validate the result lines. Finally, I’ll confirm these changes with the user and organize the operations accordingly. 🔧 schemaDesignTool() 🧠 Thinking (db)... Planning tool usage I need to make sure I'm clear about my purpose before calling the tool and keep my inputs minimal. After I call the tool, I'll validate the result in just one or two lines. Then, I’ll wrap things up by reporting completion. It seems straightforward, but I want to ensure I'm methodical. Okay, let’s get started with this process! 🔧 schemaDesignTool() ⏺ (qa): { ⏺ (qa): { 🔧 saveTestcase() email (\ usingn id an,\ expired token. The systemn name should:,\n email,\n phone,\n ⏺ (qa): { 🔧 saveTestcase()n username,\ Case: Listingn 1) & Reject avatar_url Search\n the verification,\n -- Category: attempt due created_at to token,\ expiryn Listing & Search\n--.\ Titlen-- updated_at,\n 2: Search by partial username) Mark the deleted_at,\n or email returns matching active locked_until token as unus users onlyable (e,\n .g.,\n-- 🔧 saveTestcase() password_hash Type update: status SELECT\n ⏺ (qa): { 🔧 saveTestcase() 🔧 saveTestcase() typically verify user by id or username/email and check status and soft-delete (deleted_at01',\n 'active'::user_status,\n 'alice_example\n-- - We search case-insensitively on username and email/type exists; common values might include 'pending', 'used', 'expired'. 🔧 saveTestcase() IS NULL).\n-- - We demonstrate a SELECT',\n NULL,\n now using ILIKE for partial matches.\n-- - If 'expired' is not 🔧 saveTestcase() that would be used in an authentication check to ensure that(),\n now(),\n NULL,\ We include pagination parameters as placeholders: :limit and :offset to defined, adapt to existing values.\n-- - email ⏺ (qa): { 🔧 saveTestcase() authenticated).\n-- This query demonstrates selecting a user candidate by login identifier (email or username user with email differing only by case: [email protected]\n -- This insert should fail due to unique index on lower(email)\n BEGIN\n INSERT INTOn u.email,\n u.avatar_url,\n u.created_at,\n u.updated_at,\n u.status\nFROM\n-- - We perform an UPDATE on a specific token row identified by a known token value or token id.\n-- - In production, you would pass in the token value (orBEGIN;\n\n-- Assumptions:\n-- - We identify the target user by a known uuid: :target_user_id\n-- - Admin performing the action is 🔧 saveTestcase() 🔧 saveTestcase()) and enforcing soft-delete check.\n\n-- Parameters (to be bound by application):\n-- :login_input (text) can be an email or username\n-- :now (timestamptz) current timestamp for evaluating users (\n id,\n name,\n email,\n phone,\n status,\n username,\n avatar_url,\n created_at,\n updated_at,\n deleted_at,\n locked_until,\n password_hash,\n email_verified_at,\n failed_login users AS u\nWHERE\n u.deleted_at IS NULL\n AND u.status <> 'deleted'\n AND (\n LOWER(u.username) LIKE LOWER('%' || :search_prefix || '%')\n OR LOWER(u.email id) and the acting user (optional) for auditing.\n-- - For auditable behavior, an audit entry could be recorded; however this test focuses on token state change.\n\n-- Begin by declaring the input parameters (as inline literals for the test):\n-- Replace :token_id_or_value with the actual represented by :admin_actor_id\n-- - Status enum type for users is assumed to include 'active', 'inactive', 'deleted'. If a different enum is used, adjust accordingly.\n-- - If 🔧 saveTestcase() 🔧 saveTestcase() locked_until, etc. (not strictly required here but common in auth checks)\n\nWITH candidate AS (\n SELECT\n u.id,\n u.email,\n u.username,\n u.status,\n u.locked_until_attempts,\n require_password_reset\n ) VALUES (\n gen_random_uuid(),\n 'Bob Example',\n '[email protected]',\n '+15550100102',\n 'active'::user_status,\n 'bob_example',\n NULL,\n now(),\n NULL,\n NULL,\n NULL,\n 'password_hash_placeholder',\n NULL,\n 0,\n false\n );\n EXCEPTION WHEN unique_violation THEN\n -- Expected: the insert should fail due to case-insensitive unique constraint on lower(email)\n ) LIKE LOWER('%' || :search_prefix || '%')\n )\nORDER BY\n u.username ASC\nLIMIT :limit\nOFFSET :offset;\n" 🔧 saveTestcase() 🔧 saveTestcase() 🔧 saveTestcase(),\n u.deleted_at,\n u.password_hash\n FROM users u\n WHERE lower(u.email) = lower(:login_input)\n OR lower(u.username) = lower(:login_input)\n)\nSELECT\n c.id,\n c.email,\n c.username,\n c.status,\n c.locked_until,\n c.deleted_at,\n c.password_hash\nFROM candidate c\nWHERE c.deleted_at IS NULL\n AND c.password_hash IS NOT NULL\n AND (c.locked_until IS NULL OR c.locked_until <= now())\n -- If you want to explicitly enforce that soft-deleted users cannot authenticate, ensure deleted_at is NULL in the main auth path\n -- The following line reinforces the business rule in a testing context:\n -- Deleted users should not authenticate; they will be filtered out by deleted_at IS NULL above\n;\n" ⏺ (qa): { 🔧 saveTestcase() ⏺ (qa): { WITH updated AS ( ⏺ (qa): { 🔧 saveTestcase() 🔧 saveTestcase() ⏺ (qa): {"sql":"-- Test case: Consecutive failed logins increment failed_login_attempts and lock account after threshold\n-- This UPDATE handles a failed login event for a specific user.\n-- Threshold assumed to be 5 consecutive failed attempts; after reaching or exceeding the threshold, lock the account by setting locked_until.\n\nWITH prev AS (\n SELECT id, failed_login_attempts, locked_until\n FROM users\n WHERE id = $1\n FOR UPDATE\n)\nUPDATE users\nSET\n failed_login_attempts = LEAST(COALESCE(prev.failed_login_attempts, 0) + 1, 2147483647),\n locked_until = CASE\n WHEN COALESCE(prev.failed_login_attempts, 0) + 1 >= 5 THEN now()\n ELSE NULL\n END,\n updated_at = now()\nFROM prev\nWHERE users.id = prev.id\nRETURNING\n users.id,\n users.failed_login_attempts,\n users.locked_until,\n users.updated_at;"} ⏺ (qa): { 🔧 saveTestcase()(node:7673) Warning: The 'import-in-the-middle' hook has already been initialized 🔧 saveTestcase() ⏺ (qa): { 🔧 saveTestcase()\n,-- t .t page_num, integer t,.t page,_size p.actor_id integer\n, role _key_filter FROM role_lookup text rl,\n\n params p,WITH filtered_users AS (\n now_t t SELECT u\n .id,\ WHERE rln .role u_id IS.name,\ NOTn NULL\n u.email,\ AND NOT EXISTS (SELECT n u.username,\n u1.status FROM existing,\n );\n u.created_at\n -- ,\n 3) Audit the action regardless u.updated_at of insert outcome\n INSERT INTO audits (action, details,, actorn _id, created_at, target_user_id)\n u .avatar_url SELECT\n \n ' FROMuser.role.assign', users AS\n jsonb_build_object(\n 'assigned_role_key', ' umanager\n ',\n ' WHEREtarget_user u.deleted_at_id', p IS NULL.user_id ⏺ (qa): { ❌ Test Case: Attempt registration with duplicate email is rejected (case-insensitive)Error:
|
Replace specialized loadInputFiles with generic loadJsonFiles function. - Rename loadInputFiles.ts to loadJsonFiles.ts - Update executeLiamDbShared.ts to use loadJsonFiles with explicit path - Update executeOpenaiUnified.ts to use loadJsonFiles with explicit path - Change return value key from 'input' to 'data' for consistency - Remove loadInputFiles export from utils/index.ts This simplifies the codebase by having a single, flexible function for loading JSON files from any directory. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Add comprehensive LangSmith integration to schema-bench for advanced evaluation tracking, visualization, and experiment comparison. Key features: - LangSmith dataset upload from local workspace files - Evaluation execution with LiamDB and OpenAI executors - Schema metrics tracking (table/column counts, F1 scores, recall) - Configurable repetitions and concurrency - neverthrow-based error handling for type safety Implementation details: - Created evaluateWithLangsmith.ts for running evaluations - Created uploadDatasetToLangsmith.ts for dataset management - Added schemaEvaluator for computing schema similarity metrics - Integrated @liam-hq/neverthrow for functional error handling - Added loadJsonFiles utility for consistent file loading - Used ResultAsync.combine for parallel async operations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Replace manual error handler in fromPromise with @liam-hq/neverthrow's default error handler for consistency across the codebase. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Co-authored-by: Copilot <[email protected]>
…mith Add non-null assertion for apiKey after handleCliError check. Since handleCliError returns never, apiKey is guaranteed to be defined after the check, but TypeScript's control flow analysis doesn't recognize this. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
4d75f94
to
8852d25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts (3)
82-82
: Remove redundant fallback and non-null assertion.The
apiKey
is guaranteed to be truthy after the check on lines 78-80 (wherehandleCliError
exits the process). Both the|| ''
fallback and the!
non-null assertion are unnecessary.Apply this diff:
- const executor = new OpenAIExecutor({ apiKey: apiKey! }) + const executor = new OpenAIExecutor({ apiKey })
97-97
: Remove unreachable code.
handleCliError
exits the process and never returns, making this line unreachable. Consider restructuring to make the control flow explicit or removing the return statement.If
handleCliError
is typed asnever
, consider this refactor:- return handleCliError(`Unknown executor type: ${executorType}`) + handleCliError(`Unknown executor type: ${executorType}`)
119-125
: Fix the ResultAsync return type.The function is
async
but returnsResultAsync.combineWithAllErrors(results)
directly without awaiting. This meansawait runDatasets(...)
on line 151 resolves to theResultAsync
wrapper instead of a concreteResult
, breaking theisErr()
check on line 153.Apply this diff:
const runDatasets = async ( datasets: Array<{ name: string }>, options: ExecutorOptions, ) => { const results = datasets.map(({ name }) => runEvaluation(name, options)) - return ResultAsync.combineWithAllErrors(results) + return await ResultAsync.combineWithAllErrors(results) }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
pnpm-lock.yaml
is excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (10)
frontend/internal-packages/schema-bench/README.md
(1 hunks)frontend/internal-packages/schema-bench/package.json
(2 hunks)frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts
(1 hunks)frontend/internal-packages/schema-bench/src/cli/executeLiamDbShared.ts
(3 hunks)frontend/internal-packages/schema-bench/src/cli/executeOpenaiUnified.ts
(4 hunks)frontend/internal-packages/schema-bench/src/cli/uploadDatasetToLangsmith.ts
(1 hunks)frontend/internal-packages/schema-bench/src/cli/utils/index.ts
(1 hunks)frontend/internal-packages/schema-bench/src/cli/utils/loadJsonFiles.ts
(2 hunks)frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts
(1 hunks)frontend/internal-packages/schema-bench/src/langsmith/types.ts
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- frontend/internal-packages/schema-bench/src/cli/utils/index.ts
- frontend/internal-packages/schema-bench/src/cli/executeLiamDbShared.ts
- frontend/internal-packages/schema-bench/package.json
- frontend/internal-packages/schema-bench/README.md
- frontend/internal-packages/schema-bench/src/langsmith/types.ts
- frontend/internal-packages/schema-bench/src/cli/uploadDatasetToLangsmith.ts
🧰 Additional context used
📓 Path-based instructions (3)
**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
Name utility files in camelCase (e.g., mergeSchema.ts)
Files:
frontend/internal-packages/schema-bench/src/cli/utils/loadJsonFiles.ts
frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts
frontend/internal-packages/schema-bench/src/cli/executeOpenaiUnified.ts
frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts
**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Use TypeScript/TSX across the codebase
**/*.{ts,tsx}
: Use runtime type validation with valibot for external data validation
Prefer early returns for readability
Write simple, direct code without backward compatibility shims; update all call sites together
Use const-assigned arrow functions instead of function declarations for small utilities (e.g., const toggle = () => {})
Follow existing import patterns and tsconfig path aliases
Files:
frontend/internal-packages/schema-bench/src/cli/utils/loadJsonFiles.ts
frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts
frontend/internal-packages/schema-bench/src/cli/executeOpenaiUnified.ts
frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts
frontend/internal-packages/**
📄 CodeRabbit inference engine (AGENTS.md)
Infra and tooling (e2e, configs, storybook, agent) live under frontend/internal-packages
Files:
frontend/internal-packages/schema-bench/src/cli/utils/loadJsonFiles.ts
frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts
frontend/internal-packages/schema-bench/src/cli/executeOpenaiUnified.ts
frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts
🧬 Code graph analysis (3)
frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts (6)
frontend/internal-packages/schema-bench/src/langsmith/types.ts (2)
LangSmithInput
(3-6)LangSmithOutput
(8-10)frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts (1)
schemaEvaluator
(20-61)frontend/internal-packages/schema-bench/src/cli/utils/parseArgs.ts (1)
parseArgs
(17-37)frontend/internal-packages/schema-bench/src/cli/utils/workspace.ts (1)
getWorkspacePath
(14-16)frontend/internal-packages/schema-bench/src/cli/utils/selectTargetDatasets.ts (1)
selectTargetDatasets
(13-28)frontend/internal-packages/schema-bench/src/cli/utils/filterAndResolveDatasets.ts (1)
filterAndResolveDatasets
(10-30)
frontend/internal-packages/schema-bench/src/cli/executeOpenaiUnified.ts (1)
frontend/internal-packages/schema-bench/src/cli/utils/loadJsonFiles.ts (1)
loadJsonFiles
(17-71)
frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts (1)
frontend/internal-packages/schema-bench/src/evaluate/evaluate.ts (1)
evaluate
(220-284)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Supabase Preview
- GitHub Check: frontend-ci
- GitHub Check: frontend-lint
- GitHub Check: Supabase Preview
🔇 Additional comments (13)
frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts (2)
9-18
: LGTM!The helper functions
countTables
andcountColumns
are simple, correct, and efficiently compute the required metrics.
31-60
: LGTM!The metric definitions are clear and comprehensive, covering all the evaluation dimensions from the evaluate function. The inclusion of both counts and scores provides good visibility into the evaluation results.
frontend/internal-packages/schema-bench/src/cli/executeOpenaiUnified.ts (3)
3-3
: LGTM!The import additions correctly support the updated path construction and the refactored
loadJsonFiles
utility.Also applies to: 14-14
56-61
: LGTM!The updated call to
loadJsonFiles
correctly uses the new directory structure (join(datasetPath, 'execution', 'input')
) and maintains the same normalization logic with the updated return shape.
78-81
: LGTM!The batch processing correctly uses the new
data
property from the refactoredloadJsonFiles
return type.frontend/internal-packages/schema-bench/src/cli/evaluateWithLangsmith.ts (5)
24-34
: LGTM!The schema definitions correctly validate executor types and options, with sensible defaults for
numRepetitions
andmaxConcurrency
. The use of Valibot for runtime validation aligns with the coding guidelines.
39-57
: LGTM!The argument parsing correctly extracts executor options from CLI args and validates them using the schema. The logic handles both flag-based and value-based arguments appropriately.
59-74
: LGTM!The target functions correctly wrap the executors and handle errors by throwing (which LangSmith will catch). The prompt extraction logic (
input.prompt || input.input || ''
) provides good fallback handling.Also applies to: 76-95
102-117
: LGTM!The evaluation setup correctly configures LangSmith's
evaluate
function with the dataset name, evaluator, experiment prefix, and execution options. The use offromPromise
properly wraps the async evaluation in aResultAsync
.
127-156
: LGTM!The main function correctly orchestrates the evaluation workflow: parsing options, resolving datasets, validating inputs, and handling errors. The separation of dataset and executor argument parsing is clean.
frontend/internal-packages/schema-bench/src/cli/utils/loadJsonFiles.ts (3)
13-24
: LGTM!The function signature is well-designed with generic type parameters, making it reusable for different JSON file types. The updated documentation accurately reflects the generic purpose. The return type using
data
instead ofinput
makes the function more versatile.
25-27
: LGTM!The directory existence check with a clear error message provides early validation. The generic error message is appropriate for a reusable utility.
29-70
: LGTM!The file loading, parsing, and validation logic is solid. The use of neverthrow for error handling provides type-safe error propagation throughout the function. The Valibot validation ensures runtime type safety, aligning with the coding guidelines.
import * as v from 'valibot' | ||
import { evaluate } from '../evaluate/evaluate.ts' | ||
|
||
const validateSchema = (data: unknown): Schema => v.parse(schemaSchema, data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling to validateSchema.
The function uses v.parse
, which throws on validation failure and would crash the evaluator instead of returning a graceful error. LangSmith evaluators should handle validation failures without throwing.
Consider using v.safeParse
and handling validation errors explicitly:
-const validateSchema = (data: unknown): Schema => v.parse(schemaSchema, data)
+const validateSchema = (data: unknown): Schema => {
+ const result = v.safeParse(schemaSchema, data)
+ if (!result.success) {
+ throw new Error(`Schema validation failed: ${JSON.stringify(result.issues)}`)
+ }
+ return result.output
+}
Alternatively, return an error result from the evaluator when validation fails instead of throwing.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const validateSchema = (data: unknown): Schema => v.parse(schemaSchema, data) | |
const validateSchema = (data: unknown): Schema => { | |
const result = v.safeParse(schemaSchema, data) | |
if (!result.success) { | |
throw new Error(`Schema validation failed: ${JSON.stringify(result.issues)}`) | |
} | |
return result.output | |
} |
🤖 Prompt for AI Agents
In frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts
around line 7, validateSchema currently calls v.parse which throws on invalid
input; change it to use v.safeParse and handle the result so validation failures
do not throw. Specifically, call v.safeParse(schemaSchema, data), check success;
on failure map the validation errors into the evaluator's error/result shape (or
return a graceful error result) and on success return the parsed Schema value —
ensure the evaluator returns an error result rather than allowing an exception
to bubble up.
const referenceSchema = validateSchema(args.referenceOutputs?.['schema']) | ||
const outputSchema = validateSchema(args.outputs['schema']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add null checks before schema validation.
The code accesses referenceOutputs?.['schema']
and outputs['schema']
without checking if the values exist. If either is undefined, validateSchema
will throw.
Apply this diff to add null checks:
+ if (!args.referenceOutputs?.['schema']) {
+ throw new Error('Missing reference schema in referenceOutputs')
+ }
+ if (!args.outputs['schema']) {
+ throw new Error('Missing schema in outputs')
+ }
const referenceSchema = validateSchema(args.referenceOutputs?.['schema'])
const outputSchema = validateSchema(args.outputs['schema'])
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const referenceSchema = validateSchema(args.referenceOutputs?.['schema']) | |
const outputSchema = validateSchema(args.outputs['schema']) | |
if (!args.referenceOutputs?.['schema']) { | |
throw new Error('Missing reference schema in referenceOutputs') | |
} | |
if (!args.outputs['schema']) { | |
throw new Error('Missing schema in outputs') | |
} | |
const referenceSchema = validateSchema(args.referenceOutputs?.['schema']) | |
const outputSchema = validateSchema(args.outputs['schema']) |
🤖 Prompt for AI Agents
In frontend/internal-packages/schema-bench/src/langsmith/schemaEvaluator.ts
around lines 27-28, the code calls validateSchema on
referenceOutputs?.['schema'] and outputs['schema'] without ensuring those values
exist; add explicit null/undefined checks before calling validateSchema: verify
args.referenceOutputs and args.referenceOutputs['schema'] exist and verify
args.outputs and args.outputs['schema'] exist, and if missing either throw a
clear error (or handle fallback) instead of passing undefined into
validateSchema; then call validateSchema only with the confirmed non-null schema
values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM👍
Summary
Add comprehensive LangSmith integration to schema-bench for advanced evaluation tracking, visualization, and experiment comparison.
Key Features
Implementation Details
New Files
evaluateWithLangsmith.ts
- CLI for running evaluations against LangSmith datasetsuploadDatasetToLangsmith.ts
- CLI for uploading/updating LangSmith datasetsschemaEvaluator.ts
- Custom evaluator for schema similarity metricsloadJsonFiles.ts
- Shared utility for consistent JSON file loading with validationKey Technical Decisions
Neverthrow Integration: Added
@liam-hq/neverthrow
for functional error handlingResultAsync.combine
for parallel async operationsorElse
for fallback logic (get or create dataset)Valibot Validation: Schema-based CLI argument parsing with default values
v.InferOutput
to eliminate manual type definitionsCode Reuse: Extracted
loadJsonFiles
utility to avoid duplicationUsage
Setup
Run Evaluation
Test Plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Chores