LLMs don't have perfect knowledge of Convex, so they require some prompting to help them along. This repo contains a set of prompts for coding a Convex backend, a set of human-curated solutions, and a script for evaluating the LLM's output.
First, install dependencies:
pip install pdm
pdm install
npm install -g bun
bun install
echo "ANTHROPIC_API_KEY=<your ANTHROPIC_API_KEY>" > .env
echo "OPENAI_API_KEY=<your OPENAI_API_KEY>" >> .env
Then, get a Braintrust API key from the dashboard.
You can run the eval_convex_coding.py
evals with the braintrust
CLI:
BRAINTRUST_API_KEY=<your BRAINTRUST_API_KEY> pdm run braintrust eval runner/eval_convex_coding.py
It'll print out a URL for viewing the report online. You can specify a test filter regex via an environment variable:
TEST_FILTER='data_modeling' pdm run braintrust eval runner/eval_convex_coding.py
The test will also print out what temporary directory it's using for storing the generated files. You can override this
with the OUTPUT_TEMPDIR
environment variable.
OUTPUT_TEMPDIR=/tmp/convex-codegen-evals pdm run braintrust eval runner/eval_convex_coding.py
After running the evals, you may want to dig into a particular test failure. You can use the run_grader.py
script to grade the evaluations again without regenerating them:
pdm run python -m runner.run_grader /tmp/convex-codegen-evals
You can also pass in a path to a specific evaluation.
pdm run python -m runner.run_grader /tmp/convex-codegen-evals/output/claude-3-5-sonnet-latest/000-fundamentals/000-http_actions_file_storage
Use the create_eval.py
script to create a new evaluation.
pdm run python -m runner.create_eval <category> <name>
Note that test or category names cannot contain dashes.
It will walk you through things step by step. You can start at a given step by passing a 3rd argument of the step number to start from.
At each step, it will generate some content and have you edit it.
It will generally open the files to review automatically with cursor
,
meaning you should have that utility installed in your shell.
- Generates a
TASK.txt
file for what the LLM should do. - Generates an
answer/
directory with the human-curated answer. - Generates a
grader.test.ts
file with unit tests. - Interactively runs the backend and tests
- Runs the eval with braintrust. At this point edit
GAPS.txt
to capture what needs improvement. - Commits it to git.
Be sure that your answer passes tests:
pdm run python -m runner.run_grader evals/<category>/<name>/answer
pdm run python -m runner.models.guidelines <outdir>
This will generate guidelines for Anthropic and OpenAI in the specified directory.