CS benchmark

This has the assets for setting up the customer support/success project for benchmarking promptql

Setup

Pre-requisites

Have postgres running at localhost:5432 with postgres:postgres as credentials. Ensure psql and createdb commands are available.
Install poetry

curl -sSL https://install.python-poetry.org | python3 -

Set your Anthropic key in the ANTHROPIC_API_KEY env var
Set your OpenAI key in the OPENAI_API_KEY env var
Set your PromptQL key in the PROMPTQL_SECRET_KEY env var

Database setup

Control Plane data:

DB name: "control_plane_new"
Dump file: data_dumps/control_plane_dump.sql

createdb -h localhost -U postgres control_plane_new
psql -h localhost -U postgres -d control_plane_new -f data_dumps/control_plane_dump.sql

Support tickets data:
DB name: "support_tickets_new"
Dump file: support_tickets_dump_feb3.sql

createdb -h localhost -U postgres support_tickets_new
psql -h localhost -U postgres -d support_tickets_new -f data_dumps/support_tickets_dump_feb3.sql

Running PromptQL

cd my-assistant
ddn supergraph build local
docker compose up -d
ddn console --local

Running o1 and o3-mini

Ensure python dependencies are installed

poetry install

Run the model

poetry run python o1_eval.py --model <o1|o3-mini>

Automatic benchmarks

System can be promptql, tool_calling or tool_calling_python

Model can be claude, o1 or o3-mini

python bench.py --input_filepath queries/score_based_prioritization/task.yaml --output_dir score_based_prioritization --system tool_calling_python --oracle --model claude-3-7-sonnet

Measure score

You can also automatically compute scores by comparing ground truth with evaluation runs by providing the query file, output directory (which has the evaluation runs) and a python module file which has a function which computes score between ground truth and test result: evaluate_score(ground_truth: str, test_result: str) -> float

python evaluation.py --input_config queries/rule_based_prioritization/complexity3.yaml --output_dir output_complexity3 --evaluator_module scoring/test_scorer.py

Sample scoring functions for common outputs are provided in scoring/ directory.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
data_dumps		data_dumps
data_generator		data_generator
evaluation_framework		evaluation_framework
ground_truth_generators		ground_truth_generators
my-assistant		my-assistant
queries		queries
scoring		scoring
visualized_results		visualized_results
.gitignore		.gitignore
README.md		README.md
ai_assistant.py		ai_assistant.py
bench.py		bench.py
claude35_eval.py		claude35_eval.py
claude35_oracle_eval.py		claude35_oracle_eval.py
evaluation.py		evaluation.py
gpt4_eval.py		gpt4_eval.py
o1_eval.py		o1_eval.py
o1_oracle_eval.py		o1_oracle_eval.py
poetry.lock		poetry.lock
promptql_eval.py		promptql_eval.py
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
render_promptql_api_output.py		render_promptql_api_output.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS benchmark

Setup

Pre-requisites

Database setup

Running PromptQL

Running o1 and o3-mini

Automatic benchmarks

Measure score

About

Releases

Packages

Contributors 6

Languages

hasura/promptql-cs-benchmark

Folders and files

Latest commit

History

Repository files navigation

CS benchmark

Setup

Pre-requisites

Database setup

Running PromptQL

Running o1 and o3-mini

Automatic benchmarks

Measure score

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages