Skip to content

hasura/promptql-cs-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CS benchmark

This has the assets for setting up the customer support/success project for benchmarking promptql

Setup

Pre-requisites

  • Have postgres running at localhost:5432 with postgres:postgres as credentials. Ensure psql and createdb commands are available.
  • Install poetry
curl -sSL https://install.python-poetry.org | python3 -
  • Set your Anthropic key in the ANTHROPIC_API_KEY env var
  • Set your OpenAI key in the OPENAI_API_KEY env var
  • Set your PromptQL key in the PROMPTQL_SECRET_KEY env var

Database setup

Control Plane data:

  • DB name: "control_plane_new"
  • Dump file: data_dumps/control_plane_dump.sql
createdb -h localhost -U postgres control_plane_new
psql -h localhost -U postgres -d control_plane_new -f data_dumps/control_plane_dump.sql
  • Support tickets data:
  • DB name: "support_tickets_new"
  • Dump file: support_tickets_dump_feb3.sql
createdb -h localhost -U postgres support_tickets_new
psql -h localhost -U postgres -d support_tickets_new -f data_dumps/support_tickets_dump_feb3.sql

Running PromptQL

cd my-assistant
ddn supergraph build local
docker compose up -d
ddn console --local

Running o1 and o3-mini

  • Ensure python dependencies are installed
poetry install
  • Run the model
poetry run python o1_eval.py --model <o1|o3-mini>

Automatic benchmarks

System can be promptql, tool_calling or tool_calling_python

Model can be claude, o1 or o3-mini

python bench.py --input_filepath queries/score_based_prioritization/task.yaml --output_dir score_based_prioritization --system tool_calling_python --oracle --model claude-3-7-sonnet

Measure score

You can also automatically compute scores by comparing ground truth with evaluation runs by providing the query file, output directory (which has the evaluation runs) and a python module file which has a function which computes score between ground truth and test result: evaluate_score(ground_truth: str, test_result: str) -> float

python evaluation.py --input_config queries/rule_based_prioritization/complexity3.yaml --output_dir output_complexity3 --evaluator_module scoring/test_scorer.py

Sample scoring functions for common outputs are provided in scoring/ directory.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published