CompassBench is the self-built benchmark for the CompassRank LLM Leaderboard, we will provide the example data for the benchmark.
Please check CompassBench for more information of the benchmark.
v1_3_data
├── code
│ ├── compass_bench_coding_cn_val.json
│ └── compass_bench_coding_en_val.json
├── instruct
│ ├── compass_bench_instruct_cn_val.json
│ └── compass_bench_instruct_en_val.json
├── knowledge
│ └── single_choice_cn.jsonl
├── language
│ ├── compass_bench_language_cn_val.json
│ └── compass_bench_language_en_val.json
├── math
│ └── single_choice_cn.jsonl
└── reasoning
├── compass_bench_reasoning_cn_val.json
└── compass_bench_reasoning_en_val.json
- For subjective evaluation, please refer to CompassBench Subjective Config
- For objective evaluation, please refer to CompassBench Objective Config
Performance of the example data will be updated soon.
- Please link the
v1_3_data
todata/compassbench_v1_3
within the opencompass directory
export HUGGINGFACE_HUB_CACHE=/path-to-hf_hub/
export HF_HUB_CACHE=/path-to-hf_hub/
export HF_EVALUATE_OFFLINE=1
export HF_DATASETS_OFFLINE=1
export TRANSFORMERS_OFFLINE=1
# Objective Evaluation
# We use `perf_4` as the final metric
python run.py --models hf_internlm2_chat_1_8b --datasets compassbench_v1_3_objective_gen
# Subjective Evaluation
python run.py configs/eval_compassbench_v1_3_subjective.py