Skip to content

Commit 9b677d2

Browse files
committed
Added a table format to the evaluate mode
1 parent 667b164 commit 9b677d2

File tree

12 files changed

+819
-106
lines changed

12 files changed

+819
-106
lines changed

test/common/uc_eval/README.md

Lines changed: 69 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@
88

99
| 数据集 | Hugging Face 链接 |
1010
| ------------ | ------------------------------------------------------------ |
11-
| AIME2025 | [opencompass/AIME2025 · Datasets at Hugging Face](https://huggingface.co/datasets/opencompass/AIME2025) |
1211
| LongBench | [zai-org/LongBench · Datasets at Hugging Face](https://huggingface.co/datasets/zai-org/LongBench) |
1312
| LongBench v2 | [zai-org/LongBench-v2 · Datasets at Hugging Face](https://huggingface.co/datasets/zai-org/LongBench-v2) |
1413

@@ -19,7 +18,12 @@
1918
| ShartGPT | [anon8231489123/ShareGPT_Vicuna_unfiltered · Datasets at Hugging Face](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) |
2019
| ShartGPT-Chinese-English-90K | [shareAI/ShareGPT-Chinese-English-90k · Datasets at Hugging Face](https://huggingface.co/datasets/shareAI/ShareGPT-Chinese-English-90k) |
2120

22-
- 多轮对话数据集格式参照如下:
21+
多轮对话数据集格式可参照如下两种形式:
22+
23+
- 格式1:
24+
- 顶层键名(如 `"sharegpt"`)可以自定义,但内部结构必须保持一致
25+
- `"conversations"` 字段名不可修改
26+
- 对话必须采用 `"from"``"value"` 格式
2327

2428
```json
2529
{
@@ -38,11 +42,35 @@
3842
}]}
3943
```
4044

41-
**注意**
45+
- 格式2
4246

43-
- 顶层键名(如 `"sharegpt"`)可以自定义,但内部结构必须保持一致
44-
- `"conversations"` 字段名不可修改
45-
- 对话必须采用 `"from"``"value"` 格式
47+
```json
48+
[
49+
{
50+
"id": "dsOTKpn_0",
51+
"conversations": [
52+
{
53+
"from": "human",
54+
"value": "Why does `dir` command in DOS see the \"<.<\" argument as \"\\*.\\*\"?"
55+
},
56+
{
57+
"from": "human",
58+
"value": "I said `dir \"<.<\"` , it only has one dot but it is the same as `dir \"\\*.\\*\"`"
59+
}
60+
]
61+
},
62+
{
63+
"id": "60493",
64+
"conversations": [
65+
{
66+
"from": "human",
67+
"value": "我想用TypeScript编写一个程序,提供辅助函数以生成G代码绘图(Marlin)。我已经在我的3D打印机上添加了笔座,并希望将其用作笔绘图仪。该库应提供类似使用p5.js的体验,但它不是在画布上绘制形状,而是在G代码中产生文本输出。"
68+
}
69+
],
70+
"lang": "en"
71+
}
72+
]
73+
```
4674

4775
### stopwords文件
4876

@@ -232,15 +260,15 @@ def test_multiturn_dialogue_perf(
232260
"demo": [
233261
"demo.json"
234262
],
235-
"sharrgpt":[
236-
237-
]
263+
"sharegpt": [
264+
"demo.json"
265+
]
238266
}
239267
```
240268

241269
- 说明:
242270
- 键名(如 `"demo"`)表示数据集文件夹名称
243-
- 值列表包含该文件夹下的数据文件名称
271+
- 值列表表示该文件夹下的数据文件名称
244272

245273
### 文档问答性能测试
246274

@@ -309,7 +337,7 @@ models:
309337
python -m pytest --feature=qa_eval_test
310338
```
311339

312-
- **结果保存位置**:所有性能测试数据保存在:`uc_eval/results/reports/evaluate/doc_qa_latency.xlsx`
340+
- **结果保存位置**:所有性能测试数据保存在:`uc_eval/results/reports/evaluate/doc_qa_latency.xlsx`,同时,在evaluate目录下会生成一个以日期命名的文件夹,其中包含数据集和模型回复等信息
313341
- **参数配置说明**:
314342

315343
| 参数 | 含义 | 示例值 |
@@ -339,7 +367,7 @@ doc_qa_eval_cases = [
339367
metrics=["accuracy", "bootstrap-accuracy", "f1-score"],
340368
eval_class="common.uc_eval.utils.metric:MatchPatterns",
341369
select_data_class={"domain": ["Single-Document QA"]},
342-
test_name="longbench and no prefix cache"
370+
test_name="longbench v2 and no prefix cache"
343371
),
344372
),
345373
# longbench参考配置
@@ -350,9 +378,9 @@ doc_qa_eval_cases = [
350378
enable_prefix_cache=False,
351379
parallel_num=1,
352380
benchmark_mode="evaluate",
353-
metrics=["accuracy", "bootstrap-accuracy", "f1-score"],
381+
metrics=["f1-score"],
354382
eval_class="common.uc_eval.utils.metric:FuzzyMatch",
355-
test_name="longbench v2 and no prefix cache"
383+
test_name="longbench and no prefix cache"
356384
),
357385
),
358386
]
@@ -385,31 +413,44 @@ def test_doc_qa_perf(
385413
- **模板文件**:test/common/uc_eval/utils/prompt_config.py
386414

387415
```python
388-
# 非多项选择题提示模板
389-
doc_qa_prompt = ["""
390-
Please read the following text and answer the questions below.\n
391-
Text: {context}\n
392-
Question: {input}
393-
Instructions: Answer based ONLY on the information in the text above
394-
"""]
416+
# 文档问答数据集的语言,决定后续的分词方式,以及后续prompt具体使用中文还是英文. 具体使用时首先会读取数据集中是否存在language这个键,如果不存在才使用该配置
417+
# 可选值包含三个: en, zh, None
418+
DEFAULT_LANGUAGE = "None"
419+
420+
# 文档问答提示模板,在使用时会将{}占位符替换为数据集中键值对应的内容,包含英文prompt和中文prompt两种形式
421+
Q&A prompt for document QA – replace the {} placeholders with actual content from the dataset when used.
422+
doc_qa_prompt_zh = [
423+
"""
424+
阅读以下文字并用中文简短回答:\n\n{context}\n\n现在请基于上面的文章回答下面的问题,只告诉我答案,不要输出任何其他字词。\n\n问题:{input}\n回答:
425+
"""
426+
]
427+
428+
doc_qa_prompt_en = [
429+
"""
430+
Read the following text and answer briefly.\n\n{context}\n\nNow, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:
431+
"""
432+
]
395433
396434
# 多项选择题提示模板
397-
multi_answer_prompt = ["""
435+
multi_answer_prompt = [
436+
"""
398437
Please read the following text and answer the questions below.\n
399438
Text: {context}\n
400439
What is the correct answer to this question: {question}\n
401440
Choices: \n (A) {choice_A} \n (B) {choice_B} \n (C) {choice_C} \n (D) {choice_D} \n
402441
Let's think step by step. Based on the above, what is the single, most likely answer choice?\n
403442
Format your response as follows: "The correct answer is (insert answer here)'
404-
"""]
443+
"""
444+
]
405445
406446
# 答案提取正则表达式模板
407447
match_patterns = [
408-
r'The correct answer is \(([A-D])\)',
409-
r'The correct answer is ([A-D])',
410-
r'The \(([A-D])\) is the correct answer',
411-
r'The ([A-D]) is the correct answer'
448+
r"The correct answer is \(([A-D])\)",
449+
r"The correct answer is ([A-D])",
450+
r"The \(([A-D])\) is the correct answer",
451+
r"The ([A-D]) is the correct answer",
412452
]
453+
413454
```
414455

415456
- **prompt_config模板使用说明**:
@@ -421,4 +462,4 @@ match_patterns = [
421462
- 使用 `multi_answer_prompt` 中的模板构造提示
422463
- 发送请求获取模型回复
423464
- 使用 `match_patterns` 中的正则表达式提取答案(A/B/C/D)
424-
- 与数据集的参考答案进行比对,获取精度
465+
- 与数据集的参考答案进行比对,获取精度或者F1-score

test/common/uc_eval/datasets/doc_qa/demo_2.json

Lines changed: 44 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
{
22
"demo": [
33
"demo.json"
4+
],
5+
"sharegpt": [
6+
"demo.json"
47
]
58
}

0 commit comments

Comments
 (0)