Skip to content

Commit ddb7b16

Browse files
committed
update README
1 parent 3bbc570 commit ddb7b16

File tree

1 file changed

+48
-60
lines changed

1 file changed

+48
-60
lines changed

README.md

Lines changed: 48 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,16 @@ To be clarified, such zero-shot QA-style evaluation might be more suitable for t
1414
- [Supported baseline Models](#supported-baseline-models)
1515

1616
## Install
17-
```
17+
18+
```bash
1819
pip install -e .
1920
```
2021

2122
## Evaluate a model with Japanese Vicuna QA Benchmark
2223

2324
#### Step 1. Generate model answers to Japanese Vicuna QA questions (noted as jp-bench).
2425

25-
```
26+
```bash
2627
python llm_judge/gen_model_answer.py --config <CONFIG-PATH>
2728
```
2829

@@ -31,97 +32,86 @@ Arguments & Options:
3132

3233
For example:
3334

34-
```
35+
```bash
3536
python llm_judge/gen_model_answer.py --config configs/rinna--japanese-gpt-neox-3.6b-instruction-ppo.json
3637
```
3738

38-
The answers will be saved to `data/jp_bench/model_answer`.
39-
4039
#### Step 2. Generate GPT-4 judgments
4140

4241
There are several options to use GPT-4 as a judge, such as pairwise win-rate and single-answer grading.
43-
We show an example of the pairwise win-rate evaluation of instruction fine-tuned models (rinna-3.6b-sft-v2, rinna-3.6b-ppo, and japanese-alpaca-lora-7b) at the bottom.
4442

45-
```
43+
```bash
4644
OPENAI_API_KEY=<YOUR-KEY> python llm_judge/gen_judgment.py \
4745
--mode {single|pairwise-baseline|pairwise-all} \
46+
[--baseline-model <BASELINE-MODEL-ID>] \
4847
[--model-list <LIST-OF-MODEL-IDS>]
4948
```
5049

5150
Arguments & Options:
5251
- `--mode {single|pairwise-baseline|pairwise-all}` is the mode of judgment.
53-
- `single`: run score-based single-model grading.
5452
- `pairwise-baseline`: run pairwise comparison against a baseline model.
5553
- `pairwise-all`: run pairwise comparison between all model pairs.
54+
- `single`: run score-based single-model grading.
55+
- `--baseline-model <BASELINE-MODEL-ID>` is the model ID of the baseline model. This option is only available in `pairwise-baseline` mode. If not specified, the baseline model is set to `text-davinci-003`.
5656
- `--model-list <LIST-OF-MODEL-IDS>` is a list of model IDs to be evaluated. If not specified, all models in `data/jp_bench/model_answer` will be evaluated.
5757

58+
**Mode: `pairwise-baseline`**
59+
60+
This mode runs pairwise comparison against a baseline model.
61+
By default, the baseline model is set to `text-davinci-003`.
5862
For example:
5963

60-
```
64+
```bash
6165
OPENAI_API_KEY=<YOUR-KEY> python llm_judge/gen_judgment.py \
62-
--mode pairwise-all \
63-
--model-list rinna-3.6b-sft-v2 rinna-3.6b-ppo japanese-alpaca-lora-7b
66+
--mode pairwise-baseline \
67+
--model-list rinna--japanese-gpt-neox-3.6b-instruction-ppo
6468
```
6569

66-
The judgments will be saved to `data/jp_bench/model_judgment/gpt-4_pair.jsonl`
67-
68-
#### Step 3. Show jp-bench scores
70+
To show the scores:
6971

70-
Show the scores for selected models.
71-
72-
```
72+
```bash
7373
python llm_judge/show_result.py \
74-
--mode pairwise-all \
75-
--model-list rinna-3.6b-sft-v2 rinna-3.6b-ppo japanese-alpaca-lora-7b
74+
--mode pairwise-baseline \
75+
--model-list rinna--japanese-gpt-neox-3.6b-instruction-ppo
7676
```
7777

78-
---
79-
80-
#### Pairwise comparison against a baseline (default: gpt-3.5-turbo)
78+
**Mode: `pairwise-all`**
8179

82-
The `pairwise-baseline` mode runs pairwise comparison against a baseline model.
83-
84-
Generate GPT-4 judgments:
80+
This mode runs pairwise comparison between all model pairs.
81+
For example:
8582

86-
```
83+
```bash
8784
OPENAI_API_KEY=<YOUR-KEY> python llm_judge/gen_judgment.py \
88-
--mode pairwise-baseline \
89-
[--model-list <LIST-OF-MODEL-IDS>] \
90-
[--baseline-model <name-of-baseline-model>]
85+
--mode pairwise-all
9186
```
9287

93-
The judgments will be saved to `data/jp_bench/model_judgment/gpt-4_pair.jsonl`
94-
95-
Show results:
88+
To show the scores:
9689

97-
```
90+
```bash
9891
python llm_judge/show_result.py \
99-
--bench-name "jp_bench" \
100-
--mode pairwise-baseline \
101-
[--model-list <LIST-OF-MODEL-IDS>]
92+
--mode pairwise-all
10293
```
10394

104-
#### Run GPT-4 judge with the single model
95+
To be noticed, the `pairwise-all` mode might become very inefficient when evaluating many LLMs, as it evaluates combinations of each two of them.
96+
In such cases, we recommend using the `pairwise-baseline` mode, allowing all models to be compared against a fixed baseline such as ChatGPT.
10597

106-
The `single` mode grades models by asking GPT-4 to grade and give a score to model's answer directly without pairwise comparison.
98+
**Mode: `single`**
10799

108-
Generate GPT-4 judgments:
100+
This mode runs score-based single-model grading.
101+
For example:
109102

103+
```bash
104+
python llm_judge/gen_judgment.py \
105+
--mode single \
106+
--model-list rinna--japanese-gpt-neox-3.6b-instruction-ppo
110107
```
111-
OPENAI_API_KEY=<YOUR-KEY> python llm_judge/gen_judgment.py \
112-
--mode single \
113-
[--model-list <LIST-OF-MODEL-IDS>]
114-
```
115-
116-
The judgments will be saved to `data/jp_bench/model_judgment/gpt-4_single.jsonl`
117108

118-
Show results:
109+
To show the scores:
119110

120-
```
111+
```bash
121112
python llm_judge/show_result.py \
122-
--bench-name "jp_bench" \
123-
--mode single \
124-
[--model-list <LIST-OF-MODEL-IDS>]
113+
--mode single \
114+
--model-list rinna--japanese-gpt-neox-3.6b-instruction-ppo
125115
```
126116

127117
## Sample Outputs
@@ -140,6 +130,7 @@ Model outputs:
140130
| cyberagent--calm2-7b-chat | 植物性タンパク源と動物性タンパク源の違いは以下の通りです。\n\n1. 供給源: 植物性タンパク源は主に植物、特に穀物から得られます。一方、動物性タンパク源は動物、特に哺乳類から得られます。\n\n2. 生物的特性: 植物性タンパク源は、植物が有する、植物の、特に植物の、植物性タンパク源の、特に植物性タンパク源の、植物性タンパク源の、特に植物性タンパク源の、特に特に特に特に特に特に特に特に特に特に |
141131

142132
## Pairwise win-rate compared with GPT-3.5-davinci-003
133+
143134
| Model | Win | Loss | Tie | Win Rate | Loss Rate | Win Rate Adjusted |
144135
|----------------------------------------------------------|-----|------|-----|----------|-----------|-------------------|
145136
| llm-jp--llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0 | 22 | 48 | 10 | 0.2750 | 0.6000 | 0.33750 |
@@ -148,19 +139,16 @@ Model outputs:
148139
| rinna--japanese-gpt-neox-3.6b-instruction-sft-v2 | 8 | 69 | 3 | 0.1000 | 0.8625 | 0.11875 |
149140
| cyberagent--calm2-7b-chat | 5 | 67 | 8 | 0.0625 | 0.8375 | 0.11250 |
150141

151-
The GPT4 judgments is placed in `data/jp_bench/model_judgment/gpt-4_pair.jsonl`.
152-
153-
To be noticed, `pairwise-all` might become very inefficient when evaluating more LLMs, as it evaluates combinations of each two of them. In such cases, we recommend using the `pairwise-baseline` mode, allowing all models to be compared against a fixed baseline such as ChatGPT.
154-
155142
## Supported baseline Models
156-
To make it more convenient for users to utilize pairwise comparisons with existing Japanese LLMs, we offer the prediction of the following four baselines in `fastchat/llm_judge/data/jp_bench/model_answer`.
157143

158-
- [Rinna-3.6B](https://huggingface.co/rinna/japanese-gpt-neox-3.6b)
159-
- [Rinna-3.6B-sft-v2](https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft-v2)
160-
- [Rinna-3.6B-ppo](https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-ppo)
161-
- [Japanese-Alpaca-Lora](https://huggingface.co/kunishou)
144+
To make it more convenient for users to utilize pairwise comparisons with existing Japanese LLMs, we offer the prediction of the following four baselines in `data/jp_bench/model_answer`.
162145

163-
We will regularly include latest LLM baselines.
146+
- [llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0)
147+
- [llm-jp/llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0)
148+
- [rinna/japanese-gpt-neox-3.6b-instruction-ppo](https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-ppo)
149+
- [rinna/japanese-gpt-neox-3.6b-instruction-sft-v2](https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft-v2)
150+
- [cyberagent/calm2-7b-chat](https://huggingface.co/cyberagent/calm2-7b-chat)
164151

165152
## Questions
153+
166154
If you have any questions and feedback, please feel free to leave questions in the `Issues' list.

0 commit comments

Comments
 (0)