You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The answers will be saved to `data/jp_bench/model_answer`.
39
-
40
39
#### Step 2. Generate GPT-4 judgments
41
40
42
41
There are several options to use GPT-4 as a judge, such as pairwise win-rate and single-answer grading.
43
-
We show an example of the pairwise win-rate evaluation of instruction fine-tuned models (rinna-3.6b-sft-v2, rinna-3.6b-ppo, and japanese-alpaca-lora-7b) at the bottom.
-`--mode {single|pairwise-baseline|pairwise-all}` is the mode of judgment.
53
-
-`single`: run score-based single-model grading.
54
52
-`pairwise-baseline`: run pairwise comparison against a baseline model.
55
53
-`pairwise-all`: run pairwise comparison between all model pairs.
54
+
-`single`: run score-based single-model grading.
55
+
-`--baseline-model <BASELINE-MODEL-ID>` is the model ID of the baseline model. This option is only available in `pairwise-baseline` mode. If not specified, the baseline model is set to `text-davinci-003`.
56
56
-`--model-list <LIST-OF-MODEL-IDS>` is a list of model IDs to be evaluated. If not specified, all models in `data/jp_bench/model_answer` will be evaluated.
57
57
58
+
**Mode: `pairwise-baseline`**
59
+
60
+
This mode runs pairwise comparison against a baseline model.
61
+
By default, the baseline model is set to `text-davinci-003`.
The GPT4 judgments is placed in `data/jp_bench/model_judgment/gpt-4_pair.jsonl`.
152
-
153
-
To be noticed, `pairwise-all` might become very inefficient when evaluating more LLMs, as it evaluates combinations of each two of them. In such cases, we recommend using the `pairwise-baseline` mode, allowing all models to be compared against a fixed baseline such as ChatGPT.
154
-
155
142
## Supported baseline Models
156
-
To make it more convenient for users to utilize pairwise comparisons with existing Japanese LLMs, we offer the prediction of the following four baselines in `fastchat/llm_judge/data/jp_bench/model_answer`.
To make it more convenient for users to utilize pairwise comparisons with existing Japanese LLMs, we offer the prediction of the following four baselines in `data/jp_bench/model_answer`.
0 commit comments