ku-nlp
diff --git a/‎README.md‎
Lines changed: 23 additions & 10 deletions b/‎README.md‎
Lines changed: 23 additions & 10 deletions
diff --git a/‎configs/openai--text-davinci-003.json‎
Lines changed: 16 additions & 0 deletions b/‎configs/openai--text-davinci-003.json‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎data/jp_bench/judge_prompts.jsonl‎
Lines changed: 4 additions & 0 deletions b/‎data/jp_bench/judge_prompts.jsonl‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎data/jp_bench/model_answer/cyberagent--calm2-7b-chat/config.json‎
Lines changed: 13 additions & 0 deletions b/‎data/jp_bench/model_answer/cyberagent--calm2-7b-chat/config.json‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎data/jp_bench/model_answer/cyberagent--calm2-7b-chat.jsonl‎ renamed to ‎data/jp_bench/model_answer/cyberagent--calm2-7b-chat/results.jsonl‎ b/‎data/jp_bench/model_answer/cyberagent--calm2-7b-chat.jsonl‎ renamed to ‎data/jp_bench/model_answer/cyberagent--calm2-7b-chat/results.jsonl‎
diff --git a/‎data/jp_bench/model_answer/llm-jp--llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0/config.json‎
Lines changed: 14 additions & 0 deletions b/‎data/jp_bench/model_answer/llm-jp--llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0/config.json‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎data/jp_bench/model_answer/llm-jp--llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0.jsonl‎ renamed to ‎data/jp_bench/model_answer/llm-jp--llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0/results.jsonl‎ b/‎data/jp_bench/model_answer/llm-jp--llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0.jsonl‎ renamed to ‎data/jp_bench/model_answer/llm-jp--llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0/results.jsonl‎
diff --git a/‎data/jp_bench/model_answer/llm-jp--llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0/config.json‎
Lines changed: 14 additions & 0 deletions b/‎data/jp_bench/model_answer/llm-jp--llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0/config.json‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎data/jp_bench/model_answer/llm-jp--llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0.jsonl‎ renamed to ‎data/jp_bench/model_answer/llm-jp--llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0/results.jsonl‎ b/‎data/jp_bench/model_answer/llm-jp--llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0.jsonl‎ renamed to ‎data/jp_bench/model_answer/llm-jp--llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0/results.jsonl‎
diff --git a/‎data/jp_bench/model_answer/openai--text-davinci-003/config.json‎
Lines changed: 16 additions & 0 deletions b/‎data/jp_bench/model_answer/openai--text-davinci-003/config.json‎
Lines changed: 16 additions & 0 deletions
@@ -44,18 +44,20 @@ There are several options to use GPT-4 as a judge, such as pairwise win-rate and
 OPENAI_API_KEY=<YOUR-KEY> python llm_judge/gen_judgment.py \
     --mode {single|pairwise-baseline|pairwise-all} \
     [--baseline-model <BASELINE-MODEL-ID>] \
-    [--model-list <LIST-OF-MODEL-IDS>]
+    [--model-list <LIST-OF-MODEL-IDS>] \
+    [--wandb]
 ```
 
 Arguments & Options:
 - `--mode {single|pairwise-baseline|pairwise-all}` is the mode of judgment.
-    - `pairwise-baseline`: run pairwise comparison against a baseline model.
+    - `pairwise-baseline`: run pairwise comparison against a baseline model. This mode will be used by default.
     - `pairwise-all`: run pairwise comparison between all model pairs.
     - `single`: run score-based single-model grading.
 - `--baseline-model <BASELINE-MODEL-ID>` is the model ID of the baseline model. This option is only available in `pairwise-baseline` mode. If not specified, the baseline model is set to `text-davinci-003`.
 - `--model-list <LIST-OF-MODEL-IDS>` is a list of model IDs to be evaluated. If not specified, all models in `data/jp_bench/model_answer` will be evaluated.
+- `--wandb` is a flag to enable logging to W&B. You can upload the results later to W&B by running `upload_result.py`, as described in the next section.
 
-**Mode: `pairwise-baseline`**
+**Mode: `pairwise-baseline` (Default)**
 
 This mode runs pairwise comparison against a baseline model.
 By default, the baseline model is set to `text-davinci-003`.
@@ -114,6 +116,17 @@ python llm_judge/show_result.py \
     --model-list rinna--japanese-gpt-neox-3.6b-instruction-ppo
 ```
 
+#### Step 3. Upload the results to W&B (Optional)
+
+If you want to upload the results to W&B, you can run the following command:
+
+```bash
+python llm_judge/upload_result.py \
+    --mode {single|pairwise-baseline|pairwise-all} \
+    [--baseline-model <BASELINE-MODEL-ID>] \
+    [--model-list <LIST-OF-MODEL-IDS>]
+```
+
 ## Sample Outputs
 
 Question: 植物性タンパク源と動物性タンパク源の違いは何ですか？
@@ -131,13 +144,13 @@ Model outputs:
 
 ## Pairwise win-rate compared with GPT-3.5-davinci-003
 
-| Model                                                    | Win | Loss | Tie | Win Rate | Loss Rate | Win Rate Adjusted |
-|----------------------------------------------------------|-----|------|-----|----------|-----------|-------------------|
-| llm-jp--llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0 |  22 |   48 |  10 | 0.2750   | 0.6000    | 0.33750           |
-| rinna--japanese-gpt-neox-3.6b-instruction-ppo            |  10 |   61 |   9 | 0.1250   | 0.7625    | 0.18125           |
-| llm-jp--llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0 |   7 |   65 |   8 | 0.0875   | 0.8125    | 0.13750           |
-| rinna--japanese-gpt-neox-3.6b-instruction-sft-v2         |   8 |   69 |   3 | 0.1000   | 0.8625    | 0.11875           |
-| cyberagent--calm2-7b-chat                                |   5 |   67 |   8 | 0.0625   | 0.8375    | 0.11250           |
+| Model                                                    | Win Rate | Loss Rate | Win Rate Adjusted |
+|----------------------------------------------------------|----------|-----------|-------------------|
+| llm-jp--llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0 | 28.7     | 62.5      | 33.1              |
+| rinna--japanese-gpt-neox-3.6b-instruction-ppo            | 13.8     | 13.8      | 18.8              |
+| rinna--japanese-gpt-neox-3.6b-instruction-sft-v2         | 8.8      | 82.5      | 13.1              |
+| cyberagent--calm2-7b-chat                                | 6.2      | 81.2      | 12.5              |
+| llm-jp--llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0 | 10.0     | 87.5      | 11.2              |
 
 ## Supported baseline Models
 
 
@@ -0,0 +1,16 @@
+{
+  "model_id": "openai--text-davinci-003",
+  "model_name_or_path": null,
+  "lora_model_name_or_path": null,
+  "tokenizer_name_or_path": null,
+  "prompt_template": "{instruction}",
+  "generation_config": {
+    "engine": "text-davinci-003",
+    "temperature": 0.0,
+    "max_tokens": 2048,
+    "top_p": 1.0,
+    "frequency_penalty": 0.0,
+    "presence_penalty": 0.0
+  },
+  "special_token_map": {}
+}
@@ -0,0 +1,4 @@
+{"name": "pair", "type": "pairwise", "system_prompt": "以下に示されるユーザーの質問に対して2人のAIアシスタントが提供した回答の質を評価してください。回答の内容がユーザーの指示に従っており、ユーザーの質問によりよく答えているアシスタントを選んでください。具体的には、回答の有用性、関連性、正確性、深さ、創造性、詳細レベルなどの要素を考慮する必要があります。評価の際には、まず2つの回答を比較し、簡単な説明をしてください。立場が偏らないようにし、回答の提示順があなたの判断に影響しないようにしてください。回答の長さが評価に影響しないこと、特定のアシスタントの名前を好まないこと、できるだけ客観的であること、に気をつけてください。説明の後に、最終的な判断を以下の形式に従って出力してください：アシスタントAが優れていれば[[A]]、アシスタントBが優れていれば[[B]]、同点の場合は[[C]]", "prompt_template": "[ユーザーの質問]\n{question}\n\n[アシスタントAの答えの始まり]\n{answer_a}\n[アシスタントAの答えの終わり]\n\n[アシスタントBの答えの始まり]\n{answer_b}\n[アシスタントBの答えの終わり]", "description": "Prompt for general questions", "category": "general", "output_format": "[[A]]"}
+{"name": "pair-math", "type": "pairwise", "system_prompt": "以下に示されるユーザーの質問に対して2人のAIアシスタントが提供した回答の質を評価してください。回答の内容がユーザーの指示に従っており、ユーザーの質問によりよく答えているアシスタントを選んでください。参考解答、アシスタントAの回答、アシスタントBの回答が与えられるので、どちらのアシスタントの回答が優れているかを評価してください。評価の際には、まずそれぞれのアシスタントの回答を参考解答と比較し、回答の誤りを見つけて修正してください。立場が偏らないようにし、回答の提示順があなたの判断に影響しないようにしてください。回答の長さが評価に影響しないこと、特定のアシスタントの名前を好まないこと、できるだけ客観的であること、に気をつけてください。説明の後に、最終的な判断を以下の形式に従って出力してください：アシスタントAが優れていれば[[A]]、アシスタントBが優れていれば[[B]]、同点の場合は[[C]]", "prompt_template":"[ユーザーの質問]\n{question}\n\n[参考解答の始まり]\n{ref_answer_1}\n[参考解答の終わり]\n\n[アシスタントAの答えの始まり]\n{answer_a}\n[アシスタントAの答えの終わり]\n\n[アシスタントBの答えの始まり]\n{answer_b}\n[アシスタントBの答えの終わり]", "description": "Prompt for math questions", "category": "math", "output_format": "[[A]]"}
+{"name": "single", "type": "single", "system_prompt": "あなたは役に立つアシスタントです。", "prompt_template": "[インストラクション]\n以下に示されるユーザーの質問に対してAIアシスタントが提供した回答の質を評価してください。具体的には、回答の有用性、関連性、正確性、深さ、創造性、詳細レベルなどの要素を考慮して評価してください。評価の際には、まず回答内容を簡単に、できるだけ客観的に説明してください。説明を行った後、必ず「[[rating]]」という形式で、回答を1から10の尺度で評価してください（例：[[5]]）。\".\n\n[ユーザーの質問]\n{question}\n\n[アシスタントの答えの始まり]\n{answer}\n[アシスタントの答えの終わり]", "description": "Prompt for general questions", "category": "general", "output_format": "[[rating]]"}
+{"name": "single-math", "type": "single", "system_prompt": "あなたは役に立つアシスタントです。", "prompt_template": "[インストラクション]\n以下に示されるユーザーの質問に対してAIアシスタントが提供した回答の質を評価してください。具体的には、回答の正しさと親切さを考慮して評価してください。参考解答とアシスタントの回答が与えられます。評価の際には、まずアシスタントの回答を参考解答と比較し、回答の誤りを見つけて修正してください。評価はできるだけ客観的に行ってください。最後に、必ず「[[rating]]」という形式で、回答を1から10の尺度で評価してください（例：[[5]]）。\".\n\n[ユーザーの質問]\n{question}\n\n[参考解答の始まり]\n{ref_answer_1}\n[参考解答の終わり]\n\n[アシスタントの答えの始まり]\n{answer}\n[アシスタントの答えの終わり]", "description": "Prompt for general questions", "category": "math", "output_format": "[[rating]]"}
@@ -0,0 +1,13 @@
+{
+  "model_id": "cyberagent--calm2-7b-chat",
+  "model_name_or_path": "cyberagent/calm2-7b-chat",
+  "lora_model_name_or_path": null,
+  "tokenizer_name_or_path": null,
+  "prompt_template": "USER: {instruction}\nASSISTANT: ",
+  "generation_config": {
+    "do_sample": true,
+    "max_length": 2048,
+    "temperature": 0.8
+  },
+  "special_token_map": {}
+}
@@ -0,0 +1,14 @@
+{
+  "model_id": "llm-jp--llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0",
+  "model_name_or_path": "llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0",
+  "lora_model_name_or_path": null,
+  "tokenizer_name_or_path": null,
+  "prompt_template": "{instruction} ### 回答：",
+  "generation_config": {
+    "do_sample": true,
+    "max_length": 2048,
+    "temperature": 0.7,
+    "top_p": 0.95
+  },
+  "special_token_map": {}
+}
@@ -0,0 +1,14 @@
+{
+  "model_id": "llm-jp--llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0",
+  "model_name_or_path": "llm-jp/llm-jp-13b-v1.0",
+  "lora_model_name_or_path": "llm-jp/llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0",
+  "tokenizer_name_or_path": null,
+  "prompt_template": "{instruction} ### 回答：",
+  "generation_config": {
+    "do_sample": true,
+    "max_length": 2048,
+    "temperature": 0.7,
+    "top_p": 0.95
+  },
+  "special_token_map": {}
+}
@@ -0,0 +1,16 @@
+{
+  "model_id": "openai--text-davinci-003",
+  "model_name_or_path": null,
+  "lora_model_name_or_path": null,
+  "tokenizer_name_or_path": null,
+  "prompt_template": "{instruction}",
+  "generation_config": {
+    "engine": "text-davinci-003",
+    "temperature": 0.0,
+    "max_tokens": 2048,
+    "top_p": 1.0,
+    "frequency_penalty": 0.0,
+    "presence_penalty": 0.0
+  },
+  "special_token_map": {}
+}