misc

patrickkon · patrickkon · commit 0c4a71fbc94b · 2025-03-28T22:27:50.000Z
diff --git a/benchmark/experimentation_bench/llm_reasoning_2/ground_truth/q10.txt b/benchmark/experimentation_bench/llm_reasoning_2/ground_truth/q10.txt
@@ -1,2 +1,71 @@
-"Comparing gpt-4o and gpt-4o-mini, GPT-4o exhibits a higher level of optimal accuracy compared to GPT-4o-mini, making it more reliable for tasks that demand precision and correctness in output.
-And GPT-4o-mini is significantly more cost-effective than GPT-4o. ($1.7 compared to $24)
+# Answer:
+
+Comparing gpt-4o and gpt-4o-mini, GPT-4o exhibits a higher level of optimal accuracy compared to GPT-4o-mini, making it more reliable for tasks that demand precision and correctness in output, even when we increasing the number of reasoning steps for gpt-4o to the max provided, for this task.
+
+# Design:
+
+{
+    "constant_vars": [
+        "method for increasing reasoning_steps=Auto-CoT",
+        "datasets"="gsm8k",
+    ],
+    "independent_vars": [
+        "model=gpt-4o-mini, gpt-4o",
+        "reasoning_steps=use all reasoning steps for the gsm8k task, i.e., 1,2,3 steps"
+    ],
+    "dependent_vars": [
+        "accuracy",
+    ],
+}
+
+# Setup:
+
+1. Environment Preparation
+
+Ensure your Python environment and dependencies are correctly configured according to repository documentation in https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
+
+2. Select Dataset
+
+Use the dataset: gsm8k.
+
+3. Run Experiments for GPT-4o-mini
+
+Run inference using run_inference.py with:
+
+args.method: auto_cot
+
+args.model: gpt-4o-mini
+
+Systematically vary the number of reasoning steps by choosing appropriate demo files (e.g., gsm8k_1, gsm8k_2, gsm8k_3).
+
+Clearly save outputs and log files.
+
+4. Run Experiments for GPT-4o
+
+Repeat inference using:
+
+args.method: auto_cot
+
+args.model: gpt-4o
+
+Similarly vary reasoning steps with provided demo files.
+
+Save outputs and logs clearly.
+
+5. Evaluate Accuracy
+
+Extract accuracy metrics from the log files.
+
+Identify pairs of log files (one from each model) where the accuracies achieved are similar.
+
+6. Analyze and Summarize Findings
+
+For each comparable accuracy scenario, summarize:
+
+Dataset (gsm8k)
+
+Achieved accuracy
+
+Number of reasoning steps for each model
+
+Computational cost comparison
diff --git a/benchmark/experimentation_bench/llm_reasoning_2/ground_truth/q7.txt b/benchmark/experimentation_bench/llm_reasoning_2/ground_truth/q7.txt
@@ -1,8 +1,74 @@
-"The effectiveness of a certain method (e.g.,Read question, Selfverification, Making equations, Repeat state, Think about words))largely depends on the nature of the question. For math questions dataset, such as gsm8k, making equations improves the accuracy.
+# Answer: 
 
 Ground truth: Equations, think about words, and self-verification are better, than using repeating the question
 
+This is an example of approximately the results you should see obtained:
+
 Repeating the question:93.1
 Self-verification:93.4
 Making equations:93.5
 Think about words:93.6
+
+# Design:
+
+{
+    "constant_vars": [
+        "datasets"="gsm8k"
+        "model=gpt-4o-mini",
+        "method for increasing reasoning_steps=Auto-CoT",
+    ],
+    "independent_vars": [
+        "reasoning_expansion_methods=Repeating the question, Self-verification, Making equations, Think about words",
+    ],
+    "dependent_vars": [
+        "accuracy",
+    ],
+}
+
+# Setup:
+
+1. Environment Preparation
+
+Ensure your Python environment and dependencies are correctly configured according to repository documentation in https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
+
+2. Select Dataset
+
+Use the dataset: gsm8k.
+
+3. Run Experiments with Different Reasoning Strategies
+
+Use run_inference.py with these fixed parameters:
+
+args.method: auto_cot
+
+args.model: gpt-4o-mini
+
+args.dataset: gsm8k
+
+Evaluate multiple reasoning expansion strategies using provided demo files, that is:
+
+Repeating the question (gsm8k_readquestion)
+
+Self-verification (gsm8k_selfverification)
+
+Making equations (gsm8k_makeequations)
+
+Think about words (gsm8k_thinkaboutwords)
+
+Run inference separately for each reasoning strategy and save logs clearly labeled by strategy.
+
+4. Evaluate Accuracy
+
+Extract accuracy from the logs generated for each reasoning strategy (accuracy reported at the end of each log).
+
+5. Compare and Summarize Results
+
+For each reasoning strategy, clearly summarize:
+
+Dataset (gsm8k)
+
+Strategy tested (e.g., repeating the question, self-verification)
+
+Accuracy achieved
+
+Analyze and determine which reasoning expansion strategies performed better.
diff --git a/benchmark/experimentation_bench/llm_reasoning_2/ground_truth/q8.txt b/benchmark/experimentation_bench/llm_reasoning_2/ground_truth/q8.txt
@@ -1,49 +1,60 @@
-"Ground truth:   More complex tasks with higher logical and mathematical operations required more reasoning steps, whereas simpler pattern recognition tasks required fewer steps before performance declined.
-
-Example:
-
-#### Control Group Results (`gsm8k` with `gsm8k_1`):
-- **Result 1 Accuracy:** 92.5
-- **Result 2 Accuracy:** 90.0
-
-#### Experimental Group Results:
-
-##### `gsm8k` Dataset
-- **Demo File:** `gsm8k_2`
-  - **Result 1 Accuracy:** 90.0
-  - **Result 2 Accuracy:** 92.5
-- **Demo File:** `gsm8k_3`
-  - **Result 1 & 2 Accuracy:** 92.5
-
-##### `last_letters` Dataset
-- **Demo File:** `last_letters_1`
-  - **Result 1 Accuracy:** 90.0
-  - **Result 2 Accuracy:** 92.5
-- **Demo File:** `last_letters_2`
-  - **Result 1 & 2 Accuracy:** 95.0
-- **Demo File:** `last_letters_3`
-  - **Result 1 & 2 Accuracy:** 95.0
-- **Demo File:** `last_letters_4`
-  - **Result 1 & 2 Accuracy:** 95.0
-- **Demo File:** `last_letters_5`
-  - **Result 1 Accuracy:** 95.0
-  - **Result 2 Accuracy:** 92.5
-- **Demo File:** `last_letters_6`
-  - **Result 1 & 2 Accuracy:** 57.5
-- **Demo File:** `last_letters_10`
-  - **Result 1 & 2 Accuracy:** 0.0
-
-### Analysis and Conclusion
-
-1. **Task Complexity and Reasoning Steps:**
-   - For `gsm8k`, the accuracy was higher with demo files that added more reasoning steps (`gsm8k_3`).
-   - For `last_letters`, demo files with moderate reasoning steps (`last_letters_2`, `last_letters_3`, `last_letters_4`) had the highest accuracy.
-   - `last_letters_6` with longer reasoning steps showed a drop in accuracy, indicating a threshold beyond which additional reasoning steps are detrimental.
-   - `last_letters_10` resulted in 0% accuracy, suggesting excessive reasoning steps led to failure in task performance.
-
-2. **Optimal Reasoning Steps:**
-   - `gsm8k`: Optimal steps are seen in `gsm8k_3`.
-   - `last_letters`: Optimal steps are seen in `last_letters_2`, `last_letters_3`, `last_letters_4`.
-
-3. **Impact of Task Complexity:**
-   - More complex tasks with higher logical and mathematical operations required more reasoning steps, whereas simpler pattern recognition tasks required fewer steps before performance declined."
+# Answer:
+
+More complex tasks with higher logical and mathematical operations required more reasoning steps, whereas simpler pattern recognition tasks required fewer steps before performance declined.
+
+Correctly identify gsm8k as the more complex task among the two.
+
+gsm8k`: Optimal steps are seen in `gsm8k_3`.
+`last_letters`: Optimal steps are seen starting in `last_letters_2`, `last_letters_3`, `last_letters_4`.
+
+# Design:
+
+{
+    "constant_vars": [
+        "method for increasing reasoning_steps=Auto-CoT",
+        "model=gpt-4o-mini",
+    ],
+    "independent_vars": [
+        "datasets"="gsm8k, last_letters",
+        "reasoning_steps=use at least 3 reasoning steps for each dataset",
+    ],
+    "dependent_vars": [
+        "accuracy",
+    ],
+}
+
+# Setup:
+
+1. Environment Preparation
+
+Ensure your Python environment and dependencies are correctly configured according to repository documentation in https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
+
+2. Datasets and Reasoning Steps
+
+Select two datasets for testing: gsm8k (math reasoning) and last_letters (pattern recognition).
+
+Use provided demo files to systematically vary the number of reasoning steps (e.g., gsm8k_1, gsm8k_2, last_letters_3, etc.).
+
+3. Run Experiments
+
+Call run_inference.py using the following parameters:
+
+args.method: auto_cot
+
+args.model: gpt-4o-mini
+
+Execute multiple runs per dataset, incrementally adjusting the reasoning steps through corresponding demo files.
+
+Compare dataset task complexity assessments (simple analysis is fine, even if it is in the conclusion) with optimal reasoning steps.
+
+4. Analyze and Summarize Findings
+
+Summarize for each dataset clearly:
+
+Dataset name
+
+Optimal reasoning steps identified
+
+Task complexity analysis
+
+Discuss insights regarding how task complexity influences the optimal reasoning chain length.
diff --git a/benchmark/experimentation_bench/llm_reasoning_2/ground_truth/q9.txt b/benchmark/experimentation_bench/llm_reasoning_2/ground_truth/q9.txt
@@ -1,22 +1,67 @@
-"Ground truth: The hypothesis that early errors are more detrimental to the overall reasoning process than later errors is supported by the data. Early errors disrupt the logical flow more significantly, impacting the model's performance. In contrast, later errors allow the model to maintain a higher accuracy, showing less impact on overall performance.
+# Answer:
 
-#### Control Group
-- **Dataset:** gsm8k
-- **Accuracy:** 92.5% (consistent across two runs)
+Early errors are more detrimental to the overall reasoning process than later errors. Early errors disrupt the logical flow more significantly, impacting the model's performance. In contrast, later errors allow the model to maintain a higher accuracy, showing less impact on overall performance.
 
-#### Experimental Group
-- **Dataset:** gsm8k
-- **Accuracy with Early Errors:** 92.5% (consistent across two runs)
-- **Accuracy with Later Errors:** 95.0% (consistent across two runs)
+# Design:
 
-### Analysis
-The experimental results provide a clear comparison between the impacts of early and later errors in the reasoning chain:
+{
+    "constant_vars": [
+        "method for increasing reasoning_steps=Auto-CoT",
+        "model=gpt-4o-mini",
+        "datasets"="gsm8k",
+    ],
+    "independent_vars": [
+        "reasoning_demo=use the gsm8k_early demo, and the gsm8k_later demo"
+    ],
+    "dependent_vars": [
+        "accuracy",
+    ],
+}
 
-1. **Early Errors:**
-   - The accuracy remains at 92.5%, which is consistent with the control group, indicating that early errors significantly affect the logical flow, maintaining a similar accuracy as when no errors are introduced.
+# Setup:
 
-2. **Later Errors:**
-   - The accuracy improves to 95.0%, suggesting that later errors are less detrimental, allowing the logical process to achieve higher accuracy.
+1. Environment Preparation
 
-### Conclusion
-The hypothesis that early errors are more detrimental to the overall reasoning process than later errors is supported by the data. Early errors disrupt the logical flow more significantly, impacting the model's performance. In contrast, later errors allow the model to maintain a higher accuracy, showing less impact on overall performance."
+Ensure your Python environment and dependencies are correctly configured according to repository documentation in https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
+
+2. Select Dataset and Demos
+
+Dataset: gsm8k (math reasoning).
+
+Test two demo conditions that differ by error placement:
+
+Early errors demo (gsm8k_early)
+
+Later errors demo (gsm8k_later)
+
+3. Run Experiments
+
+Use run_inference.py with the following parameters:
+
+args.method: auto_cot
+
+args.model: gpt-4o-mini
+
+Vary args.demo_path to use each demo condition (gsm8k_early, gsm8k_later).
+
+Save inference outputs and logs clearly labeled by error condition.
+
+4. Evaluate Accuracy
+
+Review accuracy from log files for each condition (accuracy reported at the end of each log).
+
+5. Analyze Results
+
+Summarize results clearly, noting:
+
+Dataset (gsm8k)
+
+Accuracy for demo with early errors
+
+Accuracy for demo with later errors
+
+6. Draw Conclusions
+
+Discuss how error position affects model performance:
+
+Is the impact on accuracy greater when errors occur earlier versus later in reasoning?
diff --git a/benchmark/experimentation_bench/llm_reasoning_2/q10.txt b/benchmark/experimentation_bench/llm_reasoning_2/q10.txt
@@ -1 +1,7 @@
-Considering that larger models generally perform better, would it be more cost-effective to use a smaller model with longer reasoning chains or a larger model with fewer steps for a given level of accuracy?
+Considering that larger models generally perform better, would it be more cost-effective to use a smaller model with longer reasoning chains or a larger model with fewer steps, if the goal were to achieve the most optimal accuracy?
+
+Additional details:
+- Use GPT-4o-mini and GPT-4o as the models.
+- Use the gsm8k dataset.
+- Use the Auto-CoT method for increasing number of reasoning steps.
+- Feel free to refer to the code here: https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
diff --git a/benchmark/experimentation_bench/llm_reasoning_2/q4.txt b/benchmark/experimentation_bench/llm_reasoning_2/q4.txt
@@ -5,4 +5,5 @@ Additional details:
 - Test this for the last_letters dataset, which will be our process-oriented task.
 - Use GPT-4o-mini as the model.
 - Use the Auto-CoT method for increasing number of reasoning steps.
+- The incorrect step is located in the demo file `last_letters_false` in the repo below.
 - Feel free to refer to the code here: https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
diff --git a/benchmark/experimentation_bench/llm_reasoning_2/q7.txt b/benchmark/experimentation_bench/llm_reasoning_2/q7.txt
@@ -1,2 +1,7 @@
-How do different methods of expanding reasoning steps (e.g., repeating the question, self-verification, making equations) affect the model's accuracy, and are some expansion strategies more effective than others?
+How do different methods of expanding reasoning steps (i.e., repeating the question, self-verification, making equations, Think about word) affect the model's accuracy, and are some expansion strategies more effective than others?
 
+Additional details:
+- Test this for the gsm8k dataset.
+- Use GPT-4o-mini as the model.
+- The demo files needed to utilize these strategies/methods is already available in the repo below.
+- Feel free to refer to the code here: https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
diff --git a/benchmark/experimentation_bench/llm_reasoning_2/q8.txt b/benchmark/experimentation_bench/llm_reasoning_2/q8.txt
@@ -1 +1,7 @@
-What is the relationship between the complexity of a task (e.g., as measured by the number of logical inferences or mathematical operations needed) and the optimal length of the reasoning chain?
+What is the relationship between the complexity of a task (i.e., as measured by the number of logical inferences or mathematical operations needed) and the optimal length of the reasoning chain?
+
+Additional details:
+- Use GPT-4o-mini as the model.
+- Use the gsm8k and last_letters datasets.
+- Use the Auto-CoT method for increasing number of reasoning steps.
+- Feel free to refer to the code here: https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
diff --git a/benchmark/experimentation_bench/llm_reasoning_2/q9.txt b/benchmark/experimentation_bench/llm_reasoning_2/q9.txt
@@ -1 +1,8 @@
-How does the position of an incorrect step within the reasoning chain affect the overall outcome? Is an early error more detrimental than a later one?
+How does the position of an incorrect step within the reasoning chain affect the overall outcome? Is an early error more detrimental than a later one?
+
+Additional details:
+- Use GPT-4o-mini as the model.
+- Use the gsm8k dataset.
+- Use the Auto-CoT method for increasing number of reasoning steps.
+- The early error demo file is located in `gsm8k_early`, and the late demo file is in `gsm8k_later`, both in the repo below.
+- Feel free to refer to the code here: https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models