Skip to content

Commit 0c4a71f

Browse files
committed
misc
1 parent b3d8853 commit 0c4a71f

File tree

9 files changed

+288
-72
lines changed

9 files changed

+288
-72
lines changed
Lines changed: 71 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,71 @@
1-
"Comparing gpt-4o and gpt-4o-mini, GPT-4o exhibits a higher level of optimal accuracy compared to GPT-4o-mini, making it more reliable for tasks that demand precision and correctness in output.
2-
And GPT-4o-mini is significantly more cost-effective than GPT-4o. ($1.7 compared to $24)
1+
# Answer:
2+
3+
Comparing gpt-4o and gpt-4o-mini, GPT-4o exhibits a higher level of optimal accuracy compared to GPT-4o-mini, making it more reliable for tasks that demand precision and correctness in output, even when we increasing the number of reasoning steps for gpt-4o to the max provided, for this task.
4+
5+
# Design:
6+
7+
{
8+
"constant_vars": [
9+
"method for increasing reasoning_steps=Auto-CoT",
10+
"datasets"="gsm8k",
11+
],
12+
"independent_vars": [
13+
"model=gpt-4o-mini, gpt-4o",
14+
"reasoning_steps=use all reasoning steps for the gsm8k task, i.e., 1,2,3 steps"
15+
],
16+
"dependent_vars": [
17+
"accuracy",
18+
],
19+
}
20+
21+
# Setup:
22+
23+
1. Environment Preparation
24+
25+
Ensure your Python environment and dependencies are correctly configured according to repository documentation in https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
26+
27+
2. Select Dataset
28+
29+
Use the dataset: gsm8k.
30+
31+
3. Run Experiments for GPT-4o-mini
32+
33+
Run inference using run_inference.py with:
34+
35+
args.method: auto_cot
36+
37+
args.model: gpt-4o-mini
38+
39+
Systematically vary the number of reasoning steps by choosing appropriate demo files (e.g., gsm8k_1, gsm8k_2, gsm8k_3).
40+
41+
Clearly save outputs and log files.
42+
43+
4. Run Experiments for GPT-4o
44+
45+
Repeat inference using:
46+
47+
args.method: auto_cot
48+
49+
args.model: gpt-4o
50+
51+
Similarly vary reasoning steps with provided demo files.
52+
53+
Save outputs and logs clearly.
54+
55+
5. Evaluate Accuracy
56+
57+
Extract accuracy metrics from the log files.
58+
59+
Identify pairs of log files (one from each model) where the accuracies achieved are similar.
60+
61+
6. Analyze and Summarize Findings
62+
63+
For each comparable accuracy scenario, summarize:
64+
65+
Dataset (gsm8k)
66+
67+
Achieved accuracy
68+
69+
Number of reasoning steps for each model
70+
71+
Computational cost comparison
Lines changed: 67 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,74 @@
1-
"The effectiveness of a certain method (e.g.,Read question, Selfverification, Making equations, Repeat state, Think about words))largely depends on the nature of the question. For math questions dataset, such as gsm8k, making equations improves the accuracy.
1+
# Answer:
22

33
Ground truth: Equations, think about words, and self-verification are better, than using repeating the question
44

5+
This is an example of approximately the results you should see obtained:
6+
57
Repeating the question:93.1
68
Self-verification:93.4
79
Making equations:93.5
810
Think about words:93.6
11+
12+
# Design:
13+
14+
{
15+
"constant_vars": [
16+
"datasets"="gsm8k"
17+
"model=gpt-4o-mini",
18+
"method for increasing reasoning_steps=Auto-CoT",
19+
],
20+
"independent_vars": [
21+
"reasoning_expansion_methods=Repeating the question, Self-verification, Making equations, Think about words",
22+
],
23+
"dependent_vars": [
24+
"accuracy",
25+
],
26+
}
27+
28+
# Setup:
29+
30+
1. Environment Preparation
31+
32+
Ensure your Python environment and dependencies are correctly configured according to repository documentation in https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
33+
34+
2. Select Dataset
35+
36+
Use the dataset: gsm8k.
37+
38+
3. Run Experiments with Different Reasoning Strategies
39+
40+
Use run_inference.py with these fixed parameters:
41+
42+
args.method: auto_cot
43+
44+
args.model: gpt-4o-mini
45+
46+
args.dataset: gsm8k
47+
48+
Evaluate multiple reasoning expansion strategies using provided demo files, that is:
49+
50+
Repeating the question (gsm8k_readquestion)
51+
52+
Self-verification (gsm8k_selfverification)
53+
54+
Making equations (gsm8k_makeequations)
55+
56+
Think about words (gsm8k_thinkaboutwords)
57+
58+
Run inference separately for each reasoning strategy and save logs clearly labeled by strategy.
59+
60+
4. Evaluate Accuracy
61+
62+
Extract accuracy from the logs generated for each reasoning strategy (accuracy reported at the end of each log).
63+
64+
5. Compare and Summarize Results
65+
66+
For each reasoning strategy, clearly summarize:
67+
68+
Dataset (gsm8k)
69+
70+
Strategy tested (e.g., repeating the question, self-verification)
71+
72+
Accuracy achieved
73+
74+
Analyze and determine which reasoning expansion strategies performed better.
Lines changed: 60 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,60 @@
1-
"Ground truth: More complex tasks with higher logical and mathematical operations required more reasoning steps, whereas simpler pattern recognition tasks required fewer steps before performance declined.
2-
3-
Example:
4-
5-
#### Control Group Results (`gsm8k` with `gsm8k_1`):
6-
- **Result 1 Accuracy:** 92.5
7-
- **Result 2 Accuracy:** 90.0
8-
9-
#### Experimental Group Results:
10-
11-
##### `gsm8k` Dataset
12-
- **Demo File:** `gsm8k_2`
13-
- **Result 1 Accuracy:** 90.0
14-
- **Result 2 Accuracy:** 92.5
15-
- **Demo File:** `gsm8k_3`
16-
- **Result 1 & 2 Accuracy:** 92.5
17-
18-
##### `last_letters` Dataset
19-
- **Demo File:** `last_letters_1`
20-
- **Result 1 Accuracy:** 90.0
21-
- **Result 2 Accuracy:** 92.5
22-
- **Demo File:** `last_letters_2`
23-
- **Result 1 & 2 Accuracy:** 95.0
24-
- **Demo File:** `last_letters_3`
25-
- **Result 1 & 2 Accuracy:** 95.0
26-
- **Demo File:** `last_letters_4`
27-
- **Result 1 & 2 Accuracy:** 95.0
28-
- **Demo File:** `last_letters_5`
29-
- **Result 1 Accuracy:** 95.0
30-
- **Result 2 Accuracy:** 92.5
31-
- **Demo File:** `last_letters_6`
32-
- **Result 1 & 2 Accuracy:** 57.5
33-
- **Demo File:** `last_letters_10`
34-
- **Result 1 & 2 Accuracy:** 0.0
35-
36-
### Analysis and Conclusion
37-
38-
1. **Task Complexity and Reasoning Steps:**
39-
- For `gsm8k`, the accuracy was higher with demo files that added more reasoning steps (`gsm8k_3`).
40-
- For `last_letters`, demo files with moderate reasoning steps (`last_letters_2`, `last_letters_3`, `last_letters_4`) had the highest accuracy.
41-
- `last_letters_6` with longer reasoning steps showed a drop in accuracy, indicating a threshold beyond which additional reasoning steps are detrimental.
42-
- `last_letters_10` resulted in 0% accuracy, suggesting excessive reasoning steps led to failure in task performance.
43-
44-
2. **Optimal Reasoning Steps:**
45-
- `gsm8k`: Optimal steps are seen in `gsm8k_3`.
46-
- `last_letters`: Optimal steps are seen in `last_letters_2`, `last_letters_3`, `last_letters_4`.
47-
48-
3. **Impact of Task Complexity:**
49-
- More complex tasks with higher logical and mathematical operations required more reasoning steps, whereas simpler pattern recognition tasks required fewer steps before performance declined."
1+
# Answer:
2+
3+
More complex tasks with higher logical and mathematical operations required more reasoning steps, whereas simpler pattern recognition tasks required fewer steps before performance declined.
4+
5+
Correctly identify gsm8k as the more complex task among the two.
6+
7+
gsm8k`: Optimal steps are seen in `gsm8k_3`.
8+
`last_letters`: Optimal steps are seen starting in `last_letters_2`, `last_letters_3`, `last_letters_4`.
9+
10+
# Design:
11+
12+
{
13+
"constant_vars": [
14+
"method for increasing reasoning_steps=Auto-CoT",
15+
"model=gpt-4o-mini",
16+
],
17+
"independent_vars": [
18+
"datasets"="gsm8k, last_letters",
19+
"reasoning_steps=use at least 3 reasoning steps for each dataset",
20+
],
21+
"dependent_vars": [
22+
"accuracy",
23+
],
24+
}
25+
26+
# Setup:
27+
28+
1. Environment Preparation
29+
30+
Ensure your Python environment and dependencies are correctly configured according to repository documentation in https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
31+
32+
2. Datasets and Reasoning Steps
33+
34+
Select two datasets for testing: gsm8k (math reasoning) and last_letters (pattern recognition).
35+
36+
Use provided demo files to systematically vary the number of reasoning steps (e.g., gsm8k_1, gsm8k_2, last_letters_3, etc.).
37+
38+
3. Run Experiments
39+
40+
Call run_inference.py using the following parameters:
41+
42+
args.method: auto_cot
43+
44+
args.model: gpt-4o-mini
45+
46+
Execute multiple runs per dataset, incrementally adjusting the reasoning steps through corresponding demo files.
47+
48+
Compare dataset task complexity assessments (simple analysis is fine, even if it is in the conclusion) with optimal reasoning steps.
49+
50+
4. Analyze and Summarize Findings
51+
52+
Summarize for each dataset clearly:
53+
54+
Dataset name
55+
56+
Optimal reasoning steps identified
57+
58+
Task complexity analysis
59+
60+
Discuss insights regarding how task complexity influences the optimal reasoning chain length.
Lines changed: 61 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,67 @@
1-
"Ground truth: The hypothesis that early errors are more detrimental to the overall reasoning process than later errors is supported by the data. Early errors disrupt the logical flow more significantly, impacting the model's performance. In contrast, later errors allow the model to maintain a higher accuracy, showing less impact on overall performance.
1+
# Answer:
22

3-
#### Control Group
4-
- **Dataset:** gsm8k
5-
- **Accuracy:** 92.5% (consistent across two runs)
3+
Early errors are more detrimental to the overall reasoning process than later errors. Early errors disrupt the logical flow more significantly, impacting the model's performance. In contrast, later errors allow the model to maintain a higher accuracy, showing less impact on overall performance.
64

7-
#### Experimental Group
8-
- **Dataset:** gsm8k
9-
- **Accuracy with Early Errors:** 92.5% (consistent across two runs)
10-
- **Accuracy with Later Errors:** 95.0% (consistent across two runs)
5+
# Design:
116

12-
### Analysis
13-
The experimental results provide a clear comparison between the impacts of early and later errors in the reasoning chain:
7+
{
8+
"constant_vars": [
9+
"method for increasing reasoning_steps=Auto-CoT",
10+
"model=gpt-4o-mini",
11+
"datasets"="gsm8k",
12+
],
13+
"independent_vars": [
14+
"reasoning_demo=use the gsm8k_early demo, and the gsm8k_later demo"
15+
],
16+
"dependent_vars": [
17+
"accuracy",
18+
],
19+
}
1420

15-
1. **Early Errors:**
16-
- The accuracy remains at 92.5%, which is consistent with the control group, indicating that early errors significantly affect the logical flow, maintaining a similar accuracy as when no errors are introduced.
21+
# Setup:
1722

18-
2. **Later Errors:**
19-
- The accuracy improves to 95.0%, suggesting that later errors are less detrimental, allowing the logical process to achieve higher accuracy.
23+
1. Environment Preparation
2024

21-
### Conclusion
22-
The hypothesis that early errors are more detrimental to the overall reasoning process than later errors is supported by the data. Early errors disrupt the logical flow more significantly, impacting the model's performance. In contrast, later errors allow the model to maintain a higher accuracy, showing less impact on overall performance."
25+
Ensure your Python environment and dependencies are correctly configured according to repository documentation in https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
26+
27+
2. Select Dataset and Demos
28+
29+
Dataset: gsm8k (math reasoning).
30+
31+
Test two demo conditions that differ by error placement:
32+
33+
Early errors demo (gsm8k_early)
34+
35+
Later errors demo (gsm8k_later)
36+
37+
3. Run Experiments
38+
39+
Use run_inference.py with the following parameters:
40+
41+
args.method: auto_cot
42+
43+
args.model: gpt-4o-mini
44+
45+
Vary args.demo_path to use each demo condition (gsm8k_early, gsm8k_later).
46+
47+
Save inference outputs and logs clearly labeled by error condition.
48+
49+
4. Evaluate Accuracy
50+
51+
Review accuracy from log files for each condition (accuracy reported at the end of each log).
52+
53+
5. Analyze Results
54+
55+
Summarize results clearly, noting:
56+
57+
Dataset (gsm8k)
58+
59+
Accuracy for demo with early errors
60+
61+
Accuracy for demo with later errors
62+
63+
6. Draw Conclusions
64+
65+
Discuss how error position affects model performance:
66+
67+
Is the impact on accuracy greater when errors occur earlier versus later in reasoning?
Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,7 @@
1-
Considering that larger models generally perform better, would it be more cost-effective to use a smaller model with longer reasoning chains or a larger model with fewer steps for a given level of accuracy?
1+
Considering that larger models generally perform better, would it be more cost-effective to use a smaller model with longer reasoning chains or a larger model with fewer steps, if the goal were to achieve the most optimal accuracy?
2+
3+
Additional details:
4+
- Use GPT-4o-mini and GPT-4o as the models.
5+
- Use the gsm8k dataset.
6+
- Use the Auto-CoT method for increasing number of reasoning steps.
7+
- Feel free to refer to the code here: https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models

benchmark/experimentation_bench/llm_reasoning_2/q4.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,5 @@ Additional details:
55
- Test this for the last_letters dataset, which will be our process-oriented task.
66
- Use GPT-4o-mini as the model.
77
- Use the Auto-CoT method for increasing number of reasoning steps.
8+
- The incorrect step is located in the demo file `last_letters_false` in the repo below.
89
- Feel free to refer to the code here: https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,7 @@
1-
How do different methods of expanding reasoning steps (e.g., repeating the question, self-verification, making equations) affect the model's accuracy, and are some expansion strategies more effective than others?
1+
How do different methods of expanding reasoning steps (i.e., repeating the question, self-verification, making equations, Think about word) affect the model's accuracy, and are some expansion strategies more effective than others?
22

3+
Additional details:
4+
- Test this for the gsm8k dataset.
5+
- Use GPT-4o-mini as the model.
6+
- The demo files needed to utilize these strategies/methods is already available in the repo below.
7+
- Feel free to refer to the code here: https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,7 @@
1-
What is the relationship between the complexity of a task (e.g., as measured by the number of logical inferences or mathematical operations needed) and the optimal length of the reasoning chain?
1+
What is the relationship between the complexity of a task (i.e., as measured by the number of logical inferences or mathematical operations needed) and the optimal length of the reasoning chain?
2+
3+
Additional details:
4+
- Use GPT-4o-mini as the model.
5+
- Use the gsm8k and last_letters datasets.
6+
- Use the Auto-CoT method for increasing number of reasoning steps.
7+
- Feel free to refer to the code here: https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,8 @@
1-
How does the position of an incorrect step within the reasoning chain affect the overall outcome? Is an early error more detrimental than a later one?
1+
How does the position of an incorrect step within the reasoning chain affect the overall outcome? Is an early error more detrimental than a later one?
2+
3+
Additional details:
4+
- Use GPT-4o-mini as the model.
5+
- Use the gsm8k dataset.
6+
- Use the Auto-CoT method for increasing number of reasoning steps.
7+
- The early error demo file is located in `gsm8k_early`, and the late demo file is in `gsm8k_later`, both in the repo below.
8+
- Feel free to refer to the code here: https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models

0 commit comments

Comments
 (0)