|
1 |
| -"Ground truth: More complex tasks with higher logical and mathematical operations required more reasoning steps, whereas simpler pattern recognition tasks required fewer steps before performance declined. |
2 |
| - |
3 |
| -Example: |
4 |
| - |
5 |
| -#### Control Group Results (`gsm8k` with `gsm8k_1`): |
6 |
| -- **Result 1 Accuracy:** 92.5 |
7 |
| -- **Result 2 Accuracy:** 90.0 |
8 |
| - |
9 |
| -#### Experimental Group Results: |
10 |
| - |
11 |
| -##### `gsm8k` Dataset |
12 |
| -- **Demo File:** `gsm8k_2` |
13 |
| - - **Result 1 Accuracy:** 90.0 |
14 |
| - - **Result 2 Accuracy:** 92.5 |
15 |
| -- **Demo File:** `gsm8k_3` |
16 |
| - - **Result 1 & 2 Accuracy:** 92.5 |
17 |
| - |
18 |
| -##### `last_letters` Dataset |
19 |
| -- **Demo File:** `last_letters_1` |
20 |
| - - **Result 1 Accuracy:** 90.0 |
21 |
| - - **Result 2 Accuracy:** 92.5 |
22 |
| -- **Demo File:** `last_letters_2` |
23 |
| - - **Result 1 & 2 Accuracy:** 95.0 |
24 |
| -- **Demo File:** `last_letters_3` |
25 |
| - - **Result 1 & 2 Accuracy:** 95.0 |
26 |
| -- **Demo File:** `last_letters_4` |
27 |
| - - **Result 1 & 2 Accuracy:** 95.0 |
28 |
| -- **Demo File:** `last_letters_5` |
29 |
| - - **Result 1 Accuracy:** 95.0 |
30 |
| - - **Result 2 Accuracy:** 92.5 |
31 |
| -- **Demo File:** `last_letters_6` |
32 |
| - - **Result 1 & 2 Accuracy:** 57.5 |
33 |
| -- **Demo File:** `last_letters_10` |
34 |
| - - **Result 1 & 2 Accuracy:** 0.0 |
35 |
| - |
36 |
| -### Analysis and Conclusion |
37 |
| - |
38 |
| -1. **Task Complexity and Reasoning Steps:** |
39 |
| - - For `gsm8k`, the accuracy was higher with demo files that added more reasoning steps (`gsm8k_3`). |
40 |
| - - For `last_letters`, demo files with moderate reasoning steps (`last_letters_2`, `last_letters_3`, `last_letters_4`) had the highest accuracy. |
41 |
| - - `last_letters_6` with longer reasoning steps showed a drop in accuracy, indicating a threshold beyond which additional reasoning steps are detrimental. |
42 |
| - - `last_letters_10` resulted in 0% accuracy, suggesting excessive reasoning steps led to failure in task performance. |
43 |
| - |
44 |
| -2. **Optimal Reasoning Steps:** |
45 |
| - - `gsm8k`: Optimal steps are seen in `gsm8k_3`. |
46 |
| - - `last_letters`: Optimal steps are seen in `last_letters_2`, `last_letters_3`, `last_letters_4`. |
47 |
| - |
48 |
| -3. **Impact of Task Complexity:** |
49 |
| - - More complex tasks with higher logical and mathematical operations required more reasoning steps, whereas simpler pattern recognition tasks required fewer steps before performance declined." |
| 1 | +# Answer: |
| 2 | + |
| 3 | +More complex tasks with higher logical and mathematical operations required more reasoning steps, whereas simpler pattern recognition tasks required fewer steps before performance declined. |
| 4 | + |
| 5 | +Correctly identify gsm8k as the more complex task among the two. |
| 6 | + |
| 7 | +gsm8k`: Optimal steps are seen in `gsm8k_3`. |
| 8 | +`last_letters`: Optimal steps are seen starting in `last_letters_2`, `last_letters_3`, `last_letters_4`. |
| 9 | + |
| 10 | +# Design: |
| 11 | + |
| 12 | +{ |
| 13 | + "constant_vars": [ |
| 14 | + "method for increasing reasoning_steps=Auto-CoT", |
| 15 | + "model=gpt-4o-mini", |
| 16 | + ], |
| 17 | + "independent_vars": [ |
| 18 | + "datasets"="gsm8k, last_letters", |
| 19 | + "reasoning_steps=use at least 3 reasoning steps for each dataset", |
| 20 | + ], |
| 21 | + "dependent_vars": [ |
| 22 | + "accuracy", |
| 23 | + ], |
| 24 | +} |
| 25 | + |
| 26 | +# Setup: |
| 27 | + |
| 28 | +1. Environment Preparation |
| 29 | + |
| 30 | +Ensure your Python environment and dependencies are correctly configured according to repository documentation in https://github.com/MingyuJ666/The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models |
| 31 | + |
| 32 | +2. Datasets and Reasoning Steps |
| 33 | + |
| 34 | +Select two datasets for testing: gsm8k (math reasoning) and last_letters (pattern recognition). |
| 35 | + |
| 36 | +Use provided demo files to systematically vary the number of reasoning steps (e.g., gsm8k_1, gsm8k_2, last_letters_3, etc.). |
| 37 | + |
| 38 | +3. Run Experiments |
| 39 | + |
| 40 | +Call run_inference.py using the following parameters: |
| 41 | + |
| 42 | +args.method: auto_cot |
| 43 | + |
| 44 | +args.model: gpt-4o-mini |
| 45 | + |
| 46 | +Execute multiple runs per dataset, incrementally adjusting the reasoning steps through corresponding demo files. |
| 47 | + |
| 48 | +Compare dataset task complexity assessments (simple analysis is fine, even if it is in the conclusion) with optimal reasoning steps. |
| 49 | + |
| 50 | +4. Analyze and Summarize Findings |
| 51 | + |
| 52 | +Summarize for each dataset clearly: |
| 53 | + |
| 54 | +Dataset name |
| 55 | + |
| 56 | +Optimal reasoning steps identified |
| 57 | + |
| 58 | +Task complexity analysis |
| 59 | + |
| 60 | +Discuss insights regarding how task complexity influences the optimal reasoning chain length. |
0 commit comments