NovaSky-AI · guspan-tanadi · Mar 3, 2025
diff --git a/src/content/posts/sky-t1-7b.md b/src/content/posts/sky-t1-7b.md
@@ -39,7 +39,7 @@ To foster community progress, we open-sourced all artifacts including the traini
 
 ### Step 1: SFT
 We use the QwQ model to generate the distillation data since **the model was trained before the release of DeepSeek R1** and QwQ was the only open-weights long reasoning model at the time when we trained the model. For the data mixture, we use GPT-4o-mini to classify the difficulty of the prompts according to the AoPS standard and selected math problems of difficulty higher than Level 3, Olympiads higher than Level 8, and all AIME/AMC problems in the [NUMINA dataset](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT). We then perform rejection sampling by only accepting the solutions that match the ground truth. In total, we curated [5K responses from QwQ](https://huggingface.co/datasets/NovaSky-AI/Sky-T1-7B-step1-sft-5k).
-Finally, we use the 5K responses to perform SFT on the Qwen2.5-Math-7B using the [Sky-T1 system prompt](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/skythought_evals/models/model_configs.yaml). We trained the model for 3 epochs, using a learning rate of 1e-5, and a batch size of 96. After this stage, we get the [Sky-T1-7B-Step1](https://huggingface.co/NovaSky-AI/Sky-T1-7B-step1) model.
+Finally, we use the 5K responses to perform SFT on the Qwen2.5-Math-7B using the [Sky-T1 system prompt](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/evals/models/model_configs.yaml). We trained the model for 3 epochs, using a learning rate of 1e-5, and a batch size of 96. After this stage, we get the [Sky-T1-7B-Step1](https://huggingface.co/NovaSky-AI/Sky-T1-7B-step1) model.
 
 ### Step 2: RL
 Next, we apply the [PRIME](https://github.com/PRIME-RL/PRIME)’s algorithms to it. We use the [Eurus-2-RL-Data](https://huggingface.co/datasets/PRIME-RL/Eurus-2-RL-Data) for the RL training and run it for 127 steps with a batch size of 256 (~30K data). For each prompt, we generate 4 rollouts and adopt the prompt filtering optimization proposed in PRIME that filters out the problems for which all of the 4 rollouts are correct or wrong. After this stage, we get the [Sky-T1-7B-Step2](https://huggingface.co/NovaSky-AI/Sky-T1-7B-step2) model. This stage runs on 8xH100 for around 44 hours.
@@ -64,7 +64,7 @@ As shown in Figure 2, Long CoT SFT significantly improves the model’s overall
 **Figure 2:** Pass@K curves for models trained after each step for AIME24 and AMC23.
 
 ## Sky-T1-mini – Simple RL Boosts the Performance
-Throughout our development of Sky-T1-7B (which was trained before the release of DeepSeek R1’s release), we found that simple RL algorithms without a Process Reward Model (PRM) work well to enhance the model’s performance. Therefore, we also apply the simple RLOO algorithm with only the verifier reward on [DeepSeek-R1-Distill-Qwen-7B]((https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)), the current SOTA open-source 7B reasoning model, using the [STILL3](https://huggingface.co/datasets/RUC-AIBOX/STILL-3-Preview-RL-Data) dataset and the numina_amc_aime and numina_olympiads subset in the [Eurus-2-RL-Data](https://huggingface.co/datasets/PRIME-RL/Eurus-2-RL-Data) dataset. We run it for 119 steps (~28 hours) with a batch size of 256 (~30k) on 8xH100, with a cutoff length of 8k and then run it for 29 steps (~8.7 hours) with a cutoff length of 16k. The final model, Sky-T1-mini, approaches o1-mini performance across the four math benchmarks, as reported in Figure 3. **While we only trained the model for a short period of time with contexts cutoff (we also didn't carefully choose the algorithms and data mixtures), the accuracy improvement is still impressive: +4% on AIME, +5.6% on OlympiadBench and +2% on average, demonstrating the potential of RL in further enhancing model's performance beyond distillation.**
+Throughout our development of Sky-T1-7B (which was trained before the release of DeepSeek R1’s release), we found that simple RL algorithms without a Process Reward Model (PRM) work well to enhance the model’s performance. Therefore, we also apply the simple RLOO algorithm with only the verifier reward on [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B), the current SOTA open-source 7B reasoning model, using the [STILL3](https://huggingface.co/datasets/RUC-AIBOX/STILL-3-Preview-RL-Data) dataset and the numina_amc_aime and numina_olympiads subset in the [Eurus-2-RL-Data](https://huggingface.co/datasets/PRIME-RL/Eurus-2-RL-Data) dataset. We run it for 119 steps (~28 hours) with a batch size of 256 (~30k) on 8xH100, with a cutoff length of 8k and then run it for 29 steps (~8.7 hours) with a cutoff length of 16k. The final model, Sky-T1-mini, approaches o1-mini performance across the four math benchmarks, as reported in Figure 3. **While we only trained the model for a short period of time with contexts cutoff (we also didn't carefully choose the algorithms and data mixtures), the accuracy improvement is still impressive: +4% on AIME, +5.6% on OlympiadBench and +2% on average, demonstrating the potential of RL in further enhancing model's performance beyond distillation.**
 
 ## Complete Results
 ![img](https://raw.githubusercontent.com/NovaSky-AI/novasky-ai.github.io/main/assets/images/sky-t1-7b/performance_stats_avg.png)