The official implementation for paper Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
Pre-training LLMs is notoriously expensive.
Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of much larger models.
However, the viability of these approaches in efficient pre-training for LLMs remains underexplored.
This work identifies three critical obstacles: (O1) the lack of comprehensive evaluation, (O2) the untested viability for scaling, and (O3) the lack of empirical guidelines, which will be addressed one by one.
To tackle O1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLMs pre-training setting.
Our findings reveal that a depthwise stacking operator,
git clone https://github.com/tongxuluo/prts.git
cd prts
cd /path/to/dataset
git lfs install
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path path/to/llama --destination_path data/slimpajama --split validation --percentage 1.0
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path path/to/llama --destination_path data/slimpajama --split train --percentage 1.0
We formalize a set of guidelines for effectively utilizing the
For growth factor
For Llama Families:
Model | N | D | ||
---|---|---|---|---|
Llama3-8B | 8B | 15T | 6.58B | 4 |
Llama2-7B | 7B | 2T | 11.11B | 4 |
Llama2-13B | 13B | 2T | 15.84B | 4 |
Llama2-70B | 70B | 2T | 42.48B | 4 |
We only need the first checkpoint (10B tokens).
sbatch base_model.sh
For example, in the case of
{
"src_config_name": "6L2048H",
"trg_config_name": "24L2048H",
"src_init_path": "/path/to/your/base_model/check_point_dir/iter-005000-ckpt.pth",
"stacking_list": [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5],
"embd_name": "wte",
"ln_name": "ln_f",
"head_name": "lm_head",
"layer_name": "h"
}
sbatch g_stack.sh
- Open source our code -- 2024.5.24 .
- Open source our last checkpoints of main experiments -- 2024.5.29 .
- Refactor our code to make it more concise -- 2024.7 .
Our code is based on TinyLlama, licensed under the Apache-2.0 license.
@misc{zhang2024tinyllama,
title={TinyLlama: An Open-Source Small Language Model},
author={Peiyuan Zhang and Guangtao Zeng and Tianduo Wang and Wei Lu},
year={2024},
eprint={2401.02385},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
If you find our paper inspiring and have utilized it in your work, please cite our paper.
@article{du2024stacking,
title={Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training},
author={Du, Wenyu and Luo, Tongxu and Qiu, Zihan and Huang, Zeyu and Shen, Yikang and Cheng, Reynold and Guo, Yike and Fu, Jie},
journal={arXiv preprint arXiv:2405.15319},
year={2024}
}