-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[DOCS] Lora without regret #4181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
4e33b88
a9fcac0
55a35ad
ed094bf
d1675fc
21c78c3
64dd3f5
0584915
0726576
63a5d21
46e3255
fd5eb14
fc85021
3e1942d
23238d7
75ecba0
a56672d
081636e
087f100
c547533
c10527a
d81f44a
5a59a8a
27f373a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,158 @@ | ||||||
| # LoRA Without Regret | ||||||
burtenshaw marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Recent research from the team at [Thinking Machines Lab](https://thinkingmachines.ai/blog/lora/) (Schulman et al., 2025) shows that **LoRA can match full fine-tuning performance** when configured correctly, while using only ~67% of the compute. These findings are exciting to TRL users because they're straight forward to implement and can improve model performance on smaller budgets. | ||||||
burtenshaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| This guide provides simple instructions to reproduce the results of the blog post in TRL. | ||||||
burtenshaw marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ## Benefits of LoRA over full fine-tuning | ||||||
|
|
||||||
| First of all, let's remind ourselves of the benefits of [LoRA over full fine-tuning](https://huggingface.co/docs/trl/en/peft_integration). | ||||||
burtenshaw marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| LoRA trains an adapter layer on top of the base model, which contains significantly fewer parameters than the base model itself. This allows us to train the model on less GPU memory. It has generally been accepted that this comes with a trade-off in performance. The [blog post](https://thinkingmachines.ai/blog/lora/) proposes that with the correct configuration, LoRA can overcome this tradeoff and match full fine-tuning performance. | ||||||
burtenshaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ## Key findings in optimizing LoRA | ||||||
|
|
||||||
| Let's dive into the key findings of the blog post one by one and see how we can implement them in TRL scripts. Below, we will reproduce the results of the blog post using complete the TRL scripts that you can run locally or on Hugging Face Jobs. | ||||||
burtenshaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ### 1. *LoRA performs better when applied to all weight matrices* | ||||||
|
|
||||||
| The authors recommend applying LoRA to all weight matrices instead of attention-only LoRA targeted at attention layers, and this is not overcome by increasing the rank. In TRL script, we could use `--lora_target_modules all-linear` to apply LoRA to all weight matrices. | ||||||
burtenshaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
burtenshaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
|  | ||||||
|
|
||||||
| Attention-only LoRA underperforms even when using higher rank to match parameter count. | ||||||
|
|
||||||
| ### 2. *We can estimate trainable parameters from dataset size to determine LoRA rank* | ||||||
|
|
||||||
| The blog post recommends choosing LoRA rank based on task and dataset size. LoRA rank controls the number of trainable parameters in the LoRA adapter. And the post proposes that LoRA works well when the number of parameters exceeds the amount of information to be learned, which we should estimate from the dataset size. | ||||||
burtenshaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
|  | ||||||
|
|
||||||
| In TRL script, we could use `--lora_r` to set the rank and adapt it based on the task and dataset we're training on. The blog post recommends the following ranks based on the task and dataset size: | ||||||
|
|
||||||
| | Task Type | Dataset Size | Recommended Rank | | ||||||
| |-----------|-------------|------------------| | ||||||
| | **SFT** - Small instruction | <10K examples | 32-64 | | ||||||
| | **SFT** - Medium instruction | 10K-1M examples | 64-128 | | ||||||
| | **SFT** - Large reasoning | >1M examples | 256+ | | ||||||
| | **RL** - All tasks | Any size | 8-32 | | ||||||
burtenshaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Reinforcement learning requires minimal capacity, so we can use lower ranks. This is because policy gradient algorithms learn only ~1 bit per episode, requiring minimal capacity. | ||||||
burtenshaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ### 3. *"FullFT and high-rank LoRAs have similar learning curves"* | ||||||
|
|
||||||
| Counter-intuitively, the blog post recommends using similar learning rates to full fine-tuning. In TRL script, we could use `--learning_rate` to set the learning rate. The 1/r scaling in LoRA makes optimal learning rate approximately rank-independent. | ||||||
burtenshaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
burtenshaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
|  | ||||||
|
|
||||||
| ### 4. *"In some scenarios, LoRA is less tolerant of large batch sizes than full fine-tuning."* | ||||||
|
|
||||||
| The blog post recommends using effective batch size < 256 because the authors found LoRA to be less tolerant of large batch sizes. This could not be mitigated by increasing the LoRA rank. In TRL script, we could use `--per_device_train_batch_size` and `--gradient_accumulation_steps` to set the batch size. | ||||||
burtenshaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
burtenshaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
|  | ||||||
|
|
||||||
| ## Examples with TRL | ||||||
|
|
||||||
| Those are the core findings of the blog post. Let's implement them in TRL scripts to train LoRA adapters. | ||||||
burtenshaw marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ### Supervised Fine-Tuning (SFT) | ||||||
|
|
||||||
| The blog post performs SFT on a range of models and datasets from the Hub, which we can reproduce in TRL. | ||||||
|
|
||||||
| | Model | Dataset | | ||||||
| |-------|---------| | ||||||
| | [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B) | [allenai/tulu-3-sft-mixture](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) | | ||||||
| | [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B) | [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) | | ||||||
| | [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B) | [allenai/tulu-3-sft-mixture](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) | | ||||||
| | [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B) | [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) | | ||||||
|
|
||||||
| <hfoptions id="sft"> | ||||||
| <hfoption id="jobs"> | ||||||
|
|
||||||
| ```bash | ||||||
|
|
||||||
| # Medium dataset (Tulu3) - use rank 128 | ||||||
| # TODO: add hf jobs command | ||||||
| ``` | ||||||
|
|
||||||
| To use Hugging Face Jobs, you will need to be logged in to the Hugging Face Hub (`hf auth login`) and have a [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan. Check out the [Jobs documentation](https://huggingface.co/docs/huggingface_hub/en/guides/jobs) for more details. | ||||||
|
|
||||||
| </hfoption> | ||||||
| <hfoption id="local"> | ||||||
|
|
||||||
| ```bash | ||||||
|
|
||||||
| # Medium dataset (Tulu3) - use rank 128 | ||||||
| # TODO: local command | ||||||
| ``` | ||||||
|
|
||||||
| To run th script locally, you will need to have `uv` installed. Check out the [uv documentation](https://docs.astral.sh/uv/) for more details. | ||||||
|
||||||
| To run th script locally, you will need to have `uv` installed. Check out the [uv documentation](https://docs.astral.sh/uv/) for more details. | |
| To run the script locally, you will need to have `uv` installed. Check out the [uv documentation](https://docs.astral.sh/uv/) for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need uv to run local script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to use a custom uv based script. I'll use the standard trl scripts instead.
Uh oh!
There was an error while loading. Please reload this page.