Parameter-efficient finetuning method adds a small number of new parameters to a pretrained LLM and only finetunes newly added parameters to make the LLM perform better on,
- (a) a target dataset (for example, a domain-specific dataset like medical or legal documents)
- and (b) a target task (for example, sentiment classification).
LoRA, or Low-Rank Adaptation, is a technique that modifies the architecture of a pre-trained model by introducing low-rank matrices into the model's layers. LoRA adds trainable low-rank matrices to selected layers, allowing the model to adapt to new tasks without the need for extensive computational resources.
- Preservation of Pre-trained Knowledge: The original model retains its pre-trained weights, and only the low-rank matrices are trained, allowing for faster adaptation to new tasks.
- If your task heavily relies on understanding context, prioritize attention layers.
- Later layers: Task-specific adaptations
Layers Typically Targeted for LoRA Adapters
- Attention Layers:
- Query and Value Projections: These layers are crucial for the model's ability to focus on relevant parts of the input data.
- Commonly Used Modules: q_proj, k_proj, v_proj, and o_proj are often targeted as they directly influence the attention mechanism.
- Feedforward Layers:
- Linear Layers: These include any linear transformations within the model, such as those used in feedforward networks after attention layers.
- Examples: gate_proj, down_proj, and up_proj are frequently included in the adaptation process.
- Output Layers:
- Final Linear Layer: This layer is responsible for producing the final output of the model and can also benefit from fine-tuning through LoRA.
Fine-tune Code Generation Prompt:
Ignore previous instructions. Now your role is an AI and ML engineer, your job is to generate code for following use case, Use Case: Finetune an open-source large language models (LLMs) such as latest Llama 3.2 chat model for multi tasks such as Classification, Chatbot, Question Answering, Multi-choice question and answering and more.
Below are few finetuning guidelines,
- Use HuggingFace libraries for finetuning LLM.
- The foundation LLM model and datasets should be downloaded from Huggingface, sometimes downloading models require Huggingface login and providing consent.
- Employ finetuning techniques such as PEFT, QLORA and/or any modern techniques that you aware off.
- Use wandb.ai weights and bias tool for reporting progress and metrics
- Finetune model for 2 epochs, use learning rate scheduler
- Split data into training and validation set with 80% used for training and rest for validation. Validate finetuned model with validation dataset.
- Save & upload finetuned model to Huggingface
- Use FSDP (Fully Sharded Data Parallel) to fine tune model. Assume that there are 3 machine or nodes and each having 4 A100 GPUs. The code should support distributed training across multiple nodes and GPUs.
- To stop retrain again due to crashes, do frequent check pointing.
- Use model optimization techniques such as torch.compile, mixed precision or relevant techniques
- To improve training speed include techniques such as flash attention, KV cache, grouped attention, sliding attention. Choose whichever is best for the model.
- Apply DPO technique for finetuning.
- Finetune is planned to be executed on GPU renting services such as runpod.ai. Generate code that can be executed on such GPU renting services.
- Compute evalution metrics such as GLUE, BLEU, Perpelixity relvant to task and dataset and capture them through wandb.ai
- Generate code as python project, separate behaviors into different modules so that it can be parameterized and packaged.
The model should be finetuned using below mentioned finetune datasets.
Finetune Huggingface Datasets:
1A) Dataset Name: infinite-dataset-hub/TextClaimsDataset 1B) Description: The 'TextClaimsDataset' is a curated collection of insurance claim descriptions where each text snippet is labeled according to its relevance to actual insurance claims. The dataset aims to assist machine learning practitioners in training models to classify texts as either 'Claim' or 'Not a Claim'. This classification can be pivotal for fraud detection systems in the insurance industry, helping to identify potential fraudulent claims from legitimate ones. 1C) Supported Tasks: Classification
2A) Dataset Name: PolyAI/banking77 2B) Dataset composed of online banking queries annotated with their corresponding intents. BANKING77 dataset provides a very fine-grained set of intents in a banking domain. It comprises 13,083 customer service queries labeled with 77 intents. It focuses on fine-grained single-domain intent detection. 2C) Supported Tasks: Intent classification, intent detection
3A) tau/commonsense_qa 3B) CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. The dataset is provided in two major training/validation/testing set splits: "Random split" which is the main evaluation split, and "Question token split" 3C) Supported Tasks: multiple-choice question answering
- LLaMA-2 from the Ground Up
- Fine-Tuning Llama-2 LLM on Google Colab: A Step-by-Step Guide.
- How to Fine-Tune an LLM Part 2: Instruction Tuning Llama 2
- Multiple tasks for one fine-tuned LLM
- Fine-tune Llama 2 with Limited Resources
- Llama_2_7b_chat_hf_sharded_bf16_INFERENCE.ipynb
- fLlama_bnb_Inference.ipynb
Model parallelism splits the model across multiple GPUs. This is especially useful for models that are too large to fit into the memory of a single GPU.
- Pipeline Parallelism: Splits the model into different layers that are processed by different GPUs in a pipelined fashion.
- Tensor Parallelism: Splits the individual layers themselves across multiple GPUs to parallelize the computation.
Fully Sharded Data Parallel (FSDP) enables efficient training of large neural networks across multiple GPUs and nodes by distributing the model's parameters across devices.
PyTorch Fully Sharded Data Parallel (FSDP) primarily utilizes data parallelism with some aspects of tensor parallelism.
2 GPUs per node and a total of 2 nodes (4 GPUs total)
FSDP shards the model parameters of each layer split into 4 parts distribut across the 4 GPUs, with each GPU holding only a portion (shard) of the parameters for every layer.
For example, Layer 1's parameters are split into 4 shards, one shard per GPU:
- GPU 1 (Node 1) holds shard 1
- GPU 2 (Node 1) holds shard 2
- GPU 3 (Node 2) holds shard 3
- GPU 4 (Node 2) holds shard 4
1. Forward Pass with All-Gather
-
All-Gather for Input Processing:
-
When you forward data through a layer, FSDP first all-gathers the layer’s sharded parameters across all GPUs. This means that each GPU will temporarily hold a complete copy of the parameters for that layer.
-
All four shards of the layer parameters (from all GPUs) are gathered, so that each GPU can perform the forward pass for the complete layer.
-
Local Computation:
-
After all-gathering, each GPU has the full layer’s parameters and processes its local portion of the input batch through the layer.
-
Each GPU computes the output for its share of the input data using the full set of parameters (that were all-gathered).
-
Proceeding to the Next Layer:
-
After each layer’s forward pass is computed, the model parameters that were all-gathered are discarded from memory, and only the sharded portion is kept on each GPU.
-
The process repeats for each subsequent layer.
2. Backward Pass with Reduce-Scatter
During the backward pass, gradients need to be computed and distributed across the GPUs to update the sharded parameters correctly.
-
Backward Pass Steps:
-
Local Gradient Computation:
-
Each GPU calculates the gradients locally using the outputs from the forward pass for its input data shard. However, since each GPU only holds part of the model’s parameters, these gradients need to be synchronized across GPUs.
-
Reduce-Scatter for Gradients:
-
Once local gradients are computed, FSDP performs a reduce-scatter operation.
-
This step involves summing the gradients across all GPUs (as each GPU holds the gradient for the same parameters across different input shards).
-
The summed gradients are then scattered back, with each GPU receiving only the gradients corresponding to the portion of the model parameters it is responsible for (i.e., the sharded parameters).
-
Weight Update:
-
After the reduce-scatter, each GPU has the correct accumulated gradients for its shard of the model parameters.
-
Each GPU locally updates its shard of the model parameters using these gradients.
-
Proceeding to the Next Layer:
- The process is repeated in reverse order (from the last layer to the first) during the backward pass.
To fine-tune the LLaMA model using Direct Preference Optimization (DPO), you can follow a structured approach that involves preparing your dataset, setting up the model and training environment, and implementing the DPO training process. Below is a comprehensive guide based on the information gathered.
Ensure you have the necessary libraries installed. You will need PyTorch, Hugging Face Transformers, and the TRL (Training Reinforcement Learning) library.
pip install torch transformers accelerate trl
DPO requires a dataset that includes prompts, preferred responses, and rejected responses. The dataset should be structured as follows:
dataset = [
{
"prompt": "What is the capital of France?",
"chosen": "The capital of France is Paris.",
"rejected": "France's capital is Lyon."
},
{
"prompt": "What is the largest planet in our solar system?",
"chosen": "The largest planet in our solar system is Jupiter.",
"rejected": "Earth is the largest planet."
},
# Add more examples...
]
Load the LLaMA model and tokenizer from Hugging Face. For this example, we will use a pre-trained LLaMA model.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-3-8b" # Replace with your specific model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
You will need to configure the DPO trainer with your model, reference model, and training parameters.
from trl import DPOTrainer
# Create a reference model (typically a copy of your SFT model)
model_ref = AutoModelForCausalLM.from_pretrained(model_name)
# Define training arguments
training_args = {
"per_device_train_batch_size": 4,
"learning_rate": 8e-6,
"num_train_epochs": 3,
}
# Initialize DPO Trainer
dpo_trainer = DPOTrainer(
model=model,
model_ref=model_ref,
beta=0.1, # Temperature hyperparameter for DPO
train_dataset=dataset,
tokenizer=tokenizer,
args=training_args,
)
Now you can start the training process using the train
method provided by the DPO trainer.
dpo_trainer.train()
After training, evaluate your model's performance on a holdout set or through manual testing. Once satisfied, save your fine-tuned model for future use.
model.save_pretrained("fine_tuned_llama_dpo")
tokenizer.save_pretrained("fine_tuned_llama_dpo")
- Dataset Preparation: Create a dataset with prompts, preferred responses, and rejected responses.
- Model Initialization: Load your LLaMA model and tokenizer.
- DPO Trainer Setup: Configure the DPO trainer with necessary parameters.
- Training: Execute the training process.
- Evaluation and Saving: Assess performance and save the fine-tuned model.
This approach allows you to effectively fine-tune LLaMA models using Direct Preference Optimization, aligning their outputs more closely with human preferences based on feedback data.