From 12e89fcbe91b58e7c0af6b91dd4ce800a3212b3d Mon Sep 17 00:00:00 2001
From: mohammadreza <mohammadreza.esmaeiliyan@basalam.com>
Date: Fri, 26 Apr 2024 13:09:58 +0330
Subject: [PATCH 01/18] feat: add fine tuning on simple task using single GPU
 with fast inference jupyter notebook

---
 notebooks/en/_toctree.yml                     |   3 +
 ...sk_on_single_gpu_with_fast_inference.ipynb | 560 ++++++++++++++++++
 notebooks/en/index.md                         |   1 +
 3 files changed, 564 insertions(+)
 create mode 100644 notebooks/en/fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb

diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml
index 72d9c2da..d04b9024 100644
--- a/notebooks/en/_toctree.yml
+++ b/notebooks/en/_toctree.yml
@@ -21,6 +21,9 @@
     title: RAG Evaluation
   - local: llm_judge
     title: Using LLM-as-a-judge for an automated and versatile evaluation
+  - local: fine_tuning_simple_task_on_single_gpu_with_fast_inference
+    title: Fin-tuning LLM on a simple task using single GPU with fast inference
+
 
 - title: Diffusion Recipes
   sections:
diff --git a/notebooks/en/fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb b/notebooks/en/fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb
new file mode 100644
index 00000000..e2d26f79
--- /dev/null
+++ b/notebooks/en/fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb
@@ -0,0 +1,560 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "source": [
+    "# Fin-tuning LLM on a simple task using single GPU with fast inference\n",
+    "\n",
+    "_Authored by: [Mohammadreza Esmaeiliyan](https://github.com/MrzEsma)_"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "a59bf2a9e5015030"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "In this notebook, the attempt has been made to fine-tune an LLM in the simplest manner without adding unnecessary complexity, with a parameter count suitable for a Customer-level GPU, and then to perform inference using one of the fastest open-source inference engines, Vllm. I have tried to explain all the concepts and techniques used as far as possible; however, since there are many concepts and techniques to explain, I firstly gave a priority based on importance so that the more important concepts and techniques can be studied first. Secondly, since others have written these explanations well and in more detail in blogs, I have referred to these links. As the Iranian saying goes, \"In a house of wisdom, a few words suffice\" :)\n",
+    "Let's get started.\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "755fc90c27f1cb99"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "from datasets import load_dataset\n",
+    "from transformers import (\n",
+    "    AutoModelForCausalLM,\n",
+    "    AutoTokenizer,\n",
+    "    BitsAndBytesConfig,\n",
+    "    TrainingArguments,\n",
+    ")\n",
+    "from peft import LoraConfig, PeftModel\n",
+    "from trl import SFTTrainer, DataCollatorForCompletionOnlyLM"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "3a35eafbe37e4ad2"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "The `peft` library, or parameter efficient fine tuning, has been created to fine-tune LLMs more efficiently. If we were to open and fine-tune the upper layers of the network traditionally like all neural networks, it would require a lot of processing and also a significant amount of VRAM. With the methods developed in recent papers, this library has been implemented for efficient fine-tuning of LLMs. Read more about peft here: [Hugging Face PEFT](https://huggingface.co/blog/peft). Importance level: 1"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "64671b5d61ba9d57"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "After the emergence of LLMs, a task known as Alignment was created, in which we try to produce outputs from LLMs that are compatible with our preferences. We start with simple supervised fine tuning or SFT in the first stage, and in the second stage, a mechanism for receiving feedback from users is created, and with other techniques, we make the LLM more aligned with our preferences. The `trl` library has been created for such a task, and this library is used in the first stage, which is SFT. For further reading on the Alignment task, see [OpenAI Blog on Instruction Following](https://openai.com/research/instruction-following#fn1) and [Hugging Face Blog on RLHF](https://huggingface.co/blog/rlhf). Importance level: 1\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "a1e7c704058c2373"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Set parameters"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "261a8f52fe09202e"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "# General parameters\n",
+    "model_name = \"NousResearch/Llama-2-7b-chat-hf\"  # The model that you want to train from the Hugging Face hub\n",
+    "dataset_name = \"yahma/alpaca-cleaned\"  # The instruction dataset to use\n",
+    "new_model = \"llama-2-7b-alpaca\"  # The name for fine-tuned LoRA Adaptor\n",
+    "\n",
+    "# LoRA parameters\n",
+    "lora_r = 64\n",
+    "lora_alpha = lora_r * 2\n",
+    "lora_dropout = 0.1\n",
+    "target_modules = [\"q_proj\", \"v_proj\", 'k_proj']\n",
+    "\n",
+    "# QLoRA parameters\n",
+    "load_in_4bit = True\n",
+    "bnb_4bit_compute_dtype = \"float16\"\n",
+    "bnb_4bit_quant_type = \"nf4\"\n",
+    "bnb_4bit_use_double_quant = False\n",
+    "\n",
+    "# TrainingArguments parameters\n",
+    "num_train_epochs = 1\n",
+    "fp16 = False\n",
+    "bf16 = False\n",
+    "per_device_train_batch_size = 4\n",
+    "per_device_eval_batch_size = 4\n",
+    "gradient_accumulation_steps = 1\n",
+    "gradient_checkpointing = True\n",
+    "learning_rate = 0.00015\n",
+    "weight_decay = 0.01\n",
+    "optim = \"paged_adamw_32bit\"\n",
+    "lr_scheduler_type = \"cosine\"\n",
+    "max_steps = -1\n",
+    "warmup_ratio = 0.03\n",
+    "group_by_length = True\n",
+    "save_steps = 0\n",
+    "logging_steps = 25\n",
+    "\n",
+    "# SFT parameters\n",
+    "max_seq_length = None\n",
+    "packing = False\n",
+    "device_map = {\"\": 0}\n",
+    "\n",
+    "# Dataset parameters\n",
+    "use_special_template = True\n",
+    "response_template = ' ### Answer:'\n",
+    "instruction_prompt_template = '\"### Human:\"'\n",
+    "use_llama_like_model = True"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "96fccf9f7364bac6"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Train Code"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "234ef91c9c1c0789"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "# Load dataset (you can process it here)\n",
+    "dataset = load_dataset(dataset_name, split=\"train\")\n",
+    "percent_of_train_dataset = 0.95\n",
+    "other_columns = [i for i in dataset.column_names if i not in ['instruction', 'output', 'text']]\n",
+    "dataset = dataset.remove_columns(other_columns)\n",
+    "split_dataset = dataset.train_test_split(train_size=int(dataset.num_rows * percent_of_train_dataset), seed=19, shuffle=False)\n",
+    "train_dataset = split_dataset[\"train\"]\n",
+    "eval_dataset = split_dataset[\"test\"]\n",
+    "print(f\"Size of the train set: {len(train_dataset)}. Size of the validation set: {len(eval_dataset)}\")"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "8cc58fe0c4b229e0"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Two techniques, LoRA and QLoRA, are among the most important techniques of PEFT. In brief, LoRA aims to open only these layers for fine-tuning by constructing and adding a low-rank matrix to each of the model layers, thus neither changing the model weights nor requiring lengthy training, and the created weights are lightweight and can be produced multiple times, allowing multiple tasks to be fine-tuned with an LLM loaded into RAM. In the QLoRA technique, the weights are quantized to 4 bits, further reducing RAM consumption. Read about LoRA [here at Lightning AI](https://lightning.ai/pages/community/tutorial/lora-llm/) and about QLoRA [here at Hugging Face](https://huggingface.co/blog/4bit-transformers-bitsandbytes). For other efficient training methods, see [Hugging Face Docs on Performance Training](https://huggingface.co/docs/transformers/perf_train_gpu_one) and [SFT Trainer Enhancement](https://huggingface.co/docs/trl/main/en/sft_trainer#enhance-models-performances-using-neftune). Importance level: 2\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "382296d37668763c"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "# Load QLoRA configuration\n",
+    "compute_dtype = getattr(torch, bnb_4bit_compute_dtype)\n",
+    "\n",
+    "bnb_config = BitsAndBytesConfig(\n",
+    "    load_in_4bit=load_in_4bit,\n",
+    "    bnb_4bit_quant_type=bnb_4bit_quant_type,\n",
+    "    bnb_4bit_compute_dtype=compute_dtype,\n",
+    "    bnb_4bit_use_double_quant=bnb_4bit_use_double_quant,\n",
+    ")"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "32d8aa11a6d47e0d"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "# Load LoRA configuration\n",
+    "peft_config = LoraConfig(\n",
+    "    lora_alpha=lora_alpha,\n",
+    "    lora_dropout=lora_dropout,\n",
+    "    r=lora_r,\n",
+    "    bias=\"none\",\n",
+    "    task_type=\"CAUSAL_LM\",\n",
+    ")"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "8a5216910d0a339a"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "# Load base model\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    model_name,\n",
+    "    quantization_config=bnb_config,\n",
+    "    device_map=device_map\n",
+    ")\n",
+    "model.config.use_cache = False"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "bacbbc9ddd19504d"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "In this blog, you can read about the best practices for fine-tuning LLMs [Sebastian Raschka's Magazine](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms?r=1h0eu9&utm_campaign=post&utm_medium=web). Importance level: 1\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "56219e83015a7357"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "# Set training parameters\n",
+    "training_arguments = TrainingArguments(\n",
+    "    output_dir=new_model,\n",
+    "    num_train_epochs=num_train_epochs,\n",
+    "    per_device_train_batch_size=per_device_train_batch_size,\n",
+    "    gradient_accumulation_steps=gradient_accumulation_steps,\n",
+    "    optim=optim,\n",
+    "    save_steps=save_steps,\n",
+    "    logging_steps=logging_steps,\n",
+    "    learning_rate=learning_rate,\n",
+    "    weight_decay=weight_decay,\n",
+    "    fp16=fp16,\n",
+    "    bf16=bf16,\n",
+    "    max_steps=max_steps,\n",
+    "    warmup_ratio=warmup_ratio,\n",
+    "    gradient_checkpointing=gradient_checkpointing,\n",
+    "    group_by_length=group_by_length,\n",
+    "    lr_scheduler_type=lr_scheduler_type\n",
+    ")"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "a82c50bc69c3632b"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "# Load tokenizer\n",
+    "tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n",
+    "tokenizer.pad_token = tokenizer.eos_token\n",
+    "tokenizer.padding_side = \"right\"  # Fix weird overflow issue with fp16 training\n",
+    "if not tokenizer.chat_template:\n",
+    "    tokenizer.chat_template = \"{% for message in messages %}{{'<|im_start|>' + message['role'] + '\\n' + message['content'] + '<|im_end|>' + '\\n'}}{% endfor %}\""
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "c86b66f59bee28dc"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "def special_formatting_prompts(example):\n",
+    "    output_texts = []\n",
+    "    for i in range(len(example['instruction'])):\n",
+    "        text = f\"{instruction_prompt_template}{example['instruction'][i]}{example['input'][i]}\\n{response_template} {example['output'][i]}\"\n",
+    "        output_texts.append(text)\n",
+    "    return output_texts\n",
+    "\n",
+    "\n",
+    "def normal_formatting_prompts(example):\n",
+    "    output_texts = []\n",
+    "    for i in range(len(example['instruction'])):\n",
+    "        chat_temp = [{\"role\": \"system\", \"content\": example['instruction'][i]},\n",
+    "                     {\"role\": \"user\", \"content\": {example['input'][i]}},\n",
+    "                     {\"role\": \"assistant\", \"content\": example['output'][i]}]\n",
+    "        text = tokenizer.apply_chat_template(chat_temp, tokenize=False)\n",
+    "        output_texts.append(text)\n",
+    "    return output_texts\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "7d3f935e03db79b8"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Regarding the chat template, let me briefly explain that to understand the structure of the conversation between the user and the model during model training, a series of reserved phrases are created to separate the user's message and the model's response, so the model precisely understands where each message comes from and has a sense of the conversational structure. Typically, adhering to a chat template helps increase accuracy in the intended task. However, when there is a distribution shift between the fine-tuning dataset and the model, using a specific chat template can be more helpful. For further reading, visit [Hugging Face Blog on Chat Templates](https://huggingface.co/blog/chat-templates). Importance level: 3\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "ea4399c36bcdcbbd"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "if use_special_template:\n",
+    "    formatting_func = special_formatting_prompts\n",
+    "    if use_llama_like_model:\n",
+    "        response_template_ids = tokenizer.encode(response_template, add_special_tokens=False)[2:]\n",
+    "        collator = DataCollatorForCompletionOnlyLM(response_template=response_template_ids, tokenizer=tokenizer)\n",
+    "    else:\n",
+    "        collator = DataCollatorForCompletionOnlyLM(response_template=response_template, tokenizer=tokenizer)\n",
+    "else:\n",
+    "    formatting_func = normal_formatting_prompts"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "95dc3db0d6c5ddaf"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "trainer = SFTTrainer(\n",
+    "    model=model,\n",
+    "    train_dataset=train_dataset,\n",
+    "    eval_dataset=eval_dataset,\n",
+    "    peft_config=peft_config,\n",
+    "    formatting_func=formatting_func,\n",
+    "    data_collator=collator,\n",
+    "    max_seq_length=max_seq_length,\n",
+    "    tokenizer=tokenizer,\n",
+    "    args=training_arguments,\n",
+    "    packing=packing\n",
+    ")"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "48e09edab86c4212"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "# Train model\n",
+    "trainer.train()\n",
+    "\n",
+    "# Save fine tuned Lora Adaptor \n",
+    "trainer.model.save_pretrained(new_model)"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "a17a3b28010ce90e"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Inference Code"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "39abd4f63776cc49"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import gc\n",
+    "\n",
+    "\n",
+    "def clear_hardwares():\n",
+    "    torch.clear_autocast_cache()\n",
+    "    torch.cuda.ipc_collect()\n",
+    "    torch.cuda.empty_cache()\n",
+    "    gc.collect()"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "70cca01bc96d9ead"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "clear_hardwares()\n",
+    "clear_hardwares()"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "76760bc5f6c5c632"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "def generate(model, prompt: str, kwargs):\n",
+    "    tokenized_prompt = tokenizer(prompt, return_tensors='pt').to(model.device)\n",
+    "    prompt_length = len(tokenized_prompt.get('input_ids')[0])\n",
+    "    with torch.cuda.amp.autocast():\n",
+    "        output_tokens = model.generate(**tokenized_prompt, **kwargs) if kwargs else model.generate(**tokenized_prompt)\n",
+    "        output = tokenizer.decode(output_tokens[0][prompt_length:], skip_special_tokens=True)\n",
+    "    return output"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "dd8313238b26e95e"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "base_model = AutoModelForCausalLM.from_pretrained(new_model, return_dict=True, device_map='auto', token='')\n",
+    "tokenizer = AutoTokenizer.from_pretrained(new_model, max_length=max_seq_length)\n",
+    "model = PeftModel.from_pretrained(base_model, new_model)\n",
+    "del base_model"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "d3fe5a27fa40ba9"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "sample = eval_dataset[0]\n",
+    "if use_special_template:\n",
+    "    prompt = f\"{instruction_prompt_template}{sample['instruction']}{sample['input']}\\n{response_template}\"\n",
+    "else:\n",
+    "    chat_temp = [{\"role\": \"system\", \"content\": sample['instruction']},\n",
+    "                 {\"role\": \"user\", \"content\": {sample['input']}}]\n",
+    "    prompt = tokenizer.apply_chat_template(chat_temp, tokenize=False, add_generation_prompt=True)"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "70682a07fcaaca3f"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "gen_kwargs = {\"max_new_tokens\": 1024}\n",
+    "generated_texts = generate(model=model, prompt=prompt, kwargs=gen_kwargs)\n",
+    "print(generated_texts)"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "febeb00f0a6f0b5e"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Merge to base model"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "c18abf489437a546"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "clear_hardwares()\n",
+    "merged_model = model.merge_and_unload()\n",
+    "clear_hardwares()\n",
+    "del model\n",
+    "new_model_name = 'your_hf_account/your_desired_name'\n",
+    "merged_model.push_to_hub(new_model_name)"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "4f5f450001bf428f"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Fast Inference with [Vllm](https://github.com/vllm-project/vllm)\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "4851ef41e4cc4f95"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "The `vllm` library is one of the fastest inference engines for LLMs. For a comparative overview of available options, you can use this blog: [7 Frameworks for Serving LLMs](https://medium.com/@gsuresh957/7-frameworks-for-serving-llms-5044b533ee88). Importance level: 3\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "fe82f0a57fe86f60"
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "from vllm import LLM, SamplingParams\n",
+    "\n",
+    "gen_kwargs = {\"max_tokens\": 1024}\n",
+    "\n",
+    "llm = LLM(model=new_model_name, gpu_memory_utilization=0.9, trust_remote_code=True)\n",
+    "sampling_params = SamplingParams(**gen_kwargs)\n",
+    "outputs = llm.generate(prompt, gen_kwargs)\n",
+    "print(outputs)"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "88bee8960b176e87"
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebooks/en/index.md b/notebooks/en/index.md
index b22cc465..f3fc8f84 100644
--- a/notebooks/en/index.md
+++ b/notebooks/en/index.md
@@ -10,6 +10,7 @@ Check out the recently added notebooks:
 - [Using LLM-as-a-judge 🧑‍⚖️ for an automated and versatile evaluation](llm_judge)
 - [Create a legal preference dataset](pipeline_notus_instructions_preferences_legal)
 - [Suggestions for Data Annotation with SetFit in Zero-shot Text Classification](labelling_feedback_setfit)
+- [Fine-tune an LLM on simple task using single GPU with fast inference](fine_tuning_simple_task_on_single_gpu_with_fast_inference)
 
 You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook).
 

From 55b0cede68aa4b1411a69b65c7fff756d826a1a8 Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Fri, 26 Apr 2024 13:54:35 +0330
Subject: [PATCH 02/18] Update
 fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb

fix: add  "execution_count": null
---
 ...sk_on_single_gpu_with_fast_inference.ipynb | 20 +++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/notebooks/en/fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb b/notebooks/en/fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb
index e2d26f79..637ec520 100644
--- a/notebooks/en/fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb
+++ b/notebooks/en/fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb
@@ -26,6 +26,7 @@
   {
    "cell_type": "code",
    "outputs": [],
+   "execution_count": null,
    "source": [
     "import torch\n",
     "from datasets import load_dataset\n",
@@ -75,6 +76,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "# General parameters\n",
@@ -140,6 +142,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "# Load dataset (you can process it here)\n",
@@ -169,6 +172,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "# Load QLoRA configuration\n",
@@ -188,6 +192,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "# Load LoRA configuration\n",
@@ -206,6 +211,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "# Load base model\n",
@@ -233,6 +239,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "# Set training parameters\n",
@@ -262,6 +269,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "# Load tokenizer\n",
@@ -278,6 +286,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "def special_formatting_prompts(example):\n",
@@ -315,6 +324,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "if use_special_template:\n",
@@ -334,6 +344,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "trainer = SFTTrainer(\n",
@@ -356,6 +367,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "# Train model\n",
@@ -381,6 +393,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "import torch\n",
@@ -400,6 +413,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "clear_hardwares()\n",
@@ -412,6 +426,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "def generate(model, prompt: str, kwargs):\n",
@@ -429,6 +444,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "base_model = AutoModelForCausalLM.from_pretrained(new_model, return_dict=True, device_map='auto', token='')\n",
@@ -443,6 +459,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "sample = eval_dataset[0]\n",
@@ -460,6 +477,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "gen_kwargs = {\"max_new_tokens\": 1024}\n",
@@ -483,6 +501,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "clear_hardwares()\n",
@@ -519,6 +538,7 @@
   },
   {
    "cell_type": "code",
+   "execution_count": null,
    "outputs": [],
    "source": [
     "from vllm import LLM, SamplingParams\n",

From 765ba509f43459be9671f197554ce0571a4255b9 Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Sat, 27 Apr 2024 14:51:32 +0330
Subject: [PATCH 03/18] Update _toctree.yml

---
 notebooks/en/_toctree.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml
index d04b9024..6642fe25 100644
--- a/notebooks/en/_toctree.yml
+++ b/notebooks/en/_toctree.yml
@@ -22,7 +22,7 @@
   - local: llm_judge
     title: Using LLM-as-a-judge for an automated and versatile evaluation
   - local: fine_tuning_simple_task_on_single_gpu_with_fast_inference
-    title: Fin-tuning LLM on a simple task using single GPU with fast inference
+    title: Fine-tuning LLM on a simple task using single GPU with fast inference
 
 
 - title: Diffusion Recipes

From 216c47b334671160269dcc4f613bff55321b2bfb Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Tue, 30 Apr 2024 23:48:15 +0330
Subject: [PATCH 04/18] Update
 fine_tuning_llm_for_generate_persian_product_catalog_in_json_format.ipynb

---
 ...sk_on_single_gpu_with_fast_inference.ipynb | 21 +++++++++----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/notebooks/en/fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb b/notebooks/en/fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb
index 637ec520..920556e8 100644
--- a/notebooks/en/fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb
+++ b/notebooks/en/fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb
@@ -3,7 +3,7 @@
   {
    "cell_type": "markdown",
    "source": [
-    "# Fin-tuning LLM on a simple task using single GPU with fast inference\n",
+    "# Fin-tuning LLM for Generate Persian Product Catalogs in JSON Format\n",
     "\n",
     "_Authored by: [Mohammadreza Esmaeiliyan](https://github.com/MrzEsma)_"
    ],
@@ -15,8 +15,9 @@
   {
    "cell_type": "markdown",
    "source": [
-    "In this notebook, the attempt has been made to fine-tune an LLM in the simplest manner without adding unnecessary complexity, with a parameter count suitable for a Customer-level GPU, and then to perform inference using one of the fastest open-source inference engines, Vllm. I have tried to explain all the concepts and techniques used as far as possible; however, since there are many concepts and techniques to explain, I firstly gave a priority based on importance so that the more important concepts and techniques can be studied first. Secondly, since others have written these explanations well and in more detail in blogs, I have referred to these links. As the Iranian saying goes, \"In a house of wisdom, a few words suffice\" :)\n",
-    "Let's get started.\n"
+    "In this notebook, an attempt has been made to fine-tune a Large Language Model (LLM) in the simplest manner possible, without adding unnecessary complexity. The model has been optimized to be suitable for a customer-level GPU, which is used to generate Persian product catalogs and produce structured output in JSON format. This model is particularly effective in creating structured outputs for the unstructured titles and descriptions of products on Iranian platforms with user-generated content. Such platforms include [Basalam](https://basalam.com), [Divar](https://divar.ir/), [Digikala](https://www.digikala.com/), and others. You can also see a fine-tuned LLM with this code on [our HF account](BaSalam/Llama2-7b-entity-attr-v1).\n",
+    "Additionally, one of the fastest open-source inference engines, Vllm, is employed for inference. I have endeavored to explain all the relevant concepts and techniques as clearly as possible. However, given the vast number of concepts and techniques involved, I have prioritized them based on their importance, allowing the more critical ones to be studied first. Moreover, since others have provided detailed explanations in blogs, I have included references to these sources. As the Iranian saying goes, 'In a house of wisdom, a few words suffice.' :)\n",
+    "Let's get started."
    ],
    "metadata": {
     "collapsed": false
@@ -81,8 +82,8 @@
    "source": [
     "# General parameters\n",
     "model_name = \"NousResearch/Llama-2-7b-chat-hf\"  # The model that you want to train from the Hugging Face hub\n",
-    "dataset_name = \"yahma/alpaca-cleaned\"  # The instruction dataset to use\n",
-    "new_model = \"llama-2-7b-alpaca\"  # The name for fine-tuned LoRA Adaptor\n",
+    "dataset_name = \"BaSalam/entity-attribute-dataset-GPT-3.5-generated-v1\"  # The instruction dataset to use\n",
+    "new_model = \"llama-persian-catalog-generator\"  # The name for fine-tuned LoRA Adaptor\n",
     "\n",
     "# LoRA parameters\n",
     "lora_r = 64\n",
@@ -148,7 +149,7 @@
     "# Load dataset (you can process it here)\n",
     "dataset = load_dataset(dataset_name, split=\"train\")\n",
     "percent_of_train_dataset = 0.95\n",
-    "other_columns = [i for i in dataset.column_names if i not in ['instruction', 'output', 'text']]\n",
+    "other_columns = [i for i in dataset.column_names if i not in ['instruction', 'output']]\n",
     "dataset = dataset.remove_columns(other_columns)\n",
     "split_dataset = dataset.train_test_split(train_size=int(dataset.num_rows * percent_of_train_dataset), seed=19, shuffle=False)\n",
     "train_dataset = split_dataset[\"train\"]\n",
@@ -292,7 +293,7 @@
     "def special_formatting_prompts(example):\n",
     "    output_texts = []\n",
     "    for i in range(len(example['instruction'])):\n",
-    "        text = f\"{instruction_prompt_template}{example['instruction'][i]}{example['input'][i]}\\n{response_template} {example['output'][i]}\"\n",
+    "        text = f\"{instruction_prompt_template}{example['instruction'][i]}\\n{response_template} {example['output'][i]}\"\n",
     "        output_texts.append(text)\n",
     "    return output_texts\n",
     "\n",
@@ -301,7 +302,6 @@
     "    output_texts = []\n",
     "    for i in range(len(example['instruction'])):\n",
     "        chat_temp = [{\"role\": \"system\", \"content\": example['instruction'][i]},\n",
-    "                     {\"role\": \"user\", \"content\": {example['input'][i]}},\n",
     "                     {\"role\": \"assistant\", \"content\": example['output'][i]}]\n",
     "        text = tokenizer.apply_chat_template(chat_temp, tokenize=False)\n",
     "        output_texts.append(text)\n",
@@ -464,10 +464,9 @@
    "source": [
     "sample = eval_dataset[0]\n",
     "if use_special_template:\n",
-    "    prompt = f\"{instruction_prompt_template}{sample['instruction']}{sample['input']}\\n{response_template}\"\n",
+    "    prompt = f\"{instruction_prompt_template}{sample['instruction']}\\n{response_template}\"\n",
     "else:\n",
-    "    chat_temp = [{\"role\": \"system\", \"content\": sample['instruction']},\n",
-    "                 {\"role\": \"user\", \"content\": {sample['input']}}]\n",
+    "    chat_temp = [{\"role\": \"system\", \"content\": sample['instruction']}]\n",
     "    prompt = tokenizer.apply_chat_template(chat_temp, tokenize=False, add_generation_prompt=True)"
    ],
    "metadata": {

From f4b34068cbce90dc11fcf35c35aa40eca83b69f8 Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Tue, 30 Apr 2024 23:48:58 +0330
Subject: [PATCH 05/18] Rename
 fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb to
 fine_tuning_llm_for_generate_persian_product_catalog_in_json_format.ipynb

---
 ...llm_for_generate_persian_product_catalog_in_json_format.ipynb} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename notebooks/en/{fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb => fine_tuning_llm_for_generate_persian_product_catalog_in_json_format.ipynb} (100%)

diff --git a/notebooks/en/fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb b/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalog_in_json_format.ipynb
similarity index 100%
rename from notebooks/en/fine_tuning_simple_task_on_single_gpu_with_fast_inference.ipynb
rename to notebooks/en/fine_tuning_llm_for_generate_persian_product_catalog_in_json_format.ipynb

From e292dc8f824c0833deeab993ca7770dc3c75ff20 Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Tue, 30 Apr 2024 23:49:33 +0330
Subject: [PATCH 06/18] Rename
 fine_tuning_llm_for_generate_persian_product_catalog_in_json_format.ipynb to
 fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb

---
 ...lm_for_generate_persian_product_catalogs_in_json_format.ipynb} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename notebooks/en/{fine_tuning_llm_for_generate_persian_product_catalog_in_json_format.ipynb => fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb} (100%)

diff --git a/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalog_in_json_format.ipynb b/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
similarity index 100%
rename from notebooks/en/fine_tuning_llm_for_generate_persian_product_catalog_in_json_format.ipynb
rename to notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb

From 8457a5937440e0eda0408793f409ba5c33f07233 Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Tue, 30 Apr 2024 23:50:30 +0330
Subject: [PATCH 07/18] Update _toctree.yml

---
 notebooks/en/_toctree.yml | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml
index 6642fe25..9b81fc89 100644
--- a/notebooks/en/_toctree.yml
+++ b/notebooks/en/_toctree.yml
@@ -21,9 +21,8 @@
     title: RAG Evaluation
   - local: llm_judge
     title: Using LLM-as-a-judge for an automated and versatile evaluation
-  - local: fine_tuning_simple_task_on_single_gpu_with_fast_inference
-    title: Fine-tuning LLM on a simple task using single GPU with fast inference
-
+  - local: fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format
+    title: Fine-tuning LLM for Generate Persian Product Catalogs in JSON Format
 
 - title: Diffusion Recipes
   sections:

From a1407cad50a5c7986b95765a9c7a33d08e69ea2b Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Tue, 30 Apr 2024 23:52:35 +0330
Subject: [PATCH 08/18] Update index.md

---
 notebooks/en/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/notebooks/en/index.md b/notebooks/en/index.md
index f3fc8f84..6ab09c31 100644
--- a/notebooks/en/index.md
+++ b/notebooks/en/index.md
@@ -10,7 +10,7 @@ Check out the recently added notebooks:
 - [Using LLM-as-a-judge 🧑‍⚖️ for an automated and versatile evaluation](llm_judge)
 - [Create a legal preference dataset](pipeline_notus_instructions_preferences_legal)
 - [Suggestions for Data Annotation with SetFit in Zero-shot Text Classification](labelling_feedback_setfit)
-- [Fine-tune an LLM on simple task using single GPU with fast inference](fine_tuning_simple_task_on_single_gpu_with_fast_inference)
+- [Fine-tuning LLM for Generate Persian Product Catalogs in JSON Format](fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format)
 
 You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook).
 

From 1e32acae9d6fd81f771d895c5012306baa3f8442 Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Tue, 30 Apr 2024 23:52:54 +0330
Subject: [PATCH 09/18] Update
 fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb

---
 ...m_for_generate_persian_product_catalogs_in_json_format.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb b/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
index 920556e8..f9480296 100644
--- a/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
+++ b/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
@@ -3,7 +3,7 @@
   {
    "cell_type": "markdown",
    "source": [
-    "# Fin-tuning LLM for Generate Persian Product Catalogs in JSON Format\n",
+    "# Fine-tuning LLM for Generate Persian Product Catalogs in JSON Format\n",
     "\n",
     "_Authored by: [Mohammadreza Esmaeiliyan](https://github.com/MrzEsma)_"
    ],

From fd96302ca5bb41ada6c06f490bad6ec6dd0a8c8d Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Mon, 13 May 2024 00:12:33 +0330
Subject: [PATCH 10/18] Update
 fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb

feat: add an example and remove importance level
---
 ...sian_product_catalogs_in_json_format.ipynb | 62 ++++++++++++++-----
 1 file changed, 46 insertions(+), 16 deletions(-)

diff --git a/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb b/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
index f9480296..89e31a6b 100644
--- a/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
+++ b/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
@@ -16,8 +16,9 @@
    "cell_type": "markdown",
    "source": [
     "In this notebook, an attempt has been made to fine-tune a Large Language Model (LLM) in the simplest manner possible, without adding unnecessary complexity. The model has been optimized to be suitable for a customer-level GPU, which is used to generate Persian product catalogs and produce structured output in JSON format. This model is particularly effective in creating structured outputs for the unstructured titles and descriptions of products on Iranian platforms with user-generated content. Such platforms include [Basalam](https://basalam.com), [Divar](https://divar.ir/), [Digikala](https://www.digikala.com/), and others. You can also see a fine-tuned LLM with this code on [our HF account](BaSalam/Llama2-7b-entity-attr-v1).\n",
-    "Additionally, one of the fastest open-source inference engines, Vllm, is employed for inference. I have endeavored to explain all the relevant concepts and techniques as clearly as possible. However, given the vast number of concepts and techniques involved, I have prioritized them based on their importance, allowing the more critical ones to be studied first. Moreover, since others have provided detailed explanations in blogs, I have included references to these sources. As the Iranian saying goes, 'In a house of wisdom, a few words suffice.' :)\n",
-    "Let's get started."
+    "Additionally, one of the fastest open-source inference engines, Vllm, is employed for inference. \n",
+    "I tried to mention all the important concepts and techniques in the description section. However, since others have written valuable explanations and blogs, I simply referred to these sources. As the Iranian saying goes, 'In a house of wisdom, a few words suffice.' :)\n",
+    "Let's get started"
    ],
    "metadata": {
     "collapsed": false
@@ -46,19 +47,19 @@
    "id": "3a35eafbe37e4ad2"
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "raw",
    "source": [
-    "The `peft` library, or parameter efficient fine tuning, has been created to fine-tune LLMs more efficiently. If we were to open and fine-tune the upper layers of the network traditionally like all neural networks, it would require a lot of processing and also a significant amount of VRAM. With the methods developed in recent papers, this library has been implemented for efficient fine-tuning of LLMs. Read more about peft here: [Hugging Face PEFT](https://huggingface.co/blog/peft). Importance level: 1"
+    "The `peft` library, or parameter efficient fine tuning, has been created to fine-tune LLMs more efficiently. If we were to open and fine-tune the upper layers of the network traditionally like all neural networks, it would require a lot of processing and also a significant amount of VRAM. With the methods developed in recent papers, this library has been implemented for efficient fine-tuning of LLMs. Read more about peft here: [Hugging Face PEFT](https://huggingface.co/blog/peft)."
    ],
    "metadata": {
     "collapsed": false
    },
-   "id": "64671b5d61ba9d57"
+   "id": "77a201b51133c2f9"
   },
   {
    "cell_type": "markdown",
    "source": [
-    "After the emergence of LLMs, a task known as Alignment was created, in which we try to produce outputs from LLMs that are compatible with our preferences. We start with simple supervised fine tuning or SFT in the first stage, and in the second stage, a mechanism for receiving feedback from users is created, and with other techniques, we make the LLM more aligned with our preferences. The `trl` library has been created for such a task, and this library is used in the first stage, which is SFT. For further reading on the Alignment task, see [OpenAI Blog on Instruction Following](https://openai.com/research/instruction-following#fn1) and [Hugging Face Blog on RLHF](https://huggingface.co/blog/rlhf). Importance level: 1\n"
+    "After the emergence of LLMs, a task known as Alignment was created, in which we try to produce outputs from LLMs that are compatible with our preferences. We start with simple supervised fine tuning or SFT in the first stage, and in the second stage, a mechanism for receiving feedback from users is created, and with other techniques, we make the LLM more aligned with our preferences. The `trl` library has been created for such a task, and this library is used in the first stage, which is SFT. For further reading on the Alignment task, see [OpenAI Blog on Instruction Following](https://openai.com/research/instruction-following#fn1) and [Hugging Face Blog on RLHF](https://huggingface.co/blog/rlhf). "
    ],
    "metadata": {
     "collapsed": false
@@ -164,7 +165,7 @@
   {
    "cell_type": "markdown",
    "source": [
-    "Two techniques, LoRA and QLoRA, are among the most important techniques of PEFT. In brief, LoRA aims to open only these layers for fine-tuning by constructing and adding a low-rank matrix to each of the model layers, thus neither changing the model weights nor requiring lengthy training, and the created weights are lightweight and can be produced multiple times, allowing multiple tasks to be fine-tuned with an LLM loaded into RAM. In the QLoRA technique, the weights are quantized to 4 bits, further reducing RAM consumption. Read about LoRA [here at Lightning AI](https://lightning.ai/pages/community/tutorial/lora-llm/) and about QLoRA [here at Hugging Face](https://huggingface.co/blog/4bit-transformers-bitsandbytes). For other efficient training methods, see [Hugging Face Docs on Performance Training](https://huggingface.co/docs/transformers/perf_train_gpu_one) and [SFT Trainer Enhancement](https://huggingface.co/docs/trl/main/en/sft_trainer#enhance-models-performances-using-neftune). Importance level: 2\n"
+    "Two techniques, LoRA and QLoRA, are among the most important techniques of PEFT. In brief, LoRA aims to open only these layers for fine-tuning by constructing and adding a low-rank matrix to each of the model layers, thus neither changing the model weights nor requiring lengthy training, and the created weights are lightweight and can be produced multiple times, allowing multiple tasks to be fine-tuned with an LLM loaded into RAM. In the QLoRA technique, the weights are quantized to 4 bits, further reducing RAM consumption. Read about LoRA [here at Lightning AI](https://lightning.ai/pages/community/tutorial/lora-llm/) and about QLoRA [here at Hugging Face](https://huggingface.co/blog/4bit-transformers-bitsandbytes). For other efficient training methods, see [Hugging Face Docs on Performance Training](https://huggingface.co/docs/transformers/perf_train_gpu_one) and [SFT Trainer Enhancement](https://huggingface.co/docs/trl/main/en/sft_trainer#enhance-models-performances-using-neftune).\n"
    ],
    "metadata": {
     "collapsed": false
@@ -231,7 +232,7 @@
   {
    "cell_type": "markdown",
    "source": [
-    "In this blog, you can read about the best practices for fine-tuning LLMs [Sebastian Raschka's Magazine](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms?r=1h0eu9&utm_campaign=post&utm_medium=web). Importance level: 1\n"
+    "In this blog, you can read about the best practices for fine-tuning LLMs [Sebastian Raschka's Magazine](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms?r=1h0eu9&utm_campaign=post&utm_medium=web). \n"
    ],
    "metadata": {
     "collapsed": false
@@ -315,7 +316,7 @@
   {
    "cell_type": "markdown",
    "source": [
-    "Regarding the chat template, let me briefly explain that to understand the structure of the conversation between the user and the model during model training, a series of reserved phrases are created to separate the user's message and the model's response, so the model precisely understands where each message comes from and has a sense of the conversational structure. Typically, adhering to a chat template helps increase accuracy in the intended task. However, when there is a distribution shift between the fine-tuning dataset and the model, using a specific chat template can be more helpful. For further reading, visit [Hugging Face Blog on Chat Templates](https://huggingface.co/blog/chat-templates). Importance level: 3\n"
+    "Regarding the chat template, let me briefly explain that to understand the structure of the conversation between the user and the model during model training, a series of reserved phrases are created to separate the user's message and the model's response, so the model precisely understands where each message comes from and has a sense of the conversational structure. Typically, adhering to a chat template helps increase accuracy in the intended task. However, when there is a distribution shift between the fine-tuning dataset and the model, using a specific chat template can be more helpful. For further reading, visit [Hugging Face Blog on Chat Templates](https://huggingface.co/blog/chat-templates).\n"
    ],
    "metadata": {
     "collapsed": false
@@ -528,7 +529,8 @@
   {
    "cell_type": "markdown",
    "source": [
-    "The `vllm` library is one of the fastest inference engines for LLMs. For a comparative overview of available options, you can use this blog: [7 Frameworks for Serving LLMs](https://medium.com/@gsuresh957/7-frameworks-for-serving-llms-5044b533ee88). Importance level: 3\n"
+    "The `vllm` library is one of the fastest inference engines for LLMs. For a comparative overview of available options, you can use this blog: [7 Frameworks for Serving LLMs](https://medium.com/@gsuresh957/7-frameworks-for-serving-llms-5044b533ee88). \n",
+    "In this example, we are inferring version 1 of our fine-tuned model on this task."
    ],
    "metadata": {
     "collapsed": false
@@ -538,16 +540,44 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.69s/it]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " {\"attributes\": {\"قد جلوی کار\": \"85 سانتی متر\", \"قد پشت کار\": \"88 سانتی متر\"}, \"product_entity\": \"مانتو اسپرت\"} }\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
    "source": [
     "from vllm import LLM, SamplingParams\n",
     "\n",
-    "gen_kwargs = {\"max_tokens\": 1024}\n",
+    "prompt = \"\"\"### Question: here is a product title from a Iranian marketplace.  \\n         give me the Product Entity and Attributes of this product in Persian language.\\n         give the output in this json format: {'attributes': {'attribute_name' : <attribute value>, ...}, 'product_entity': '<product entity>'}.\\n         Don't make assumptions about what values to plug into json. Just give Json not a single word more.\\n         \\nproduct title:\"\"\"\n",
+    "user_prompt_template = '### Question: '\n",
+    "response_template = ' ### Answer:'\n",
+    "\n",
+    "llm = LLM(model='BaSalam/Llama2-7b-entity-attr-v1', gpu_memory_utilization=0.9, trust_remote_code=True)\n",
+    "\n",
+    "product = 'مانتو اسپرت پانیذ قد جلوی کار حدودا 85 سانتی متر قد پشت کار حدودا 88 سانتی متر'\n",
+    "sampling_params = SamplingParams(temperature=0.0, max_tokens=75)\n",
+    "prompt = f'{user_prompt_template} {prompt}{product}\\n {response_template}'\n",
+    "outputs = llm.generate(prompt, sampling_params)\n",
     "\n",
-    "llm = LLM(model=new_model_name, gpu_memory_utilization=0.9, trust_remote_code=True)\n",
-    "sampling_params = SamplingParams(**gen_kwargs)\n",
-    "outputs = llm.generate(prompt, gen_kwargs)\n",
-    "print(outputs)"
+    "print(outputs[0].outputs[0].text)"
    ],
    "metadata": {
     "collapsed": false

From 3047f51a21d5cb8dc1f08d97af1f597971886b56 Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Tue, 14 May 2024 14:27:21 +0330
Subject: [PATCH 11/18] Update
 fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb

Change example output from code cell to markdown cell
---
 ...sian_product_catalogs_in_json_format.ipynb | 42 ++++++++-----------
 1 file changed, 18 insertions(+), 24 deletions(-)

diff --git a/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb b/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
index 89e31a6b..9f3d3e97 100644
--- a/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
+++ b/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
@@ -540,29 +540,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.69s/it]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      " {\"attributes\": {\"قد جلوی کار\": \"85 سانتی متر\", \"قد پشت کار\": \"88 سانتی متر\"}, \"product_entity\": \"مانتو اسپرت\"} }\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "from vllm import LLM, SamplingParams\n",
     "\n",
@@ -583,7 +561,23 @@
     "collapsed": false
    },
    "id": "88bee8960b176e87"
-  }
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Example Output\n\n",
+    "```\n",
+    "{\n",
+    "    \"attributes\": {\n",
+    "        \"قد جلوی کار\": \"85 سانتی متر\",\n",
+    "        \"قد پشت کار\": \"88 سانتی متر\"\n",
+    "    },\n",
+    "    \"product_entity\": \"مانتو اسپرت\"\n",
+    "}\n",
+    "```\n"
+   ]
+}
  ],
  "metadata": {
   "kernelspec": {

From 9a2f86ceb5fa0ccee6e4ee6937eaca02256833b8 Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Wed, 22 May 2024 11:55:10 +0330
Subject: [PATCH 12/18] Update
 fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb

resolve all merveenoyan hints and suggestions
---
 ...sian_product_catalogs_in_json_format.ipynb | 154 +++++++++++-------
 1 file changed, 94 insertions(+), 60 deletions(-)

diff --git a/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb b/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
index 9f3d3e97..feaa9775 100644
--- a/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
+++ b/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
@@ -15,10 +15,11 @@
   {
    "cell_type": "markdown",
    "source": [
-    "In this notebook, an attempt has been made to fine-tune a Large Language Model (LLM) in the simplest manner possible, without adding unnecessary complexity. The model has been optimized to be suitable for a customer-level GPU, which is used to generate Persian product catalogs and produce structured output in JSON format. This model is particularly effective in creating structured outputs for the unstructured titles and descriptions of products on Iranian platforms with user-generated content. Such platforms include [Basalam](https://basalam.com), [Divar](https://divar.ir/), [Digikala](https://www.digikala.com/), and others. You can also see a fine-tuned LLM with this code on [our HF account](BaSalam/Llama2-7b-entity-attr-v1).\n",
-    "Additionally, one of the fastest open-source inference engines, Vllm, is employed for inference. \n",
-    "I tried to mention all the important concepts and techniques in the description section. However, since others have written valuable explanations and blogs, I simply referred to these sources. As the Iranian saying goes, 'In a house of wisdom, a few words suffice.' :)\n",
-    "Let's get started"
+    "In this notebook, we have attempted to fine-tune a large language model with no added complexity. The model has been optimized for use on a customer-level GPU to generate Persian product catalogs and produce structured output in JSON format. It is particularly effective for creating structured outputs from the unstructured titles and descriptions of products on Iranian platforms with user-generated content, such as [Basalam](https://basalam.com), [Divar](https://divar.ir/), [Digikala](https://www.digikala.com/), and others. \n",
+    "\n",
+    "You can see a fine-tuned LLM with this code on [our HF account](BaSalam/Llama2-7b-entity-attr-v1). Additionally, one of the fastest open-source inference engines, [Vllm](https://github.com/vllm/vllm), is employed for inference. \n",
+    "\n",
+    "Let's get started!"
    ],
    "metadata": {
     "collapsed": false
@@ -47,19 +48,19 @@
    "id": "3a35eafbe37e4ad2"
   },
   {
-   "cell_type": "raw",
+   "cell_type": "markdown",
    "source": [
     "The `peft` library, or parameter efficient fine tuning, has been created to fine-tune LLMs more efficiently. If we were to open and fine-tune the upper layers of the network traditionally like all neural networks, it would require a lot of processing and also a significant amount of VRAM. With the methods developed in recent papers, this library has been implemented for efficient fine-tuning of LLMs. Read more about peft here: [Hugging Face PEFT](https://huggingface.co/blog/peft)."
    ],
    "metadata": {
     "collapsed": false
    },
-   "id": "77a201b51133c2f9"
+   "id": "30caf9936156e430"
   },
   {
    "cell_type": "markdown",
    "source": [
-    "After the emergence of LLMs, a task known as Alignment was created, in which we try to produce outputs from LLMs that are compatible with our preferences. We start with simple supervised fine tuning or SFT in the first stage, and in the second stage, a mechanism for receiving feedback from users is created, and with other techniques, we make the LLM more aligned with our preferences. The `trl` library has been created for such a task, and this library is used in the first stage, which is SFT. For further reading on the Alignment task, see [OpenAI Blog on Instruction Following](https://openai.com/research/instruction-following#fn1) and [Hugging Face Blog on RLHF](https://huggingface.co/blog/rlhf). "
+    "With the emergence of LLMs, the task of Alignment was developed to produce outputs compatible with our preferences. We start with simple supervised fine tuning (SFT) in the first stage. In the second stage, we implement a mechanism for receiving user feedback and apply other techniques to align the LLM with our preferences. The `trl` library supports this task and is used in the first stage, SFT. For further reading on the Alignment task, see the [OpenAI Blog on Instruction Following](https://www.openai.com/blog/instruction-following) and the [Hugging Face Blog on RLHF](https://huggingface.co/blog/rlhf).\n"
    ],
    "metadata": {
     "collapsed": false
@@ -69,7 +70,7 @@
   {
    "cell_type": "markdown",
    "source": [
-    "## Set parameters"
+    "## Set hyperparameters"
    ],
    "metadata": {
     "collapsed": false
@@ -132,6 +133,20 @@
    },
    "id": "96fccf9f7364bac6"
   },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Two key techniques in PEFT are LoRA and QLoRA. LoRA (Low-Rank Adaptation) stores changes in weights by constructing and adding a low-rank matrix to each model layer. This method opens only these layers for fine-tuning, without changing the original model weights or requiring lengthy training. The resulting weights are lightweight and can be produced multiple times, allowing for the fine-tuning of multiple tasks with an LLM loaded into RAM. \n",
+    "\n",
+    "QLoRA quantizes the weights to 4 bits, further reducing RAM consumption. \n",
+    "\n",
+    "Read about LoRA [here at Lightning AI](https://lightning.ai/pages/community/tutorial/lora-llm/) and about QLoRA [here at Hugging Face](https://huggingface.co/blog/4bit-transformers-bitsandbytes). For other efficient training methods, see [Hugging Face Docs on Performance Training](https://huggingface.co/docs/transformers/perf_train_gpu_one) and [SFT Trainer Enhancement](https://huggingface.co/docs/trl/main/en/sft_trainer#enhance-models-performances-using-neftune).\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "382296d37668763c"
+  },
   {
    "cell_type": "markdown",
    "source": [
@@ -144,7 +159,6 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
    "outputs": [],
    "source": [
     "# Load dataset (you can process it here)\n",
@@ -160,17 +174,8 @@
    "metadata": {
     "collapsed": false
    },
-   "id": "8cc58fe0c4b229e0"
-  },
-  {
-   "cell_type": "markdown",
-   "source": [
-    "Two techniques, LoRA and QLoRA, are among the most important techniques of PEFT. In brief, LoRA aims to open only these layers for fine-tuning by constructing and adding a low-rank matrix to each of the model layers, thus neither changing the model weights nor requiring lengthy training, and the created weights are lightweight and can be produced multiple times, allowing multiple tasks to be fine-tuned with an LLM loaded into RAM. In the QLoRA technique, the weights are quantized to 4 bits, further reducing RAM consumption. Read about LoRA [here at Lightning AI](https://lightning.ai/pages/community/tutorial/lora-llm/) and about QLoRA [here at Hugging Face](https://huggingface.co/blog/4bit-transformers-bitsandbytes). For other efficient training methods, see [Hugging Face Docs on Performance Training](https://huggingface.co/docs/transformers/perf_train_gpu_one) and [SFT Trainer Enhancement](https://huggingface.co/docs/trl/main/en/sft_trainer#enhance-models-performances-using-neftune).\n"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "382296d37668763c"
+   "id": "8cc58fe0c4b229e0",
+   "execution_count": 0
   },
   {
    "cell_type": "code",
@@ -229,16 +234,6 @@
    },
    "id": "bacbbc9ddd19504d"
   },
-  {
-   "cell_type": "markdown",
-   "source": [
-    "In this blog, you can read about the best practices for fine-tuning LLMs [Sebastian Raschka's Magazine](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms?r=1h0eu9&utm_campaign=post&utm_medium=web). \n"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "56219e83015a7357"
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -269,6 +264,16 @@
    },
    "id": "a82c50bc69c3632b"
   },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "In this blog, you can read about the best practices for fine-tuning LLMs [Sebastian Raschka's Magazine](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms?r=1h0eu9&utm_campaign=post&utm_medium=web). \n"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "56219e83015a7357"
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -286,9 +291,18 @@
    },
    "id": "c86b66f59bee28dc"
   },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Regarding the chat template, we will briefly explain that to understand the structure of the conversation between the user and the model during model training, a series of reserved phrases are created to separate the user's message and the model's response. This ensures that the model precisely understands where each message comes from and maintains a sense of the conversational structure. Typically, adhering to a chat template helps increase accuracy in the intended task. However, when there is a distribution shift between the fine-tuning dataset and the model, using a specific chat template can be even more helpful. For further reading, visit [Hugging Face Blog on Chat Templates](https://huggingface.co/blog/chat-templates).\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "ea4399c36bcdcbbd"
+  },
   {
    "cell_type": "code",
-   "execution_count": null,
    "outputs": [],
    "source": [
     "def special_formatting_prompts(example):\n",
@@ -311,17 +325,8 @@
    "metadata": {
     "collapsed": false
    },
-   "id": "7d3f935e03db79b8"
-  },
-  {
-   "cell_type": "markdown",
-   "source": [
-    "Regarding the chat template, let me briefly explain that to understand the structure of the conversation between the user and the model during model training, a series of reserved phrases are created to separate the user's message and the model's response, so the model precisely understands where each message comes from and has a sense of the conversational structure. Typically, adhering to a chat template helps increase accuracy in the intended task. However, when there is a distribution shift between the fine-tuning dataset and the model, using a specific chat template can be more helpful. For further reading, visit [Hugging Face Blog on Chat Templates](https://huggingface.co/blog/chat-templates).\n"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "ea4399c36bcdcbbd"
+   "id": "7d3f935e03db79b8",
+   "execution_count": 0
   },
   {
    "cell_type": "code",
@@ -366,6 +371,18 @@
    },
    "id": "48e09edab86c4212"
   },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "The `SFTTrainer` is then instantiated to handle supervised fine-tuning (SFT) of the model. This trainer is specifically designed for SFT and includes additional parameters such as `formatting_func` and `packing` which are not typically found in standard trainers.\n",
+    "`formatting_func`: A custom function to format training examples by combining instruction and response templates.\n",
+    "`packing`: Disables packing multiple samples into one sequence, which is not a standard parameter in the typical Trainer class.\n"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "38fb6fddbca5567e"
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -405,25 +422,15 @@
     "    torch.clear_autocast_cache()\n",
     "    torch.cuda.ipc_collect()\n",
     "    torch.cuda.empty_cache()\n",
-    "    gc.collect()"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "70cca01bc96d9ead"
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "outputs": [],
-   "source": [
+    "    gc.collect()\n",
+    "\n",
     "clear_hardwares()\n",
     "clear_hardwares()"
    ],
    "metadata": {
     "collapsed": false
    },
-   "id": "76760bc5f6c5c632"
+   "id": "70cca01bc96d9ead"
   },
   {
    "cell_type": "code",
@@ -432,10 +439,13 @@
    "source": [
     "def generate(model, prompt: str, kwargs):\n",
     "    tokenized_prompt = tokenizer(prompt, return_tensors='pt').to(model.device)\n",
+    "    \n",
     "    prompt_length = len(tokenized_prompt.get('input_ids')[0])\n",
+    "    \n",
     "    with torch.cuda.amp.autocast():\n",
     "        output_tokens = model.generate(**tokenized_prompt, **kwargs) if kwargs else model.generate(**tokenized_prompt)\n",
     "        output = tokenizer.decode(output_tokens[0][prompt_length:], skip_special_tokens=True)\n",
+    "        \n",
     "    return output"
    ],
    "metadata": {
@@ -508,14 +518,36 @@
     "merged_model = model.merge_and_unload()\n",
     "clear_hardwares()\n",
     "del model\n",
-    "new_model_name = 'your_hf_account/your_desired_name'\n",
-    "merged_model.push_to_hub(new_model_name)"
+    "adapter_model_name = 'your_hf_account/your_desired_name'\n",
+    "merged_model.push_to_hub(adapter_model_name)"
    ],
    "metadata": {
     "collapsed": false
    },
    "id": "4f5f450001bf428f"
   },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Here, we merged the adapter with the base model and push the merged model on the hub. You can just push the adapter in the hub and avoid pushing the heavy base model file in this way:\n",
+    "```\n",
+    "model.push_to_hub(adapter_model_name)\n",
+    "```\n",
+    "And then you load the model in this way:\n",
+    "```\n",
+    "config = PeftConfig.from_pretrained(adapter_model_name)\n",
+    "model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')\n",
+    "tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)\n",
+    "\n",
+    "# Load the Lora model\n",
+    "model = PeftModel.from_pretrained(model, adapter_model_name)\n",
+    "```"
+   ],
+   "metadata": {
+    "collapsed": false
+   },
+   "id": "16775c2ed49bfe11"
+  },
   {
    "cell_type": "markdown",
    "source": [
@@ -566,7 +598,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Example Output\n\n",
+    "### Example Output\n",
+    "\n",
     "```\n",
     "{\n",
     "    \"attributes\": {\n",
@@ -576,8 +609,9 @@
     "    \"product_entity\": \"مانتو اسپرت\"\n",
     "}\n",
     "```\n"
-   ]
-}
+   ],
+   "id": "dc007ced7ca34bbb"
+  }
  ],
  "metadata": {
   "kernelspec": {

From 630216e95897a89d4423baa95fef6f963a9d5f52 Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Sun, 16 Jun 2024 23:05:53 +0330
Subject: [PATCH 13/18] Update
 fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb

---
 ...sian_product_catalogs_in_json_format.ipynb | 542 +++++++++++-------
 1 file changed, 342 insertions(+), 200 deletions(-)

diff --git a/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb b/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
index feaa9775..6b045d57 100644
--- a/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
+++ b/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
@@ -2,34 +2,47 @@
  "cells": [
   {
    "cell_type": "markdown",
+   "id": "a59bf2a9e5015030",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "source": [
-    "# Fine-tuning LLM for Generate Persian Product Catalogs in JSON Format\n",
+    "# Fine-tuning LLM to Generate Persian Product Catalogs in JSON Format\n",
     "\n",
     "_Authored by: [Mohammadreza Esmaeiliyan](https://github.com/MrzEsma)_"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "a59bf2a9e5015030"
+   ]
   },
   {
    "cell_type": "markdown",
+   "id": "755fc90c27f1cb99",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "source": [
     "In this notebook, we have attempted to fine-tune a large language model with no added complexity. The model has been optimized for use on a customer-level GPU to generate Persian product catalogs and produce structured output in JSON format. It is particularly effective for creating structured outputs from the unstructured titles and descriptions of products on Iranian platforms with user-generated content, such as [Basalam](https://basalam.com), [Divar](https://divar.ir/), [Digikala](https://www.digikala.com/), and others. \n",
     "\n",
     "You can see a fine-tuned LLM with this code on [our HF account](BaSalam/Llama2-7b-entity-attr-v1). Additionally, one of the fastest open-source inference engines, [Vllm](https://github.com/vllm/vllm), is employed for inference. \n",
     "\n",
     "Let's get started!"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "755fc90c27f1cb99"
+   ]
   },
   {
    "cell_type": "code",
-   "outputs": [],
    "execution_count": null,
+   "id": "3a35eafbe37e4ad2",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [],
    "source": [
     "import torch\n",
     "from datasets import load_dataset\n",
@@ -41,64 +54,133 @@
     ")\n",
     "from peft import LoraConfig, PeftModel\n",
     "from trl import SFTTrainer, DataCollatorForCompletionOnlyLM"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "3a35eafbe37e4ad2"
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "The `peft` library, or parameter efficient fine tuning, has been created to fine-tune LLMs more efficiently. If we were to open and fine-tune the upper layers of the network traditionally like all neural networks, it would require a lot of processing and also a significant amount of VRAM. With the methods developed in recent papers, this library has been implemented for efficient fine-tuning of LLMs. Read more about peft here: [Hugging Face PEFT](https://huggingface.co/blog/peft)."
-   ],
+   "id": "30caf9936156e430",
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
-   "id": "30caf9936156e430"
+   "source": [
+    "The `peft` library, or parameter efficient fine tuning, has been created to fine-tune LLMs more efficiently. If we were to open and fine-tune the upper layers of the network traditionally like all neural networks, it would require a lot of processing and also a significant amount of VRAM. With the methods developed in recent papers, this library has been implemented for efficient fine-tuning of LLMs. Read more about peft here: [Hugging Face PEFT](https://huggingface.co/blog/peft)."
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "With the emergence of LLMs, the task of Alignment was developed to produce outputs compatible with our preferences. We start with simple supervised fine tuning (SFT) in the first stage. In the second stage, we implement a mechanism for receiving user feedback and apply other techniques to align the LLM with our preferences. The `trl` library supports this task and is used in the first stage, SFT. For further reading on the Alignment task, see the [OpenAI Blog on Instruction Following](https://www.openai.com/blog/instruction-following) and the [Hugging Face Blog on RLHF](https://huggingface.co/blog/rlhf).\n"
-   ],
+   "id": "261a8f52fe09202e",
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
-   "id": "a1e7c704058c2373"
-  },
-  {
-   "cell_type": "markdown",
    "source": [
     "## Set hyperparameters"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "261a8f52fe09202e"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "96fccf9f7364bac6",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "# General parameters\n",
     "model_name = \"NousResearch/Llama-2-7b-chat-hf\"  # The model that you want to train from the Hugging Face hub\n",
     "dataset_name = \"BaSalam/entity-attribute-dataset-GPT-3.5-generated-v1\"  # The instruction dataset to use\n",
-    "new_model = \"llama-persian-catalog-generator\"  # The name for fine-tuned LoRA Adaptor\n",
-    "\n",
+    "new_model = \"llama-persian-catalog-generator\"  # The name for fine-tuned LoRA Adaptor"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7f69a97083bf19d9",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [],
+   "source": [
     "# LoRA parameters\n",
     "lora_r = 64\n",
     "lora_alpha = lora_r * 2\n",
     "lora_dropout = 0.1\n",
-    "target_modules = [\"q_proj\", \"v_proj\", 'k_proj']\n",
+    "target_modules = [\"q_proj\", \"v_proj\", 'k_proj']\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "382296d37668763c",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "source": [
+    "LoRA (Low-Rank Adaptation) stores changes in weights by constructing and adding a low-rank matrix to each model layer. This method opens only these layers for fine-tuning, without changing the original model weights or requiring lengthy training. The resulting weights are lightweight and can be produced multiple times, allowing for the fine-tuning of multiple tasks with an LLM loaded into RAM. \n",
+    "\n",
     "\n",
+    "Read about LoRA [here at Lightning AI](https://lightning.ai/pages/community/tutorial/lora-llm/). For other efficient training methods, see [Hugging Face Docs on Performance Training](https://huggingface.co/docs/transformers/perf_train_gpu_one) and [SFT Trainer Enhancement](https://huggingface.co/docs/trl/main/en/sft_trainer#enhance-models-performances-using-neftune).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "501beb388b6749ea",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [],
+   "source": [
     "# QLoRA parameters\n",
     "load_in_4bit = True\n",
     "bnb_4bit_compute_dtype = \"float16\"\n",
     "bnb_4bit_quant_type = \"nf4\"\n",
-    "bnb_4bit_use_double_quant = False\n",
+    "bnb_4bit_use_double_quant = False\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "39149616eb21ec5b",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "source": [
+    "QLoRA (Quantized Low-Rank Adaptation) is an efficient fine-tuning approach that enables large language models to run on smaller GPUs by using 4-bit quantization. This method preserves the full performance of 16-bit fine-tuning while reducing memory usage, making it possible to fine-tune models with up to 65 billion parameters on a single 48GB GPU. QLoRA combines 4-bit NormalFloat data types, double quantization, and paged optimizers to manage memory efficiently. It allows fine-tuning of models with low-rank adapters, significantly enhancing accessibility for AI model development.\n",
     "\n",
+    "Read about QLoRA [here at Hugging Face](https://huggingface.co/blog/4bit-transformers-bitsandbytes)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "83f51e63e67aa87b",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [],
+   "source": [
     "# TrainingArguments parameters\n",
     "num_train_epochs = 1\n",
     "fp16 = False\n",
@@ -126,39 +208,32 @@
     "use_special_template = True\n",
     "response_template = ' ### Answer:'\n",
     "instruction_prompt_template = '\"### Human:\"'\n",
-    "use_llama_like_model = True"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "96fccf9f7364bac6"
+    "use_llama_like_model = True\n"
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "Two key techniques in PEFT are LoRA and QLoRA. LoRA (Low-Rank Adaptation) stores changes in weights by constructing and adding a low-rank matrix to each model layer. This method opens only these layers for fine-tuning, without changing the original model weights or requiring lengthy training. The resulting weights are lightweight and can be produced multiple times, allowing for the fine-tuning of multiple tasks with an LLM loaded into RAM. \n",
-    "\n",
-    "QLoRA quantizes the weights to 4 bits, further reducing RAM consumption. \n",
-    "\n",
-    "Read about LoRA [here at Lightning AI](https://lightning.ai/pages/community/tutorial/lora-llm/) and about QLoRA [here at Hugging Face](https://huggingface.co/blog/4bit-transformers-bitsandbytes). For other efficient training methods, see [Hugging Face Docs on Performance Training](https://huggingface.co/docs/transformers/perf_train_gpu_one) and [SFT Trainer Enhancement](https://huggingface.co/docs/trl/main/en/sft_trainer#enhance-models-performances-using-neftune).\n"
-   ],
+   "id": "234ef91c9c1c0789",
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
-   "id": "382296d37668763c"
-  },
-  {
-   "cell_type": "markdown",
    "source": [
     "## Train Code"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "234ef91c9c1c0789"
+   ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
+   "id": "8cc58fe0c4b229e0",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "# Load dataset (you can process it here)\n",
@@ -170,16 +245,18 @@
     "train_dataset = split_dataset[\"train\"]\n",
     "eval_dataset = split_dataset[\"test\"]\n",
     "print(f\"Size of the train set: {len(train_dataset)}. Size of the validation set: {len(eval_dataset)}\")"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "8cc58fe0c4b229e0",
-   "execution_count": 0
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "32d8aa11a6d47e0d",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "# Load QLoRA configuration\n",
@@ -191,15 +268,18 @@
     "    bnb_4bit_compute_dtype=compute_dtype,\n",
     "    bnb_4bit_use_double_quant=bnb_4bit_use_double_quant,\n",
     ")"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "32d8aa11a6d47e0d"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "8a5216910d0a339a",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "# Load LoRA configuration\n",
@@ -210,15 +290,18 @@
     "    bias=\"none\",\n",
     "    task_type=\"CAUSAL_LM\",\n",
     ")"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "8a5216910d0a339a"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "bacbbc9ddd19504d",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "# Load base model\n",
@@ -228,15 +311,18 @@
     "    device_map=device_map\n",
     ")\n",
     "model.config.use_cache = False"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "bacbbc9ddd19504d"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "a82c50bc69c3632b",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "# Set training parameters\n",
@@ -258,25 +344,18 @@
     "    group_by_length=group_by_length,\n",
     "    lr_scheduler_type=lr_scheduler_type\n",
     ")"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "a82c50bc69c3632b"
-  },
-  {
-   "cell_type": "markdown",
-   "source": [
-    "In this blog, you can read about the best practices for fine-tuning LLMs [Sebastian Raschka's Magazine](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms?r=1h0eu9&utm_campaign=post&utm_medium=web). \n"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "56219e83015a7357"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "c86b66f59bee28dc",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "# Load tokenizer\n",
@@ -285,24 +364,31 @@
     "tokenizer.padding_side = \"right\"  # Fix weird overflow issue with fp16 training\n",
     "if not tokenizer.chat_template:\n",
     "    tokenizer.chat_template = \"{% for message in messages %}{{'<|im_start|>' + message['role'] + '\\n' + message['content'] + '<|im_end|>' + '\\n'}}{% endfor %}\""
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "c86b66f59bee28dc"
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "Regarding the chat template, we will briefly explain that to understand the structure of the conversation between the user and the model during model training, a series of reserved phrases are created to separate the user's message and the model's response. This ensures that the model precisely understands where each message comes from and maintains a sense of the conversational structure. Typically, adhering to a chat template helps increase accuracy in the intended task. However, when there is a distribution shift between the fine-tuning dataset and the model, using a specific chat template can be even more helpful. For further reading, visit [Hugging Face Blog on Chat Templates](https://huggingface.co/blog/chat-templates).\n"
-   ],
+   "id": "ea4399c36bcdcbbd",
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
-   "id": "ea4399c36bcdcbbd"
+   "source": [
+    "Regarding the chat template, we will briefly explain that to understand the structure of the conversation between the user and the model during model training, a series of reserved phrases are created to separate the user's message and the model's response. This ensures that the model precisely understands where each message comes from and maintains a sense of the conversational structure. Typically, adhering to a chat template helps increase accuracy in the intended task. However, when there is a distribution shift between the fine-tuning dataset and the model, using a specific chat template can be even more helpful. For further reading, visit [Hugging Face Blog on Chat Templates](https://huggingface.co/blog/chat-templates).\n"
+   ]
   },
   {
    "cell_type": "code",
+   "execution_count": null,
+   "id": "7d3f935e03db79b8",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "def special_formatting_prompts(example):\n",
@@ -321,16 +407,18 @@
     "        text = tokenizer.apply_chat_template(chat_temp, tokenize=False)\n",
     "        output_texts.append(text)\n",
     "    return output_texts\n"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "7d3f935e03db79b8",
-   "execution_count": 0
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "95dc3db0d6c5ddaf",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "if use_special_template:\n",
@@ -342,15 +430,18 @@
     "        collator = DataCollatorForCompletionOnlyLM(response_template=response_template, tokenizer=tokenizer)\n",
     "else:\n",
     "    formatting_func = normal_formatting_prompts"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "95dc3db0d6c5ddaf"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "48e09edab86c4212",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "trainer = SFTTrainer(\n",
@@ -365,27 +456,33 @@
     "    args=training_arguments,\n",
     "    packing=packing\n",
     ")"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "48e09edab86c4212"
+   ]
   },
   {
    "cell_type": "markdown",
+   "id": "38fb6fddbca5567e",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "source": [
     "The `SFTTrainer` is then instantiated to handle supervised fine-tuning (SFT) of the model. This trainer is specifically designed for SFT and includes additional parameters such as `formatting_func` and `packing` which are not typically found in standard trainers.\n",
     "`formatting_func`: A custom function to format training examples by combining instruction and response templates.\n",
     "`packing`: Disables packing multiple samples into one sequence, which is not a standard parameter in the typical Trainer class.\n"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "38fb6fddbca5567e"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "a17a3b28010ce90e",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "# Train model\n",
@@ -393,25 +490,31 @@
     "\n",
     "# Save fine tuned Lora Adaptor \n",
     "trainer.model.save_pretrained(new_model)"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "a17a3b28010ce90e"
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "## Inference Code"
-   ],
+   "id": "39abd4f63776cc49",
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
-   "id": "39abd4f63776cc49"
+   "source": [
+    "## Inference Code"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "70cca01bc96d9ead",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "import torch\n",
@@ -426,15 +529,18 @@
     "\n",
     "clear_hardwares()\n",
     "clear_hardwares()"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "70cca01bc96d9ead"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "dd8313238b26e95e",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "def generate(model, prompt: str, kwargs):\n",
@@ -447,30 +553,36 @@
     "        output = tokenizer.decode(output_tokens[0][prompt_length:], skip_special_tokens=True)\n",
     "        \n",
     "    return output"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "dd8313238b26e95e"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "d3fe5a27fa40ba9",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "base_model = AutoModelForCausalLM.from_pretrained(new_model, return_dict=True, device_map='auto', token='')\n",
     "tokenizer = AutoTokenizer.from_pretrained(new_model, max_length=max_seq_length)\n",
     "model = PeftModel.from_pretrained(base_model, new_model)\n",
     "del base_model"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "d3fe5a27fa40ba9"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "70682a07fcaaca3f",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "sample = eval_dataset[0]\n",
@@ -479,39 +591,48 @@
     "else:\n",
     "    chat_temp = [{\"role\": \"system\", \"content\": sample['instruction']}]\n",
     "    prompt = tokenizer.apply_chat_template(chat_temp, tokenize=False, add_generation_prompt=True)"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "70682a07fcaaca3f"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "febeb00f0a6f0b5e",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "gen_kwargs = {\"max_new_tokens\": 1024}\n",
     "generated_texts = generate(model=model, prompt=prompt, kwargs=gen_kwargs)\n",
     "print(generated_texts)"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "febeb00f0a6f0b5e"
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "## Merge to base model"
-   ],
+   "id": "c18abf489437a546",
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
-   "id": "c18abf489437a546"
+   "source": [
+    "## Merge to base model"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "4f5f450001bf428f",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "clear_hardwares()\n",
@@ -520,14 +641,17 @@
     "del model\n",
     "adapter_model_name = 'your_hf_account/your_desired_name'\n",
     "merged_model.push_to_hub(adapter_model_name)"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "4f5f450001bf428f"
+   ]
   },
   {
    "cell_type": "markdown",
+   "id": "16775c2ed49bfe11",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "source": [
     "Here, we merged the adapter with the base model and push the merged model on the hub. You can just push the adapter in the hub and avoid pushing the heavy base model file in this way:\n",
     "```\n",
@@ -542,36 +666,45 @@
     "# Load the Lora model\n",
     "model = PeftModel.from_pretrained(model, adapter_model_name)\n",
     "```"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "16775c2ed49bfe11"
+   ]
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "## Fast Inference with [Vllm](https://github.com/vllm-project/vllm)\n"
-   ],
+   "id": "4851ef41e4cc4f95",
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
-   "id": "4851ef41e4cc4f95"
+   "source": [
+    "## Fast Inference with [Vllm](https://github.com/vllm-project/vllm)\n"
+   ]
   },
   {
    "cell_type": "markdown",
+   "id": "fe82f0a57fe86f60",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "source": [
     "The `vllm` library is one of the fastest inference engines for LLMs. For a comparative overview of available options, you can use this blog: [7 Frameworks for Serving LLMs](https://medium.com/@gsuresh957/7-frameworks-for-serving-llms-5044b533ee88). \n",
     "In this example, we are inferring version 1 of our fine-tuned model on this task."
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "fe82f0a57fe86f60"
+   ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "88bee8960b176e87",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
    "outputs": [],
    "source": [
     "from vllm import LLM, SamplingParams\n",
@@ -588,14 +721,11 @@
     "outputs = llm.generate(prompt, sampling_params)\n",
     "\n",
     "print(outputs[0].outputs[0].text)"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "88bee8960b176e87"
+   ]
   },
   {
    "cell_type": "markdown",
+   "id": "dc007ced7ca34bbb",
    "metadata": {},
    "source": [
     "### Example Output\n",
@@ -609,27 +739,39 @@
     "    \"product_entity\": \"مانتو اسپرت\"\n",
     "}\n",
     "```\n"
-   ],
-   "id": "dc007ced7ca34bbb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2bfe00769699bbd2",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "source": [
+    "In this blog, you can read about the best practices for fine-tuning LLMs [Sebastian Raschka's Magazine](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms?r=1h0eu9&utm_campaign=post&utm_medium=web). \n"
+   ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
-    "version": 2
+    "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython2",
-   "version": "2.7.6"
+   "pygments_lexer": "ipython3",
+   "version": "3.9.19"
   }
  },
  "nbformat": 4,

From 690f8962e8fbd78623eb55cb555a7877cc95c83d Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Sun, 16 Jun 2024 23:08:34 +0330
Subject: [PATCH 14/18] Rename
 fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb to
 fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb

---
 ...llm_to_generate_persian_product_catalogs_in_json_format.ipynb} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename notebooks/en/{fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb => fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb} (100%)

diff --git a/notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb b/notebooks/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb
similarity index 100%
rename from notebooks/en/fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format.ipynb
rename to notebooks/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb

From 438596011e315124fb051e9be12e159693bec7af Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Sun, 16 Jun 2024 23:11:25 +0330
Subject: [PATCH 15/18] Update _toctree.yml

---
 notebooks/en/_toctree.yml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml
index b8e9641e..c065ffb1 100644
--- a/notebooks/en/_toctree.yml
+++ b/notebooks/en/_toctree.yml
@@ -42,8 +42,8 @@
           title: RAG with source highlighting using Structured generation
         - local: rag_with_unstructured_data
           title: Building RAG with Custom Unstructured Data
-        - local: fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format
-          title: Fine-tuning LLM for Generate Persian Product Catalogs in JSON Format
+        - local: fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format
+          title: Fine-tuning LLM to Generate Persian Product Catalogs in JSON Format
         - local: llm_gateway_pii_detection
           title: LLM Gateway for PII Detection
           

From 67b05c7b878936f421445380fbff0d8fa25eadef Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Sun, 16 Jun 2024 23:11:51 +0330
Subject: [PATCH 16/18] Update index.md

---
 notebooks/en/index.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/notebooks/en/index.md b/notebooks/en/index.md
index 5271ea9b..89b5616e 100644
--- a/notebooks/en/index.md
+++ b/notebooks/en/index.md
@@ -12,7 +12,7 @@ Check out the recently added notebooks:
 - [Create a legal preference dataset](pipeline_notus_instructions_preferences_legal)
 - [Suggestions for Data Annotation with SetFit in Zero-shot Text Classification](labelling_feedback_setfit)
 - [Building RAG with Custom Unstructured Data](rag_with_unstructured_data)
-- [Fine-tuning LLM for Generate Persian Product Catalogs in JSON Format](fine_tuning_llm_for_generate_persian_product_catalogs_in_json_format)
+- [Fine-tuning LLM to Generate Persian Product Catalogs in JSON Format](fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format)
 
 You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook).
 
@@ -20,4 +20,4 @@ You can also check out the notebooks in the cookbook's [GitHub repo](https://git
 
 The Open-Source AI Cookbook is a community effort, and we welcome contributions from everyone!
 Check out the cookbook's [Contribution guide](https://github.com/huggingface/cookbook/blob/main/README.md) to learn
-how you can add your "recipe".
\ No newline at end of file
+how you can add your "recipe".

From 9b2cabc5e05ab3805c3626754ee4418eb4fa8e5a Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Tue, 18 Jun 2024 00:05:54 +0330
Subject: [PATCH 17/18] Update
 fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb

---
 ...sian_product_catalogs_in_json_format.ipynb | 77 ++++++++++++++-----
 1 file changed, 59 insertions(+), 18 deletions(-)

diff --git a/notebooks/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb b/notebooks/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb
index 6b045d57..7406caae 100644
--- a/notebooks/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb
+++ b/notebooks/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb
@@ -186,7 +186,6 @@
     "fp16 = False\n",
     "bf16 = False\n",
     "per_device_train_batch_size = 4\n",
-    "per_device_eval_batch_size = 4\n",
     "gradient_accumulation_steps = 1\n",
     "gradient_checkpointing = True\n",
     "learning_rate = 0.00015\n",
@@ -221,7 +220,7 @@
     }
    },
    "source": [
-    "## Train Code"
+    "## Model Training"
    ]
   },
   {
@@ -247,6 +246,48 @@
     "print(f\"Size of the train set: {len(train_dataset)}. Size of the validation set: {len(eval_dataset)}\")"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8a5216910d0a339a",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Load LoRA configuration\n",
+    "peft_config = LoraConfig(\n",
+    "    r=lora_r,\n",
+    "    lora_alpha=lora_alpha,\n",
+    "    lora_dropout=lora_dropout,\n",
+    "    bias=\"none\",\n",
+    "    task_type=\"CAUSAL_LM\",\n",
+    "    target_modules=target_modules\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "230bfceb895c6738",
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
+   },
+   "source": [
+    "The LoraConfig object is used to configure the LoRA (Low-Rank Adaptation) settings for the model when using the Peft library. This can help to reduce the number of parameters that need to be fine-tuned, which can lead to faster training and lower memory usage. Here's a breakdown of the parameters:\n",
+    "- `r`: The rank of the low-rank matrices used in LoRA. This parameter controls the dimensionality of the low-rank adaptation and directly impacts the model's capacity to adapt and the computational cost.\n",
+    "- `lora_alpha`: This parameter controls the scaling factor for the low-rank adaptation matrices. A higher alpha value can increase the model's capacity to learn new tasks.\n",
+    "- `lora_dropout`: The dropout rate for LoRA. This can help to prevent overfitting during fine-tuning. In this case, it's set to 0.1.\n",
+    "- `bias`: Specifies whether to add a bias term to the low-rank matrices. In this case, it's set to \"none\", which means that no bias term will be added.\n",
+    "- `task_type`: Defines the type of task for which the model is being fine-tuned. Here, \"CAUSAL_LM\" indicates that the task is a causal language modeling task, which predicts the next word in a sequence.\n",
+    "- `target_modules`: Specifies the modules in the model to which LoRA will be applied. In this case, it's set to `[\"q_proj\", \"v_proj\", 'k_proj']`, which are the query, value, and key projection layers in the model's attention mechanism."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -271,25 +312,24 @@
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8a5216910d0a339a",
+   "cell_type": "markdown",
+   "id": "535275d96f478839",
    "metadata": {
     "collapsed": false,
     "jupyter": {
      "outputs_hidden": false
     }
    },
-   "outputs": [],
    "source": [
-    "# Load LoRA configuration\n",
-    "peft_config = LoraConfig(\n",
-    "    lora_alpha=lora_alpha,\n",
-    "    lora_dropout=lora_dropout,\n",
-    "    r=lora_r,\n",
-    "    bias=\"none\",\n",
-    "    task_type=\"CAUSAL_LM\",\n",
-    ")"
+    "This block configures the settings for using BitsAndBytes (bnb), a library that provides efficient memory management and compression techniques for PyTorch models. Specifically, it defines how the model weights will be loaded and quantized in 4-bit precision, which is useful for reducing memory usage and potentially speeding up inference.\n",
+    "\n",
+    "- `load_in_4bit`: A boolean that determines whether to load the model in 4-bit precision.\n",
+    "- `bnb_4bit_quant_type`: Specifies the type of 4-bit quantization to use. Here, it's set to 4-bit NormalFloat (NF4) quantization type, which is a new data type introduced in QLoRA. This type is information-theoretically optimal for normally distributed weights, providing an efficient way to quantize the model for fine-tuning.\n",
+    "- `bnb_4bit_compute_dtype`: Sets the data type used for computations involving the quantized model. In QLoRA, it's set to \"float16\", which is commonly used for mixed-precision training to balance performance and precision.\n",
+    "- `bnb_4bit_use_double_quant`: This boolean parameter indicates whether to use double quantization. Setting it to False means that only single quantization will be used, which is typically faster but might be slightly less accurate.\n",
+    "\n",
+    "Why we have two data type (quant_type and compute_type)? \n",
+    "QLoRA employs two distinct data types: one for storing base model weights (in here 4-bit NormalFloat) and another for computational operations (16-bit). During the forward and backward passes, QLoRA dequantizes the weights from the storage format to the computational format. However, it only calculates gradients for the LoRA parameters, which utilize 16-bit bfloat. This approach ensures that weights are decompressed only when necessary, maintaining low memory usage throughout both training and inference phases.\n"
    ]
   },
   {
@@ -502,7 +542,7 @@
     }
    },
    "source": [
-    "## Inference Code"
+    "## Inference"
    ]
   },
   {
@@ -527,6 +567,7 @@
     "    torch.cuda.empty_cache()\n",
     "    gc.collect()\n",
     "\n",
+    "\n",
     "clear_hardwares()\n",
     "clear_hardwares()"
    ]
@@ -545,13 +586,13 @@
    "source": [
     "def generate(model, prompt: str, kwargs):\n",
     "    tokenized_prompt = tokenizer(prompt, return_tensors='pt').to(model.device)\n",
-    "    \n",
+    "\n",
     "    prompt_length = len(tokenized_prompt.get('input_ids')[0])\n",
-    "    \n",
+    "\n",
     "    with torch.cuda.amp.autocast():\n",
     "        output_tokens = model.generate(**tokenized_prompt, **kwargs) if kwargs else model.generate(**tokenized_prompt)\n",
     "        output = tokenizer.decode(output_tokens[0][prompt_length:], skip_special_tokens=True)\n",
-    "        \n",
+    "\n",
     "    return output"
    ]
   },

From b1a8b407babccd12ca240b7c95d77284d6fb29b8 Mon Sep 17 00:00:00 2001
From: Mohammadreza Esmaeiliyan <55921249+MrzEsma@users.noreply.github.com>
Date: Wed, 19 Jun 2024 12:52:32 +0330
Subject: [PATCH 18/18] Update
 fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb

edit our HF link.
---
 ...sian_product_catalogs_in_json_format.ipynb | 197 ++++--------------
 1 file changed, 40 insertions(+), 157 deletions(-)

diff --git a/notebooks/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb b/notebooks/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb
index 7406caae..a4caf348 100644
--- a/notebooks/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb
+++ b/notebooks/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb
@@ -4,10 +4,7 @@
    "cell_type": "markdown",
    "id": "a59bf2a9e5015030",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "# Fine-tuning LLM to Generate Persian Product Catalogs in JSON Format\n",
@@ -19,15 +16,12 @@
    "cell_type": "markdown",
    "id": "755fc90c27f1cb99",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "In this notebook, we have attempted to fine-tune a large language model with no added complexity. The model has been optimized for use on a customer-level GPU to generate Persian product catalogs and produce structured output in JSON format. It is particularly effective for creating structured outputs from the unstructured titles and descriptions of products on Iranian platforms with user-generated content, such as [Basalam](https://basalam.com), [Divar](https://divar.ir/), [Digikala](https://www.digikala.com/), and others. \n",
     "\n",
-    "You can see a fine-tuned LLM with this code on [our HF account](BaSalam/Llama2-7b-entity-attr-v1). Additionally, one of the fastest open-source inference engines, [Vllm](https://github.com/vllm/vllm), is employed for inference. \n",
+    "You can see a fine-tuned LLM with this code on [our HF account](https://huggingface.co/BaSalam/Llama2-7b-entity-attr-v1). Additionally, one of the fastest open-source inference engines, [Vllm](https://github.com/vllm-project/vllm), is employed for inference. \n",
     "\n",
     "Let's get started!"
    ]
@@ -37,10 +31,7 @@
    "execution_count": null,
    "id": "3a35eafbe37e4ad2",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -60,10 +51,7 @@
    "cell_type": "markdown",
    "id": "30caf9936156e430",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "The `peft` library, or parameter efficient fine tuning, has been created to fine-tune LLMs more efficiently. If we were to open and fine-tune the upper layers of the network traditionally like all neural networks, it would require a lot of processing and also a significant amount of VRAM. With the methods developed in recent papers, this library has been implemented for efficient fine-tuning of LLMs. Read more about peft here: [Hugging Face PEFT](https://huggingface.co/blog/peft)."
@@ -73,10 +61,7 @@
    "cell_type": "markdown",
    "id": "261a8f52fe09202e",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "## Set hyperparameters"
@@ -87,10 +72,7 @@
    "execution_count": null,
    "id": "96fccf9f7364bac6",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -105,10 +87,7 @@
    "execution_count": null,
    "id": "7f69a97083bf19d9",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -123,10 +102,7 @@
    "cell_type": "markdown",
    "id": "382296d37668763c",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "LoRA (Low-Rank Adaptation) stores changes in weights by constructing and adding a low-rank matrix to each model layer. This method opens only these layers for fine-tuning, without changing the original model weights or requiring lengthy training. The resulting weights are lightweight and can be produced multiple times, allowing for the fine-tuning of multiple tasks with an LLM loaded into RAM. \n",
@@ -140,10 +116,7 @@
    "execution_count": null,
    "id": "501beb388b6749ea",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -158,10 +131,7 @@
    "cell_type": "markdown",
    "id": "39149616eb21ec5b",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "QLoRA (Quantized Low-Rank Adaptation) is an efficient fine-tuning approach that enables large language models to run on smaller GPUs by using 4-bit quantization. This method preserves the full performance of 16-bit fine-tuning while reducing memory usage, making it possible to fine-tune models with up to 65 billion parameters on a single 48GB GPU. QLoRA combines 4-bit NormalFloat data types, double quantization, and paged optimizers to manage memory efficiently. It allows fine-tuning of models with low-rank adapters, significantly enhancing accessibility for AI model development.\n",
@@ -174,10 +144,7 @@
    "execution_count": null,
    "id": "83f51e63e67aa87b",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -214,10 +181,7 @@
    "cell_type": "markdown",
    "id": "234ef91c9c1c0789",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "## Model Training"
@@ -228,10 +192,7 @@
    "execution_count": null,
    "id": "8cc58fe0c4b229e0",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -251,10 +212,7 @@
    "execution_count": null,
    "id": "8a5216910d0a339a",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -273,10 +231,7 @@
    "cell_type": "markdown",
    "id": "230bfceb895c6738",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "The LoraConfig object is used to configure the LoRA (Low-Rank Adaptation) settings for the model when using the Peft library. This can help to reduce the number of parameters that need to be fine-tuned, which can lead to faster training and lower memory usage. Here's a breakdown of the parameters:\n",
@@ -293,10 +248,7 @@
    "execution_count": null,
    "id": "32d8aa11a6d47e0d",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -315,10 +267,7 @@
    "cell_type": "markdown",
    "id": "535275d96f478839",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "This block configures the settings for using BitsAndBytes (bnb), a library that provides efficient memory management and compression techniques for PyTorch models. Specifically, it defines how the model weights will be loaded and quantized in 4-bit precision, which is useful for reducing memory usage and potentially speeding up inference.\n",
@@ -337,10 +286,7 @@
    "execution_count": null,
    "id": "bacbbc9ddd19504d",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -358,10 +304,7 @@
    "execution_count": null,
    "id": "a82c50bc69c3632b",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -391,10 +334,7 @@
    "execution_count": null,
    "id": "c86b66f59bee28dc",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -410,10 +350,7 @@
    "cell_type": "markdown",
    "id": "ea4399c36bcdcbbd",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "Regarding the chat template, we will briefly explain that to understand the structure of the conversation between the user and the model during model training, a series of reserved phrases are created to separate the user's message and the model's response. This ensures that the model precisely understands where each message comes from and maintains a sense of the conversational structure. Typically, adhering to a chat template helps increase accuracy in the intended task. However, when there is a distribution shift between the fine-tuning dataset and the model, using a specific chat template can be even more helpful. For further reading, visit [Hugging Face Blog on Chat Templates](https://huggingface.co/blog/chat-templates).\n"
@@ -424,10 +361,7 @@
    "execution_count": null,
    "id": "7d3f935e03db79b8",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -454,10 +388,7 @@
    "execution_count": null,
    "id": "95dc3db0d6c5ddaf",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -477,10 +408,7 @@
    "execution_count": null,
    "id": "48e09edab86c4212",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -502,10 +430,7 @@
    "cell_type": "markdown",
    "id": "38fb6fddbca5567e",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "The `SFTTrainer` is then instantiated to handle supervised fine-tuning (SFT) of the model. This trainer is specifically designed for SFT and includes additional parameters such as `formatting_func` and `packing` which are not typically found in standard trainers.\n",
@@ -518,10 +443,7 @@
    "execution_count": null,
    "id": "a17a3b28010ce90e",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -536,10 +458,7 @@
    "cell_type": "markdown",
    "id": "39abd4f63776cc49",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "## Inference"
@@ -550,10 +469,7 @@
    "execution_count": null,
    "id": "70cca01bc96d9ead",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -577,10 +493,7 @@
    "execution_count": null,
    "id": "dd8313238b26e95e",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -601,10 +514,7 @@
    "execution_count": null,
    "id": "d3fe5a27fa40ba9",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -619,10 +529,7 @@
    "execution_count": null,
    "id": "70682a07fcaaca3f",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -639,10 +546,7 @@
    "execution_count": null,
    "id": "febeb00f0a6f0b5e",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -655,10 +559,7 @@
    "cell_type": "markdown",
    "id": "c18abf489437a546",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "## Merge to base model"
@@ -669,10 +570,7 @@
    "execution_count": null,
    "id": "4f5f450001bf428f",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -688,10 +586,7 @@
    "cell_type": "markdown",
    "id": "16775c2ed49bfe11",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "Here, we merged the adapter with the base model and push the merged model on the hub. You can just push the adapter in the hub and avoid pushing the heavy base model file in this way:\n",
@@ -713,10 +608,7 @@
    "cell_type": "markdown",
    "id": "4851ef41e4cc4f95",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "## Fast Inference with [Vllm](https://github.com/vllm-project/vllm)\n"
@@ -726,10 +618,7 @@
    "cell_type": "markdown",
    "id": "fe82f0a57fe86f60",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "The `vllm` library is one of the fastest inference engines for LLMs. For a comparative overview of available options, you can use this blog: [7 Frameworks for Serving LLMs](https://medium.com/@gsuresh957/7-frameworks-for-serving-llms-5044b533ee88). \n",
@@ -741,10 +630,7 @@
    "execution_count": null,
    "id": "88bee8960b176e87",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "outputs": [],
    "source": [
@@ -786,10 +672,7 @@
    "cell_type": "markdown",
    "id": "2bfe00769699bbd2",
    "metadata": {
-    "collapsed": false,
-    "jupyter": {
-     "outputs_hidden": false
-    }
+    "collapsed": false
    },
    "source": [
     "In this blog, you can read about the best practices for fine-tuning LLMs [Sebastian Raschka's Magazine](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms?r=1h0eu9&utm_campaign=post&utm_medium=web). \n"