TAMIL -LLAMA : A N EWTAMIL LANGUAGE MODEL BASED ON\n",
+ "LLAMA 2\n",
+ "Abhinand Balachandran\n",
+ "abhinandb.ml@gmail.com\n",
+ "ABSTRACT\n",
+ "Language modeling has witnessed remarkable advancements in recent years, with Large Language\n",
+ "Models (LLMs) like ChatGPT setting unparalleled benchmarks in human-like text generation. How-\n",
+ "ever, a prevailing limitation is the underrepresentation of languages like Tamil in these cutting-edge\n",
+ "models, leading to suboptimal performance in diverse linguistic contexts. This paper addresses this\n",
+ "lacuna, enhancing the open-source LLaMA model with an addition of 16,000 Tamil tokens, aiming to\n",
+ "achieve superior text generation and comprehension in the Tamil language. We strategically employ\n",
+ "the LoRA methodology for efficient model training on a comprehensive Tamil corpus, ensuring com-\n",
+ "putational feasibility and model robustness. Moreover, we introduce a Tamil-translated version of the\n",
+ "Alpaca dataset and a subset of the OpenOrca dataset tailored for instruction fine-tuning. Our results\n",
+ "showcase significant performance improvements in Tamil text generation, with potential implications\n",
+ "for the broader landscape of LLMs in Indian languages. We further underscore our commitment\n",
+ "to open research by making our models, datasets, and code1publicly accessible, fostering further\n",
+ "innovations in language modeling.\n",
+ "1 Introduction\n",
+ "The past few years have been transformative for language modeling, with groundbreaking advances and monumental\n",
+ "achievements. At the forefront of this revolution was OpenAI’s ChatGPT (OpenAI, 2022), which not only raised the\n",
+ "bar in language modeling performance but also underscored the immense societal implications of such technologies.\n",
+ "Alongside ChatGPT, various Large Language Models (LLMs) have consistently demonstrated exceptional prowess in\n",
+ "natural language understanding and generation, heralding a new era in computational linguistics.\n",
+ "Central to the functionality of these modern LLMs is the Transformer architecture, a cornerstone concept brought to\n",
+ "the limelight by \"Attention is All You Need\"(Vaswani et al., 2017). This innovation transformed our approach to\n",
+ "sequence-based tasks, catalyzing pivotal models like BERT (Devlin et al., 2019) and redefining best practices in \n",
+ "Natural\n",
+ "Language Processing (NLP).\n",
+ "Subsequent developments, particularly the Generative Pre-trained Transformer (GPT)(Radford et al., 2018), \n",
+ "showcased\n",
+ "the profound potential of unsupervised pre-training on vast datasets. Models like GPT-3 and its successor, GPT-4\n",
+ "(OpenAI, 2023), have redefined benchmarks and fueled a renaissance in natural language understanding and \n",
+ "generation.\n",
+ "Beyond their technical prowess, they have prompted a renewed vigor in exploring the limits of Artificial General\n",
+ "Intelligence (AGI). These advancements, paired with exemplary performance in numerous applications, have galvanized\n",
+ "the NLP community, sparking widespread application and research from sentiment analysis to machine translation.\n",
+ "However, progress is not without its pitfalls. The elite LLMs, despite their remarkable capabilities, grapple with\n",
+ "challenges—primarily, their proprietary nature, which constricts open research. Furthermore, an English-centric\n",
+ "bias and the enormous computational requirements for training such behemoths further accentuate the call for more\n",
+ "accessible and diverse solutions.\n",
+ "In response, the open-source community has championed the creation of models like LLaMA (Touvron et al., 2023a)\n",
+ "and Mistral (Jiang et al., 2023). Such models, despite their compact nature, challenge the hegemony of giants like\n",
+ "ChatGPT in select benchmarks, heralding a promising direction for future research.\n",
+ "1GitHub Repository: https://github.com/abhinand5/tamil-llamaarXiv:2311.05845v110 Nov 2023However, as robust as \n",
+ "these models, like LLaMA and Mistral, might be, their proficiency in generating coherent text in\n",
+ "Tamil and several other Indian languages remains noticeably deficient. A fundamental limitation lies in their \n",
+ "minimal\n",
+ "vocabulary of Tamil characters, which is essential for effective text encoding and generation. This paper aims to \n",
+ "bridge\n",
+ "this gap by augmenting the existing LLaMA models’ vocabulary with an additional 16,000 Tamil tokens, markedly\n",
+ "enhancing their capability in processing and producing Tamil content. This method draws inspiration from a parallel\n",
+ "endeavor in the Chinese adaptation of LLaMA, as documented in Cui et al. (2023). To ensure efficient pre-training\n",
+ "and fine-tuning while maintaining computational feasibility, we leverage the LoRA (Hu et al., 2021) methodology. We\n",
+ "aspire that this initiative catalyzes further research endeavors, refining LLaMA and other open-source models \n",
+ "tailored\n",
+ "for Indian languages. A succinct overview of the principal contributions of this paper is as follows:\n",
+ "•We bolster the LLaMA model’s encoding and decoding proficiencies for Tamil by incorporating an additional\n",
+ "16,000 Tamil tokens, thereby expanding its vocabulary.\n",
+ "•Through the LoRA methodology, the augmented model undergoes training on an extensive Tamil corpus,\n",
+ "resulting in a marked enhancement of its text generation capabilities relative to its predecessor models.\n",
+ "•We present a Tamil-translated version of the original Alpaca dataset (Taori et al., 2023), paired with a subset of\n",
+ "the OpenOrca (Lian et al., 2023) dataset, both curated for instruction fine-tuning in Tamil.\n",
+ "•Our newly trained instruction and chat models, built upon the Alpaca and OpenOrca datasets, demonstrate\n",
+ "notable advancements in performance for the Tamil language compared to other open-source language models.\n",
+ "•To stimulate continuous innovation and broader adaptability, we grant public access to the models, datasets,\n",
+ "and associated code, inviting further exploration and encouraging the refinement of LLaMA models for diverse\n",
+ "languages.\n",
+ "2 Related Work\n",
+ "Within the broad field of Natural Language Processing (NLP), the advent of Large Language Models (LLMs) marks a\n",
+ "transformative moment. These models have heralded new capabilities in understanding, generating, and processing\n",
+ "various human languages, underpinning innovations from automated content creation to nuanced sentiment analysis.\n",
+ "While their proficiency in mainstream languages like English is widely recognized and leveraged, a disparity exists\n",
+ "in\n",
+ "their performance and availability for numerous non-European languages.\n",
+ "Tamil, a language with ancient roots and spoken by a substantial global population, epitomizes this disparity. \n",
+ "Despite\n",
+ "its linguistic depth and cultural significance, dedicated pre-trained LLMs for Tamil are conspicuously \n",
+ "underrepresented.\n",
+ "Most current offerings are generic, multipurpose LLMs, which do not cater specifically to the unique attributes of \n",
+ "the\n",
+ "Tamil language.\n",
+ "A survey of the existing literature reveals that many attempts to cater to the Tamil language through LLMs rely \n",
+ "heavily\n",
+ "on multilingual models. Works such as Scao et al. (2022), Shliazhko et al. (2022), and Lin et al. (2022) have all \n",
+ "ventured\n",
+ "into this domain. However, it is crucial to note that, except \"GPT-2 Tamil\" by Mahendiran (2021), all these models\n",
+ "are not exclusive to Tamil. While they can process Tamil to a certain extent, their capabilities are inherently \n",
+ "limited.\n",
+ "This limitation arises because the training data for these models often comprise a low fraction of Tamil content \n",
+ "relative\n",
+ "to other languages. Consequently, the nuances and intricacies specific to Tamil are often lost, leading to \n",
+ "suboptimal\n",
+ "performance.\n",
+ "The effort by Mahendiran (2021) represents a notable deviation from this trend. Here, the GPT-2 base model, \n",
+ "equipped\n",
+ "with 117 million parameters as outlined in Radford et al. (2019), was fine-tuned with a focus on Tamil, using both \n",
+ "the\n",
+ "Oscar dataset (Caswell et al., 2020) and The IndicNLP (Kunchukuttan, 2020) dataset. This approach signifies a \n",
+ "targeted\n",
+ "attempt to adapt LLM capabilities for the Tamil language specifically.\n",
+ "However, the broader landscape of Tamil-specific LLM research remains relatively uncharted. This context \n",
+ "underscores\n",
+ "the motivation for our present research. We endeavor to delve deeper into this space, addressing existing \n",
+ "shortcomings\n",
+ "and advancing the capabilities of LLMs tailored for Tamil.\n",
+ "3 Tamil LLaMA\n",
+ "3.1 Datasets Used\n",
+ "The development of Tamil-LLaMA involved using several different datasets, each chosen for specific parts of the\n",
+ "training and fine-tuning process. This approach was vital to ensure the model’s effectiveness across various tasks.\n",
+ "23.1.1 Datasets used for Pre-Training\n",
+ "For the initial pre-training phase of LLaMA 2(Touvron et al., 2023a), we mainly used the CulturaX dataset (Nguyen\n",
+ "et al., 2023). This dataset is a combination of many popular datasets, including the Oscar dataset (Caswell et al.,\n",
+ "2020).\n",
+ "Out of the 4.72 million documents in CulturaX, we selected 600k documents (12 GB) for training. This choice was\n",
+ "made to manage training costs while aiming for high performance. Our approach was successful, as the model showed\n",
+ "strong results in text completion tasks even with this smaller dataset.\n",
+ "3.1.2 Datasets used for Instruction Tuning\n",
+ "The \"Instruction Tuning\" phase was a pivotal stage in refining LLaMA’s proficiency in precisely adhering to textual\n",
+ "instructions. For this enhancement, we incorporated a translated version of the Stanford Alpaca dataset (Taori et \n",
+ "al.,\n",
+ "2023), comprising 52,000 instructions. Concurrently, we integrated a specialized no-code section from the OpenOrca\n",
+ "dataset (Lian et al., 2023), which consists of around 93,000 instructions. The deliberate focus on no-code \n",
+ "instructions\n",
+ "was to streamline the training process, eliminating the intricacies presented by coding instructions during \n",
+ "translation.\n",
+ "To ensure translation uniformity and accuracy across the datasets, the Google Translation API service was our tool \n",
+ "of\n",
+ "choice. We meticulously translated the entirety of the Alpaca dataset while also applying a similar methodology to \n",
+ "the\n",
+ "OpenOrca subset.\n",
+ "We believe that leveraging diverse datasets has bolstered LLaMA’s enhanced capability to discern and generate\n",
+ "contextually pertinent responses across a spectrum of prompts.\n",
+ "3.2 Background on the LLaMA Models\n",
+ "Introduced by Touvron et al. (2023a), LLaMA has emerged as an essential milestone in the world of open-source\n",
+ "large language models (LLMs), with the renowned Transformer architecture (Vaswani et al., 2017) as its foundation.\n",
+ "While it draws inspiration from models like GPT for its basic structure—comprising an embedding layer and multiple\n",
+ "transformer blocks—LLaMA has its unique features. LLaMA has brought forward several innovative techniques such\n",
+ "as pre-normalization (Zhang and Sennrich, 2019), SwiGLU activation (Shazeer, 2020), and rotary embeddings (Su\n",
+ "et al., 2022). Offered in sizes ranging from 7B (7 Billion) to 65B (65 Billion) parameters, LLaMA has been trained\n",
+ "on a rich mixture of content sources, including web pages, books, and academic papers. Its strong performance on\n",
+ "benchmarks, especially given its relatively compact size compared to other models, has made it a noteworthy \n",
+ "contender\n",
+ "in the LLM landscape, drawing considerable attention in the AI research community.\n",
+ "Building upon its predecessor’s foundation, LLaMA 2(Touvron et al., 2023b) introduces monumental enhancements to\n",
+ "the LLaMA lineage. With a dataset expanded by 40% relative to LLaMA 1, the models under LLaMA 2 exhibit an\n",
+ "enriched comprehension of diverse content, leading to improved text generation. An extended context length of 4,096\n",
+ "tokens empowers LLaMA 2 to process and understand more extensive textual segments, significantly benefiting tasks\n",
+ "such as translation and intricate question answering. Another pivotal innovation in LLaMA 2 is adopting the \n",
+ "grouped-\n",
+ "query attention mechanism (Ainslie et al., 2023), facilitating faster inference despite its expanded size compared \n",
+ "to\n",
+ "LLaMA 1.\n",
+ "In the course of our research, we made a conscious choice to employ LLaMA 2 as our primary language model. Several\n",
+ "factors influenced this decision. Firstly, LLaMA 2 is a recent addition to the lineage of Large Language Models, \n",
+ "which\n",
+ "implies that it benefits from the latest advancements in model training and architectural innovations. This recent \n",
+ "launch\n",
+ "incorporates the most up-to-date techniques and methodologies. Secondly, compared with its predecessor, LLaMA\n",
+ "1, the enhancements in LLaMA 2 are undeniably compelling. These improvements are not just incremental; they\n",
+ "represent substantial strides in areas such as data exposure, context length, and attention mechanisms. The \n",
+ "evolution\n",
+ "from LLaMA 1 to LLaMA 2 is emblematic of the rapid advancements in the field, and by leveraging the latter, we\n",
+ "aimed to ensure our research was grounded in the most cutting-edge tools available.\n",
+ "3.3 Expansion of Tamil Vocabulary\n",
+ "LLaMA 2, as outlined in the seminal work of Touvron et al. (2023b), is backed by an expansive pre-training corpus \n",
+ "of 2\n",
+ "Trillion tokens. A detailed linguistic analysis of this vast corpus reveals a striking imbalance in language \n",
+ "representation.\n",
+ "An overwhelming 89.7% of the tokens are sourced from English, with other European languages collectively \n",
+ "contributing\n",
+ "to nearly 10% of the dataset. In stark contrast, diverse languages such as Tamil and Hindi represent a meager \n",
+ "presence,\n",
+ "with their combined token count along with other under-represented languages accounting for less than 0.21%.\n",
+ "This skewed distribution raises concerns about the genuine multilingual and cross-lingual capabilities of LLaMA 2.\n",
+ "While it is evident that the model is proficient in several European languages, its ability to comprehend and \n",
+ "generate\n",
+ "3content in languages like Tamil needs to be improved substantially. Our preliminary experiments further \n",
+ "underscored\n",
+ "this limitation. When presented with tasks in Tamil, LLaMA 2 exhibited a remarkable lack of coherence in its \n",
+ "responses.\n",
+ "In fact, its performance was notably inferior to smaller models, underscoring a noticeable shortcoming in LLaMA 2’s\n",
+ "coverage of worldwide languages. There is a clear need for the open-source community to focus on languages like\n",
+ "Tamil, spoken by millions globally across multiple countries.\n",
+ "To bolster the text generation and understanding abilities of LLaMA 2 in Tamil, we advocate extending its \n",
+ "pre-training\n",
+ "phase with an expansive Tamil corpus, as recommended by Cui et al. (2023). However, this alone is not sufficient. A\n",
+ "limitation arises from LLaMA’s existing vocabulary, which has a tiny number of Tamil characters. Although LLaMA\n",
+ "can bypass this by encoding unknown tokens, this process considerably lengthens the sequences, leading to \n",
+ "substantial\n",
+ "delays during encoding and decoding. Typically, a single Tamil character is translated into 3-4 byte tokens. \n",
+ "Moreover,\n",
+ "these byte tokens are not uniquely purposed for Tamil characters but represent UTF-8 tokens from various languages.\n",
+ "This dual role complicates the task for transformer encoders and byte-tokens to understand and capture the nuanced\n",
+ "semantics of Tamil characters proficiently.\n",
+ "To overcome these problems and to enhance the text generation capabilities in Tamil, we propose the incorporation \n",
+ "of\n",
+ "an additional 16,000 Tamil tokens to the pre-existing vocabulary of the LLAMA 2 model. This methodology echoes the\n",
+ "strategies employed in developing Chinese LLaMA (Cui et al., 2023). The subsequent steps explain the process of\n",
+ "vocabulary extension:\n",
+ "1.Employ SentencePiece (Kudo and Richardson, 2018) to train a Tamil Tokenizer on an extensive corpus\n",
+ "of contemporary Tamil text, capturing the essence of modern linguistic nuances necessary for coherent\n",
+ "communication.\n",
+ "2.Integrate the original tokenizer of the LLaMA 2 model with the vocabulary derived from the newly trained\n",
+ "SentencePiece tokenizer. This amalgamation culminates in an augmented tokenizer encompassing an additional\n",
+ "16,000 Tamil tokens, leading to an aggregated vocabulary size of 48,000(32,000 original + 16,000 new).\n",
+ "3.Drawing parallels from Cui et al. (2023), the LLaMA model is then tailored to accommodate the Tamil LLaMA\n",
+ "tokenizer. This modification necessitates resizing the word embeddings and the language model head from\n",
+ "a matrix shape V ×H to V’ ×H. Herein, V represents the original vocabulary size of 32,000, whereas V’\n",
+ "signifies the extended size of 48,000. Importantly, this adjustment ensures the preservation of the embeddings\n",
+ "associated with the original vocabulary by appending the new rows to the concluding segments of the initial\n",
+ "embedding matrices.\n",
+ "In Figure 1, we can see that the Tamil LLaMA tokenizer needs only 20% to 25% of the tokens that the original LLaMA\n",
+ "model uses to encode Tamil text. This makes the Tamil LLaMA much more efficient. With this crucial update, the\n",
+ "model can handle over three times more information and works three times faster. In conclusion, our modifications \n",
+ "to\n",
+ "LLaMA 2 significantly bolster its capabilities in understanding and generating Tamil content. By adding 16,000 \n",
+ "Tamil\n",
+ "tokens, we ensure a more efficient and nuanced representation. The new Tamil LLaMA tokenizer drastically reduces\n",
+ "the required tokens, making encoding more efficient.\n",
+ "Figure 1: Tokenizer comparisons between original LLaMA and Tamil LLaMA.\n",
+ "43.4 Pre-Training Phase\n",
+ "In order to harness the full potential of the expanded vocabulary of Tamil LLaMA, a robust pre-training phase is\n",
+ "implemented using a comprehensive Tamil text corpus. The datasets utilized during this training phase are detailed \n",
+ "in\n",
+ "3.1.1.\n",
+ "Causal Language Modelling Approach The central mechanism for this pre-training is Causal Language Modelling\n",
+ "(CLM). This method specializes in predicting a given token xtrelying entirely on its preceding tokens. Formally, \n",
+ "the\n",
+ "objective during this training phase is to maximize the likelihood of the entire sequence, as represented by:\n",
+ "P(x1, x2, . . . , x T) =TY\n",
+ "t=1P(xt|x1, x2, . . . , x t−1)(1)\n",
+ "Breaking down the elements of this equation:\n",
+ "•x1, x2, . . . , x T: The individual tokens that constitute the sequence.\n",
+ "•P(xt|x1, x2, . . . , x t−1): Represents the conditional probability of the token xt, which depends on the preced-\n",
+ "ing tokens in the sequence.\n",
+ "Significance of the CLM in Language Adaptation The CLM stage is integral to enhancing LLaMA’s capability in\n",
+ "Tamil and other languages. It facilitates the model in learning the intricate syntactic patterns, semantic \n",
+ "subtleties, and\n",
+ "unique linguistic features of Tamil. Due to its autoregressive characteristics, the CLM mimics the human approach \n",
+ "to\n",
+ "comprehending and generating language, which is primarily shaped by the previous context. Hence, at the end of this\n",
+ "initial training period, LLaMA becomes capable of interpreting and creating Tamil text that is pertinent to the \n",
+ "given\n",
+ "context. This sets a strong foundation for further fine-tuning and specific task-based training sessions.\n",
+ "3.5 Fine-Tuning Phase\n",
+ "Following the foundational pre-training phase, the fine-tuning phase emerges as a crucial step, especially for \n",
+ "modern\n",
+ "Large Language Models (LLMs) deployed in real-world scenarios. A broad understanding of language structure and\n",
+ "semantics, while essential, does not suffice for such applications. This gap is addressed by instruction \n",
+ "fine-tuning, a\n",
+ "tailored process enabling LLMs to interpret and execute task-oriented instructions conveyed in natural language. \n",
+ "Rather\n",
+ "than the traditional approach of adapting to specific datasets, instruction fine-tuning focuses on a wide array of \n",
+ "tasks\n",
+ "articulated through language, ensuring the LLM’s adaptability without task-specific alterations. The datasets \n",
+ "employed\n",
+ "in this phase are elaborated in Section 3.1.2.\n",
+ "Instruction fine-tuning’s transformative essence lies in its ability to enhance an LLM’s dynamism and \n",
+ "responsiveness.\n",
+ "While pre-training equips the model with general linguistic proficiency, instruction fine-tuning refines it to \n",
+ "interact\n",
+ "seamlessly with users through natural language, bridging the gap between overarching language mastery and nuanced,\n",
+ "task-specific agility.\n",
+ "The instruction format employed closely resembles the one described in the original Alpaca dataset (Taori et al., \n",
+ "2023).\n",
+ "Both prompt templates suggested by Alpaca have been utilized: one that includes an input field within the \n",
+ "instruction\n",
+ "and another that does not. The prompt templates used during training are given in Figure 2.\n",
+ "It is essential to clarify that in both templates, the first line signifies the system prompts. For the Alpaca \n",
+ "dataset (Taori\n",
+ "et al., 2023), we utilize the two system prompts as mentioned in Figure 2. However, for the OpenOrca subset (Lian\n",
+ "et al., 2023), a distinct approach is taken: given that this subset already includes a dedicated field for the \n",
+ "system prompt\n",
+ "within its dataset, we utilize that specific prompt.\n",
+ "3.6 Experimental Setup and Training Details\n",
+ "3.6.1 LoRA Approach for Pre-Training and Fine-Tuning\n",
+ "LoRA (Low-Rank Adapters) is a technique that offers an efficient pathway to fine-tuning large language models, as\n",
+ "introduced by Hu et al. (2021). This approach is especially beneficial for its computational efficiency, enabling \n",
+ "the\n",
+ "fine-tuning of language models without the need for extensive GPU resources. We employed the LoRA method to\n",
+ "moderate training expenses while also accelerating the training timeline. Training the complete set of parameters\n",
+ "for models like LLaMA can be exceedingly expensive and resource-intensive, which is often beyond the budget of\n",
+ "individual research teams or small organizations.\n",
+ "5Figure 2: Prompt Template for Instruction Tasks\n",
+ "1. Prompt T emplate Without Input\n",
+ "ஒரு பணிைய எவ ் வாறு நிைறேவற ் ற ேவண ் டும ் என ் று கூறும ் அறB-\n",
+ "வுைரகீேழஉள ் ளது. ேவண ் டுேகாைளப ் ெபாருத ் தமாகநிைறவுெசய ் -\n",
+ "கின ் ற பதில ் ஒன ் ைற எழுதுக.\n",
+ "### Instruction:\n",
+ "{instruction}\n",
+ "### Response:\n",
+ "{output}\n",
+ "2. Prompt T emplate With Input\n",
+ "ஒரு பணிைய எவ ் வாறு நிைறேவற ் ற ேவண ் டும ் என ் று கூறும ் அறB-\n",
+ "வுைர கீேழ உள ் ளது. ேமலும ் விரிவான பின ் னணிைய வழங ் கும ் ஓர ்\n",
+ "உள ் ளீடும ் ெகாடுக ் கப ் பட ் டுள ் ளது. ேவண ் டுேகாைளப ் ெபாருத ் தமாக\n",
+ "நிைறவு ெசய ் கின ் ற பதில ் ஒன ் ைற எழுதுக.\n",
+ "### Instruction:\n",
+ "{instruction}\n",
+ "### Input:\n",
+ "{input}\n",
+ "### Response:\n",
+ "{output}\n",
+ "3.6.2 Experimental Setups for Pre-Training\n",
+ "The foundational models of Tamil LLaMA are initiated with the original LLaMA weights and undergo pre-training\n",
+ "using the fp16precision setting for both the 7B2and 13B3parameter versions. We utilize 12GB of Tamil text sourced\n",
+ "from Nguyen et al. (2023) during this pre-training phase. Further insights on the dataset can be found in section \n",
+ "3.1.1.\n",
+ "Our pre-training strategy incorporates the LoRA method Hu et al. (2021), where we integrate LoRA adapters into the\n",
+ "attention vectors and subsequently train the embeddings, LM heads, and the newly incorporated LoRA parameters. A\n",
+ "noteworthy deviation from the methodology of the Chinese LLaMA (Cui et al., 2023) in our approach is the \n",
+ "elimination\n",
+ "of the initial exclusive training of embeddings. Instead of following it with a two-stage LoRA training of \n",
+ "attention\n",
+ "blocks, embeddings, and LM heads, we’ve opted for a streamlined approach to curb costs.\n",
+ "For the training infrastructure, we harnessed an Nvidia A100 GPU with 80GB of VRAM. The models were trained for\n",
+ "1 epoch on the entire dataset, and the training time spanned 48 hours for 7B model and 60 hours for the 13B model \n",
+ "on\n",
+ "Microsoft Azure’s Standard NC24adsA 100v4instance.\n",
+ "The detailed hyperparameters used for training are listed in Table 1.\n",
+ "3.6.3 Experimental Setups for Instruction Fine-Tuning\n",
+ "The 7B4and 13B5models, once pre-trained, undergo fine-tuning in alignment with the procedures outlined in Section\n",
+ "3.5. The datasets employed for this phase are elaborated upon in Section 3.1.2. We persist with the LoRA \n",
+ "methodology\n",
+ "for fine-tuning, executing it under the fp16precision setting for both models. Our datasets comprise translated \n",
+ "variants\n",
+ "of Alpaca (Taori et al., 2023) and a select subset from OpenOrca (Lian et al., 2023).\n",
+ "2Tamil LLaMA 7B Pretrained: https://huggingface.co/abhinand/tamil-llama-7b-base-v0.1\n",
+ "3Tamil LLaMA 13B Pretrained: https://huggingface.co/abhinand/tamil-llama-13b-base-v0.1\n",
+ "4Tamil LLaMA 7B Instruct: https://huggingface.co/abhinand/tamil-llama-7b-instruct-v0.1\n",
+ "5Tamil LLaMA 13B Instruct: https://huggingface.co/abhinand/tamil-llama-13b-instruct-v0.1\n",
+ "6Table 1: Pre-Training Hyperparameters\n",
+ "Configurations 7B 13B\n",
+ "Training Data 12GB 4GB\n",
+ "Epochs 11\n",
+ "Batch Size 6464\n",
+ "Initial Learning Rate 2e-42e-4\n",
+ "Max Sequence Length 512512\n",
+ "LoRA Rank 6464\n",
+ "LoRA Alpha 128128\n",
+ "LoRA Target Modules QKVO, MLP QKVO, MLP\n",
+ "Training Precision FP16 FP16\n",
+ "In a bid to augment the models’ proficiency with Tamil-centric literature, cultural nuances, and historical \n",
+ "contexts, we\n",
+ "leverage a tailored dataset sourced from Wikipedia. Additionally, to extract instructions from this text, we \n",
+ "utilize the\n",
+ "Self-Instruct method, as highlighted in Wang et al. (2023). This approach involves the GPT-4(OpenAI, 2023) APIs\n",
+ "from OpenAI to generate the new instruction dataset. It is crucial to note that the system prompts, referenced in \n",
+ "Section\n",
+ "3.1.2, remain consistent during this supplemental fine-tuning phase. For the hardware, the same A100 GPU with 80GB\n",
+ "of VRAM was utilized.\n",
+ "In summary, our fine-tuning approach employs a new translated dataset consisting of roughly 145,000 instructions. A\n",
+ "detailed account of the hyperparameters used for fine-tuning can be found in the Table 2.\n",
+ "Table 2: Fine-tuning Hyperparameters\n",
+ "Configurations 7B 13B\n",
+ "Training Data 145k 145k\n",
+ "Epochs 21\n",
+ "Batch Size 6464\n",
+ "Dropout Rate 0.10.1\n",
+ "Initial Learning Rate 2e-42e-4\n",
+ "Max Sequence Length 512512\n",
+ "LoRA Rank 6464\n",
+ "LoRA Alpha 128128\n",
+ "LoRA Target Modules QKVO, MLP QKVO, MLP\n",
+ "Training Precision FP16 FP16\n",
+ "4 Results on Instruction Following Tasks\n",
+ "4.1 Task Design and Evaluation Method\n",
+ "Evaluating the outcomes of text generation tasks is intricate due to their multifaceted formats, distinguishing \n",
+ "them\n",
+ "from typical Natural Language Understanding (NLU) tasks. Drawing inspiration from previous studies that employed\n",
+ "GPT-4(OpenAI, 2023) for scoring, we similarly engage GPT-4 to assign a grade on a 10-point scale to each instance.\n",
+ "This approach is more efficient than human evaluations. However, understanding the potential inaccuracies of \n",
+ "GPT-4’s\n",
+ "evaluations, we supplement its scores with manual reviews, adjusting them as necessary. Such hands-on inspections\n",
+ "affirm the consistency and authenticity of the scores, ensuring they genuinely mirror the efficacy of the models \n",
+ "under\n",
+ "review.\n",
+ "With the GPT-4-based scoring and manual verifications, we have established a robust evaluation framework for our\n",
+ "Tamil LLaMA. Our assessment suite is diligently designed to provide a basic evaluation of Tamil LLaMA. This suite\n",
+ "comprises over 120 diverse examples, covering areas such as Question Answering, Reasoning, Literature, \n",
+ "Entertainment,\n",
+ "Translation, Programming, and Ethics, among others. The overall score for a specific task is computed by summing\n",
+ "the scores from its constituent samples and normalizing it to a 100-point scale. Such an approach ensures a \n",
+ "holistic\n",
+ "reflection of the models’ capabilities across varying tasks, yielding a well-rounded measure of their overall \n",
+ "performance.\n",
+ "74.2 Generation Parameters\n",
+ "The choice of generation parameters during inference greatly affects the caliber of the results in tasks involving \n",
+ "text\n",
+ "generation. Additionally, the degree of quantization can also affect performance. Below are the generation \n",
+ "parameters\n",
+ "we adopted for model evaluations:\n",
+ "•Quantization Config : The model is loaded in 8−bit, with the torch data type specified as bfloat 16.\n",
+ "•Context Size: The context size is maintained at the model’s default of 4096 tokens.\n",
+ "•Temperature: We assign a temperature value of 0.2 to guide the randomness during sampling. A lower\n",
+ "temperature prompts the model to produce more deterministic outputs, whereas a higher value boosts diversity,\n",
+ "potentially compromising coherence. For creative instructions, we adjust the temperature to 0.7 to encourage\n",
+ "varied outputs.\n",
+ "•Top-k Sampling : With k set to 50, the model selects its succeeding token from the 50 most probable candidates,\n",
+ "introducing a level of unpredictability and variety to the resulting text.\n",
+ "•Top-p Sampling : Complementing Top-k sampling, we employ Top-p sampling with a threshold of 0.90. This\n",
+ "ensures the model weighs a fluid set of tokens, which, combined, represent 90\n",
+ "•Maximum Sequence Length : To keep the output concise and pertinent, we cap the generated sequence at 512\n",
+ "tokens.\n",
+ "•Repetition Penalty : A repetition penalty of 1.1 is applied to deter the model from producing redundant text,\n",
+ "disincentivizing previously chosen tokens.\n",
+ "For these evaluations, we utilized a Google Colab notebook powered by a T4 GPU.\n",
+ "4.3 Results from Instruction Tasks\n",
+ "The evaluation scores of the Tamil LLaMA models, as rated by GPT-4, are presented in Table 3. A noteworthy\n",
+ "observation during our evaluation is the superior performance of our models compared to gpt-3.5-turbo in manual\n",
+ "assessments, which is further reinforced by the commendable scores in GPT-4’s evaluations. However, it is essential\n",
+ "to\n",
+ "consider that GPT-4 might inherently favor responses from other GPT model lineages. Even though our model excels in\n",
+ "numerous tasks, there are areas of exception, such as ethics, and this was anticipated, given that we did not \n",
+ "undertake\n",
+ "any alignment efforts. Challenges in literature/entertainment and other areas can be attributed to data limitations\n",
+ "during\n",
+ "the pre-training phase, primarily due to cost constraints. Despite these nuances, our models establish a robust \n",
+ "foundation\n",
+ "for subsequent enhancements and progress in large language models tailored to Tamil.\n",
+ "Table 3: GPT-4 rated performance scores for different models on Tamil instructions\n",
+ "Task Type Tamil-LLaMA-7B Tamil-LLaMA-13B gpt-3.5-turbo\n",
+ "Question Answering 77.0075.3354.33\n",
+ "Open-ended QA 84.4785.2658.68\n",
+ "Reasoning 47.5064.2563.50\n",
+ "Literature 45.5040.0071.00\n",
+ "Entertainment 43.3350.0060.00\n",
+ "Creative Writing 92.5095.6259.69\n",
+ "Translation 60.5666.6792.78\n",
+ "Coding 63.5776.0757.14\n",
+ "Ethics 23.7557.5040.00\n",
+ "Overall 63.8371.1761.33\n",
+ "By observing Table 3, several intriguing outcomes emerge. Notably, the gpt-3.5-turbo , despite its prowess in \n",
+ "numerous\n",
+ "languages, appears to be eclipsed by the Tamil LLaMA models in multiple domains. A standout observation was\n",
+ "the Ethics category, where the gpt-3.5-turbo model demonstrated a propensity to respond to potentially dangerous\n",
+ "queries in Tamil. Additionally, in the Coding section, the gpt-3.5-turbo ’s responses either seemed to exhibit a \n",
+ "lack of\n",
+ "comprehension or overlooked critical details, leading to a subdued score. While gpt-3.5-turbo excels in tasks \n",
+ "related to\n",
+ "English and other languages, its performance in the context of Tamil reveals areas for weaknesses.\n",
+ "84.3.1 Reasoning:\n",
+ "In reasoning tasks, the models demonstrate commendable performance. While minor discrepancies occasionally arise in\n",
+ "areas such as dates, quantities, and formulas, they predominantly excel in reasoning exercises. According to our \n",
+ "manual\n",
+ "evaluations, even our smaller Tamil-LLaMA 7B model surpasses the performance of the much larger LLaMA 2 70B in\n",
+ "Tamil text generation. In comparison, even gpt-3.5-turbo (OpenAI, 2022) often falters in several reasoning \n",
+ "instructions,\n",
+ "producing outputs that miss the mark in relevance, clarity, fluency, and accuracy. This inadequacy in performance \n",
+ "is\n",
+ "also observed in LLaMA 2 70B, rendering their generated Tamil text less beneficial. Examples of responses related \n",
+ "to\n",
+ "reasoning tasks are given in the Figure 5.\n",
+ "We conducted our comparisons with LLaMA 2 70B using the model hosted by Perplexity Labs.\n",
+ "4.3.2 Translation:\n",
+ "For translation tasks, our models exhibit satisfactory performance, particularly when translating from a foreign \n",
+ "language\n",
+ "to Tamil. However, the accuracy diminishes when translating from Tamil to other languages—a shortcoming we aim to\n",
+ "address in future iterations. Based on our manual evaluations, our models outperform the original LLaMA 2 70B in\n",
+ "Tamil text translations. However, their efficacy is roughly on par with gpt-3.5-turbo . Examples of outputs for \n",
+ "translation\n",
+ "tasks are given in Figure 6.\n",
+ "4.3.3 Code Generation:\n",
+ "Our models exhibit impressive performance in code generation tasks despite the limited code instructions present\n",
+ "in the training dataset. They capably provide coherent explanations in Tamil for the generated code. Based on our\n",
+ "hands-on evaluations, our models markedly surpass the performance of the more sizable LLaMA 2 70B model, which\n",
+ "when instructed in Tamil, often either misconstrues the task or produces erroneous answers in English. However, it \n",
+ "is\n",
+ "important to highlight that our model is not tailored for coding tasks. While it handles more straightforward \n",
+ "problems\n",
+ "adeptly, it encounters challenges with more intricate ones. Example responses from our models for Code Generation\n",
+ "tasks can be found in Figure 7.\n",
+ "4.3.4 Open Question Answering\n",
+ "In open question answering tasks, much like in reasoning, the model displays a commendable performance. Despite\n",
+ "occasional inaccuracies in areas like dates and other factual information, its proficiency often exceeded our \n",
+ "expectations,\n",
+ "delivering surprising results on multiple instances. Example responses from our models for Open Question Answering\n",
+ "tasks can be found in Figure 8.\n",
+ "4.3.5 Creative Writing / Text Generation\n",
+ "Text generation is a foundational capability for Large Language Models (LLMs), with creative text generation—such \n",
+ "as\n",
+ "crafting letters or applications—being a particularly notable use case. In general, larger models have an edge in \n",
+ "this\n",
+ "domain, often outshining their smaller counterparts. The quality and quantity of training data play pivotal roles \n",
+ "in this\n",
+ "context. While the sheer volume of data can improve performance, the richness and quality of the data are equally \n",
+ "vital.\n",
+ "With abundant high-quality training data, even smaller models can sometimes surpass the performance of larger ones.\n",
+ "In our experiments, our models showed decent performance in standard tasks. However, they faced challenges when\n",
+ "assigned with more complicated tasks. Example responses from our models for Creative Writing tasks can be found in\n",
+ "Figure 9.\n",
+ "4.3.6 Mathematical reasoning\n",
+ "Mathematical reasoning presents a significant challenge for our models. Like many Large Language Models (LLMs),\n",
+ "they don’t excel in handling mathematical tasks. From our hands-on experiments, we observed that the performance of\n",
+ "our models, mainly when dealing with Tamil, lagged behind that of the original English LLaMA models. Recognizing\n",
+ "this as an area of improvement, we intend to prioritize and enhance the model’s capabilities in subsequent \n",
+ "iterations.\n",
+ "Examples of outputs for mathematical reasoning tasks are given in Figure 10.\n",
+ "4.4 Results from Natural Language Understanding (NLU) tasks\n",
+ "Understanding natural language (NLU) is a vital element within the field of natural language processing (NLP) that\n",
+ "enables computers to comprehend and interpret human language. NLU focuses on comprehending and extracting\n",
+ "9meaning from text, whereas text generation is concerned with generating human-like text based on a given input, \n",
+ "often\n",
+ "without any specific understanding of the text’s meaning.\n",
+ "To ascertain the prowess of a model, its performance in Natural Language Understanding (NLU) tasks is paramount.\n",
+ "However, the availability of standard benchmarks for Tamil in this domain remains sparse. Notable exceptions \n",
+ "include\n",
+ "the IndicNLP (Kunchukuttan, 2020), IndicNLP Corpus (Kunchukuttan et al., 2020), and IndicSentiment (AI4Bharat,\n",
+ "2023) datasets. We opted to assess our models utilizing the test set from the IndicSentiment dataset (AI4Bharat, \n",
+ "2023),\n",
+ "and a text classification dataset sourced from the IndicNLP Corpus (Kunchukuttan et al., 2020).\n",
+ "The test set of the IndicSentiment dataset encompasses 1,000 sentiment samples in Tamil. It is important to note \n",
+ "that\n",
+ "our evaluation was concentrated solely on this Tamil subset.\n",
+ "Figure 3: Performance comparison on the IndicSentiment-7B dataset\n",
+ "From Figure 3, it is evident that our Tamil LLaMA model remarkably surpasses the original LLaMA in this specific\n",
+ "NLU task. The latter’s performance mirrors that of random guessing, registering an accuracy of 50.5%. In stark \n",
+ "contrast,\n",
+ "our model impressively scores an accuracy of 81.3%. This enhanced NLU capability underscores the efficacy of our\n",
+ "methodologies—such as vocabulary expansion and retraining in facilitating the model to comprehend a new language\n",
+ "like Tamil with heightened proficiency.\n",
+ "We further extended our evaluation to the iNLTK Headline Classification subset within the IndicNLP suite (Kakwani\n",
+ "et al., 2020). It is essential to highlight that our analysis was focused strictly on the Tamil language subset of \n",
+ "this dataset.\n",
+ "The outcomes of this evaluation are graphically depicted in Figure 4.\n",
+ "Insight from Figure 4 reveals that the original LLaMA model’s performance aligns closely with random predictions.\n",
+ "In contrast, our Tamil LLaMA model showcases a compelling lead, achieving an accuracy rate of 80.12%, further\n",
+ "affirming its superior capability in natural language understanding.\n",
+ "5 Limitations\n",
+ "The Tamil LLaMA suite of models we introduce in this paper heralds several advancements in Tamil language \n",
+ "processing.\n",
+ "However, in the spirit of rigorous research, it is imperative to discuss the inherent limitations accompanying \n",
+ "these\n",
+ "models.\n",
+ "10Figure 4: Performance comparison on the IndicGLUE Text Classification dataset\n",
+ "•Constrained Knowledge Base : Due to computational and cost constraints, our models were trained on a\n",
+ "relatively limited Tamil dataset. This translates to gaps in the models’ knowledge, especially regarding nuances\n",
+ "and specifics native to Tamil culture and literature. While the current version lays the foundation, the true\n",
+ "potential can be unlocked with access to a broader data spectrum, enriching its contextual understanding.\n",
+ "•Ethical Concerns : Detoxification procedures were not implemented in our training process, making these\n",
+ "models prone to generating potentially harmful or offensive content. Their uncensored nature necessitates\n",
+ "caution during deployment.\n",
+ "•Lack of Robustness : Our models may, at times, produce outputs that veer off-topic or deviate substantially\n",
+ "from anticipated responses. This vulnerability is more pronounced under adversarial conditions or tricky\n",
+ "prompts.\n",
+ "•Reasoning and Mathematical Challenges : While our models showcase competence in specific reasoning\n",
+ "scenarios, they falter in many others, underscoring the repercussions of not having a comprehensive training\n",
+ "set.\n",
+ "•Over-Generation Tendencies : On occasions, the models tend to generate verbose content, extending beyond\n",
+ "logical termination points, leading to potential redundancy.\n",
+ "•Evaluation Hurdles : Assessment of LLMs is a crucial yet challenging endeavor. The scarcity of standardized\n",
+ "benchmarks, particularly for languages like Tamil, which are outside the European linguistic group, complicates\n",
+ "comparative evaluations. Although we propose an evaluative approach tailored for Tamil within this paper, it\n",
+ "is not exhaustive enough to gauge models’ efficacy across diverse domains.\n",
+ "•Translation Loss : Given that the instructional prompts used for fine-tuning the Tamil LLaMA base models are\n",
+ "derived from English datasets translated into Tamil, there is a potential for nuanced inaccuracies—commonly\n",
+ "referred to as translation loss. This can potentially affect the models’ abilities in both text generation and\n",
+ "comprehension due to subtle shifts in meaning that can occur during the translation process.\n",
+ "While some of these challenges are addressable in subsequent iterations, we envision this work serving as an \n",
+ "anchor,\n",
+ "inspiring the research community to propel advancements in LLMs for Indian languages.\n",
+ "116 Conclusion\n",
+ "In this research endeavor, we have not only filled a critical void in the domain of Tamil text generation but have \n",
+ "also\n",
+ "elevated the status of this venerable language within the realm of large language models with the advent of our \n",
+ "Tamil\n",
+ "LLaMA.To assess the performance of our models, we curated an evaluation dataset consisting of 120 Tamil \n",
+ "instructions\n",
+ "covering a wide range of topics. We then employed GPT-4 to assess and rate the responses generated by our model. \n",
+ "The\n",
+ "7B variant of our model has surpassed the performance of OpenAI’s gpt-3.5-turbo in tasks involving Tamil \n",
+ "instructions\n",
+ "within our evaluation methodology. Even more impressively, the 13B iteration has outperformed its counterparts,\n",
+ "demonstrating an almost 10% higher proficiency in these tasks.\n",
+ "The significance of our findings is accentuated by the efficiency of our models in generating Tamil text. Equipped \n",
+ "with\n",
+ "a refined tokenizer, the 7B and 13B variants demonstrate exceptional proficiency, eclipsing the original LLaMA \n",
+ "models\n",
+ "in processing speed without sacrificing textual quality. This stride is not just a modest step forward but a major \n",
+ "leap in\n",
+ "the models’ ability to process and generate Tamil language content, thus forging a new avenue for practical \n",
+ "applications\n",
+ "that necessitate precision and promptness.\n",
+ "Nevertheless, our exploration in this field continues. We acknowledge the triumphs and yet remain mindful of the\n",
+ "impending obstacles. The possibility of generating unintended or adverse content, the financial limitations \n",
+ "associated\n",
+ "with model training, and the need for comprehensive evaluation benchmarks tailored for Tamil are challenges we are\n",
+ "committed to surmounting.\n",
+ "As we chart our path forward, we acknowledge the value of continuous learning and adaptation. We aspire to delve\n",
+ "deeper into strategies to mitigate the limitations observed, perhaps even leveraging advancements like \n",
+ "Reinforcement\n",
+ "Learning With Human Feedback (RLHF) and more advanced quantization techniques. The horizon looks promising,\n",
+ "and we remain committed to elevating the capabilities of LLaMA for Tamil and other Indian languages.\n",
+ "Appendix\n",
+ "Example Outputs\n",
+ "Figure 5: Example Output for Reasoning Tasks\n",
+ "Prompt ேமைஜயில ் 4 ஆப ் பிள ் கள ் இருந ் தன. குமார ் ஒன ் ைறச ்\n",
+ "சாப ் பிட ் டான ் , ராம ் எடுத ் துச ் ெசன ் றான ் 2. மீதி எத ் -\n",
+ "தைன ஆப ் பிள ் கள ் உள ் ளன?\n",
+ "7B Model முதலில ் , நாம ் 4 - 1 = 3 ஆப ் பிள ் கைள எடுத ் துக ் ெகாள ் -\n",
+ "ளலாம ் , ஏெனனில ் குமார ் ஒன ் று சாப ் பிட ் டார ் . அடுத ் து,\n",
+ "ராம ் 2 ஆப ் பிள ் கைள எடுத ் துச ் ெசன ் றார ் , எனேவ மீத-\n",
+ "முள ் ள ஆப ் பிள ் களின ் எண ் ணிக ் ைக 3 - 2 = 1 ஆகும ் .\n",
+ "எனேவ, மீதமுள ் ள ஆப ் பிள ் 1 உள ் ளது.\n",
+ "13B Model இப ் ேபாது, இந ் தக ் ேகள ் விக ் கு பதிலளிக ் க, நாம ் சூத ் தி-\n",
+ "ரத ் ைதப ் பயன ் படுத ் தலாம ் :\n",
+ "ெமாத ் த ஆப ் பிள ் கள ் = ஆரம ் ப ஆப ் பிள ் கள ் - சாப ் பிட ் ட\n",
+ "ஆப ் பிள ் கள ் - எடுத ் துச ் ெசன ் ற ஆப ் பிள ் கள ்\n",
+ "ஆரம ் ப ஆப ் பிள ் கள ் = 4\n",
+ "சாப ் பிட ் ட ஆப ் பிள ் கள ் = 1\n",
+ "எடுக ் கப ் பட ் ட ஆப ் பிள ் கள ் = 2\n",
+ "இப ் ேபாது இந ் த மதிப ் புகைள சூத ் திரத ் தில ் ெசருகலாம ் :\n",
+ "ெமாத ் த ஆப ் பிள ் கள ் = 4 - 1 - 2 = 1\n",
+ "எனேவ, ேமைஜயில ் 1 ஆப ் பிள ் உள ் ளது.\n",
+ "12Figure 6: Example Outputs for Translation Tasks\n",
+ "Example 1\n",
+ "Prompt தயவுெசய ் துபின ் வரும் வாக்கியத்ைததமBழBல ் ெமாழB-\n",
+ "ெபயர்க்கவும்:\n",
+ "”W e’ve trained a model called ChatGPT which interacts in\n",
+ "a conversational way . The dialogue format makes it possible\n",
+ "for ChatGPT to answer followup questions, admit its mis-\n",
+ "takes, challenge incorrect premises, and reject inappropriate\n",
+ "requests. ”\n",
+ "7B Model நாங ் கள ் சாட்ேபாட் என ் ற ெபயரில ் பயிற ் சியளிக்கப ் -\n",
+ "பட்ட மாதிரிையக் ெகாண ் டுள ் ேளாம், இது உைரயாட-\n",
+ "லில ் ேபசுவதுேபால ் ெசயல ் படுகிறது.உைரயாடல ் வடி-\n",
+ "வைமப ் பின ் மூலம், ேகள ் விகளுக்கு பதிலளிக்கும் தி-\n",
+ "றன ் மற ் றும் தவறானஅறBக்ைககைளமறுக்கும் திறன ்\n",
+ "ேபான ் றபலவிஷயங ் கைளச் ெசய ் யசாட்ேபாட் அனு-\n",
+ "மதிக்கிறது. கூடுதலாக, சாட்ேபாட் சில ேநரங ் களில ்\n",
+ "அதன ் தவறுகளில ் ஒப ் புக்ெகாள ் கிறார் மற ் றும் ெபா-\n",
+ "ருத்தமற ் றேகாரிக்ைககைளக் கூடநிராகரிக்கிறார்.\n",
+ "13B Model நாங ் கள ் ”ChatGPT”என ் றமாதிரிையபயிற ் றுவித்துள ் -\n",
+ "ேளாம், இது உைரயாடல ் முைறயில ் ெதாடர்பு ெகாள ் -\n",
+ "கிறது. இந்தேபச்சுவழக்குவடிவமானது,பின ் ெதாடர்-\n",
+ "தல ் ேகள ் விகளுக்குபதிலளிக்கவும்,தவறுகைளச் சரி-\n",
+ "பார்க்கவும்,தவறானமுன ் ெமாழBவுகைளசவால ் ெசய ் -\n",
+ "யவும் மற ் றும் ெபாருத்தமற ் றேகாரிக்ைககள ் மீதுநிரா-\n",
+ "கரிக்கவும் ChatGPTஐஅனுமதிக்கிறது.\n",
+ "Example 2\n",
+ "Prompt தயவுெசய ் துபின ் வரும் வாக்கியத்ைததமBழBல ் ெமாழB-\n",
+ "ெபயர்க்கவும்:\n",
+ "”Estoy pensando en viajar a Inglaterra para ver el partido\n",
+ "entre Inglaterra y España de este fin de semana. ¡V a a ser\n",
+ "emocionante!”\n",
+ "7B Model நான ் இந்த வார இறுதியில ் இங ் கிலாந்தில ் கால ் பந்து\n",
+ "விைளயாட்ைடப ் பார்க்கவிரும்புகிேறன ் ,இதுமBகவும்\n",
+ "சுவாரஸ ் யமாகஇருக்கும்!\n",
+ "13B Model நான ் இந்த வார இறுதியில ் இங ் கிலாந்து மற ் றும் ஸ ் -\n",
+ "ெபயின ் இைடேயயான கால ் பந்து விைளயாட்ைடப ்\n",
+ "பார்க்க இங ் கிலாந்ைதச் ெசல ் லத் திட்டமBடுகிேறன ் !\n",
+ "இதுமBகவும் உற ் சாகமாகஇருக்கும்!\n",
+ "13Figure 7: Example Output for Coding Tasks\n",
+ "Prompt :ைபத ் தானில ் ”Quick Sort” அல ் காரிதத ் ைத ெசயல ் ப-\n",
+ "டுத ் தவும ் .\n",
+ "7B Model :\n",
+ "நிச ் சயம ் ! ைபதான ் குறBயீடுதுணுக ் குஇேதா,இது QuickSort அல ் -\n",
+ "காரிதத ் ைதப ் பயன ் படுத ் திவரிைசப ் படுத ் தப ் பட ் டபட ் டியைல\n",
+ "உருவாக ் கப ் பயன ் படும ் :\n",
+ "def quick_sort ( arr ) :\n",
+ "if len ( arr )<= 1:\n",
+ "return arr\n",
+ "pivot = arr [0]\n",
+ "less = [ x for x in arr [1:] if x < pivot ]\n",
+ "greater = [ x for x in arr [1:] if x > pivot ]\n",
+ "return quick_sort ( less ) + [ pivot ] + quick_sort\n",
+ "( greater )\n",
+ "my_list = [5 , 3 , 7 , 2 , 9 , 4]\n",
+ "result = quick_sort ( my_list )\n",
+ "print ( result ) # Output : [2 , 3 , 5 , 4 , 7 , 9]\n",
+ "குறBயீட ் டின ் ெவளியீடு,வரிைசப ் படுத ் தப ் பட ் டபட ் டியலில ் உள ் ள\n",
+ "உறுப ் புகளின ் மறுசீரைமக ் கப ் பட ் டவரிைசையவழங ் குகிறது.\n",
+ "இது உதவும ் என ் று நான ் நம ் புகிேறன ் ! உங ் களுக ் கு ஏேதனும ்\n",
+ "ேகள ் விகள ் இருந ் தால ் எனக ் கு ெதரியப ் படுத ் தவும ் .\n",
+ "13B Model :\n",
+ "ைபத ் தானில ் QuickSort அல ் காரிதம ் ெசயல ் படுத ் துவதற ் கான\n",
+ "எடுத ் துக ் காட ் டு இங ் ேக:\n",
+ "def quick_sort ( arr ) :\n",
+ "if len ( arr ) <= 1:\n",
+ "return arr\n",
+ "pivot = arr [0]\n",
+ "left , right = [] , []\n",
+ "for i in range (1 , len ( arr )) :\n",
+ "if arr [ i ] < pivot :\n",
+ "left . append ( arr [ i ])\n",
+ "else :\n",
+ "right . append ( arr [ i ])\n",
+ "return quick_sort ( left ) + [ pivot ] + quick_sort\n",
+ "( right )\n",
+ "arr = [2 , 3 , 5 , 7 , 9 , 10 , 11]\n",
+ "result = quick_sort ( arr )\n",
+ "print (\"The sorted array is :\", result )\n",
+ "14Figure 8: Example Output for Open Question Answering Tasks\n",
+ "Prompt :நான ் ெபரியபணக ் காரன ் இல ் ைல,சமீபத ் திய iPhone\n",
+ "15'க ் கு நிைறய பணம ் ெசலவழBக ் க ேவண ் டுமா?\n",
+ "7B Model :\n",
+ "AIஆக,எனக ் குதனிப ் பட ் டவிருப ் பங ் கள ் அல ் லதுநிதிஇலக ் -\n",
+ "குகள ் இல ் ைல. இருப ் பினும ் ,உங ் கள ் ேதைவகளுக ் குஏற ் றெதா-\n",
+ "ைலேபசிையத ் ேதர ் ந ் ெதடுப ் பதுமற ் றும ் உங ் கள ் பட ் ெஜட ் ைடக ்\n",
+ "கருத ் தில ் ெகாள ் வது முக ் கியம ் . சமீபத ் திய ஐேபான ் களில ் வி-\n",
+ "ைலகள ் அதிகமாக இருக ் கலாம ் , ஆனால ் நீங ் கள ் விரும ் பிய\n",
+ "அம ் சங ் கைளவழங ் குவதில ் அைவஇன ் னும ் சிறந ் தைவ. உங ் -\n",
+ "கள ் ேதைவகைளப ் பூர ் த ் தி ெசய ் யும ் ஒன ் ைறத ் ேதர ் வுெசய ் ய-\n",
+ "வும ் ,ேமலும ் கூடுதல ் ெசலவுகைளச ் ேசமBக ் கஉதவும ் வைகயில ் ,\n",
+ "தள ் ளுபடிகள ் அல ் லது ஒப ் பந ் தங ் கைளப ் பார ் க ் கவும ் .\n",
+ "உங ் கள ் பட ் ெஜட ் டுக ் குள ் இருக ் கமுயற ் சிப ் பதும ் முக ் கியம ் . உங ் -\n",
+ "கள ் ெசலவினங ் கைளக ் கண ் காணித ் து, உங ் கள ் ெசலவினங ் -\n",
+ "களுக ் கு முன ் னுரிைம அளிக ் கவும ் . இது ேதைவயற ் ற ெசலவு-\n",
+ "கைளத ் தவிர ் க ் கவும ் ,உங ் கள ் பணத ் ைதச ் ேசமBப ் பதற ் கானவழB-\n",
+ "கைளக ் கண ் டறBயவும ் உதவும ் .\n",
+ "இறுதியில ் ,உங ் கள ் ேதைவகள ் மற ் றும ் பட ் ெஜட ் டிற ் குஏற ் றெதா-\n",
+ "ைலேபசிையத ் ேதர ் ந ் ெதடுப ் பது முக ் கியம ் . உங ் கள ் பட ் ெஜட ் -\n",
+ "டில ் ஒட ் டிக ் ெகாண ் டு,உங ் களுக ் குத ் ேதைவயானஅம ் சங ் கைளத ்\n",
+ "ேதடுங ் கள ் .\n",
+ "இருப ் பினும ் ,சமீபத ் தியஐேபான ் களில ் ஒன ் றுநிச ் சயமாகஅதிக\n",
+ "விைலயில ் இருக ் கும ் என ் பைத நிைனவில ் ெகாள ் ளுங ் கள ் .\n",
+ "15Figure 9: Example Output for Creative Writing Tasks\n",
+ "Example:\n",
+ "Prompt :ஐபிஎல ் ெதாடைரெசன ் ைனசூப ் பர் கிங ் ஸ ் (சிஎஸ ் -\n",
+ "ேக)ெவன ் றதுஎன ் றதைலப ் பில ் ஒருசிறுெசய ் திக் கட்டுைர-\n",
+ "ையஎழுதுங ் கள ் .\n",
+ "7B Model :\n",
+ "ெசன ் ைனசூப ் பர் கிங ் ஸ ் (சிஎஸ ் ேக)2021இந்தியன ் பிரீமBயர்\n",
+ "லீக்கில ் (ஐபிஎல ் )இறுதிப ் ேபாட்டிையெவன ் றது,அவர்களின ்\n",
+ "மூன ் றாவதுஐபிஎல ் பட்டத்ைதஉறுதிெசய ் தது. இந்தெவற ் -\n",
+ "றBயின ் மூலம், சிஎஸ ் ேகஅணிக்காகவிைளயாடியேகப ் டன ்\n",
+ "ேதானிதைலைமயிலானஇளம் மற ் றும் திறைமயானவீரர்க-\n",
+ "ளின ் குழுவிற ் குஇதுஒருகுறBப ் பிடத்தக்கசாதைனயாகும்.\n",
+ "சமீபத்தியசீசன ் முழுவதும்,சிஎஸ ் ேகஅணிபலசவால ் கைள\n",
+ "எதிர்ெகாண ் டது,ஆனால ் அவர்கள ் ெதாடர்ந்துதங ் கள ் திற-\n",
+ "ைமகைளயும் நம்பிக்ைகையயும் ெவளிப ் படுத்தினர். அணி-\n",
+ "யின ் ேவகப ் பந்துவீச்சு,ேபட்டிங ் மற ் றும் ஃபீல ் டிங ் ஆகியவற ் -\n",
+ "றBன ் வலுவானகலைவயானதுஅவர்களின ் ேபாட்டியாளர்-\n",
+ "களின ் பலேமாசமானஆட்டங ் களுக்குவழBவகுத்தது,ேமலும்\n",
+ "அவர்கள ் இறுதிப ் ேபாட்டியில ் அவர்களின ் ெசாந்தவிைளயாட்-\n",
+ "டுகளில ் ஆதிக்கம் ெசலுத்தமுடிந்தது.\n",
+ "இந்தெவற ் றBயின ் மூலம்,சிஎஸ ் ேகஅணிமூன ் றுஐபிஎல ் சாம்-\n",
+ "பியன ் ஷCப ் கைளெவன ் றுள ் ளது,ேமலும் அவர்களின ் ரசிகர்-\n",
+ "கள ் மகிழ்ச்சியில ் மூழ்கிஉள ் ளனர்.ேதானியும் அவரதுஅணி-\n",
+ "யும் பலஆண ் டுகளாகஇந்தியகிரிக்ெகட் அரங ் கில ் முக்கிய\n",
+ "நபராகஇருந்துவருகின ் றனர்,ேமலும் இந்தெவற ் றBஅவர்க-\n",
+ "ளின ் பாரம்பரியத்ைதேமலும் உயர்த்துகிறது.\n",
+ "ஒட்டுெமாத்தமாக,சிஎஸ ் ேகஅணிஇந்தசீசனின ் ஐபிஎல ் ெதா-\n",
+ "டர் இறுதிப ் ேபாட்டிக்குமுன ் ேனறBயதில ் ஒருெபரியசாதைன-\n",
+ "ையக் ெகாண ் டிருந்தது,ேமலும் அவர்களின ் ரசிகர்கள ் இப ் ேபா-\n",
+ "துஒருவருடம் கழBத்துதங ் கள ் அணிக்குஉற ் சாகத்ைதயும் ஆத-\n",
+ "ரைவயும் காட்டஎதிர்பார்க்கிறார்கள ் .\n",
+ "16Figure 10: Example Outputs for Mathematical Reasoning Tasks\n",
+ "Example 1\n",
+ "Prompt 7! என ் றால ் என ் ன என ் பைதக ் கணக ் கிடுங ் கள ்\n",
+ "7B Model 7! என ் பது 7 x 6 x 5 x 4 x 3 x 2 x 1 என ் ற ெபருக ் கல ்\n",
+ "ெவளிப ் பாட ் ைடக ் குறBக ் கிறது, இது ேதாராயமாக 870\n",
+ "ஆகும ் .\n",
+ "எனேவ, 7! = 870\n",
+ "13B Model 7! என ் பது 7 இன ் காரணிகளின ் எண ் ணிக ் ைக, அதா-\n",
+ "வது 1 x 2 x 3 x 4 x 5 x 6 x 7 ஆகும ் . எனேவ, 7! = 5040\n",
+ "ஆகும ் .\n",
+ "Example 2\n",
+ "Prompt 3x+1=10 என ் றால ் , xஎதற ் கு சமம ் ?\n",
+ "7B Model ெகாடுக ் கப ் பட ் ட சமன ் பாடு 3x+ 1 = 10 எனில ் , xக ் கு தீர ் -\n",
+ "வு காண, நாம ் பின ் வரும ் படிகைளப ் பின ் பற ் றலாம ் :\n",
+ "படி 1: சமன ் பாட ் டின ் இரு பக ் கங ் களிலிருந ் தும ் 1 ஐக ்\n",
+ "கழBக ் கவும ் . இது நமக ் கு அளிக ் கிறது:\n",
+ "3x+ 1−1 = 10 −1\n",
+ "3x= 9\n",
+ "படி 2: இப ் ேபாது எங ் களிடம ் 4xஇன ் மதிப ் பு இருப ் ப-\n",
+ "தால ் , அைத அசல ் சமன ் பாட ் டில ் மாற ் றலாம ் : 3x+1 = 10\n",
+ "3(9) + 1 = 10\n",
+ "29 + 1 = 10\n",
+ "30 = 10\n",
+ "எனேவ, xக ் கான தீர ் வு 30 ஆகும ் .\n",
+ "இருப ் பினும ் , ெகாடுக ் கப ் பட ் ட சமன ் பாட ் டிற ் கான சரி-\n",
+ "யான தீர ் ைவ நீங ் கள ் ேதடுகிறீர ் களானால ் , xஇன ் மதிப ் -\n",
+ "ைபப ் ெபற, இரு பக ் கங ் கைளயும ் 3 ஆல ் வகுக ் க ேவண ் -\n",
+ "டும ் .\n",
+ "x= (30/3)\n",
+ "x= 10\n",
+ "எனேவ, xஎன ் பது 10க ் கு சமம ் .\n",
+ "13B Model 3x+ 1 = 10 சமன ் பாட ் டில ் xக ் கு தீர ் வு காணும ் ேபா-\n",
+ "து, சமன ் பாட ் டின ் ஒரு பக ் கத ் தில ் மாறBைய தனிைமப ் -\n",
+ "படுத ் த ேவண ் டும ் .\n",
+ "முதலில ் , சமன ் பாட ் டின ் இரு பக ் கங ் களிலிருந ் தும ் 1 ஐக ்\n",
+ "கழBப ் பதன ் மூலம ் ெதாடங ் கலாம ் :\n",
+ "3x+ 1−1 = 10 −1\n",
+ "இது எளிதாக ் குகிறது:\n",
+ "3x= 9\n",
+ "அடுத ் து, சமன ் பாட ் டின ் இரு பக ் கங ் கைளயும ் 3 ஆல ்\n",
+ "வகுப ் பதன ் மூலம ் xஐ தனிைமப ் படுத ் துேவாம ் :\n",
+ "(3x)/3 = (9)/3\n",
+ "இது எளிதாக ் குகிறது:\n",
+ "x= 3\n",
+ "எனேவ, 3x+ 1 = 10 சமன ் பாட ் டிற ் கான தீர ் வு x= 3\n",
+ "ஆகும ் .\n",
+ "17Acknowledgments\n",
+ "We gratefully acknowledge the assistance of OpenAI’s GPT-4 in the preparation of this manuscript. The AI’s advanced\n",
+ "language understanding and generation capabilities were invaluable in refining the structure, clarity, and overall\n",
+ "coherence of the original draft.\n",
+ "References\n",
+ "AI4Bharat. Indic sentiment dataset by ai4bharat. https://huggingface.co/datasets/ai4bharat/\n",
+ "IndicSentiment , 2023.\n",
+ "J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebrón, and S. Sanghai. Gqa: Training generalized \n",
+ "multi-query\n",
+ "transformer models from multi-head checkpoints, 2023.\n",
+ "I. Caswell, T. Breiner, D. van Esch, and A. Bapna. Language id in the wild: Unexpected challenges on the path to a\n",
+ "thousand-language web text corpus, 2020.\n",
+ "Y . Cui, Z. Yang, and X. Yao. Efficient and effective text encoding for chinese llama and alpaca, 2023.\n",
+ "J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for \n",
+ "language\n",
+ "understanding, 2019.\n",
+ "E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of \n",
+ "large\n",
+ "language models, 2021.\n",
+ "A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel,\n",
+ "G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. \n",
+ "E.\n",
+ "Sayed. Mistral 7b, 2023.\n",
+ "D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, and P. Kumar. IndicNLPSuite:\n",
+ "Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages.\n",
+ "InFindings of the Association for Computational Linguistics: EMNLP 2020 , pages 4948–4961, Online, Nov.\n",
+ "2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.445. URL https://\n",
+ "aclanthology.org/2020.findings-emnlp.445 .\n",
+ "T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for\n",
+ "neural text processing, 2018.\n",
+ "A. Kunchukuttan. The IndicNLP Library. https://github.com/anoopkunchukuttan/indic_nlp_library/\n",
+ "blob/master/docs/indicnlp.pdf , 2020.\n",
+ "A. Kunchukuttan, D. Kakwani, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, and P. Kumar. Ai4bharat-indicnlp\n",
+ "corpus: Monolingual corpora and word embeddings for indic languages. arXiv preprint arXiv:2005.00085 , 2020.\n",
+ "W. Lian, B. Goodson, E. Pentland, A. Cook, C. V ong, and \"Teknium\". Openorca: An open dataset of gpt augmented\n",
+ "flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca , 2023.\n",
+ "X. V . Lin, T. Mihaylov, M. Artetxe, T. Wang, S. Chen, D. Simig, M. Ott, N. Goyal, S. Bhosale, J. Du, R. Pasunuru,\n",
+ "S. Shleifer, P. S. Koura, V . Chaudhary, B. O’Horo, J. Wang, L. Zettlemoyer, Z. Kozareva, M. Diab, V . Stoyanov, \n",
+ "and\n",
+ "X. Li. Few-shot learning with multilingual language models, 2022.\n",
+ "A. Mahendiran. abinayam/gpt-2-tamil. https://huggingface.co/abinayam/gpt-2-tamil , 2021.\n",
+ "T. Nguyen, C. V . Nguyen, V . D. Lai, H. Man, N. T. Ngo, F. Dernoncourt, R. A. Rossi, and T. H. Nguyen. Culturax: A\n",
+ "cleaned, enormous, and multilingual dataset for large language models in 167 languages, 2023.\n",
+ "OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt , 2022.\n",
+ "OpenAI. Gpt-4 technical report, 2023.\n",
+ "A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by\n",
+ "generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/\n",
+ "language-unsupervised/language_understanding_paper.pdf , 2018.\n",
+ "A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised mul-\n",
+ "titask learners. https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_\n",
+ "are_unsupervised_multitask_learners.pdf , 2019.\n",
+ "T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et\n",
+ "al.\n",
+ "Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 , 2022.\n",
+ "N. Shazeer. Glu variants improve transformer, 2020.\n",
+ "18O. Shliazhko, A. Fenogenova, M. Tikhonova, V . Mikhailov, A. Kozlova, and T. Shavrina. mgpt: Few-shot learners go\n",
+ "multilingual, 2022. URL https://arxiv.org/abs/2204.07580 .\n",
+ "J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position \n",
+ "embedding,\n",
+ "2022.\n",
+ "R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford alpaca: \n",
+ "An\n",
+ "instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca , 2023.\n",
+ "H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. \n",
+ "Azhar,\n",
+ "A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models, 2023a.\n",
+ "H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. \n",
+ "Bhosale,\n",
+ "D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao,\n",
+ "V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann,\n",
+ "A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y . Lu, Y . Mao, X. Martinet, T. Mihaylov,\n",
+ "P. Mishra, I. Molybog, Y . Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. \n",
+ "Smith,\n",
+ "R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y . Zhang, A. Fan,\n",
+ "M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and\n",
+ "fine-tuned chat models, 2023b.\n",
+ "A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is \n",
+ "all\n",
+ "you need. Advances in neural information processing systems , 30, 2017.\n",
+ "Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning \n",
+ "language\n",
+ "models with self-generated instructions, 2023.\n",
+ "B. Zhang and R. Sennrich. Root mean square layer normalization, 2019.\n",
+ "19\n",
+ "
\n"
+ ],
+ "text/plain": [
+ "TAMIL -LLAMA : A N EWTAMIL LANGUAGE MODEL BASED ON\n",
+ "LLAMA \u001b[1;36m2\u001b[0m\n",
+ "Abhinand Balachandran\n",
+ "abhinandb.ml@gmail.com\n",
+ "ABSTRACT\n",
+ "Language modeling has witnessed remarkable advancements in recent years, with Large Language\n",
+ "Models \u001b[1m(\u001b[0mLLMs\u001b[1m)\u001b[0m like ChatGPT setting unparalleled benchmarks in human-like text generation. How-\n",
+ "ever, a prevailing limitation is the underrepresentation of languages like Tamil in these cutting-edge\n",
+ "models, leading to suboptimal performance in diverse linguistic contexts. This paper addresses this\n",
+ "lacuna, enhancing the open-source LLaMA model with an addition of \u001b[1;36m16\u001b[0m,\u001b[1;36m000\u001b[0m Tamil tokens, aiming to\n",
+ "achieve superior text generation and comprehension in the Tamil language. We strategically employ\n",
+ "the LoRA methodology for efficient model training on a comprehensive Tamil corpus, ensuring com-\n",
+ "putational feasibility and model robustness. Moreover, we introduce a Tamil-translated version of the\n",
+ "Alpaca dataset and a subset of the OpenOrca dataset tailored for instruction fine-tuning. Our results\n",
+ "showcase significant performance improvements in Tamil text generation, with potential implications\n",
+ "for the broader landscape of LLMs in Indian languages. We further underscore our commitment\n",
+ "to open research by making our models, datasets, and code1publicly accessible, fostering further\n",
+ "innovations in language modeling.\n",
+ "\u001b[1;36m1\u001b[0m Introduction\n",
+ "The past few years have been transformative for language modeling, with groundbreaking advances and monumental\n",
+ "achievements. At the forefront of this revolution was OpenAI’s ChatGPT \u001b[1m(\u001b[0mOpenAI, \u001b[1;36m2022\u001b[0m\u001b[1m)\u001b[0m, which not only raised the\n",
+ "bar in language modeling performance but also underscored the immense societal implications of such technologies.\n",
+ "Alongside ChatGPT, various Large Language Models \u001b[1m(\u001b[0mLLMs\u001b[1m)\u001b[0m have consistently demonstrated exceptional prowess in\n",
+ "natural language understanding and generation, heralding a new era in computational linguistics.\n",
+ "Central to the functionality of these modern LLMs is the Transformer architecture, a cornerstone concept brought to\n",
+ "the limelight by \u001b[32m\"Attention is All You Need\"\u001b[0m \u001b[1m(\u001b[0mVaswani et al., \u001b[1;36m2017\u001b[0m\u001b[1m)\u001b[0m. This innovation transformed our approach to\n",
+ "sequence-based tasks, catalyzing pivotal models like BERT \u001b[1m(\u001b[0mDevlin et al., \u001b[1;36m2019\u001b[0m\u001b[1m)\u001b[0m and redefining best practices in \n",
+ "Natural\n",
+ "Language Processing \u001b[1m(\u001b[0mNLP\u001b[1m)\u001b[0m.\n",
+ "Subsequent developments, particularly the Generative Pre-trained Transformer \u001b[1m(\u001b[0mGPT\u001b[1m)\u001b[0m \u001b[1m(\u001b[0mRadford et al., \u001b[1;36m2018\u001b[0m\u001b[1m)\u001b[0m, \n",
+ "showcased\n",
+ "the profound potential of unsupervised pre-training on vast datasets. Models like GPT-\u001b[1;36m3\u001b[0m and its successor, GPT-\u001b[1;36m4\u001b[0m\n",
+ "\u001b[1m(\u001b[0mOpenAI, \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m, have redefined benchmarks and fueled a renaissance in natural language understanding and \n",
+ "generation.\n",
+ "Beyond their technical prowess, they have prompted a renewed vigor in exploring the limits of Artificial General\n",
+ "Intelligence \u001b[1m(\u001b[0mAGI\u001b[1m)\u001b[0m. These advancements, paired with exemplary performance in numerous applications, have galvanized\n",
+ "the NLP community, sparking widespread application and research from sentiment analysis to machine translation.\n",
+ "However, progress is not without its pitfalls. The elite LLMs, despite their remarkable capabilities, grapple with\n",
+ "challenges—primarily, their proprietary nature, which constricts open research. Furthermore, an English-centric\n",
+ "bias and the enormous computational requirements for training such behemoths further accentuate the call for more\n",
+ "accessible and diverse solutions.\n",
+ "In response, the open-source community has championed the creation of models like LLaMA \u001b[1m(\u001b[0mTouvron et al., 2023a\u001b[1m)\u001b[0m\n",
+ "and Mistral \u001b[1m(\u001b[0mJiang et al., \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m. Such models, despite their compact nature, challenge the hegemony of giants like\n",
+ "ChatGPT in select benchmarks, heralding a promising direction for future research.\n",
+ "1GitHub Repository: \u001b[4;94mhttps://github.com/abhinand5/tamil-llamaarXiv:2311.05845v1\u001b[0m \u001b[1;36m10\u001b[0m Nov 2023However, as robust as \n",
+ "these models, like LLaMA and Mistral, might be, their proficiency in generating coherent text in\n",
+ "Tamil and several other Indian languages remains noticeably deficient. A fundamental limitation lies in their \n",
+ "minimal\n",
+ "vocabulary of Tamil characters, which is essential for effective text encoding and generation. This paper aims to \n",
+ "bridge\n",
+ "this gap by augmenting the existing LLaMA models’ vocabulary with an additional \u001b[1;36m16\u001b[0m,\u001b[1;36m000\u001b[0m Tamil tokens, markedly\n",
+ "enhancing their capability in processing and producing Tamil content. This method draws inspiration from a parallel\n",
+ "endeavor in the Chinese adaptation of LLaMA, as documented in Cui et al. \u001b[1m(\u001b[0m\u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m. To ensure efficient pre-training\n",
+ "and fine-tuning while maintaining computational feasibility, we leverage the LoRA \u001b[1m(\u001b[0mHu et al., \u001b[1;36m2021\u001b[0m\u001b[1m)\u001b[0m methodology. We\n",
+ "aspire that this initiative catalyzes further research endeavors, refining LLaMA and other open-source models \n",
+ "tailored\n",
+ "for Indian languages. A succinct overview of the principal contributions of this paper is as follows:\n",
+ "•We bolster the LLaMA model’s encoding and decoding proficiencies for Tamil by incorporating an additional\n",
+ "\u001b[1;36m16\u001b[0m,\u001b[1;36m000\u001b[0m Tamil tokens, thereby expanding its vocabulary.\n",
+ "•Through the LoRA methodology, the augmented model undergoes training on an extensive Tamil corpus,\n",
+ "resulting in a marked enhancement of its text generation capabilities relative to its predecessor models.\n",
+ "•We present a Tamil-translated version of the original Alpaca dataset \u001b[1m(\u001b[0mTaori et al., \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m, paired with a subset of\n",
+ "the OpenOrca \u001b[1m(\u001b[0mLian et al., \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m dataset, both curated for instruction fine-tuning in Tamil.\n",
+ "•Our newly trained instruction and chat models, built upon the Alpaca and OpenOrca datasets, demonstrate\n",
+ "notable advancements in performance for the Tamil language compared to other open-source language models.\n",
+ "•To stimulate continuous innovation and broader adaptability, we grant public access to the models, datasets,\n",
+ "and associated code, inviting further exploration and encouraging the refinement of LLaMA models for diverse\n",
+ "languages.\n",
+ "\u001b[1;36m2\u001b[0m Related Work\n",
+ "Within the broad field of Natural Language Processing \u001b[1m(\u001b[0mNLP\u001b[1m)\u001b[0m, the advent of Large Language Models \u001b[1m(\u001b[0mLLMs\u001b[1m)\u001b[0m marks a\n",
+ "transformative moment. These models have heralded new capabilities in understanding, generating, and processing\n",
+ "various human languages, underpinning innovations from automated content creation to nuanced sentiment analysis.\n",
+ "While their proficiency in mainstream languages like English is widely recognized and leveraged, a disparity exists\n",
+ "in\n",
+ "their performance and availability for numerous non-European languages.\n",
+ "Tamil, a language with ancient roots and spoken by a substantial global population, epitomizes this disparity. \n",
+ "Despite\n",
+ "its linguistic depth and cultural significance, dedicated pre-trained LLMs for Tamil are conspicuously \n",
+ "underrepresented.\n",
+ "Most current offerings are generic, multipurpose LLMs, which do not cater specifically to the unique attributes of \n",
+ "the\n",
+ "Tamil language.\n",
+ "A survey of the existing literature reveals that many attempts to cater to the Tamil language through LLMs rely \n",
+ "heavily\n",
+ "on multilingual models. Works such as Scao et al. \u001b[1m(\u001b[0m\u001b[1;36m2022\u001b[0m\u001b[1m)\u001b[0m, Shliazhko et al. \u001b[1m(\u001b[0m\u001b[1;36m2022\u001b[0m\u001b[1m)\u001b[0m, and Lin et al. \u001b[1m(\u001b[0m\u001b[1;36m2022\u001b[0m\u001b[1m)\u001b[0m have all \n",
+ "ventured\n",
+ "into this domain. However, it is crucial to note that, except \u001b[32m\"GPT-2 Tamil\"\u001b[0m by Mahendiran \u001b[1m(\u001b[0m\u001b[1;36m2021\u001b[0m\u001b[1m)\u001b[0m, all these models\n",
+ "are not exclusive to Tamil. While they can process Tamil to a certain extent, their capabilities are inherently \n",
+ "limited.\n",
+ "This limitation arises because the training data for these models often comprise a low fraction of Tamil content \n",
+ "relative\n",
+ "to other languages. Consequently, the nuances and intricacies specific to Tamil are often lost, leading to \n",
+ "suboptimal\n",
+ "performance.\n",
+ "The effort by Mahendiran \u001b[1m(\u001b[0m\u001b[1;36m2021\u001b[0m\u001b[1m)\u001b[0m represents a notable deviation from this trend. Here, the GPT-\u001b[1;36m2\u001b[0m base model, \n",
+ "equipped\n",
+ "with \u001b[1;36m117\u001b[0m million parameters as outlined in Radford et al. \u001b[1m(\u001b[0m\u001b[1;36m2019\u001b[0m\u001b[1m)\u001b[0m, was fine-tuned with a focus on Tamil, using both \n",
+ "the\n",
+ "Oscar dataset \u001b[1m(\u001b[0mCaswell et al., \u001b[1;36m2020\u001b[0m\u001b[1m)\u001b[0m and The IndicNLP \u001b[1m(\u001b[0mKunchukuttan, \u001b[1;36m2020\u001b[0m\u001b[1m)\u001b[0m dataset. This approach signifies a \n",
+ "targeted\n",
+ "attempt to adapt LLM capabilities for the Tamil language specifically.\n",
+ "However, the broader landscape of Tamil-specific LLM research remains relatively uncharted. This context \n",
+ "underscores\n",
+ "the motivation for our present research. We endeavor to delve deeper into this space, addressing existing \n",
+ "shortcomings\n",
+ "and advancing the capabilities of LLMs tailored for Tamil.\n",
+ "\u001b[1;36m3\u001b[0m Tamil LLaMA\n",
+ "\u001b[1;36m3.1\u001b[0m Datasets Used\n",
+ "The development of Tamil-LLaMA involved using several different datasets, each chosen for specific parts of the\n",
+ "training and fine-tuning process. This approach was vital to ensure the model’s effectiveness across various tasks.\n",
+ "\u001b[1;36m23.1\u001b[0m.\u001b[1;36m1\u001b[0m Datasets used for Pre-Training\n",
+ "For the initial pre-training phase of LLaMA \u001b[1;36m2\u001b[0m \u001b[1m(\u001b[0mTouvron et al., 2023a\u001b[1m)\u001b[0m, we mainly used the CulturaX dataset \u001b[1m(\u001b[0mNguyen\n",
+ "et al., \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m. This dataset is a combination of many popular datasets, including the Oscar dataset \u001b[1m(\u001b[0mCaswell et al.,\n",
+ "\u001b[1;36m2020\u001b[0m\u001b[1m)\u001b[0m.\n",
+ "Out of the \u001b[1;36m4.72\u001b[0m million documents in CulturaX, we selected 600k documents \u001b[1m(\u001b[0m\u001b[1;36m12\u001b[0m GB\u001b[1m)\u001b[0m for training. This choice was\n",
+ "made to manage training costs while aiming for high performance. Our approach was successful, as the model showed\n",
+ "strong results in text completion tasks even with this smaller dataset.\n",
+ "\u001b[1;36m3.1\u001b[0m.\u001b[1;36m2\u001b[0m Datasets used for Instruction Tuning\n",
+ "The \u001b[32m\"Instruction Tuning\"\u001b[0m phase was a pivotal stage in refining LLaMA’s proficiency in precisely adhering to textual\n",
+ "instructions. For this enhancement, we incorporated a translated version of the Stanford Alpaca dataset \u001b[1m(\u001b[0mTaori et \n",
+ "al.,\n",
+ "\u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m, comprising \u001b[1;36m52\u001b[0m,\u001b[1;36m000\u001b[0m instructions. Concurrently, we integrated a specialized no-code section from the OpenOrca\n",
+ "dataset \u001b[1m(\u001b[0mLian et al., \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m, which consists of around \u001b[1;36m93\u001b[0m,\u001b[1;36m000\u001b[0m instructions. The deliberate focus on no-code \n",
+ "instructions\n",
+ "was to streamline the training process, eliminating the intricacies presented by coding instructions during \n",
+ "translation.\n",
+ "To ensure translation uniformity and accuracy across the datasets, the Google Translation API service was our tool \n",
+ "of\n",
+ "choice. We meticulously translated the entirety of the Alpaca dataset while also applying a similar methodology to \n",
+ "the\n",
+ "OpenOrca subset.\n",
+ "We believe that leveraging diverse datasets has bolstered LLaMA’s enhanced capability to discern and generate\n",
+ "contextually pertinent responses across a spectrum of prompts.\n",
+ "\u001b[1;36m3.2\u001b[0m Background on the LLaMA Models\n",
+ "Introduced by Touvron et al. \u001b[1m(\u001b[0m2023a\u001b[1m)\u001b[0m, LLaMA has emerged as an essential milestone in the world of open-source\n",
+ "large language models \u001b[1m(\u001b[0mLLMs\u001b[1m)\u001b[0m, with the renowned Transformer architecture \u001b[1m(\u001b[0mVaswani et al., \u001b[1;36m2017\u001b[0m\u001b[1m)\u001b[0m as its foundation.\n",
+ "While it draws inspiration from models like GPT for its basic structure—comprising an embedding layer and multiple\n",
+ "transformer blocks—LLaMA has its unique features. LLaMA has brought forward several innovative techniques such\n",
+ "as pre-normalization \u001b[1m(\u001b[0mZhang and Sennrich, \u001b[1;36m2019\u001b[0m\u001b[1m)\u001b[0m, SwiGLU activation \u001b[1m(\u001b[0mShazeer, \u001b[1;36m2020\u001b[0m\u001b[1m)\u001b[0m, and rotary embeddings \u001b[1m(\u001b[0mSu\n",
+ "et al., \u001b[1;36m2022\u001b[0m\u001b[1m)\u001b[0m. Offered in sizes ranging from 7B \u001b[1m(\u001b[0m\u001b[1;36m7\u001b[0m Billion\u001b[1m)\u001b[0m to 65B \u001b[1m(\u001b[0m\u001b[1;36m65\u001b[0m Billion\u001b[1m)\u001b[0m parameters, LLaMA has been trained\n",
+ "on a rich mixture of content sources, including web pages, books, and academic papers. Its strong performance on\n",
+ "benchmarks, especially given its relatively compact size compared to other models, has made it a noteworthy \n",
+ "contender\n",
+ "in the LLM landscape, drawing considerable attention in the AI research community.\n",
+ "Building upon its predecessor’s foundation, LLaMA \u001b[1;36m2\u001b[0m \u001b[1m(\u001b[0mTouvron et al., 2023b\u001b[1m)\u001b[0m introduces monumental enhancements to\n",
+ "the LLaMA lineage. With a dataset expanded by \u001b[1;36m40\u001b[0m% relative to LLaMA \u001b[1;36m1\u001b[0m, the models under LLaMA \u001b[1;36m2\u001b[0m exhibit an\n",
+ "enriched comprehension of diverse content, leading to improved text generation. An extended context length of \u001b[1;36m4\u001b[0m,\u001b[1;36m096\u001b[0m\n",
+ "tokens empowers LLaMA \u001b[1;36m2\u001b[0m to process and understand more extensive textual segments, significantly benefiting tasks\n",
+ "such as translation and intricate question answering. Another pivotal innovation in LLaMA \u001b[1;36m2\u001b[0m is adopting the \n",
+ "grouped-\n",
+ "query attention mechanism \u001b[1m(\u001b[0mAinslie et al., \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m, facilitating faster inference despite its expanded size compared \n",
+ "to\n",
+ "LLaMA \u001b[1;36m1\u001b[0m.\n",
+ "In the course of our research, we made a conscious choice to employ LLaMA \u001b[1;36m2\u001b[0m as our primary language model. Several\n",
+ "factors influenced this decision. Firstly, LLaMA \u001b[1;36m2\u001b[0m is a recent addition to the lineage of Large Language Models, \n",
+ "which\n",
+ "implies that it benefits from the latest advancements in model training and architectural innovations. This recent \n",
+ "launch\n",
+ "incorporates the most up-to-date techniques and methodologies. Secondly, compared with its predecessor, LLaMA\n",
+ "\u001b[1;36m1\u001b[0m, the enhancements in LLaMA \u001b[1;36m2\u001b[0m are undeniably compelling. These improvements are not just incremental; they\n",
+ "represent substantial strides in areas such as data exposure, context length, and attention mechanisms. The \n",
+ "evolution\n",
+ "from LLaMA \u001b[1;36m1\u001b[0m to LLaMA \u001b[1;36m2\u001b[0m is emblematic of the rapid advancements in the field, and by leveraging the latter, we\n",
+ "aimed to ensure our research was grounded in the most cutting-edge tools available.\n",
+ "\u001b[1;36m3.3\u001b[0m Expansion of Tamil Vocabulary\n",
+ "LLaMA \u001b[1;36m2\u001b[0m, as outlined in the seminal work of Touvron et al. \u001b[1m(\u001b[0m2023b\u001b[1m)\u001b[0m, is backed by an expansive pre-training corpus \n",
+ "of \u001b[1;36m2\u001b[0m\n",
+ "Trillion tokens. A detailed linguistic analysis of this vast corpus reveals a striking imbalance in language \n",
+ "representation.\n",
+ "An overwhelming \u001b[1;36m89.7\u001b[0m% of the tokens are sourced from English, with other European languages collectively \n",
+ "contributing\n",
+ "to nearly \u001b[1;36m10\u001b[0m% of the dataset. In stark contrast, diverse languages such as Tamil and Hindi represent a meager \n",
+ "presence,\n",
+ "with their combined token count along with other under-represented languages accounting for less than \u001b[1;36m0.21\u001b[0m%.\n",
+ "This skewed distribution raises concerns about the genuine multilingual and cross-lingual capabilities of LLaMA \u001b[1;36m2\u001b[0m.\n",
+ "While it is evident that the model is proficient in several European languages, its ability to comprehend and \n",
+ "generate\n",
+ "3content in languages like Tamil needs to be improved substantially. Our preliminary experiments further \n",
+ "underscored\n",
+ "this limitation. When presented with tasks in Tamil, LLaMA \u001b[1;36m2\u001b[0m exhibited a remarkable lack of coherence in its \n",
+ "responses.\n",
+ "In fact, its performance was notably inferior to smaller models, underscoring a noticeable shortcoming in LLaMA \u001b[1;36m2\u001b[0m’s\n",
+ "coverage of worldwide languages. There is a clear need for the open-source community to focus on languages like\n",
+ "Tamil, spoken by millions globally across multiple countries.\n",
+ "To bolster the text generation and understanding abilities of LLaMA \u001b[1;36m2\u001b[0m in Tamil, we advocate extending its \n",
+ "pre-training\n",
+ "phase with an expansive Tamil corpus, as recommended by Cui et al. \u001b[1m(\u001b[0m\u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m. However, this alone is not sufficient. A\n",
+ "limitation arises from LLaMA’s existing vocabulary, which has a tiny number of Tamil characters. Although LLaMA\n",
+ "can bypass this by encoding unknown tokens, this process considerably lengthens the sequences, leading to \n",
+ "substantial\n",
+ "delays during encoding and decoding. Typically, a single Tamil character is translated into \u001b[1;36m3\u001b[0m-\u001b[1;36m4\u001b[0m byte tokens. \n",
+ "Moreover,\n",
+ "these byte tokens are not uniquely purposed for Tamil characters but represent UTF-\u001b[1;36m8\u001b[0m tokens from various languages.\n",
+ "This dual role complicates the task for transformer encoders and byte-tokens to understand and capture the nuanced\n",
+ "semantics of Tamil characters proficiently.\n",
+ "To overcome these problems and to enhance the text generation capabilities in Tamil, we propose the incorporation \n",
+ "of\n",
+ "an additional \u001b[1;36m16\u001b[0m,\u001b[1;36m000\u001b[0m Tamil tokens to the pre-existing vocabulary of the LLAMA \u001b[1;36m2\u001b[0m model. This methodology echoes the\n",
+ "strategies employed in developing Chinese LLaMA \u001b[1m(\u001b[0mCui et al., \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m. The subsequent steps explain the process of\n",
+ "vocabulary extension:\n",
+ "\u001b[1;36m1.\u001b[0mEmploy SentencePiece \u001b[1m(\u001b[0mKudo and Richardson, \u001b[1;36m2018\u001b[0m\u001b[1m)\u001b[0m to train a Tamil Tokenizer on an extensive corpus\n",
+ "of contemporary Tamil text, capturing the essence of modern linguistic nuances necessary for coherent\n",
+ "communication.\n",
+ "\u001b[1;36m2.\u001b[0mIntegrate the original tokenizer of the LLaMA \u001b[1;36m2\u001b[0m model with the vocabulary derived from the newly trained\n",
+ "SentencePiece tokenizer. This amalgamation culminates in an augmented tokenizer encompassing an additional\n",
+ "\u001b[1;36m16\u001b[0m,\u001b[1;36m000\u001b[0m Tamil tokens, leading to an aggregated vocabulary size of \u001b[1;36m48\u001b[0m,\u001b[1;36m000\u001b[0m \u001b[1m(\u001b[0m\u001b[1;36m32\u001b[0m,\u001b[1;36m000\u001b[0m original + \u001b[1;36m16\u001b[0m,\u001b[1;36m000\u001b[0m new\u001b[1m)\u001b[0m.\n",
+ "\u001b[1;36m3.\u001b[0mDrawing parallels from Cui et al. \u001b[1m(\u001b[0m\u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m, the LLaMA model is then tailored to accommodate the Tamil LLaMA\n",
+ "tokenizer. This modification necessitates resizing the word embeddings and the language model head from\n",
+ "a matrix shape V ×H to V’ ×H. Herein, V represents the original vocabulary size of \u001b[1;36m32\u001b[0m,\u001b[1;36m000\u001b[0m, whereas V’\n",
+ "signifies the extended size of \u001b[1;36m48\u001b[0m,\u001b[1;36m000\u001b[0m. Importantly, this adjustment ensures the preservation of the embeddings\n",
+ "associated with the original vocabulary by appending the new rows to the concluding segments of the initial\n",
+ "embedding matrices.\n",
+ "In Figure \u001b[1;36m1\u001b[0m, we can see that the Tamil LLaMA tokenizer needs only \u001b[1;36m20\u001b[0m% to \u001b[1;36m25\u001b[0m% of the tokens that the original LLaMA\n",
+ "model uses to encode Tamil text. This makes the Tamil LLaMA much more efficient. With this crucial update, the\n",
+ "model can handle over three times more information and works three times faster. In conclusion, our modifications \n",
+ "to\n",
+ "LLaMA \u001b[1;36m2\u001b[0m significantly bolster its capabilities in understanding and generating Tamil content. By adding \u001b[1;36m16\u001b[0m,\u001b[1;36m000\u001b[0m \n",
+ "Tamil\n",
+ "tokens, we ensure a more efficient and nuanced representation. The new Tamil LLaMA tokenizer drastically reduces\n",
+ "the required tokens, making encoding more efficient.\n",
+ "Figure \u001b[1;36m1\u001b[0m: Tokenizer comparisons between original LLaMA and Tamil LLaMA.\n",
+ "\u001b[1;36m43.4\u001b[0m Pre-Training Phase\n",
+ "In order to harness the full potential of the expanded vocabulary of Tamil LLaMA, a robust pre-training phase is\n",
+ "implemented using a comprehensive Tamil text corpus. The datasets utilized during this training phase are detailed \n",
+ "in\n",
+ "\u001b[1;36m3.1\u001b[0m.\u001b[1;36m1\u001b[0m.\n",
+ "Causal Language Modelling Approach The central mechanism for this pre-training is Causal Language Modelling\n",
+ "\u001b[1m(\u001b[0mCLM\u001b[1m)\u001b[0m. This method specializes in predicting a given token xtrelying entirely on its preceding tokens. Formally, \n",
+ "the\n",
+ "objective during this training phase is to maximize the likelihood of the entire sequence, as represented by:\n",
+ "\u001b[1;35mP\u001b[0m\u001b[1m(\u001b[0mx1, x2, . . . , x T\u001b[1m)\u001b[0m =TY\n",
+ "\u001b[33mt\u001b[0m=\u001b[1;35m1P\u001b[0m\u001b[1m(\u001b[0mxt|x1, x2, . . . , x t−\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m \u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m\n",
+ "Breaking down the elements of this equation:\n",
+ "•x1, x2, . . . , x T: The individual tokens that constitute the sequence.\n",
+ "•\u001b[1;35mP\u001b[0m\u001b[1m(\u001b[0mxt|x1, x2, . . . , x t−\u001b[1;36m1\u001b[0m\u001b[1m)\u001b[0m: Represents the conditional probability of the token xt, which depends on the preced-\n",
+ "ing tokens in the sequence.\n",
+ "Significance of the CLM in Language Adaptation The CLM stage is integral to enhancing LLaMA’s capability in\n",
+ "Tamil and other languages. It facilitates the model in learning the intricate syntactic patterns, semantic \n",
+ "subtleties, and\n",
+ "unique linguistic features of Tamil. Due to its autoregressive characteristics, the CLM mimics the human approach \n",
+ "to\n",
+ "comprehending and generating language, which is primarily shaped by the previous context. Hence, at the end of this\n",
+ "initial training period, LLaMA becomes capable of interpreting and creating Tamil text that is pertinent to the \n",
+ "given\n",
+ "context. This sets a strong foundation for further fine-tuning and specific task-based training sessions.\n",
+ "\u001b[1;36m3.5\u001b[0m Fine-Tuning Phase\n",
+ "Following the foundational pre-training phase, the fine-tuning phase emerges as a crucial step, especially for \n",
+ "modern\n",
+ "Large Language Models \u001b[1m(\u001b[0mLLMs\u001b[1m)\u001b[0m deployed in real-world scenarios. A broad understanding of language structure and\n",
+ "semantics, while essential, does not suffice for such applications. This gap is addressed by instruction \n",
+ "fine-tuning, a\n",
+ "tailored process enabling LLMs to interpret and execute task-oriented instructions conveyed in natural language. \n",
+ "Rather\n",
+ "than the traditional approach of adapting to specific datasets, instruction fine-tuning focuses on a wide array of \n",
+ "tasks\n",
+ "articulated through language, ensuring the LLM’s adaptability without task-specific alterations. The datasets \n",
+ "employed\n",
+ "in this phase are elaborated in Section \u001b[1;36m3.1\u001b[0m.\u001b[1;36m2\u001b[0m.\n",
+ "Instruction fine-tuning’s transformative essence lies in its ability to enhance an LLM’s dynamism and \n",
+ "responsiveness.\n",
+ "While pre-training equips the model with general linguistic proficiency, instruction fine-tuning refines it to \n",
+ "interact\n",
+ "seamlessly with users through natural language, bridging the gap between overarching language mastery and nuanced,\n",
+ "task-specific agility.\n",
+ "The instruction format employed closely resembles the one described in the original Alpaca dataset \u001b[1m(\u001b[0mTaori et al., \n",
+ "\u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m.\n",
+ "Both prompt templates suggested by Alpaca have been utilized: one that includes an input field within the \n",
+ "instruction\n",
+ "and another that does not. The prompt templates used during training are given in Figure \u001b[1;36m2\u001b[0m.\n",
+ "It is essential to clarify that in both templates, the first line signifies the system prompts. For the Alpaca \n",
+ "dataset \u001b[1m(\u001b[0mTaori\n",
+ "et al., \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m, we utilize the two system prompts as mentioned in Figure \u001b[1;36m2\u001b[0m. However, for the OpenOrca subset \u001b[1m(\u001b[0mLian\n",
+ "et al., \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m, a distinct approach is taken: given that this subset already includes a dedicated field for the \n",
+ "system prompt\n",
+ "within its dataset, we utilize that specific prompt.\n",
+ "\u001b[1;36m3.6\u001b[0m Experimental Setup and Training Details\n",
+ "\u001b[1;36m3.6\u001b[0m.\u001b[1;36m1\u001b[0m LoRA Approach for Pre-Training and Fine-Tuning\n",
+ "LoRA \u001b[1m(\u001b[0mLow-Rank Adapters\u001b[1m)\u001b[0m is a technique that offers an efficient pathway to fine-tuning large language models, as\n",
+ "introduced by Hu et al. \u001b[1m(\u001b[0m\u001b[1;36m2021\u001b[0m\u001b[1m)\u001b[0m. This approach is especially beneficial for its computational efficiency, enabling \n",
+ "the\n",
+ "fine-tuning of language models without the need for extensive GPU resources. We employed the LoRA method to\n",
+ "moderate training expenses while also accelerating the training timeline. Training the complete set of parameters\n",
+ "for models like LLaMA can be exceedingly expensive and resource-intensive, which is often beyond the budget of\n",
+ "individual research teams or small organizations.\n",
+ "5Figure \u001b[1;36m2\u001b[0m: Prompt Template for Instruction Tasks\n",
+ "\u001b[1;36m1\u001b[0m. Prompt T emplate Without Input\n",
+ "ஒரு பணிைய எவ ் வாறு நிைறேவற ் ற ேவண ் டும ் என ் று கூறும ் அறB-\n",
+ "வுைரகீேழஉள ் ளது. ேவண ் டுேகாைளப ் ெபாருத ் தமாகநிைறவுெசய ் -\n",
+ "கின ் ற பதில ் ஒன ் ைற எழுதுக.\n",
+ "### Instruction:\n",
+ "\u001b[1m{\u001b[0minstruction\u001b[1m}\u001b[0m\n",
+ "### Response:\n",
+ "\u001b[1m{\u001b[0moutput\u001b[1m}\u001b[0m\n",
+ "\u001b[1;36m2\u001b[0m. Prompt T emplate With Input\n",
+ "ஒரு பணிைய எவ ் வாறு நிைறேவற ் ற ேவண ் டும ் என ் று கூறும ் அறB-\n",
+ "வுைர கீேழ உள ் ளது. ேமலும ் விரிவான பின ் னணிைய வழங ் கும ் ஓர ்\n",
+ "உள ் ளீடும ் ெகாடுக ் கப ் பட ் டுள ் ளது. ேவண ் டுேகாைளப ் ெபாருத ் தமாக\n",
+ "நிைறவு ெசய ் கின ் ற பதில ் ஒன ் ைற எழுதுக.\n",
+ "### Instruction:\n",
+ "\u001b[1m{\u001b[0minstruction\u001b[1m}\u001b[0m\n",
+ "### Input:\n",
+ "\u001b[1m{\u001b[0minput\u001b[1m}\u001b[0m\n",
+ "### Response:\n",
+ "\u001b[1m{\u001b[0moutput\u001b[1m}\u001b[0m\n",
+ "\u001b[1;36m3.6\u001b[0m.\u001b[1;36m2\u001b[0m Experimental Setups for Pre-Training\n",
+ "The foundational models of Tamil LLaMA are initiated with the original LLaMA weights and undergo pre-training\n",
+ "using the fp16precision setting for both the 7B2and 13B3parameter versions. We utilize 12GB of Tamil text sourced\n",
+ "from Nguyen et al. \u001b[1m(\u001b[0m\u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m during this pre-training phase. Further insights on the dataset can be found in section \n",
+ "\u001b[1;36m3.1\u001b[0m.\u001b[1;36m1\u001b[0m.\n",
+ "Our pre-training strategy incorporates the LoRA method Hu et al. \u001b[1m(\u001b[0m\u001b[1;36m2021\u001b[0m\u001b[1m)\u001b[0m, where we integrate LoRA adapters into the\n",
+ "attention vectors and subsequently train the embeddings, LM heads, and the newly incorporated LoRA parameters. A\n",
+ "noteworthy deviation from the methodology of the Chinese LLaMA \u001b[1m(\u001b[0mCui et al., \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m in our approach is the \n",
+ "elimination\n",
+ "of the initial exclusive training of embeddings. Instead of following it with a two-stage LoRA training of \n",
+ "attention\n",
+ "blocks, embeddings, and LM heads, we’ve opted for a streamlined approach to curb costs.\n",
+ "For the training infrastructure, we harnessed an Nvidia A100 GPU with 80GB of VRAM. The models were trained for\n",
+ "\u001b[1;36m1\u001b[0m epoch on the entire dataset, and the training time spanned \u001b[1;36m48\u001b[0m hours for 7B model and \u001b[1;36m60\u001b[0m hours for the 13B model \n",
+ "on\n",
+ "Microsoft Azure’s Standard NC24adsA 100v4instance.\n",
+ "The detailed hyperparameters used for training are listed in Table \u001b[1;36m1\u001b[0m.\n",
+ "\u001b[1;36m3.6\u001b[0m.\u001b[1;36m3\u001b[0m Experimental Setups for Instruction Fine-Tuning\n",
+ "The 7B4and 13B5models, once pre-trained, undergo fine-tuning in alignment with the procedures outlined in Section\n",
+ "\u001b[1;36m3.5\u001b[0m. The datasets employed for this phase are elaborated upon in Section \u001b[1;36m3.1\u001b[0m.\u001b[1;36m2\u001b[0m. We persist with the LoRA \n",
+ "methodology\n",
+ "for fine-tuning, executing it under the fp16precision setting for both models. Our datasets comprise translated \n",
+ "variants\n",
+ "of Alpaca \u001b[1m(\u001b[0mTaori et al., \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m and a select subset from OpenOrca \u001b[1m(\u001b[0mLian et al., \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m.\n",
+ "2Tamil LLaMA 7B Pretrained: \u001b[4;94mhttps://huggingface.co/abhinand/tamil-llama-7b-base-v0.1\u001b[0m\n",
+ "3Tamil LLaMA 13B Pretrained: \u001b[4;94mhttps://huggingface.co/abhinand/tamil-llama-13b-base-v0.1\u001b[0m\n",
+ "4Tamil LLaMA 7B Instruct: \u001b[4;94mhttps://huggingface.co/abhinand/tamil-llama-7b-instruct-v0.1\u001b[0m\n",
+ "5Tamil LLaMA 13B Instruct: \u001b[4;94mhttps://huggingface.co/abhinand/tamil-llama-13b-instruct-v0.1\u001b[0m\n",
+ "6Table \u001b[1;36m1\u001b[0m: Pre-Training Hyperparameters\n",
+ "Configurations 7B 13B\n",
+ "Training Data 12GB 4GB\n",
+ "Epochs \u001b[1;36m1\u001b[0m \u001b[1;36m1\u001b[0m\n",
+ "Batch Size \u001b[1;36m64\u001b[0m \u001b[1;36m64\u001b[0m\n",
+ "Initial Learning Rate \u001b[1;36m2e-4\u001b[0m \u001b[1;36m2e-4\u001b[0m\n",
+ "Max Sequence Length \u001b[1;36m512\u001b[0m \u001b[1;36m512\u001b[0m\n",
+ "LoRA Rank \u001b[1;36m64\u001b[0m \u001b[1;36m64\u001b[0m\n",
+ "LoRA Alpha \u001b[1;36m128\u001b[0m \u001b[1;36m128\u001b[0m\n",
+ "LoRA Target Modules QKVO, MLP QKVO, MLP\n",
+ "Training Precision FP16 FP16\n",
+ "In a bid to augment the models’ proficiency with Tamil-centric literature, cultural nuances, and historical \n",
+ "contexts, we\n",
+ "leverage a tailored dataset sourced from Wikipedia. Additionally, to extract instructions from this text, we \n",
+ "utilize the\n",
+ "Self-Instruct method, as highlighted in Wang et al. \u001b[1m(\u001b[0m\u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m. This approach involves the GPT-\u001b[1;36m4\u001b[0m \u001b[1m(\u001b[0mOpenAI, \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m APIs\n",
+ "from OpenAI to generate the new instruction dataset. It is crucial to note that the system prompts, referenced in \n",
+ "Section\n",
+ "\u001b[1;36m3.1\u001b[0m.\u001b[1;36m2\u001b[0m, remain consistent during this supplemental fine-tuning phase. For the hardware, the same A100 GPU with 80GB\n",
+ "of VRAM was utilized.\n",
+ "In summary, our fine-tuning approach employs a new translated dataset consisting of roughly \u001b[1;36m145\u001b[0m,\u001b[1;36m000\u001b[0m instructions. A\n",
+ "detailed account of the hyperparameters used for fine-tuning can be found in the Table \u001b[1;36m2\u001b[0m.\n",
+ "Table \u001b[1;36m2\u001b[0m: Fine-tuning Hyperparameters\n",
+ "Configurations 7B 13B\n",
+ "Training Data 145k 145k\n",
+ "Epochs \u001b[1;36m2\u001b[0m \u001b[1;36m1\u001b[0m\n",
+ "Batch Size \u001b[1;36m64\u001b[0m \u001b[1;36m64\u001b[0m\n",
+ "Dropout Rate \u001b[1;36m0.1\u001b[0m \u001b[1;36m0.1\u001b[0m\n",
+ "Initial Learning Rate \u001b[1;36m2e-4\u001b[0m \u001b[1;36m2e-4\u001b[0m\n",
+ "Max Sequence Length \u001b[1;36m512\u001b[0m \u001b[1;36m512\u001b[0m\n",
+ "LoRA Rank \u001b[1;36m64\u001b[0m \u001b[1;36m64\u001b[0m\n",
+ "LoRA Alpha \u001b[1;36m128\u001b[0m \u001b[1;36m128\u001b[0m\n",
+ "LoRA Target Modules QKVO, MLP QKVO, MLP\n",
+ "Training Precision FP16 FP16\n",
+ "\u001b[1;36m4\u001b[0m Results on Instruction Following Tasks\n",
+ "\u001b[1;36m4.1\u001b[0m Task Design and Evaluation Method\n",
+ "Evaluating the outcomes of text generation tasks is intricate due to their multifaceted formats, distinguishing \n",
+ "them\n",
+ "from typical Natural Language Understanding \u001b[1m(\u001b[0mNLU\u001b[1m)\u001b[0m tasks. Drawing inspiration from previous studies that employed\n",
+ "GPT-\u001b[1;36m4\u001b[0m \u001b[1m(\u001b[0mOpenAI, \u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m for scoring, we similarly engage GPT-\u001b[1;36m4\u001b[0m to assign a grade on a \u001b[1;36m10\u001b[0m-point scale to each instance.\n",
+ "This approach is more efficient than human evaluations. However, understanding the potential inaccuracies of \n",
+ "GPT-\u001b[1;36m4\u001b[0m’s\n",
+ "evaluations, we supplement its scores with manual reviews, adjusting them as necessary. Such hands-on inspections\n",
+ "affirm the consistency and authenticity of the scores, ensuring they genuinely mirror the efficacy of the models \n",
+ "under\n",
+ "review.\n",
+ "With the GPT-\u001b[1;36m4\u001b[0m-based scoring and manual verifications, we have established a robust evaluation framework for our\n",
+ "Tamil LLaMA. Our assessment suite is diligently designed to provide a basic evaluation of Tamil LLaMA. This suite\n",
+ "comprises over \u001b[1;36m120\u001b[0m diverse examples, covering areas such as Question Answering, Reasoning, Literature, \n",
+ "Entertainment,\n",
+ "Translation, Programming, and Ethics, among others. The overall score for a specific task is computed by summing\n",
+ "the scores from its constituent samples and normalizing it to a \u001b[1;36m100\u001b[0m-point scale. Such an approach ensures a \n",
+ "holistic\n",
+ "reflection of the models’ capabilities across varying tasks, yielding a well-rounded measure of their overall \n",
+ "performance.\n",
+ "\u001b[1;36m74.2\u001b[0m Generation Parameters\n",
+ "The choice of generation parameters during inference greatly affects the caliber of the results in tasks involving \n",
+ "text\n",
+ "generation. Additionally, the degree of quantization can also affect performance. Below are the generation \n",
+ "parameters\n",
+ "we adopted for model evaluations:\n",
+ "•Quantization Config : The model is loaded in \u001b[1;36m8\u001b[0m−bit, with the torch data type specified as bfloat \u001b[1;36m16\u001b[0m.\n",
+ "•Context Size: The context size is maintained at the model’s default of \u001b[1;36m4096\u001b[0m tokens.\n",
+ "•Temperature: We assign a temperature value of \u001b[1;36m0.2\u001b[0m to guide the randomness during sampling. A lower\n",
+ "temperature prompts the model to produce more deterministic outputs, whereas a higher value boosts diversity,\n",
+ "potentially compromising coherence. For creative instructions, we adjust the temperature to \u001b[1;36m0.7\u001b[0m to encourage\n",
+ "varied outputs.\n",
+ "•Top-k Sampling : With k set to \u001b[1;36m50\u001b[0m, the model selects its succeeding token from the \u001b[1;36m50\u001b[0m most probable candidates,\n",
+ "introducing a level of unpredictability and variety to the resulting text.\n",
+ "•Top-p Sampling : Complementing Top-k sampling, we employ Top-p sampling with a threshold of \u001b[1;36m0.90\u001b[0m. This\n",
+ "ensures the model weighs a fluid set of tokens, which, combined, represent \u001b[1;36m90\u001b[0m\n",
+ "•Maximum Sequence Length : To keep the output concise and pertinent, we cap the generated sequence at \u001b[1;36m512\u001b[0m\n",
+ "tokens.\n",
+ "•Repetition Penalty : A repetition penalty of \u001b[1;36m1.1\u001b[0m is applied to deter the model from producing redundant text,\n",
+ "disincentivizing previously chosen tokens.\n",
+ "For these evaluations, we utilized a Google Colab notebook powered by a T4 GPU.\n",
+ "\u001b[1;36m4.3\u001b[0m Results from Instruction Tasks\n",
+ "The evaluation scores of the Tamil LLaMA models, as rated by GPT-\u001b[1;36m4\u001b[0m, are presented in Table \u001b[1;36m3\u001b[0m. A noteworthy\n",
+ "observation during our evaluation is the superior performance of our models compared to gpt-\u001b[1;36m3.5\u001b[0m-turbo in manual\n",
+ "assessments, which is further reinforced by the commendable scores in GPT-\u001b[1;36m4\u001b[0m’s evaluations. However, it is essential\n",
+ "to\n",
+ "consider that GPT-\u001b[1;36m4\u001b[0m might inherently favor responses from other GPT model lineages. Even though our model excels in\n",
+ "numerous tasks, there are areas of exception, such as ethics, and this was anticipated, given that we did not \n",
+ "undertake\n",
+ "any alignment efforts. Challenges in literature/entertainment and other areas can be attributed to data limitations\n",
+ "during\n",
+ "the pre-training phase, primarily due to cost constraints. Despite these nuances, our models establish a robust \n",
+ "foundation\n",
+ "for subsequent enhancements and progress in large language models tailored to Tamil.\n",
+ "Table \u001b[1;36m3\u001b[0m: GPT-\u001b[1;36m4\u001b[0m rated performance scores for different models on Tamil instructions\n",
+ "Task Type Tamil-LLaMA-7B Tamil-LLaMA-13B gpt-\u001b[1;36m3.5\u001b[0m-turbo\n",
+ "Question Answering \u001b[1;36m77.00\u001b[0m \u001b[1;36m75.33\u001b[0m \u001b[1;36m54.33\u001b[0m\n",
+ "Open-ended QA \u001b[1;36m84.47\u001b[0m \u001b[1;36m85.26\u001b[0m \u001b[1;36m58.68\u001b[0m\n",
+ "Reasoning \u001b[1;36m47.50\u001b[0m \u001b[1;36m64.25\u001b[0m \u001b[1;36m63.50\u001b[0m\n",
+ "Literature \u001b[1;36m45.50\u001b[0m \u001b[1;36m40.00\u001b[0m \u001b[1;36m71.00\u001b[0m\n",
+ "Entertainment \u001b[1;36m43.33\u001b[0m \u001b[1;36m50.00\u001b[0m \u001b[1;36m60.00\u001b[0m\n",
+ "Creative Writing \u001b[1;36m92.50\u001b[0m \u001b[1;36m95.62\u001b[0m \u001b[1;36m59.69\u001b[0m\n",
+ "Translation \u001b[1;36m60.56\u001b[0m \u001b[1;36m66.67\u001b[0m \u001b[1;36m92.78\u001b[0m\n",
+ "Coding \u001b[1;36m63.57\u001b[0m \u001b[1;36m76.07\u001b[0m \u001b[1;36m57.14\u001b[0m\n",
+ "Ethics \u001b[1;36m23.75\u001b[0m \u001b[1;36m57.50\u001b[0m \u001b[1;36m40.00\u001b[0m\n",
+ "Overall \u001b[1;36m63.83\u001b[0m \u001b[1;36m71.17\u001b[0m \u001b[1;36m61.33\u001b[0m\n",
+ "By observing Table \u001b[1;36m3\u001b[0m, several intriguing outcomes emerge. Notably, the gpt-\u001b[1;36m3.5\u001b[0m-turbo , despite its prowess in \n",
+ "numerous\n",
+ "languages, appears to be eclipsed by the Tamil LLaMA models in multiple domains. A standout observation was\n",
+ "the Ethics category, where the gpt-\u001b[1;36m3.5\u001b[0m-turbo model demonstrated a propensity to respond to potentially dangerous\n",
+ "queries in Tamil. Additionally, in the Coding section, the gpt-\u001b[1;36m3.5\u001b[0m-turbo ’s responses either seemed to exhibit a \n",
+ "lack of\n",
+ "comprehension or overlooked critical details, leading to a subdued score. While gpt-\u001b[1;36m3.5\u001b[0m-turbo excels in tasks \n",
+ "related to\n",
+ "English and other languages, its performance in the context of Tamil reveals areas for weaknesses.\n",
+ "\u001b[1;36m84.3\u001b[0m.\u001b[1;36m1\u001b[0m Reasoning:\n",
+ "In reasoning tasks, the models demonstrate commendable performance. While minor discrepancies occasionally arise in\n",
+ "areas such as dates, quantities, and formulas, they predominantly excel in reasoning exercises. According to our \n",
+ "manual\n",
+ "evaluations, even our smaller Tamil-LLaMA 7B model surpasses the performance of the much larger LLaMA \u001b[1;36m2\u001b[0m 70B in\n",
+ "Tamil text generation. In comparison, even gpt-\u001b[1;36m3.5\u001b[0m-turbo \u001b[1m(\u001b[0mOpenAI, \u001b[1;36m2022\u001b[0m\u001b[1m)\u001b[0m often falters in several reasoning \n",
+ "instructions,\n",
+ "producing outputs that miss the mark in relevance, clarity, fluency, and accuracy. This inadequacy in performance \n",
+ "is\n",
+ "also observed in LLaMA \u001b[1;36m2\u001b[0m 70B, rendering their generated Tamil text less beneficial. Examples of responses related \n",
+ "to\n",
+ "reasoning tasks are given in the Figure \u001b[1;36m5\u001b[0m.\n",
+ "We conducted our comparisons with LLaMA \u001b[1;36m2\u001b[0m 70B using the model hosted by Perplexity Labs.\n",
+ "\u001b[1;36m4.3\u001b[0m.\u001b[1;36m2\u001b[0m Translation:\n",
+ "For translation tasks, our models exhibit satisfactory performance, particularly when translating from a foreign \n",
+ "language\n",
+ "to Tamil. However, the accuracy diminishes when translating from Tamil to other languages—a shortcoming we aim to\n",
+ "address in future iterations. Based on our manual evaluations, our models outperform the original LLaMA \u001b[1;36m2\u001b[0m 70B in\n",
+ "Tamil text translations. However, their efficacy is roughly on par with gpt-\u001b[1;36m3.5\u001b[0m-turbo . Examples of outputs for \n",
+ "translation\n",
+ "tasks are given in Figure \u001b[1;36m6\u001b[0m.\n",
+ "\u001b[1;36m4.3\u001b[0m.\u001b[1;36m3\u001b[0m Code Generation:\n",
+ "Our models exhibit impressive performance in code generation tasks despite the limited code instructions present\n",
+ "in the training dataset. They capably provide coherent explanations in Tamil for the generated code. Based on our\n",
+ "hands-on evaluations, our models markedly surpass the performance of the more sizable LLaMA \u001b[1;36m2\u001b[0m 70B model, which\n",
+ "when instructed in Tamil, often either misconstrues the task or produces erroneous answers in English. However, it \n",
+ "is\n",
+ "important to highlight that our model is not tailored for coding tasks. While it handles more straightforward \n",
+ "problems\n",
+ "adeptly, it encounters challenges with more intricate ones. Example responses from our models for Code Generation\n",
+ "tasks can be found in Figure \u001b[1;36m7\u001b[0m.\n",
+ "\u001b[1;36m4.3\u001b[0m.\u001b[1;36m4\u001b[0m Open Question Answering\n",
+ "In open question answering tasks, much like in reasoning, the model displays a commendable performance. Despite\n",
+ "occasional inaccuracies in areas like dates and other factual information, its proficiency often exceeded our \n",
+ "expectations,\n",
+ "delivering surprising results on multiple instances. Example responses from our models for Open Question Answering\n",
+ "tasks can be found in Figure \u001b[1;36m8\u001b[0m.\n",
+ "\u001b[1;36m4.3\u001b[0m.\u001b[1;36m5\u001b[0m Creative Writing \u001b[35m/\u001b[0m Text Generation\n",
+ "Text generation is a foundational capability for Large Language Models \u001b[1m(\u001b[0mLLMs\u001b[1m)\u001b[0m, with creative text generation—such \n",
+ "as\n",
+ "crafting letters or applications—being a particularly notable use case. In general, larger models have an edge in \n",
+ "this\n",
+ "domain, often outshining their smaller counterparts. The quality and quantity of training data play pivotal roles \n",
+ "in this\n",
+ "context. While the sheer volume of data can improve performance, the richness and quality of the data are equally \n",
+ "vital.\n",
+ "With abundant high-quality training data, even smaller models can sometimes surpass the performance of larger ones.\n",
+ "In our experiments, our models showed decent performance in standard tasks. However, they faced challenges when\n",
+ "assigned with more complicated tasks. Example responses from our models for Creative Writing tasks can be found in\n",
+ "Figure \u001b[1;36m9\u001b[0m.\n",
+ "\u001b[1;36m4.3\u001b[0m.\u001b[1;36m6\u001b[0m Mathematical reasoning\n",
+ "Mathematical reasoning presents a significant challenge for our models. Like many Large Language Models \u001b[1m(\u001b[0mLLMs\u001b[1m)\u001b[0m,\n",
+ "they don’t excel in handling mathematical tasks. From our hands-on experiments, we observed that the performance of\n",
+ "our models, mainly when dealing with Tamil, lagged behind that of the original English LLaMA models. Recognizing\n",
+ "this as an area of improvement, we intend to prioritize and enhance the model’s capabilities in subsequent \n",
+ "iterations.\n",
+ "Examples of outputs for mathematical reasoning tasks are given in Figure \u001b[1;36m10\u001b[0m.\n",
+ "\u001b[1;36m4.4\u001b[0m Results from Natural Language Understanding \u001b[1m(\u001b[0mNLU\u001b[1m)\u001b[0m tasks\n",
+ "Understanding natural language \u001b[1m(\u001b[0mNLU\u001b[1m)\u001b[0m is a vital element within the field of natural language processing \u001b[1m(\u001b[0mNLP\u001b[1m)\u001b[0m that\n",
+ "enables computers to comprehend and interpret human language. NLU focuses on comprehending and extracting\n",
+ "9meaning from text, whereas text generation is concerned with generating human-like text based on a given input, \n",
+ "often\n",
+ "without any specific understanding of the text’s meaning.\n",
+ "To ascertain the prowess of a model, its performance in Natural Language Understanding \u001b[1m(\u001b[0mNLU\u001b[1m)\u001b[0m tasks is paramount.\n",
+ "However, the availability of standard benchmarks for Tamil in this domain remains sparse. Notable exceptions \n",
+ "include\n",
+ "the IndicNLP \u001b[1m(\u001b[0mKunchukuttan, \u001b[1;36m2020\u001b[0m\u001b[1m)\u001b[0m, IndicNLP Corpus \u001b[1m(\u001b[0mKunchukuttan et al., \u001b[1;36m2020\u001b[0m\u001b[1m)\u001b[0m, and IndicSentiment \u001b[1m(\u001b[0mAI4Bharat,\n",
+ "\u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m datasets. We opted to assess our models utilizing the test set from the IndicSentiment dataset \u001b[1m(\u001b[0mAI4Bharat, \n",
+ "\u001b[1;36m2023\u001b[0m\u001b[1m)\u001b[0m,\n",
+ "and a text classification dataset sourced from the IndicNLP Corpus \u001b[1m(\u001b[0mKunchukuttan et al., \u001b[1;36m2020\u001b[0m\u001b[1m)\u001b[0m.\n",
+ "The test set of the IndicSentiment dataset encompasses \u001b[1;36m1\u001b[0m,\u001b[1;36m000\u001b[0m sentiment samples in Tamil. It is important to note \n",
+ "that\n",
+ "our evaluation was concentrated solely on this Tamil subset.\n",
+ "Figure \u001b[1;36m3\u001b[0m: Performance comparison on the IndicSentiment-7B dataset\n",
+ "From Figure \u001b[1;36m3\u001b[0m, it is evident that our Tamil LLaMA model remarkably surpasses the original LLaMA in this specific\n",
+ "NLU task. The latter’s performance mirrors that of random guessing, registering an accuracy of \u001b[1;36m50.5\u001b[0m%. In stark \n",
+ "contrast,\n",
+ "our model impressively scores an accuracy of \u001b[1;36m81.3\u001b[0m%. This enhanced NLU capability underscores the efficacy of our\n",
+ "methodologies—such as vocabulary expansion and retraining in facilitating the model to comprehend a new language\n",
+ "like Tamil with heightened proficiency.\n",
+ "We further extended our evaluation to the iNLTK Headline Classification subset within the IndicNLP suite \u001b[1m(\u001b[0mKakwani\n",
+ "et al., \u001b[1;36m2020\u001b[0m\u001b[1m)\u001b[0m. It is essential to highlight that our analysis was focused strictly on the Tamil language subset of \n",
+ "this dataset.\n",
+ "The outcomes of this evaluation are graphically depicted in Figure \u001b[1;36m4\u001b[0m.\n",
+ "Insight from Figure \u001b[1;36m4\u001b[0m reveals that the original LLaMA model’s performance aligns closely with random predictions.\n",
+ "In contrast, our Tamil LLaMA model showcases a compelling lead, achieving an accuracy rate of \u001b[1;36m80.12\u001b[0m%, further\n",
+ "affirming its superior capability in natural language understanding.\n",
+ "\u001b[1;36m5\u001b[0m Limitations\n",
+ "The Tamil LLaMA suite of models we introduce in this paper heralds several advancements in Tamil language \n",
+ "processing.\n",
+ "However, in the spirit of rigorous research, it is imperative to discuss the inherent limitations accompanying \n",
+ "these\n",
+ "models.\n",
+ "10Figure \u001b[1;36m4\u001b[0m: Performance comparison on the IndicGLUE Text Classification dataset\n",
+ "•Constrained Knowledge Base : Due to computational and cost constraints, our models were trained on a\n",
+ "relatively limited Tamil dataset. This translates to gaps in the models’ knowledge, especially regarding nuances\n",
+ "and specifics native to Tamil culture and literature. While the current version lays the foundation, the true\n",
+ "potential can be unlocked with access to a broader data spectrum, enriching its contextual understanding.\n",
+ "•Ethical Concerns : Detoxification procedures were not implemented in our training process, making these\n",
+ "models prone to generating potentially harmful or offensive content. Their uncensored nature necessitates\n",
+ "caution during deployment.\n",
+ "•Lack of Robustness : Our models may, at times, produce outputs that veer off-topic or deviate substantially\n",
+ "from anticipated responses. This vulnerability is more pronounced under adversarial conditions or tricky\n",
+ "prompts.\n",
+ "•Reasoning and Mathematical Challenges : While our models showcase competence in specific reasoning\n",
+ "scenarios, they falter in many others, underscoring the repercussions of not having a comprehensive training\n",
+ "set.\n",
+ "•Over-Generation Tendencies : On occasions, the models tend to generate verbose content, extending beyond\n",
+ "logical termination points, leading to potential redundancy.\n",
+ "•Evaluation Hurdles : Assessment of LLMs is a crucial yet challenging endeavor. The scarcity of standardized\n",
+ "benchmarks, particularly for languages like Tamil, which are outside the European linguistic group, complicates\n",
+ "comparative evaluations. Although we propose an evaluative approach tailored for Tamil within this paper, it\n",
+ "is not exhaustive enough to gauge models’ efficacy across diverse domains.\n",
+ "•Translation Loss : Given that the instructional prompts used for fine-tuning the Tamil LLaMA base models are\n",
+ "derived from English datasets translated into Tamil, there is a potential for nuanced inaccuracies—commonly\n",
+ "referred to as translation loss. This can potentially affect the models’ abilities in both text generation and\n",
+ "comprehension due to subtle shifts in meaning that can occur during the translation process.\n",
+ "While some of these challenges are addressable in subsequent iterations, we envision this work serving as an \n",
+ "anchor,\n",
+ "inspiring the research community to propel advancements in LLMs for Indian languages.\n",
+ "\u001b[1;36m116\u001b[0m Conclusion\n",
+ "In this research endeavor, we have not only filled a critical void in the domain of Tamil text generation but have \n",
+ "also\n",
+ "elevated the status of this venerable language within the realm of large language models with the advent of our \n",
+ "Tamil\n",
+ "LLaMA.To assess the performance of our models, we curated an evaluation dataset consisting of \u001b[1;36m120\u001b[0m Tamil \n",
+ "instructions\n",
+ "covering a wide range of topics. We then employed GPT-\u001b[1;36m4\u001b[0m to assess and rate the responses generated by our model. \n",
+ "The\n",
+ "7B variant of our model has surpassed the performance of OpenAI’s gpt-\u001b[1;36m3.5\u001b[0m-turbo in tasks involving Tamil \n",
+ "instructions\n",
+ "within our evaluation methodology. Even more impressively, the 13B iteration has outperformed its counterparts,\n",
+ "demonstrating an almost \u001b[1;36m10\u001b[0m% higher proficiency in these tasks.\n",
+ "The significance of our findings is accentuated by the efficiency of our models in generating Tamil text. Equipped \n",
+ "with\n",
+ "a refined tokenizer, the 7B and 13B variants demonstrate exceptional proficiency, eclipsing the original LLaMA \n",
+ "models\n",
+ "in processing speed without sacrificing textual quality. This stride is not just a modest step forward but a major \n",
+ "leap in\n",
+ "the models’ ability to process and generate Tamil language content, thus forging a new avenue for practical \n",
+ "applications\n",
+ "that necessitate precision and promptness.\n",
+ "Nevertheless, our exploration in this field continues. We acknowledge the triumphs and yet remain mindful of the\n",
+ "impending obstacles. The possibility of generating unintended or adverse content, the financial limitations \n",
+ "associated\n",
+ "with model training, and the need for comprehensive evaluation benchmarks tailored for Tamil are challenges we are\n",
+ "committed to surmounting.\n",
+ "As we chart our path forward, we acknowledge the value of continuous learning and adaptation. We aspire to delve\n",
+ "deeper into strategies to mitigate the limitations observed, perhaps even leveraging advancements like \n",
+ "Reinforcement\n",
+ "Learning With Human Feedback \u001b[1m(\u001b[0mRLHF\u001b[1m)\u001b[0m and more advanced quantization techniques. The horizon looks promising,\n",
+ "and we remain committed to elevating the capabilities of LLaMA for Tamil and other Indian languages.\n",
+ "Appendix\n",
+ "Example Outputs\n",
+ "Figure \u001b[1;36m5\u001b[0m: Example Output for Reasoning Tasks\n",
+ "Prompt ேமைஜயில ் \u001b[1;36m4\u001b[0m ஆப ் பிள ் கள ் இருந ் தன. குமார ் ஒன ் ைறச ்\n",
+ "சாப ் பிட ் டான ் , ராம ் எடுத ் துச ் ெசன ் றான ் \u001b[1;36m2\u001b[0m. மீதி எத ் -\n",
+ "தைன ஆப ் பிள ் கள ் உள ் ளன?\n",
+ "7B Model முதலில ் , நாம ் \u001b[1;36m4\u001b[0m - \u001b[1;36m1\u001b[0m = \u001b[1;36m3\u001b[0m ஆப ் பிள ் கைள எடுத ் துக ் ெகாள ் -\n",
+ "ளலாம ் , ஏெனனில ் குமார ் ஒன ் று சாப ் பிட ் டார ் . அடுத ் து,\n",
+ "ராம ் \u001b[1;36m2\u001b[0m ஆப ் பிள ் கைள எடுத ் துச ் ெசன ் றார ் , எனேவ மீத-\n",
+ "முள ் ள ஆப ் பிள ் களின ் எண ் ணிக ் ைக \u001b[1;36m3\u001b[0m - \u001b[1;36m2\u001b[0m = \u001b[1;36m1\u001b[0m ஆகும ் .\n",
+ "எனேவ, மீதமுள ் ள ஆப ் பிள ் \u001b[1;36m1\u001b[0m உள ் ளது.\n",
+ "13B Model இப ் ேபாது, இந ் தக ் ேகள ் விக ் கு பதிலளிக ் க, நாம ் சூத ் தி-\n",
+ "ரத ் ைதப ் பயன ் படுத ் தலாம ் :\n",
+ "ெமாத ் த ஆப ் பிள ் கள ் = ஆரம ் ப ஆப ் பிள ் கள ் - சாப ் பிட ் ட\n",
+ "ஆப ் பிள ் கள ் - எடுத ் துச ் ெசன ் ற ஆப ் பிள ் கள ்\n",
+ "ஆரம ் ப ஆப ் பிள ் கள ் = \u001b[1;36m4\u001b[0m\n",
+ "சாப ் பிட ் ட ஆப ் பிள ் கள ் = \u001b[1;36m1\u001b[0m\n",
+ "எடுக ் கப ் பட ் ட ஆப ் பிள ் கள ் = \u001b[1;36m2\u001b[0m\n",
+ "இப ் ேபாது இந ் த மதிப ் புகைள சூத ் திரத ் தில ் ெசருகலாம ் :\n",
+ "ெமாத ் த ஆப ் பிள ் கள ் = \u001b[1;36m4\u001b[0m - \u001b[1;36m1\u001b[0m - \u001b[1;36m2\u001b[0m = \u001b[1;36m1\u001b[0m\n",
+ "எனேவ, ேமைஜயில ் \u001b[1;36m1\u001b[0m ஆப ் பிள ் உள ் ளது.\n",
+ "12Figure \u001b[1;36m6\u001b[0m: Example Outputs for Translation Tasks\n",
+ "Example \u001b[1;36m1\u001b[0m\n",
+ "Prompt தயவுெசய ் துபின ் வரும் வாக்கியத்ைததமBழBல ் ெமாழB-\n",
+ "ெபயர்க்கவும்:\n",
+ "”W e’ve trained a model called ChatGPT which interacts in\n",
+ "a conversational way . The dialogue format makes it possible\n",
+ "for ChatGPT to answer followup questions, admit its mis-\n",
+ "takes, challenge incorrect premises, and reject inappropriate\n",
+ "requests. ”\n",
+ "7B Model நாங ் கள ் சாட்ேபாட் என ் ற ெபயரில ் பயிற ் சியளிக்கப ் -\n",
+ "பட்ட மாதிரிையக் ெகாண ் டுள ் ேளாம், இது உைரயாட-\n",
+ "லில ் ேபசுவதுேபால ் ெசயல ் படுகிறது.உைரயாடல ் வடி-\n",
+ "வைமப ் பின ் மூலம், ேகள ் விகளுக்கு பதிலளிக்கும் தி-\n",
+ "றன ் மற ் றும் தவறானஅறBக்ைககைளமறுக்கும் திறன ்\n",
+ "ேபான ் றபலவிஷயங ் கைளச் ெசய ் யசாட்ேபாட் அனு-\n",
+ "மதிக்கிறது. கூடுதலாக, சாட்ேபாட் சில ேநரங ் களில ்\n",
+ "அதன ் தவறுகளில ் ஒப ் புக்ெகாள ் கிறார் மற ் றும் ெபா-\n",
+ "ருத்தமற ் றேகாரிக்ைககைளக் கூடநிராகரிக்கிறார்.\n",
+ "13B Model நாங ் கள ் ”ChatGPT”என ் றமாதிரிையபயிற ் றுவித்துள ் -\n",
+ "ேளாம், இது உைரயாடல ் முைறயில ் ெதாடர்பு ெகாள ் -\n",
+ "கிறது. இந்தேபச்சுவழக்குவடிவமானது,பின ் ெதாடர்-\n",
+ "தல ் ேகள ் விகளுக்குபதிலளிக்கவும்,தவறுகைளச் சரி-\n",
+ "பார்க்கவும்,தவறானமுன ் ெமாழBவுகைளசவால ் ெசய ் -\n",
+ "யவும் மற ் றும் ெபாருத்தமற ் றேகாரிக்ைககள ் மீதுநிரா-\n",
+ "கரிக்கவும் ChatGPTஐஅனுமதிக்கிறது.\n",
+ "Example \u001b[1;36m2\u001b[0m\n",
+ "Prompt தயவுெசய ் துபின ் வரும் வாக்கியத்ைததமBழBல ் ெமாழB-\n",
+ "ெபயர்க்கவும்:\n",
+ "”Estoy pensando en viajar a Inglaterra para ver el partido\n",
+ "entre Inglaterra y España de este fin de semana. ¡V a a ser\n",
+ "emocionante!”\n",
+ "7B Model நான ் இந்த வார இறுதியில ் இங ் கிலாந்தில ் கால ் பந்து\n",
+ "விைளயாட்ைடப ் பார்க்கவிரும்புகிேறன ் ,இதுமBகவும்\n",
+ "சுவாரஸ ் யமாகஇருக்கும்!\n",
+ "13B Model நான ் இந்த வார இறுதியில ் இங ் கிலாந்து மற ் றும் ஸ ் -\n",
+ "ெபயின ் இைடேயயான கால ் பந்து விைளயாட்ைடப ்\n",
+ "பார்க்க இங ் கிலாந்ைதச் ெசல ் லத் திட்டமBடுகிேறன ் !\n",
+ "இதுமBகவும் உற ் சாகமாகஇருக்கும்!\n",
+ "13Figure \u001b[1;36m7\u001b[0m: Example Output for Coding Tasks\n",
+ "Prompt :ைபத ் தானில ் ”Quick Sort” அல ் காரிதத ் ைத ெசயல ் ப-\n",
+ "டுத ் தவும ் .\n",
+ "7B Model :\n",
+ "நிச ் சயம ் ! ைபதான ் குறBயீடுதுணுக ் குஇேதா,இது QuickSort அல ் -\n",
+ "காரிதத ் ைதப ் பயன ் படுத ் திவரிைசப ் படுத ் தப ் பட ் டபட ் டியைல\n",
+ "உருவாக ் கப ் பயன ் படும ் :\n",
+ "def quick_sort \u001b[1m(\u001b[0m arr \u001b[1m)\u001b[0m :\n",
+ "if len \u001b[1m(\u001b[0m arr \u001b[1m)\u001b[0m \u001b[1m<\u001b[0m\u001b[39m= \u001b[0m\u001b[1;36m1\u001b[0m\u001b[39m:\u001b[0m\n",
+ "\u001b[39mreturn arr\u001b[0m\n",
+ "\u001b[39mpivot = arr \u001b[0m\u001b[1;39m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;39m]\u001b[0m\n",
+ "\u001b[39mless = \u001b[0m\u001b[1;39m[\u001b[0m\u001b[39m x for x in arr \u001b[0m\u001b[1;39m[\u001b[0m\u001b[1;36m1\u001b[0m\u001b[39m:\u001b[0m\u001b[1;39m]\u001b[0m\u001b[39m if x < pivot \u001b[0m\u001b[1;39m]\u001b[0m\n",
+ "\u001b[39mgreater = \u001b[0m\u001b[1;39m[\u001b[0m\u001b[39m x for x in arr \u001b[0m\u001b[1;39m[\u001b[0m\u001b[1;36m1\u001b[0m\u001b[39m:\u001b[0m\u001b[1;39m]\u001b[0m\u001b[39m if x \u001b[0m\u001b[1m>\u001b[0m pivot \u001b[1m]\u001b[0m\n",
+ "return quick_sort \u001b[1m(\u001b[0m less \u001b[1m)\u001b[0m + \u001b[1m[\u001b[0m pivot \u001b[1m]\u001b[0m + quick_sort\n",
+ "\u001b[1m(\u001b[0m greater \u001b[1m)\u001b[0m\n",
+ "my_list = \u001b[1m[\u001b[0m\u001b[1;36m5\u001b[0m , \u001b[1;36m3\u001b[0m , \u001b[1;36m7\u001b[0m , \u001b[1;36m2\u001b[0m , \u001b[1;36m9\u001b[0m , \u001b[1;36m4\u001b[0m\u001b[1m]\u001b[0m\n",
+ "result = quick_sort \u001b[1m(\u001b[0m my_list \u001b[1m)\u001b[0m\n",
+ "print \u001b[1m(\u001b[0m result \u001b[1m)\u001b[0m # Output : \u001b[1m[\u001b[0m\u001b[1;36m2\u001b[0m , \u001b[1;36m3\u001b[0m , \u001b[1;36m5\u001b[0m , \u001b[1;36m4\u001b[0m , \u001b[1;36m7\u001b[0m , \u001b[1;36m9\u001b[0m\u001b[1m]\u001b[0m\n",
+ "குறBயீட ் டின ் ெவளியீடு,வரிைசப ் படுத ் தப ் பட ் டபட ் டியலில ் உள ் ள\n",
+ "உறுப ் புகளின ் மறுசீரைமக ் கப ் பட ் டவரிைசையவழங ் குகிறது.\n",
+ "இது உதவும ் என ் று நான ் நம ் புகிேறன ் ! உங ் களுக ் கு ஏேதனும ்\n",
+ "ேகள ் விகள ் இருந ் தால ் எனக ் கு ெதரியப ் படுத ் தவும ் .\n",
+ "13B Model :\n",
+ "ைபத ் தானில ் QuickSort அல ் காரிதம ் ெசயல ் படுத ் துவதற ் கான\n",
+ "எடுத ் துக ் காட ் டு இங ் ேக:\n",
+ "def quick_sort \u001b[1m(\u001b[0m arr \u001b[1m)\u001b[0m :\n",
+ "if len \u001b[1m(\u001b[0m arr \u001b[1m)\u001b[0m <= \u001b[1;36m1\u001b[0m:\n",
+ "return arr\n",
+ "pivot = arr \u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1m]\u001b[0m\n",
+ "left , right = \u001b[1m[\u001b[0m\u001b[1m]\u001b[0m , \u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
+ "for i in range \u001b[1m(\u001b[0m\u001b[1;36m1\u001b[0m , len \u001b[1m(\u001b[0m arr \u001b[1m)\u001b[0m \u001b[1m)\u001b[0m :\n",
+ "if arr \u001b[1m[\u001b[0m i \u001b[1m]\u001b[0m < pivot :\n",
+ "left . append \u001b[1m(\u001b[0m arr \u001b[1m[\u001b[0m i \u001b[1m]\u001b[0m\u001b[1m)\u001b[0m\n",
+ "else :\n",
+ "right . append \u001b[1m(\u001b[0m arr \u001b[1m[\u001b[0m i \u001b[1m]\u001b[0m\u001b[1m)\u001b[0m\n",
+ "return quick_sort \u001b[1m(\u001b[0m left \u001b[1m)\u001b[0m + \u001b[1m[\u001b[0m pivot \u001b[1m]\u001b[0m + quick_sort\n",
+ "\u001b[1m(\u001b[0m right \u001b[1m)\u001b[0m\n",
+ "arr = \u001b[1m[\u001b[0m\u001b[1;36m2\u001b[0m , \u001b[1;36m3\u001b[0m , \u001b[1;36m5\u001b[0m , \u001b[1;36m7\u001b[0m , \u001b[1;36m9\u001b[0m , \u001b[1;36m10\u001b[0m , \u001b[1;36m11\u001b[0m\u001b[1m]\u001b[0m\n",
+ "result = quick_sort \u001b[1m(\u001b[0m arr \u001b[1m)\u001b[0m\n",
+ "print \u001b[1m(\u001b[0m\u001b[32m\"The sorted array is :\"\u001b[0m, result \u001b[1m)\u001b[0m\n",
+ "14Figure \u001b[1;36m8\u001b[0m: Example Output for Open Question Answering Tasks\n",
+ "Prompt :நான ் ெபரியபணக ் காரன ் இல ் ைல,சமீபத ் திய iPhone\n",
+ "\u001b[1;36m15\u001b[0m'க ் கு நிைறய பணம ் ெசலவழBக ் க ேவண ் டுமா?\n",
+ "7B Model :\n",
+ "AIஆக,எனக ் குதனிப ் பட ் டவிருப ் பங ் கள ் அல ் லதுநிதிஇலக ் -\n",
+ "குகள ் இல ் ைல. இருப ் பினும ் ,உங ் கள ் ேதைவகளுக ் குஏற ் றெதா-\n",
+ "ைலேபசிையத ் ேதர ் ந ் ெதடுப ் பதுமற ் றும ் உங ் கள ் பட ் ெஜட ் ைடக ்\n",
+ "கருத ் தில ் ெகாள ் வது முக ் கியம ் . சமீபத ் திய ஐேபான ் களில ் வி-\n",
+ "ைலகள ் அதிகமாக இருக ் கலாம ் , ஆனால ் நீங ் கள ் விரும ் பிய\n",
+ "அம ் சங ் கைளவழங ் குவதில ் அைவஇன ் னும ் சிறந ் தைவ. உங ் -\n",
+ "கள ் ேதைவகைளப ் பூர ் த ் தி ெசய ் யும ் ஒன ் ைறத ் ேதர ் வுெசய ் ய-\n",
+ "வும ் ,ேமலும ் கூடுதல ் ெசலவுகைளச ் ேசமBக ் கஉதவும ் வைகயில ் ,\n",
+ "தள ் ளுபடிகள ் அல ் லது ஒப ் பந ் தங ் கைளப ் பார ் க ் கவும ் .\n",
+ "உங ் கள ் பட ் ெஜட ் டுக ் குள ் இருக ் கமுயற ் சிப ் பதும ் முக ் கியம ் . உங ் -\n",
+ "கள ் ெசலவினங ் கைளக ் கண ் காணித ் து, உங ் கள ் ெசலவினங ் -\n",
+ "களுக ் கு முன ் னுரிைம அளிக ் கவும ் . இது ேதைவயற ் ற ெசலவு-\n",
+ "கைளத ் தவிர ் க ் கவும ் ,உங ் கள ் பணத ் ைதச ் ேசமBப ் பதற ் கானவழB-\n",
+ "கைளக ் கண ் டறBயவும ் உதவும ் .\n",
+ "இறுதியில ் ,உங ் கள ் ேதைவகள ் மற ் றும ் பட ் ெஜட ் டிற ் குஏற ் றெதா-\n",
+ "ைலேபசிையத ் ேதர ் ந ் ெதடுப ் பது முக ் கியம ் . உங ் கள ் பட ் ெஜட ் -\n",
+ "டில ் ஒட ் டிக ் ெகாண ் டு,உங ் களுக ் குத ் ேதைவயானஅம ் சங ் கைளத ்\n",
+ "ேதடுங ் கள ் .\n",
+ "இருப ் பினும ் ,சமீபத ் தியஐேபான ் களில ் ஒன ் றுநிச ் சயமாகஅதிக\n",
+ "விைலயில ் இருக ் கும ் என ் பைத நிைனவில ் ெகாள ் ளுங ் கள ் .\n",
+ "15Figure \u001b[1;36m9\u001b[0m: Example Output for Creative Writing Tasks\n",
+ "Example:\n",
+ "Prompt :ஐபிஎல ் ெதாடைரெசன ் ைனசூப ் பர் கிங ் ஸ ் \u001b[1m(\u001b[0mசிஎஸ ் -\n",
+ "ேக\u001b[1m)\u001b[0mெவன ் றதுஎன ் றதைலப ் பில ் ஒருசிறுெசய ் திக் கட்டுைர-\n",
+ "ையஎழுதுங ் கள ் .\n",
+ "7B Model :\n",
+ "ெசன ் ைனசூப ் பர் கிங ் ஸ ் \u001b[1m(\u001b[0mசிஎஸ ் ேக\u001b[1m)\u001b[0m2021இந்தியன ் பிரீமBயர்\n",
+ "லீக்கில ் \u001b[1m(\u001b[0mஐபிஎல ் \u001b[1m)\u001b[0mஇறுதிப ் ேபாட்டிையெவன ் றது,அவர்களின ்\n",
+ "மூன ் றாவதுஐபிஎல ் பட்டத்ைதஉறுதிெசய ் தது. இந்தெவற ் -\n",
+ "றBயின ் மூலம், சிஎஸ ் ேகஅணிக்காகவிைளயாடியேகப ் டன ்\n",
+ "ேதானிதைலைமயிலானஇளம் மற ் றும் திறைமயானவீரர்க-\n",
+ "ளின ் குழுவிற ் குஇதுஒருகுறBப ் பிடத்தக்கசாதைனயாகும்.\n",
+ "சமீபத்தியசீசன ் முழுவதும்,சிஎஸ ் ேகஅணிபலசவால ் கைள\n",
+ "எதிர்ெகாண ் டது,ஆனால ் அவர்கள ் ெதாடர்ந்துதங ் கள ் திற-\n",
+ "ைமகைளயும் நம்பிக்ைகையயும் ெவளிப ் படுத்தினர். அணி-\n",
+ "யின ் ேவகப ் பந்துவீச்சு,ேபட்டிங ் மற ் றும் ஃபீல ் டிங ் ஆகியவற ் -\n",
+ "றBன ் வலுவானகலைவயானதுஅவர்களின ் ேபாட்டியாளர்-\n",
+ "களின ் பலேமாசமானஆட்டங ் களுக்குவழBவகுத்தது,ேமலும்\n",
+ "அவர்கள ் இறுதிப ் ேபாட்டியில ் அவர்களின ் ெசாந்தவிைளயாட்-\n",
+ "டுகளில ் ஆதிக்கம் ெசலுத்தமுடிந்தது.\n",
+ "இந்தெவற ் றBயின ் மூலம்,சிஎஸ ் ேகஅணிமூன ் றுஐபிஎல ் சாம்-\n",
+ "பியன ் ஷCப ் கைளெவன ் றுள ் ளது,ேமலும் அவர்களின ் ரசிகர்-\n",
+ "கள ் மகிழ்ச்சியில ் மூழ்கிஉள ் ளனர்.ேதானியும் அவரதுஅணி-\n",
+ "யும் பலஆண ் டுகளாகஇந்தியகிரிக்ெகட் அரங ் கில ் முக்கிய\n",
+ "நபராகஇருந்துவருகின ் றனர்,ேமலும் இந்தெவற ் றBஅவர்க-\n",
+ "ளின ் பாரம்பரியத்ைதேமலும் உயர்த்துகிறது.\n",
+ "ஒட்டுெமாத்தமாக,சிஎஸ ் ேகஅணிஇந்தசீசனின ் ஐபிஎல ் ெதா-\n",
+ "டர் இறுதிப ் ேபாட்டிக்குமுன ் ேனறBயதில ் ஒருெபரியசாதைன-\n",
+ "ையக் ெகாண ் டிருந்தது,ேமலும் அவர்களின ் ரசிகர்கள ் இப ் ேபா-\n",
+ "துஒருவருடம் கழBத்துதங ் கள ் அணிக்குஉற ் சாகத்ைதயும் ஆத-\n",
+ "ரைவயும் காட்டஎதிர்பார்க்கிறார்கள ் .\n",
+ "16Figure \u001b[1;36m10\u001b[0m: Example Outputs for Mathematical Reasoning Tasks\n",
+ "Example \u001b[1;36m1\u001b[0m\n",
+ "Prompt \u001b[1;36m7\u001b[0m! என ் றால ் என ் ன என ் பைதக ் கணக ் கிடுங ் கள ்\n",
+ "7B Model \u001b[1;36m7\u001b[0m! என ் பது \u001b[1;36m7\u001b[0m x \u001b[1;36m6\u001b[0m x \u001b[1;36m5\u001b[0m x \u001b[1;36m4\u001b[0m x \u001b[1;36m3\u001b[0m x \u001b[1;36m2\u001b[0m x \u001b[1;36m1\u001b[0m என ் ற ெபருக ் கல ்\n",
+ "ெவளிப ் பாட ் ைடக ் குறBக ் கிறது, இது ேதாராயமாக \u001b[1;36m870\u001b[0m\n",
+ "ஆகும ் .\n",
+ "எனேவ, \u001b[1;36m7\u001b[0m! = \u001b[1;36m870\u001b[0m\n",
+ "13B Model \u001b[1;36m7\u001b[0m! என ் பது \u001b[1;36m7\u001b[0m இன ் காரணிகளின ் எண ் ணிக ் ைக, அதா-\n",
+ "வது \u001b[1;36m1\u001b[0m x \u001b[1;36m2\u001b[0m x \u001b[1;36m3\u001b[0m x \u001b[1;36m4\u001b[0m x \u001b[1;36m5\u001b[0m x \u001b[1;36m6\u001b[0m x \u001b[1;36m7\u001b[0m ஆகும ் . எனேவ, \u001b[1;36m7\u001b[0m! = \u001b[1;36m5040\u001b[0m\n",
+ "ஆகும ் .\n",
+ "Example \u001b[1;36m2\u001b[0m\n",
+ "Prompt 3x+\u001b[1;36m1\u001b[0m=\u001b[1;36m10\u001b[0m என ் றால ் , xஎதற ் கு சமம ் ?\n",
+ "7B Model ெகாடுக ் கப ் பட ் ட சமன ் பாடு 3x+ \u001b[1;36m1\u001b[0m = \u001b[1;36m10\u001b[0m எனில ் , xக ் கு தீர ் -\n",
+ "வு காண, நாம ் பின ் வரும ் படிகைளப ் பின ் பற ் றலாம ் :\n",
+ "படி \u001b[1;36m1\u001b[0m: சமன ் பாட ் டின ் இரு பக ் கங ் களிலிருந ் தும ் \u001b[1;36m1\u001b[0m ஐக ்\n",
+ "கழBக ் கவும ் . இது நமக ் கு அளிக ் கிறது:\n",
+ "3x+ \u001b[1;36m1\u001b[0m−\u001b[1;36m1\u001b[0m = \u001b[1;36m10\u001b[0m −\u001b[1;36m1\u001b[0m\n",
+ "\u001b[33m3x\u001b[0m= \u001b[1;36m9\u001b[0m\n",
+ "படி \u001b[1;36m2\u001b[0m: இப ் ேபாது எங ் களிடம ் 4xஇன ் மதிப ் பு இருப ் ப-\n",
+ "தால ் , அைத அசல ் சமன ் பாட ் டில ் மாற ் றலாம ் : 3x+\u001b[1;36m1\u001b[0m = \u001b[1;36m10\u001b[0m\n",
+ "\u001b[1;35m3\u001b[0m\u001b[1m(\u001b[0m\u001b[1;36m9\u001b[0m\u001b[1m)\u001b[0m + \u001b[1;36m1\u001b[0m = \u001b[1;36m10\u001b[0m\n",
+ "\u001b[1;36m29\u001b[0m + \u001b[1;36m1\u001b[0m = \u001b[1;36m10\u001b[0m\n",
+ "\u001b[1;36m30\u001b[0m = \u001b[1;36m10\u001b[0m\n",
+ "எனேவ, xக ் கான தீர ் வு \u001b[1;36m30\u001b[0m ஆகும ் .\n",
+ "இருப ் பினும ் , ெகாடுக ் கப ் பட ் ட சமன ் பாட ் டிற ் கான சரி-\n",
+ "யான தீர ் ைவ நீங ் கள ் ேதடுகிறீர ் களானால ் , xஇன ் மதிப ் -\n",
+ "ைபப ் ெபற, இரு பக ் கங ் கைளயும ் \u001b[1;36m3\u001b[0m ஆல ் வகுக ் க ேவண ் -\n",
+ "டும ் .\n",
+ "\u001b[33mx\u001b[0m= \u001b[1m(\u001b[0m\u001b[1;36m30\u001b[0m/\u001b[1;36m3\u001b[0m\u001b[1m)\u001b[0m\n",
+ "\u001b[33mx\u001b[0m= \u001b[1;36m10\u001b[0m\n",
+ "எனேவ, xஎன ் பது 10க ் கு சமம ் .\n",
+ "13B Model 3x+ \u001b[1;36m1\u001b[0m = \u001b[1;36m10\u001b[0m சமன ் பாட ் டில ் xக ் கு தீர ் வு காணும ் ேபா-\n",
+ "து, சமன ் பாட ் டின ் ஒரு பக ் கத ் தில ் மாறBைய தனிைமப ் -\n",
+ "படுத ் த ேவண ் டும ் .\n",
+ "முதலில ் , சமன ் பாட ் டின ் இரு பக ் கங ் களிலிருந ் தும ் \u001b[1;36m1\u001b[0m ஐக ்\n",
+ "கழBப ் பதன ் மூலம ் ெதாடங ் கலாம ் :\n",
+ "3x+ \u001b[1;36m1\u001b[0m−\u001b[1;36m1\u001b[0m = \u001b[1;36m10\u001b[0m −\u001b[1;36m1\u001b[0m\n",
+ "இது எளிதாக ் குகிறது:\n",
+ "\u001b[33m3x\u001b[0m= \u001b[1;36m9\u001b[0m\n",
+ "அடுத ் து, சமன ் பாட ் டின ் இரு பக ் கங ் கைளயும ் \u001b[1;36m3\u001b[0m ஆல ்\n",
+ "வகுப ் பதன ் மூலம ் xஐ தனிைமப ் படுத ் துேவாம ் :\n",
+ "\u001b[1m(\u001b[0m3x\u001b[1m)\u001b[0m\u001b[35m/\u001b[0m\u001b[95m3\u001b[0m = \u001b[1m(\u001b[0m\u001b[1;36m9\u001b[0m\u001b[1m)\u001b[0m\u001b[35m/\u001b[0m\u001b[95m3\u001b[0m\n",
+ "இது எளிதாக ் குகிறது:\n",
+ "\u001b[33mx\u001b[0m= \u001b[1;36m3\u001b[0m\n",
+ "எனேவ, 3x+ \u001b[1;36m1\u001b[0m = \u001b[1;36m10\u001b[0m சமன ் பாட ் டிற ் கான தீர ் வு \u001b[33mx\u001b[0m= \u001b[1;36m3\u001b[0m\n",
+ "ஆகும ் .\n",
+ "17Acknowledgments\n",
+ "We gratefully acknowledge the assistance of OpenAI’s GPT-\u001b[1;36m4\u001b[0m in the preparation of this manuscript. The AI’s advanced\n",
+ "language understanding and generation capabilities were invaluable in refining the structure, clarity, and overall\n",
+ "coherence of the original draft.\n",
+ "References\n",
+ "AI4Bharat. Indic sentiment dataset by ai4bharat. \u001b[4;94mhttps://huggingface.co/datasets/ai4bharat/\u001b[0m\n",
+ "IndicSentiment , \u001b[1;36m2023\u001b[0m.\n",
+ "J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebrón, and S. Sanghai. Gqa: Training generalized \n",
+ "multi-query\n",
+ "transformer models from multi-head checkpoints, \u001b[1;36m2023\u001b[0m.\n",
+ "I. Caswell, T. Breiner, D. van Esch, and A. Bapna. Language id in the wild: Unexpected challenges on the path to a\n",
+ "thousand-language web text corpus, \u001b[1;36m2020\u001b[0m.\n",
+ "Y . Cui, Z. Yang, and X. Yao. Efficient and effective text encoding for chinese llama and alpaca, \u001b[1;36m2023\u001b[0m.\n",
+ "J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for \n",
+ "language\n",
+ "understanding, \u001b[1;36m2019\u001b[0m.\n",
+ "E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of \n",
+ "large\n",
+ "language models, \u001b[1;36m2021\u001b[0m.\n",
+ "A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel,\n",
+ "G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. \n",
+ "E.\n",
+ "Sayed. Mistral 7b, \u001b[1;36m2023\u001b[0m.\n",
+ "D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, and P. Kumar. IndicNLPSuite:\n",
+ "Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages.\n",
+ "InFindings of the Association for Computational Linguistics: EMNLP \u001b[1;36m2020\u001b[0m , pages \u001b[1;36m4948\u001b[0m–\u001b[1;36m4961\u001b[0m, Online, Nov.\n",
+ "\u001b[1;36m2020\u001b[0m. Association for Computational Linguistics. doi: \u001b[1;36m10.18653\u001b[0m/v1/\u001b[1;36m2020.\u001b[0mfindings-emnlp.\u001b[1;36m445\u001b[0m. URL \u001b[4;94mhttps://\u001b[0m\n",
+ "aclanthology.org/\u001b[1;36m2020.\u001b[0mfindings-emnlp.\u001b[1;36m445\u001b[0m .\n",
+ "T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for\n",
+ "neural text processing, \u001b[1;36m2018\u001b[0m.\n",
+ "A. Kunchukuttan. The IndicNLP Library. \u001b[4;94mhttps://github.com/anoopkunchukuttan/indic_nlp_library/\u001b[0m\n",
+ "blob/master/docs/indicnlp.pdf , \u001b[1;36m2020\u001b[0m.\n",
+ "A. Kunchukuttan, D. Kakwani, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, and P. Kumar. Ai4bharat-indicnlp\n",
+ "corpus: Monolingual corpora and word embeddings for indic languages. arXiv preprint arXiv:\u001b[1;36m2005.00085\u001b[0m , \u001b[1;36m2020\u001b[0m.\n",
+ "W. Lian, B. Goodson, E. Pentland, A. Cook, C. V ong, and \u001b[32m\"Teknium\"\u001b[0m. Openorca: An open dataset of gpt augmented\n",
+ "flan reasoning traces. \u001b[4;94mhttps://https://huggingface.co/Open-Orca/OpenOrca\u001b[0m , \u001b[1;36m2023\u001b[0m.\n",
+ "X. V . Lin, T. Mihaylov, M. Artetxe, T. Wang, S. Chen, D. Simig, M. Ott, N. Goyal, S. Bhosale, J. Du, R. Pasunuru,\n",
+ "S. Shleifer, P. S. Koura, V . Chaudhary, B. O’Horo, J. Wang, L. Zettlemoyer, Z. Kozareva, M. Diab, V . Stoyanov, \n",
+ "and\n",
+ "X. Li. Few-shot learning with multilingual language models, \u001b[1;36m2022\u001b[0m.\n",
+ "A. Mahendiran. abinayam/gpt-\u001b[1;36m2\u001b[0m-tamil. \u001b[4;94mhttps://huggingface.co/abinayam/gpt-2-tamil\u001b[0m , \u001b[1;36m2021\u001b[0m.\n",
+ "T. Nguyen, C. V . Nguyen, V . D. Lai, H. Man, N. T. Ngo, F. Dernoncourt, R. A. Rossi, and T. H. Nguyen. Culturax: A\n",
+ "cleaned, enormous, and multilingual dataset for large language models in \u001b[1;36m167\u001b[0m languages, \u001b[1;36m2023\u001b[0m.\n",
+ "OpenAI. Introducing chatgpt. \u001b[4;94mhttps://openai.com/blog/chatgpt\u001b[0m , \u001b[1;36m2022\u001b[0m.\n",
+ "OpenAI. Gpt-\u001b[1;36m4\u001b[0m technical report, \u001b[1;36m2023\u001b[0m.\n",
+ "A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by\n",
+ "generative pre-training. \u001b[4;94mhttps://s3-us-west-2.amazonaws.com/openai-assets/research-covers/\u001b[0m\n",
+ "language-unsupervised/language_understanding_paper.pdf , \u001b[1;36m2018\u001b[0m.\n",
+ "A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised mul-\n",
+ "titask learners. \u001b[4;94mhttps://d4mucfpksywv.cloudfront.net/better-language-models/language_models_\u001b[0m\n",
+ "are_unsupervised_multitask_learners.pdf , \u001b[1;36m2019\u001b[0m.\n",
+ "T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et\n",
+ "al.\n",
+ "Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:\u001b[1;36m2211.05100\u001b[0m , \u001b[1;36m2022\u001b[0m.\n",
+ "N. Shazeer. Glu variants improve transformer, \u001b[1;36m2020\u001b[0m.\n",
+ "18O. Shliazhko, A. Fenogenova, M. Tikhonova, V . Mikhailov, A. Kozlova, and T. Shavrina. mgpt: Few-shot learners go\n",
+ "multilingual, \u001b[1;36m2022\u001b[0m. URL \u001b[4;94mhttps://arxiv.org/abs/2204.07580\u001b[0m .\n",
+ "J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position \n",
+ "embedding,\n",
+ "\u001b[1;36m2022\u001b[0m.\n",
+ "R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford alpaca: \n",
+ "An\n",
+ "instruction-following llama model. \u001b[4;94mhttps://github.com/tatsu-lab/stanford_alpaca\u001b[0m , \u001b[1;36m2023\u001b[0m.\n",
+ "H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. \n",
+ "Azhar,\n",
+ "A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models, 2023a.\n",
+ "H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. \n",
+ "Bhosale,\n",
+ "D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao,\n",
+ "V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann,\n",
+ "A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y . Lu, Y . Mao, X. Martinet, T. Mihaylov,\n",
+ "P. Mishra, I. Molybog, Y . Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. \n",
+ "Smith,\n",
+ "R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y . Zhang, A. Fan,\n",
+ "M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama \u001b[1;36m2\u001b[0m: Open foundation and\n",
+ "fine-tuned chat models, 2023b.\n",
+ "A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is \n",
+ "all\n",
+ "you need. Advances in neural information processing systems , \u001b[1;36m30\u001b[0m, \u001b[1;36m2017\u001b[0m.\n",
+ "Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning \n",
+ "language\n",
+ "models with self-generated instructions, \u001b[1;36m2023\u001b[0m.\n",
+ "B. Zhang and R. Sennrich. Root mean square layer normalization, \u001b[1;36m2019\u001b[0m.\n",
+ "\u001b[1;36m19\u001b[0m\n"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "from wikipediaapi import Wikipedia\n",
+ "wiki = Wikipedia('RAGBot/0.0', 'en')\n",
+ "data = wiki.page('Hayao_Miyazaki').text\n",
+ "\n",
+ "## After Uploading a pdf\n",
+ "# data = load_document(\"/content/R_Tamil_LLama.pdf\")\n",
+ "\n",
+ "print(data)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RcN_opcCgTUY"
+ },
+ "source": [
+ "Perform Chunking"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ },
+ "id": "NKkGc9edgTUZ",
+ "outputId": "6efe0b8a-aa1d-4e7c-ee84-dfc0b10ebe6a"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
Total number of chunks 10\n",
+ "
\n"
+ ],
+ "text/plain": [
+ "Total number of chunks \u001b[1;36m10\u001b[0m\n"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "def chunk_text(text, chunk_size=1000, overlap=20):\n",
+ " \"\"\"\n",
+ " Split the text into chunks based on the number of words and word overlap.\n",
+ " \"\"\"\n",
+ " words = text.split()\n",
+ " chunks = []\n",
+ " for i in range(0, len(words), chunk_size - overlap):\n",
+ " chunk = ' '.join(words[i:i + chunk_size])\n",
+ " chunks.append(chunk)\n",
+ " return chunks\n",
+ "\n",
+ "chunked_data = chunk_text(data)\n",
+ "\n",
+ "print(\"Total number of chunks\", len(chunked_data))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6dVGhlpNgTUZ"
+ },
+ "source": [
+ "Visualise Chunking"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 1000
+ },
+ "id": "Ztv2QtXRgTUa",
+ "outputId": "ac09962a-ad1d-403a-8f2e-e6e7fda0d07b"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
Chunk 1
\n",
+ "
TAMIL -LLAMA : A N EWTAMIL LANGUAGE MODEL BASED ON LLAMA 2 Abhinand Balachandran\n",
+ "abhinandb.ml@gmail.com ABSTRACT Language modeling has witnessed remarkable advancements in recent\n",
+ "years, with Large Language Models (LLMs) like ChatGPT setting unparalleled benchmarks in human-like\n",
+ "text generation. How- ever, a prevailing limitation is the underrepresentation of languages like\n",
+ "Tamil in these cutting-edge models, leading to suboptimal performance in diverse linguistic\n",
+ "contexts. This paper addresses this lacuna, enhancing the open-source LLaMA model with an addition\n",
+ "of 16,000 Tamil tokens, aiming to achieve superior text generation and comprehension in the Tamil\n",
+ "language. We strategically employ the LoRA methodology for efficient model training on a\n",
+ "comprehensive Tamil corpus, ensuring com- putational feasibility and model robustness. Moreover, we\n",
+ "introduce a Tamil-translated version of the Alpaca dataset and a subset of the OpenOrca dataset\n",
+ "tailored for instruction fine-tuning. Our results showcase significant performance improvements in\n",
+ "Tamil text generation, with potential implications for the broader landscape of LLMs in Indian\n",
+ "languages. We further underscore our commitment to open research by making our models, datasets, and\n",
+ "code1publicly accessible, fostering further innovations in language modeling. 1 Introduction The\n",
+ "past few years have been transformative for language modeling, with groundbreaking advances and\n",
+ "monumental achievements. At the forefront of this revolution was OpenAI’s ChatGPT (OpenAI, 2022),\n",
+ "which not only raised the bar in language modeling performance but also underscored the immense\n",
+ "societal implications of such technologies. Alongside ChatGPT, various Large Language Models (LLMs)\n",
+ "have consistently demonstrated exceptional prowess in natural language understanding and generation,\n",
+ "heralding a new era in computational linguistics. Central to the functionality of these modern LLMs\n",
+ "is the Transformer architecture, a cornerstone concept brought to the limelight by \"Attention is All\n",
+ "You Need\" (Vaswani et al., 2017). This innovation transformed our approach to sequence-based tasks,\n",
+ "catalyzing pivotal models like BERT (Devlin et al., 2019) and redefining best practices in Natural\n",
+ "Language Processing (NLP). Subsequent developments, particularly the Generative Pre-trained\n",
+ "Transformer (GPT) (Radford et al., 2018), showcased the profound potential of unsupervised pre-\n",
+ "training on vast datasets. Models like GPT-3 and its successor, GPT-4 (OpenAI, 2023), have redefined\n",
+ "benchmarks and fueled a renaissance in natural language understanding and generation. Beyond their\n",
+ "technical prowess, they have prompted a renewed vigor in exploring the limits of Artificial General\n",
+ "Intelligence (AGI). These advancements, paired with exemplary performance in numerous applications,\n",
+ "have galvanized the NLP community, sparking widespread application and research from sentiment\n",
+ "analysis to machine translation. However, progress is not without its pitfalls. The elite LLMs,\n",
+ "despite their remarkable capabilities, grapple with challenges—primarily, their proprietary nature,\n",
+ "which constricts open research. Furthermore, an English-centric bias and the enormous computational\n",
+ "requirements for training such behemoths further accentuate the call for more accessible and diverse\n",
+ "solutions. In response, the open-source community has championed the creation of models like LLaMA\n",
+ "(Touvron et al., 2023a) and Mistral (Jiang et al., 2023). Such models, despite their compact nature,\n",
+ "challenge the hegemony of giants like ChatGPT in select benchmarks, heralding a promising direction\n",
+ "for future research. 1GitHub Repository: https://github.com/abhinand5/tamil-llamaarXiv:2311.05845v1\n",
+ "[cs.CL] 10 Nov 2023However, as robust as these models, like LLaMA and Mistral, might be, their\n",
+ "proficiency in generating coherent text in Tamil and several other Indian languages remains\n",
+ "noticeably deficient. A fundamental limitation lies in their minimal vocabulary of Tamil characters,\n",
+ "which is essential for effective text encoding and generation. This paper aims to bridge this gap by\n",
+ "augmenting the existing LLaMA models’ vocabulary with an additional 16,000 Tamil tokens, markedly\n",
+ "enhancing their capability in processing and producing Tamil content. This method draws inspiration\n",
+ "from a parallel endeavor in the Chinese adaptation of LLaMA, as documented in Cui et al. (2023). To\n",
+ "ensure efficient pre-training and fine-tuning while maintaining computational feasibility, we\n",
+ "leverage the LoRA (Hu et al., 2021) methodology. We aspire that this initiative catalyzes further\n",
+ "research endeavors, refining LLaMA and other open-source models tailored for Indian languages. A\n",
+ "succinct overview of the principal contributions of this paper is as follows: •We bolster the LLaMA\n",
+ "model’s encoding and decoding proficiencies for Tamil by incorporating an additional 16,000 Tamil\n",
+ "tokens, thereby expanding its vocabulary. •Through the LoRA methodology, the augmented model\n",
+ "undergoes training on an extensive Tamil corpus, resulting in a marked enhancement of its text\n",
+ "generation capabilities relative to its predecessor models. •We present a Tamil-translated version\n",
+ "of the original Alpaca dataset (Taori et al., 2023), paired with a subset of the OpenOrca (Lian et\n",
+ "al., 2023) dataset, both curated for instruction fine-tuning in Tamil. •Our newly trained\n",
+ "instruction and chat models, built upon the Alpaca and OpenOrca datasets, demonstrate notable\n",
+ "advancements in performance for the Tamil language compared to other open-source language models.\n",
+ "•To stimulate continuous innovation and broader adaptability, we grant public access to the models,\n",
+ "datasets, and associated code, inviting further exploration and encouraging the refinement of LLaMA\n",
+ "models for diverse languages. 2 Related Work Within the broad field of Natural Language Processing\n",
+ "(NLP), the advent of Large Language Models (LLMs) marks a transformative moment. These models have\n",
+ "heralded new capabilities in understanding, generating, and processing various human languages,\n",
+ "underpinning innovations from automated content creation to nuanced sentiment analysis. While their\n",
+ "proficiency in mainstream languages like English is widely recognized and leveraged, a disparity\n",
+ "exists in their performance and availability for numerous non-European languages. Tamil, a language\n",
+ "with ancient roots and spoken by a substantial global population, epitomizes this disparity. Despite\n",
+ "its linguistic depth and cultural significance, dedicated pre-trained LLMs for Tamil are\n",
+ "conspicuously underrepresented. Most current offerings are generic, multipurpose LLMs, which do not\n",
+ "cater specifically to the unique attributes of the Tamil language. A survey of the existing\n",
+ "literature reveals that many attempts to cater to the Tamil language through LLMs rely heavily on\n",
+ "multilingual models. Works such as Scao et al. (2022), Shliazhko et al. (2022), and Lin et al.\n",
+ "(2022) have all ventured into this domain. However, it is crucial to note that, except \"GPT-2 Tamil\"\n",
+ "by Mahendiran (2021), all these models are not exclusive to Tamil. While they can process Tamil to a\n",
+ "certain extent, their capabilities are inherently limited. This limitation arises because the\n",
+ "training data for
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
Chunk 2
\n",
+ "
can process Tamil to a certain extent, their capabilities are inherently limited. This limitation\n",
+ "arises because the training data for these models often comprise a low fraction of Tamil content\n",
+ "relative to other languages. Consequently, the nuances and intricacies specific to Tamil are often\n",
+ "lost, leading to suboptimal performance. The effort by Mahendiran (2021) represents a notable\n",
+ "deviation from this trend. Here, the GPT-2 base model, equipped with 117 million parameters as\n",
+ "outlined in Radford et al. (2019), was fine-tuned with a focus on Tamil, using both the Oscar\n",
+ "dataset (Caswell et al., 2020) and The IndicNLP (Kunchukuttan, 2020) dataset. This approach\n",
+ "signifies a targeted attempt to adapt LLM capabilities for the Tamil language specifically. However,\n",
+ "the broader landscape of Tamil-specific LLM research remains relatively uncharted. This context\n",
+ "underscores the motivation for our present research. We endeavor to delve deeper into this space,\n",
+ "addressing existing shortcomings and advancing the capabilities of LLMs tailored for Tamil. 3 Tamil\n",
+ "LLaMA 3.1 Datasets Used The development of Tamil-LLaMA involved using several different datasets,\n",
+ "each chosen for specific parts of the training and fine-tuning process. This approach was vital to\n",
+ "ensure the model’s effectiveness across various tasks. 23.1.1 Datasets used for Pre-Training For the\n",
+ "initial pre-training phase of LLaMA 2 (Touvron et al., 2023a), we mainly used the CulturaX dataset\n",
+ "(Nguyen et al., 2023). This dataset is a combination of many popular datasets, including the Oscar\n",
+ "dataset (Caswell et al., 2020). Out of the 4.72 million documents in CulturaX, we selected 600k\n",
+ "documents (12 GB) for training. This choice was made to manage training costs while aiming for high\n",
+ "performance. Our approach was successful, as the model showed strong results in text completion\n",
+ "tasks even with this smaller dataset. 3.1.2 Datasets used for Instruction Tuning The \"Instruction\n",
+ "Tuning\" phase was a pivotal stage in refining LLaMA’s proficiency in precisely adhering to textual\n",
+ "instructions. For this enhancement, we incorporated a translated version of the Stanford Alpaca\n",
+ "dataset (Taori et al., 2023), comprising 52,000 instructions. Concurrently, we integrated a\n",
+ "specialized no-code section from the OpenOrca dataset (Lian et al., 2023), which consists of around\n",
+ "93,000 instructions. The deliberate focus on no-code instructions was to streamline the training\n",
+ "process, eliminating the intricacies presented by coding instructions during translation. To ensure\n",
+ "translation uniformity and accuracy across the datasets, the Google Translation API service was our\n",
+ "tool of choice. We meticulously translated the entirety of the Alpaca dataset while also applying a\n",
+ "similar methodology to the OpenOrca subset. We believe that leveraging diverse datasets has\n",
+ "bolstered LLaMA’s enhanced capability to discern and generate contextually pertinent responses\n",
+ "across a spectrum of prompts. 3.2 Background on the LLaMA Models Introduced by Touvron et al.\n",
+ "(2023a), LLaMA has emerged as an essential milestone in the world of open-source large language\n",
+ "models (LLMs), with the renowned Transformer architecture (Vaswani et al., 2017) as its foundation.\n",
+ "While it draws inspiration from models like GPT for its basic structure—comprising an embedding\n",
+ "layer and multiple transformer blocks—LLaMA has its unique features. LLaMA has brought forward\n",
+ "several innovative techniques such as pre-normalization (Zhang and Sennrich, 2019), SwiGLU\n",
+ "activation (Shazeer, 2020), and rotary embeddings (Su et al., 2022). Offered in sizes ranging from\n",
+ "7B (7 Billion) to 65B (65 Billion) parameters, LLaMA has been trained on a rich mixture of content\n",
+ "sources, including web pages, books, and academic papers. Its strong performance on benchmarks,\n",
+ "especially given its relatively compact size compared to other models, has made it a noteworthy\n",
+ "contender in the LLM landscape, drawing considerable attention in the AI research community.\n",
+ "Building upon its predecessor’s foundation, LLaMA 2 (Touvron et al., 2023b) introduces monumental\n",
+ "enhancements to the LLaMA lineage. With a dataset expanded by 40% relative to LLaMA 1, the models\n",
+ "under LLaMA 2 exhibit an enriched comprehension of diverse content, leading to improved text\n",
+ "generation. An extended context length of 4,096 tokens empowers LLaMA 2 to process and understand\n",
+ "more extensive textual segments, significantly benefiting tasks such as translation and intricate\n",
+ "question answering. Another pivotal innovation in LLaMA 2 is adopting the grouped- query attention\n",
+ "mechanism (Ainslie et al., 2023), facilitating faster inference despite its expanded size compared\n",
+ "to LLaMA 1. In the course of our research, we made a conscious choice to employ LLaMA 2 as our\n",
+ "primary language model. Several factors influenced this decision. Firstly, LLaMA 2 is a recent\n",
+ "addition to the lineage of Large Language Models, which implies that it benefits from the latest\n",
+ "advancements in model training and architectural innovations. This recent launch incorporates the\n",
+ "most up-to-date techniques and methodologies. Secondly, compared with its predecessor, LLaMA 1, the\n",
+ "enhancements in LLaMA 2 are undeniably compelling. These improvements are not just incremental; they\n",
+ "represent substantial strides in areas such as data exposure, context length, and attention\n",
+ "mechanisms. The evolution from LLaMA 1 to LLaMA 2 is emblematic of the rapid advancements in the\n",
+ "field, and by leveraging the latter, we aimed to ensure our research was grounded in the most\n",
+ "cutting-edge tools available. 3.3 Expansion of Tamil Vocabulary LLaMA 2, as outlined in the seminal\n",
+ "work of Touvron et al. (2023b), is backed by an expansive pre-training corpus of 2 Trillion tokens.\n",
+ "A detailed linguistic analysis of this vast corpus reveals a striking imbalance in language\n",
+ "representation. An overwhelming 89.7% of the tokens are sourced from English, with other European\n",
+ "languages collectively contributing to nearly 10% of the dataset. In stark contrast, diverse\n",
+ "languages such as Tamil and Hindi represent a meager presence, with their combined token count along\n",
+ "with other under-represented languages accounting for less than 0.21%. This skewed distribution\n",
+ "raises concerns about the genuine multilingual and cross-lingual capabilities of LLaMA 2. While it\n",
+ "is evident that the model is proficient in several European languages, its ability to comprehend and\n",
+ "generate 3content in languages like Tamil needs to be improved substantially. Our preliminary\n",
+ "experiments further underscored this limitation. When presented with tasks in Tamil, LLaMA 2\n",
+ "exhibited a remarkable lack of coherence in its responses. In fact, its performance was notably\n",
+ "inferior to smaller models, underscoring a noticeable shortcoming in LLaMA
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
Chunk 3
\n",
+ "
coherence in its responses. In fact, its performance was notably inferior to smaller models,\n",
+ "underscoring a noticeable shortcoming in LLaMA 2’s coverage of worldwide languages. There is a clear\n",
+ "need for the open-source community to focus on languages like Tamil, spoken by millions globally\n",
+ "across multiple countries. To bolster the text generation and understanding abilities of LLaMA 2 in\n",
+ "Tamil, we advocate extending its pre-training phase with an expansive Tamil corpus, as recommended\n",
+ "by Cui et al. (2023). However, this alone is not sufficient. A limitation arises from LLaMA’s\n",
+ "existing vocabulary, which has a tiny number of Tamil characters. Although LLaMA can bypass this by\n",
+ "encoding unknown tokens, this process considerably lengthens the sequences, leading to substantial\n",
+ "delays during encoding and decoding. Typically, a single Tamil character is translated into 3-4 byte\n",
+ "tokens. Moreover, these byte tokens are not uniquely purposed for Tamil characters but represent\n",
+ "UTF-8 tokens from various languages. This dual role complicates the task for transformer encoders\n",
+ "and byte-tokens to understand and capture the nuanced semantics of Tamil characters proficiently. To\n",
+ "overcome these problems and to enhance the text generation capabilities in Tamil, we propose the\n",
+ "incorporation of an additional 16,000 Tamil tokens to the pre-existing vocabulary of the LLAMA 2\n",
+ "model. This methodology echoes the strategies employed in developing Chinese LLaMA (Cui et al.,\n",
+ "2023). The subsequent steps explain the process of vocabulary extension: 1.Employ SentencePiece\n",
+ "(Kudo and Richardson, 2018) to train a Tamil Tokenizer on an extensive corpus of contemporary Tamil\n",
+ "text, capturing the essence of modern linguistic nuances necessary for coherent communication.\n",
+ "2.Integrate the original tokenizer of the LLaMA 2 model with the vocabulary derived from the newly\n",
+ "trained SentencePiece tokenizer. This amalgamation culminates in an augmented tokenizer encompassing\n",
+ "an additional 16,000 Tamil tokens, leading to an aggregated vocabulary size of 48,000 (32,000\n",
+ "original + 16,000 new). 3.Drawing parallels from Cui et al. (2023), the LLaMA model is then tailored\n",
+ "to accommodate the Tamil LLaMA tokenizer. This modification necessitates resizing the word\n",
+ "embeddings and the language model head from a matrix shape V ×H to V’ ×H. Herein, V represents the\n",
+ "original vocabulary size of 32,000, whereas V’ signifies the extended size of 48,000. Importantly,\n",
+ "this adjustment ensures the preservation of the embeddings associated with the original vocabulary\n",
+ "by appending the new rows to the concluding segments of the initial embedding matrices. In Figure 1,\n",
+ "we can see that the Tamil LLaMA tokenizer needs only 20% to 25% of the tokens that the original\n",
+ "LLaMA model uses to encode Tamil text. This makes the Tamil LLaMA much more efficient. With this\n",
+ "crucial update, the model can handle over three times more information and works three times faster.\n",
+ "In conclusion, our modifications to LLaMA 2 significantly bolster its capabilities in understanding\n",
+ "and generating Tamil content. By adding 16,000 Tamil tokens, we ensure a more efficient and nuanced\n",
+ "representation. The new Tamil LLaMA tokenizer drastically reduces the required tokens, making\n",
+ "encoding more efficient. Figure 1: Tokenizer comparisons between original LLaMA and Tamil LLaMA.\n",
+ "43.4 Pre-Training Phase In order to harness the full potential of the expanded vocabulary of Tamil\n",
+ "LLaMA, a robust pre-training phase is implemented using a comprehensive Tamil text corpus. The\n",
+ "datasets utilized during this training phase are detailed in 3.1.1. Causal Language Modelling\n",
+ "Approach The central mechanism for this pre-training is Causal Language Modelling (CLM). This method\n",
+ "specializes in predicting a given token xtrelying entirely on its preceding tokens. Formally, the\n",
+ "objective during this training phase is to maximize the likelihood of the entire sequence, as\n",
+ "represented by: P(x1, x2, . . . , x T) =TY t=1P(xt|x1, x2, . . . , x t−1) (1) Breaking down the\n",
+ "elements of this equation: •x1, x2, . . . , x T: The individual tokens that constitute the sequence.\n",
+ "•P(xt|x1, x2, . . . , x t−1): Represents the conditional probability of the token xt, which depends\n",
+ "on the preced- ing tokens in the sequence. Significance of the CLM in Language Adaptation The CLM\n",
+ "stage is integral to enhancing LLaMA’s capability in Tamil and other languages. It facilitates the\n",
+ "model in learning the intricate syntactic patterns, semantic subtleties, and unique linguistic\n",
+ "features of Tamil. Due to its autoregressive characteristics, the CLM mimics the human approach to\n",
+ "comprehending and generating language, which is primarily shaped by the previous context. Hence, at\n",
+ "the end of this initial training period, LLaMA becomes capable of interpreting and creating Tamil\n",
+ "text that is pertinent to the given context. This sets a strong foundation for further fine-tuning\n",
+ "and specific task-based training sessions. 3.5 Fine-Tuning Phase Following the foundational pre-\n",
+ "training phase, the fine-tuning phase emerges as a crucial step, especially for modern Large\n",
+ "Language Models (LLMs) deployed in real-world scenarios. A broad understanding of language structure\n",
+ "and semantics, while essential, does not suffice for such applications. This gap is addressed by\n",
+ "instruction fine-tuning, a tailored process enabling LLMs to interpret and execute task-oriented\n",
+ "instructions conveyed in natural language. Rather than the traditional approach of adapting to\n",
+ "specific datasets, instruction fine-tuning focuses on a wide array of tasks articulated through\n",
+ "language, ensuring the LLM’s adaptability without task-specific alterations. The datasets employed\n",
+ "in this phase are elaborated in Section 3.1.2. Instruction fine-tuning’s transformative essence lies\n",
+ "in its ability to enhance an LLM’s dynamism and responsiveness. While pre-training equips the model\n",
+ "with general linguistic proficiency, instruction fine-tuning refines it to interact seamlessly with\n",
+ "users through natural language, bridging the gap between overarching language mastery and nuanced,\n",
+ "task-specific agility. The instruction format employed closely resembles the one described in the\n",
+ "original Alpaca dataset (Taori et al., 2023). Both prompt templates suggested by Alpaca have been\n",
+ "utilized: one that includes an input field within the instruction and another that does not. The\n",
+ "prompt templates used during training are given in Figure 2. It is essential to clarify that in both\n",
+ "templates, the first line signifies the system prompts. For the Alpaca dataset (Taori et al., 2023),\n",
+ "we utilize the two system prompts as mentioned in Figure 2. However, for the OpenOrca subset (Lian\n",
+ "et al., 2023),
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
Chunk 4
\n",
+ "
we utilize the two system prompts as mentioned in Figure 2. However, for the OpenOrca subset (Lian\n",
+ "et al., 2023), a distinct approach is taken: given that this subset already includes a dedicated\n",
+ "field for the system prompt within its dataset, we utilize that specific prompt. 3.6 Experimental\n",
+ "Setup and Training Details 3.6.1 LoRA Approach for Pre-Training and Fine-Tuning LoRA (Low-Rank\n",
+ "Adapters) is a technique that offers an efficient pathway to fine-tuning large language models, as\n",
+ "introduced by Hu et al. (2021). This approach is especially beneficial for its computational\n",
+ "efficiency, enabling the fine-tuning of language models without the need for extensive GPU\n",
+ "resources. We employed the LoRA method to moderate training expenses while also accelerating the\n",
+ "training timeline. Training the complete set of parameters for models like LLaMA can be exceedingly\n",
+ "expensive and resource-intensive, which is often beyond the budget of individual research teams or\n",
+ "small organizations. 5Figure 2: Prompt Template for Instruction Tasks 1. Prompt T emplate Without\n",
+ "Input ஒரு பணிைய எவ ் வாறு நிைறேவற ் ற ேவண ் டும ் என ் று கூறும ் அறB- வுைரகீேழஉள ் ளது. ேவண ்\n",
+ "டுேகாைளப ் ெபாருத ் தமாகநிைறவுெசய ் - கின ் ற பதில ் ஒன ் ைற எழுதுக. ### Instruction: {instruction}\n",
+ "### Response: {output} 2. Prompt T emplate With Input ஒரு பணிைய எவ ் வாறு நிைறேவற ் ற ேவண ் டும ் என\n",
+ "் று கூறும ் அறB- வுைர கீேழ உள ் ளது. ேமலும ் விரிவான பின ் னணிைய வழங ் கும ் ஓர ் உள ் ளீடும ்\n",
+ "ெகாடுக ் கப ் பட ் டுள ் ளது. ேவண ் டுேகாைளப ் ெபாருத ் தமாக நிைறவு ெசய ் கின ் ற பதில ் ஒன ் ைற\n",
+ "எழுதுக. ### Instruction: {instruction} ### Input: {input} ### Response: {output} 3.6.2 Experimental\n",
+ "Setups for Pre-Training The foundational models of Tamil LLaMA are initiated with the original LLaMA\n",
+ "weights and undergo pre-training using the fp16precision setting for both the 7B2and 13B3parameter\n",
+ "versions. We utilize 12GB of Tamil text sourced from Nguyen et al. (2023) during this pre-training\n",
+ "phase. Further insights on the dataset can be found in section 3.1.1. Our pre-training strategy\n",
+ "incorporates the LoRA method Hu et al. (2021), where we integrate LoRA adapters into the attention\n",
+ "vectors and subsequently train the embeddings, LM heads, and the newly incorporated LoRA parameters.\n",
+ "A noteworthy deviation from the methodology of the Chinese LLaMA (Cui et al., 2023) in our approach\n",
+ "is the elimination of the initial exclusive training of embeddings. Instead of following it with a\n",
+ "two-stage LoRA training of attention blocks, embeddings, and LM heads, we’ve opted for a streamlined\n",
+ "approach to curb costs. For the training infrastructure, we harnessed an Nvidia A100 GPU with 80GB\n",
+ "of VRAM. The models were trained for 1 epoch on the entire dataset, and the training time spanned 48\n",
+ "hours for 7B model and 60 hours for the 13B model on Microsoft Azure’s Standard NC24adsA\n",
+ "100v4instance. The detailed hyperparameters used for training are listed in Table 1. 3.6.3\n",
+ "Experimental Setups for Instruction Fine-Tuning The 7B4and 13B5models, once pre-trained, undergo\n",
+ "fine-tuning in alignment with the procedures outlined in Section 3.5. The datasets employed for this\n",
+ "phase are elaborated upon in Section 3.1.2. We persist with the LoRA methodology for fine-tuning,\n",
+ "executing it under the fp16precision setting for both models. Our datasets comprise translated\n",
+ "variants of Alpaca (Taori et al., 2023) and a select subset from OpenOrca (Lian et al., 2023).\n",
+ "2Tamil LLaMA 7B Pretrained: https://huggingface.co/abhinand/tamil-llama-7b-base-v0.1 3Tamil LLaMA\n",
+ "13B Pretrained: https://huggingface.co/abhinand/tamil-llama-13b-base-v0.1 4Tamil LLaMA 7B Instruct:\n",
+ "https://huggingface.co/abhinand/tamil-llama-7b-instruct-v0.1 5Tamil LLaMA 13B Instruct:\n",
+ "https://huggingface.co/abhinand/tamil-llama-13b-instruct-v0.1 6Table 1: Pre-Training Hyperparameters\n",
+ "Configurations 7B 13B Training Data 12GB 4GB Epochs 1 1 Batch Size 64 64 Initial Learning Rate 2e-4\n",
+ "2e-4 Max Sequence Length 512 512 LoRA Rank 64 64 LoRA Alpha 128 128 LoRA Target Modules QKVO, MLP\n",
+ "QKVO, MLP Training Precision FP16 FP16 In a bid to augment the models’ proficiency with Tamil-\n",
+ "centric literature, cultural nuances, and historical contexts, we leverage a tailored dataset\n",
+ "sourced from Wikipedia. Additionally, to extract instructions from this text, we utilize the Self-\n",
+ "Instruct method, as highlighted in Wang et al. (2023). This approach involves the GPT-4 (OpenAI,\n",
+ "2023) APIs from OpenAI to generate the new instruction dataset. It is crucial to note that the\n",
+ "system prompts, referenced in Section 3.1.2, remain consistent during this supplemental fine-tuning\n",
+ "phase. For the hardware, the same A100 GPU with 80GB of VRAM was utilized. In summary, our fine-\n",
+ "tuning approach employs a new translated dataset consisting of roughly 145,000 instructions. A\n",
+ "detailed account of the hyperparameters used for fine-tuning can be found in the Table 2. Table 2:\n",
+ "Fine-tuning Hyperparameters Configurations 7B 13B Training Data 145k 145k Epochs 2 1 Batch Size 64\n",
+ "64 Dropout Rate 0.1 0.1 Initial Learning Rate 2e-4 2e-4 Max Sequence Length 512 512 LoRA Rank 64 64\n",
+ "LoRA Alpha 128 128 LoRA Target Modules QKVO, MLP QKVO, MLP Training Precision FP16 FP16 4 Results on\n",
+ "Instruction Following Tasks 4.1 Task Design and Evaluation Method Evaluating the outcomes of text\n",
+ "generation tasks is intricate due to their multifaceted formats, distinguishing them from typical\n",
+ "Natural Language Understanding (NLU) tasks. Drawing inspiration from previous studies that employed\n",
+ "GPT-4 (OpenAI, 2023) for scoring, we similarly engage GPT-4 to assign a grade on a 10-point scale to\n",
+ "each instance. This approach is more efficient than human evaluations. However, understanding the\n",
+ "potential inaccuracies of GPT-4’s evaluations, we supplement its scores with manual reviews,\n",
+ "adjusting them as necessary. Such hands-on inspections affirm the consistency and authenticity of\n",
+ "the scores, ensuring they genuinely mirror the efficacy of the models under review. With the\n",
+ "GPT-4-based scoring and manual verifications, we have established a robust evaluation framework for\n",
+ "our Tamil LLaMA. Our assessment suite is diligently designed to provide a basic evaluation of Tamil\n",
+ "LLaMA. This suite comprises over 120 diverse examples, covering areas such as Question Answering,\n",
+ "Reasoning, Literature, Entertainment, Translation, Programming, and Ethics, among others. The\n",
+ "overall score for a specific task is computed by summing the scores from its constituent samples and\n",
+ "normalizing it to a 100-point scale. Such an approach ensures a holistic reflection of
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
Chunk 5
\n",
+ "
scores from its constituent samples and normalizing it to a 100-point scale. Such an approach\n",
+ "ensures a holistic reflection of the models’ capabilities across varying tasks, yielding a well-\n",
+ "rounded measure of their overall performance. 74.2 Generation Parameters The choice of generation\n",
+ "parameters during inference greatly affects the caliber of the results in tasks involving text\n",
+ "generation. Additionally, the degree of quantization can also affect performance. Below are the\n",
+ "generation parameters we adopted for model evaluations: •Quantization Config : The model is loaded\n",
+ "in 8−bit, with the torch data type specified as bfloat 16. •Context Size: The context size is\n",
+ "maintained at the model’s default of 4096 tokens. •Temperature: We assign a temperature value of 0.2\n",
+ "to guide the randomness during sampling. A lower temperature prompts the model to produce more\n",
+ "deterministic outputs, whereas a higher value boosts diversity, potentially compromising coherence.\n",
+ "For creative instructions, we adjust the temperature to 0.7 to encourage varied outputs. •Top-k\n",
+ "Sampling : With k set to 50, the model selects its succeeding token from the 50 most probable\n",
+ "candidates, introducing a level of unpredictability and variety to the resulting text. •Top-p\n",
+ "Sampling : Complementing Top-k sampling, we employ Top-p sampling with a threshold of 0.90. This\n",
+ "ensures the model weighs a fluid set of tokens, which, combined, represent 90 •Maximum Sequence\n",
+ "Length : To keep the output concise and pertinent, we cap the generated sequence at 512 tokens.\n",
+ "•Repetition Penalty : A repetition penalty of 1.1 is applied to deter the model from producing\n",
+ "redundant text, disincentivizing previously chosen tokens. For these evaluations, we utilized a\n",
+ "Google Colab notebook powered by a T4 GPU. 4.3 Results from Instruction Tasks The evaluation scores\n",
+ "of the Tamil LLaMA models, as rated by GPT-4, are presented in Table 3. A noteworthy observation\n",
+ "during our evaluation is the superior performance of our models compared to gpt-3.5-turbo in manual\n",
+ "assessments, which is further reinforced by the commendable scores in GPT-4’s evaluations. However,\n",
+ "it is essential to consider that GPT-4 might inherently favor responses from other GPT model\n",
+ "lineages. Even though our model excels in numerous tasks, there are areas of exception, such as\n",
+ "ethics, and this was anticipated, given that we did not undertake any alignment efforts. Challenges\n",
+ "in literature/entertainment and other areas can be attributed to data limitations during the pre-\n",
+ "training phase, primarily due to cost constraints. Despite these nuances, our models establish a\n",
+ "robust foundation for subsequent enhancements and progress in large language models tailored to\n",
+ "Tamil. Table 3: GPT-4 rated performance scores for different models on Tamil instructions Task Type\n",
+ "Tamil-LLaMA-7B Tamil-LLaMA-13B gpt-3.5-turbo Question Answering 77.00 75.33 54.33 Open-ended QA\n",
+ "84.47 85.26 58.68 Reasoning 47.50 64.25 63.50 Literature 45.50 40.00 71.00 Entertainment 43.33 50.00\n",
+ "60.00 Creative Writing 92.50 95.62 59.69 Translation 60.56 66.67 92.78 Coding 63.57 76.07 57.14\n",
+ "Ethics 23.75 57.50 40.00 Overall 63.83 71.17 61.33 By observing Table 3, several intriguing outcomes\n",
+ "emerge. Notably, the gpt-3.5-turbo , despite its prowess in numerous languages, appears to be\n",
+ "eclipsed by the Tamil LLaMA models in multiple domains. A standout observation was the Ethics\n",
+ "category, where the gpt-3.5-turbo model demonstrated a propensity to respond to potentially\n",
+ "dangerous queries in Tamil. Additionally, in the Coding section, the gpt-3.5-turbo ’s responses\n",
+ "either seemed to exhibit a lack of comprehension or overlooked critical details, leading to a\n",
+ "subdued score. While gpt-3.5-turbo excels in tasks related to English and other languages, its\n",
+ "performance in the context of Tamil reveals areas for weaknesses. 84.3.1 Reasoning: In reasoning\n",
+ "tasks, the models demonstrate commendable performance. While minor discrepancies occasionally arise\n",
+ "in areas such as dates, quantities, and formulas, they predominantly excel in reasoning exercises.\n",
+ "According to our manual evaluations, even our smaller Tamil-LLaMA 7B model surpasses the performance\n",
+ "of the much larger LLaMA 2 70B in Tamil text generation. In comparison, even gpt-3.5-turbo (OpenAI,\n",
+ "2022) often falters in several reasoning instructions, producing outputs that miss the mark in\n",
+ "relevance, clarity, fluency, and accuracy. This inadequacy in performance is also observed in LLaMA\n",
+ "2 70B, rendering their generated Tamil text less beneficial. Examples of responses related to\n",
+ "reasoning tasks are given in the Figure 5. We conducted our comparisons with LLaMA 2 70B using the\n",
+ "model hosted by Perplexity Labs. 4.3.2 Translation: For translation tasks, our models exhibit\n",
+ "satisfactory performance, particularly when translating from a foreign language to Tamil. However,\n",
+ "the accuracy diminishes when translating from Tamil to other languages—a shortcoming we aim to\n",
+ "address in future iterations. Based on our manual evaluations, our models outperform the original\n",
+ "LLaMA 2 70B in Tamil text translations. However, their efficacy is roughly on par with gpt-3.5-turbo\n",
+ ". Examples of outputs for translation tasks are given in Figure 6. 4.3.3 Code Generation: Our models\n",
+ "exhibit impressive performance in code generation tasks despite the limited code instructions\n",
+ "present in the training dataset. They capably provide coherent explanations in Tamil for the\n",
+ "generated code. Based on our hands-on evaluations, our models markedly surpass the performance of\n",
+ "the more sizable LLaMA 2 70B model, which when instructed in Tamil, often either misconstrues the\n",
+ "task or produces erroneous answers in English. However, it is important to highlight that our model\n",
+ "is not tailored for coding tasks. While it handles more straightforward problems adeptly, it\n",
+ "encounters challenges with more intricate ones. Example responses from our models for Code\n",
+ "Generation tasks can be found in Figure 7. 4.3.4 Open Question Answering In open question answering\n",
+ "tasks, much like in reasoning, the model displays a commendable performance. Despite occasional\n",
+ "inaccuracies in areas like dates and other factual information, its proficiency often exceeded our\n",
+ "expectations, delivering surprising results on multiple instances. Example responses from our models\n",
+ "for Open Question Answering tasks can be found in Figure 8. 4.3.5 Creative Writing / Text Generation\n",
+ "Text generation is a foundational capability for Large Language Models (LLMs), with creative text\n",
+ "generation—such as crafting letters or applications—being a particularly notable use case. In\n",
+ "general, larger models have an edge in this domain, often outshining their smaller counterparts. The\n",
+ "quality and quantity of training data play pivotal roles in this context. While the
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
Chunk 6
\n",
+ "
often outshining their smaller counterparts. The quality and quantity of training data play pivotal\n",
+ "roles in this context. While the sheer volume of data can improve performance, the richness and\n",
+ "quality of the data are equally vital. With abundant high-quality training data, even smaller models\n",
+ "can sometimes surpass the performance of larger ones. In our experiments, our models showed decent\n",
+ "performance in standard tasks. However, they faced challenges when assigned with more complicated\n",
+ "tasks. Example responses from our models for Creative Writing tasks can be found in Figure 9. 4.3.6\n",
+ "Mathematical reasoning Mathematical reasoning presents a significant challenge for our models. Like\n",
+ "many Large Language Models (LLMs), they don’t excel in handling mathematical tasks. From our hands-\n",
+ "on experiments, we observed that the performance of our models, mainly when dealing with Tamil,\n",
+ "lagged behind that of the original English LLaMA models. Recognizing this as an area of improvement,\n",
+ "we intend to prioritize and enhance the model’s capabilities in subsequent iterations. Examples of\n",
+ "outputs for mathematical reasoning tasks are given in Figure 10. 4.4 Results from Natural Language\n",
+ "Understanding (NLU) tasks Understanding natural language (NLU) is a vital element within the field\n",
+ "of natural language processing (NLP) that enables computers to comprehend and interpret human\n",
+ "language. NLU focuses on comprehending and extracting 9meaning from text, whereas text generation is\n",
+ "concerned with generating human-like text based on a given input, often without any specific\n",
+ "understanding of the text’s meaning. To ascertain the prowess of a model, its performance in Natural\n",
+ "Language Understanding (NLU) tasks is paramount. However, the availability of standard benchmarks\n",
+ "for Tamil in this domain remains sparse. Notable exceptions include the IndicNLP (Kunchukuttan,\n",
+ "2020), IndicNLP Corpus (Kunchukuttan et al., 2020), and IndicSentiment (AI4Bharat, 2023) datasets.\n",
+ "We opted to assess our models utilizing the test set from the IndicSentiment dataset (AI4Bharat,\n",
+ "2023), and a text classification dataset sourced from the IndicNLP Corpus (Kunchukuttan et al.,\n",
+ "2020). The test set of the IndicSentiment dataset encompasses 1,000 sentiment samples in Tamil. It\n",
+ "is important to note that our evaluation was concentrated solely on this Tamil subset. Figure 3:\n",
+ "Performance comparison on the IndicSentiment-7B dataset From Figure 3, it is evident that our Tamil\n",
+ "LLaMA model remarkably surpasses the original LLaMA in this specific NLU task. The latter’s\n",
+ "performance mirrors that of random guessing, registering an accuracy of 50.5%. In stark contrast,\n",
+ "our model impressively scores an accuracy of 81.3%. This enhanced NLU capability underscores the\n",
+ "efficacy of our methodologies—such as vocabulary expansion and retraining in facilitating the model\n",
+ "to comprehend a new language like Tamil with heightened proficiency. We further extended our\n",
+ "evaluation to the iNLTK Headline Classification subset within the IndicNLP suite (Kakwani et al.,\n",
+ "2020). It is essential to highlight that our analysis was focused strictly on the Tamil language\n",
+ "subset of this dataset. The outcomes of this evaluation are graphically depicted in Figure 4.\n",
+ "Insight from Figure 4 reveals that the original LLaMA model’s performance aligns closely with random\n",
+ "predictions. In contrast, our Tamil LLaMA model showcases a compelling lead, achieving an accuracy\n",
+ "rate of 80.12%, further affirming its superior capability in natural language understanding. 5\n",
+ "Limitations The Tamil LLaMA suite of models we introduce in this paper heralds several advancements\n",
+ "in Tamil language processing. However, in the spirit of rigorous research, it is imperative to\n",
+ "discuss the inherent limitations accompanying these models. 10Figure 4: Performance comparison on\n",
+ "the IndicGLUE Text Classification dataset •Constrained Knowledge Base : Due to computational and\n",
+ "cost constraints, our models were trained on a relatively limited Tamil dataset. This translates to\n",
+ "gaps in the models’ knowledge, especially regarding nuances and specifics native to Tamil culture\n",
+ "and literature. While the current version lays the foundation, the true potential can be unlocked\n",
+ "with access to a broader data spectrum, enriching its contextual understanding. •Ethical Concerns :\n",
+ "Detoxification procedures were not implemented in our training process, making these models prone to\n",
+ "generating potentially harmful or offensive content. Their uncensored nature necessitates caution\n",
+ "during deployment. •Lack of Robustness : Our models may, at times, produce outputs that veer off-\n",
+ "topic or deviate substantially from anticipated responses. This vulnerability is more pronounced\n",
+ "under adversarial conditions or tricky prompts. •Reasoning and Mathematical Challenges : While our\n",
+ "models showcase competence in specific reasoning scenarios, they falter in many others, underscoring\n",
+ "the repercussions of not having a comprehensive training set. •Over-Generation Tendencies : On\n",
+ "occasions, the models tend to generate verbose content, extending beyond logical termination points,\n",
+ "leading to potential redundancy. •Evaluation Hurdles : Assessment of LLMs is a crucial yet\n",
+ "challenging endeavor. The scarcity of standardized benchmarks, particularly for languages like\n",
+ "Tamil, which are outside the European linguistic group, complicates comparative evaluations.\n",
+ "Although we propose an evaluative approach tailored for Tamil within this paper, it is not\n",
+ "exhaustive enough to gauge models’ efficacy across diverse domains. •Translation Loss : Given that\n",
+ "the instructional prompts used for fine-tuning the Tamil LLaMA base models are derived from English\n",
+ "datasets translated into Tamil, there is a potential for nuanced inaccuracies—commonly referred to\n",
+ "as translation loss. This can potentially affect the models’ abilities in both text generation and\n",
+ "comprehension due to subtle shifts in meaning that can occur during the translation process. While\n",
+ "some of these challenges are addressable in subsequent iterations, we envision this work serving as\n",
+ "an anchor, inspiring the research community to propel advancements in LLMs for Indian languages. 116\n",
+ "Conclusion In this research endeavor, we have not only filled a critical void in the domain of Tamil\n",
+ "text generation but have also elevated the status of this venerable language within the realm of\n",
+ "large language models with the advent of our Tamil LLaMA.To assess the performance of our models, we\n",
+ "curated an evaluation dataset consisting of 120 Tamil instructions covering a wide range of topics.\n",
+ "We then employed GPT-4 to assess and rate the responses generated by our model. The 7B variant of\n",
+ "our model has surpassed the performance of OpenAI’s gpt-3.5-turbo in tasks involving Tamil\n",
+ "instructions within our evaluation methodology. Even more impressively, the 13B iteration has\n",
+ "outperformed its counterparts, demonstrating an almost
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
Chunk 7
\n",
+ "
involving Tamil instructions within our evaluation methodology. Even more impressively, the 13B\n",
+ "iteration has outperformed its counterparts, demonstrating an almost 10% higher proficiency in these\n",
+ "tasks. The significance of our findings is accentuated by the efficiency of our models in generating\n",
+ "Tamil text. Equipped with a refined tokenizer, the 7B and 13B variants demonstrate exceptional\n",
+ "proficiency, eclipsing the original LLaMA models in processing speed without sacrificing textual\n",
+ "quality. This stride is not just a modest step forward but a major leap in the models’ ability to\n",
+ "process and generate Tamil language content, thus forging a new avenue for practical applications\n",
+ "that necessitate precision and promptness. Nevertheless, our exploration in this field continues. We\n",
+ "acknowledge the triumphs and yet remain mindful of the impending obstacles. The possibility of\n",
+ "generating unintended or adverse content, the financial limitations associated with model training,\n",
+ "and the need for comprehensive evaluation benchmarks tailored for Tamil are challenges we are\n",
+ "committed to surmounting. As we chart our path forward, we acknowledge the value of continuous\n",
+ "learning and adaptation. We aspire to delve deeper into strategies to mitigate the limitations\n",
+ "observed, perhaps even leveraging advancements like Reinforcement Learning With Human Feedback\n",
+ "(RLHF) and more advanced quantization techniques. The horizon looks promising, and we remain\n",
+ "committed to elevating the capabilities of LLaMA for Tamil and other Indian languages. Appendix\n",
+ "Example Outputs Figure 5: Example Output for Reasoning Tasks Prompt ேமைஜயில ் 4 ஆப ் பிள ் கள ் இருந\n",
+ "் தன. குமார ் ஒன ் ைறச ் சாப ் பிட ் டான ் , ராம ் எடுத ் துச ் ெசன ் றான ் 2. மீதி எத ் - தைன ஆப ்\n",
+ "பிள ் கள ் உள ் ளன? 7B Model முதலில ் , நாம ் 4 - 1 = 3 ஆப ் பிள ் கைள எடுத ் துக ் ெகாள ் - ளலாம ்\n",
+ ", ஏெனனில ் குமார ் ஒன ் று சாப ் பிட ் டார ் . அடுத ் து, ராம ் 2 ஆப ் பிள ் கைள எடுத ் துச ் ெசன ்\n",
+ "றார ் , எனேவ மீத- முள ் ள ஆப ் பிள ் களின ் எண ் ணிக ் ைக 3 - 2 = 1 ஆகும ் . எனேவ, மீதமுள ் ள ஆப ்\n",
+ "பிள ் 1 உள ் ளது. 13B Model இப ் ேபாது, இந ் தக ் ேகள ் விக ் கு பதிலளிக ் க, நாம ் சூத ் தி- ரத ்\n",
+ "ைதப ் பயன ் படுத ் தலாம ் : ெமாத ் த ஆப ் பிள ் கள ் = ஆரம ் ப ஆப ் பிள ் கள ் - சாப ் பிட ் ட ஆப ்\n",
+ "பிள ் கள ் - எடுத ் துச ் ெசன ் ற ஆப ் பிள ் கள ் ஆரம ் ப ஆப ் பிள ் கள ் = 4 சாப ் பிட ் ட ஆப ் பிள\n",
+ "் கள ் = 1 எடுக ் கப ் பட ் ட ஆப ் பிள ் கள ் = 2 இப ் ேபாது இந ் த மதிப ் புகைள சூத ் திரத ் தில ்\n",
+ "ெசருகலாம ் : ெமாத ் த ஆப ் பிள ் கள ் = 4 - 1 - 2 = 1 எனேவ, ேமைஜயில ் 1 ஆப ் பிள ் உள ் ளது.\n",
+ "12Figure 6: Example Outputs for Translation Tasks Example 1 Prompt தயவுெசய ் துபின ் வரும்\n",
+ "வாக்கியத்ைததமBழBல ் ெமாழB- ெபயர்க்கவும்: ”W e’ve trained a model called ChatGPT which interacts in a\n",
+ "conversational way . The dialogue format makes it possible for ChatGPT to answer followup questions,\n",
+ "admit its mis- takes, challenge incorrect premises, and reject inappropriate requests. ” 7B Model\n",
+ "நாங ் கள ் சாட்ேபாட் என ் ற ெபயரில ் பயிற ் சியளிக்கப ் - பட்ட மாதிரிையக் ெகாண ் டுள ் ேளாம், இது\n",
+ "உைரயாட- லில ் ேபசுவதுேபால ் ெசயல ் படுகிறது.உைரயாடல ் வடி- வைமப ் பின ் மூலம், ேகள ் விகளுக்கு\n",
+ "பதிலளிக்கும் தி- றன ் மற ் றும் தவறானஅறBக்ைககைளமறுக்கும் திறன ் ேபான ் றபலவிஷயங ் கைளச் ெசய ்\n",
+ "யசாட்ேபாட் அனு- மதிக்கிறது. கூடுதலாக, சாட்ேபாட் சில ேநரங ் களில ் அதன ் தவறுகளில ் ஒப ் புக்ெகாள ்\n",
+ "கிறார் மற ் றும் ெபா- ருத்தமற ் றேகாரிக்ைககைளக் கூடநிராகரிக்கிறார். 13B Model நாங ் கள ் ”ChatGPT”என\n",
+ "் றமாதிரிையபயிற ் றுவித்துள ் - ேளாம், இது உைரயாடல ் முைறயில ் ெதாடர்பு ெகாள ் - கிறது.\n",
+ "இந்தேபச்சுவழக்குவடிவமானது,பின ் ெதாடர்- தல ் ேகள ் விகளுக்குபதிலளிக்கவும்,தவறுகைளச் சரி-\n",
+ "பார்க்கவும்,தவறானமுன ் ெமாழBவுகைளசவால ் ெசய ் - யவும் மற ் றும் ெபாருத்தமற ் றேகாரிக்ைககள ்\n",
+ "மீதுநிரா- கரிக்கவும் ChatGPTஐஅனுமதிக்கிறது. Example 2 Prompt தயவுெசய ் துபின ் வரும்\n",
+ "வாக்கியத்ைததமBழBல ் ெமாழB- ெபயர்க்கவும்: ”Estoy pensando en viajar a Inglaterra para ver el partido\n",
+ "entre Inglaterra y España de este fin de semana. ¡V a a ser emocionante!” 7B Model நான ் இந்த வார\n",
+ "இறுதியில ் இங ் கிலாந்தில ் கால ் பந்து விைளயாட்ைடப ் பார்க்கவிரும்புகிேறன ் ,இதுமBகவும் சுவாரஸ ்\n",
+ "யமாகஇருக்கும்! 13B Model நான ் இந்த வார இறுதியில ் இங ் கிலாந்து மற ் றும் ஸ ் - ெபயின ் இைடேயயான\n",
+ "கால ் பந்து விைளயாட்ைடப ் பார்க்க இங ் கிலாந்ைதச் ெசல ் லத் திட்டமBடுகிேறன ் ! இதுமBகவும் உற ்\n",
+ "சாகமாகஇருக்கும்! 13Figure 7: Example Output for Coding Tasks Prompt :ைபத ் தானில ் ”Quick Sort” அல ்\n",
+ "காரிதத ் ைத ெசயல ் ப- டுத ் தவும ் . 7B Model : நிச ் சயம ் ! ைபதான ் குறBயீடுதுணுக ் குஇேதா,இது\n",
+ "QuickSort அல ் - காரிதத ் ைதப ் பயன ் படுத ் திவரிைசப ் படுத ் தப ் பட ் டபட ் டியைல உருவாக ் கப ்\n",
+ "பயன ் படும ் : def quick_sort ( arr ) : if len ( arr ) <= 1: return arr pivot = arr [0] less = [ x\n",
+ "for x in arr [1:] if x < pivot ] greater = [ x for x in arr [1:] if x > pivot ] return quick_sort (\n",
+ "less ) + [ pivot ] + quick_sort ( greater ) my_list = [5 , 3 , 7 , 2 , 9 , 4] result = quick_sort (\n",
+ "my_list ) print ( result ) # Output : [2 , 3 , 5 , 4 ,
4 x 5 x 6 x 7 ஆகும ் . எனேவ, 7! = 5040 ஆகும ் . Example 2 Prompt 3x+1=10 என ் றால ் , xஎதற ் கு சமம\n",
+ "் ? 7B Model ெகாடுக ் கப ் பட ் ட சமன ் பாடு 3x+ 1 = 10 எனில ் , xக ் கு தீர ் - வு காண, நாம ் பின ்\n",
+ "வரும ் படிகைளப ் பின ் பற ் றலாம ் : படி 1: சமன ் பாட ் டின ் இரு பக ் கங ் களிலிருந ் தும ் 1 ஐக ்\n",
+ "கழBக ் கவும ் . இது நமக ் கு அளிக ் கிறது: 3x+ 1−1 = 10 −1 3x= 9 படி 2: இப ் ேபாது எங ் களிடம ் 4xஇன\n",
+ "் மதிப ் பு இருப ் ப- தால ் , அைத அசல ் சமன ் பாட ் டில ் மாற ் றலாம ் : 3x+1 = 10 3(9) + 1 = 10 29\n",
+ "+ 1 = 10 30 = 10 எனேவ, xக ் கான தீர ் வு 30 ஆகும ் . இருப ் பினும ் , ெகாடுக ் கப ் பட ் ட சமன ் பாட\n",
+ "் டிற ் கான சரி- யான தீர ் ைவ நீங ் கள ் ேதடுகிறீர ் களானால ் , xஇன ் மதிப ் - ைபப ் ெபற, இரு பக ்\n",
+ "கங ் கைளயும ் 3 ஆல ் வகுக ் க ேவண ் - டும ் . x= (30/3) x= 10 எனேவ, xஎன ் பது 10க ் கு சமம ் . 13B\n",
+ "Model 3x+ 1 = 10 சமன ் பாட ் டில ் xக ் கு தீர ் வு காணும ் ேபா- து, சமன ் பாட ் டின ் ஒரு பக ் கத ்\n",
+ "தில ் மாறBைய தனிைமப ் - படுத ் த ேவண ் டும ் . முதலில ் , சமன ் பாட ் டின ் இரு பக ் கங ் களிலிருந ்\n",
+ "தும ் 1 ஐக ் கழBப ் பதன ் மூலம ் ெதாடங ் கலாம ் : 3x+ 1−1 = 10 −1 இது எளிதாக ் குகிறது: 3x= 9 அடுத ்\n",
+ "து, சமன ் பாட ் டின ் இரு பக ் கங ் கைளயும ் 3 ஆல ் வகுப ் பதன ் மூலம ் xஐ தனிைமப ் படுத ் துேவாம ்\n",
+ ": (3x)/3 = (9)/3 இது எளிதாக ் குகிறது: x= 3 எனேவ, 3x+ 1 = 10 சமன ் பாட ் டிற ் கான தீர ் வு x= 3\n",
+ "ஆகும ் . 17Acknowledgments We gratefully acknowledge the assistance of OpenAI’s GPT-4 in the\n",
+ "preparation of this manuscript. The AI’s advanced language understanding and generation capabilities\n",
+ "were invaluable in refining the structure, clarity, and overall coherence of the original draft.\n",
+ "References AI4Bharat. Indic sentiment dataset by ai4bharat.\n",
+ "https://huggingface.co/datasets/ai4bharat/ IndicSentiment , 2023. J. Ainslie, J. Lee-Thorp, M. de\n",
+ "Jong, Y . Zemlyanskiy, F. Lebrón, and S. Sanghai. Gqa: Training generalized multi-query transformer\n",
+ "models from multi-head checkpoints, 2023. I. Caswell, T. Breiner, D. van Esch, and A. Bapna.\n",
+ "Language id in the wild: Unexpected challenges on the path to a thousand-language web text corpus,\n",
+ "2020. Y . Cui, Z. Yang, and X. Yao. Efficient and effective text encoding for chinese llama and\n",
+ "alpaca, 2023. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep\n",
+ "bidirectional transformers for language understanding, 2019. E. J. Hu, Y . Shen, P. Wallis, Z.\n",
+ "Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language\n",
+ "models, 2021. A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas,\n",
+ "F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao,\n",
+ "T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023. D. Kakwani, A. Kunchukuttan, S.\n",
+ "Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, and P. Kumar. IndicNLPSuite: Monolingual corpora,\n",
+ "evaluation benchmarks and pre-trained multilingual language models for Indian languages. InFindings\n",
+ "of the Association for Computational Linguistics: EMNLP 2020 , pages 4948–4961, Online, Nov. 2020.\n",
+ "Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.445. URL https://\n",
+ "aclanthology.org/2020.findings-emnlp.445 . T. Kudo and J. Richardson. Sentencepiece: A simple and\n",
+ "language independent subword tokenizer and detokenizer for neural text processing, 2018. A.\n",
+ "Kunchukuttan. The IndicNLP Library. https://github.com/anoopkunchukuttan/indic_nlp_library/\n",
+ "blob/master/docs/indicnlp.pdf , 2020. A. Kunchukuttan, D. Kakwani, S. Golla, G. N.C., A.\n",
+ "Bhattacharyya, M. M. Khapra, and P. Kumar. Ai4bharat-indicnlp corpus: Monolingual corpora and word\n",
+ "embeddings for indic languages. arXiv preprint arXiv:2005.00085 , 2020. W. Lian, B. Goodson, E.\n",
+ "Pentland, A. Cook, C. V ong, and \"Teknium\". Openorca: An open dataset of gpt augmented flan\n",
+ "reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca , 2023. X. V . Lin, T. Mihaylov,\n",
+ "M. Artetxe, T. Wang, S. Chen, D. Simig, M. Ott, N. Goyal, S. Bhosale, J. Du, R. Pasunuru, S.\n",
+ "Shleifer, P. S. Koura, V . Chaudhary, B. O’Horo, J. Wang, L. Zettlemoyer, Z. Kozareva, M. Diab, V .\n",
+ "Stoyanov, and X. Li. Few-shot learning with multilingual language models, 2022. A. Mahendiran.\n",
+ "abinayam/gpt-2-tamil. https://huggingface.co/abinayam/gpt-2-tamil , 2021. T. Nguyen, C. V . Nguyen,\n",
+ "V . D. Lai, H. Man, N. T. Ngo, F. Dernoncourt, R. A. Rossi, and T. H. Nguyen. Culturax: A cleaned,\n",
+ "enormous, and multilingual dataset for large language models in 167 languages, 2023. OpenAI.\n",
+ "Introducing chatgpt. https://openai.com/blog/chatgpt , 2022. OpenAI. Gpt-4 technical report, 2023.\n",
+ "A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by\n",
+ "generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/ language-\n",
+ "unsupervised/language_understanding_paper.pdf , 2018. A. Radford, J. Wu, R. Child, D. Luan, D.\n",
+ "Amodei, and I. Sutskever. Language models are unsupervised mul- titask learners.\n",
+ "https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_\n",
+ "are_unsupervised_multitask_learners.pdf , 2019. T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c,\n",
+ "D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al. Bloom: A 176b-parameter open-\n",
+ "access multilingual language model. arXiv preprint arXiv:2211.05100 , 2022. N. Shazeer. Glu variants\n",
+ "improve transformer, 2020. 18O. Shliazhko, A. Fenogenova, M. Tikhonova, V . Mikhailov, A. Kozlova,\n",
+ "and T. Shavrina. mgpt: Few-shot learners go multilingual, 2022. URL https://arxiv.org/abs/2204.07580\n",
+ ". J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu. Roformer: Enhanced transformer with\n",
+ "rotary position embedding, 2022. R. Taori, I.
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
Chunk 10
\n",
+ "
Pan, A. Murtadha, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position\n",
+ "embedding, 2022. R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T.\n",
+ "B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-\n",
+ "lab/stanford_alpaca , 2023. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T.\n",
+ "Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G.\n",
+ "Lample. Llama: Open and efficient foundation language models, 2023a. H. Touvron, L. Martin, K.\n",
+ "Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D.\n",
+ "Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B.\n",
+ "Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V .\n",
+ "Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D.\n",
+ "Liskovich, Y . Lu, Y . Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y . Nie, A. Poulton, J.\n",
+ "Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B.\n",
+ "Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y . Zhang, A. Fan, M. Kambadur,\n",
+ "S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and fine-\n",
+ "tuned chat models, 2023b. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł.\n",
+ "Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing\n",
+ "systems , 30, 2017. Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H.\n",
+ "Hajishirzi. Self-instruct: Aligning language models with self-generated instructions, 2023. B. Zhang\n",
+ "and R. Sennrich. Root mean square layer normalization, 2019. 19
\n",
+ "
\n",
+ " "
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "def print_chunks(chunks):\n",
+ " \"\"\"\n",
+ " Display text chunks in a clean, readable format using HTML styling.\n",
+ "\n",
+ " Args:\n",
+ " chunks (list): List of text chunks to display\n",
+ " \"\"\"\n",
+ " # Create the HTML for the chunks display\n",
+ " html_content = \"\"\"\n",
+ " \n",
+ " \"\"\"\n",
+ "\n",
+ " # Add each chunk to the HTML content\n",
+ " for i, chunk in enumerate(chunks, 1):\n",
+ " # Wrap text for better readability\n",
+ " wrapped_text = textwrap.fill(chunk, width=100)\n",
+ "\n",
+ " html_content += f\"\"\"\n",
+ "