forked from tomaarsen/attention_sinks
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Llama-2-7b-hf.txt
20 lines (20 loc) · 7.1 KB
/
Llama-2-7b-hf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<s> Vaswani et al. (2017) introduced the Transformers model for sequence-to-sequence (seq2seq) tasks. It achieved state-of-the-art performance on machine translation and summarization tasks by introducing attention-based recurrent neural networks (RNNs, e.g., LSTMs) to capture long-range dependencies in the input sequence.
Transformers are composed of self-attention and feed-forward layers. The self-attention layer computes a weighted sum of inputs by attending to each input at different positions, whereas the feed-forward layer applies a linear transformation to the output of the self-attention layer. In this post, we’ll explore Transformer architectures and their training.
The architecture of a vanilla Transformer is shown in Figure 1. It consists of an encoder and a decoder. The encoder takes an input sequence $x \in \mathbb{R}^{T \times D}$ and outputs a sequence of hidden states $h_1, h_2, \hdots, h_T \in \mathbb{R}^{D}$. The decoder takes the hidden states of the encoder and outputs a sequence of hidden states $d_1, d_2, \hdots, d_T \in \mathbb{R}^{D}$.
Figure 1: Architecture of a vanilla Transformer.
The encoder is a stack of $N$ identical layers, each of which consists of a multi-head self-attention layer and a feed-forward layer. The multi-head self-attention layer has $K$ heads, each of which computes a weighted sum of the input sequence using different sets of keys and/or values. The feed-forward layer applies a linear transformation to the output of the self-attention layer.
The decoder is a stack of $N$ identical layers, each of which consists of a multi-head self-attention layer and a feed-forward layer. The multi-head self-attention layer has $K$ heads, each of which computes a weighted sum of the input sequence using different sets of keys and/or values. The feed-forward layer applies a linear transformation to the output of the self-attention layer.
The input sequence is fed into the encoder and the output sequence is fed into the decoder. The two sequences are concatenated and fed into the final feed-forward layer, which outputs the final hidden state $h_T$. This final hidden state is used to compute the probability distribution over the vocabulary $V$ using a softmax function.
The training objective is to minimize the cross-entropy loss between the groundtruth and the predicted probabilities. During training, the model is optimized using stochastic gradient descent (SGD). SGD is an iterative optimization algorithm that updates the model parameters by taking gradients of the loss function with respect to the parameters.
During training, the model learns to attend to different parts of the input sequence in order to generate the output sequence. This is achieved by computing the attention weights for each input token and using them to compute the weighted sum of the input sequence. The attention weights are computed using a softmax function, which ensures that the sum of the attention weights is 1.
The vanilla Transformer has been shown to be effective for a wide range of NLP tasks, including machine translation, question answering, and text summarization. However, it suffers from two drawbacks: (1) it is computationally expensive due to the need to compute the attention weights for each input token, and (2) it can be prone to overfitting.
To address these issues, Vaswani et al. (2017) proposed a variant of the Transformer called the Attention is All You Need (A2T) model. The A2T model replaces the multi-head self-attention layer with a single self-attention layer, which reduces the number of parameters by a factor of $K$. This makes the model more efficient to train and less prone to overfitting.
Vaswani et al. (2017) also proposed a variant of the Transformer called Bidirectional Encoder Representations from Transformers (BERT). BERT is a pre-trained language model that can be fine-tuned for a specific task. BERT uses a bidirectional encoder, which allows the model to attend to both the past and future context of an input token. This improves the performance of the model on downstream tasks such as natural language inference (NLI) and sentiment analysis.
In this post, we explored the basics of Transformers and their training. We saw that Transformers are composed of self-attention and feed-forward layers, and that the training objective is to minimize the cross-entropy loss between the groundtruth and the predicted probabilities. We also saw that the vanilla Transformer is computationally expensive and prone to overfitting, and that the A2T and BERT models address these issues.
If you are interested in learning more about Transformers, I recommend the following resources:
Transformers: An Introduction by Ashish Vaswani, Noam Shazeer, Niki Parraviz, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Korzekwinski, Augustinus Lebek, Michal Manconi, Arvind Neelakantan, Norma Ferrández Leòn, António Milhazes, Hugo Larochell, Prafulla Dhariwal, Aditya Khosravi, Kenton Lee, Dharshan Kumaran, Sanja Fidrmuc, James Head-Shudder, Ronan Colloquint, Quoc Le, Jason Chuang, Florian Schroedl, Thore Graepel, Greg Corcoran, Luke Zettlemoyer, Chris Dingvedsæter, Mike Morton, Oriol Vinyals, Joel Grus, Ashutosh Saxena, Piotr Bojanowski, Tim Rocktäschel, Matej Slavkovic, Alex Warlenius, Ming-Wei Chang, Yoshua Bengio, Ian Goodfellas, Soumith Chintamneni, Ameet Talwalkar, Daan van Eekhoudt, Max Jaderber, and Marc’Aurelio Ranzato. https://arxiv.org/abs/1706.03762v1
Beyond Word Embeddings: Representing Sentences as Compositional Semantic Spaces by Jacob Devlin, Mory Ghazwinia, Miguel Ballesteros, George Dahl, Abdel-rahman Mohamed, William Farne, Matthew Leacock, Luke Vilmur, Karthik Narasimhan, and Kristina Toutanova. https://arxi.v.org/abs/1802.05365
Previous Post How to use TensorFlow 2.0 in Jupyter Notebooks
Next Post A Gentle Introduction to Neural Networks for Beginners
One thought on “A Gentle Introduction to Transformers for Beginners”
Pingback: A Gentle Introduction to GPT-3 for Beginners – Machine Learning Blog by Siddharth Shirodkar 🤖🧠👨�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������