|
1 | 1 | # Building Transformers from Scratch
|
2 |
| -## My second attempt at building transformers from scratch using the [Attention paper](https://arxiv.org/abs/1706.03762) as a guide. |
| 2 | +* My second attempt at building transformers from scratch using the [Attention paper](https://arxiv.org/abs/1706.03762) as a guide. |
3 | 3 | * Special thanks to [Joris Baan](https://github.com/jsbaan/transformer-from-scratch) for the original code and the inspiration to build this project.
|
| 4 | + |
4 | 5 | ## Introduction
|
5 |
| -### Transformers have become the go-to model for many natural language processing tasks. They have been shown to outperform RNNs and LSTMs on many tasks. The transformer model was introduced in the paper [Attention is All You Need](https://arxiv.org/abs/1706.03762) by Vaswani et al. The transformer model is based on the self-attention mechanism, which allows the model to focus on different parts of the input sequence when making predictions. The transformer model consists of an encoder and a decoder, each of which is composed of multiple layers of self-attention and feed-forward neural networks. The transformer model has been shown to achieve state-of-the-art performance on many natural language processing tasks, including machine translation, text summarization, and question answering. |
| 6 | +* Transformers have become the go-to model for many natural language processing tasks. They have been shown to outperform RNNs and LSTMs on many tasks. The transformer model was introduced in the paper [Attention is All You Need](https://arxiv.org/abs/1706.03762) by Vaswani et al. The transformer model is based on the self-attention mechanism, which allows the model to focus on different parts of the input sequence when making predictions. The transformer model consists of an encoder and a decoder, each of which is composed of multiple layers of self-attention and feed-forward neural networks. The transformer model has been shown to achieve state-of-the-art performance on many natural language processing tasks, including machine translation, text summarization, and question answering. |
6 | 7 |
|
7 |
| -### In this project, I will build a transformer model from scratch using PyTorch. The model will be trained on a simple dataset and will be evaluated on a test set. The goal of this project is to gain a better understanding of how the transformer model works and how it can be implemented in code. |
| 8 | +* In this project, I will build a transformer model from scratch using PyTorch. The model will be trained on a simple dataset and will be evaluated on a test set. The goal of this project is to gain a better understanding of how the transformer model works and how it can be implemented in code. |
8 | 9 |
|
9 |
| -## The goal of this project is to build a transformer model from scratch using PyTorch. The model will be trained on a simple dataset and will be evaluated on a test set. The model will be built using the following components: |
10 |
| -- Multi-Head Attention - The model will use multi-head attention to allow the model to focus on different parts of the input sequence when making predictions. |
11 |
| -- Position-wise Feed-Forward Networks - The model will use position-wise feed-forward networks to process the output of the multi-head attention layer. |
12 |
| -- Layer Normalization - The model will use layer normalization to normalize the output of the multi-head attention and feed-forward layers. |
13 |
| -- Residual Connections - The model will use residual connections to allow the model to learn the identity function. |
14 |
| -- Positional Encoding - The model will use positional encoding to encode the position of each token in the input sequence. |
15 |
| -- Masking - The model will use masking to prevent the model from attending to future tokens during training. |
| 10 | +* The goal of this project is to build a transformer model from scratch using PyTorch. The model will be trained on a simple dataset and will be evaluated on a test set. The model will be built using the following components: |
| 11 | + - Multi-Head Attention - The model will use multi-head attention to allow the model to focus on different parts of the input sequence when making predictions. |
| 12 | + - Position-wise Feed-Forward Networks - The model will use position-wise feed-forward networks to process the output of the multi-head attention layer. |
| 13 | + - Layer Normalization - The model will use layer normalization to normalize the output of the multi-head attention and feed-forward layers. |
| 14 | + - Residual Connections - The model will use residual connections to allow the model to learn the identity function. |
| 15 | + - Positional Encoding - The model will use positional encoding to encode the position of each token in the input sequence. |
| 16 | + - Masking - The model will use masking to prevent the model from attending to future tokens during training. |
16 | 17 |
|
17 |
| -## The model will be trained using the Adam optimizer and the learning rate will be scheduled using the Noam learning rate scheduler. The model will be evaluated using the [BLEU score metric](https://en.wikipedia.org/wiki/BLEU). |
| 18 | +* The model will be trained using the Adam optimizer and the learning rate will be scheduled using the Noam learning rate scheduler. The model will be evaluated using the [BLEU score metric](https://en.wikipedia.org/wiki/BLEU). |
18 | 19 |
|
19 | 20 | ## The project will be divided into the following sections:
|
20 | 21 | 1. Data Preprocessing
|
|
23 | 24 | 4. Evaluation
|
24 | 25 | 5. Conclusion
|
25 | 26 |
|
26 |
| -#### Side note: I was listening to the theory of consciousness from the YouTube video [Consciousness of Artificial Intelligence](https://www.youtube.com/watch?v=sISkAb7suqo) while building this. It's a very interesting video and I highly recommend it. |
| 27 | +* Side note: I was listening to the theory of consciousness from the YouTube video [Consciousness of Artificial Intelligence](https://www.youtube.com/watch?v=sISkAb7suqo) while building this. It's a very interesting video and I highly recommend it. |
0 commit comments