This project is a complete, from-scratch implementation of the Transformer architecture as described in the "Attention is All You Need" paper. It uses the OPUS Books translation dataset to perform English to Spanish translation.
The goal of this project was to deeply understand the inner workings of the Transformer architecture by implementing it without relying on prebuilt libraries like Hugging Face Transformers. This hands-on approach helped build confidence in core NLP concepts and the foundational technologies behind Large Language Models (LLMs).
- Custom tokenizer using Hugging Face Tokenizers (WordLevel)
- Positional Encoding
- Multi-Head Attention
- Encoder and Decoder stacks
- Greedy Decoding
- Training loop with TensorBoard support
- Configurable hyperparameters
The implementation follows the original Transformer architecture:
-
Input Embeddings: Learnable embeddings scaled by
$\sqrt{d_{model}}$ - Positional Encoding: Added to input embeddings
- Multi-Head Attention: Implemented from scratch
- Feedforward Layers: Two layer MLP
- Layer Normalization and Residual Connections
- Stacked Encoder and Decoder Blocks (configurable depth)
- Final Linear + Softmax projection layer
- Source: OPUS Books - HuggingFace
- Languages: English to Spanish
- Tokenizer: Custom trained WordLevel tokenizer with special tokens
[SOS],[EOS],[PAD], and[UNK]
Defined in config.py:
{
"batch_size": 32,
"num_epochs": 3,
"lr": 1e-4,
"seq_len": 128,
"d_model": 256,
"lang_src": "en",
"lang_tgt": "es",
"model_folder": "weights",
"model_filename": "tmodel_",
"preload": None,
"tokenizer_file": "tokenizer_{lang}.json",
"experiment_name": "runs/tmodle"
}.
├── train.py # Training script
├── model.py # Transformer architecture
├── dataset.py # Dataset class and causal masking
├── config.py # Config and utility functions
├── weights/ # Model checkpoints
├── tokenizer_en.json # Tokenizer for English
├── tokenizer_es.json # Tokenizer for Spanish
└── runs/tmodle/ # TensorBoard logs
pip install torch datasets tokenizers tensorboard tqdmpython train.pytensorboard --logdir=runs/tmodle--------------------------------------------------------------------------------
SOURCE: I saw the cat sleeping on the sofa.
TARGET: Vi al gato durmiendo en el sofá.
PREDICTED: Vi al gato durmiendo en el sofá.
(Note: prediction quality depends on training duration and tokenizer vocabulary.)
- Understood how multi-head attention and masking works
- Gained confidence in working with positional encoding and residual layers
- Learned how to process datasets and build tokenizers
- Developed deeper insights into how LLMs build on the Transformer foundation
- Add support for beam search decoding
- Implement learning rate scheduler with warm-up
- Add BLEU score evaluation
- Add inference script for custom input
- Attention Is All You Need Paper
- The Annotated Transformer
- Hugging Face Datasets and Tokenizers