This repository contains an implementation of a GPT (Generative Pre-trained Transformer) model trained on Azerbaijani Wikipedia data. The model is designed to understand and generate Azerbaijani text.
.
├── README.md
├── az_tokenizer.json # Trained tokenizer for Azerbaijani text
├── az_wiki_data.json # Collected Wikipedia data
├── best_model.pt # Saved state of the best trained model
├── collect_data.py # Script for collecting Wikipedia articles
├── generate.py # Text generation script using the trained model
├── prepare_data.py # Data preprocessing and tokenizer training
├── push_to_hf.py # Script to upload the trained model to Hugging Face Model Hub
├── requirements.txt # Project dependencies
└── train.py # GPT model training script
- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install dependencies based on your system:
For Mac with Apple Silicon (M1/M2):
# Install PyTorch for Apple Silicon
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
# Install other required packages
pip install transformers wikipedia-api beautifulsoup4 requests huggingface_hub
For other systems:
pip install -r requirements.txt
- Uses MPS (Metal Performance Shaders) for acceleration
- Optimized memory management for Apple Silicon
- May require specific PyTorch nightly builds
- Automatically utilizes CUDA if available
- Implements mixed precision training
- Memory optimization through gradient accumulation
- Collect Azerbaijani Wikipedia articles:
python collect_data.py
This will save articles to az_wiki_data.json
- Prepare data and train tokenizer:
python prepare_data.py
This will create az_tokenizer.json
Train the GPT model:
python train.py
The training script:
- Uses mixed precision training
- Implements gradient accumulation
- Saves model checkpoints every 5 epochs
- Saves the best model based on validation loss
- Transformer-based architecture
- Configuration adjustable in
train.py
:- Embedding dimension: 512
- Attention heads: 8
- Layers: 6
- Block size: 128
- Batch size: 4
Generate text using the trained model:
python generate.py
The generate.py
script:
- Loads the trained model and tokenizer
- Generates text based on a user-provided prompt
- Implements sampling strategies such as nucleus sampling and temperature scaling
Upload your trained model to the Hugging Face Model Hub:
python push_to_hf.py
The push_to_hf.py
script:
- Authenticates with your Hugging Face account
- Creates a new repository for your model (if needed)
- Uploads the trained model, tokenizer, and any other relevant files
collect_data.py
: Collects articles from Azerbaijani Wikipedia using categories like history, culture, literature, and geographyprepare_data.py
: Preprocesses text and trains a BPE tokenizertrain.py
: Contains GPT model implementation and training loopgenerate.py
: Generates text using the trained model and sampling strategiespush_to_hf.py
: Script for uploading the trained model to Hugging Face's Model Hubaz_wiki_data.json
: Collected and preprocessed Wikipedia articlesaz_tokenizer.json
: Trained BPE tokenizer for Azerbaijani textbest_model.pt
: Saved state of the best model during training
The model saves:
- Best model state as
best_model.pt
- Regular checkpoints as
checkpoint_epoch_N.pt
- Interrupted training state as
interrupt_checkpoint.pt
- Recommended: GPU with at least 8GB memory
- For larger models: Use gradient accumulation steps
- Adjustable batch size and model size based on available memory
Common Issues:
-
Memory Errors:
- Reduce batch size
- Enable gradient accumulation
- Reduce model size
- Clear GPU cache regularly
-
PyTorch Installation:
- For Apple Silicon: Use the nightly build command
- For CUDA: Install appropriate CUDA version
-
Data Loading:
- Reduce number of workers if getting process errors
- Enable pin memory for faster data transfer
- Implement model evaluation metrics
- Add data augmentation techniques
- Implement distributed training
- Add model compression techniques