[AI] Notes

Deep Learning notes

Data Collection: Collect a large dataset of text.
Preprocessing: Tokenize the text into subwords.
Model Architecture: Choose a model architecture. BPE algorithm for tokenization.
1. Embedding
2. Layer Norm
3. Self attention
4. Projection
5. MLP
6. Transformer
7. Softmax
8. Output
Inference: Train the model on the dataset. Data splitting and training strategies.
1. Human Training
2. (SFT) Supervised Fine Tuning
(RL) Reinforced Learning
1. (RLHF) Reinforcement Learning with Human Feedback
2. Practice problems
3. Fine-tune the model on a smaller dataset.
Output

Train V3-Base with RL for reasoning -> R1-Zero
Create SFT data from R1-Zero using rejection sampling + synthetic data from V3.
Re-train V3-Base from scratch on SFT data., followed by RL (reasoning + human prefs.)

Foundational Models

Model	Parameters	Methods	Notes
V1	67B	Traditional Transformer
V2	236B	(MHLA) Multi-head latent Attention; (MoE) Mixture of Experts	made the model fast
V3	671B	(RL) Reinforcement Learning; (MHLA) Multi-head latent Attention; (MoE) Mixture of Experts	Balance load amongst many GPU's
R1-Zero	671B	(RL) Reinforcement Learning; (MHLA) Multi-head latent Attention; (MoE) Mixture of Experts; (CoT) Chain of Thougt	Reasoning model
R1	671B	(RL) Reinforcement Learning; (SPT) Supervised Fine-Tuning; (MHLA) Multi-head latent Attention; (MoE) Mixture of Experts; (CoT) Chain of Thougt	Reasoning model; Human preferences

Model	Full Name	Organization
BERT	Bidirectional Encoder Representations from Transformers	Google
XLNet	Generalized Autoregressive Pretraining for Language Understanding	Google/CMU
RoBERTa	A Robustly Optimized BERT Approach	Facebook
DistilBERT	DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter	Hugging Face
CTRL	Conditional Transformer Language Model for Controllable Generation	Salesforce
GPT-2	Generative Pre-trained Transformer	OpenAI
ALBERT	A Lite BERT for Self-supervised Learning of Language Representations	Google
Megatron	Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	NVIDIA