Valerie is a Large Language Model written completely from scratch in pure C.
- UTF-8 grapheme support
- Byte-Pair Encoding (BPE) tokenizer
- Model weights: Q8 (inference), BF16 (training)
- File serialization & validation
- Completions engine
- Chat completions engine
- Training Engine
- Fine-tuning Engine
- Headless CPU Support (OpenMP)
- Headless GPU Support (Vulkan)
git clone https://github.com/teleprint-me/valerie.c valerie
cd valerie
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build -j $(nproc)Valerie includes an ASCII-only Byte-Pair Encoding (BPE) tokenizer designed for transparency and ease of extension. Unicode (UTF-8 grapheme) support is planned.
- Train the model: Build and serialize a BPE tokenizer from a plaintext corpus.
- Predict: Encode and decode text using a trained model.
Build and save a tokenizer model:
./build/examples/tokenizer/train --input S --output S [--merges N] [--verbose]--input,-iPath to input plaintext corpus (required)--output,-oDirectory to save the tokenizer model (required)--merges,-mNumber of BPE merge steps (default: 10)--verbose,-vEnable debug output
Encode and decode text with a trained model:
./build/examples/tokenizer/predict --model S --prompt S [options]--model,-mPath to tokenizer model file (required)--prompt,-pInput text to encode and decode (required)--add-bos,-bAdd BOS marker--add-eos,-eAdd EOS marker--verbose,-vEnable debug output
Train:
./build/examples/tokenizer/train -i samples/simple.txt -o models -m 10Predict:
./build/examples/tokenizer/predict -m models/tokenizer.model -p 'Hello, world!'Typical output:
- Prints tokens, frequencies, and merge steps when training.
- Lists vocabulary and encodings when predicting.
Planned:
- Unicode grapheme support
- Model extensibility and validation
AGPL to ensure end-user freedom.