Valerie

Valerie is a Large Language Model written completely from scratch in pure C.

Features

Setup

git clone https://github.com/teleprint-me/valerie.c valerie
cd valerie
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build -j $(nproc)

Tokenizer

Valerie includes an ASCII-only Byte-Pair Encoding (BPE) tokenizer designed for transparency and ease of extension. Unicode (UTF-8 grapheme) support is planned.

Workflow

Train the model: Build and serialize a BPE tokenizer from a plaintext corpus.
Predict: Encode and decode text using a trained model.

Commands

Train

Build and save a tokenizer model:

./build/examples/tokenizer/train --input S --output S [--merges N] [--verbose]

--input, -i Path to input plaintext corpus (required)
--output, -o Directory to save the tokenizer model (required)
--merges, -m Number of BPE merge steps (default: 10)
--verbose, -v Enable debug output

Predict

Encode and decode text with a trained model:

./build/examples/tokenizer/predict --model S --prompt S [options]

--model, -m Path to tokenizer model file (required)
--prompt, -p Input text to encode and decode (required)
--add-bos, -b Add BOS marker
--add-eos, -e Add EOS marker
--verbose, -v Enable debug output

Example

Train:

./build/examples/tokenizer/train -i samples/simple.txt -o models -m 10

Predict:

./build/examples/tokenizer/predict -m models/tokenizer.model -p 'Hello, world!'

Typical output:

Prints tokens, frequencies, and merge steps when training.
Lists vocabulary and encodings when predicting.

Planned:

Unicode grapheme support
Model extensibility and validation

License

AGPL to ensure end-user freedom.

Name		Name	Last commit message	Last commit date
Latest commit History 593 Commits
.github		.github
examples		examples
include		include
samples		samples
src		src
.clang-format		.clang-format
.clangd		.clangd
.flake8		.flake8
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Valerie

Features

Setup

Tokenizer

Workflow

Commands

Train

Predict

Example

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Languages

Uh oh!

License

teleprint-me/valerie.c

Folders and files

Latest commit

History

Repository files navigation

Valerie

Features

Setup

Tokenizer

Workflow

Commands

Train

Predict

Example

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Languages

Packages