High-performance, header-only tiktoken implementation in C++ with a small wrapper API and examples. It mirrors the behavior of OpenAI's tiktoken for common operations and supports custom encodings via local .tiktoken BPE files.
Highlights
- Core BPE engine in src/_tiktoken.hppand friendly API insrc/tiktoken.hpp
- PCRE2 regex splitting with JIT acceleration when available
- Zero-allocation hot paths for faster encoding/decoding
- Examples and a simple benchmark CLI
- C++17 toolchain (AppleClang, clang, or GCC)
- Conan 2.x for dependencies (PCRE2 and utfcpp)
- CMake ≥ 3.20
macOS quick tips
- Install Conan: pipx install conanorpip install --user conan
- Ensure developer tools: xcode-select --install
This project uses Conan to fetch PCRE2 and utfcpp and generates a CMake preset that wires everything for you.
- 
Detect a profile (first time only): - conan profile detect -f
 
- 
Install deps and generate presets into the project root (toolchain/presets go under build/Release/):- conan install . --output-folder=. --build=missing -s build_type=Release
 
- 
Configure and build (flat preset writes build files to build/):- cmake --preset conan-release-flat
- cmake --build --preset conan-release-flat -j
 
Alternative: plain CMake (if you already have PCRE2 + utfcpp)
- Set CMAKE_PREFIX_PATHto point to your PCRE2/utfcpp install roots sofind_package(PCRE2)andfind_package(utf8cpp)succeed.
- Library interface target: tiktoken(header-only)
- Example app: cpp_basic
- Benchmark app: tiktoken_benchmark
Defaults can be toggled in CMakeLists.txt:
- TIKTOKEN_BUILD_EXAMPLES(ON)
- TIKTOKEN_BUILD_BENCHMARKS(ON)
The public API is in src/tiktoken.hpp. Minimal example:
#include "tiktoken.hpp"
using namespace tiktoken;
int main() {
	// Build a tiny demo encoding (bytes 0..255 map to ids 0..255; a single special token)
	EncodingDefinition def;
	def.name = "minimal_demo";
	def.pat_str = R"('(?:[sdmt]|ll|ve|re)| ?\p{L}++| ?\p{N}++| ?[^\s\p{L}\p{N}]++|\s++$|\s+(?!\S)|\s)";
	for (int b = 0; b < 256; ++b) def.mergeable_ranks.push_back({U8Vec{static_cast<U8>(b)}, static_cast<Rank>(b)});
	def.special_tokens = {{"<|endoftext|>", 256}};
	def.explicit_n_vocab = 257;
	register_encoding(def);
	auto enc = get_encoding(def.name);
	auto tokens = enc->encode("hello world");
	auto bytes = enc->decode_bytes(tokens);
	auto text = std::string(reinterpret_cast<const char*>(bytes.data()), bytes.size());
}Or load a real BPE file (.tiktoken) and run cl100k-style encoding:
auto ranks = tiktoken::load_tiktoken_bpe_from_file("cl100k_base.tiktoken");
tiktoken::EncodingDefinition def;
def.name = "cl100k_base";
def.pat_str = R"('(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}++|\p{N}{1,3}+| ?[^\s\p{L}\p{N}]++[\r\n]*+|\s++$|\s*[\r\n]|\s+(?!\S)|\s)";
def.mergeable_ranks = std::move(ranks);
def.special_tokens = {{"<|endoftext|>", 100257}, {"<|fim_prefix|>", 100258}, {"<|fim_middle|>", 100259}, {"<|fim_suffix|>", 100260}, {"<|endofprompt|>", 100276}};
tiktoken::register_encoding(def);
auto enc = tiktoken::get_encoding(def.name);
auto toks = enc->encode("hello world");cpp_basic demonstrates loading a .tiktoken file or using a minimal demo:
- With BPE: ./build/cpp_basic build/cl100k_base.tiktoken "hello world"
- Demo fallback (no args): ./build/cpp_basic
Simple CLI to loop on encode(text) and report throughput:
- 
Demo mode (no BPE): - ./build/tiktoken_benchmark --iters 200 --progress
 
- 
With a BPE and input file: - ./build/tiktoken_benchmark --bpe build/cl100k_base.tiktoken --file build/bench.txt --iters 1000 --progress
 
Flags
- --bpe <path>:- .tiktokenfile (base64 bytes + rank per line)
- --file <path>: text to encode (raw text)
- --iters N(default 1000)
- --progress: prints progress dots to stderr
Notes on performance
- PCRE2 JIT is enabled automatically when available.
- Hot-path lookups avoid per-match allocations.
- Throughput depends strongly on the regex pattern and text; try Release builds.
- 
CMake can’t find PCRE2/utf8cpp - Use Conan flow shown above, or set CMAKE_PREFIX_PATHto where these packages are installed.
- On macOS with Homebrew, you might set: -DCMAKE_PREFIX_PATH="/opt/homebrew/opt/pcre2;/opt/homebrew/opt/utf8cpp"
 
- Use Conan flow shown above, or set 
- 
Preset not found / toolchain warnings - Ensure you ran: conan install . --output-folder=build --build=missing -s build_type=Release
- Then run either:
- cmake --preset conan-release && cmake --build --preset conan-release -j(nested build dirs)
- or cmake --preset conan-release-flat && cmake --build --preset conan-release-flat -j(flat build dir)
 
 
- Ensure you ran: 
- 
Xcode/Clang issues - Make sure command-line tools are installed: xcode-select --install
 
- Make sure command-line tools are installed: 
- 
Slow debug builds - Use -DCMAKE_BUILD_TYPE=Releaseor theconan-releasepreset. Regex JIT and hot paths shine in Release.
 
- Use 
MIT. See LICENSE.
Inspired by OpenAI’s tiktoken and Rust implementations of BPE tokenizers.