bytepiece

Implementation of Su's bytepiece.

Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little preprocessing, more pure and language independent.

Bindings

Rust
Python

Quick Example using Python

from rs_bytepiece import Tokenizer

tokenizer = Tokenizer()
output = tokenizer.encode("今天天气不错")
print(output)
# [40496, 45268, 39432]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

bytepiece

Bindings

Quick Example using Python

Files

README.md

Latest commit

History

README.md

File metadata and controls

bytepiece

Bindings

Quick Example using Python