Skip to content

Latest commit

 

History

History
23 lines (14 loc) · 603 Bytes

README.md

File metadata and controls

23 lines (14 loc) · 603 Bytes

bytepiece

Implementation of Su's bytepiece.

Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little preprocessing, more pure and language independent.

Bindings

Quick Example using Python

from rs_bytepiece import Tokenizer

tokenizer = Tokenizer()
output = tokenizer.encode("今天天气不错")
print(output)
# [40496, 45268, 39432]