Skip to content

Commit

Permalink
docs: move tokenization links to the top of README
Browse files Browse the repository at this point in the history
  • Loading branch information
danbev committed Aug 27, 2024
1 parent cad2af0 commit e5c2aaa
Showing 1 changed file with 9 additions and 8 deletions.
17 changes: 9 additions & 8 deletions notes/tokenization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,15 @@ model. There might be a configuration file in addition to this that specifies
the type of tokenizing that the model uses, like Byte-Pair Encoding (BPE),
WordPiece, SentencePiece, or Unigram, etc.

#### Tokenization notes
The following notes are individual walkthroughs of the tokenization process for
different tokenization types in llama.cpp:

* [Byte Pair Encoding (BPE)](./bpe.md)
* [WordPiece](./wordpiece.md) TODO
* [SentencePiece](./sentencepiece.md)
* [Unigram](./unigram.md) TODO

### Tokenization in llama.cpp
Llama.cpp supports the following types of tokenization:
```c
Expand Down Expand Up @@ -295,11 +304,3 @@ $56 = std::forward_list = {
[1] = {type = FRAGMENT_BUFFER_VARIANT_TYPE_RAW_TEXT, token = -1, _dummy = "", raw_text = "<s>What is LoRA?</s>", offset = 3, length = 13},
[2] = {type = FRAGMENT_BUFFER_VARIANT_TYPE_TOKEN, token = 2, _dummy = "", raw_text = "", offset = 0, length = 0}}
```
#### Tokenization notes
The following notes are individual walkthroughs of the tokenization process for
different tokenization types in llama.cpp:

* [Byte Pair Encoding (BPE)](./bpe.md)
* [WordPiece](./wordpiece.md) TODO
* [SentencePiece](./sentencepiece.md)
* [Unigram](./unigram.md) TODO

0 comments on commit e5c2aaa

Please sign in to comment.