Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: PR#10864 tokenization regression #10875

Closed
dranger003 opened this issue Dec 17, 2024 · 0 comments · Fixed by #10876
Closed

Eval bug: PR#10864 tokenization regression #10875

dranger003 opened this issue Dec 17, 2024 · 0 comments · Fixed by #10876

Comments

@dranger003
Copy link
Contributor

Name and Version

$ ./build/bin/llama-cli --version
version: 4349 (081b29bd)
built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu

Operating systems

Linux, Windows

GGML backends

CPU, CUDA

Hardware

Xeon w5-3435X + RTX 5000 Ada

Models

meta-llama/Llama-3.1-8B-Instruct

Problem description & steps to reproduce

Comment this line and convert the model again to resolve the issue:

token = tokenizer.decode(tokenizer.encode(token))

Tokenization before PR #10864:

| Index | Token ID | Token Text                |
| ----- | -------- | ------------------------- |
| 0     | 128000   | `<\|begin_of_text\|>`     |
| 1     | 128006   | `<\|start_header_id\|>`   |
| 2     | 9125     | `system`                  |
| 3     | 128007   | `<\|end_header_id\|>`     |
| 4     | 271      | `\n\n`                    |
| 5     | 2675     | `You`                     |
| 6     | 527      | ` are`                    |
| 7     | 264      | ` a`                      |
| 8     | 11190    | ` helpful`                |
| 9     | 18328    | ` assistant`              |
| 10    | 13       | `.`                       |
| 11    | 128009   | `<\|eot_id\|>`            |
| 12    | 128006   | `<\|start_header_id\|>`   |
| 13    | 882      | `user`                    |
| 14    | 128007   | `<\|end_header_id\|>`     |
| 15    | 271      | `\n\n`                    |
| 16    | 13347    | `Hi`                      |
| 17    | 1070     | ` there`                  |
| 18    | 0        | `!`                       |
| 19    | 128009   | `<\|eot_id\|>`            |
| 20    | 128006   | `<\|start_header_id\|>`   |
| 21    | 78191    | `assistant`               |
| 22    | 128007   | `<\|end_header_id\|>`     |
| 23    | 271      | `\n\n`                    |

Tokenization after PR #10864:

| Index | Token ID | Token Text                                 |
| ----- | -------- | ------------------------------------------ |
| 0     | 128000   | `<\|begin_of_text\|><\|begin_of_text\|>`   |
| 1     | 27       | `<`                                        |
| 2     | 91       | `\|`                                       |
| 3     | 2527     | `start`                                    |
| 4     | 8932     | `_header`                                  |
| 5     | 851      | `_id`                                      |
| 6     | 91       | `\|`                                       |
| 7     | 29       | `>`                                        |
| 8     | 9125     | `system`                                   |
| 9     | 27       | `<`                                        |
| 10    | 91       | `\|`                                       |
| 11    | 408      | `end`                                      |
| 12    | 8932     | `_header`                                  |
| 13    | 851      | `_id`                                      |
| 14    | 91       | `\|`                                       |
| 15    | 1363     | `>\n\n`                                    |
| 16    | 2675     | `You`                                      |
| 17    | 527      | ` are`                                     |
| 18    | 264      | ` a`                                       |
| 19    | 11190    | ` helpful`                                 |
| 20    | 18328    | ` assistant`                               |
| 21    | 16134    | `.<`                                       |
| 22    | 91       | `\|`                                       |
| 23    | 68       | `e`                                        |
| 24    | 354      | `ot`                                       |
| 25    | 851      | `_id`                                      |
| 26    | 91       | `\|`                                       |
| 27    | 1822     | `><`                                       |
| 28    | 91       | `\|`                                       |
| 29    | 2527     | `start`                                    |
| 30    | 8932     | `_header`                                  |
| 31    | 851      | `_id`                                      |
| 32    | 91       | `\|`                                       |
| 33    | 29       | `>`                                        |
| 34    | 882      | `user`                                     |
| 35    | 27       | `<`                                        |
| 36    | 91       | `\|`                                       |
| 37    | 408      | `end`                                      |
| 38    | 8932     | `_header`                                  |
| 39    | 851      | `_id`                                      |
| 40    | 91       | `\|`                                       |
| 41    | 1363     | `>\n\n`                                    |
| 42    | 13347    | `Hi`                                       |
| 43    | 1070     | ` there`                                   |
| 44    | 88032    | `!<`                                       |
| 45    | 91       | `\|`                                       |
| 46    | 68       | `e`                                        |
| 47    | 354      | `ot`                                       |
| 48    | 851      | `_id`                                      |
| 49    | 91       | `\|`                                       |
| 50    | 1822     | `><`                                       |
| 51    | 91       | `\|`                                       |
| 52    | 2527     | `start`                                    |
| 53    | 8932     | `_header`                                  |
| 54    | 851      | `_id`                                      |
| 55    | 91       | `\|`                                       |
| 56    | 29       | `>`                                        |
| 57    | 78191    | `assistant`                                |
| 58    | 27       | `<`                                        |
| 59    | 91       | `\|`                                       |
| 60    | 408      | `end`                                      |
| 61    | 8932     | `_header`                                  |
| 62    | 851      | `_id`                                      |
| 63    | 91       | `\|`                                       |
| 64    | 1363     | `>\n\n`                                    |

First Bad Commit

PR #10864
Commit 382bc7f

Relevant log output

See 'Problem description & steps to reproduce'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant