Eval bug: PR#10864 tokenization regression #10875

dranger003 · 2024-12-17T22:12:59Z

Name and Version

$ ./build/bin/llama-cli --version
version: 4349 (081b29bd)
built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu

Operating systems

Linux, Windows

GGML backends

CPU, CUDA

Hardware

Xeon w5-3435X + RTX 5000 Ada

Models

meta-llama/Llama-3.1-8B-Instruct

Problem description & steps to reproduce

Comment this line and convert the model again to resolve the issue:

llama.cpp/convert_hf_to_gguf.py

Line 530 in 081b29b

token = tokenizer.decode(tokenizer.encode(token))

Tokenization before PR #10864:

| Index | Token ID | Token Text                |
| ----- | -------- | ------------------------- |
| 0     | 128000   | `<\|begin_of_text\|>`     |
| 1     | 128006   | `<\|start_header_id\|>`   |
| 2     | 9125     | `system`                  |
| 3     | 128007   | `<\|end_header_id\|>`     |
| 4     | 271      | `\n\n`                    |
| 5     | 2675     | `You`                     |
| 6     | 527      | ` are`                    |
| 7     | 264      | ` a`                      |
| 8     | 11190    | ` helpful`                |
| 9     | 18328    | ` assistant`              |
| 10    | 13       | `.`                       |
| 11    | 128009   | `<\|eot_id\|>`            |
| 12    | 128006   | `<\|start_header_id\|>`   |
| 13    | 882      | `user`                    |
| 14    | 128007   | `<\|end_header_id\|>`     |
| 15    | 271      | `\n\n`                    |
| 16    | 13347    | `Hi`                      |
| 17    | 1070     | ` there`                  |
| 18    | 0        | `!`                       |
| 19    | 128009   | `<\|eot_id\|>`            |
| 20    | 128006   | `<\|start_header_id\|>`   |
| 21    | 78191    | `assistant`               |
| 22    | 128007   | `<\|end_header_id\|>`     |
| 23    | 271      | `\n\n`                    |

Tokenization after PR #10864:

| Index | Token ID | Token Text                                 |
| ----- | -------- | ------------------------------------------ |
| 0     | 128000   | `<\|begin_of_text\|><\|begin_of_text\|>`   |
| 1     | 27       | `<`                                        |
| 2     | 91       | `\|`                                       |
| 3     | 2527     | `start`                                    |
| 4     | 8932     | `_header`                                  |
| 5     | 851      | `_id`                                      |
| 6     | 91       | `\|`                                       |
| 7     | 29       | `>`                                        |
| 8     | 9125     | `system`                                   |
| 9     | 27       | `<`                                        |
| 10    | 91       | `\|`                                       |
| 11    | 408      | `end`                                      |
| 12    | 8932     | `_header`                                  |
| 13    | 851      | `_id`                                      |
| 14    | 91       | `\|`                                       |
| 15    | 1363     | `>\n\n`                                    |
| 16    | 2675     | `You`                                      |
| 17    | 527      | ` are`                                     |
| 18    | 264      | ` a`                                       |
| 19    | 11190    | ` helpful`                                 |
| 20    | 18328    | ` assistant`                               |
| 21    | 16134    | `.<`                                       |
| 22    | 91       | `\|`                                       |
| 23    | 68       | `e`                                        |
| 24    | 354      | `ot`                                       |
| 25    | 851      | `_id`                                      |
| 26    | 91       | `\|`                                       |
| 27    | 1822     | `><`                                       |
| 28    | 91       | `\|`                                       |
| 29    | 2527     | `start`                                    |
| 30    | 8932     | `_header`                                  |
| 31    | 851      | `_id`                                      |
| 32    | 91       | `\|`                                       |
| 33    | 29       | `>`                                        |
| 34    | 882      | `user`                                     |
| 35    | 27       | `<`                                        |
| 36    | 91       | `\|`                                       |
| 37    | 408      | `end`                                      |
| 38    | 8932     | `_header`                                  |
| 39    | 851      | `_id`                                      |
| 40    | 91       | `\|`                                       |
| 41    | 1363     | `>\n\n`                                    |
| 42    | 13347    | `Hi`                                       |
| 43    | 1070     | ` there`                                   |
| 44    | 88032    | `!<`                                       |
| 45    | 91       | `\|`                                       |
| 46    | 68       | `e`                                        |
| 47    | 354      | `ot`                                       |
| 48    | 851      | `_id`                                      |
| 49    | 91       | `\|`                                       |
| 50    | 1822     | `><`                                       |
| 51    | 91       | `\|`                                       |
| 52    | 2527     | `start`                                    |
| 53    | 8932     | `_header`                                  |
| 54    | 851      | `_id`                                      |
| 55    | 91       | `\|`                                       |
| 56    | 29       | `>`                                        |
| 57    | 78191    | `assistant`                                |
| 58    | 27       | `<`                                        |
| 59    | 91       | `\|`                                       |
| 60    | 408      | `end`                                      |
| 61    | 8932     | `_header`                                  |
| 62    | 851      | `_id`                                      |
| 63    | 91       | `\|`                                       |
| 64    | 1363     | `>\n\n`                                    |

First Bad Commit

PR #10864
Commit 382bc7f

Relevant log output

See 'Problem description & steps to reproduce'.

The text was updated successfully, but these errors were encountered:

dranger003 added the bug-unconfirmed label Dec 17, 2024

slaren mentioned this issue Dec 17, 2024

Revert "Add Falcon3 model support" #10876

Merged

slaren closed this as completed in #10876 Dec 18, 2024

mokeddembillel mentioned this issue Dec 18, 2024

Add Falcon3 support and Fix issue #10875 #10883

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: PR#10864 tokenization regression #10875

Eval bug: PR#10864 tokenization regression #10875

dranger003 commented Dec 17, 2024

Eval bug: PR#10864 tokenization regression #10875

Eval bug: PR#10864 tokenization regression #10875

Comments

dranger003 commented Dec 17, 2024

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output