diff --git a/notes/llama-main.md b/notes/llama-main.md index dd10d2fa..b01fae4f 100644 --- a/notes/llama-main.md +++ b/notes/llama-main.md @@ -63,10 +63,10 @@ context length, and the relative positions it learned where within this range. During inference if the relative positions are larger than this range the LLM will have out-of-distribution (O.O.D) problem. -Self-extend is a way map these relative positions that are larger than the +Self-extend is a way to map these relative positions that are larger than the context length that the model was trained on at _inference_ time. -These are options related to Grouped Query Attention: +These are the options related to Grouped Query Attention (gqa): ```console $ ./llama-cli --help | grep group -gan, --grp-attn-n N group-attention factor (default: 1) @@ -76,7 +76,7 @@ $ ./llama-cli --help | grep group to divide the relative positions into groups. The default is 1 which means that the relative positions are not divided into groups. -`grp-attn-w` specifies the total width of tokens used in group attention. So +`grp-attn-w` specifies the total width of tokens used in self-extend. So normally in the attention relative position "encoding/calculation" the positions that are outside of the context length that the model was trained can cause issues for the model because it was not trained on these positions. The limit @@ -88,12 +88,269 @@ relative positions that are larger than the context length that the model was trained on. And this is one during inference as opposed to other methods like LongRoPE which are done during fine-tuning. +Lets try this out with the following input prompt file: +```console +$ ./run-tokenize.sh self-extend.txt +Total number of tokens: 7038 +``` +And check the module context length which is the length that the model was +trained on: +```console +$ ./inspect-model.sh models/llama-2-7b.Q4_0.gguf +INFO:gguf-dump:* Loading: models/llama-2-7b.Q4_0.gguf +* File is LITTLE endian, script is running on a LITTLE endian host. +* Dumping 22 key/value pair(s) + 1: UINT32 | 1 | GGUF.version = 2 + 2: UINT64 | 1 | GGUF.tensor_count = 291 + 3: UINT64 | 1 | GGUF.kv_count = 19 + 4: STRING | 1 | general.architecture = 'llama' + 5: STRING | 1 | general.name = 'LLaMA v2' + 6: UINT32 | 1 | llama.context_length = 4096 + 7: UINT32 | 1 | llama.embedding_length = 4096 + 8: UINT32 | 1 | llama.block_count = 32 + 9: UINT32 | 1 | llama.feed_forward_length = 11008 + 10: UINT32 | 1 | llama.rope.dimension_count = 128 + 11: UINT32 | 1 | llama.attention.head_count = 32 + 12: UINT32 | 1 | llama.attention.head_count_kv = 32 + 13: FLOAT32 | 1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06 + 14: UINT32 | 1 | general.file_type = 2 + 15: STRING | 1 | tokenizer.ggml.model = 'llama' + 16: [STRING] | 32000 | tokenizer.ggml.tokens + 17: [FLOAT32] | 32000 | tokenizer.ggml.scores + 18: [INT32] | 32000 | tokenizer.ggml.token_type + 19: UINT32 | 1 | tokenizer.ggml.bos_token_id = 1 + 20: UINT32 | 1 | tokenizer.ggml.eos_token_id = 2 + 21: UINT32 | 1 | tokenizer.ggml.unknown_token_id = 0 + 22: UINT32 | 1 | general.quantization_version = 2 +``` +So the model was trained on a context length of `4096` tokens, and we are going +to use an input prompt of size `7038` tokens. + +With self-extend we then have to decide what values to set for `grp-attn-n` and +`grp-attn-w`. + +```console +./llama-cli -m models/llama-2-7b.Q4_0.gguf -ngl 10 -f self-extend.txt -c 8000 --temp 1 -n 200 --grp-attn-n 4 --grp-attn-w 128 +``` +So this means that we are going to take each group of 128 tokens and split them +into 4 groups of 32 tokens. +``` + 0 127 255 + +---------------------------|-------------------------+----------------------+ + [0-31][32-63][64-95][96-127][128-159][160-191][192-223][224-255] + [grp0][grp1 ][grp2 ][grp3 ][grp4 ][grp5 ][grp6 ][grp7 ] + +grp0 = all tokens will have position 0 +grp1 = all tokens will have position 1 +grp2 = all tokens will have position 2 +... +``` +In this case will be processing the batch with 2048 tokens and this is also +the context length of the model we are using. So additional tokens would have a +position that is larger than the context length that the model was trained on. +So we want to map these position into the range the model has seen during +training so that the attention mechanism can work as intended. + +If we have a grp-attn-n of 2 and grp-attn-w of 2048 each position. We have +window with of 2048: +``` +[1720595163] n_past = 2048 +[1720595163] embd_inp.size(): 7037, n_consumed: 2048 +[1720595163] +[1720595163] shift: [ 0, 2048] + 0 -> [ 0, 2048] +[1720595163] div: [ 0, 2048] / 2 -> [ 0, 1024] +[1720595163] shift: [ 2048, 2048] + -1024 -> [ 1024, 1024] +``` +Each position in this range will be mapped using /2. For example: +``` +(gdb) p ctx.kv_self.cells[0] +$23 = {pos = 0, delta = 0, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[1] +$24 = {pos = 0, delta = -1, src = 0, seq_id = std::set with 1 element = {[0] = 0}} + +(gdb) p ctx.kv_self.cells[2] +$25 = {pos = 1, delta = -1, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[3] +$26 = {pos = 1, delta = -2, src = 0, seq_id = std::set with 1 element = {[0] = 0}} + +(gdb) p ctx.kv_self.cells[4] +$27 = {pos = 2, delta = -2, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[5] +$28 = {pos = 2, delta = -3, src = 0, seq_id = std::set with 1 element = {[0] = 0}} + +(gdb) p ctx.kv_self.cells[2044] +$32 = {pos = 1022, delta = -1022, src = 0, seq_id = std::set with 1 element = {[0] = 0}} + +(gdb) p ctx.kv_self.cells[2045] +$31 = {pos = 1022, delta = -1023, src = 0, seq_id = std::set with 1 element = {[0] = 0}} + +(gdb) p ctx.kv_self.cells[2046] +$30 = {pos = 1023, delta = -1023, src = 0, seq_id = std::set with 1 element = {[0] = 0}} + +(gdb) p ctx.kv_self.cells[2047] +$29 = {pos = 1023, delta = -1024, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +``` +Is we use a grp-attn-n of 4 we would get: +```console +[1720596142] n_past = 2048 +[1720596142] embd_inp.size(): 7037, n_consumed: 2048 +[1720596142] +[1720596142] shift: [ 0, 2048] + 0 -> [ 0, 2048] +[1720596142] div: [ 0, 2048] / 4 -> [ 0, 512] +[1720596142] shift: [ 2048, 2048] + -1536 -> [ 512, 512] + +(gdb) p ctx.kv_self.cells[0] +$2 = {pos = 0, delta = 0, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[1] +$3 = {pos = 0, delta = -1, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[2] +$4 = {pos = 0, delta = -2, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[3] +$5 = {pos = 0, delta = -3, src = 0, seq_id = std::set with 1 element = {[0] = 0}} + +(gdb) p ctx.kv_self.cells[4] +$6 = {pos = 1, delta = -3, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[5] +$7 = {pos = 1, delta = -4, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[6] +$8 = {pos = 1, delta = -5, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[7] +$9 = {pos = 1, delta = -6, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +``` +So we have control over the number of tokens what will have the same position in +this case two tokens will have the same position. + +Can you accomplish the same thing but having a smaller window size? +For example setting the window group length/width to 1024: +```console +[1720601833] n_past = 2048 +[1720601833] embd_inp.size(): 7037, n_consumed: 2048 +[1720601833] +[1720601833] shift: [ 0, 2048] + 0 -> [ 0, 2048] +[1720601833] div: [ 0, 1024] / 2 -> [ 0, 512] +[1720601833] shift: [ 1024, 2048] + -512 -> [ 512, 1536] + + +(gdb) p ctx.kv_self.cells[0] +$4 = {pos = 0, delta = 0, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[1] +$5 = {pos = 0, delta = -1, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[2] + +$6 = {pos = 1, delta = -1, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[3] +$7 = {pos = 1, delta = -2, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[4] + +$8 = {pos = 2, delta = -2, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[5] +$9 = {pos = 2, delta = -3, src = 0, seq_id = std::set with 1 element = {[0] = 0}} + +(gdb) p ctx.kv_self.cells[1022] +$15 = {pos = 511, delta = -511, src = 0, seq_id = std::set with 1 element = {[0] = 0} +(gdb) p ctx.kv_self.cells[1023] +$12 = {pos = 511, delta = -512, src = 0, seq_id = std::set with 1 element = {[0] = 0}} + +(gdb) p ctx.kv_self.cells[1024] +$13 = {pos = 1024, delta = 0, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[1025] +$14 = {pos = 1025, delta = 0, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +``` +Notice that this time only the positions up to 1024 are mapped. But we will +go through the loop once more. +```console +n_past_old = 2048, n_past = 1536, ga_i = 512 + +[1720602118] +[1720602119] shift: [ 512, 1536] + 512 -> [ 1024, 2048] +[1720602120] div: [ 1024, 2048] / 2 -> [ 512, 1024] +[1720602120] shift: [ 2048, 2048] + -1024 -> [ 1024, 1024] + +(gdb) p ctx.kv_self.cells[1024] +$21 = {pos = 512, delta = -512, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[1025] +$22 = {pos = 512, delta = -513, src = 0, seq_id = std::set with 1 element = {[0] = 0}} + +(gdb) p ctx.kv_self.cells[1026] +$23 = {pos = 513, delta = -513, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[1027] +$24 = {pos = 513, delta = -514, src = 0, seq_id = std::set with 1 element = {[0] = 0}} + +(gdb) p ctx.kv_self.cells[1028] +$25 = {pos = 514, delta = -514, src = 0, seq_id = std::set with 1 element = {[0] = 0}} +(gdb) p ctx.kv_self.cells[1029] +$26 = {pos = 514, delta = -515, src = 0, seq_id = std::set with 1 element = {[0] = 0} +``` +So we've accomplished the exact same result with regards to the grouping. I'm +finding it difficult to understand why one would want to have a smaller window +size than the context length that the model was trained on. + +One thing to note is that we only enter the self-extend block if `n_past` is +greater or equal to `ga_i + ga_w`: +```c++ + while (n_past >= ga_i + ga_w) { +``` + +Wait, so with `ga_n`, which is the factor, this will give an x context length +extension. + +So the following would enable a context length of 4096 (2x2048): +``` +--grp-attn-n 2 --grp-attn-w 2048 +``` +Notice that for positions up to 4095 we are alright, but above that the +positions will be: +``` +4096 / 2 = 2048 +4097 / 2 = 2048 +4098 / 2 = 2049 +4099 / 2 = 2049 +``` +And these are outside of the context length that this model was trained on. + +So to be able to support larger context lengths the `grp-attn-n` and +`grp-attn-w` options can be used. So a `gan` value of 2 and a `gaw` value +of 2048 would mean that we can handle 4096 context lengths (doubling the +context). If we need larger we can increase one or both (depending on the model +used and the context lenght it was trained on) of these values. + +In this case the context length the model was trained on was 2048 so that is +the max value we can specify for `grp-attn-w` since we are dividing this with +the number of groups: +``` +4096 / 3 = 1365 +4097 / 3 = 1365 +4098 / 3 = 1366 +4099 / 3 = 1366 +4100 / 3 = 1366 + +4101 / 3 = 1367 +4102 / 3 = 1367 +4103 / 3 = 1367 + +6144 / 3 = 2048 +``` +So this would enable context of 3x2048=6144 token. +And we can increase `gan` to allow longer context. + + + +If we configure self-extend to have many groups, meaning that more positions +are mapped to the same position and the attention mechanism might not be able to +handle this (will mess up the attention scores depending on the model used). + +I think a good default would be to set the width to the context length that +the model was trained on, and the groups to 2 to keep. + +The self attention with the `floor` operation is called "grouped attention". + +The positions get mapped using the following operation: ``` floor(pos / ga_n) ``` -`ga_n` is the number group that is used in the floor operation and determines -how many groups the relative positions are divided into. The default is 1 which -means that the relative positions are not divided into groups. +`ga_n` is the number that is used in the floor operation and determines how many +groups the relative positions are divided into. The default is 1 which means +that the relative positions are not divided into groups. In the following we will have decoded the prompt, which will have populated the `kv_cache`: @@ -155,6 +412,8 @@ while (n_past >= ga_i + ga_w) { LOG("\nn_past_old = %d, n_past = %d, ga_i = %d\n\n", n_past + bd, n_past, ga_i); } ``` +So we have specified that `ga_w` is 4 and `ga_n` is 2. To get the number +of groups we divide `ga_w` by `ga_n` which is 2. So we will have 2 groups. ```c+ const int ib = (ga_n * ga_i) / ga_w; @@ -228,6 +487,7 @@ than 512 can be dealt with within the confines of this trained range. llama_kv_cache_seq_div(ctx, 0, ga_i + ib*bd, ga_i + ib*bd + ga_w, ga_n); ``` +The first thing that happens is that `llama_kv_cache_seq_add` is called: ```c++ llama_kv_cache_seq_add(ctx, 0, ga_i, n_past, ib*bd); llama_kv_cache_seq_add(ctx, 0, 0, 5, 0); @@ -246,14 +506,13 @@ void llama_kv_cache_seq_add(struct llama_context * ctx, } ``` Notice that for this first case delta will be 0 so it will just return as there -is nothing to add. +is nothing to add. I'll return to this function later and explain it. Next we have the division (the floor operation): ```c++ seq_id, p0 p1 d llama_kv_cache_seq_div(ctx, 0 , ga_i + ib*bd, ga_i + ib*bd + ga_w, ga_n); llama_kv_cache_seq_div(ctx, 0 , 0 + 0 , 0 + 0 + 4 , 2); - ``` Now notice how we are using `ga_i` + `ib*bd` ``` @@ -336,7 +595,6 @@ Note that this is /= so we are performing the division and then assigning the result back to `cache.cells[i].pos`. So this is where we adjust the position and we are using `ga_n`: ```c++ - ↓ llama_kv_cache_seq_div(ctx, 0, ga_i + ib*bd, ga_i + ib*bd + ga_w, ga_n); ``` For the next iteration we will have: @@ -425,8 +683,8 @@ $32 = 4 (gdb) p p1 $33 = 5 ``` -So the first 4 cells will not be in the range but the 4 will. This will then -set he `has_shift` to true and adjust. +So the first 4 cells will not be in the range but the 4th will. This will then +set the `has_shift` to true and adjust. The current position of this cell is 4: ```console (gdb) p cache.cells[i].pos @@ -485,7 +743,8 @@ Back in the main while look we then have: ``` Now, this is interesting `n_past` is currently 5 and `bd` is 2 so `n_past` will be updated to 3. So instead of having the position of the next token be 5 it -has become 3? +has become 3. + And `ga_i` will be updated to become 2 (the group size) So the next time we call `llama_decode` `n_past` will be 3: @@ -495,7 +754,6 @@ if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval, n_past, 0))) { So even though after the prompt was decoded, after which `n_past` was 5 it is now 3. - After a few decodes `n_past` will again be greater than or equal to `ga_i + ga_w` and this time the first add will be entered. @@ -557,7 +815,7 @@ Is this done because the positions were added using `n_past` which was 3 and then incremented, to the new cells to the positions 3, 4, 5. This add is adjusting them to be in the order prior to the adjustment of `n_past`. Notice that they are now incremental from the first position, But not in the grouping -but that Is think will be handled by the next division operation: +but that will be handled by the next division operation: ```console (gdb) s llama_kv_cache_seq_div (ctx=0x5555561fa4f0, seq_id=0, p0=4, p1=8, d=2) at src/llama.cpp:18218 @@ -904,14 +1162,12 @@ INFO:gguf-dump:* Loading: /home/danbev/.cache/lm-studio/models/TheBloke/TinyLlam #### Testing: -I created a text files with a context length by downloading a book from the -Gutenberg project: +I created a text file by downloading a book from the Gutenberg project: ```console $ wget https://www.gutenberg.org/cache/epub/1184/pg1184.txt ``` I then just kept the first ~600 lines and saved it in a file named -`self-extend.txt. -We can inspect the number of tokens in this files using: +`self-extend.txt`. We can inspect the number of tokens in this files using: ```console ./llama-tokenize -m models/llama-2-7b.Q4_0.gguf -f self-extend.txt --show-count Total number of tokens: 7038 @@ -939,15 +1195,16 @@ INFO:gguf-dump:* Loading: models/llama-2-7b.Q4_0.gguf 15: STRING | 1 | tokenizer.ggml.model = 'llama' ... ``` +And notice that this model was trained using a context length of 4096 tokens. -First, lsts run this without any self-extend options: +First, lets run this without any self-extend options: ```console $ ./llama-cli -m models/llama-2-7b.Q4_0.gguf -ngl 10 -f self-extend.txt -c 8192 --temp 0 -n 256 ``` This will load the prompt without any issues and initially generation looks -somewhat ok, but then it starts to generate gibberish just a bunch or new lines. +somewhat ok, but then it starts to generate gibberish (just a bunch or new lines). -Next we try with the self-extend options: +Next, lets try with the self-extend options: ```console $ /llama-cli -m models/llama-2-7b.Q4_0.gguf -ngl 10 -f self-extend.txt -c 8192 --temp 0 -n 256 --grp-attn-n 4 --grp-attn-w 256 ``` @@ -1056,7 +1313,6 @@ but this is done as a fine tuning step after the model has been trained. Self-Extend does not require fine tuning. - ### ... The kv-cache is updated by `llama_decode_internal`: ```c++ @@ -1085,7 +1341,7 @@ The kv-cache is updated by `llama_decode_internal`: } ``` Now, if there has been some update to the kv-cache, like setting the `has_shift` -flag or the `do_copy`the `llama_kv_cache_update` will performs updates. For the +flag or the `do_copy`the `llama_kv_cache_update` will perform updates. For the initial prompt this will not be the case. So this would not do anything. The `kv_self.head` and `kv_self.used` will also be 0 at this point. Next we have `llama_kv_cache_find_slot` which will find a slot for the tokens @@ -1137,6 +1393,10 @@ static bool llama_kv_cache_find_slot( return true; ``` +Notice that the if statement in the while true loop checking that the number +of tokens will fit in the cache. If not, the head will be reset to 0 and the +loop will continue. And notice that `n_tested` is updated to the size of the +cache minus the head. ```console (gdb) p cache.head @@ -1158,7 +1418,9 @@ $11 = 8192 ``` So the if statement looping over the number of tokens in the batch and checking if the position in that cell is greater than or equal to 0 which means that is -not empty (-1). This is not the case so we will break out of the loop. +not empty (-1). + +This is not the case so we will break out of the loop. So this is really checking that the cells at the current head are empty. So `found` will still be true in this case and we will break out of the loop. @@ -1172,11 +1434,15 @@ Next, we will iterate over all the tokens in the batch. } } ``` +Also, notice that the positions are the positions as they are in the batch, so +there is nothing related to self-extend here! + And update the position and the sequence id for each token in the batch. After that cache.used will be updated and then we return true: ```c++ cache.used += n_tokens; ``` + Back in `llama_decode_internal` we have: ```c++ @@ -1204,8 +1470,11 @@ $20 = 8192 (gdb) p kv_self.n $21 = 32 ``` + After the ggml compute graph has been built and computed we end up in: ```c++ + ggml_cgraph * gf = llama_build_graph(lctx, u_batch, false); + llama_graph_compute(lctx, gf, n_threads); // update the kv ring buffer @@ -1218,9 +1487,77 @@ After the ggml compute graph has been built and computed we end up in: } } ``` -Here the cache is updated by the number of tokens in the batch. -Notice that if the head becomes greater that the size of the cache it will be -reset to 0. +`llama_graph_compute` will build the computation graph +Both the Query and the Key cached will be roped: +```c++ + Qcur = ggml_rope_ext( + ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos, nullptr, + n_rot, rope_type, n_ctx_orig, freq_base, freq_scale, + ext_factor, attn_factor, beta_fast, beta_slow + ); + cb(Qcur, "Qcur", il); + + Kcur = ggml_rope_ext( + ctx0, ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens), inp_pos, nullptr, + n_rot, rope_type, n_ctx_orig, freq_base, freq_scale, + ext_factor, attn_factor, beta_fast, beta_slow + ); +``` + +```console +(gdb) p *Qcur +$11 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {2048, 2, 1, 1}, nb = {4, 8192, + 16384, 16384}, op = GGML_OP_MUL_MAT, op_params = {0 }, flags = 0, grad = 0x0, src = { + 0x55555bd630e0, 0x7fffcf51e8a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, + data = 0x0, name = "Qcur-0", '\000' , extra = 0x0} +``` +In this case there batch only contains two tokens as this is the warmup decode +but that does not matter. We can see that we have something like: +``` + 0 2047 +0 [...........................................] +1 [...........................................] +``` +```console +(gdb) p n_embd_head +$12 = 64 +(gdb) p n_head +$13 = 32 +(gdb) p n_tokens +$14 = 2 + +(gdb) p *ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens) +$16 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {64, 32, 2, 1}, nb = {4, 256, + 8192, 16384}, op = GGML_OP_RESHAPE, op_params = {0 }, flags = 0, grad = 0x0, src = { + 0x7fffcf51ea10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x7fffcf51ea10, view_offs = 0, + data = 0x0, name = "Qcur-0 (reshaped)", '\000' , extra = 0x0} +``` +``` + 0 64 +0 [...........] + . + . / + . / +31 [...........]/ + 0 +32*64 = 2048 +``` +So this is setting up the computation graph and the above reshaped tensor will +later be updated with values./ + +```console +(gdb) p *inp_pos +$18 = {type = GGML_TYPE_I32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {2, 1, 1, 1}, nb = {4, 8, 8, + 8}, op = GGML_OP_NONE, op_params = {0 }, flags = 1, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, + 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x0, + name = "inp_pos", '\000' , extra = 0x0} +``` + +```c++ + cur = llm_build_kv(ctx0, model, hparams, cparams, kv_self, gf, + model.layers[il].wo, model.layers[il].bo, + Kcur, Vcur, Qcur, KQ_mask, n_tokens, kv_head, n_kv, 1.0f/sqrtf(float(n_embd_head)), cb, il); +``` From `build_gemma2`: @@ -1373,8 +1710,11 @@ GGML_CALL static void ggml_backend_cpu_buffer_set_tensor( GGML_UNUSED(buffer); } ``` -So that took care of the token values, that is the ids of the tokens. Next we -will do something similar for the positions of the tokens in the batch: +So that took care of the token values, that is the ids (token ids) of the +tokens. + +Next we will do something similar for the positions of the tokens in the batch: +(still in `llama_set_inputs`); ```c if (batch.pos && lctx.inp_pos) { const int64_t n_tokens = batch.n_tokens; @@ -1399,6 +1739,109 @@ $83 = 6648929 $84 = 4 ``` +Next we have (note that `is_encoding` would be true is the model was and +encoder-decoder model like T5): +```c++ + if (lctx.inp_KQ_mask) { + // NOTE: hparams.causal_attn indicates the model is capable of generation and uses the kv cache. + if (cparams.causal_attn && !lctx.is_encoding) { + const int64_t n_kv = kv_self.n; + const int64_t n_tokens = batch.n_tokens; + + GGML_ASSERT(ggml_backend_buffer_is_host(lctx.inp_KQ_mask->buffer)); + + float * data = (float *) lctx.inp_KQ_mask->data; + float * data_swa = nullptr; + + if (lctx.inp_KQ_mask_swa) { + data_swa = (float *) lctx.inp_KQ_mask_swa->data; + } + + // For causal attention, use only the previous KV cells + // of the correct sequence for each token of the batch. + // It's assumed that if a token in the batch has multiple sequences, they are equivalent. + for (int h = 0; h < 1; ++h) { + for (int j = 0; j < n_tokens; ++j) { + const llama_pos pos = batch.pos[j]; + const llama_seq_id seq_id = batch.seq_id[j][0]; + + for (int i = 0; i < n_kv; ++i) { + float f; + if (!lctx.kv_self.cells[i].has_seq_id(seq_id) || lctx.kv_self.cells[i].pos > pos) { + f = -INFINITY; + } else { + if (hparams.use_alibi) { + f = -fabs(lctx.kv_self.cells[i].pos - pos); + } else { + f = 0.0f; + } + } + data[h*(n_kv*n_tokens) + j*n_kv + i] = f; + + // may need to cut off old tokens for sliding window + if (data_swa) { + if (pos - lctx.kv_self.cells[i].pos >= (int32_t)hparams.n_swa) { + f = -INFINITY; + } + data_swa[h*(n_kv*n_tokens) + j*n_kv + i] = f; + } + } + } + + for (int i = n_tokens; i < GGML_PAD(n_tokens, GGML_KQ_MASK_PAD); ++i) { + for (int j = 0; j < n_kv; ++j) { + data[h*(n_kv*n_tokens) + i*n_kv + j] = -INFINITY; + } + } + } +``` +The for loop with the `h` index looks a little odd to me. This index will be +inialized to 0 and then the loop will run once. This value, 0, is also used +in a few calculations in the code which could be remove as they will always be +zero. But lets think about what is happening here. The inner for loop is going +to iterate over all the tokens in the batch and then for each token it will +iterate over the number of kv_self.n which in this case is 32. 'f' will be 0.0f +in our case and then the inp_KQ_mask will be updated with that value: +```c++ + data[h*(n_kv*n_tokens) + j*n_kv + i] = f; +``` +But notice that `h*(n_kv*n_tokens)` will always be 0 and could possibly be +removed. +The next time through the loop i will be 1 and this will cause and the current +pos is 0, so the first if statement will be entered and f set to -INFINITY. And +this makes sense if we think about it. For the first token is should not attend +to any tokens ahead of it. So the next value in inp_KQ_mask will be -INFINITY. +And this will happen for all values up to n_kv (32). +This will build up a mask tensor matrix that looks likes something like this: +``` + 0 31 + +----+-----+-----+---------------------------+ + | 0 |~inf |~inf | ... | + | 0 | 0 |~inf | | + | | + | | + | | + | | + | | + | | + | | + +--------------------------------------------+ + 31 +``` +After that and having gone through and creating the mask for the tokens in the +batch there might be more slots in the mask matrix that need to be filled which +is why the following will start at n_tokens and for each them set the values +to ~inf: +```++ + for (int i = n_tokens; i < GGML_PAD(n_tokens, GGML_KQ_MASK_PAD); ++i) { + for (int j = 0; j < n_kv; ++j) { + data[h*(n_kv*n_tokens) + i*n_kv + j] = -INFINITY; + } + } +``` +And in our case that is what llama_set_inputs does. + + ```console (gdb) p n_kv $95 = 32 @@ -1473,10 +1916,325 @@ it. After that all the tokens from `n_tokens` to the end will be set to -INFINITY and therefor masked out. -Back in `llama_decode_internal` we have and ready to compute the graph: +Back in `llama_decode_internal` we are now ready to compute the graph: ```c llama_set_inputs(lctx, u_batch); llama_graph_compute(lctx, gf, n_threads); ``` + +In llama_decode_internal we have the following function which comes before +llama_kv_cache_find_slot: +```c++ + + // non-causal masks do not use the KV cache + if (hparams.causal_attn) { + llama_kv_cache_update(&lctx); +``` +```c++ +void llama_kv_cache_update(struct llama_context * ctx) { + llama_kv_cache_update_internal(*ctx); +} + +static void llama_kv_cache_update_internal(struct llama_context & lctx) { + bool need_reserve = false; + + // apply K-shift if needed + if (lctx.model.hparams.rope_type != LLAMA_ROPE_TYPE_NONE && lctx.kv_self.has_shift) { + { + ggml_backend_sched_reset(lctx.sched); + + ggml_cgraph * gf = llama_build_graph_k_shift(lctx); + + ggml_backend_sched_alloc_graph(lctx.sched, gf); + + llama_set_k_shift(lctx); + + llama_graph_compute(lctx, gf, lctx.cparams.n_threads); + + need_reserve = true; + } + + { + auto & kv_self = lctx.kv_self; + + kv_self.has_shift = false; + + for (uint32_t i = 0; i < kv_self.size; ++i) { + kv_self.cells[i].delta = 0; + } + } + } +``` +```c++ +static struct ggml_cgraph * llama_build_graph_k_shift(llama_context & lctx) { + llama_batch dummy; + dummy.n_tokens = 0; + + llm_build_cb cb = [&](struct ggml_tensor * , const char * , int ) { }; + + struct llm_build_context llm(lctx, dummy, cb, false); + + llm.init(); + + struct ggml_cgraph * result = llm.build_k_shift(); + + llm.free(); + + return result; +} + + struct ggml_cgraph * build_k_shift() { + struct ggml_cgraph * gf = ggml_new_graph_custom(ctx0, LLAMA_MAX_NODES, false); + + GGML_ASSERT(kv_self.size == n_ctx); + + lctx.inp_K_shift = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_ctx); + cb(lctx.inp_K_shift, "K_shift", -1); + ggml_set_input(lctx.inp_K_shift); + + for (int il = 0; il < n_layer; ++il) { + const int64_t n_head_kv = hparams.n_head_kv(il); + const int64_t n_embd_k_gqa = hparams.n_embd_k_gqa(il); + struct ggml_tensor * rope_factors = build_rope_factors(il); + struct ggml_tensor * tmp = + // we rotate only the first n_rot dimensions + ggml_rope_ext_inplace(ctx0, + ggml_view_3d(ctx0, kv_self.k_l[il], + n_embd_head_k, n_head_kv, n_ctx, + ggml_row_size(kv_self.k_l[il]->type, n_embd_head_k), + ggml_row_size(kv_self.k_l[il]->type, n_embd_k_gqa), + 0), + lctx.inp_K_shift, rope_factors, n_rot, rope_type, n_ctx_orig, freq_base, freq_scale, + ext_factor, attn_factor, beta_fast, beta_slow); + + cb(tmp, "K_shifted", il); + ggml_build_forward_expand(gf, tmp); + } + + return gf; + } + + struct ggml_tensor * build_rope_factors(int il) { + // choose long/short freq factors based on the context size + const auto n_ctx_pre_seq = cparams.n_ctx / cparams.n_seq_max; + + if (n_ctx_pre_seq > hparams.n_ctx_orig_yarn) { + return model.layers[il].rope_long; + } + + return model.layers[il].rope_short; + } +``` +Is this `build_rope_factors` an impl. of LongRope? + +```console +(gdb) p *ggml_view_3d(ctx0, kv_self.k_l[il],n_embd_head_k, n_head_kv, n_ctx, ggml_row_size(kv_self.k_l[il]->type, n_embd_head_k), ggml_row_size(kv_self.k_l[il]->type, n_embd_k_gqa), 0) +$79 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {64, 4, 8000, 1}, nb = {2, 128, + 512, 4096000}, op = GGML_OP_VIEW, op_params = {0 }, flags = 0, grad = 0x0, src = { + 0x55555bd47850, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x55555bd47850, view_offs = 0, + data = 0x7fffa08d2020, name = "cache_k_l0 (view)", '\000' , extra = 0x0} +(gdb) p kv_self.k_l[il] +$80 = (ggml_tensor *) 0x55555bd47850 +(gdb) p *kv_self.k_l[il] +$81 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x55555bd39490, ne = {2048000, 1, 1, 1}, + nb = {2, 4096000, 4096000, 4096000}, op = GGML_OP_NONE, op_params = {0 }, flags = 0, + grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, + data = 0x7fffa08d2020, name = "cache_k_l0", '\000' , extra = 0x0} +``` + +After that we will have a call to `llama_set_k_shift`: +```c++ + ggml_cgraph * gf = llama_build_graph_k_shift(lctx); + + ggml_backend_sched_alloc_graph(lctx.sched, gf); + + llama_set_k_shift(lctx); +``` + +```c++ +static void llama_set_k_shift(llama_context & lctx) { + const int64_t kv_size = lctx.kv_self.size; + + assert(ggml_backend_buffer_is_host(lctx.inp_K_shift->buffer)); + + int32_t * data = (int32_t *) lctx.inp_K_shift->data; + + for (int i = 0; i < kv_size; ++i) { + data[i] = lctx.kv_self.cells[i].delta; + } +} +``` +Notice that this is getting the data member from the inp_K_shift tensor and +and then iterating through number of cache elements. And it is using the delta +that we updated ealier in the `ga_n` block!So I think this is how the deltas are +used. +TODO: take a closer look at how inp_K_shift is used in the computation +graph. So I actually missed this when going through the code above but this +tensor is used here: +```c++ + struct ggml_tensor * tmp = + // we rotate only the first n_rot dimensions + ggml_rope_ext_inplace(ctx0, + ggml_view_3d(ctx0, kv_self.k_l[il], + n_embd_head_k, n_head_kv, n_ctx, + ggml_row_size(kv_self.k_l[il]->type, n_embd_head_k), + ggml_row_size(kv_self.k_l[il]->type, n_embd_k_gqa), + 0), + lctx.inp_K_shift, rope_factors, n_rot, rope_type, n_ctx_orig, freq_base, freq_scale, + ext_factor, attn_factor, beta_fast, beta_slow); +``` +The first tensor passed to `ggml_rope_ext_inplace` is the tensor to be rotated +the second is the tensor containing the positions. This will be set as src1 for +this operation (remember that this is only setting up the computation graphs and +that the actual operation is performed later during the forward pass. + +Lets set a break point in `ggml_compute_forward_rope_f32` to see how the b +tensor above is used. + +```console + const struct ggml_tensor * src0 = dst->src[0]; + const struct ggml_tensor * src1 = dst->src[1]; + ... + + const int32_t * pos = (const int32_t *) src1->data; + + for (int64_t i3 = 0; i3 < ne3; i3++) { + for (int64_t i2 = 0; i2 < ne2; i2++) { + const int64_t p = pos[i2]; +``` +So the above is looping over +```console +(gdb) p src0.ne[3] +$109 = 1 +``` +And the looping over `src0.ne[2]` which is 512. +```console +(gdb) p *src0 +$105 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x55555bd3a0a0, ne = {64, 32, 512, 1}, + nb = {4, 256, 8192, 4194304}, op = GGML_OP_RESHAPE, op_params = {0 }, flags = 0, grad = 0x0, + src = {0x7fffcf51ea10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x7fffcf51ea10, view_offs = 0, + data = 0x7fff80cd1820, name = "Qcur-0 (reshaped)", '\000' , extra = 0x0} +``` + +```console +(gdb) p *src1 +$115 = {type = GGML_TYPE_I32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x55555bd3a0a0, ne = {512, 1, 1, 1}, + nb = {4, 2048, 2048, 2048}, op = GGML_OP_NONE, op_params = {0 }, flags = 1, grad = 0x0, + src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, + data = 0x7fff7f530820, name = "inp_pos", '\000' , extra = 0x0} +``` + + + +### `ggml_rope_ext` +```c + Qcur = ggml_rope_ext( + ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos, nullptr, + n_rot, rope_type, n_ctx_orig, freq_base, freq_scale, + ext_factor, attn_factor, beta_fast, beta_slow + ); + cb(Qcur, "Qcur", il); +``` +Lets start by focusing on the second argument which is `a` and this would be +the tensor that the rotation should be applied to. This tensor is first +reshaped to a 3D tensor: +```console +(gdb) p *Qcur +$2 = {type = GGML_TYPE_F32, +backend = GGML_BACKEND_TYPE_CPU, +buffer = 0x0, +ne = {4096, 512, 1, 1}, +nb = {4, 16384, 8388608, 8388608}, +op = GGML_OP_MUL_MAT, op_params = {0 }, flags = 0, grad = 0x0, +src = {0x555558423910, 0x7fffcf51e8a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, +data = 0x0, name = "Qcur-0", '\000' , extra = 0x0} +``` +``` + QCur + 0 4095 + 0 +-------------------------------------------+ + | | + | | + | | + | | + | | + | | + | | + | | + | | + 511 +-------------------------------------------+ +``` +Lets look at the reshaping: +```c + ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens) +``` +```console +(gdb) p n_embd_head +$8 = 128 +(gdb) p n_head +$9 = 32 +(gdb) p n_tokens +$10 = 512 +``` +So that becomes: +```c + ggml_reshape_3d(ctx0, Qcur, 128, 32, 512) +``` +And notice what we have split the dimensions which were 4096 into 128x32 (4096) +``` + /--------------------------+ 0 + / / + / / + 0/ 127 / + 0 +---------------------+ / + | | / + | | / + | | / + | |/ + 32 +---------------------+ 511 +``` +So we are reshaping Qcur to the above dimensions before calling rope. +The signagure for `ggml_rope_ext` is: +```c +struct ggml_tensor * ggml_rope_ext( + struct ggml_context * ctx, + struct ggml_tensor * a, + struct ggml_tensor * b, + struct ggml_tensor * c, + int n_dims, + int mode, + int n_ctx_orig, + float freq_base, + float freq_scale, + float ext_factor, + float attn_factor, + float beta_fast, + float beta_slow) { + return ggml_rope_impl( + ctx, a, b, c, n_dims, mode, n_ctx_orig, freq_base, freq_scale, + ext_factor, attn_factor, beta_fast, beta_slow, false + ); +} +``` +So Qcur is a, b is `inp_pos`. `c` is null. + + +```c +static void ggml_compute_forward_rope_f32( + const struct ggml_compute_params * params, + struct ggml_tensor * dst, + const bool forward) { + ... + const float theta_scale = powf(freq_base, -2.0f/n_dims); +``` +So, we can see here that the `freq_base` is used to calculate the `theta_scale` +and notice that this the same as specified in the vanilla RoPE paper where +we take 10000^(-2/d). And we can see what `n_dims` is used for. +```c + const int32_t * pos = (const int32_t *) src1->data; +``` +And here we can see that the tensor `b` is the position tensor which makes +sense as it's dimension matches the embedding dimension (512 in this case). +