From bff4287393ef44aaf0c3deac5823662f20f0cde4 Mon Sep 17 00:00:00 2001 From: Daniel Bevenius Date: Wed, 30 Oct 2024 13:27:36 +0100 Subject: [PATCH] docs: add table of contents to llama-kv-cache.md --- notes/llama-kv-cache.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/notes/llama-kv-cache.md b/notes/llama-kv-cache.md index 1ddbee6..9bf606b 100644 --- a/notes/llama-kv-cache.md +++ b/notes/llama-kv-cache.md @@ -3,6 +3,10 @@ I've gone through the theory of Key-Value caching in the transformer architectur in [llama.md](llama.md). This document is a more detailed look at the implementation of the key-value cache in the llama.cpp codebase. +## Table of Contents +- [KV-Cache at inference time](#inference-with-kv-cache) +- [llama_kv_cache details](#llama_kv_cache) + ### Inference with KV-Cache Lets set a break point before `llama_decode` and see how this interacts with the kv-cache. @@ -1086,7 +1090,7 @@ the key and values cache will use 6 when calculating the offset to store the roped k and value cache entried for the next token. -### `kv_self` +### llama_kv_cache A `llama_context` contains a member named `kv_self` (self as in self attention) which is of type `llama_kv_cache`. This struct is defined in `llama.cpp`: ```c++