docs: update trie DAT notes

danbev · Aug 25, 2024 · 254f2e7 · 254f2e7
1 parent eaa5582
commit 254f2e7
Showing 1 changed file with 213 additions and 45 deletions.
diff --git a/notes/trie.md b/notes/trie.md
@@ -118,71 +118,239 @@ and "apron").
 Hash tables are not suited for range queries.
 
 
-### XOR-compressed double arrays (XCDA)
-Lets start with double array tries.
-A double array trie uses two arrays, BASE and CHECK, to represent a trie structure compactly.
+### Double Array Trie (DAT)
+A double array trie uses two arrays, BASE and CHECK, to represent a trie
+structure compactly.
 
-So first we start with a normal trie for the words "cat" and "cow":
+Lets start with inserting the work 'cat' into a DAT.
+```c
+    trie_dat* trie = create_trie();
 
-Trie Structure with Node Numbers:
+    insert(trie, "cow");
 ```
-    0(root)
-       |
-    1(c)
-   /     \
-2(a)     3(o)
- |         |
-4(t*)     5(w*)
 
-* = end of word
+```c
+void insert(trie_dat* trie, const char* word) {
+    int cur_node = ROOT_NODE; // this is the current node, and we start at the ROOT node.
+```
+And ROOT_NODE is defined as 1, and the trie_dat structure is defined as:
+```console
+(gdb) p cur_node
+$2 = 1
+
+(gdb) ptype trie
+type = struct {
+    int *base;
+    int *check;
+    _Bool *terminal;
+    int size;
+    int capacity;
+} *
+```
+And a trie is created using the create_trie() function:
+```c
+trie_dat* create_trie() {
+    trie_dat* trie = malloc(sizeof(trie_dat));
+    trie->size = ROOT_NODE;
+    trie->capacity = INITIAL_SIZE;
+    trie->base = calloc(trie->capacity, sizeof(int));
+    trie->check = calloc(trie->capacity, sizeof(int));
+    trie->terminal = calloc(trie->capacity, sizeof(bool));
+    return trie;
+}
+```
 
+We will be focusing on the base and check array in this section.
+
+First, we will iterate over all the individual characters in the word 'cat":
+```c
+    for (int i = 0; word[i] != '\0'; i++) {
+        int char_offset = word[i] - 'a';
+```
+The above is a common way to convert a character to an index. In this case, we
+are converting the character 'a' to 0, 'b' to 1, 'c' to 2, etc. So the offset
+for 'c' is 2.
+
+Recall that `cur_node` is the root which is 1 for our trie:
+```c
+        if (trie->base[cur_node] == 0) {
+            trie->base[cur_node] = trie->size;
+        }
+```
+And since this is the first time calling insert there is nothing in the trie at
+the moment. So we are setting the base[1] = 1 (this initial size of the trie).
+
+```
+BASE[1] = 1
+```
 
+Next we have are going to calculate the transition or offset from the base[1] to
+the node of the character 'c':
+```c
+        // Calculate the transition index which uses base[s] + c.
+        int t = trie->base[cur_node] + char_offset;
+```
+
+```console
+(gdb) p trie->base[cur_node] + char_offset
+$11 = 3
+```
+This value is then used as the index into check:
+```
+        if (trie->check[t] == 0) {
+            trie->check[t] = cur_node;
+            trie->size++;
+        } else if (trie->check[t] != cur_node) {
+            // Handle conflicts in base/check
+            fprintf(stderr, "Error: Conflict detected while inserting '%s'.\n", word);
+            return;
+        }
+```
+And this value of this element will be the cur_node which is 1 in this case:
+```
+CHECK[3] = 1
+```
+And the size of the trie is incremented by 1 and will become 2.
+
+The last thing in this loop is:
+```c
+        cur_node = t;
+```
+This is setting the current node which 3 to be the current node.
+
+So this is what the arrays look like after we have iterated over the first
+character in 'cat':
+```
+        0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
+BASE  [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...]
+CHECK [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...]
 ```
+When we insert a character we calculate some number for it which we use to
+insert it. The character itself is never stored. And when we search we
+calculate the same number using the character and if the “path” to/from the
+parent is valid we have inserted it previously. This is what the check array
+is for. So we assign a unique number to each character and this number is used
+in both inserts and searches.
+
+So if we wanted to search for the work 'cow' now we would start at the root:
+```
+char mapping `'c' - 'a' = 2;
+base[1] + 2 = 3,
+      1 + 2 = 3,
 
-Double Array Representation:
+check[3] == 1? Yes, continue
 ```
-BASE[0] = Has one child which is 'c' (decimal 99).
-Lets say that BASE[0] = 1 then we have 1 + 99 = 100. This would mean that then
-node for 'c' would be placed at index 100 in the trie. This is far from the
-root wasting a lot of space.
-If we want the node for 'c' to be at index 1, then we would have to set BASE[0]
-to 1 - 99 = -98. This would mean that the node for 'c' would be placed at index
-1 in the trie.
-So that gives us BASE[0] = -98.
+So we can see that the check verifies that we indeed have inserted this "path"
+before and we can continue to the next character. We currently only have one
+node but we we will insert more below.
+
+Next we have the character 'o':
+```console
+(gdb) p char_offset
+$18 = 14
 ```
-BASE[0] + 99 = -98 + 99 = 1
+And recall that `cur_node` is 3
+```c
+        if (trie->base[cur_node] == 0) {
+
 ```
-In the above trie node 'c' has two children 'a' and 'o'. The value in BASE[1]
-should be chosen such that when we add the ASCII value of 'a' to BASE[1] we
-get the correct index for 'a'. And likewise for 'o'. 
 
-Let's assume we want the node for 'a' to be at index 2. 'a' = 97.
+```console
+(gdb) p trie->size
+$21 = 2
 ```
-'a' = 97
-BASE[1] + 97 = 2
-BASE[1] = 2 - 97 = -95
+We currently have 2 nodes 'ROOT' and 'c'.
 
-BASE[0] = -98
-BASE[1] = -95
+And the last node we inserted was 'c' and the index was 3:
+```
+(gdb) p cur_node
+$22 = 3
+```
+Then we will set trie->base[3] = 2:
+```console
+trie->base[cur_node] = trie->size;
+```
+And we caclulate the transition index:
+```c
+int t = trie->base[cur_node] + char_offset;
+```
+```console
+(gdb) p char_offset
+$123 = 14
+(gdb) p trie->base[cur_node]
+$124 = 2
+(gdb) p trie->base[cur_node] + char_offset
+$125 = 16
+```
+And we will set check[16] = 3
+```console
+trie->check[t] = cur_node;
 ```
 
-The base array is used for navigating from one node to another node based on the
-input characters.
+This will give a arrays that look like this:
+```
+        0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
+BASE  [ 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...]
+CHECK [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0...]
+```
+Next we have the character 'w' and cur_node is now 16.
+```
+        0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22, 23  24  25 
+BASE  [ 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0 ,  0,  0]
+```
 
-base[current_node] gives us the base index/offset for navigating from the current
-node to its children. This is then used with an offset, another character in the
-input. The offset is the ASCII value of the character.
+```
+        0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22, 23  24, 25
+CHECK [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0,  0,  0, 16]
+```
+This is the last character in the word 'cow' and we will set the terminal flag:
+```
+trie->terminal[cur_node] = true;
+```
+So the arrays look like this after 'cow' has been inserted:
+```
+        0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22, 23  24  25 
+BASE  [ 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0,  0,  0,  0]
+CHECK [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0,  0,  0, 16]
+```
 
-So base[current_node] represents the offset in the array where the transistions
-from the current node start.
+Trie Structure with Node Numbers:
 ```
-    0(root)
+    1(root)
        |
-    1(c)
-   /     \
-2(a)     3(o)
- |         |
-4(t*)     5(w*)
+    3(c)
+   / 
+16(0)
+ |
+25(w*)
+
+* = end of word
+```
+
+### Packed Double Array Trie (PDAT)
+When implementing the double array trie and stepping through the code I noticed
+that the arrays, BASE and CHECK are quite sparely populated. This is because
+the arrays are allocated based on the number of nodes in the trie. The number of
+nodes in the trie is based on the number of characters in the input strings.
+This can be quite large and wasteful. Packing the arrays can reduce the memory
+footprint of the trie.
+implementation. There seems to be many different ways to pack the arrays but
+I'll focus on the method used in llama.cpp.
+
+There is an example of Double Array Trie in
+[dat](../fundamentals/datastructures/sr/dat.c) which might help to take a look
+at to get an understanding of how a DAT works.
+
+So, instead of using two arrays for BASE and CHECK we will now have have a
+single uint32_t array which will contain all the information for a node the
+trie.
+``
+Bits    0-7: LCHECK value (8 bits)
+Bit       8: LEAF flag (1 bit)
+Bit       9: BASE extension flag (1 bit)
+Bits  10-30: BASE value or Value (21 bits)
+Bit      31: Sign bit for LCHECK or additional VALUE bit
 ```
 
 _wip_
+