diff --git a/notes/trie.md b/notes/trie.md index 34d3e5b2..4cca15d7 100644 --- a/notes/trie.md +++ b/notes/trie.md @@ -118,71 +118,239 @@ and "apron"). Hash tables are not suited for range queries. -### XOR-compressed double arrays (XCDA) -Lets start with double array tries. -A double array trie uses two arrays, BASE and CHECK, to represent a trie structure compactly. +### Double Array Trie (DAT) +A double array trie uses two arrays, BASE and CHECK, to represent a trie +structure compactly. -So first we start with a normal trie for the words "cat" and "cow": +Lets start with inserting the work 'cat' into a DAT. +```c + trie_dat* trie = create_trie(); -Trie Structure with Node Numbers: + insert(trie, "cow"); ``` - 0(root) - | - 1(c) - / \ -2(a) 3(o) - | | -4(t*) 5(w*) -* = end of word +```c +void insert(trie_dat* trie, const char* word) { + int cur_node = ROOT_NODE; // this is the current node, and we start at the ROOT node. +``` +And ROOT_NODE is defined as 1, and the trie_dat structure is defined as: +```console +(gdb) p cur_node +$2 = 1 + +(gdb) ptype trie +type = struct { + int *base; + int *check; + _Bool *terminal; + int size; + int capacity; +} * +``` +And a trie is created using the create_trie() function: +```c +trie_dat* create_trie() { + trie_dat* trie = malloc(sizeof(trie_dat)); + trie->size = ROOT_NODE; + trie->capacity = INITIAL_SIZE; + trie->base = calloc(trie->capacity, sizeof(int)); + trie->check = calloc(trie->capacity, sizeof(int)); + trie->terminal = calloc(trie->capacity, sizeof(bool)); + return trie; +} +``` +We will be focusing on the base and check array in this section. + +First, we will iterate over all the individual characters in the word 'cat": +```c + for (int i = 0; word[i] != '\0'; i++) { + int char_offset = word[i] - 'a'; +``` +The above is a common way to convert a character to an index. In this case, we +are converting the character 'a' to 0, 'b' to 1, 'c' to 2, etc. So the offset +for 'c' is 2. + +Recall that `cur_node` is the root which is 1 for our trie: +```c + if (trie->base[cur_node] == 0) { + trie->base[cur_node] = trie->size; + } +``` +And since this is the first time calling insert there is nothing in the trie at +the moment. So we are setting the base[1] = 1 (this initial size of the trie). + +``` +BASE[1] = 1 +``` +Next we have are going to calculate the transition or offset from the base[1] to +the node of the character 'c': +```c + // Calculate the transition index which uses base[s] + c. + int t = trie->base[cur_node] + char_offset; +``` + +```console +(gdb) p trie->base[cur_node] + char_offset +$11 = 3 +``` +This value is then used as the index into check: +``` + if (trie->check[t] == 0) { + trie->check[t] = cur_node; + trie->size++; + } else if (trie->check[t] != cur_node) { + // Handle conflicts in base/check + fprintf(stderr, "Error: Conflict detected while inserting '%s'.\n", word); + return; + } +``` +And this value of this element will be the cur_node which is 1 in this case: +``` +CHECK[3] = 1 +``` +And the size of the trie is incremented by 1 and will become 2. + +The last thing in this loop is: +```c + cur_node = t; +``` +This is setting the current node which 3 to be the current node. + +So this is what the arrays look like after we have iterated over the first +character in 'cat': +``` + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 +BASE [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...] +CHECK [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...] ``` +When we insert a character we calculate some number for it which we use to +insert it. The character itself is never stored. And when we search we +calculate the same number using the character and if the “path” to/from the +parent is valid we have inserted it previously. This is what the check array +is for. So we assign a unique number to each character and this number is used +in both inserts and searches. + +So if we wanted to search for the work 'cow' now we would start at the root: +``` +char mapping `'c' - 'a' = 2; +base[1] + 2 = 3, + 1 + 2 = 3, -Double Array Representation: +check[3] == 1? Yes, continue ``` -BASE[0] = Has one child which is 'c' (decimal 99). -Lets say that BASE[0] = 1 then we have 1 + 99 = 100. This would mean that then -node for 'c' would be placed at index 100 in the trie. This is far from the -root wasting a lot of space. -If we want the node for 'c' to be at index 1, then we would have to set BASE[0] -to 1 - 99 = -98. This would mean that the node for 'c' would be placed at index -1 in the trie. -So that gives us BASE[0] = -98. +So we can see that the check verifies that we indeed have inserted this "path" +before and we can continue to the next character. We currently only have one +node but we we will insert more below. + +Next we have the character 'o': +```console +(gdb) p char_offset +$18 = 14 ``` -BASE[0] + 99 = -98 + 99 = 1 +And recall that `cur_node` is 3 +```c + if (trie->base[cur_node] == 0) { + ``` -In the above trie node 'c' has two children 'a' and 'o'. The value in BASE[1] -should be chosen such that when we add the ASCII value of 'a' to BASE[1] we -get the correct index for 'a'. And likewise for 'o'. -Let's assume we want the node for 'a' to be at index 2. 'a' = 97. +```console +(gdb) p trie->size +$21 = 2 ``` -'a' = 97 -BASE[1] + 97 = 2 -BASE[1] = 2 - 97 = -95 +We currently have 2 nodes 'ROOT' and 'c'. -BASE[0] = -98 -BASE[1] = -95 +And the last node we inserted was 'c' and the index was 3: +``` +(gdb) p cur_node +$22 = 3 +``` +Then we will set trie->base[3] = 2: +```console +trie->base[cur_node] = trie->size; +``` +And we caclulate the transition index: +```c +int t = trie->base[cur_node] + char_offset; +``` +```console +(gdb) p char_offset +$123 = 14 +(gdb) p trie->base[cur_node] +$124 = 2 +(gdb) p trie->base[cur_node] + char_offset +$125 = 16 +``` +And we will set check[16] = 3 +```console +trie->check[t] = cur_node; ``` -The base array is used for navigating from one node to another node based on the -input characters. +This will give a arrays that look like this: +``` + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 +BASE [ 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...] +CHECK [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0...] +``` +Next we have the character 'w' and cur_node is now 16. +``` + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22, 23 24 25 +BASE [ 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0 , 0, 0] +``` -base[current_node] gives us the base index/offset for navigating from the current -node to its children. This is then used with an offset, another character in the -input. The offset is the ASCII value of the character. +``` + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22, 23 24, 25 +CHECK [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 16] +``` +This is the last character in the word 'cow' and we will set the terminal flag: +``` +trie->terminal[cur_node] = true; +``` +So the arrays look like this after 'cow' has been inserted: +``` + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22, 23 24 25 +BASE [ 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0] +CHECK [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 16] +``` -So base[current_node] represents the offset in the array where the transistions -from the current node start. +Trie Structure with Node Numbers: ``` - 0(root) + 1(root) | - 1(c) - / \ -2(a) 3(o) - | | -4(t*) 5(w*) + 3(c) + / +16(0) + | +25(w*) + +* = end of word +``` + +### Packed Double Array Trie (PDAT) +When implementing the double array trie and stepping through the code I noticed +that the arrays, BASE and CHECK are quite sparely populated. This is because +the arrays are allocated based on the number of nodes in the trie. The number of +nodes in the trie is based on the number of characters in the input strings. +This can be quite large and wasteful. Packing the arrays can reduce the memory +footprint of the trie. +implementation. There seems to be many different ways to pack the arrays but +I'll focus on the method used in llama.cpp. + +There is an example of Double Array Trie in +[dat](../fundamentals/datastructures/sr/dat.c) which might help to take a look +at to get an understanding of how a DAT works. + +So, instead of using two arrays for BASE and CHECK we will now have have a +single uint32_t array which will contain all the information for a node the +trie. +`` +Bits 0-7: LCHECK value (8 bits) +Bit 8: LEAF flag (1 bit) +Bit 9: BASE extension flag (1 bit) +Bits 10-30: BASE value or Value (21 bits) +Bit 31: Sign bit for LCHECK or additional VALUE bit ``` _wip_ +