Skip to content

Commit

Permalink
docs: update trie DAT notes
Browse files Browse the repository at this point in the history
  • Loading branch information
danbev committed Aug 25, 2024
1 parent eaa5582 commit 254f2e7
Showing 1 changed file with 213 additions and 45 deletions.
258 changes: 213 additions & 45 deletions notes/trie.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,71 +118,239 @@ and "apron").
Hash tables are not suited for range queries.


### XOR-compressed double arrays (XCDA)
Lets start with double array tries.
A double array trie uses two arrays, BASE and CHECK, to represent a trie structure compactly.
### Double Array Trie (DAT)
A double array trie uses two arrays, BASE and CHECK, to represent a trie
structure compactly.

So first we start with a normal trie for the words "cat" and "cow":
Lets start with inserting the work 'cat' into a DAT.
```c
trie_dat* trie = create_trie();

Trie Structure with Node Numbers:
insert(trie, "cow");
```
0(root)
|
1(c)
/ \
2(a) 3(o)
| |
4(t*) 5(w*)
* = end of word
```c
void insert(trie_dat* trie, const char* word) {
int cur_node = ROOT_NODE; // this is the current node, and we start at the ROOT node.
```
And ROOT_NODE is defined as 1, and the trie_dat structure is defined as:
```console
(gdb) p cur_node
$2 = 1

(gdb) ptype trie
type = struct {
int *base;
int *check;
_Bool *terminal;
int size;
int capacity;
} *
```
And a trie is created using the create_trie() function:
```c
trie_dat* create_trie() {
trie_dat* trie = malloc(sizeof(trie_dat));
trie->size = ROOT_NODE;
trie->capacity = INITIAL_SIZE;
trie->base = calloc(trie->capacity, sizeof(int));
trie->check = calloc(trie->capacity, sizeof(int));
trie->terminal = calloc(trie->capacity, sizeof(bool));
return trie;
}
```

We will be focusing on the base and check array in this section.

First, we will iterate over all the individual characters in the word 'cat":
```c
for (int i = 0; word[i] != '\0'; i++) {
int char_offset = word[i] - 'a';
```
The above is a common way to convert a character to an index. In this case, we
are converting the character 'a' to 0, 'b' to 1, 'c' to 2, etc. So the offset
for 'c' is 2.
Recall that `cur_node` is the root which is 1 for our trie:
```c
if (trie->base[cur_node] == 0) {
trie->base[cur_node] = trie->size;
}
```
And since this is the first time calling insert there is nothing in the trie at
the moment. So we are setting the base[1] = 1 (this initial size of the trie).

```
BASE[1] = 1
```

Next we have are going to calculate the transition or offset from the base[1] to
the node of the character 'c':
```c
// Calculate the transition index which uses base[s] + c.
int t = trie->base[cur_node] + char_offset;
```

```console
(gdb) p trie->base[cur_node] + char_offset
$11 = 3
```
This value is then used as the index into check:
```
if (trie->check[t] == 0) {
trie->check[t] = cur_node;
trie->size++;
} else if (trie->check[t] != cur_node) {
// Handle conflicts in base/check
fprintf(stderr, "Error: Conflict detected while inserting '%s'.\n", word);
return;
}
```
And this value of this element will be the cur_node which is 1 in this case:
```
CHECK[3] = 1
```
And the size of the trie is incremented by 1 and will become 2.

The last thing in this loop is:
```c
cur_node = t;
```
This is setting the current node which 3 to be the current node.

So this is what the arrays look like after we have iterated over the first
character in 'cat':
```
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
BASE [ 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...]
CHECK [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...]
```
When we insert a character we calculate some number for it which we use to
insert it. The character itself is never stored. And when we search we
calculate the same number using the character and if the “path” to/from the
parent is valid we have inserted it previously. This is what the check array
is for. So we assign a unique number to each character and this number is used
in both inserts and searches.

So if we wanted to search for the work 'cow' now we would start at the root:
```
char mapping `'c' - 'a' = 2;
base[1] + 2 = 3,
1 + 2 = 3,
Double Array Representation:
check[3] == 1? Yes, continue
```
BASE[0] = Has one child which is 'c' (decimal 99).
Lets say that BASE[0] = 1 then we have 1 + 99 = 100. This would mean that then
node for 'c' would be placed at index 100 in the trie. This is far from the
root wasting a lot of space.
If we want the node for 'c' to be at index 1, then we would have to set BASE[0]
to 1 - 99 = -98. This would mean that the node for 'c' would be placed at index
1 in the trie.
So that gives us BASE[0] = -98.
So we can see that the check verifies that we indeed have inserted this "path"
before and we can continue to the next character. We currently only have one
node but we we will insert more below.

Next we have the character 'o':
```console
(gdb) p char_offset
$18 = 14
```
BASE[0] + 99 = -98 + 99 = 1
And recall that `cur_node` is 3
```c
if (trie->base[cur_node] == 0) {

```
In the above trie node 'c' has two children 'a' and 'o'. The value in BASE[1]
should be chosen such that when we add the ASCII value of 'a' to BASE[1] we
get the correct index for 'a'. And likewise for 'o'.
Let's assume we want the node for 'a' to be at index 2. 'a' = 97.
```console
(gdb) p trie->size
$21 = 2
```
'a' = 97
BASE[1] + 97 = 2
BASE[1] = 2 - 97 = -95
We currently have 2 nodes 'ROOT' and 'c'.

BASE[0] = -98
BASE[1] = -95
And the last node we inserted was 'c' and the index was 3:
```
(gdb) p cur_node
$22 = 3
```
Then we will set trie->base[3] = 2:
```console
trie->base[cur_node] = trie->size;
```
And we caclulate the transition index:
```c
int t = trie->base[cur_node] + char_offset;
```
```console
(gdb) p char_offset
$123 = 14
(gdb) p trie->base[cur_node]
$124 = 2
(gdb) p trie->base[cur_node] + char_offset
$125 = 16
```
And we will set check[16] = 3
```console
trie->check[t] = cur_node;
```

The base array is used for navigating from one node to another node based on the
input characters.
This will give a arrays that look like this:
```
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
BASE [ 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...]
CHECK [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0...]
```
Next we have the character 'w' and cur_node is now 16.
```
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22, 23 24 25
BASE [ 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0 , 0, 0]
```

base[current_node] gives us the base index/offset for navigating from the current
node to its children. This is then used with an offset, another character in the
input. The offset is the ASCII value of the character.
```
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22, 23 24, 25
CHECK [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 16]
```
This is the last character in the word 'cow' and we will set the terminal flag:
```
trie->terminal[cur_node] = true;
```
So the arrays look like this after 'cow' has been inserted:
```
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22, 23 24 25
BASE [ 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0]
CHECK [ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 16]
```

So base[current_node] represents the offset in the array where the transistions
from the current node start.
Trie Structure with Node Numbers:
```
0(root)
1(root)
|
1(c)
/ \
2(a) 3(o)
| |
4(t*) 5(w*)
3(c)
/
16(0)
|
25(w*)
* = end of word
```

### Packed Double Array Trie (PDAT)
When implementing the double array trie and stepping through the code I noticed
that the arrays, BASE and CHECK are quite sparely populated. This is because
the arrays are allocated based on the number of nodes in the trie. The number of
nodes in the trie is based on the number of characters in the input strings.
This can be quite large and wasteful. Packing the arrays can reduce the memory
footprint of the trie.
implementation. There seems to be many different ways to pack the arrays but
I'll focus on the method used in llama.cpp.

There is an example of Double Array Trie in
[dat](../fundamentals/datastructures/sr/dat.c) which might help to take a look
at to get an understanding of how a DAT works.

So, instead of using two arrays for BASE and CHECK we will now have have a
single uint32_t array which will contain all the information for a node the
trie.
``
Bits 0-7: LCHECK value (8 bits)
Bit 8: LEAF flag (1 bit)
Bit 9: BASE extension flag (1 bit)
Bits 10-30: BASE value or Value (21 bits)
Bit 31: Sign bit for LCHECK or additional VALUE bit
```
_wip_

0 comments on commit 254f2e7

Please sign in to comment.