Skip to content

Commit

Permalink
docs: add initial ggml RoPE notes (wip)
Browse files Browse the repository at this point in the history
Signed-off-by: Daniel Bevenius <[email protected]>
  • Loading branch information
danbev committed Jul 24, 2024
1 parent e674ca4 commit 1975c8a
Showing 1 changed file with 269 additions and 0 deletions.
269 changes: 269 additions & 0 deletions notes/ggml-rope.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
## GGML RoPE implementation notes

This document contains a walkthrough of the RoPE function in GGML.

The code for this can be found in [rope.c](../fundamentals/ggml/src/rope.c).

```console
$ gdb --args bin/rope
```
The first function call we make is to set up the tensors and operations in the
context which is done by calling `ggml_rope_ext`.
```c
struct ggml_tensor * ggml_rope_ext(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
struct ggml_tensor * c,
int n_dims,
int mode,
int n_ctx_orig,
float freq_base,
float freq_scale,
float ext_factor,
float attn_factor,
float beta_fast,
float beta_slow) {
return ggml_rope_impl(
ctx, a, b, c, n_dims, mode, n_ctx_orig, freq_base, freq_scale,
ext_factor, attn_factor, beta_fast, beta_slow, false
);
}
```
The tensor `a` is the tensor that is to be rotated. The `b` tensor is a one
dimensional vector that contains the positions. And I think that `c` is a
tensor that contains scaling factors but I've not gotten this far in my
understanding of this yet, and in the example it NULL is passed in.
`n_dims` is, `d` in RoPE. TODO: link with rope.md document.
`mode` is ?
`n_ctx_orig` is the models original context that it was trained on, this might
be used by PI and used to calculate 's' (s = L'/L) I think?
`freq_base` is the base frequency which is 10000 by default.
`freq_scale` is the scaling factor which I thought might be -2 because this is
what the paper uses but the default for this is 1.0f.
`ext_factor` might be the extrapolation factor but I'm not sure.
`attn_factor` ?
`beta_fast` might be the scaling factor for YaRN and which should be used for
scaling the higher frequencies.
`beta_slow` also related to YaRN and might be used to scale the lower
frequencies.
So that will set up the context and the tensor operations for RoPE. Next we
want to create a computation graph and run the forward pass:
```c
struct ggml_cgraph* c_graph = ggml_new_graph(ctx);
ggml_build_forward_expand(c_graph, s);
ggml_graph_compute_with_ctx(ctx, c_graph, 4);
```

Now, we can set a breakpoint in `ggml_compute_forward_rope`:
```console
(gdb) br ggml_compute_forward_rope
Breakpoint 5 at 0x55555558acf8: file /home/danbev/work/ai/learning-ai/fundamentals/ggml/ggml/src/ggml.c, line 14061.
(gdb) r
```
Keep in mind that the excution in GGML is multithreaded and multiple threads
will be running when our break point is it. So just continuing stepping will
this using:
```console
(gdb) set scheduler-locking on
```
And now we can step only the current thread.

```c
static void ggml_compute_forward_rope(
const struct ggml_compute_params * params,
struct ggml_tensor * dst) {

const struct ggml_tensor * src0 = dst->src[0];

switch (src0->type) {
case GGML_TYPE_F16:
{
ggml_compute_forward_rope_f16(params, dst, true);
} break;
case GGML_TYPE_F32:
{
ggml_compute_forward_rope_f32(params, dst, true);
} break;
default:
{
GGML_ASSERT(false);
} break;
}
}
```
I need to readup on how this works but with some handwaving the
`ggml_compute_params` is part of GGMLs multithreading and contains information
about which thread and what part of the tensor this thread is working on.
```console
(gdb) p *dst->src[0]
$9 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {128, 32, 512, 1}, nb = {4, 512, 16384, 8388608}, op = GGML_OP_RESHAPE,
op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x7ffff68ed030, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
view_src = 0x7ffff68ed030, view_offs = 0, data = 0x7ffff68ed180, name = "a_reshaped", '\000' <repeats 53 times>, extra = 0x0}
```
In our case the type of the src tensor is F32:
```c
case GGML_TYPE_F32:
{
ggml_compute_forward_rope_f32(params, dst, true);
} break;
```
```c
static void ggml_compute_forward_rope_f32(
const struct ggml_compute_params * params,
struct ggml_tensor * dst,
const bool forward) {
const struct ggml_tensor * src0 = dst->src[0];
const struct ggml_tensor * src1 = dst->src[1];
const struct ggml_tensor * src2 = dst->src[2];
```
```console
(gdb) p *src0
$12 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {128, 32, 512, 1}, nb = {4, 512, 16384, 8388608}, op = GGML_OP_RESHAPE,
op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x7ffff68ed030, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
view_src = 0x7ffff68ed030, view_offs = 0, data = 0x7ffff68ed180, name = "a_reshaped", '\000' <repeats 53 times>, extra = 0x0}

(gdb) p *src1
$13 = {type = GGML_TYPE_I32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {512, 1, 1, 1}, nb = {4, 2048, 2048, 2048}, op = GGML_OP_NONE,
op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0,
data = 0x7ffff70ed460, name = "pos", '\000' <repeats 60 times>, extra = 0x0}
```
And in this case we did not specify a c so src2 is null.
Next the parameters, that is the parameters that were set on the operation
parameters and not the computation params:
```console
(gdb) p dst.op_params
$16 = {0, 128, 0, 0, 4096, 1176256512, 1065353216, 0, 1065353216, 1107296256, 0, 0, 0, 0, 0, 0}
```
So these parameters are extracted into local variables:
```c
const int n_dims = ((int32_t *) dst->op_params)[1];
const int mode = ((int32_t *) dst->op_params)[2];
const int n_ctx_orig = ((int32_t *) dst->op_params)[4];

float freq_base, freq_scale, ext_factor, attn_factor, beta_fast, beta_slow;

memcpy(&freq_base, (int32_t *) dst->op_params + 5, sizeof(float));
memcpy(&freq_scale, (int32_t *) dst->op_params + 6, sizeof(float));
memcpy(&ext_factor, (int32_t *) dst->op_params + 7, sizeof(float));
memcpy(&attn_factor, (int32_t *) dst->op_params + 8, sizeof(float));
memcpy(&beta_fast, (int32_t *) dst->op_params + 9, sizeof(float));
memcpy(&beta_slow, (int32_t *) dst->op_params + 10, sizeof(float));


GGML_TENSOR_UNARY_OP_LOCALS
```
The macro will create local variables like the following (which can be generated
by the make target `pre-ggml.c`).`:
```c
const int64_t ne00 = (src0)->ne[0]; (void)(ne00);
const int64_t ne01 = (src0)->ne[1]; (void)(ne01);
const int64_t ne02 = (src0)->ne[2]; (void)(ne02);
const int64_t ne03 = (src0)->ne[3]; (void)(ne03);
const size_t nb00 = (src0)->nb[0]; (void)(nb00);
const size_t nb01 = (src0)->nb[1]; (void)(nb01);
const size_t nb02 = (src0)->nb[2]; (void)(nb02);
const size_t nb03 = (src0)->nb[3]; (void)(nb03);
const int64_t ne0 = (dst)->ne[0]; (void)(ne0);
const int64_t ne1 = (dst)->ne[1]; (void)(ne1);
const int64_t ne2 = (dst)->ne[2]; (void)(ne2);
const int64_t ne3 = (dst)->ne[3]; (void)(ne3);
const size_t nb0 = (dst)->nb[0]; (void)(nb0);
const size_t nb1 = (dst)->nb[1]; (void)(nb1);
const size_t nb2 = (dst)->nb[2]; (void)(nb2);
const size_t nb3 = (dst)->nb[3]; (void)(nb3);
```
So this is simply extracting local variables for src0 and dst and the casts are
to avoid warnings that the variables might not be used.


A little further down we have the following:
```c
const float theta_scale = powf(freq_base, -2.0f/n_dims);
```
Now, this looks familar (at least after reading [rope.md](rope.md) and notice
that this also clarifies that the `freq_scale` is not the scaling factor for
the frequency which I originally thought.
```console
(gdb) p theta_scale
$27 = 0.865964353
```
Following that we have this function:
```c
float corr_dims[2];
ggml_rope_yarn_corr_dims(n_dims, n_ctx_orig, freq_base, beta_fast, beta_slow, corr_dims);
```
And notice that this is where we use (or at least pass in) the original context
length, the `freq_base`, `beta_fast` and `beta_slow`. And notice that corr_dims
is an array of two floats that will be populated by the function.
```c
GGML_CALL void ggml_rope_yarn_corr_dims(
int n_dims, int n_ctx_orig, float freq_base, float beta_fast, float beta_slow, float dims[2]
) {
// start and end correction dims
float start = floorf(ggml_rope_yarn_corr_dim(n_dims, n_ctx_orig, beta_fast, freq_base));
float end = ceilf(ggml_rope_yarn_corr_dim(n_dims, n_ctx_orig, beta_slow, freq_base));
dims[0] = MAX(0, start);
dims[1] = MIN(n_dims - 1, end);
}
```
Lets break this down a little and start by looking the function `ggml_rope_yarn_corr_dim`:
```c
static float ggml_rope_yarn_corr_dim(int n_dims, int n_ctx_orig, float n_rot, float base) {
return n_dims * logf(n_ctx_orig / (n_rot * 2 * (float)M_PI)) / (2 * logf(base));
}
```
Notice that we are passing in `beta_fast` and `beta_slow` as `n_rot`.
```c
// Apparently solving `n_rot = 2pi * x * base^((2 * max_pos_emb) / n_dims)` for x, we get
// `corr_dim(n_rot) = n_dims * log(max_pos_emb / (n_rot * 2pi)) / (2 * log(base))`
static float ggml_rope_yarn_corr_dim(int n_dims, int n_ctx_orig, float n_rot, float base) {
return n_dims * logf(n_ctx_orig / (n_rot * 2 * (float)M_PI)) / (2 * logf(base));
}
```
What is going on here is that this function calculates the start and end
dimensions for which rotation corrections should be applied. So in our case
they should be applied for dimensions greater than 48 and less than 80.

From the paper we see:

```
2 PI
λ_d = ------- = 2PI b^(2d|D|)
theta_d
d = index of the hidden dimension.
b = base frequency.
D = number of hidden dimensions.
```
λ_d indicates how many tokens are required for the positional encoding at the
d-th dimension to cycle through a complete period.

This describes the number of tokens needed for the positional embedding at the
d-th hidden dimension to complete a full rotation of 2PI.


```
L L
r(d) = -------- = ----------
lambda_d 2PI b^2d|D|
```

```console
(gdb) s
ggml_rope_yarn_corr_dim (n_dims=128, n_ctx_orig=4096, n_rot=32, base=10000) at /home/danbev/work/ai/learning-ai/fundamentals/ggml/ggml/src/ggml.c:13778
13778 return n_dims * logf(n_ctx_orig / (n_rot * 2 * (float)M_PI)) / (2 * logf(base));
```
So this becomes `128 * log(4096 / (32 * 2 * pi)) / (2 * log(10000))` which is
`128 * log(4096 / 201.06193) / (2 * log(10000))` which is `128 * log(20.354) /
(2 * 4)` which is `128 * 3.015 / 8` which is `48`. So the start is 48 and the
end is 80. And this is the start and end of the correction dimensions.

0 comments on commit 1975c8a

Please sign in to comment.