Skip to content

Commit

Permalink
docs: fix typoes in mamba notes
Browse files Browse the repository at this point in the history
  • Loading branch information
danbev committed Aug 6, 2024
1 parent c05fd1e commit a229831
Showing 1 changed file with 9 additions and 9 deletions.
18 changes: 9 additions & 9 deletions notes/mamba.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ training as they can be parallelized, incontrast to RNNs which are sequential.

But, the issue with transformers is that they don't scale to long sequences
which is because the self attention mechanism is quadratic in the sequence
length. Every token has to attend to every other token in a sequenc (n²). So if
length. Every token has to attend to every other token in a sequence (n²). So if
we have 40 tokens that means 1600 attention operations, which means more
computation and this just increases the longer the input sequence it.
computation and this just increases the longer the input sequence is.
In this respect RNNs are more performant as they don't have the quadratic
scaling issue that the self attention mechanism has (but do have other like
slower training).
Expand All @@ -26,7 +26,7 @@ The core of Mamba is state space models (SSMs). Before we go further it might
make sense to review [RNNs](./rnn.md) and [SSMs](./state-space-models.md).

Selective state space models, which Mamaba is a type of, give us a linear
recurrent network simliar to RRNs, but also have the fast training that we gets
recurrent network simliar to RRNs, but also have the fast training that we get
from transformers. So we get the best of both worlds.

One major difference with state space models is that they have state which is
Expand All @@ -37,7 +37,7 @@ like RNNs do have state, but recall that they process the input sequentially.
To understand how Mamba fits in I found it useful to compare it to how
transformers look in an neural network:
```
Residul
Residual
+---------> |
| |
| +-------------------+
Expand All @@ -62,7 +62,7 @@ Residul ↑
```
And then we have Mamba:
```
Residul
Residual
+---------> |
| |
| +-------------------+
Expand Down Expand Up @@ -132,22 +132,22 @@ definition of a state space model. This makes sense if we think about it as
this is not specific to neural networks or even computers. Think about an analog
system, for example an IoT device that reads the temperature from a sensor
connected to it. To process this signal it needs to be converted into digital
form. A simliar thing needs to be done in this case as we can't use continous
form. A simliar thing needs to be done in this case, as we can't use continous
signals with computers, just like an IoT can't process an analog signal
directly. So we need to convert into descrete time steps, similar to how an
Analog-to-Digital Converter ([ADC]) would convert the signal into quantized
values. This step is called discretization in the state space model.

[ADC]: https://github.com/danbev/learning-iot/tree/master?tab=readme-ov-file#analog-to-digital-converter-adc

So instead of the using functions as shown above we concrete values we will
So instead of the using functions as shown above with concrete values we will
transform A and B into discrete values and the equations become:
```
_ _
hₜ = Ahₜ₋₁ + Bxₜ
yₜ = Chₜ+ Dxₜ
```
To get the A_hat and B_hat values a process called discretization is used.
To get the `A_hat` and `B_hat` values a process called discretization is used.

### Discretization
So we will first discretize the parameters A, and B of the state space model,
Expand Down Expand Up @@ -339,7 +339,7 @@ transform A and B into discrete values and the equations become:
hₜ = Ahₜ₋₁ + Bxₜ
yₜ = Chₜ+ Dxₜ
```
Where A_hat and B_hat are:
Where `A_hat` and `B_hat` are:
```
A_hat = (I - Δ/2 A)⁻¹ (⁻¹ inverse bilinear transform)
B_hat = (I - Δ/2 A)⁻¹ ΔB (⁻¹ inverse bilinear transform)
Expand Down

0 comments on commit a229831

Please sign in to comment.