diff --git a/notes/architectures/mamba.md b/notes/architectures/mamba.md index 43c04e10..a9de296b 100644 --- a/notes/architectures/mamba.md +++ b/notes/architectures/mamba.md @@ -9,7 +9,7 @@ One of the authors is Tri Dao, was also involved in the developement of [Flash Attention](./flash-attention.md) and Mamba takes advantage of the GPU hardware. -Transformers are effecient at training as they can be parallelized, incontrast +Transformers are effecient at training as they can be parallelized, in contrast to RNNs which are sequential which makes training large models a slow process. But, the issue with transformers is that they don't scale to long sequences @@ -27,7 +27,7 @@ make sense to review [RNNs](./rnn.md) and [SSMs](./state-space-models.md). Paper: [Mamba: Linear-Time Sequence Modeling with Selective State Space](https://arxiv.org/pdf/2312.00752) -Selective state space models, which Mamaba is a type of, give us a linear +Selective state space models, which Mamba is a type of, give us a linear recurrent network simliar to RRNs, but also have the fast training that we get from transformers. So we get the best of both worlds.