Replies: 3 comments 2 replies
-
Twas inevitable @turboderp |
Beta Was this translation helpful? Give feedback.
-
Landmark Attention: Random-Access Infinite Context Length for Transformers |
Beta Was this translation helpful? Give feedback.
-
I've seen the paper, yeah. Haven't read it in detail yet, but it looks very much like I'd have to see the code to have any idea what they're actually doing and if it's something that can be applied to a pretrained model. There are lots and lots of "infinite context" techniques that all make big promises, but what we actually need is either a base model trained with those techniques or a way to adapt them to a pretrained model. Cause I don't have $200k lying around to train it myself. :) If that's even enough... I mean, I think most of the reason all these pretrained models have a limited context size isn't that attention is so impossibly complex to evaluate. I mean, I can do inference at almost 10k tokens just with Llama and a single A100. Getting suitable training data for long-range training, though, that's harder. And it's inevitably going to take longer, since long-range relationships in a text always represent less of the total information content than short-range relationships. And just finding a single bit of information in a long sequence is not the same as actually reasoning about the whole sequence. What if there's more than one item to consider? What if they're interdependent? "My car is blue. My house is the same color as my car. My shirt is the same color as my house. What color is my shirt?" That type of stuff requires full attention over the entire sequence. So we'll see when they release the code. |
Beta Was this translation helpful? Give feedback.
-
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
https://arxiv.org/abs/2305.07185
Beta Was this translation helpful? Give feedback.
All reactions