Papers for emulating large contexts #9

ghost · 2023-05-25T03:44:20Z

ghost
May 25, 2023

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
https://arxiv.org/abs/2305.07185

ghost · 2023-05-25T04:05:46Z

ghost
May 25, 2023

"A big part of the motivation for this project is context length, and I have some preliminary results to suggest Llama can be finetuned to work on long sequences." - turboderp

Twas inevitable @turboderp

0 replies

ghost · 2023-05-27T09:59:41Z

ghost
May 27, 2023

Landmark Attention: Random-Access Infinite Context Length for Transformers
https://arxiv.org/abs/2305.16300

0 replies

turboderp · 2023-05-27T10:23:19Z

turboderp
May 27, 2023
Maintainer

I've seen the paper, yeah. Haven't read it in detail yet, but it looks very much like I'd have to see the code to have any idea what they're actually doing and if it's something that can be applied to a pretrained model. There are lots and lots of "infinite context" techniques that all make big promises, but what we actually need is either a base model trained with those techniques or a way to adapt them to a pretrained model. Cause I don't have $200k lying around to train it myself. :)

If that's even enough... I mean, I think most of the reason all these pretrained models have a limited context size isn't that attention is so impossibly complex to evaluate. I mean, I can do inference at almost 10k tokens just with Llama and a single A100. Getting suitable training data for long-range training, though, that's harder. And it's inevitably going to take longer, since long-range relationships in a text always represent less of the total information content than short-range relationships.

And just finding a single bit of information in a long sequence is not the same as actually reasoning about the whole sequence. What if there's more than one item to consider? What if they're interdependent? "My car is blue. My house is the same color as my car. My shirt is the same color as my house. What color is my shirt?" That type of stuff requires full attention over the entire sequence.

So we'll see when they release the code.

2 replies

ghost May 27, 2023

Ill be right back here as soon as the codes are up :)

Hmm, you're right, it does seem like the real improvements have to come from the models and their ability to think in larger contexts whilst scaling to represent more information.

What do you see in the ability to adapt transformer models with extra layers to emulate larger contexts? Of course they wont come close to large context models but I think there's still some awesome benefits. MegaByte on one hand utilizes running multiple prompts at the same time increasing the speed per prompt as they all share the immediate layer when processed together. This of course was used for running the whole context (split up) together which makes 12k+ tokens run faster than 4*2k tokens ran concurrently.

To me, it seems that most models can be adapted with additional layers/processes to glue model context ends together which would be great for in place summarization but not for whole context processing but nonetheless another step to upgrading whats been released. How do you see it @turboderp ?

turboderp May 27, 2023
Maintainer

Pretty much like that. If there's a large pretrained MegaByte model released for at some point, we'll see how it performs, but I strongly suspect it will be good at the sorts of things they tried out in the paper and not much else. Keep in mind that while a million tokens sounds very impressive, what they actually did with it was remember a single fact (with like a 10% failure rate at that), without actually showing that the model could do anything with that fact, let alone a million tokens worth of facts.

GPT-4's approach seems much more scalable right now. They don't have as much context, not even 12k I believe for the regular version, but they have enough context to work with for chain-of-thought stuff, fetching history from a database, switching in adapters etc., and then a competent model that actually uses the whole context.

Cause that matters too, in the end, perhaps more than what the architecture "should" be able to do. Llama-7B is a goldfish compared to Llama-33B, for instance, even though they're both 2k models. And that's ultimately because 7B is smaller and doesn't have as much training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Papers for emulating large contexts #9

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Papers for emulating large contexts #9

ghost May 25, 2023

Replies: 3 comments · 2 replies

ghost May 25, 2023

ghost May 27, 2023

turboderp May 27, 2023 Maintainer

ghost May 27, 2023

turboderp May 27, 2023 Maintainer

ghost
May 25, 2023

Replies: 3 comments 2 replies

ghost
May 25, 2023

ghost
May 27, 2023

turboderp
May 27, 2023
Maintainer

turboderp May 27, 2023
Maintainer