-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various Scale Things #22
Conversation
Also random, but did you work on Bliss at Uber? I think I might have been an intern on your team lol. |
@mitchellgordon95 ohhh my god, yes, I worked on Bliss ... I think I remember you now! Lol 🦦 🦦 🦦 will never forget the otter branding |
@mitchellgordon95 this looks good, but i hesitate to merge it because of so many changes. we could always keep it open and i can gradually incorporate some of the ideas (parallelizing chunks to embedding is a great idea!) thank you for sharing regardless! |
@mitchellgordon95 how much of a speed up are you seeing removing autofaiss in favor of something closer to scann? some benchmarks would definitely help sway me towards more complicated code :) |
Sure! I'm planning on testing out the indexing as soon as the embedding is finished, so I can run some benchmarks on auto-faiss in parallel. It's actually been so many weekends since I switched away from auto-faiss that I forget exactly why I did it 😅, but I know I must have had a good reason because it was a PITA to set up. I think one reason was that auto-faiss doesn't support index training on GPUs (which can be very slow for 64M training examples). Another reason was I wanted to make sure that we fully optimized memory usage since The Pile is around 5.8B chunks. Even with the full PQ compression etc., it still ends up being 8 bytes per embedding ~= 43 GB of RAM to store the index. |
@mitchellgordon95 ok cool! i'll take a look at the new faiss indexing method this coming week. benchmarks would definitely help! i definitely like your other changes, so if you are willing to break them up into separate PRs, we could merge them in before bringing in https://github.com/lucidrains/RETRO-pytorch/pull/22/files#diff-91e76e2663878e2a72d63398db46c8fa835e402b4f44c0c87010b48f790fc021R320 thank you for all this! |
@mitchellgordon95 who are you training RETRO for btw? are you working at latitude games? not still at Uber I hope lol |
Yeah I'm at Latitude. It's not a priority project, but I've used my last 3 hackathons to work on it lol |
@lucidrains @mitchellgordon95 hi gang! I have been working on a training loop for this code, which you can see here: We are doing something atypical, and may need to fork and reimplement both of your implementations. I am aware of the minimum requirements of the license, but I wanted to ask, would you take offense? We will reference the upstreams that generated our ideas (ie this repo, and this PR). We may not submit the work upstream as it would be a fresh start, although our code is licensed with ASL2. Also, do either of you have any examples of inferencing or use cases where this type of retroformer is of use? Thank you, |
Hello, I have done a basic benchmark of training. If you can provide other benchmark suggestions, happy to provide A/B/A comparisons and report in this PR. OverviewArtificial Wisdom™ Retrieval Transformer benchmark results
System under test
baselinewallclock
observation
PR reworkwallclock:
Observation
My general observation is that faiss does not appear to use compute in the GPUs, only memory. There may be a defect in our build, or in the implemetnation. cc @rstarmer @MostAwesomeDude. Thank you, |
yes absolutely! the giving is unconditional, thus MIT |
have you seen Nvidia's follow-up Retro2 paper yet? |
@lucidrains I haven't, if you can share or have the title, i would love to see it! Robert found [Megtrron RETRO](https://github.com/NVIDIA/Megatron-LM#retro], although i don't know if this is what you were referencing. We are building a library that composes the following things:
You pick 1 from each of the three categories, and can use them in composition. Unlike langchain, our work is designed around using shared memory for API communication, instead of HTTPS. IE. More like the monolithic kernel Would love to have further suggestions for the idea proposed here. I understand if the MIT license requirements are met, then the software is licensed with those terms. As an open source dev, as long as someone using software I wrote complies with the terms, I was always good with however they used it. What I am asking is a little different. If I were to use your code as a reference for ABA testing, and also to learn from, but didn't integrate your library, would you take offense? I never did, but I was and AM all in open-source. Many don't understand the finer mechanics of ASL/MIT/BSD-revised, and do take offense. It sounds like you are very experienced in this area, which is awesome! Your code is an invaluable resource to the engineering community. Thank you for your gifts. Thank you, |
I did notice, after switching to a larger dataset, specifically https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample, the act of building the embedding indexes is "different", and possibly faster. I will publish an update comparison when i have one to give. |
thanks for the kind words you are free to use this repository however you wish, no conditions https://arxiv.org/abs/2304.06762 this is the paper. I haven't gone through it but the author apparently found some further simplications. what they have in the Megatron repo should be this Retro++ |
Hey Lucid,
I've been working on scaling the DB up to contain the whole Pile in my free time. En route to this, I've made a few changes that you might be interested in merging:
chunks_to_embeddings_
by adding a "worker_id" paramchunks_to_precalculated_knn_
should be able to reuse pre-computed embeds etc.BertEmbeds
class that supports memmap'd files > 16 TB (max file size on ext4)Index creation at scale isn't really tested (but I have tested it at smaller scales). I'm running embedding at scale right now and I think it works. Anyway, I don't really expect you to just merge this but figured I'd mention it before I get too far off master.