Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various Scale Things #22

Closed
wants to merge 26 commits into from
Closed

Various Scale Things #22

wants to merge 26 commits into from

Conversation

mitchellgordon95
Copy link
Contributor

Hey Lucid,

I've been working on scaling the DB up to contain the whole Pile in my free time. En route to this, I've made a few changes that you might be interested in merging:

  • Ditch autofaiss in favor of manually constructing the FAISS index to be as close to SCANN as possible
  • Add support for training the index on GPUs
  • Parallelize chunks_to_embeddings_ by adding a "worker_id" param
  • chunks_to_precalculated_knn_ should be able to reuse pre-computed embeds etc.
  • Add a BertEmbeds class that supports memmap'd files > 16 TB (max file size on ext4)
  • Add support for going from jsonl->chunks (which is what the Pile is) in addition to txt->chunks (just for convenience)

Index creation at scale isn't really tested (but I have tested it at smaller scales). I'm running embedding at scale right now and I think it works. Anyway, I don't really expect you to just merge this but figured I'd mention it before I get too far off master.

@mitchellgordon95
Copy link
Contributor Author

Also random, but did you work on Bliss at Uber? I think I might have been an intern on your team lol.

@lucidrains
Copy link
Owner

lucidrains commented May 8, 2022

@mitchellgordon95 ohhh my god, yes, I worked on Bliss ... I think I remember you now! Lol 🦦 🦦 🦦 will never forget the otter branding

@lucidrains
Copy link
Owner

@mitchellgordon95 this looks good, but i hesitate to merge it because of so many changes. we could always keep it open and i can gradually incorporate some of the ideas (parallelizing chunks to embedding is a great idea!)

thank you for sharing regardless!

@lucidrains
Copy link
Owner

lucidrains commented May 8, 2022

@mitchellgordon95 how much of a speed up are you seeing removing autofaiss in favor of something closer to scann? some benchmarks would definitely help sway me towards more complicated code :)

@mitchellgordon95
Copy link
Contributor Author

mitchellgordon95 commented May 8, 2022

Sure! I'm planning on testing out the indexing as soon as the embedding is finished, so I can run some benchmarks on auto-faiss in parallel.

It's actually been so many weekends since I switched away from auto-faiss that I forget exactly why I did it 😅, but I know I must have had a good reason because it was a PITA to set up.

I think one reason was that auto-faiss doesn't support index training on GPUs (which can be very slow for 64M training examples). Another reason was I wanted to make sure that we fully optimized memory usage since The Pile is around 5.8B chunks. Even with the full PQ compression etc., it still ends up being 8 bytes per embedding ~= 43 GB of RAM to store the index.

@lucidrains
Copy link
Owner

@mitchellgordon95 ok cool! i'll take a look at the new faiss indexing method this coming week. benchmarks would definitely help! i definitely like your other changes, so if you are willing to break them up into separate PRs, we could merge them in before bringing in https://github.com/lucidrains/RETRO-pytorch/pull/22/files#diff-91e76e2663878e2a72d63398db46c8fa835e402b4f44c0c87010b48f790fc021R320

thank you for all this!

@lucidrains
Copy link
Owner

@mitchellgordon95 who are you training RETRO for btw? are you working at latitude games? not still at Uber I hope lol

@mitchellgordon95
Copy link
Contributor Author

Yeah I'm at Latitude. It's not a priority project, but I've used my last 3 hackathons to work on it lol

@sdake
Copy link

sdake commented Jul 4, 2023

@lucidrains @mitchellgordon95 hi gang! I have been working on a training loop for this code, which you can see here:
artificialwisdomai/origin#50

We are doing something atypical, and may need to fork and reimplement both of your implementations. I am aware of the minimum requirements of the license, but I wanted to ask, would you take offense? We will reference the upstreams that generated our ideas (ie this repo, and this PR).

We may not submit the work upstream as it would be a fresh start, although our code is licensed with ASL2.

Also, do either of you have any examples of inferencing or use cases where this type of retroformer is of use?

Thank you,
-steve

@sdake
Copy link

sdake commented Jul 4, 2023

Hello,

I have done a basic benchmark of training. If you can provide other benchmark suggestions, happy to provide A/B/A comparisons and report in this PR.

Overview

Artificial Wisdom™ Retrieval Transformer benchmark results

System under test

baseline

wallclock

(baseline) sdake@beast-06:~/repos/origin/retrieval$ REPROCESS=1 python train.py
Artificial Wisdom™ Retreival Transformer Training
• retrieval_model=artificialwisdomai/retroformer • foundation_model=mosaicml/mpt30b •
Epoch 0 100%   ━━━━━━━━━━━━━━━━━━━━━━ • retrieved=65568 • loss=3.89 • 0:14:15 • 0:00:00

observation

  • GPU consumed memory: 33.2 GB @ 1 A40
  • GPU utilization: 50-70% @ 1 A40

PR rework

wallclock:

Artificial Wisdom™ Retreival Transformer Training
• retrieval_model=artificialwisdomai/retroformer • foundation_model=mosaicml/mpt30b •
Epoch 0 100%   ━━━━━━━━━━━━━━━━━━━━━━ • retrieved=65568 • loss=3.82 • 0:14:25 • 0:00:00

Observation

  • GPU consumed memory: 33.2 GB @ 1 A40
  • GPU utilization: 50-70% @ 1 A40
  • Three other GPUs: 2 GB @ 3 A30
  • GPU utilization on three A30s = 0%.
  • a system without autofaiss is valued for many reasons.

My general observation is that faiss does not appear to use compute in the GPUs, only memory. There may be a defect in our build, or in the implemetnation.

cc @rstarmer @MostAwesomeDude.

Thank you,
-steve

@lucidrains
Copy link
Owner

yes absolutely! the giving is unconditional, thus MIT

@lucidrains
Copy link
Owner

have you seen Nvidia's follow-up Retro2 paper yet?

@sdake
Copy link

sdake commented Jul 5, 2023

@lucidrains I haven't, if you can share or have the title, i would love to see it! Robert found [Megtrron RETRO](https://github.com/NVIDIA/Megatron-LM#retro], although i don't know if this is what you were referencing.

We are building a library that composes the following things:

  • Three retrieval transformer architectures
  • Three large language models
  • Three vector stores

You pick 1 from each of the three categories, and can use them in composition. Unlike langchain, our work is designed around using shared memory for API communication, instead of HTTPS. IE. More like the monolithic kernel kernel.org and less like the microkernel Windows NT.

Would love to have further suggestions for the idea proposed here. I understand if the MIT license requirements are met, then the software is licensed with those terms. As an open source dev, as long as someone using software I wrote complies with the terms, I was always good with however they used it.

What I am asking is a little different. If I were to use your code as a reference for ABA testing, and also to learn from, but didn't integrate your library, would you take offense? I never did, but I was and AM all in open-source. Many don't understand the finer mechanics of ASL/MIT/BSD-revised, and do take offense. It sounds like you are very experienced in this area, which is awesome!

Your code is an invaluable resource to the engineering community. Thank you for your gifts.

Thank you,
-steve

@sdake
Copy link

sdake commented Jul 5, 2023

I did notice, after switching to a larger dataset, specifically https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample, the act of building the embedding indexes is "different", and possibly faster. I will publish an update comparison when i have one to give.

@lucidrains
Copy link
Owner

thanks for the kind words

you are free to use this repository however you wish, no conditions

https://arxiv.org/abs/2304.06762 this is the paper. I haven't gone through it but the author apparently found some further simplications. what they have in the Megatron repo should be this Retro++

@latitudegames latitudegames closed this by deleting the head repository Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants