Various Scale Things #22

mitchellgordon95 · 2022-05-07T23:15:44Z

Hey Lucid,

I've been working on scaling the DB up to contain the whole Pile in my free time. En route to this, I've made a few changes that you might be interested in merging:

Ditch autofaiss in favor of manually constructing the FAISS index to be as close to SCANN as possible
Add support for training the index on GPUs
Parallelize chunks_to_embeddings_ by adding a "worker_id" param
chunks_to_precalculated_knn_ should be able to reuse pre-computed embeds etc.
Add a BertEmbeds class that supports memmap'd files > 16 TB (max file size on ext4)
Add support for going from jsonl->chunks (which is what the Pile is) in addition to txt->chunks (just for convenience)

Index creation at scale isn't really tested (but I have tested it at smaller scales). I'm running embedding at scale right now and I think it works. Anyway, I don't really expect you to just merge this but figured I'd mention it before I get too far off master.

mitchellgordon95 · 2022-05-07T23:16:15Z

Also random, but did you work on Bliss at Uber? I think I might have been an intern on your team lol.

lucidrains · 2022-05-08T02:30:02Z

@mitchellgordon95 ohhh my god, yes, I worked on Bliss ... I think I remember you now! Lol 🦦 🦦 🦦 will never forget the otter branding

lucidrains · 2022-05-08T02:33:46Z

@mitchellgordon95 this looks good, but i hesitate to merge it because of so many changes. we could always keep it open and i can gradually incorporate some of the ideas (parallelizing chunks to embedding is a great idea!)

thank you for sharing regardless!

lucidrains · 2022-05-08T02:44:57Z

@mitchellgordon95 how much of a speed up are you seeing removing autofaiss in favor of something closer to scann? some benchmarks would definitely help sway me towards more complicated code :)

mitchellgordon95 · 2022-05-08T05:08:08Z

Sure! I'm planning on testing out the indexing as soon as the embedding is finished, so I can run some benchmarks on auto-faiss in parallel.

It's actually been so many weekends since I switched away from auto-faiss that I forget exactly why I did it 😅, but I know I must have had a good reason because it was a PITA to set up.

I think one reason was that auto-faiss doesn't support index training on GPUs (which can be very slow for 64M training examples). Another reason was I wanted to make sure that we fully optimized memory usage since The Pile is around 5.8B chunks. Even with the full PQ compression etc., it still ends up being 8 bytes per embedding ~= 43 GB of RAM to store the index.

lucidrains · 2022-05-09T15:34:01Z

@mitchellgordon95 ok cool! i'll take a look at the new faiss indexing method this coming week. benchmarks would definitely help! i definitely like your other changes, so if you are willing to break them up into separate PRs, we could merge them in before bringing in https://github.com/lucidrains/RETRO-pytorch/pull/22/files#diff-91e76e2663878e2a72d63398db46c8fa835e402b4f44c0c87010b48f790fc021R320

thank you for all this!

lucidrains · 2022-05-09T17:11:25Z

@mitchellgordon95 who are you training RETRO for btw? are you working at latitude games? not still at Uber I hope lol

mitchellgordon95 · 2022-05-09T17:53:37Z

Yeah I'm at Latitude. It's not a priority project, but I've used my last 3 hackathons to work on it lol

…vectors

sdake · 2023-07-04T18:41:29Z

@lucidrains @mitchellgordon95 hi gang! I have been working on a training loop for this code, which you can see here:
artificialwisdomai/origin#50

We are doing something atypical, and may need to fork and reimplement both of your implementations. I am aware of the minimum requirements of the license, but I wanted to ask, would you take offense? We will reference the upstreams that generated our ideas (ie this repo, and this PR).

We may not submit the work upstream as it would be a fresh start, although our code is licensed with ASL2.

Also, do either of you have any examples of inferencing or use cases where this type of retroformer is of use?

Thank you,
-steve

sdake · 2023-07-04T19:23:02Z

Hello,

I have done a basic benchmark of training. If you can provide other benchmark suggestions, happy to provide A/B/A comparisons and report in this PR.

Overview

Artificial Wisdom™ Retrieval Transformer benchmark results

Test A: PR Various Scale Things #22
Test B: https://github.com/lucidrains/RETRO-pytorch@ab3c4a6

System under test

faiss built with Artificial Wisdom™ Wiki Faiss Build
Indexing content: Istio.io documentation
glob: **/*.md
NVIDIA CUDA: 12.1
One NVIDIA A40
Three NVIDIA AA30
Debian 12
Training loop: Artificial Wisdom Initial Retroformer proof of concept

baseline

wallclock

(baseline) sdake@beast-06:~/repos/origin/retrieval$ REPROCESS=1 python train.py
Artificial Wisdom™ Retreival Transformer Training
• retrieval_model=artificialwisdomai/retroformer • foundation_model=mosaicml/mpt30b •
Epoch 0 100%   ━━━━━━━━━━━━━━━━━━━━━━ • retrieved=65568 • loss=3.89 • 0:14:15 • 0:00:00

observation

GPU consumed memory: 33.2 GB @ 1 A40
GPU utilization: 50-70% @ 1 A40

PR rework

wallclock:

Artificial Wisdom™ Retreival Transformer Training
• retrieval_model=artificialwisdomai/retroformer • foundation_model=mosaicml/mpt30b •
Epoch 0 100%   ━━━━━━━━━━━━━━━━━━━━━━ • retrieved=65568 • loss=3.82 • 0:14:25 • 0:00:00

Observation

GPU consumed memory: 33.2 GB @ 1 A40
GPU utilization: 50-70% @ 1 A40
Three other GPUs: 2 GB @ 3 A30
GPU utilization on three A30s = 0%.
a system without autofaiss is valued for many reasons.

My general observation is that faiss does not appear to use compute in the GPUs, only memory. There may be a defect in our build, or in the implemetnation.

cc @rstarmer @MostAwesomeDude.

Thank you,
-steve

lucidrains · 2023-07-04T19:34:26Z

yes absolutely! the giving is unconditional, thus MIT

lucidrains · 2023-07-04T19:36:11Z

have you seen Nvidia's follow-up Retro2 paper yet?

sdake · 2023-07-05T01:26:22Z

@lucidrains I haven't, if you can share or have the title, i would love to see it! Robert found [Megtrron RETRO](https://github.com/NVIDIA/Megatron-LM#retro], although i don't know if this is what you were referencing.

We are building a library that composes the following things:

Three retrieval transformer architectures
Three large language models
Three vector stores

You pick 1 from each of the three categories, and can use them in composition. Unlike langchain, our work is designed around using shared memory for API communication, instead of HTTPS. IE. More like the monolithic kernel kernel.org and less like the microkernel Windows NT.

Would love to have further suggestions for the idea proposed here. I understand if the MIT license requirements are met, then the software is licensed with those terms. As an open source dev, as long as someone using software I wrote complies with the terms, I was always good with however they used it.

What I am asking is a little different. If I were to use your code as a reference for ABA testing, and also to learn from, but didn't integrate your library, would you take offense? I never did, but I was and AM all in open-source. Many don't understand the finer mechanics of ASL/MIT/BSD-revised, and do take offense. It sounds like you are very experienced in this area, which is awesome!

Your code is an invaluable resource to the engineering community. Thank you for your gifts.

Thank you,
-steve

sdake · 2023-07-05T01:28:50Z

I did notice, after switching to a larger dataset, specifically https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample, the act of building the embedding indexes is "different", and possibly faster. I will publish an update comparison when i have one to give.

lucidrains · 2023-07-05T01:49:54Z

thanks for the kind words

you are free to use this repository however you wish, no conditions

https://arxiv.org/abs/2304.06762 this is the paper. I haven't gone through it but the author apparently found some further simplications. what they have in the Megatron repo should be this Retro++

mitchellgordon95 added 15 commits March 11, 2022 20:34

Build the faiss index ourselves instead of using autofaiss

d605483

Train the faiss index using gpus

e3ccd6c

Add transformers to the pip dependencies

52dd5d9

If we're using GPUs, mask should also be sent to GPU

8bcc873

Add worker_id to chunks_to_embeddings to facilitate parallel processing

e6cc9c4

If the faiss index is already pre-computed, use it when calculating knns

5bd529c

Merge branch 'main' of github.com:lucidrains/RETRO-pytorch into main

f1aaf6c

Add support for jsonl parsing to chunk creation

74f3048

Merge remote-tracking branch 'upstream/main' into main

ad42726

Add some logging when processing chunks

c7641b9

Gracefully handle hitting max chunks / seqs during tokenization

03e0b39

Add BertEmbeds class

b20f775

Use BertEmbeds class instead of numpy.memmap

a53a5df

Merge branch 'main' of github.com:lucidrains/RETRO-pytorch into main

586d4d0

Add context management to BertEmbeds class

424d790

mitchellgordon95 added 7 commits May 18, 2022 02:10

Don't randomly select embeds for training, just use the first N

118aa50

Add embeddings to the index in chunks, so we don't run out of RAM

5124511

Add a safe-guard index write

1347dba

Don't put the index in a tmp folder, make the path explicit

1a5a92b

Add pre-trained index option when indexing embeddings

8dfddf0

Remove autofaiss dependency

ff4cca5

Add option to set FAISS thread count

6d106a1

mitchellgordon95 added 3 commits June 28, 2022 18:06

Add checkpointing when adding embeddings to index

b1e36ac

Don't use OPQ dimensionality reduction

0ff8a12

Do OPQ dimensionality reduction, but make sure we're targeting 16 sub…

214d074

…vectors

lucidrains force-pushed the main branch from 96ade1f to 3032cf4 Compare July 5, 2022 15:11

Update retrieval.py

10be243

latitudegames closed this by deleting the head repository Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various Scale Things #22

Various Scale Things #22

mitchellgordon95 commented May 7, 2022

mitchellgordon95 commented May 7, 2022

lucidrains commented May 8, 2022 •

edited

Loading

lucidrains commented May 8, 2022

lucidrains commented May 8, 2022 •

edited

Loading

mitchellgordon95 commented May 8, 2022 •

edited

Loading

lucidrains commented May 9, 2022

lucidrains commented May 9, 2022

mitchellgordon95 commented May 9, 2022

sdake commented Jul 4, 2023

sdake commented Jul 4, 2023 •

edited

Loading

lucidrains commented Jul 4, 2023

lucidrains commented Jul 4, 2023

sdake commented Jul 5, 2023

sdake commented Jul 5, 2023

lucidrains commented Jul 5, 2023

Various Scale Things #22

Various Scale Things #22

Conversation

mitchellgordon95 commented May 7, 2022

mitchellgordon95 commented May 7, 2022

lucidrains commented May 8, 2022 • edited Loading

lucidrains commented May 8, 2022

lucidrains commented May 8, 2022 • edited Loading

mitchellgordon95 commented May 8, 2022 • edited Loading

lucidrains commented May 9, 2022

lucidrains commented May 9, 2022

mitchellgordon95 commented May 9, 2022

sdake commented Jul 4, 2023

sdake commented Jul 4, 2023 • edited Loading

Overview

System under test

baseline

wallclock

observation

PR rework

wallclock:

Observation

lucidrains commented Jul 4, 2023

lucidrains commented Jul 4, 2023

sdake commented Jul 5, 2023

sdake commented Jul 5, 2023

lucidrains commented Jul 5, 2023

lucidrains commented May 8, 2022 •

edited

Loading

lucidrains commented May 8, 2022 •

edited

Loading

mitchellgordon95 commented May 8, 2022 •

edited

Loading

sdake commented Jul 4, 2023 •

edited

Loading