Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

historical_indexer runs out of memory and must start from beginning #2

Open
azigler opened this issue Jul 6, 2023 · 8 comments
Open

Comments

@azigler
Copy link

azigler commented Jul 6, 2023

Hi @redsolver -- making a new issue so we can stay organized. 馃殌

My machine:

2 GB Memory
1 vCPU
25 GB Disk + 30 GB mounted
Ubuntu 22.04 (LTS) x64

If I run this script, CPU immediately hits 100% (understandable, since this is a very weak machine) and memory slowly crawls up to 100% over the course of ~1hr before hitting a maximum and my machine killing the PID. It does manage to count all the repos and then start downloading them, and the script works as I can confirm SurrealDB stores the blocks. I'll get a Process killed message in my terminal when my machine kills it due to lack of RAM, and then the memory is released.

If I start again, it starts over from the very beginning, not where it left off. This means that unless I have sufficient RAM, I can't get the whole historical index. Again, that's understandable. This is a super weak machine just for testing. But do you have a recommended spec to run this at so I can use the script?

@redsolver
Copy link
Contributor

I just added a progress cursor to the historical indexer, so if you restart it it should remember and continue there instead of starting again from the start. still experimental, it needs to run for at least 5 minutes until the cursor is saved to prevent skipping fields because of concurrency

@azigler
Copy link
Author

azigler commented Jul 18, 2023

Thanks @redsolver! I tried this out it does seem to pick back up from the cursor, very cool. I see you're writing the cursor to file, so I could hypothetically manually update it to whatever was the last thing it saw before getting killed. I find it consistently crashing on a 60MB repository download, so I think I'll have to try your historical indexer out on a different machine, ultimately.

@redsolver
Copy link
Contributor

I could add a max repo size which skips repos above a specific size, but that would of course cause an incomplete index

@azigler
Copy link
Author

azigler commented Jul 18, 2023

I could add a max repo size which skips repos above a specific size, but that would of course cause an incomplete index

I don't think that would be useful here ultimately, and the bottleneck seems to be my RAM on this particular machine. I'd be interested to know at what threshold of RAM you/others have success with the script so I may try to replicate.

@redsolver
Copy link
Contributor

So recently I re-indexed the entire historical repo data on a new server (128GB RAM) and it's almost impossible to do because there seem to be some memory leaks. I did a lot of manual workarounds and small changes during indexing to get it to work, but the current implementation is pretty broken. So the historical indexer will likely need a rewrite in Rust to actually work correctly again without needing to manually intervene all the time

@redsolver
Copy link
Contributor

The best short term solution would be to share DB dumps of the entire historical data so not all users need to index everything again. atm it's 50 GB for all historical data

@azigler
Copy link
Author

azigler commented Aug 24, 2023

The best short term solution would be to share DB dumps of the entire historical data so not all users need to index everything again. atm it's 50 GB for all historical data

I agree, I think sharing checkpoints might work well. Does today's atproto blog post impact how this would work?

@redsolver
Copy link
Contributor

The changes in repository structure might make it easier to sync the historical data, because it's likely less. But for now I'll focus on a robust backup/snapshot solution for my database format, which can then be used to bootstrap new third-party instances quickly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants