historical_indexer runs out of memory and must start from beginning #2

azigler · 2023-07-06T17:40:13Z

Hi @redsolver -- making a new issue so we can stay organized. 🚀

My machine:

2 GB Memory
1 vCPU
25 GB Disk + 30 GB mounted
Ubuntu 22.04 (LTS) x64

If I run this script, CPU immediately hits 100% (understandable, since this is a very weak machine) and memory slowly crawls up to 100% over the course of ~1hr before hitting a maximum and my machine killing the PID. It does manage to count all the repos and then start downloading them, and the script works as I can confirm SurrealDB stores the blocks. I'll get a Process killed message in my terminal when my machine kills it due to lack of RAM, and then the memory is released.

If I start again, it starts over from the very beginning, not where it left off. This means that unless I have sufficient RAM, I can't get the whole historical index. Again, that's understandable. This is a super weak machine just for testing. But do you have a recommended spec to run this at so I can use the script?

The text was updated successfully, but these errors were encountered:

redsolver · 2023-07-17T11:49:26Z

I just added a progress cursor to the historical indexer, so if you restart it it should remember and continue there instead of starting again from the start. still experimental, it needs to run for at least 5 minutes until the cursor is saved to prevent skipping fields because of concurrency

azigler · 2023-07-18T00:16:05Z

Thanks @redsolver! I tried this out it does seem to pick back up from the cursor, very cool. I see you're writing the cursor to file, so I could hypothetically manually update it to whatever was the last thing it saw before getting killed. I find it consistently crashing on a 60MB repository download, so I think I'll have to try your historical indexer out on a different machine, ultimately.

redsolver · 2023-07-18T22:03:08Z

I could add a max repo size which skips repos above a specific size, but that would of course cause an incomplete index

azigler · 2023-07-18T23:33:42Z

I could add a max repo size which skips repos above a specific size, but that would of course cause an incomplete index

I don't think that would be useful here ultimately, and the bottleneck seems to be my RAM on this particular machine. I'd be interested to know at what threshold of RAM you/others have success with the script so I may try to replicate.

redsolver · 2023-08-24T17:58:18Z

So recently I re-indexed the entire historical repo data on a new server (128GB RAM) and it's almost impossible to do because there seem to be some memory leaks. I did a lot of manual workarounds and small changes during indexing to get it to work, but the current implementation is pretty broken. So the historical indexer will likely need a rewrite in Rust to actually work correctly again without needing to manually intervene all the time

redsolver · 2023-08-24T17:59:37Z

The best short term solution would be to share DB dumps of the entire historical data so not all users need to index everything again. atm it's 50 GB for all historical data

azigler · 2023-08-24T22:51:25Z

The best short term solution would be to share DB dumps of the entire historical data so not all users need to index everything again. atm it's 50 GB for all historical data

I agree, I think sharing checkpoints might work well. Does today's atproto blog post impact how this would work?

redsolver · 2023-10-18T21:41:31Z

The changes in repository structure might make it easier to sync the historical data, because it's likely less. But for now I'll focus on a robust backup/snapshot solution for my database format, which can then be used to bootstrap new third-party instances quickly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

historical_indexer runs out of memory and must start from beginning #2

historical_indexer runs out of memory and must start from beginning #2

azigler commented Jul 6, 2023

redsolver commented Jul 17, 2023

azigler commented Jul 18, 2023

redsolver commented Jul 18, 2023

azigler commented Jul 18, 2023

redsolver commented Aug 24, 2023

redsolver commented Aug 24, 2023

azigler commented Aug 24, 2023

redsolver commented Oct 18, 2023

historical_indexer runs out of memory and must start from beginning #2

historical_indexer runs out of memory and must start from beginning #2

Comments

azigler commented Jul 6, 2023

redsolver commented Jul 17, 2023

azigler commented Jul 18, 2023

redsolver commented Jul 18, 2023

azigler commented Jul 18, 2023

redsolver commented Aug 24, 2023

redsolver commented Aug 24, 2023

azigler commented Aug 24, 2023

redsolver commented Oct 18, 2023