Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data format v3 #20

Open
andnp opened this issue Dec 1, 2023 · 0 comments
Open

Data format v3 #20

andnp opened this issue Dec 1, 2023 · 0 comments

Comments

@andnp
Copy link
Owner

andnp commented Dec 1, 2023

It's time to allow partial data to be stored in the results folder. Currently, we hold all data in memory (in the Collector object) until the end of the run. Then we dump that data into the results database all at once. Doing this allows us to avoid incomplete/partial states. There are several drawbacks, though:

  • This costs a lot of memory, particularly for long running experiments (e.g. continual learning experiments)
  • This puts a large sudden write-pressure on the database at the end of a run. If many parallel processes are doing this at once, this dramatically increases the chances of long lockups
  • The full collector object needs to be checkpointed currently, which can be very expensive as it fills up

There are several challenges for the format, some of which already exist and we are simply ignoring them because the probability of hitting the issue is small, but no longer:

  • Data consistency. If a run is preempted, we need to ensure that restarting the run doesn't invalidate the data. Possibly this means needing to delete some overlapping rows or other safeguards. Preferably this is done only once, for instance when the checkpoint loads.
  • Buffered writes. We don't want to hit the database frequently, too much potential for lockups and too much disk activity.
  • Database lock handling.

Some concrete implementation notes:

  • Experiment description -> experiment metadata -> experiment. The description should be used once to construct the metadata in a db. Then the experiment will be run from that db only. This allows us to synchronize states and retain consistency. This also makes the path for computed .py experiment descriptions much simpler.
    • Need to handle synchronizing the metadata db across local and server machines. Possibly embedding this into the results database is sufficient
  • Use some tmp location (e.g. SLURM_TMPDIR) for partial results db, then synchronize that with the final results db. This should alleviate write pressure and make buffering less important
  • Have db writer occur asynchronously in a background thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant