Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a frequency parameter to TriggerWandbSyncLightningCallback #101

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

CedricLeon
Copy link
Contributor

In order to synchronize every N epoch instead of every epoch (by default), I added a frequency parameter to the Lightning callback.

I did not write any test for that feature, mainly because I am not sure how you proceed (but also because of lazyness 😅)
Tell me if you want me to add a test for that.

Copy link

semanticdiff-com bot commented Mar 22, 2024

Review changes with SemanticDiff.

Analyzed 3 of 3 files.

Overall, the semantic diff is 51% smaller than the GitHub diff.

Filename Status
✔️ src/wandb_osh/hooks.py 58.82% smaller
✔️ src/wandb_osh/lightning_hooks.py 26.87% smaller
✔️ src/wandb_osh/ray_hooks.py 74.45% smaller

@@ -41,6 +43,7 @@ def __init__(
"""
super().__init__()
self._hook = TriggerWandbSyncHook(communication_dir=communication_dir)
self.frequency = frequency
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you make this private? i.e., self._frequency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea

@klieret
Copy link
Owner

klieret commented Mar 22, 2024

I like this, this would be an alternate way of solving #83 and much simpler than implementing multiprocessing.

As for tests, I think this is fine, I'd say it's simple enough :)

@@ -24,11 +24,13 @@ class TriggerWandbSyncLightningCallback(pl.Callback):
def __init__(
self,
communication_dir: PathLike = _comm_default_dir,
frequency: int = 1,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a better name for this. Because this setting is basically 1/frequency, right? (the larger this value is, the less often we sync). So perhaps something like sync_every_n_epochs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with such a name but thought that frequency was shorter and still self-explanatory.
I will make the change to sync_every_n_epochs, it is more accurate

src/wandb_osh/lightning_hooks.py Outdated Show resolved Hide resolved
@klieret klieret linked an issue Mar 22, 2024 that may be closed by this pull request
@CedricLeon
Copy link
Contributor Author

CedricLeon commented Mar 25, 2024

I made the changes. Still without testing 😇
I am 100% confident it works for the Lightning hook (TriggerWandbSyncLightningCallback), because I use such a setup in my work. However, I never used Ray before and can't think of an easy way to test it. Can you have a look and tell me if I implemented something stupid?
Also, about #83, I think it will definitely help but won't solve the problem (I can see that, when I run big sweeps, with 20 or 40 runs in parallel 😅).

@CedricLeon
Copy link
Contributor Author

CedricLeon commented Mar 25, 2024

I just realized that the hook() should also be called after the end of the last epoch, in case people use a weird sync_every_n_epochs. I don't think this is implemented yet.
I say this in case such thing happen:

  • An run running for, say, 100, epochs but sync_every_n_epochs set to 3 will miss to synchronize the last epoch.
  • Or (maybe more classic), a scheduler killing the run epoch 51 when sync_every_n_epochs is set to 10 or even 5 or 2. In this case a few epoch won't be uploaded.

I see an easy fix for the Lightning callback; adding the on_test_epoch_end() hook with a call to TriggerWandbSyncHook that will happen 100% of the time (e.g., by setting current-epoch = sync_every_n_epochs, that's cheesy but that will work).
But, again, I have no clue about Ray 😅

@klieret
Copy link
Owner

klieret commented Mar 25, 2024

Oh you're 100% right, I should have thought about that as well. A totally different avenue of solution (that might also help with syncing hyperparameters of runs that get killed because of #100) would be that the wandb-osh syncer tool remembers which runs were syncing and if it doesn't get triggered for an hour, it runs one final sync call (and we can catch KeyboardInterrupt to make sure we also do the final round if people kill wandb-osh. Of course that's also just another silly workaround.

Let me think about that. We might in fact do both (and perhaps add a note to the ray one that it has this disadvantage).

@CedricLeon
Copy link
Contributor Author

That would be a pretty robust workaround indeed.
Should I add the on_test_epoch_end() fix I mentionned and we close this PR?
Or should I just add a Warning about this problem in the doc, and we fix it in another PR?

I dug into the problem I raised in #100 and this is deeper that just not syncing the config. i will detail what I found so far in the Issue.
Anyway, it would be handy for me to have the possibility to reduce the number of Triggers to investigate when the problem happens.

@CedricLeon
Copy link
Contributor Author

@klieret should I push this temporary fix?
Or do you want to solve it in this PR?

@klieret
Copy link
Owner

klieret commented Apr 1, 2024

Sorry for my late replies! Let me take a look now :)

@klieret
Copy link
Owner

klieret commented Apr 1, 2024

Could you give me push permission on the branch (I made some fixes and added some tests to this). You only have to click a checkbox on this PR, I think

@CedricLeon
Copy link
Contributor Author

Could you give me push permission on the branch (I made some fixes and added some tests to this). You only have to click a checkbox on this PR, I think

The box is checked, it seems to be checked by default btw

@CedricLeon
Copy link
Contributor Author

Hi @klieret, any update about this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wandb-osh cannot handle many runs at once
2 participants