Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed training #12

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open

distributed training #12

wants to merge 47 commits into from

Conversation

kolia
Copy link

@kolia kolia commented Jul 20, 2020

This is the companion PR to a beacon-internal project.

It defines a DistributedClassifier with an impl of train! that sends batch specs (whatever info workers need to know what their next batch is) to workers, where losses and gradients get computed and sent back. The driver node performs parameter updates after summing the gradients from all workers. This is a purely synchronous distributed training loop: the workers are always running with the latest version of the model, which ensures that the model converges and performs the same as it would if it were trained locally (there are no such guarantees for asynchronous training schemes where workers are often running with stale model params.)

It also defines a DistributedLogger allowing workers to send back logs to the driver node. This hasn't been tested, beyond it not barfing.

It also defines some utilities in distributed/ that are not specific to Lighthouse or Flux, and which should be moved somewhere else eventually.

@jrevels jrevels self-requested a review July 20, 2020 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant