distributed training #12

kolia · 2020-07-20T21:01:53Z

This is the companion PR to a beacon-internal project.

It defines a DistributedClassifier with an impl of train! that sends batch specs (whatever info workers need to know what their next batch is) to workers, where losses and gradients get computed and sent back. The driver node performs parameter updates after summing the gradients from all workers. This is a purely synchronous distributed training loop: the workers are always running with the latest version of the model, which ensures that the model converges and performs the same as it would if it were trained locally (there are no such guarantees for asynchronous training schemes where workers are often running with stale model params.)

It also defines a DistributedLogger allowing workers to send back logs to the driver node. This hasn't been tested, beyond it not barfing.

It also defines some utilities in distributed/ that are not specific to Lighthouse or Flux, and which should be moved somewhere else eventually.

Trying to see if this is the cause of serialization errors

bugfix attempt

Much better. Less bigs/workarounds required with the newer version :)

`rmprocs()` can hang beyond timeout.

…eFlux.jl into ks/big_train

this matters because BatchNorm and other similar layers need to maintain state that is not learned by gradient descent

kolia added 4 commits July 19, 2020 01:50

Remove Zygote.Params arg from loss_and_gradient

d609538

Trying to see if this is the cause of serialization errors

Distributed train!

f92fdee

Bugfixes

88a14f9

Remove duplicate method defs.

ded6689

jrevels self-requested a review July 20, 2020 21:21

kolia added 25 commits July 21, 2020 21:27

Differentiate calls to loss_and_gradient better.

b403f9c

bugfix attempt

Bugfix attempt 2

3839b7a

Bugfix attempt 3

d19379e

Bugfix 4

c177e81

Remove duplicate method defs.

16877fc

Adding straggler method back.

28d02e0

buffered_batch_loader

954ace2

Align grads and params on their params orders.

2ee1666

Bugfix

4942e32

Bugfix

702852b

Bugfixes; occasional "CUDNNError: CUDNN_STATUS_NOT_INITIALIZED"

426a1c9

Updated Flux/Zygote/CuArrays->CUDA

81a590b

Much better. Less bigs/workarounds required with the newer version :)

export buffered_batch_loader

b78825b

add CUDA

11d7114

forgot to use CUDA

85db31d

Add Renormalizer optimiser

3cd605b

Send less over wire to avoid trouble.

b618117

Log __plot_data__

1ee66a2

Bugfixes to robust controller-based training loop.

baed8e1

Don't crash if cannot kill stalled worker.

2b7bb5a

Bump up timeout for worker returning a gradient.

ae527d9

Try not to get stuck in rmprocs()

59a4121

Merge remote-tracking branch 'origin' into ks/big_train

3ef496c

Merge remote-tracking branch 'origin/ks/big_train' into ks/big_train

c602ec1

Handle case where no worker returns a gradient.

5e68d56

kolia added 18 commits July 28, 2020 19:06

Typo bugfix

0c6c5d5

Whoups, nuked wrong one.

a3bf554

Remove probable blocker.

ee2fe7a

`rmprocs()` can hang beyond timeout.

Remove unresponsive workers

069daa4

Widen type of logger in loss_and_gradient

5fb0e53

Merge remote-tracking branch 'origin/ks/big_train' into ks/big_train

cf794a7

@show train_loss

5cf2d38

Tighten up memory consumption on master.

797767b

Merge branch 'ks/big_train' of github.com:beacon-biosignals/Lighthous…

ccfabd2

…eFlux.jl into ks/big_train

gpu_free_memory() util

3b10320

typo

810ea29

Preserve order / alignment.

44bdfd1

workers keep model between passes

82c2ee1

this matters because BatchNorm and other similar layers need to maintain state that is not learned by gradient descent

can't assign to _model?

7143b82

Checkpoint before picking subset for PR.

cc482b9

Bugfix: apply! needs to be in module Flux.Optimise

e35b553

Cleanup logger; it now lives in Lighthouse#general_logger

0ca38f6

avoid memory leak by calling GC on remote workers

6346ca6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed training #12

distributed training #12

kolia commented Jul 20, 2020 •

edited by ericphanson

Loading

distributed training #12

Are you sure you want to change the base?

distributed training #12

Conversation

kolia commented Jul 20, 2020 • edited by ericphanson Loading

kolia commented Jul 20, 2020 •

edited by ericphanson

Loading