fix diloco 2 #13

samsja · 2024-09-26T23:20:35Z

FIX DILOCO LFG

This pr fix the diloco intregration. The main bug was that model were not initialized the same across diloco rank. We used to fix this by pulling the weight init from hf. We don't do this anymore because we don't use transformer for the modeling.

This pr introduces a common seed. I checked, and the weights are correctly initialized now.

The loss is looking better now

Additionally the PR add testing for the diloco all reduce as well as some small fix and enhancement

src/zeroband/train.py

src/zeroband/diloco.py

Jackmin801 · 2024-09-27T04:05:09Z

src/zeroband/diloco.py

        """

        self._logger.debug("sync inner model")
        for param_offloaded, param in zip(self.param_list_cpu, model.parameters()):
-            param.data = param_offloaded.data.to("cuda")  # todo: use copy_ here
+            param.data.copy_(param_offloaded.data.to(param.device))  # todo: use copy_ here


the .to(device) isnt needed here. You should be able to directly copy the data

interesting so it will handle the move to device by itself ?

my understanding is that a.copy_(b) means copy the content of b to a inplace so it works across devices and doesnt allocate a new intermediate storage.

meanwhile param.copy_(param_offloaded.data.to(param.device)) will first evaluate the .to which creates a new intermediate storage then copies the content. might be wrong tho, not sure how smart compilers/interpreters are

Jackmin801

LFGTM!

samsja added 4 commits September 26, 2024 23:17

only sync loss between local pg

edf0b02

refactor tests dist

3be4213

update docstring

df1616c

refactor: use only global pg in diloco + first failing tests

294c36d

samsja force-pushed the fix-diloco-2 branch from e5fc6a0 to 294c36d Compare September 26, 2024 23:23

samsja added 4 commits September 26, 2024 23:41

fix config to use shard grad op

ba392de

do copy instead of assigment

a143cab

init adamw inner before diloco

270bd95

make weight init same across diloco rank

b43bde7

samsja commented Sep 27, 2024

View reviewed changes

src/zeroband/train.py Show resolved Hide resolved

samsja requested review from Jackmin801 and JohannesHa September 27, 2024 01:30

JohannesHa reviewed Sep 27, 2024

View reviewed changes

src/zeroband/diloco.py Show resolved Hide resolved

Jackmin801 reviewed Sep 27, 2024

View reviewed changes

remove the to in the copy because jack said so and the test pass

9d801d3

samsja force-pushed the fix-diloco-2 branch from 330bdd3 to 9d801d3 Compare September 27, 2024 04:39

samsja requested review from JohannesHa and Jackmin801 September 27, 2024 04:45

Jackmin801 approved these changes Sep 27, 2024

View reviewed changes

samsja merged commit d9059e1 into main Sep 27, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix diloco 2 #13

fix diloco 2 #13

samsja commented Sep 26, 2024 •

edited

Loading

Jackmin801 Sep 27, 2024

samsja Sep 27, 2024

Jackmin801 Sep 27, 2024

samsja Sep 27, 2024 •

edited

Loading

Jackmin801 left a comment

fix diloco 2 #13

fix diloco 2 #13

Conversation

samsja commented Sep 26, 2024 • edited Loading

FIX DILOCO LFG

Jackmin801 Sep 27, 2024

Choose a reason for hiding this comment

samsja Sep 27, 2024

Choose a reason for hiding this comment

Jackmin801 Sep 27, 2024

Choose a reason for hiding this comment

samsja Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

Jackmin801 left a comment

Choose a reason for hiding this comment

samsja commented Sep 26, 2024 •

edited

Loading

samsja Sep 27, 2024 •

edited

Loading