Limited and potentially incorrect weight initialization for CoCa model #165

flxst · 2024-06-26T15:40:48Z

In #161, the weight initialization was restructured and extended, mainly for the GPT2 model.

For CoCa, there are

limitations:
- only plain and scaled initializations can be selected, not scaled_embed.
- for the scaled initialization, the standard deviation has to be explicitly specified (e.g. 0.02). The auto option is available only for GPT2.
  Note that the implementation of the auto option might be a bit more involved for CoCa, as the model has two decoders (text and multimodal) which can have different hidden dimensions, by which the standard deviations of the weight initializations is scaled in auto mode.
potential issues:
- The scaled initialization might not be implemented as intended. In GPT2, we follow the common practice (see e.g. nanoGPT or https://arxiv.org/abs/2312.16903) to scale both the projection matrices W0 (attention) and W2 (ffn) with the number of layers. However, in our current CoCa implementation, only W0 is scaled like this.

Note: This issue should be addressed after #161 was merged.

The text was updated successfully, but these errors were encountered:

flxst · 2024-07-03T14:25:38Z

Update: Only plain initialization with an explicitly specified standard deviation (e.g. 0.02, not auto) is now allowed / implemented for CoCa, see #161.

flxst mentioned this issue Jun 26, 2024

Feat: Various Configurable Initializations #161

Merged

5 tasks

flxst closed this as completed Jul 3, 2024

le1nux mentioned this issue Jul 9, 2024

Towards stable modalities version #141

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limited and potentially incorrect weight initialization for CoCa model #165

Limited and potentially incorrect weight initialization for CoCa model #165

flxst commented Jun 26, 2024 •

edited

Loading

flxst commented Jul 3, 2024

Limited and potentially incorrect weight initialization for CoCa model #165

Limited and potentially incorrect weight initialization for CoCa model #165

Comments

flxst commented Jun 26, 2024 • edited Loading

flxst commented Jul 3, 2024

flxst commented Jun 26, 2024 •

edited

Loading