You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #161, the weight initialization was restructured and extended, mainly for the GPT2 model.
For CoCa, there are
limitations:
only plain and scaled initializations can be selected, not scaled_embed.
for the scaled initialization, the standard deviation has to be explicitly specified (e.g. 0.02). The auto option is available only for GPT2. Note that the implementation of the auto option might be a bit more involved for CoCa, as the model has two decoders (text and multimodal) which can have different hidden dimensions, by which the standard deviations of the weight initializations is scaled in auto mode.
potential issues:
The scaled initialization might not be implemented as intended. In GPT2, we follow the common practice (see e.g. nanoGPT or https://arxiv.org/abs/2312.16903) to scale both the projection matrices W0 (attention) and W2 (ffn) with the number of layers. However, in our current CoCa implementation, only W0 is scaled like this.
Note: This issue should be addressed after #161 was merged.
The text was updated successfully, but these errors were encountered:
Update: Only plain initialization with an explicitly specified standard deviation (e.g. 0.02, not auto) is now allowed / implemented for CoCa, see #161.
In #161, the weight initialization was restructured and extended, mainly for the GPT2 model.
For CoCa, there are
limitations:
plain
andscaled
initializations can be selected, notscaled_embed
.scaled
initialization, the standard deviation has to be explicitly specified (e.g.0.02
). Theauto
option is available only for GPT2.Note that the implementation of the
auto
option might be a bit more involved for CoCa, as the model has two decoders (text and multimodal) which can have different hidden dimensions, by which the standard deviations of the weight initializations is scaled inauto
mode.potential issues:
scaled
initialization might not be implemented as intended. In GPT2, we follow the common practice (see e.g. nanoGPT or https://arxiv.org/abs/2312.16903) to scale both the projection matrices W0 (attention) and W2 (ffn) with the number of layers. However, in our current CoCa implementation, only W0 is scaled like this.Note: This issue should be addressed after #161 was merged.
The text was updated successfully, but these errors were encountered: