Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limited and potentially incorrect weight initialization for CoCa model #165

Closed
flxst opened this issue Jun 26, 2024 · 1 comment
Closed

Comments

@flxst
Copy link
Member

flxst commented Jun 26, 2024

In #161, the weight initialization was restructured and extended, mainly for the GPT2 model.

For CoCa, there are

  • limitations:

    • only plain and scaled initializations can be selected, not scaled_embed.
    • for the scaled initialization, the standard deviation has to be explicitly specified (e.g. 0.02). The auto option is available only for GPT2.
      Note that the implementation of the auto option might be a bit more involved for CoCa, as the model has two decoders (text and multimodal) which can have different hidden dimensions, by which the standard deviations of the weight initializations is scaled in auto mode.
  • potential issues:

    • The scaled initialization might not be implemented as intended. In GPT2, we follow the common practice (see e.g. nanoGPT or https://arxiv.org/abs/2312.16903) to scale both the projection matrices W0 (attention) and W2 (ffn) with the number of layers. However, in our current CoCa implementation, only W0 is scaled like this.

Note: This issue should be addressed after #161 was merged.

@flxst
Copy link
Member Author

flxst commented Jul 3, 2024

Update: Only plain initialization with an explicitly specified standard deviation (e.g. 0.02, not auto) is now allowed / implemented for CoCa, see #161.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant