Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For smoe, set the number of experts N to 3 and the number of selected experts to 2, why can ensure the computational cost is similar ? #14

Open
zenghao-zh opened this issue Nov 22, 2024 · 0 comments

Comments

@zenghao-zh
Copy link

In paper

``For the MoE layers, we set the number of experts N to 32 for MoE-Dropout and SSD. MoE-Dropout linearly increases the number of selected experts K from 6 to 32 during the pre-training. For SSD, we set the threshold τ to 0.9 and monitor the activation pattern every 3,000 steps. In the sparse mode, we also select 6 experts for each layer. The ratio of the sparse mode r is set to 0.5. The ratio of the final dense training l is set to 0.1. For SMoE, we set the number of experts N to 3 and the number of selected experts K to 2 to ensure the computational cost is similar to that of other methods.''

Why for ssd, you select 6 experts from 32 experts while in SMoE, selects 2 from 3?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant