You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
``For the MoE layers, we set the number of experts N to 32 for MoE-Dropout and SSD. MoE-Dropout linearly increases the number of selected experts K from 6 to 32 during the pre-training. For SSD, we set the threshold τ to 0.9 and monitor the activation pattern every 3,000 steps. In the sparse mode, we also select 6 experts for each layer. The ratio of the sparse mode r is set to 0.5. The ratio of the final dense training l is set to 0.1. For SMoE, we set the number of experts N to 3 and the number of selected experts K to 2 to ensure the computational cost is similar to that of other methods.''
Why for ssd, you select 6 experts from 32 experts while in SMoE, selects 2 from 3?
The text was updated successfully, but these errors were encountered:
In paper
``For the MoE layers, we set the number of experts N to 32 for MoE-Dropout and SSD. MoE-Dropout linearly increases the number of selected experts K from 6 to 32 during the pre-training. For SSD, we set the threshold τ to 0.9 and monitor the activation pattern every 3,000 steps. In the sparse mode, we also select 6 experts for each layer. The ratio of the sparse mode r is set to 0.5. The ratio of the final dense training l is set to 0.1. For SMoE, we set the number of experts N to 3 and the number of selected experts K to 2 to ensure the computational cost is similar to that of other methods.''
Why for ssd, you select 6 experts from 32 experts while in SMoE, selects 2 from 3?
The text was updated successfully, but these errors were encountered: