-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zipformer explanation #837
Comments
@danpovey @csukuangfj I was wondering if there is some documentation somewhere (similar to this one about the reworked Conformer) that explains the key differences between conformer and zipformer? |
Sorry I have been waiting till I can commit the more recent version of the zipformer, that has a lot more changes. There is a new optimizer called ScaledAdam that makes the learning rate proportional to the parameter magnitude on a per-tensor basis, and learns the whole tensor magnitude via a modification to the learning procedure (as if we were learning a multiplicative scalar factor); and this makes it possible to remove the log-scale that we previously had in each parameter tensor. We use a very large learning rate that has max value of 0.05, and to make the training stable there is a Whitening module that modifies the grad in a way that prevents situations where the activations are dominated by a small number of directions in parameter space. The optimizer also has some new features for stability and debugging, to detect when unusually large parameter changes are happening on a particular minibatch and limit them, in an automatically adjusted way; and to print stuff out when this happens. And new options to detect infinite values and print them out, to help in debugging diverged networks. All these changes together give us quite a bit more freedom to change the network topology and not have to worry too much about the network diverging. At least, the probability is significantly reduced. Each individual layer also has more sub-modules: instead of ff1 + self_attn + conv_module + ff2, there is now ff1 + self_attn + conv_module1 + ff2 + self_attn2 + conv_module2 + ff3. This might seem pointless -- as it's quite like doubling the layer-- but the difference is that the attention weights are shared between the two self-attention modules (but not the feature projections), which saves computation because we re-use the weights. So it's more efficient; and we also avoid another BasicNorm. We introduce a trainable bypass for each dim on each layer, subject to some limits; so the activations can bypass each layer in a soft way. We change how dropout is done: we remove dropout from the output of individual sub-modules of each layer, keeping only sub-module-level and layer-level dropout. And we introduce a dropout mask that masks out all dimensions above, say, 256, on 15% of frames; and this mask is shared all through the stack of layers so the same frames are masked all the way through. The mask is applied to the output of each layer. These types of things actually have schedules so they change on a schedule during training; we have this thing called ScheduledFloat that makes this easy to set up. |
Thanks @danpovey! Any tentative timeline on when we can expect the updated Zipformer in icefall? |
I'm hoping in a month or two. |
@danpovey hi dan, is any progress in new updated zipformer? |
We are doing final testing. Unfortunately, on larger datasets, it seems not better in WER than the current zipformer, but it is faster to train and uses less memory. |
@danpovey is there any preprint or paper or report about zipformer? |
Full pre-print available here: #1328 |
hello authors! I had one doubt regarding Zipformer.I am trying a simplistic implementation of Zipformer using PyTorch. Edit:If possible please point me the reference to your specific code segment in this codebase.Thank You! |
Hello guys. Could you explain in several words what is zipformer. Thanks a lot.
The text was updated successfully, but these errors were encountered: