Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zipformer explanation #837

Closed
AlexandderGorodetski opened this issue Jan 12, 2023 · 11 comments
Closed

Zipformer explanation #837

AlexandderGorodetski opened this issue Jan 12, 2023 · 11 comments

Comments

@AlexandderGorodetski
Copy link

Hello guys. Could you explain in several words what is zipformer. Thanks a lot.

@AlexandderGorodetski
Copy link
Author

The explanation is here
https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless7/zipformer.py

@desh2608
Copy link
Collaborator

@danpovey @csukuangfj I was wondering if there is some documentation somewhere (similar to this one about the reworked Conformer) that explains the key differences between conformer and zipformer?

@desh2608 desh2608 reopened this Jan 31, 2023
@danpovey
Copy link
Collaborator

danpovey commented Feb 1, 2023

Sorry I have been waiting till I can commit the more recent version of the zipformer, that has a lot more changes.
I think the most significant difference is that there are multiple stacks of encoders, and they run at different rates,
with the middle ones more strongly downsampled, by up to a factor of 4, than the eventual output of the network.
This is more efficient as there are fewer frames to evaluate.

There is a new optimizer called ScaledAdam that makes the learning rate proportional to the parameter magnitude on a per-tensor basis, and learns the whole tensor magnitude via a modification to the learning procedure (as if we were learning a multiplicative scalar factor); and this makes it possible to remove the log-scale that we previously had in each parameter tensor. We use a very large learning rate that has max value of 0.05, and to make the training stable there is a Whitening module that modifies the grad in a way that prevents situations where the activations are dominated by a small number of directions in parameter space. The optimizer also has some new features for stability and debugging, to detect when unusually large parameter changes are happening on a particular minibatch and limit them, in an automatically adjusted way; and to print stuff out when this happens. And new options to detect infinite values and print them out, to help in debugging diverged networks. All these changes together give us quite a bit more freedom to change the network topology and not have to worry too much about the network diverging. At least, the probability is significantly reduced.

Each individual layer also has more sub-modules: instead of ff1 + self_attn + conv_module + ff2, there is now ff1 + self_attn + conv_module1 + ff2 + self_attn2 + conv_module2 + ff3. This might seem pointless -- as it's quite like doubling the layer-- but the difference is that the attention weights are shared between the two self-attention modules (but not the feature projections), which saves computation because we re-use the weights. So it's more efficient; and we also avoid another BasicNorm.

We introduce a trainable bypass for each dim on each layer, subject to some limits; so the activations can bypass each layer in a soft way.

We change how dropout is done: we remove dropout from the output of individual sub-modules of each layer, keeping only sub-module-level and layer-level dropout. And we introduce a dropout mask that masks out all dimensions above, say, 256, on 15% of frames; and this mask is shared all through the stack of layers so the same frames are masked all the way through. The mask is applied to the output of each layer. These types of things actually have schedules so they change on a schedule during training; we have this thing called ScheduledFloat that makes this easy to set up.

@desh2608
Copy link
Collaborator

desh2608 commented Feb 1, 2023

Thanks @danpovey! Any tentative timeline on when we can expect the updated Zipformer in icefall?

@danpovey
Copy link
Collaborator

danpovey commented Feb 2, 2023

I'm hoping in a month or two.

@brainbpe
Copy link

brainbpe commented Apr 4, 2023

I'm hoping in a month or two.

@danpovey hi dan, is any progress in new updated zipformer?

@danpovey
Copy link
Collaborator

danpovey commented Apr 4, 2023

We are doing final testing. Unfortunately, on larger datasets, it seems not better in WER than the current zipformer, but it is faster to train and uses less memory.

@seastar105
Copy link

@danpovey is there any preprint or paper or report about zipformer?

@csukuangfj
Copy link
Collaborator

@danpovey is there any preprint or paper or report about zipformer?

@seastar105

Please see #1230

@JinZr
Copy link
Collaborator

JinZr commented Oct 24, 2023

Full pre-print available here: #1328

@JinZr JinZr closed this as completed Oct 24, 2023
@MysticShadow427
Copy link

MysticShadow427 commented May 17, 2024

hello authors! I had one doubt regarding Zipformer.I am trying a simplistic implementation of Zipformer using PyTorch.
As per the paper the inputs to the model is a 3-D vector [batch_size,num_time_steps,mel_features] but the first layer of the model which is ConvEmbed has 2-D convolution layers with number of channels as given in the paper.The nn.Conv2d requires a 4-D array [batch_size,num_channels,num_time_steps,mel_features] , so how to tackle this,by unsqueezing a dimension and giving input a array of [batch_size,1,num_time_steps,mel_features] and if we pass this to the model which will work fine till MultiHead Attention layer,so how to implement our typical attention on a 4-D tensor?.
Please correct me if I missing something.Thank You!

Edit:If possible please point me the reference to your specific code segment in this codebase.Thank You!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants