Zipformer explanation #837

AlexandderGorodetski · 2023-01-12T13:22:16Z

Hello guys. Could you explain in several words what is zipformer. Thanks a lot.

AlexandderGorodetski · 2023-01-12T13:51:08Z

The explanation is here
https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless7/zipformer.py

desh2608 · 2023-01-31T16:32:50Z

@danpovey @csukuangfj I was wondering if there is some documentation somewhere (similar to this one about the reworked Conformer) that explains the key differences between conformer and zipformer?

danpovey · 2023-02-01T16:06:02Z

Sorry I have been waiting till I can commit the more recent version of the zipformer, that has a lot more changes.
I think the most significant difference is that there are multiple stacks of encoders, and they run at different rates,
with the middle ones more strongly downsampled, by up to a factor of 4, than the eventual output of the network.
This is more efficient as there are fewer frames to evaluate.

There is a new optimizer called ScaledAdam that makes the learning rate proportional to the parameter magnitude on a per-tensor basis, and learns the whole tensor magnitude via a modification to the learning procedure (as if we were learning a multiplicative scalar factor); and this makes it possible to remove the log-scale that we previously had in each parameter tensor. We use a very large learning rate that has max value of 0.05, and to make the training stable there is a Whitening module that modifies the grad in a way that prevents situations where the activations are dominated by a small number of directions in parameter space. The optimizer also has some new features for stability and debugging, to detect when unusually large parameter changes are happening on a particular minibatch and limit them, in an automatically adjusted way; and to print stuff out when this happens. And new options to detect infinite values and print them out, to help in debugging diverged networks. All these changes together give us quite a bit more freedom to change the network topology and not have to worry too much about the network diverging. At least, the probability is significantly reduced.

Each individual layer also has more sub-modules: instead of ff1 + self_attn + conv_module + ff2, there is now ff1 + self_attn + conv_module1 + ff2 + self_attn2 + conv_module2 + ff3. This might seem pointless -- as it's quite like doubling the layer-- but the difference is that the attention weights are shared between the two self-attention modules (but not the feature projections), which saves computation because we re-use the weights. So it's more efficient; and we also avoid another BasicNorm.

We introduce a trainable bypass for each dim on each layer, subject to some limits; so the activations can bypass each layer in a soft way.

We change how dropout is done: we remove dropout from the output of individual sub-modules of each layer, keeping only sub-module-level and layer-level dropout. And we introduce a dropout mask that masks out all dimensions above, say, 256, on 15% of frames; and this mask is shared all through the stack of layers so the same frames are masked all the way through. The mask is applied to the output of each layer. These types of things actually have schedules so they change on a schedule during training; we have this thing called ScheduledFloat that makes this easy to set up.

desh2608 · 2023-02-01T16:48:40Z

Thanks @danpovey! Any tentative timeline on when we can expect the updated Zipformer in icefall?

danpovey · 2023-02-02T05:33:08Z

I'm hoping in a month or two.

brainbpe · 2023-04-04T11:05:06Z

I'm hoping in a month or two.

@danpovey hi dan, is any progress in new updated zipformer?

danpovey · 2023-04-04T15:02:54Z

We are doing final testing. Unfortunately, on larger datasets, it seems not better in WER than the current zipformer, but it is faster to train and uses less memory.

seastar105 · 2023-09-26T09:53:50Z

@danpovey is there any preprint or paper or report about zipformer?

csukuangfj · 2023-09-26T10:29:43Z

@danpovey is there any preprint or paper or report about zipformer?

@seastar105

Please see #1230

JinZr · 2023-10-24T08:42:34Z

Full pre-print available here: #1328

MysticShadow427 · 2024-05-17T15:42:54Z

hello authors! I had one doubt regarding Zipformer.I am trying a simplistic implementation of Zipformer using PyTorch.
As per the paper the inputs to the model is a 3-D vector [batch_size,num_time_steps,mel_features] but the first layer of the model which is ConvEmbed has 2-D convolution layers with number of channels as given in the paper.The nn.Conv2d requires a 4-D array [batch_size,num_channels,num_time_steps,mel_features] , so how to tackle this,by unsqueezing a dimension and giving input a array of [batch_size,1,num_time_steps,mel_features] and if we pass this to the model which will work fine till MultiHead Attention layer,so how to implement our typical attention on a 4-D tensor?.
Please correct me if I missing something.Thank You!

Edit:If possible please point me the reference to your specific code segment in this codebase.Thank You!

AlexandderGorodetski closed this as completed Jan 12, 2023

desh2608 reopened this Jan 31, 2023

JinZr closed this as completed Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zipformer explanation #837

Zipformer explanation #837

AlexandderGorodetski commented Jan 12, 2023

AlexandderGorodetski commented Jan 12, 2023

desh2608 commented Jan 31, 2023

danpovey commented Feb 1, 2023 •

edited

Loading

desh2608 commented Feb 1, 2023

danpovey commented Feb 2, 2023

brainbpe commented Apr 4, 2023

danpovey commented Apr 4, 2023 •

edited

Loading

seastar105 commented Sep 26, 2023

csukuangfj commented Sep 26, 2023

JinZr commented Oct 24, 2023

MysticShadow427 commented May 17, 2024 •

edited

Loading

Zipformer explanation #837

Zipformer explanation #837

Comments

AlexandderGorodetski commented Jan 12, 2023

AlexandderGorodetski commented Jan 12, 2023

desh2608 commented Jan 31, 2023

danpovey commented Feb 1, 2023 • edited Loading

desh2608 commented Feb 1, 2023

danpovey commented Feb 2, 2023

brainbpe commented Apr 4, 2023

danpovey commented Apr 4, 2023 • edited Loading

seastar105 commented Sep 26, 2023

csukuangfj commented Sep 26, 2023

JinZr commented Oct 24, 2023

MysticShadow427 commented May 17, 2024 • edited Loading

danpovey commented Feb 1, 2023 •

edited

Loading

danpovey commented Apr 4, 2023 •

edited

Loading

MysticShadow427 commented May 17, 2024 •

edited

Loading