Implementation of the U-Net like Zipformer from the Zipformer paper, for improving the Conformer with better temporal resolution.
1 Zipformer Block
import torch
from zipformer import ZipformerBlock
block = ZipformerBlock(
dim = 512,
dim_head = 64,
heads = 8,
mult = 4
)
x = torch.randn(32, 100, 512) #[batch_size,num_time_steps,feature_dim]
block(x) # (32, 100, 512)
Zipformer - just multiple ZipformerBlock
from above
import torch
from zipformer import Zipformer
zipformer = Zipformer(
dim = 512,
depth = 12, # 12 blocks
dim_head = 64,
heads = 8,
mult = 4,
)
x = torch.randn(32, 100, 512)
zipformer(x) # (32, 100, 512)
- switch to a better relative positional encoding.
- whitener and balancer activation modifications.
- adding a training and evaluation script.
Please give a pull request,I will be happy to improve this naive implementation.
@misc{@article{yao2023zipformer,
title={Zipformer: A faster and better encoder for automatic speech recognition},
author={Yao, Zengwei and Guo, Liyong and Yang, Xiaoyu and Kang, Wei and Kuang, Fangjun and Yang, Yifan and Jin, Zengrui and Lin, Long and Povey, Daniel},
journal={arXiv preprint arXiv:2310.11230},
year={2023}
}