Naoya Takahashi¹, Nabarun Goswami², Yuki Mitsufuji¹
¹Sony Corporation, Audio Technology Development Department, Tokyo, Japan
²Sony India Software Center, Bangalore, India
Naoya.Takahashi [at] sony.com
- is_blind: no
- additional_training_data: no
- Code: not available
- Demos: not available
This submission uses a multi-scale multi-band DenseLSTM (MMDeseLSTM), which is an extetnsion of MMDenseNet [1]. In MMDenseLSTM, some of dense-blocks, typically in low scale, equip a LSTM layer, as shown in Figure 1. Each LSTM layer is one layer bi-directional LSTM with maximum 128 cells.
For each instrument, a MMDeseLSTM is
trained to predict the target instrument amplitude from the mixture
amplitude in the STFT domain (frame size: 4096, hop size: 1024). The raw
output of each network is then combined by a multichannel Wiener filter as
described in [2] where we estimate the power spectral densities and spatial
covariance matrices from the DNN outputs.
More details, experiments and analysis is described in a paper which is submitted to elsewhere.
The network is trained on train
part of musdb
for 60 epochs, where a training curve saturate.
- N. Takahashi and Y. Mitsufuji: Multi-scale multi-band DenseNets for audio source separation, Proc. WASPAA, 2017
- A. A. Nugraha, A. Liutkus, and E. Vincent. "Multichannel music separation with deep neural networks." EUSIPCO, 2016.
- N. Takahashi, N. Goswami and Y. Mitsufuji: MMDenseLSTM: an efficient two way modeling for audio source separation, arXiv pre-print, 2018