Skip to content

Latest commit

 

History

History
33 lines (19 loc) · 2.35 KB

description.md

File metadata and controls

33 lines (19 loc) · 2.35 KB

MDLT

Stylianos Ioannis Mimilakis¹ and Konstantinos Drossos²

¹Fraunhofer-IDMT, Ilmenau, Germany

²Tampere University of Technology, Tampere, Finland

Contact: mis [at] idmt.fraunhofer.de

Additional Info

  • is_blind: no
  • additional_training_data: no

Supplemental Material

Method

Task: Singing voice separation.

We used the Masker and Denoiser (MaD) architecture presented in the references below. Our method operates on single-channel mixture magnitude spectrograms and yields single-channel estimates for the singing voice. The main difference between the MDL1 is that a thresholding algorithm is applied to the latent space that controls the time-frequency mask generation ("denoted as "H-j-dec" in our paper"). Values less or equal than $0.2$ in the absolute latent space are brought to 0. This is applied only during the testing time. The accompaniment source is estimated by time-domain subtraction. To avoid the computational complexities of the recurrent inference, we introduced to the overall cost a unit matrix norm penalty for the latent representation of the target source time-frequency mask (denoted as "H-j-dec" in our paper). In MDLT a scalar of $2e-7$ is applied to the aforementioned matrix norm. For training we only used the training subset of MUSDB18, without any augmentation, normalisation or dropout. At test time, we applied our method to each available mixture channel independently.

More details can be found here: https://js-mim.github.io/mss_pytorch/

References

1 S.I. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller: A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation, in Proc. IEEE International Workshop on Machine Learning for Signal Processing (MLSP), September 2017.

2 S.I. Mimilakis, K. Drossos, J.F. Santos, G. Schuller, T. Virtanen, and Y. Bengio: Monaural singing voice separation with skip-filtering connections and recurrent inference of time-frequency mask, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018.