Dimensional Speech Emotion Recognition by Using Acoustic Features and Word Embeddings using Multitask Learning
by Bagus Tris Atmaja, Masato Akagi
This paper has been published in APSIPA Transaction on Signal and Information Processing.
The majority of research in speech emotion recognition (SER) is conducted to recognize emotion categories. Recognizing dimensional emotion attributes is also important, however, and it has several advantages over categorical emotion. For this research, we investigate dimensional SER using both speech features and word embeddings. The concatenation network joins acoustic networks and text networks from bimodal features. We demonstrate that those bimodal features, both are extracted from speech, improve the performance of dimensional SER over unimodal SER either using acoustic features or word embeddings. A significant improvement on the valence dimension is contributed by the addition of word embeddings to SER system, while arousal and dominance dimensions are also improved. We proposed a multitask learning (MTL) approach for the prediction of all emotional attributes. This MTL maximizes the concordance correlation between predicted emotion degrees and true emotion labels simultaneously. The findings suggest that the use of MTL with two parameters is better than other evaluated methods in representing the interrelation of emotional attributes. In unimodal results, speech features attain higher performance on arousal and dominance, while word embeddings are better for predicting valence. The overall evaluation uses the concordance correlation coefficient score of the three emotional attributes. We also discuss some differences between categorical and dimensional emotion results from psychological and engineering perspectives.
The implementation of the algorithm proposed in the paper was conducted using Numpy, Keras (v2.3), and Tensorflow (v1.15).
All source code used to generate the results and figures in the paper are in
the code
folder.
The calculations and figure generation are all run inside
Jupyter notebooks.
The data used in this study is provided in data
and the sources for the
manuscript text and figures are in latex
.
Results generated by the code are saved in results
.
See the README.md
files in each directory for a full description.
Architecture of the proposed dimensional SER with the main results.
You can download a copy of all the files in this repository by cloning the git repository:
git clone https://github.com/bagustris/dimensional-ser.git
A copy of the paper is also archived at https://doi.org/10.1017/ATSIP.2020.14
You'll need a working Python environment to run the code.
The recommended way to set up your environment is through the
Anaconda Python distribution which
provides the conda
package manager.
Anaconda can be installed in your user directory and does not interfere with
the system Python installation.
The required dependencies are specified in the file requirements.txt
.
We use pip
virtual environments to manage the project dependencies in
isolation.
Thus, you can install our dependencies without causing conflicts with your
setup (even with different Python versions).
Run the following command in the repository folder (where environment.yml
is located) to create a separate environment and install all required
dependencies in it:
pip3.6 venv REPO_NAME
Since the dataset is not included, it is difficult to reproduce the result. However, the plot in the paper can be reproduced from the csv file in data directory.
All source code is made available under a BSD 3-clause license. You can freely
use and modify the code, without warranty, so long as you provide attribution
to the authors. See LICENSE.md
for the full license text.
The manuscript text is not open source. The authors reserve the rights to the article content, which is currently published in the journal of APSIPA Transaction on Signal and Information Processing.
B. T. Atmaja and M. Akagi, “Dimensional speech emotion recognition from speech
features and word embeddings by using multitask learning,” APSIPA Transactions
on Signal and Information Processing, vol. 9, p. e17, 2020.