Speech Emotion Caption

Model Architecture

This recipe generates high-quality, human-like speech emotion descriptions. The model is based on the q-former projector and the vicuna-7b-v1.5 LLM. The model is trained on an unpublished datasets dataset, which is a large-scale dataset for speech emotion captioning.

Performance and checkpoints

We only train the q-former projector in this recipe.

Encoder	Projector	LLM	Similarity Score
emotion2vec_base	Q-Former	vicuna-7b-v1.5	71.10

Note: The baseline model SECap was tested in our environment and achieved a similarity score of 71.52. Our model's score is slightly lower.

Data preparation

You need to prepare the data jsonl in this format.

{"key": "key_name", "source": "path_to_wav_file", "target": "corresponding_caption"}
...

Decode with checkpoints

bash decode_emotion2vec_qformer_vicuna_7b.sh

Modify the path including speech_encoder_path, llm_path, output_dir, ckpt_path, val_data_path and decode_log in the script when you run the shell script.

Train a new model

If you do have sufficient relevant data, you can train the model yourself.

bash finetune_emotion2vec_qformer_vicuna_7b.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Speech Emotion Caption

Model Architecture

Performance and checkpoints

Data preparation

Decode with checkpoints

Train a new model

Files

README.md

Latest commit

History

README.md

File metadata and controls

Speech Emotion Caption

Model Architecture

Performance and checkpoints

Data preparation

Decode with checkpoints

Train a new model