This recipe generates high-quality, human-like speech emotion descriptions. The model is based on the q-former projector and the vicuna-7b-v1.5 LLM. The model is trained on an unpublished datasets dataset, which is a large-scale dataset for speech emotion captioning.
We only train the q-former projector in this recipe.
Encoder | Projector | LLM | Similarity Score |
---|---|---|---|
emotion2vec_base | Q-Former | vicuna-7b-v1.5 | 71.10 |
Note: The baseline model SECap was tested in our environment and achieved a similarity score of 71.52. Our model's score is slightly lower.
You need to prepare the data jsonl in this format.
{"key": "key_name", "source": "path_to_wav_file", "target": "corresponding_caption"}
...
bash decode_emotion2vec_qformer_vicuna_7b.sh
Modify the path including speech_encoder_path
, llm_path
, output_dir
, ckpt_path
, val_data_path
and decode_log
in the script when you run the shell script.
If you do have sufficient relevant data, you can train the model yourself.
bash finetune_emotion2vec_qformer_vicuna_7b.sh