Skip to content

Latest commit

 

History

History
104 lines (76 loc) · 11.1 KB

README.md

File metadata and controls

104 lines (76 loc) · 11.1 KB

S2S-Bench

中文版本

Welcome to our project repository. This project primarily focuses on the reproduction and evaluation of Speech Large Models (SLMs), with a particular emphasis on speech-to-speech (S2S) models that support both speech input and output. The repository includes:

    1. Reproduction code for various models;
    1. Arena web code for testing and demonstration;
    1. Dataset for testing.

In existing research, benchmarks for evaluating models’ instruction-following abilities often overlook paralinguistic information in both input and output, and lack direct comparison of speech output across models. To address these issues, we introduce a novel arena-style S2S benchmark that covers multiple real-world task scenarios, using the ELO rating system for performance analysis. Preliminary experiments show that although some models excel in knowledge-intensive tasks, they still face challenges in generating expressive speech. This study provides critical insights for the further development of S2S models and establishes a robust framework for evaluating model performance in both semantic and paralinguistic dimensions.

We invite researchers to include their models in our evaluation system. For inquiries, please contact us via issue submission or email: [email protected], [email protected].

If you have any additional interesting tests, feel free to contact us.

Model Reproduction Guide

Cascade Model

The Cascade Model consists of three components: ASR, LLMs, and TTS.

  • For ASR, we use whisper-large-v3;
  • For LLMs, we use gpt-4o-2024-08-06 (text version);
  • For TTS, we use CosyVoice-300M-Instruct.

The related code is located in ./CascadeModel.

Environment setup for each component is as follows:

Whisper-ASR

Refer to the Whisper model page on Hugging Face for ASR environment configuration.

GPT-4o-LLMs

To set up the environment for LLMs, use the following command:

pip install openai==0.28.0

CosyVoice-TTS

For TTS environment configuration, refer to FunAudio’s official GitHub.

Make sure to set PATH_TO_COSYVOICE to the path of your CosyVoice code directory to import the required packages. If you encounter a module-not-found error for Matcha-TTS, you can resolve it by running:

export PYTHONPATH=third_party/Matcha-TTS

GPT-4o

We reproduced the gpt-4o-realtime-preview-2024-10-01 version by calling the API directly. Refer to ./GPT-4o for related code and environment configuration. If you experience issues with model responses or speech format, try using the conversion code.

SpeechGPT

We used the SpeechGPT open-source code without modifications.

FunaudioLLMs-Qwen72B

We reproduced two versions of this model: the official version (using Qwen-72B as the LLMs) and a version using GPT-4o. This section introduces the former; the next section introduces the latter.

The code using Qwen-72B as the LLMs is in ./Funaudio_qwen. To run this code, you will need the following configurations:

SenseVoice

Refer to SenseVoice’s official GitHub for environment setup, and set 'PATH_TO_SENSEVOICE' in sensevoice.py.

Qwen72B

Environment installation and obtaining API Key:

  1. Install the environment:
    pip install dashscope
  2. Obtain the API Key by:
    • Visiting Aliyun's website and logging in.
    • Accessing the console, locating "Machine Learning" under "Products" or "Services," or directly searching for "DashScope."
    • Navigating to the DashScope service page and following instructions to activate the service.
    • After activation, create a project or application if required.
    • Once created, locate API Key or access key settings in the project or application management interface.
    • Follow the prompts to generate or view your API Key.

CosyVoice

Environment setup is the same as in the Cascade Model section.

LLaMA-omni

We modified the invocation method based on the LLaMA-Omni open-source project. Follow these steps for setup:

  1. Download and configure the original project.
  2. Place run_arena.sh in the ./omni_speech/infer folder, modifying the script with the correct model and dataset paths.
  3. Run run_model.py.
  4. Run change_filename.py.

Mini-Omni

We modified the invocation method based on the Mini-Omni open-source project. Follow these steps for setup:

Our modified code can be found in the ./Mini-Omni directory.

  1. Download and configure the original project;
  2. Place inference_arena.py in the folder containing the downloaded data;
  3. Configure the file paths in inference_arena.py;
  4. Run inference_arena.py.

Website Code

coming soon

Dataset

You can refer to: Hugging Face Dataset

BIb

Since our project is still ongoing and requires the involvement of many friends and partners in the evaluation process, we have decided to publicly release the first draft of our paper to help everyone quickly understand our research progress and encourage participation. However, it is important to note that this is only our preliminary version, not the final draft. Our Paper