Skip to content

Commit 36b0b0e

Browse files
authored
Update README.md
1 parent e6b8f06 commit 36b0b0e

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

models/tts/metis/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Unlike previous task-specific or multi-task models, Metis follows a pre-training
1515
Specifically, (1) Metis utilizes two discrete speech representations: SSL tokens derived from speech self-supervised learning (SSL) features, and acoustic tokens directly quantized from waveforms. (2) Metis performs masked generative pre-training on SSL tokens, utilizing 300K hours of diverse speech data, without any additional condition. (3) Through fine-tuning with task-specific conditions, Metis achieves efficient adaptation to various speech generation tasks while supporting multimodal input, even when using limited data and trainable parameters.
1616
Experiments demonstrate that Metis can serve as a foundation model for unified speech generation: Metis outperforms state-of-the-art task-specific or multi-task systems
1717
across five speech generation tasks, including zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech, even with fewer than 20M trainable parameters or 300 times less training data.
18-
Audio samples are are available at [demo page](https://metis-demo.github.io/).
18+
Audio samples are available at [demo page](https://metis-demo.github.io/).
1919

2020

2121
<div align="center">
@@ -45,7 +45,7 @@ Metis is fully compatible with MaskGCT and shares several key model components w
4545
| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- |
4646
| [Semantic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/semantic_codec) | Converting speech to semantic tokens. |
4747
| [Acoustic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/acoustic_codec) | Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. |
48-
| [Semantic2Acoustic](https://huggingface.co/amphion/MaskGCT/tree/main/s2a_model) | Predicts acoustic tokens conditioned on semantic tokens.
48+
| [Semantic2Acoustic](https://huggingface.co/amphion/MaskGCT/tree/main/s2a_model) | Predicts acoustic tokens conditioned on semantic tokens. |
4949
<!-- | [MaskGCT-T2S](https://huggingface.co/amphion/MaskGCT/tree/main/t2s_model) | Predicting semantic tokens with text and prompt semantic tokens. | -->
5050

5151
We open-source the pretrained model checkpoint of the first stage of Metis (with masked generative pre-training), as well as the fine-tuned models for speech enhancement (SE), target speaker extraction (TSE), voice conversion (VC), lip-to-speech (L2S), and the unified multi-task (Omni) model.
@@ -237,4 +237,4 @@ If you use Metis in your research, please cite the following paper:
237237
booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
238238
year={2024}
239239
}
240-
```
240+
```

0 commit comments

Comments
 (0)