Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tts : add OuteTTS support #10784

Merged
merged 45 commits into from
Dec 18, 2024
Merged

tts : add OuteTTS support #10784

merged 45 commits into from
Dec 18, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Dec 11, 2024

close #10173

Overview

This PR adds inference support for the OuteTTS vocoder (i.e. WavTokenizer) directly into libllama. This enables full text-to-speech generation using llama.cpp.

# generate output.wav
llama-tts \
    --hf-repo OuteAI/OuteTTS-0.2-500M-GGUF \
    --hf-file OuteTTS-0.2-500M-Q8_0.gguf \
    --hf-repo-v ggml-org/WavTokenizer \
    --hf-file-v WavTokenizer-Large-75-F16.gguf \
    -p "I am sorry Dave, I'm afraid I can't do that."

# play the generated audio
ffplay output.wav
sorry.mp4

TTS requires 2 models to be provided: an LLM and a voice decoder. The first one generates audio codes (tokens) from the provided input text, based on some voice settings. The second one converts the audio codes into a spectrogram. The spectrogram is then converted back to audio with inverse FFT.

Usage

# this will produce F16 LLM model (~1 GB)
mkdir models/outetts-0.2-0.5B-llm
python convert_hf_to_gguf.py OuteAI/OuteTTS-0.2-500M/ --outfile models/outetts-0.2-0.5B-llm/ggml-model-f16.gguf --outtype f16

# this will produce Q8_0 LLM model (~500 MB)
llama-quantize models/outetts-0.2-0.5B-llm/ggml-model-f16.gguf models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf q8_0
# convert PT -> HF
python examples/tts/convert_pt_to_hf.py ./WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt

# convert HF -> GGUF (~250 MB)
mkdir models/wavtokenizer-large-75
python convert_hf_to_gguf.py WavTokenizer-large-speech-75token/ --outfile models/wavtokenizer-large-75/ggml-model-f16.gguf --outtype f16
```\

- Generate speech from text using the `llama-tts` example:

```bash
llama-tts \
    -m  ./models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf \
    -mv ./models/wavtokenizer-large-75/ggml-model-f16.gguf \
    -p "Hello world"

Note that the sampling settings of the LLM might need some adjustments.

Server usage

Initial server support is available using the examples/tts/tts-outetts.py script. It requires to start 2 servers: one with the LLM and one with WavTokenizer:

# llm server
./build/bin/llama-server -m ./models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf --port 8020

# wavtokenizer server
./build/bin/llama-server -m ./models/wavtokenizer-large-75/ggml-model-f16.gguf --port 8021 --embeddings --pooling none

# generate audio
python ./examples/tts/tts-outetts.py http://localhost:8020 http://localhost:8021 "Hello world"

The python script is currently missing the spectrogram -> audio conversion. For reference implementation of this post-processing see:

  • The original Python code: https://github.com/edwko/OuteTTS/blob/f43afd9fcc61baf18da0664ebe0e3ac0ebbb3814/outetts/wav_tokenizer/decoder/heads.py#L24-L67
  • Or the embd_to_audio() function in tts.cpp:
    // TODO: not optimized at all
    static std::vector<float> embd_to_audio(
    const float * embd,
    const int n_codes,
    const int n_embd,
    const int n_thread) {
    const int n_fft = 1280;
    const int n_hop = 320;
    const int n_win = 1280;
    const int n_pad = (n_win - n_hop)/2;
    const int n_out = (n_codes - 1)*n_hop + n_win;
    std::vector<float> hann(n_fft);
    fill_hann_window(hann.size(), true, hann.data());
    int n_spec = n_embd*n_codes;
    std::vector<float> E (n_spec);
    std::vector<float> S (n_spec);
    std::vector<float> ST(n_spec);
    for (int l = 0; l < n_codes; ++l) {
    for (int k = 0; k < n_embd; ++k) {
    E[k*n_codes + l] = embd[l*n_embd + k];
    }
    }
    for (int k = 0; k < n_embd/2; ++k) {
    for (int l = 0; l < n_codes; ++l) {
    float mag = E[(k )*n_codes + l];
    float phi = E[(k + n_embd/2)*n_codes + l];
    mag = exp(mag);
    if (mag > 1e2) {
    mag = 1e2;
    }
    S[2*(k*n_codes + l) + 0] = mag*cosf(phi);
    S[2*(k*n_codes + l) + 1] = mag*sinf(phi);
    }
    }
    for (int l = 0; l < n_codes; ++l) {
    for (int k = 0; k < n_embd/2; ++k) {
    ST[l*n_embd + 2*k + 0] = S[2*(k*n_codes + l) + 0];
    ST[l*n_embd + 2*k + 1] = S[2*(k*n_codes + l) + 1];
    }
    }
    std::vector<float> res (n_codes*n_fft);
    std::vector<float> hann2(n_codes*n_fft);
    std::vector<std::thread> workers(n_thread);
    for (int i = 0; i < n_thread; ++i) {
    workers[i] = std::thread([&, i]() {
    for (int l = i; l < n_codes; l += n_thread) {
    irfft(n_fft, ST.data() + l*n_embd, res.data() + l*n_fft);
    for (int j = 0; j < n_fft; ++j) {
    res [l*n_fft + j] *= hann[j];
    hann2[l*n_fft + j] = hann[j] * hann[j];
    }
    }
    });
    }
    for (int i = 0; i < n_thread; ++i) {
    workers[i].join();
    }
    std::vector<float> audio;
    std::vector<float> env;
    fold(res, n_out, n_win, n_hop, n_pad, audio);
    fold(hann2, n_out, n_win, n_hop, n_pad, env); // TODO: can be done once
    for (size_t i = 0; i < audio.size(); ++i) {
    audio[i] /= env[i];
    }
    return audio;
    }

I don't know what is the best way to implement this in a Python script and importing PyTorch for that seems like an overkill. So I'll leave it like this for now and hope we get some ideas later on.

TODO:

@github-actions github-actions bot added examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Dec 11, 2024
@mirek190
Copy link

mirek190 commented Dec 11, 2024

wow ...nice ;)

and implementation of multimodal models like a vision yet and we done ;-D

@ggerganov
Copy link
Owner Author

wow ...nice ;)

and implementation multimodal models live vision yet and we done ;-D

and-we-are-done.mp4

@edwko
Copy link

edwko commented Dec 11, 2024

Awesome! Really excited to see it running natively 😊

@ggerganov
Copy link
Owner Author

Awesome! Really excited to see it running natively

natively.mp4

@ggerganov
Copy link
Owner Author

Here is a longer generation:

TTS requires 2 models to be provided: an LLM and a Vocoder(?). The first one generates audio codes (tokens) from the provided input text, based on some voice settings. The second one converts the audio codes into a spectrogram. The spectrogram is then converted back to audio with inverse FFT.

longer.mp4

Not sure how to pass punctuation yet. Or even if this model supports it.

punctuation.mp4

@jadams777
Copy link

This is great. Would love to see a video tutorial on how to set up Ollama with this.

@ggerganov
Copy link
Owner Author

This is great. Would love to see a video tutorial on how to set up Ollama with this.

ollama.mp4

@ngxson
Copy link
Collaborator

ngxson commented Dec 11, 2024

Out of curiosity, does it make sense to combine both llm+voc into one gguf? I'm thinking about the idea of having llama-voice-to-voice -m llama-3.1.gguf -mtts oute-tts.gguf -masr whisper.gguf, but maybe it's too early to think about that?

@ggerganov
Copy link
Owner Author

Maybe we can add support to pack multiple models in a single GGUF.

@edwko
Copy link

edwko commented Dec 11, 2024

Not sure how to pass punctuation yet. Or even if this model supports it.
punctuation.mp4

The current models doesn't support special characters yet. I plan to add support for this in next release. For now in the interface it clears them.

@ggerganov
Copy link
Owner Author

Great, looking forward to this. And many thanks and admirations for this work 👍

#include <vector>
#include <fstream>
#include <thread>

Copy link

@edwko edwko Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a suggestion for the text preprocessing implementation, based on how it's currently done in library.

#include <string>
#include <vector>
#include <regex>
#include <stdexcept>
#include <sstream>
#include <map>
#include <iostream>

const std::map<int, std::string> ones = {
    {0, "zero"}, {1, "one"}, {2, "two"}, {3, "three"}, {4, "four"},
    {5, "five"}, {6, "six"}, {7, "seven"}, {8, "eight"}, {9, "nine"},
    {10, "ten"}, {11, "eleven"}, {12, "twelve"}, {13, "thirteen"}, {14, "fourteen"},
    {15, "fifteen"}, {16, "sixteen"}, {17, "seventeen"}, {18, "eighteen"}, {19, "nineteen"}
};

const std::map<int, std::string> tens = {
    {2, "twenty"}, {3, "thirty"}, {4, "forty"}, {5, "fifty"},
    {6, "sixty"}, {7, "seventy"}, {8, "eighty"}, {9, "ninety"}
};

// Convert a number less than 1000 to words
std::string convert_less_than_thousand(int num) {
    std::string result;
    
    if (num >= 100) {
        result += ones.at(num / 100) + " hundred ";
        num %= 100;
    }
    
    if (num >= 20) {
        result += tens.at(num / 10);
        if (num % 10 > 0) {
            result += "-" + ones.at(num % 10);
        }
    } else if (num > 0) {
        result += ones.at(num);
    }
    
    return result;
}

std::string number_to_words(const std::string& number_str) {
    try {
        size_t decimal_pos = number_str.find('.');
        std::string integer_part = number_str.substr(0, decimal_pos);
        
        int int_number = std::stoi(integer_part);
        std::string result;
        
        if (int_number == 0) {
            result = "zero";
        } else {
            if (int_number >= 1000000000) {
                int billions = int_number / 1000000000;
                result += convert_less_than_thousand(billions) + " billion ";
                int_number %= 1000000000;
            }
            
            if (int_number >= 1000000) {
                int millions = int_number / 1000000;
                result += convert_less_than_thousand(millions) + " million ";
                int_number %= 1000000;
            }
            
            if (int_number >= 1000) {
                int thousands = int_number / 1000;
                result += convert_less_than_thousand(thousands) + " thousand ";
                int_number %= 1000;
            }
            
            if (int_number > 0) {
                result += convert_less_than_thousand(int_number);
            }
        }
        
        // Handle decimal part
        if (decimal_pos != std::string::npos) {
            result += " point";
            std::string decimal_part = number_str.substr(decimal_pos + 1);
            for (char digit : decimal_part) {
                result += " " + ones.at(digit - '0');
            }
        }
        
        return result;
    } catch (const std::exception& e) {
        // Skip if fails
        return " "; 
    }
}

std::string replace_numbers_with_words(const std::string& input_text) {
    std::regex number_pattern(R"(\d+(\.\d+)?)");
    std::string result;
    auto it = std::sregex_iterator(input_text.begin(), input_text.end(), number_pattern);
    auto end = std::sregex_iterator();

    size_t last_pos = 0;
    for (std::sregex_iterator i = it; i != end; ++i) {
        const std::smatch& match = *i;
        result.append(input_text, last_pos, match.position() - last_pos);
        result.append(number_to_words(match.str()));
        last_pos = match.position() + match.length();
    }
    result.append(input_text, last_pos);
    
    return result;
}

// Based on: https://github.com/edwko/OuteTTS/blob/a613e79c489d8256dd657ea9168d78de75895d82/outetts/version/v1/prompt_processor.py#L39
std::string process_text(const std::string& text) {
    
    // For now I skipped text romanization as I am unsure how to handle
    // uroman and MeCab implementations in C++
    // maybe something like https://github.com/anyascii/anyascii/ could work.
    // currently only English would be supported in this function

    std::string processed_text = replace_numbers_with_words(text);

    std::transform(processed_text.begin(), processed_text.end(), 
                  processed_text.begin(), ::tolower);

    std::regex special_chars(R"([-_/,\.\\])");
    processed_text = std::regex_replace(processed_text, special_chars, " ");
    
    std::regex non_alpha(R"([^a-z\s])");
    processed_text = std::regex_replace(processed_text, non_alpha, "");
    
    std::regex multiple_spaces(R"(\s+)");
    processed_text = std::regex_replace(processed_text, multiple_spaces, " ");
    
    processed_text = std::regex_replace(processed_text, std::regex(R"(^\s+|\s+$)"), "");

    /*
        Replace spaces with the separator token same as in line 365

        for (auto & c : prompt_user) {
        if (c == ' ') {
            prompt_clean += "<|text_sep|>";
    */
    processed_text = std::regex_replace(processed_text, std::regex(R"(\s)"), "<|text_sep|>");

    return processed_text;
}

@ggerganov ggerganov mentioned this pull request Dec 13, 2024
3 tasks
@edwko
Copy link

edwko commented Dec 14, 2024

I've consolidated WavTokenizer into model.py file and split the base model (1.75GB) into two components:

https://huggingface.co/OuteAI/wavtokenizer-large-75token-interface/tree/main
encoder (82MB)
decoder (248MB)

Might help with the convert_pt_to_hf.py script.

Here's the splitting code:

# model.py code...

def split_wav_tokenizer(model, save_directory):
    """Split WavTokenizer model and save components"""
    encoder_dir = os.path.join(save_directory, "encoder")
    decoder_dir = os.path.join(save_directory, "decoder")
    
    encoder = WavEncoder(model.feature_extractor)
    encoder.save_pretrained(encoder_dir)
    
    codebook_weights = torch.cat(
        [vq.codebook for vq in model.feature_extractor.encodec.quantizer.vq.layers],
        dim=0
    )
    decoder = WavDecoder(model.backbone, model.head, codebook_weights)
    decoder.save_pretrained(decoder_dir)

@ggerganov
Copy link
Owner Author

ggerganov commented Dec 17, 2024

Initial server support is now available using the examples/tts/tts-outetts.py script. It requires to start 2 servers: one with the LLM and one with WavTokenizer:

# llm server
./build/bin/llama-server -m ./models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf --port 8020

# wavtokenizer server
./build/bin/llama-server -m ./models/wavtokenizer-large-75/ggml-model-f16.gguf --port 8021 --embeddings --pooling none

# generate audio
python ./examples/tts/tts-outetts.py http://localhost:8020 http://localhost:8021 "Hello world"

The python script is currently missing the spectrogram -> audio conversion. I don't know what is the best way to implement this and importing PyTorch for that seems like an overkill. So I'll leave it like this for now and hope we get some ideas later on.

This is still WIP as we'll refactor the endpoints to improve support for this, before merging.

@ggerganov ggerganov changed the base branch from master to gg/server-embeddings-all December 17, 2024 14:36
@ggerganov ggerganov force-pushed the gg/server-embeddings-all branch from 2230786 to 2a5510e Compare December 18, 2024 09:34
@ggerganov
Copy link
Owner Author

Planning to merge this later today. There is a lot that can be improved in the following aspects:

  • Better conversion script for the WavTokenizer model
  • A more general TTS example (currently hacked just for OuteTTS)
  • Improve spectrogram post-processing implementation
  • Better server support + voice loading

The primary goal of this PR was to see how viable it is to support TTS in libllama and lay down some initial steps. The OuteTTS implementation did not require any major modifications to the API, so I think this is a good indication for integrating more TTS models in the future.

After merging this, I will focus on refactoring the src/llama.cpp to make the code more modularized and figure out how to improve the KV cache implementation.

@ggerganov ggerganov merged commit 0bf2d10 into master Dec 18, 2024
50 checks passed
@ggerganov ggerganov deleted the gg/tts-add-outetts branch December 18, 2024 17:27
@bachittle
Copy link
Contributor

Awesome work! Would love to see more models like these supported in the future. This one comes to mind as a potential next candidate:
https://huggingface.co/fishaudio/fish-speech-1.5
https://arxiv.org/abs/2411.01156

@mirek190
Copy link

Finały llamacpp is getting multimodal 😁

@jadams777
Copy link

Awesome work! Would love to see more models like these supported in the future. This one comes to mind as a potential next candidate: https://huggingface.co/fishaudio/fish-speech-1.5 https://arxiv.org/abs/2411.01156

+1 for Fish Speech

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* server : add "tokens" output

ggml-ci

* server : output embeddings for all tokens when pooling = none

ggml-ci

* server : be explicit about the pooling type in the tests

ggml-ci

* server : do not normalize embeddings when there is no pooling

ggml-ci

* llama : add OuteTTS support (wip)

* wip

* extract features

* first conv

* group norm

* resnet conv

* resnet

* attn

* pos net

* layer norm

* convnext

* head

* hann window

* fix n_embd + remove llama.cpp hacks

* compute hann window

* fft

* spectrum processing

* clean-up

* tts : receive input text and generate codes

* clip : fix new conv name

* tts : minor fix

* tts : add header + minor fixes

ggml-ci

* tts : add matchematical constant

ggml-ci

* tts : fix sampling + cut initial noise

* tts : fixes

* tts : update default samplers

ggml-ci

* tts : text pre-processing

* tts : outetts-voc -> wavtokenizer-dec

* tts : remove hardcoded constants

ggml-ci

* tts : fix tensor shapes

* llama : refactor wavtokenizer tensors

ggml-ci

* cont

ggml-ci

* cont [no ci]

* llama : update WavTokenizer to non-causal attn

* llama : handle no-vocab detokenization

* tts : add Python example for OuteTTS (wip)

* tts : extend python example to generate spectrogram

ggml-ci

* server : fix rebase artifacts

* tts : enable "return_tokens" in Python example

ggml-ci

* tts : minor fixes

* common : support HF download for vocoder
@ylsdamxssjxxdd
Copy link
Contributor

Great work ! Can it support other languages ?

@Green-Sky
Copy link
Collaborator

@ylsdamxssjxxdd checkout #10894 for a related discussion with code.

@LostRuins
Copy link
Collaborator

LostRuins commented Jan 10, 2025

Something seems wrong when the input prompt is longer, I get a word salad in response. The longer the prompt gets, the worse the output.

Here's the prompt I used, the first 2 paragraphs from A Tale of Two Cities.

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. There were a king with a large jaw and a queen with a plain face, on the throne of England; there were a king with a large jaw and a queen with a fair face, on the throne of France. In both countries it was clearer than crystal to the lords of the State preserves of loaves and fishes, that things in general were settled for ever.

Produced a 30s audio, seems to be under 2000 tokens.

Untitled.mp4

I was reading the OuteTTS model card and noticed the line

The model's context length is 4096 tokens, allowing it to generate around 54 seconds of audio in total.

Also it seems like the n_ctx is set to 8192 here, but i'm not sure that has any impact. Has anyone else effectively tested to 4096 ctx?

Another awful result, same prompt:

Untitled2.mp4
Untitled3.mp4

Thoughts on what's going on? I tried tweaking the gen params, but still got similar results. Adjusted batch sizes and ctx size, no effect. Tried switching to greedy sampling and it was worse. Is it an implementation issue, or is the model just like that.

My gut feel, assuming the implementation is correct, is that the model's recall is too poor and it simply predicts the wrong token every so often, which cascades down into incoherence the longer the output gets. Does official OuteTTS use any grammar to constrain the available outputs? We could perhaps cache the input and then force a recital of the same token after every <|code_end|>... Thoughts?

@LostRuins
Copy link
Collaborator

Indeed it is a recall problem. I have found a solution and will propose a PR to fix it shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tts : add basic example for text-to-speech
9 participants