-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tts : add OuteTTS support #10784
tts : add OuteTTS support #10784
Conversation
wow ...nice ;) and implementation of multimodal models like a vision yet and we done ;-D |
and-we-are-done.mp4 |
Awesome! Really excited to see it running natively 😊 |
natively.mp4 |
Here is a longer generation:
longer.mp4Not sure how to pass punctuation yet. Or even if this model supports it. punctuation.mp4 |
This is great. Would love to see a video tutorial on how to set up Ollama with this. |
ollama.mp4 |
Out of curiosity, does it make sense to combine both llm+voc into one gguf? I'm thinking about the idea of having |
Maybe we can add support to pack multiple models in a single GGUF. |
The current models doesn't support special characters yet. I plan to add support for this in next release. For now in the interface it clears them. |
Great, looking forward to this. And many thanks and admirations for this work 👍 |
#include <vector> | ||
#include <fstream> | ||
#include <thread> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a suggestion for the text preprocessing implementation, based on how it's currently done in library.
#include <string>
#include <vector>
#include <regex>
#include <stdexcept>
#include <sstream>
#include <map>
#include <iostream>
const std::map<int, std::string> ones = {
{0, "zero"}, {1, "one"}, {2, "two"}, {3, "three"}, {4, "four"},
{5, "five"}, {6, "six"}, {7, "seven"}, {8, "eight"}, {9, "nine"},
{10, "ten"}, {11, "eleven"}, {12, "twelve"}, {13, "thirteen"}, {14, "fourteen"},
{15, "fifteen"}, {16, "sixteen"}, {17, "seventeen"}, {18, "eighteen"}, {19, "nineteen"}
};
const std::map<int, std::string> tens = {
{2, "twenty"}, {3, "thirty"}, {4, "forty"}, {5, "fifty"},
{6, "sixty"}, {7, "seventy"}, {8, "eighty"}, {9, "ninety"}
};
// Convert a number less than 1000 to words
std::string convert_less_than_thousand(int num) {
std::string result;
if (num >= 100) {
result += ones.at(num / 100) + " hundred ";
num %= 100;
}
if (num >= 20) {
result += tens.at(num / 10);
if (num % 10 > 0) {
result += "-" + ones.at(num % 10);
}
} else if (num > 0) {
result += ones.at(num);
}
return result;
}
std::string number_to_words(const std::string& number_str) {
try {
size_t decimal_pos = number_str.find('.');
std::string integer_part = number_str.substr(0, decimal_pos);
int int_number = std::stoi(integer_part);
std::string result;
if (int_number == 0) {
result = "zero";
} else {
if (int_number >= 1000000000) {
int billions = int_number / 1000000000;
result += convert_less_than_thousand(billions) + " billion ";
int_number %= 1000000000;
}
if (int_number >= 1000000) {
int millions = int_number / 1000000;
result += convert_less_than_thousand(millions) + " million ";
int_number %= 1000000;
}
if (int_number >= 1000) {
int thousands = int_number / 1000;
result += convert_less_than_thousand(thousands) + " thousand ";
int_number %= 1000;
}
if (int_number > 0) {
result += convert_less_than_thousand(int_number);
}
}
// Handle decimal part
if (decimal_pos != std::string::npos) {
result += " point";
std::string decimal_part = number_str.substr(decimal_pos + 1);
for (char digit : decimal_part) {
result += " " + ones.at(digit - '0');
}
}
return result;
} catch (const std::exception& e) {
// Skip if fails
return " ";
}
}
std::string replace_numbers_with_words(const std::string& input_text) {
std::regex number_pattern(R"(\d+(\.\d+)?)");
std::string result;
auto it = std::sregex_iterator(input_text.begin(), input_text.end(), number_pattern);
auto end = std::sregex_iterator();
size_t last_pos = 0;
for (std::sregex_iterator i = it; i != end; ++i) {
const std::smatch& match = *i;
result.append(input_text, last_pos, match.position() - last_pos);
result.append(number_to_words(match.str()));
last_pos = match.position() + match.length();
}
result.append(input_text, last_pos);
return result;
}
// Based on: https://github.com/edwko/OuteTTS/blob/a613e79c489d8256dd657ea9168d78de75895d82/outetts/version/v1/prompt_processor.py#L39
std::string process_text(const std::string& text) {
// For now I skipped text romanization as I am unsure how to handle
// uroman and MeCab implementations in C++
// maybe something like https://github.com/anyascii/anyascii/ could work.
// currently only English would be supported in this function
std::string processed_text = replace_numbers_with_words(text);
std::transform(processed_text.begin(), processed_text.end(),
processed_text.begin(), ::tolower);
std::regex special_chars(R"([-_/,\.\\])");
processed_text = std::regex_replace(processed_text, special_chars, " ");
std::regex non_alpha(R"([^a-z\s])");
processed_text = std::regex_replace(processed_text, non_alpha, "");
std::regex multiple_spaces(R"(\s+)");
processed_text = std::regex_replace(processed_text, multiple_spaces, " ");
processed_text = std::regex_replace(processed_text, std::regex(R"(^\s+|\s+$)"), "");
/*
Replace spaces with the separator token same as in line 365
for (auto & c : prompt_user) {
if (c == ' ') {
prompt_clean += "<|text_sep|>";
*/
processed_text = std::regex_replace(processed_text, std::regex(R"(\s)"), "<|text_sep|>");
return processed_text;
}
I've consolidated WavTokenizer into model.py file and split the base model (1.75GB) into two components: https://huggingface.co/OuteAI/wavtokenizer-large-75token-interface/tree/main Might help with the convert_pt_to_hf.py script. Here's the splitting code: # model.py code...
def split_wav_tokenizer(model, save_directory):
"""Split WavTokenizer model and save components"""
encoder_dir = os.path.join(save_directory, "encoder")
decoder_dir = os.path.join(save_directory, "decoder")
encoder = WavEncoder(model.feature_extractor)
encoder.save_pretrained(encoder_dir)
codebook_weights = torch.cat(
[vq.codebook for vq in model.feature_extractor.encodec.quantizer.vq.layers],
dim=0
)
decoder = WavDecoder(model.backbone, model.head, codebook_weights)
decoder.save_pretrained(decoder_dir) |
51e1ff4
to
c5e01c8
Compare
c5e01c8
to
ce083a5
Compare
Initial server support is now available using the # llm server
./build/bin/llama-server -m ./models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf --port 8020
# wavtokenizer server
./build/bin/llama-server -m ./models/wavtokenizer-large-75/ggml-model-f16.gguf --port 8021 --embeddings --pooling none
# generate audio
python ./examples/tts/tts-outetts.py http://localhost:8020 http://localhost:8021 "Hello world" The python script is currently missing the spectrogram -> audio conversion. I don't know what is the best way to implement this and importing PyTorch for that seems like an overkill. So I'll leave it like this for now and hope we get some ideas later on. This is still WIP as we'll refactor the endpoints to improve support for this, before merging. |
ce083a5
to
265a5ea
Compare
2230786
to
2a5510e
Compare
265a5ea
to
edb7896
Compare
Planning to merge this later today. There is a lot that can be improved in the following aspects:
The primary goal of this PR was to see how viable it is to support TTS in After merging this, I will focus on refactoring the |
Awesome work! Would love to see more models like these supported in the future. This one comes to mind as a potential next candidate: |
Finały llamacpp is getting multimodal 😁 |
+1 for Fish Speech |
* server : add "tokens" output ggml-ci * server : output embeddings for all tokens when pooling = none ggml-ci * server : be explicit about the pooling type in the tests ggml-ci * server : do not normalize embeddings when there is no pooling ggml-ci * llama : add OuteTTS support (wip) * wip * extract features * first conv * group norm * resnet conv * resnet * attn * pos net * layer norm * convnext * head * hann window * fix n_embd + remove llama.cpp hacks * compute hann window * fft * spectrum processing * clean-up * tts : receive input text and generate codes * clip : fix new conv name * tts : minor fix * tts : add header + minor fixes ggml-ci * tts : add matchematical constant ggml-ci * tts : fix sampling + cut initial noise * tts : fixes * tts : update default samplers ggml-ci * tts : text pre-processing * tts : outetts-voc -> wavtokenizer-dec * tts : remove hardcoded constants ggml-ci * tts : fix tensor shapes * llama : refactor wavtokenizer tensors ggml-ci * cont ggml-ci * cont [no ci] * llama : update WavTokenizer to non-causal attn * llama : handle no-vocab detokenization * tts : add Python example for OuteTTS (wip) * tts : extend python example to generate spectrogram ggml-ci * server : fix rebase artifacts * tts : enable "return_tokens" in Python example ggml-ci * tts : minor fixes * common : support HF download for vocoder
Great work ! Can it support other languages ? |
@ylsdamxssjxxdd checkout #10894 for a related discussion with code. |
Something seems wrong when the input prompt is longer, I get a word salad in response. The longer the prompt gets, the worse the output. Here's the prompt I used, the first 2 paragraphs from A Tale of Two Cities.
Produced a 30s audio, seems to be under 2000 tokens. Untitled.mp4I was reading the OuteTTS model card and noticed the line
Also it seems like the Another awful result, same prompt: Untitled2.mp4Untitled3.mp4Thoughts on what's going on? I tried tweaking the gen params, but still got similar results. Adjusted batch sizes and ctx size, no effect. Tried switching to greedy sampling and it was worse. Is it an implementation issue, or is the model just like that. My gut feel, assuming the implementation is correct, is that the model's recall is too poor and it simply predicts the wrong token every so often, which cascades down into incoherence the longer the output gets. Does official OuteTTS use any grammar to constrain the available outputs? We could perhaps cache the input and then force a recital of the same token after every |
Indeed it is a recall problem. I have found a solution and will propose a PR to fix it shortly. |
close #10173
Overview
This PR adds inference support for the OuteTTS vocoder (i.e. WavTokenizer) directly into
libllama
. This enables full text-to-speech generation usingllama.cpp
.sorry.mp4
TTS requires 2 models to be provided: an LLM and a voice decoder. The first one generates audio codes (tokens) from the provided input text, based on some voice settings. The second one converts the audio codes into a spectrogram. The spectrogram is then converted back to audio with inverse FFT.
Usage
Note that the sampling settings of the LLM might need some adjustments.
Server usage
Initial server support is available using the
examples/tts/tts-outetts.py
script. It requires to start 2 servers: one with the LLM and one with WavTokenizer:The python script is currently missing the spectrogram -> audio conversion. For reference implementation of this post-processing see:
embd_to_audio()
function intts.cpp
:llama.cpp/examples/tts/tts.cpp
Lines 190 to 270 in 29df666
I don't know what is the best way to implement this in a Python script and importing PyTorch for that seems like an overkill. So I'll leave it like this for now and hope we get some ideas later on.
TODO:
outetts-voc
arch towav-tokenizer