-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow downloading extra formats in the demo #617
Changes from 2 commits
a018e64
aa365d0
8b9710e
62520e1
e63f717
bba3060
4a0a7bf
f84e604
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,7 +7,7 @@ | |
|
||
import gradio as gr | ||
import torch | ||
from gradio.processing_utils import convert_to_16_bit_wav | ||
import torchaudio | ||
from loguru import logger | ||
|
||
from everyvoice.config.type_definitions import TargetTrainingTextRepresentationLevel | ||
|
@@ -18,11 +18,23 @@ | |
FastSpeech2, | ||
) | ||
from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.prediction_writing_callback import ( | ||
PredictionWritingOfflineRASCallback, | ||
PredictionWritingReadAlongCallback, | ||
PredictionWritingSpecCallback, | ||
PredictionWritingTextGridCallback, | ||
PredictionWritingWavCallback, | ||
get_tokens_from_duration_and_labels, | ||
) | ||
from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.type_definitions import ( | ||
SynthesizeOutputFormats, | ||
) | ||
from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.utils import ( | ||
truncate_basename, | ||
) | ||
from everyvoice.model.vocoder.HiFiGAN_iSTFT_lightning.hfgl.utils import ( | ||
load_hifigan_from_checkpoint, | ||
) | ||
from everyvoice.utils import slugify | ||
from everyvoice.utils.heavy import get_device_from_accelerator | ||
|
||
os.environ["no_proxy"] = "localhost,127.0.0.1,::1" | ||
|
@@ -33,6 +45,7 @@ | |
duration_control, | ||
language, | ||
speaker, | ||
output_format, | ||
text_to_spec_model, | ||
vocoder_model, | ||
vocoder_config, | ||
|
@@ -47,6 +60,7 @@ | |
"Text for synthesis was not provided. Please type the text you want to be synthesized into the textfield." | ||
) | ||
norm_text = normalize_text(text) | ||
basename = truncate_basename(slugify(text)) | ||
if allowlist and norm_text not in allowlist: | ||
raise gr.Error( | ||
f"Oops, the word {text} is not allowed to be synthesized by this model. Please contact the model owner." | ||
|
@@ -62,6 +76,8 @@ | |
raise gr.Error("Language is not selected. Please select a language.") | ||
if speaker is None: | ||
raise gr.Error("Speaker is not selected. Please select a speaker.") | ||
if output_format is None: | ||
raise gr.Error("Speaker is not selected. Please select an output format.") | ||
config, device, predictions = synthesize_helper( | ||
model=text_to_spec_model, | ||
vocoder_model=vocoder_model, | ||
|
@@ -71,9 +87,9 @@ | |
accelerator=accelerator, | ||
devices="1", | ||
device=device, | ||
global_step=1, | ||
vocoder_global_step=1, # dummy value since the vocoder step is not used | ||
output_type=[], | ||
global_step=text_to_spec_model.config.training.max_steps, | ||
vocoder_global_step=vocoder_model.config.training.max_steps, | ||
output_type=[output_format], | ||
text_representation=TargetTrainingTextRepresentationLevel.characters, | ||
output_dir=output_dir, | ||
speaker=speaker, | ||
|
@@ -91,16 +107,78 @@ | |
config=config, | ||
output_key=output_key, | ||
device=device, | ||
global_step=1, | ||
vocoder_global_step=1, # dummy value since the vocoder step is not used | ||
global_step=text_to_spec_model.config.training.max_steps, | ||
vocoder_global_step=vocoder_model.config.training.max_steps, | ||
vocoder_model=vocoder_model, | ||
vocoder_config=vocoder_config, | ||
) | ||
# move to device because lightning accumulates predictions on cpu | ||
predictions[0][output_key] = predictions[0][output_key].to(device) | ||
wav, sr = wav_writer.synthesize_audio(predictions[0]) | ||
torchaudio.save( | ||
wav_writer.get_filename(basename, speaker, language), | ||
# the vocoder output includes padding so we have to remove that | ||
wav[0], | ||
sr, | ||
format="wav", | ||
encoding="PCM_S", | ||
bits_per_sample=16, | ||
) | ||
wav_output = wav_writer.get_filename(basename, speaker, language) | ||
file_writer = None | ||
file_output = None | ||
if output_format == SynthesizeOutputFormats.readalong_html.name: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This and all the other callback constructor calls really ought to be able to be replaced by a single call to I realize you need the wav writer to create the RAS_html writer, but that's already done too. Refactoring suggestion: have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yea, this is a good idea. I would definitely be in favour of this refactor, which I think would clean up some of the acrobatics we're currently doing in order to get the wav writer separately. |
||
file_writer = PredictionWritingOfflineRASCallback( | ||
config=config, | ||
global_step=text_to_spec_model.config.training.max_steps, | ||
output_dir=output_dir, | ||
output_key=output_key, | ||
wav_callback=wav_writer, | ||
) | ||
|
||
if output_format == SynthesizeOutputFormats.readalong_xml.name: | ||
file_writer = PredictionWritingReadAlongCallback( | ||
config=config, | ||
global_step=text_to_spec_model.config.training.max_steps, | ||
output_dir=output_dir, | ||
output_key=output_key, | ||
) | ||
|
||
return sr, convert_to_16_bit_wav(wav.numpy()) | ||
if output_format == SynthesizeOutputFormats.spec.name: | ||
file_writer = PredictionWritingSpecCallback( | ||
config=config, | ||
global_step=text_to_spec_model.config.training.max_steps, | ||
output_dir=output_dir, | ||
output_key=output_key, | ||
) | ||
|
||
if output_format == SynthesizeOutputFormats.textgrid.name: | ||
file_writer = PredictionWritingTextGridCallback( | ||
config=config, | ||
global_step=text_to_spec_model.config.training.max_steps, | ||
output_dir=output_dir, | ||
output_key=output_key, | ||
) | ||
if file_writer is not None: | ||
max_seconds, phones, words = get_tokens_from_duration_and_labels( | ||
predictions[0]["duration_prediction"][0], | ||
predictions[0]["text_input"][0], | ||
text, | ||
text_to_spec_model.text_processor, | ||
text_to_spec_model.config, | ||
) | ||
|
||
file_writer.save_aligned_text_to_file( | ||
basename=basename, | ||
speaker=speaker, | ||
language=language, | ||
max_seconds=max_seconds, | ||
phones=phones, | ||
words=words, | ||
) | ||
file_output = file_writer.get_filename(basename, speaker, language) | ||
|
||
return wav_output, file_output | ||
|
||
|
||
def require_ffmpeg(): | ||
|
@@ -162,6 +240,7 @@ | |
spec_to_wav_model_path, | ||
languages, | ||
speakers, | ||
outputs, | ||
output_dir, | ||
accelerator, | ||
allowlist: list[str] = [], | ||
|
@@ -193,8 +272,10 @@ | |
) | ||
model_languages = list(model.lang2id.keys()) | ||
model_speakers = list(model.speaker2id.keys()) | ||
possible_outputs = [x.name for x in SynthesizeOutputFormats] | ||
lang_list = [] | ||
speak_list = [] | ||
output_list = [] | ||
if languages == ["all"]: | ||
lang_list = model_languages | ||
else: | ||
|
@@ -215,6 +296,16 @@ | |
print( | ||
f"Attention: The model have not been trained for speech synthesis with '{speaker}' speaker. The '{speaker}' speaker option will not be available for selection." | ||
) | ||
if outputs == ["all"]: | ||
output_list = possible_outputs | ||
else: | ||
for output in outputs: | ||
if output in possible_outputs: | ||
output_list.append(output) | ||
else: | ||
print( | ||
f"Attention: This model is not able to produce '{output}' as an output. The '{output}' option will not be available for selection. Please choose from the following possible outputs: {', '.join(possible_outputs)}" | ||
) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs to be a fatal error with an immediate exit, and the message is misleading: it's not that the model can't produce the requested output, it's that the software has no implementation for it. Right now, This is really CLI error checking, it should happen much earlier in this function, in particular before we load any checkpoint, so the error is dumped right away without having to wait 20 seconds or more for models to load first. You might get all this for nearly free if you define the list of valid values for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW, the RAS output specifiers are There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agreed! |
||
if lang_list == []: | ||
raise ValueError( | ||
f"Language option has been activated, but valid languages have not been provided. The model has been trained in {model_languages} languages. Please select either 'all' or at least some of them." | ||
|
@@ -227,6 +318,8 @@ | |
interactive_lang = len(lang_list) > 1 | ||
default_speak = speak_list[0] | ||
interactive_speak = len(speak_list) > 1 | ||
default_output = output_list[0] | ||
interactive_output = len(output_list) > 1 | ||
with gr.Blocks() as demo: | ||
gr.Markdown( | ||
""" | ||
|
@@ -255,12 +348,20 @@ | |
interactive=interactive_speak, | ||
label="Speaker", | ||
) | ||
with gr.Row(): | ||
output_format = gr.Dropdown( | ||
choices=output_list, | ||
value=default_output, | ||
interactive=interactive_output, | ||
label="Output Format", | ||
) | ||
btn = gr.Button("Synthesize") | ||
with gr.Column(): | ||
out_audio = gr.Audio(format="mp3") | ||
out_audio = gr.Audio(format="wav") | ||
out_file = gr.File(label="File Output") | ||
btn.click( | ||
synthesize_audio_preset, | ||
inputs=[inp_text, inp_slider, inp_lang, inp_speak], | ||
outputs=[out_audio], | ||
inputs=[inp_text, inp_slider, inp_lang, inp_speak, output_format], | ||
outputs=[out_audio, out_file], | ||
) | ||
return demo |
+1 −0 | fs2/model.py | |
+76 −56 | fs2/prediction_writing_callback.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to enumerate the valid options in the help message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for
--accelerator
, while I'm thinking about it... And for--language
and--speaker
we should state that they have to be language(s) and speaker(s) known to the model.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR, please address listing valid values for
--output-format
, fixing the other help messages is gravy and could go into a separate PR or issue.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they do get listed, don't they? Like if you type a speaker that doesn't exist, I thought the error message listed out all the possible speakers. The output formats are dependent on the version of everyvoice installed, so we could include that in the help message, but the language and speaker are model-dependent, so we wouldn't be able to include the lists of those in the help message, just in the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that the
everyvoice demo -h
message should say something like "valid values are the language(s) and speaker(s) the model was trained on", or something to that effect, maybe more concisely. As the documentation stands, if you're not familiar with things yet, it's a bit mysterious how you're supposed to know what values you can use there.And I know if you're just trained things, it's going to be obvious, but the point the of the help message is to support you when the information is not already obvious to you.