Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow downloading extra formats in the demo #617

Merged
merged 8 commits into from
Jan 14, 2025
7 changes: 7 additions & 0 deletions everyvoice/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -616,6 +616,12 @@ def demo(
"-s",
help="Specify speakers to be included in the demo. Example: everyvoice demo <path_to_text_to_spec_model> <path_to_spec_to_wav_model> --speaker speaker_1 --speaker Sue",
),
outputs: List[str] = typer.Option(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to enumerate the valid options in the help message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for --accelerator, while I'm thinking about it... And for --language and --speaker we should state that they have to be language(s) and speaker(s) known to the model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, please address listing valid values for --output-format, fixing the other help messages is gravy and could go into a separate PR or issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for --accelerator, while I'm thinking about it... And for --language and --speaker we should state that they have to be language(s) and speaker(s) known to the model.

I think they do get listed, don't they? Like if you type a speaker that doesn't exist, I thought the error message listed out all the possible speakers. The output formats are dependent on the version of everyvoice installed, so we could include that in the help message, but the language and speaker are model-dependent, so we wouldn't be able to include the lists of those in the help message, just in the error message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that the everyvoice demo -h message should say something like "valid values are the language(s) and speaker(s) the model was trained on", or something to that effect, maybe more concisely. As the documentation stands, if you're not familiar with things yet, it's a bit mysterious how you're supposed to know what values you can use there.
And I know if you're just trained things, it's going to be obvious, but the point the of the help message is to support you when the information is not already obvious to you.

["all"],
"--output-format",
"-O",
help="Specify output formats to be included in the demo. Example: everyvoice demo <path_to_text_to_spec_model> <path_to_spec_to_wav_model> --output-format wav",
),
output_dir: Path = typer.Option(
"synthesis_output",
"--output-dir",
Expand Down Expand Up @@ -652,6 +658,7 @@ def demo(
spec_to_wav_model_path=spec_to_wav_model,
languages=languages,
speakers=speakers,
outputs=outputs,
output_dir=output_dir,
accelerator=accelerator,
allowlist=allowlist_data,
Expand Down
121 changes: 111 additions & 10 deletions everyvoice/demo/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

import gradio as gr
import torch
from gradio.processing_utils import convert_to_16_bit_wav
import torchaudio
from loguru import logger

from everyvoice.config.type_definitions import TargetTrainingTextRepresentationLevel
Expand All @@ -18,11 +18,23 @@
FastSpeech2,
)
from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.prediction_writing_callback import (
PredictionWritingOfflineRASCallback,
PredictionWritingReadAlongCallback,
PredictionWritingSpecCallback,
PredictionWritingTextGridCallback,
PredictionWritingWavCallback,
get_tokens_from_duration_and_labels,
)
from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.type_definitions import (
SynthesizeOutputFormats,
)
from everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.utils import (
truncate_basename,
)
from everyvoice.model.vocoder.HiFiGAN_iSTFT_lightning.hfgl.utils import (
load_hifigan_from_checkpoint,
)
from everyvoice.utils import slugify
from everyvoice.utils.heavy import get_device_from_accelerator

os.environ["no_proxy"] = "localhost,127.0.0.1,::1"
Expand All @@ -33,6 +45,7 @@
duration_control,
language,
speaker,
output_format,
text_to_spec_model,
vocoder_model,
vocoder_config,
Expand All @@ -47,6 +60,7 @@
"Text for synthesis was not provided. Please type the text you want to be synthesized into the textfield."
)
norm_text = normalize_text(text)
basename = truncate_basename(slugify(text))

Check warning on line 63 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L63

Added line #L63 was not covered by tests
if allowlist and norm_text not in allowlist:
raise gr.Error(
f"Oops, the word {text} is not allowed to be synthesized by this model. Please contact the model owner."
Expand All @@ -62,6 +76,8 @@
raise gr.Error("Language is not selected. Please select a language.")
if speaker is None:
raise gr.Error("Speaker is not selected. Please select a speaker.")
if output_format is None:
raise gr.Error("Speaker is not selected. Please select an output format.")

Check warning on line 80 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L80

Added line #L80 was not covered by tests
config, device, predictions = synthesize_helper(
model=text_to_spec_model,
vocoder_model=vocoder_model,
Expand All @@ -71,9 +87,9 @@
accelerator=accelerator,
devices="1",
device=device,
global_step=1,
vocoder_global_step=1, # dummy value since the vocoder step is not used
output_type=[],
global_step=text_to_spec_model.config.training.max_steps,
vocoder_global_step=vocoder_model.config.training.max_steps,
output_type=[output_format],
text_representation=TargetTrainingTextRepresentationLevel.characters,
output_dir=output_dir,
speaker=speaker,
Expand All @@ -91,16 +107,78 @@
config=config,
output_key=output_key,
device=device,
global_step=1,
vocoder_global_step=1, # dummy value since the vocoder step is not used
global_step=text_to_spec_model.config.training.max_steps,
vocoder_global_step=vocoder_model.config.training.max_steps,
vocoder_model=vocoder_model,
vocoder_config=vocoder_config,
)
# move to device because lightning accumulates predictions on cpu
predictions[0][output_key] = predictions[0][output_key].to(device)
wav, sr = wav_writer.synthesize_audio(predictions[0])
torchaudio.save(

Check warning on line 118 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L118

Added line #L118 was not covered by tests
wav_writer.get_filename(basename, speaker, language),
# the vocoder output includes padding so we have to remove that
wav[0],
sr,
format="wav",
encoding="PCM_S",
bits_per_sample=16,
)
wav_output = wav_writer.get_filename(basename, speaker, language)
file_writer = None
file_output = None

Check warning on line 129 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L127-L129

Added lines #L127 - L129 were not covered by tests
if output_format == SynthesizeOutputFormats.readalong_html.name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and all the other callback constructor calls really ought to be able to be replaced by a single call to get_synthesis_output_callbacks, no?

I realize you need the wav writer to create the RAS_html writer, but that's already done too.

Refactoring suggestion: have get_synthesis_output_callbacks return a dict where the key are the output types, and the values are the writers. Then here you can use writers["wav"] and writers[output_format] to access the two writers you need, having passed ["wav", output_format] (or the proper Enum form if need be) as the output_type argument to get_synthesis_output_callbacks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, this is a good idea. I would definitely be in favour of this refactor, which I think would clean up some of the acrobatics we're currently doing in order to get the wav writer separately.

file_writer = PredictionWritingOfflineRASCallback(

Check warning on line 131 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L131

Added line #L131 was not covered by tests
config=config,
global_step=text_to_spec_model.config.training.max_steps,
output_dir=output_dir,
output_key=output_key,
wav_callback=wav_writer,
)

if output_format == SynthesizeOutputFormats.readalong_xml.name:
file_writer = PredictionWritingReadAlongCallback(

Check warning on line 140 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L140

Added line #L140 was not covered by tests
config=config,
global_step=text_to_spec_model.config.training.max_steps,
output_dir=output_dir,
output_key=output_key,
)

return sr, convert_to_16_bit_wav(wav.numpy())
if output_format == SynthesizeOutputFormats.spec.name:
file_writer = PredictionWritingSpecCallback(

Check warning on line 148 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L148

Added line #L148 was not covered by tests
config=config,
global_step=text_to_spec_model.config.training.max_steps,
output_dir=output_dir,
output_key=output_key,
)

if output_format == SynthesizeOutputFormats.textgrid.name:
file_writer = PredictionWritingTextGridCallback(

Check warning on line 156 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L156

Added line #L156 was not covered by tests
config=config,
global_step=text_to_spec_model.config.training.max_steps,
output_dir=output_dir,
output_key=output_key,
)
if file_writer is not None:
max_seconds, phones, words = get_tokens_from_duration_and_labels(

Check warning on line 163 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L163

Added line #L163 was not covered by tests
predictions[0]["duration_prediction"][0],
predictions[0]["text_input"][0],
text,
text_to_spec_model.text_processor,
text_to_spec_model.config,
)

file_writer.save_aligned_text_to_file(

Check warning on line 171 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L171

Added line #L171 was not covered by tests
basename=basename,
speaker=speaker,
language=language,
max_seconds=max_seconds,
phones=phones,
words=words,
)
file_output = file_writer.get_filename(basename, speaker, language)

Check warning on line 179 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L179

Added line #L179 was not covered by tests

return wav_output, file_output

Check warning on line 181 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L181

Added line #L181 was not covered by tests


def require_ffmpeg():
Expand Down Expand Up @@ -162,6 +240,7 @@
spec_to_wav_model_path,
languages,
speakers,
outputs,
output_dir,
accelerator,
allowlist: list[str] = [],
Expand Down Expand Up @@ -193,8 +272,10 @@
)
model_languages = list(model.lang2id.keys())
model_speakers = list(model.speaker2id.keys())
possible_outputs = [x.name for x in SynthesizeOutputFormats]

Check warning on line 275 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L275

Added line #L275 was not covered by tests
lang_list = []
speak_list = []
output_list = []

Check warning on line 278 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L278

Added line #L278 was not covered by tests
if languages == ["all"]:
lang_list = model_languages
else:
Expand All @@ -215,6 +296,16 @@
print(
f"Attention: The model have not been trained for speech synthesis with '{speaker}' speaker. The '{speaker}' speaker option will not be available for selection."
)
if outputs == ["all"]:
output_list = possible_outputs

Check warning on line 300 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L300

Added line #L300 was not covered by tests
else:
for output in outputs:
if output in possible_outputs:
output_list.append(output)

Check warning on line 304 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L304

Added line #L304 was not covered by tests
else:
print(

Check warning on line 306 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L306

Added line #L306 was not covered by tests
f"Attention: This model is not able to produce '{output}' as an output. The '{output}' option will not be available for selection. Please choose from the following possible outputs: {', '.join(possible_outputs)}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a fatal error with an immediate exit, and the message is misleading: it's not that the model can't produce the requested output, it's that the software has no implementation for it. Right now, everyvoice demo -O foo fs2.ckpt voc.ckpt prints this message about foo and then continues anyway and crashes with an exception a few lines later.

This is really CLI error checking, it should happen much earlier in this function, in particular before we load any checkpoint, so the error is dumped right away without having to wait 20 seconds or more for models to load first. You might get all this for nearly free if you define the list of valid values for outputs in cli.py's demo() function as I already suggested elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the RAS output specifiers are readalong_xml and readalong_html with an underscore instead of a hyphen like in everyvoice synthesize from-text. They should be unified, using hyphens here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed!

if lang_list == []:
raise ValueError(
f"Language option has been activated, but valid languages have not been provided. The model has been trained in {model_languages} languages. Please select either 'all' or at least some of them."
Expand All @@ -227,6 +318,8 @@
interactive_lang = len(lang_list) > 1
default_speak = speak_list[0]
interactive_speak = len(speak_list) > 1
default_output = output_list[0]
interactive_output = len(output_list) > 1

Check warning on line 322 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L321-L322

Added lines #L321 - L322 were not covered by tests
with gr.Blocks() as demo:
gr.Markdown(
"""
Expand Down Expand Up @@ -255,12 +348,20 @@
interactive=interactive_speak,
label="Speaker",
)
with gr.Row():
output_format = gr.Dropdown(

Check warning on line 352 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L351-L352

Added lines #L351 - L352 were not covered by tests
choices=output_list,
value=default_output,
interactive=interactive_output,
label="Output Format",
)
btn = gr.Button("Synthesize")
with gr.Column():
out_audio = gr.Audio(format="mp3")
out_audio = gr.Audio(format="wav")
out_file = gr.File(label="File Output")

Check warning on line 361 in everyvoice/demo/app.py

View check run for this annotation

Codecov / codecov/patch

everyvoice/demo/app.py#L360-L361

Added lines #L360 - L361 were not covered by tests
btn.click(
synthesize_audio_preset,
inputs=[inp_text, inp_slider, inp_lang, inp_speak],
outputs=[out_audio],
inputs=[inp_text, inp_slider, inp_lang, inp_speak, output_format],
outputs=[out_audio, out_file],
)
return demo
Loading