Skip to content

Commit

Permalink
Upper bound transformers for 1.2 (#2584)
Browse files Browse the repository at this point in the history
* upper bound transformers and name change jarvis to riva

Signed-off-by: ericharper <[email protected]>

* upper bound transformers and name change jarvis to riva

Signed-off-by: ericharper <[email protected]>
  • Loading branch information
ericharper authored Jul 30, 2021
1 parent f8e4b06 commit 9b36aae
Show file tree
Hide file tree
Showing 3 changed files with 85 additions and 85 deletions.
2 changes: 1 addition & 1 deletion requirements/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ ruamel.yaml
scikit-learn
omegaconf>=2.1.0
hydra-core>=1.1.0
transformers>=4.0.1
transformers>=4.0.1,<=4.8.1
sentencepiece<1.0.0
webdataset>=0.1.48,<=0.1.62
tqdm>=4.41.0
Expand Down
166 changes: 83 additions & 83 deletions tutorials/AudioTranslationSample.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,6 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "RYGnI-EZp_nK"
},
"source": [
"# Getting Started: Sample Conversational AI application\n",
"This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA/NeMo) to construct a toy demo which translate Mandarin audio file into English one.\n",
Expand All @@ -15,49 +12,48 @@
"* Transcribe audio with (Mandarin) speech recognition model.\n",
"* Translate text with machine translation model.\n",
"* Generate audio with text-to-speech models."
]
],
"metadata": {
"id": "RYGnI-EZp_nK"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "V72HXYuQ_p9a"
},
"source": [
"## Installation\n",
"NeMo can be installed via simple pip command.\n",
"This will take about 4 minutes.\n",
"\n",
"(The installation method below should work inside your new Conda environment or in an NVIDIA docker container.)"
]
],
"metadata": {
"id": "V72HXYuQ_p9a"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "efDmTWf1_iYK"
},
"outputs": [],
"source": [
"BRANCH = 'r1.2.0'\n",
"!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
]
],
"outputs": [],
"metadata": {
"id": "efDmTWf1_iYK"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "EyJ5HiiPrPKA"
},
"source": [
"## Import all necessary packages"
]
],
"metadata": {
"id": "EyJ5HiiPrPKA"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tdUqxeUEA8nw"
},
"outputs": [],
"source": [
"# Import NeMo and it's ASR, NLP and TTS collections\n",
"import nemo\n",
Expand All @@ -69,13 +65,14 @@
"import nemo.collections.tts as nemo_tts\n",
"# We'll use this to listen to audio\n",
"import IPython"
]
],
"outputs": [],
"metadata": {
"id": "tdUqxeUEA8nw"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "bt2EZyU3A1aq"
},
"source": [
"## Instantiate pre-trained NeMo models\n",
"\n",
Expand All @@ -84,30 +81,28 @@
"* ``list_available_models()`` - it will list all models currently available on NGC and their names.\n",
"\n",
"* ``from_pretrained(...)`` API downloads and initialized model directly from the NGC using model name.\n"
]
],
"metadata": {
"id": "bt2EZyU3A1aq"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "YNNHs5Xjr8ox",
"scrolled": true
},
"outputs": [],
"source": [
"# Here is an example of all CTC-based models:\n",
"nemo_asr.models.EncDecCTCModel.list_available_models()\n",
"# More ASR Models are available - see: nemo_asr.models.ASRModel.list_available_models()"
]
],
"outputs": [],
"metadata": {
"id": "YNNHs5Xjr8ox",
"scrolled": true
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1h9nhICjA5Dk",
"scrolled": true
},
"outputs": [],
"source": [
"# Speech Recognition model - Citrinet initially trained on Multilingual LibriSpeech English corpus, and fine-tuned on the open source Aishell-2\n",
"asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name=\"stt_zh_citrinet_1024_gamma_0_25\").cuda()\n",
Expand All @@ -117,24 +112,25 @@
"spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name=\"tts_en_fastpitch\").cuda()\n",
"# Vocoder model which takes spectrogram and produces actual audio\n",
"vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name=\"tts_hifigan\").cuda()"
]
],
"outputs": [],
"metadata": {
"id": "1h9nhICjA5Dk",
"scrolled": true
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "KPota-JtsqSY"
},
"source": [
"## Get an audio sample in Mandarin"
]
],
"metadata": {
"id": "KPota-JtsqSY"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7cGCEKkcLr52"
},
"outputs": [],
"source": [
"# Download audio sample which we'll try\n",
"# This is a sample from MCV 6.1 Dev dataset - the model hasn't seen it before\n",
Expand All @@ -143,71 +139,71 @@
"!wget 'https://nemo-public.s3.us-east-2.amazonaws.com/zh-samples/common_voice_zh-CN_21347786.mp3'\n",
"# To listen it, click on the play button below\n",
"IPython.display.Audio(audio_sample)"
]
],
"outputs": [],
"metadata": {
"id": "7cGCEKkcLr52"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "BaCdNJhhtBfM"
},
"source": [
"## Transcribe audio file\n",
"We will use speech recognition model to convert audio into text.\n"
]
],
"metadata": {
"id": "BaCdNJhhtBfM"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "KTA7jM6sL6yC"
},
"outputs": [],
"source": [
"transcribed_text = asr_model.transcribe([audio_sample])\n",
"print(transcribed_text)"
]
],
"outputs": [],
"metadata": {
"id": "KTA7jM6sL6yC"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "BjYb2TMtttCc"
},
"source": [
"## Translate Chinese text into English\n",
"NeMo's NMT models have a handy ``.translate()`` method."
]
],
"metadata": {
"id": "BjYb2TMtttCc"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "kQTdE4b9Nm9O"
},
"outputs": [],
"source": [
"english_text = nmt_model.translate(transcribed_text)\n",
"print(english_text)"
]
],
"outputs": [],
"metadata": {
"id": "kQTdE4b9Nm9O"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "9Rppc59Ut7uy"
},
"source": [
"## Generate English audio from text\n",
"Speech generation from text typically has two steps:\n",
"* Generate spectrogram from the text. In this example we will use FastPitch model for this.\n",
"* Generate actual audio from the spectrogram. In this example we will use HifiGan model for this.\n"
]
],
"metadata": {
"id": "9Rppc59Ut7uy"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wpMYfufgNt15"
},
"outputs": [],
"source": [
"# A helper function which combines FastPitch and HifiGan to go directly from \n",
"# text to audio\n",
Expand All @@ -216,26 +212,27 @@
" spectrogram = spectrogram_generator.generate_spectrogram(tokens=parsed)\n",
" audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)\n",
" return audio.to('cpu').detach().numpy()"
]
],
"outputs": [],
"metadata": {
"id": "wpMYfufgNt15"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Listen to generated audio in English\n",
"IPython.display.Audio(text_to_audio(english_text[0]), rate=22050)"
]
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"metadata": {
"id": "LiQ_GQpcBYUs"
},
"source": [
"## Next steps\n",
"A demo like this is great for prototyping and experimentation. However, for real production deployment, you would want to use a service like [NVIDIA Jarvis](https://developer.nvidia.com/nvidia-jarvis).\n",
"A demo like this is great for prototyping and experimentation. However, for real production deployment, you would want to use a service like [NVIDIA Riva](https://developer.nvidia.com/riva).\n",
"\n",
"**NeMo is built for training.** You can fine-tune, or train from scratch on your data all models used in this example. We recommend you checkout the following, more in-depth, tutorials next:\n",
"\n",
Expand All @@ -247,7 +244,10 @@
"\n",
"\n",
"You can find scripts for training and fine-tuning ASR, NLP and TTS models [here](https://github.com/NVIDIA/NeMo/tree/main/examples). "
]
],
"metadata": {
"id": "LiQ_GQpcBYUs"
}
}
],
"metadata": {
Expand Down Expand Up @@ -277,4 +277,4 @@
},
"nbformat": 4,
"nbformat_minor": 1
}
}
2 changes: 1 addition & 1 deletion tutorials/VoiceSwapSample.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -269,7 +269,7 @@
},
"source": [
"## Next steps\n",
"A demo like this is great for prototyping and experimentation. However, for real production deployment, you would want to use a service like [NVIDIA Jarvis](https://developer.nvidia.com/nvidia-jarvis).\n",
"A demo like this is great for prototyping and experimentation. However, for real production deployment, you would want to use a service like [NVIDIA Riva](https://developer.nvidia.com/riva).\n",
"\n",
"**NeMo is built for training.** You can fine-tune, or train from scratch on your data all models used in this example. We recommend you checkout the following, more in-depth, tutorials next:\n",
"\n",
Expand Down

0 comments on commit 9b36aae

Please sign in to comment.