Upper bound transformers for 1.2 (#2584)

* upper bound transformers and name change jarvis to riva Signed-off-by: ericharper <[email protected]> * upper bound transformers and name change jarvis to riva Signed-off-by: ericharper <[email protected]>
NVIDIA · Jul 30, 2021 · 9b36aae · 9b36aae
1 parent f8e4b06
commit 9b36aae
Show file tree

Hide file tree

Showing 3 changed files with 85 additions and 85 deletions.
diff --git a/requirements/requirements.txt b/requirements/requirements.txt
@@ -10,7 +10,7 @@ ruamel.yaml
 scikit-learn
 omegaconf>=2.1.0
 hydra-core>=1.1.0
-transformers>=4.0.1
+transformers>=4.0.1,<=4.8.1
 sentencepiece<1.0.0
 webdataset>=0.1.48,<=0.1.62
 tqdm>=4.41.0

diff --git a/tutorials/AudioTranslationSample.ipynb b/tutorials/AudioTranslationSample.ipynb
@@ -2,9 +2,6 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "metadata": {
-    "id": "RYGnI-EZp_nK"
-   },
    "source": [
     "# Getting Started: Sample Conversational AI application\n",
     "This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA/NeMo) to construct a toy demo which translate Mandarin audio file into English one.\n",
@@ -15,49 +12,48 @@
     "* Transcribe audio with (Mandarin) speech recognition model.\n",
     "* Translate text with machine translation model.\n",
     "* Generate audio with text-to-speech models."
-   ]
+   ],
+   "metadata": {
+    "id": "RYGnI-EZp_nK"
+   }
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "id": "V72HXYuQ_p9a"
-   },
    "source": [
     "## Installation\n",
     "NeMo can be installed via simple pip command.\n",
     "This will take about 4 minutes.\n",
     "\n",
     "(The installation method below should work inside your new Conda environment or in an NVIDIA docker container.)"
-   ]
+   ],
+   "metadata": {
+    "id": "V72HXYuQ_p9a"
+   }
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "id": "efDmTWf1_iYK"
-   },
-   "outputs": [],
    "source": [
     "BRANCH = 'r1.2.0'\n",
     "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {
+    "id": "efDmTWf1_iYK"
+   }
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "id": "EyJ5HiiPrPKA"
-   },
    "source": [
     "## Import all necessary packages"
-   ]
+   ],
+   "metadata": {
+    "id": "EyJ5HiiPrPKA"
+   }
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "id": "tdUqxeUEA8nw"
-   },
-   "outputs": [],
    "source": [
     "# Import NeMo and it's ASR, NLP and TTS collections\n",
     "import nemo\n",
@@ -69,13 +65,14 @@
     "import nemo.collections.tts as nemo_tts\n",
     "# We'll use this to listen to audio\n",
     "import IPython"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {
+    "id": "tdUqxeUEA8nw"
+   }
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "id": "bt2EZyU3A1aq"
-   },
    "source": [
     "## Instantiate pre-trained NeMo models\n",
     "\n",
@@ -84,30 +81,28 @@
     "* ``list_available_models()`` - it will list all models currently available on NGC and their names.\n",
     "\n",
     "* ``from_pretrained(...)`` API downloads and initialized model directly from the NGC using model name.\n"
-   ]
+   ],
+   "metadata": {
+    "id": "bt2EZyU3A1aq"
+   }
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "id": "YNNHs5Xjr8ox",
-    "scrolled": true
-   },
-   "outputs": [],
    "source": [
     "# Here is an example of all CTC-based models:\n",
     "nemo_asr.models.EncDecCTCModel.list_available_models()\n",
     "# More ASR Models are available - see: nemo_asr.models.ASRModel.list_available_models()"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {
+    "id": "YNNHs5Xjr8ox",
+    "scrolled": true
+   }
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "id": "1h9nhICjA5Dk",
-    "scrolled": true
-   },
-   "outputs": [],
    "source": [
     "# Speech Recognition model - Citrinet initially trained on Multilingual LibriSpeech English corpus, and fine-tuned on the open source Aishell-2\n",
     "asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name=\"stt_zh_citrinet_1024_gamma_0_25\").cuda()\n",
@@ -117,24 +112,25 @@
     "spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name=\"tts_en_fastpitch\").cuda()\n",
     "# Vocoder model which takes spectrogram and produces actual audio\n",
     "vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name=\"tts_hifigan\").cuda()"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {
+    "id": "1h9nhICjA5Dk",
+    "scrolled": true
+   }
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "id": "KPota-JtsqSY"
-   },
    "source": [
     "## Get an audio sample in Mandarin"
-   ]
+   ],
+   "metadata": {
+    "id": "KPota-JtsqSY"
+   }
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "id": "7cGCEKkcLr52"
-   },
-   "outputs": [],
    "source": [
     "# Download audio sample which we'll try\n",
     "# This is a sample from MCV 6.1 Dev dataset - the model hasn't seen it before\n",
@@ -143,71 +139,71 @@
     "!wget 'https://nemo-public.s3.us-east-2.amazonaws.com/zh-samples/common_voice_zh-CN_21347786.mp3'\n",
     "# To listen it, click on the play button below\n",
     "IPython.display.Audio(audio_sample)"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {
+    "id": "7cGCEKkcLr52"
+   }
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "id": "BaCdNJhhtBfM"
-   },
    "source": [
     "## Transcribe audio file\n",
     "We will use speech recognition model to convert audio into text.\n"
-   ]
+   ],
+   "metadata": {
+    "id": "BaCdNJhhtBfM"
+   }
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "id": "KTA7jM6sL6yC"
-   },
-   "outputs": [],
    "source": [
     "transcribed_text = asr_model.transcribe([audio_sample])\n",
     "print(transcribed_text)"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {
+    "id": "KTA7jM6sL6yC"
+   }
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "id": "BjYb2TMtttCc"
-   },
    "source": [
     "## Translate Chinese text into English\n",
     "NeMo's NMT models have a handy ``.translate()`` method."
-   ]
+   ],
+   "metadata": {
+    "id": "BjYb2TMtttCc"
+   }
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "id": "kQTdE4b9Nm9O"
-   },
-   "outputs": [],
    "source": [
     "english_text = nmt_model.translate(transcribed_text)\n",
     "print(english_text)"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {
+    "id": "kQTdE4b9Nm9O"
+   }
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "id": "9Rppc59Ut7uy"
-   },
    "source": [
     "## Generate English audio from text\n",
     "Speech generation from text typically has two steps:\n",
     "* Generate spectrogram from the text. In this example we will use FastPitch model for this.\n",
     "* Generate actual audio from the spectrogram. In this example we will use HifiGan model for this.\n"
-   ]
+   ],
+   "metadata": {
+    "id": "9Rppc59Ut7uy"
+   }
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "id": "wpMYfufgNt15"
-   },
-   "outputs": [],
    "source": [
     "# A helper function which combines FastPitch and HifiGan to go directly from \n",
     "# text to audio\n",
@@ -216,26 +212,27 @@
     "  spectrogram = spectrogram_generator.generate_spectrogram(tokens=parsed)\n",
     "  audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)\n",
     "  return audio.to('cpu').detach().numpy()"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {
+    "id": "wpMYfufgNt15"
+   }
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
-   "outputs": [],
    "source": [
     "# Listen to generated audio in English\n",
     "IPython.display.Audio(text_to_audio(english_text[0]), rate=22050)"
-   ]
+   ],
+   "outputs": [],
+   "metadata": {}
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "id": "LiQ_GQpcBYUs"
-   },
    "source": [
     "## Next steps\n",
-    "A demo like this is great for prototyping and experimentation. However, for real production deployment, you would want to use a service like [NVIDIA Jarvis](https://developer.nvidia.com/nvidia-jarvis).\n",
+    "A demo like this is great for prototyping and experimentation. However, for real production deployment, you would want to use a service like [NVIDIA Riva](https://developer.nvidia.com/riva).\n",
     "\n",
     "**NeMo is built for training.** You can fine-tune, or train from scratch on your data all models used in this example. We recommend you checkout the following, more in-depth, tutorials next:\n",
     "\n",
@@ -247,7 +244,10 @@
     "\n",
     "\n",
     "You can find scripts for training and fine-tuning ASR, NLP and TTS models [here](https://github.com/NVIDIA/NeMo/tree/main/examples). "
-   ]
+   ],
+   "metadata": {
+    "id": "LiQ_GQpcBYUs"
+   }
   }
  ],
  "metadata": {
@@ -277,4 +277,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 1
-}
+}
diff --git a/tutorials/VoiceSwapSample.ipynb b/tutorials/VoiceSwapSample.ipynb
@@ -269,7 +269,7 @@
    },
    "source": [
     "## Next steps\n",
-    "A demo like this is great for prototyping and experimentation. However, for real production deployment, you would want to use a service like [NVIDIA Jarvis](https://developer.nvidia.com/nvidia-jarvis).\n",
+    "A demo like this is great for prototyping and experimentation. However, for real production deployment, you would want to use a service like [NVIDIA Riva](https://developer.nvidia.com/riva).\n",
     "\n",
     "**NeMo is built for training.** You can fine-tune, or train from scratch on your data all models used in this example. We recommend you checkout the following, more in-depth, tutorials next:\n",
     "\n",