This project demonstrates the use of WhisperTiny models within the Unity Inference Engine for Speech to Text conversion.
This project demonstrates the use of Piper models within the Unity Inference Engine for Text to Speech synthesis.
Whisper is a model that was trained on labelled data for automatic speech recognition (ASR) and speech translation. Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al from OpenAI.
Piper language models are efficient, fully local neural Text-to-Speech (TTS) models designed for fast and high-quality voice generation.
The Piper engine uses the espeak-ng synthesizer for phonemization, which converts text into phonemes before they are processed by the neural model. This approach ensures accurate pronunciation across multiple languages while maintaining low latency, making Piper suitable for real-time applications and offline use.
- Speech Input: Perform local speech-to-text using neural inference
- Multilingual Support: Supports English, German, French voice input.
- Speech Output: Perform local speech synthesis using neural inference with any of the language models
- Unity:
6000.2.6f1 - Inference Engine:
2.3.0
You can download the WhisperTiny models from the Unity repository on Hugging Face.
You can download the Piper voice models from the repository on Hugging Face and browse language models on the Samples page.
-
Clone or download this repository.
-
Download the model assets and the espeak-ng plugin from
here.Extract the contents of CopyContentToAssetsFolder.zip. Inside, you’ll find three items:
- the data folder containing the ONNX models
- the espeak-ng synthesizer plugin
- the phonemization list
Copy all three into your project’s /Assets/ directory.
Steps 3 and 4 are only required if the models are not automatically linked to the Neural Text Generation prefab or the Neural Speech Generation prefab.
- Add the model assets to the RunWhisper component of the SpeechToText prefab.
- Add the model assets to the RunPiper component of the TextToSpeech prefab.
- Open the
/Assets/Scenes/Runtime AI Sample Scene.unityscene in the Unity Editor. - Run the scene to see test the Speech-To-Text conversion and Text-To-Speech synthesis
When you press the Record button, the microphone activates and audio is captured into an AudioClip. Pressing the button again stops the recording.
The recorded audio is then processed by WhisperTiny for speech recognition.
Use the dropdown menu to select the desired input and output languages. Once processing is complete, the recognized text will appear in the text field.
You can also type text directly into the field to test the Piper model speech synthesis.
Speech-to-Text and Text-to-Speech function independently, but in this demo they are combined to showcase the full round trip.
The espeak-ng synthesizer plugin supports Android, Windows, and macOS (x64) out of the box. For other platforms, please refer to the espeak-ng repository.
Try yourself:
This project depends on 3rd party neural networks. Please refer to the orignial WhisperTiny, Piper, and eSpeak-ng repositories for detailed license information.




