WhisperTiny Speech-To-Text and Piper Text-To-Speech for Unity

This project demonstrates the use of WhisperTiny models within the Unity Inference Engine for Speech to Text conversion.

This project demonstrates the use of Piper models within the Unity Inference Engine for Text to Speech synthesis.

Whisper is a model that was trained on labelled data for automatic speech recognition (ASR) and speech translation. Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al from OpenAI.

Piper language models are efficient, fully local neural Text-to-Speech (TTS) models designed for fast and high-quality voice generation.

The Piper engine uses the espeak-ng synthesizer for phonemization, which converts text into phonemes before they are processed by the neural model. This approach ensures accurate pronunciation across multiple languages while maintaining low latency, making Piper suitable for real-time applications and offline use.

Key Features

Speech Input: Perform local speech-to-text using neural inference
Multilingual Support: Supports English, German, French voice input.
Speech Output: Perform local speech synthesis using neural inference with any of the language models

Requirements

Unity: 6000.2.6f1
Inference Engine: 2.3.0

Models (ONNX)

You can download the WhisperTiny models from the Unity repository on Hugging Face.

You can download the Piper voice models from the repository on Hugging Face and browse language models on the Samples page.

Getting Started

Project Setup

Clone or download this repository.
Download the model assets and the espeak-ng plugin from here.

Extract the contents of CopyContentToAssetsFolder.zip. Inside, you’ll find three items:
- the data folder containing the ONNX models
- the espeak-ng synthesizer plugin
- the phonemization list
Copy all three into your project’s /Assets/ directory.

Steps 3 and 4 are only required if the models are not automatically linked to the Neural Text Generation prefab or the Neural Speech Generation prefab.

Add the model assets to the RunWhisper component of the SpeechToText prefab.

Add the model assets to the RunPiper component of the TextToSpeech prefab.

Run the Demo Scene

Open the /Assets/Scenes/Runtime AI Sample Scene.unity scene in the Unity Editor.
Run the scene to see test the Speech-To-Text conversion and Text-To-Speech synthesis

How to Use

When you press the Record button, the microphone activates and audio is captured into an AudioClip. Pressing the button again stops the recording.

The recorded audio is then processed by WhisperTiny for speech recognition.

Use the dropdown menu to select the desired input and output languages. Once processing is complete, the recognized text will appear in the text field.

You can also type text directly into the field to test the Piper model speech synthesis.

Speech-to-Text and Text-to-Speech function independently, but in this demo they are combined to showcase the full round trip.

The espeak-ng synthesizer plugin supports Android, Windows, and macOS (x64) out of the box. For other platforms, please refer to the espeak-ng repository.

Try yourself:

License

This project depends on 3rd party neural networks. Please refer to the orignial WhisperTiny, Piper, and eSpeak-ng repositories for detailed license information.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Assets		Assets
Documentation/images		Documentation/images
Packages		Packages
ProjectSettings		ProjectSettings
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WhisperTiny Speech-To-Text and Piper Text-To-Speech for Unity

Key Features

Requirements

Models (ONNX)