Skip to content

danielbierwirth/Inference-Whisper-Unity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WhisperTiny Speech-To-Text for Unity

Unity Inference Engine Hugging Face

This project demonstrates the use of WhisperTiny models within the Unity Inference Engine for Speech to Text conversion.

Whisper is a model that was trained on labelled data for automatic speech recognition (ASR) and speech translation. Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al from OpenAI.

Demo Scene

Key Features

  • Speech Input: Perform local speech-to-text using neural inference
  • Multilingual Support: Supports English, German, French voice input.

Requirements

  • Unity: 6000.1.11f1
  • Inference Engine: 2.3.0

Models (ONNX)

You can download the WhisperTiny models from the Unity repository on Hugging Face.

Model Name Hugging Face Link
decoder_model models/decoder_model.onnx
decoder_with_past_model models/decoder_with_past_model.onnx
encoder_model models/encoder_model.onnx
logmel_spectrogram models/logmel_spectrogram.onnx

Vocab JSON data/vocab.json

Getting Started

Project Setup

  1. Clone or download this repository.
  2. Download the WhisperTiny ONNX models and the vocab.json file from the Unity Hugging Face repository and place the contents into the /Assets/Data directory in your project.
  3. Add the model assets to the RunWhisper component of the MicrophoneManager GameObject.

Add the ONNX models

Run the Demo Scene

  1. Open the /Assets/Scenes/Runtime AI Sample Scene.unity scene in the Unity Editor.
  2. Run the scene to see test the Speech-To-Text conversion

Try yourself:

Try the demo

How to Use

When the record button is pressed, the microphone activates and audio is captured into an AudioClip. Pressing the button again will stop the recording.

The recorded audio clip will be used for Speech recognition. The worklfow (simplified)

Step 1: Audio Preprocessing to convert audio into time-frequency representation (log-Mel spectrogram)

Step 2: Encoder processes the spectrogram to extract meaningful features from the audio

Step 3: Decoder takes the encoded features and generates text output. The decoder predicts one token (word/character) at a time.

The dropdown menu allows you to select the desired input language. Once processing is complete, the detected text will be displayed in the text field.

License

This project depends on 3rd party neural networks. Please refer to the orignial WhisperTiny repositories for detailed license information.

About

Local Neural Speech-To-Text conversion using the WhisperTiny model in Unity

Resources

License

Stars

Watchers

Forks

Packages

No packages published