WhisperTiny Speech-To-Text for Unity

This project demonstrates the use of WhisperTiny models within the Unity Inference Engine for Speech to Text conversion.

Whisper is a model that was trained on labelled data for automatic speech recognition (ASR) and speech translation. Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al from OpenAI.

Key Features

Speech Input: Perform local speech-to-text using neural inference
Multilingual Support: Supports English, German, French voice input.

Requirements

Unity: 6000.1.11f1
Inference Engine: 2.3.0

Models (ONNX)

You can download the WhisperTiny models from the Unity repository on Hugging Face.

Model Name	Hugging Face Link
decoder_model	`models/decoder_model.onnx`
decoder_with_past_model	`models/decoder_with_past_model.onnx`
encoder_model	`models/encoder_model.onnx`
logmel_spectrogram	`models/logmel_spectrogram.onnx`

Vocab JSON data/vocab.json

Getting Started

Project Setup

Clone or download this repository.
Download the WhisperTiny ONNX models and the vocab.json file from the Unity Hugging Face repository and place the contents into the /Assets/Data directory in your project.
Add the model assets to the RunWhisper component of the MicrophoneManager GameObject.

Run the Demo Scene

Open the /Assets/Scenes/Runtime AI Sample Scene.unity scene in the Unity Editor.
Run the scene to see test the Speech-To-Text conversion

Try yourself:

How to Use

When the record button is pressed, the microphone activates and audio is captured into an AudioClip. Pressing the button again will stop the recording.

The recorded audio clip will be used for Speech recognition. The worklfow (simplified)

Step 1: Audio Preprocessing to convert audio into time-frequency representation (log-Mel spectrogram)

Step 2: Encoder processes the spectrogram to extract meaningful features from the audio

Step 3: Decoder takes the encoded features and generates text output. The decoder predicts one token (word/character) at a time.

The dropdown menu allows you to select the desired input language. Once processing is complete, the detected text will be displayed in the text field.

License

This project depends on 3rd party neural networks. Please refer to the orignial WhisperTiny repositories for detailed license information.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Assets		Assets
Documentation/images		Documentation/images
Packages		Packages
ProjectSettings		ProjectSettings
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WhisperTiny Speech-To-Text for Unity

Key Features

Requirements

Models (ONNX)

Getting Started

Project Setup

Run the Demo Scene

How to Use

Step 1: Audio Preprocessing to convert audio into time-frequency representation (log-Mel spectrogram)

Step 2: Encoder processes the spectrogram to extract meaningful features from the audio

Step 3: Decoder takes the encoded features and generates text output. The decoder predicts one token (word/character) at a time.

License

About

Uh oh!

Releases

Packages

Languages

License

danielbierwirth/Inference-Whisper-Unity

Folders and files

Latest commit

History

Repository files navigation

WhisperTiny Speech-To-Text for Unity

Key Features

Requirements

Models (ONNX)

Getting Started

Project Setup

Run the Demo Scene

How to Use

Step 1: Audio Preprocessing to convert audio into time-frequency representation (log-Mel spectrogram)

Step 2: Encoder processes the spectrogram to extract meaningful features from the audio

Step 3: Decoder takes the encoded features and generates text output. The decoder predicts one token (word/character) at a time.

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages