This project is made for easy finetuning (training) of Whisper speech-to-text models.
- Clone the project.
- Install system dependencies, notably
libmagic-dev
andffmpeg
. Or you can just use the development docker image indocker/dev/Dockerfile
which contains everything you may need. - Run
poetry install
or install the project locally using pip (but the dev docker comes with poetry)
If you happen to use VS Code (and Dev Containers extension), it should open the project in the dev docker. For more info, see VS Code documentation.
The dataset is expected have the following structure:
<dataset_root>
|-- test
| |-- <filename_01>.wav
| |-- <filename_02>.wav
| |-- ...
| `-- metadata.jsonl
|-- train
| |-- <filename_01>.wav
| |-- <filename_02>.wav
| |-- ...
| `-- metadata.jsonl
`-- validation
| |-- <filename_01>.wav
| |-- <filename_02>.wav
| |-- ...
`-- metadata.jsonl
Other audio formats, like mp3
or ogg
are also supported.
Other metadata formats, like csv
are also supported.
(For both, see huggingface datasets load_dataset
function).
Metadata must contain at least two fields:
file_name
- the name of the related audio file, including file extensiontranscription
- the ground truth transcription
The name of the transcription column can differ, but in that case, must be passed to the training script as:
whisper-finetune train ... --transcript-col-name=<your_column_name>
In case you don't want to use your own dataset, you can use Mozilla Common Voice. The package comes with a cli command to download it into an expected format.
whisper-finetune download-common-voice \
--dataset-dir ./data/common_voice \
--cache-dir ./data \
--lang cs \
--shrink-test-split 4000 \
--shrink-valid-split 2000 \
whisper-finetune
comes with a prepared audio augmentation pipeline.
It consists of several audio effects provided by https://github.com/facebookresearch/AugLy library. It also adds songs/environmental noise to the audio. Unfortunately, the audio files need to be partly downloaded manually. Here are the scripts I use for preparing augmentation data
FMA small, 8000 tracks of 30s, 8 balanced genres
mkdir -p ./data
wget -O ./data/FMA-small.zip https://os.unil.cloud.switch.ch/fma/fma_small.zip
unzip ./data/FMA-small.zip -d ./data/
ESC-50, dataset for classification of environment noise. (50 types divided into categories: animals, natural soundscapes, human non-speech, domestic sounds, urban noises) The dataset is not large, but balanced.
wget -O ./data/esc_50.zip https://github.com/karoldvl/ESC-50/archive/master.zip
unzip ./data/esc_50.zip -d ./data/
mv ./data/ESC-50-master/audio ./data/ESC-50
rm -r ./data/ESC-50-master
rm ./data/esc_50.zip
ESC-US, larger and more diverse dataset of noises, but it is not guaranteed to be balanced (or representative). Unfortunatelly, it needs to be downloaded manualy (because of license agreement). 2 parts of it should be enough. More than that would defeat the purpose of downloading the smaller balanced ESC-50 anyways.
Download the parts manually from the website, place them in the data folder and run:
mkdir -p ./data/ESC
find ./data/ -name ESC-US-*.tar.gz -print0 | parallel -0 tar -xvzf {} -C ./data/ESC/ESC-US
find ./data/ -name ESC-US-*.tar.gz -print0 | parallel -0 rm {}
Now, your data folder should look like this:
data
|-- ESC
| |-- ESC-50
| `-- ESC-US
|-- FMA-small
...
Some of these contain subdirectories, but it does not matter, the training script will find the audio files recursively.
For training, use whisper-finetune train
. Example usage:
whisper-finetune train \
--dataset-dir ./data/common_voice/cs \
--dataset-name common-voice-cs \
--noise-songs-dir ./data/FMA-small \
--noise-other-dir ./data/ESC \
--cache-dir-models ./models \
--lang cs \
--lang-long czech \
--model-size tiny \
--output-root-dir ./models \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--eval_accumulation_steps 2 \
--warmup_steps 1000 \
--evaluation_strategy "steps" \
--eval_steps 100 \
--num_train_epochs 15 \
--logging_steps 20 \
--bf16
You may have noticed that some parameter names contain '-' and others '_' as a word separator.
Parameters containing '-' are "own" to the project's training script, the remaining are passed
to the Seq2SeqTrainingArguments
object from HuggingFace transformers.
See the transformers documentation
to learn more about the parameters.
The script will automatically log to Weights and Biases:
- training loss
- validation metrics:
- word error rate
- character error rate
- exact string match
TODO