Automatic Speech Recognition (ASR) System for Egyptian Arabic
The dataset comprises 50,715 audio files in the /train/
folder and 2,199 audio files in the /adapt/
folder. Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 16,000 Hz. The corresponding transcripts are provided in the train.csv
file with the following fields:
- audio: The name of the corresponding
.wav
file. - Transcription: The words spoken by the reader.
We prepared the vocabulary consisting of 38 characters, including the space and the OOV (out-of-vocabulary) token. The following transformations were applied to the data:
- Spectrogram Generation: Spectrograms were obtained using Short-Time Fourier Transform (STFT) with a frame length of 240 (15 ms), frame step of 120, and FFT length of 256.
- Normalization: Spectrograms were normalized.
- Label Encoding: Labels were split and encoded.
- Batching: A batch size of 32 was chosen.
- Accepts spectrogram inputs of shape
(None, input_dim)
.
- Conv1: 96 filters, kernel size
[11, 41]
, strides[2, 2]
, followed by ReLU activation. - Conv2: 128 filters, kernel size
[11, 21]
, strides[1, 2]
, followed by ReLU activation.
- Reshapes the output from the convolutional layers to a 2D tensor for the RNN layers.
- Bidirectional GRU Layers: Five layers, each with 768 units, to capture temporal dependencies. Each GRU layer uses
tanh
activation andsigmoid
recurrent activation. Outputs from forward and backward GRU cells are concatenated.
- A fully connected layer with
2 * rnn_units
units followed by ReLU activation.
- A dense layer with
output_dim + 1
units and a softmax activation function to predict character probabilities.
- Optimizer: Adam optimizer.
- Loss Function: Connectionist Temporal Classification (CTC) Loss.
This model leverages Convolutional Neural Networks (CNNs) for feature extraction and Recurrent Neural Networks (RNNs) for sequence modeling, making it well-suited for end-to-end speech recognition tasks. The architecture ensures effective learning of both local and temporal features from spectrogram inputs.
You can download the model weights from here.