train new languages #105

pantau000 · 2023-06-02T12:23:48Z

Is there any possibility a user could train a new language?

octimot · 2023-06-07T12:10:51Z

Hello!

Not directly from the tool, but it is could be possible, although it requires some coding skills and a good dataset (audio + transcription pairs).

What language are you trying to add?

Cheers

pantau000 · 2023-06-07T12:15:21Z

Cape Verdean Creole (kea).

Unfortunately, there is no data set at all, as it is mainly a spoken language. It would be possible, however, to find a lot of audio recordings.

I wonder if AI can achieve that at all, learning to transcribe audio without having transcription examples? Probably not...

octimot · 2023-06-07T12:36:25Z

I wonder if AI can achieve that at all, learning to transcribe audio without having transcription examples? Probably not...

Languages share a lot of similarities between each other and I wouldn't be surprised if AI would be able to do that soon tbh...

Unfortunately, there is no data set at all, as it is mainly a spoken language. It would be possible, however, to find a lot of audio recordings.

I'm not sure how large the dataset needs to be, but you definitely need audio+transcription pairs - so someone has to prepare the transcriptions manually for training. It also depends how similar this Creole language is to others that the Whisper model might already recognize.

BTW, have you tried not selecting any language and simply running the large-v2 model on the audio you have to transcribe and translate?

There's also a huge library of models on HuggingFace, BTW: https://huggingface.co/models?library=whisper

If adding custom models to the tool would help, we could find a way to add it to our backlog...

pantau000 · 2023-06-07T12:52:58Z

I'm not sure how large the dataset needs to be, but you definitely need audio+transcription pairs - so someone has to prepare the transcriptions manually for training. It also depends how similar this Creole language is to others that the Whisper model might already recognize.

There is definitely no transcription data set available, probably because of the lack of written accounts/transcriptions. I checked a couple of sites.

BTW, have you tried not selecting any language and simply running the large-v2 model on the audio you have to transcribe and translate?

Cape Verdean Creole is based on Portuguese. If I don't select a language, Portuguese gets selected, but the transcriptions are obviously quite erroneous.

If adding custom models to the tool would help, we could find a way to add it to our backlog...

I think it would be worth trying to add a new language that could be identified by the model, so that the model would somehow (how?) be able to learn by the user's feedback (who could correct the transcript). Like OCR can learn from the user's feeedback to better recognize individual images/texts. But this of course goes far beyond your current approach.

pantau000 added the enhancement New feature or request label Jun 2, 2023

octimot closed this as completed Jun 7, 2023

octimot reopened this Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train new languages #105

train new languages #105

pantau000 commented Jun 2, 2023

octimot commented Jun 7, 2023

pantau000 commented Jun 7, 2023

octimot commented Jun 7, 2023

pantau000 commented Jun 7, 2023 •

edited

Loading

train new languages #105

train new languages #105

Comments

pantau000 commented Jun 2, 2023

octimot commented Jun 7, 2023

pantau000 commented Jun 7, 2023

octimot commented Jun 7, 2023

pantau000 commented Jun 7, 2023 • edited Loading

pantau000 commented Jun 7, 2023 •

edited

Loading