Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train new languages #105

Open
pantau000 opened this issue Jun 2, 2023 · 4 comments
Open

train new languages #105

pantau000 opened this issue Jun 2, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@pantau000
Copy link

Is there any possibility a user could train a new language?

@pantau000 pantau000 added the enhancement New feature or request label Jun 2, 2023
@octimot
Copy link
Owner

octimot commented Jun 7, 2023

Hello!

Not directly from the tool, but it is could be possible, although it requires some coding skills and a good dataset (audio + transcription pairs).

What language are you trying to add?

Cheers

@octimot octimot closed this as completed Jun 7, 2023
@octimot octimot reopened this Jun 7, 2023
@pantau000
Copy link
Author

Cape Verdean Creole (kea).

Unfortunately, there is no data set at all, as it is mainly a spoken language. It would be possible, however, to find a lot of audio recordings.

I wonder if AI can achieve that at all, learning to transcribe audio without having transcription examples? Probably not...

@octimot
Copy link
Owner

octimot commented Jun 7, 2023

I wonder if AI can achieve that at all, learning to transcribe audio without having transcription examples? Probably not...

Languages share a lot of similarities between each other and I wouldn't be surprised if AI would be able to do that soon tbh...

Unfortunately, there is no data set at all, as it is mainly a spoken language. It would be possible, however, to find a lot of audio recordings.

I'm not sure how large the dataset needs to be, but you definitely need audio+transcription pairs - so someone has to prepare the transcriptions manually for training. It also depends how similar this Creole language is to others that the Whisper model might already recognize.

BTW, have you tried not selecting any language and simply running the large-v2 model on the audio you have to transcribe and translate?

There's also a huge library of models on HuggingFace, BTW: https://huggingface.co/models?library=whisper

If adding custom models to the tool would help, we could find a way to add it to our backlog...

@pantau000
Copy link
Author

pantau000 commented Jun 7, 2023

I'm not sure how large the dataset needs to be, but you definitely need audio+transcription pairs - so someone has to prepare the transcriptions manually for training. It also depends how similar this Creole language is to others that the Whisper model might already recognize.

There is definitely no transcription data set available, probably because of the lack of written accounts/transcriptions. I checked a couple of sites.

BTW, have you tried not selecting any language and simply running the large-v2 model on the audio you have to transcribe and translate?

Cape Verdean Creole is based on Portuguese. If I don't select a language, Portuguese gets selected, but the transcriptions are obviously quite erroneous.

If adding custom models to the tool would help, we could find a way to add it to our backlog...

I think it would be worth trying to add a new language that could be identified by the model, so that the model would somehow (how?) be able to learn by the user's feedback (who could correct the transcript). Like OCR can learn from the user's feeedback to better recognize individual images/texts. But this of course goes far beyond your current approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants