How to add support for a new language that is in endangered language category #1162
khera-shanu
started this conversation in
Ideas
Replies: 1 comment
-
To leverage Whisper, you'd require a few hours of recordings with labels, split into segments shorter than 30 seconds.
I think the use of latin alphabets would make fine-tuning marginally easier, but still there will be a lot of research efforts needed for the transcription to work reliably. Somewhat related effort was done for Hindi recently. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I want to do it for the Tangkhul language - Wiki Page
It doesn't have its own writing system/alphabet and uses the Latin alphabet, also there is not a lot of written content except 2-3 books and a few dictionaries, and one book explaining its grammar as someone's Ph.D. Research.
What is the minimum I will need for training whisper to identify a language like that, I am new to AI but not programming, also passionate to solve this problem just for the sake of solving it.
I don't mind if someone points me in the right direction not looking for spoon-feeding, Thanks for reading.
P.S. There are more than 200k people who use this language and you can assume 10% of them to have access to an internet connection and smart phone, open to ideas for crowd sourcing for data collection.
Beta Was this translation helpful? Give feedback.
All reactions