Skip to content

In-progress project on forced alignment with the Montreal Forced Aligner for Armenian

License

Notifications You must be signed in to change notification settings

jhdeov/armenianMFA

Repository files navigation

Armenian MFA

In-progress project on forced alignment of Armenian using the Montreal Forced Aligner.

We trained an acoustic model on the Armenian data from the FLEURS dataset. The dataset is around 14 hours of Eastern Armenian speech (n=4380 sound files). We normalized the transcript for the following:

  • to remove word-internal punctuation
  • to remove word-external punctuation
  • to convert digits into number lemmas
  • to find errors in the transcripts

We manually created a pronunciation dictionary by examining the tokens in FLEURS against the Armenian Wiktionary entries on Wikipron.

We at first trained the model with a beam of 100. The model generated TextGrids for 4324 sound files with word-alignment and phone-alignment. We then re-ran the model on the data with a beam of 1000 to get TextGrids for 4379 sound files. One file seems to be broken.

Each TextGrid has the following structure:

  • words tier, generated by MFA.
  • phones tier, generated by MFA.
  • sentenceOriginal tier, manually generated. Lists the original transcript from FLEURS.
  • sentenceNormalized tier, manually generated. Lists the transcript that we created by normalizing the sentenceOriginal tier. The model was run over this tier.
  • notes tier, manually generated.

About

In-progress project on forced alignment with the Montreal Forced Aligner for Armenian

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published