Skip to content

Conversation

@jerinphilip
Copy link
Owner

@jerinphilip jerinphilip commented Jan 22, 2024

Work in progress

Eventual goal is to implement beam-search for transliteration to generate multiple candidates (unlike the greedy decoding after forced overfitting in the case of translation).

seq2seq is an overkill for transliteration. The overkill mostly happens with an expert user in a deterministic (and not statistical) IME. The hopes of this effort is that NNs as powerful enough function approximators can be used to make life easier for a no-expert user. The following benefits come to mind:

  1. No need to switch between case alterations and symbols (~), simply type all small letters to get associated most likely outputs.
  2. Use beam-search to generate multiple targets that are most likely from a given source variation.
  3. Robustness to typing errors and noise. Some masked character training should allow the network to guess the most suitable character/subword from context.
  4. Long-context selection. GBoard (WFSTs?) fails with really long agglutinated sequences, seq2seq with transformers appear to be doing better on a cursory try-out (this claim will have to be validated).

The model trained for a first-exploration already provides good enough variations among candidates.

naal
0 ||| നാൽ ||| F0= -0.247635 ||| -0.247635
0 ||| നാൾ ||| F0= -2.30577 ||| -2.30577
0 ||| നാല് ||| F0= -2.57854 ||| -2.57854
0 ||| നാള് ||| F0= -4.42439 ||| -4.42439
0 ||| നാല ||| F0= -4.5098 ||| -4.5098

@jerinphilip jerinphilip changed the title Add transliteration with beam-search (multiple candidates) Add transliteration with beam-search Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant