Building lexicons in Python #191

pzelasko · 2021-05-11T16:07:25Z

The current setup inherits building lexicon FSTs from Kaldi. I think it makes sense to have the ability to build it directly in Python, which should make building new recipes easier, as well as (eventually) allow for some things like dynamic expansion of the lexicon without leaving Python.

The data structure would basically resemble that of Kaldi, e.g.:

class Dict:
  # a list of words and their phone transcripts, possibly with scores to resemble lexiconp.txt
  lexicon: List[str, List[str]]

  # OOV word symbol
  oov: str

  # optional silence phone symbol
  optional_silence: str

  # a list of silence phone symbols (maybe we should call them special symbols? spoken noise is not really silence)
  silence_phones: List[str]

  # a list of nonsilence phone symbols
  nonsilence_phones: List[str]

  @property
  def words(self) -> List[str]:
    """A sorted list of unique words in Dict. Includes <eps>, #0, <s> and </s>"""
  
  @property
  def phones(self) -> List[str]:
    """A sorted list of unique phones in Dict."""

and methods:

def save(self, path):
  """Save into a file or a directory (maybe same as Kaldi's data dir)"""

@classmethod
def load(cls, path) -> 'Dict':
  """Read all the information from a path"""

def compile_lexicon_fst(self) -> k2.Fsa:
  """Adds disambiguation symbols and compiles L.fst"""

def extend(self, lexicon: List[str, List[str]]) -> k2.Fsa:
  """Adds new words and their corresponding phone transcripts into Dict. Checks for compatibility with the phone set."""

Kaldi's prepare_lang.sh has accumulated a lot of options, so I'd like to get some feedback which of them are useful to keep and which are not:

num sil/nonsil states and share_silence_phones are currently unused and probably not needed anymore?
position dependent phones seems superficial in our current setups, not sure if it'll be useful?
could unk-fst be still useful?
silprob/sil_prob - is it worth supporting it?

We can of course start from something minimal and extend it... It does seem like a substantial amount of work but I think it's worth it and I can give it a shot, or at least lay some groundwork. What do you guys think? Also, I want to make sure I wouldn't be duplicating anybody's effort.

The text was updated successfully, but these errors were encountered:

danpovey · 2021-05-12T04:55:17Z

num sil/nonsil states relates to the topo, so probably doesn't belong in the dict.

If we do use word-position dependent phones, we'd probably want to simplify them into 2 classes instead of 5.
I did some experiments with them but saw no clear gain, but this could be revisited. However, it can be done simply as a transformation on the dict, doubling the phone-set size, so may not need to be represented in the Dict itself.

unk-fst may still be useful, I guess, but I think we can leave it separate from Dict, for now at least.

silprob: my feeling is we may not need it since it can just be absorbed into the probability of silence in the acoustic model
(if we're training with LF-MMI and other sequence criteria, removing it shouldn't remove any modeling power).

There is even a question whether the silence_phones / nonsilence_phones belongs in the Dict. It's not clear what uses we
have for that right now. We do need the opt_sil, though, so we can turn the Dict into an Fst (note: None should be allowable).

Also: turning the Dict into an Fsa may not be the most efficient method of graph-building (at least for supervisions) One possibility is to turn the Dict into an FsaVec and introducing a new indexing operation whereby an FsaVec can be indexed by an Fsa or FsaVec. The idea is this: that an expression a[b] gives you something with the top-level structure of b, but where each arc in b with a label x is replaced by the Fsa a[x], with the start-state and final-state of a[x] being identified with the source-state and destination-state of the arc in b, and any additional states in a[x] being inserted somewhere in the result (e.g. just after the source-state of the arc). I would propose to have epsilon be treated as a normal symbol and element 0 of a being what we replace epsilon arcs with (would likely be just a single arc from start-state to final-state); the last element of a being used when the symbol in b is -1's; and -1 arcs in a being replaced with 0 if their destination-state in a[b] is not a final-state. This way, we could put the optional silence at the start of all the individual FSA's, and the final FSA in a would also have the optional-silence which may be present at end-of-sentence.

danpovey · 2021-05-12T04:56:56Z

@csukuangfj I'll talk to Kangwei about doing this, it would be a good first project that can let him understand the basics of k2 C++ programming.

csukuangfj · 2021-05-12T06:10:57Z

@csukuangfj I'll talk to Kangwei about doing this,

Cool!

francisr · 2021-05-12T13:48:02Z

If we do use word-position dependent phones, we'd probably want to simplify them into 2 classes instead of 5.
I did some experiments with them but saw no clear gain, but this could be revisited. However, it can be done simply as a transformation on the dict, doubling the phone-set size, so may not need to be represented in the Dict itself.

Even if there's no WER gain with position dependent phones, it's useful for fast lattice alignment.

pzelasko · 2021-05-12T13:51:55Z

@csukuangfj I'll talk to Kangwei about doing this, it would be a good first project that can let him understand the basics of k2 C++ programming.

Cool! In that case, I won't start working on it to avoid duplicated effort.

danpovey · 2021-05-12T15:12:39Z

Clarification: For Kangwei, I was just talking about implementing that function involving indexing a FsaVec with Fsas or FsaVecs, in k2. This shouldn't be necessary for this lexicon-building Python code since we'll anyway need a way to turn the whole thing into an Fsa. If you feel like implementing that indexing thing, I won't say no, since Kangwei won't join us for a couple of weeks.

…

On Wed, May 12, 2021 at 9:52 PM Piotr Żelasko ***@***.***> wrote: @csukuangfj <https://github.com/csukuangfj> I'll talk to Kangwei about doing this, it would be a good first project that can let him understand the basics of k2 C++ programming. Cool! In that case, I won't start working on it to avoid duplicated effort. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#191 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLOYRSQMM6XXGHDQYDM3TNKBZFANCNFSM44WAIS5Q> .

pzelasko · 2021-05-12T15:38:28Z

Could be a good opportunity to get more familiar with k2's C++ code. I'll start with the Python part and let's see then.

jtrmal · 2021-05-12T15:42:27Z

Ping me next week if you won't be interested in working on it. I think I would be, but this week I'm kinda overwhelmed by work. Y.

…

On Wed, May 12, 2021 at 11:38 Piotr Żelasko ***@***.***> wrote: Could be a good opportunity to get more familiar with k2's C++ code. I'll start with the Python part and let's see then. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#191 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q> .

jtrmal · 2021-05-12T15:43:19Z

It I can handle the c++. Up to you. Would like to pickup some task on k2 tho Y.

…

On Wed, May 12, 2021 at 11:42 Jan Yenda Trmal ***@***.***> wrote: Ping me next week if you won't be interested in working on it. I think I would be, but this week I'm kinda overwhelmed by work. Y. On Wed, May 12, 2021 at 11:38 Piotr Żelasko ***@***.***> wrote: > Could be a good opportunity to get more familiar with k2's C++ code. I'll > start with the Python part and let's see then. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#191 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q> > . >

danpovey · 2021-05-13T03:58:22Z

Perhaps both of you could collaborate on it, e.g someone write the structure and someone fill it in? I suggest the following C++ interface (a little different from what I said before): ``` /* Replace, in `index`, labels symbol_range_begin <= label < symbol_range_begin+src.Dim0() with the Fsa indexed `label - symbol_range_begin` in `src`, identifying the source and destination states of the arc in `src` with the initial and final states in `src[label - symbol_range_begin]`. Arcs with labels outside this range are just copied. Caution: the result may not be a valid Fsa because labels on final-arcs in `src` (which will be -1) may end up on non-final arcs in the result; you can use FixFinalLabels() to fix this. @param [in] src FsaVec containing individual Fsas that we'll be inserting into the result. No FSA in `src` have arcs entering its initial state; this function will crash if this requirement is violated. @param [in] index Fsa or FsaVec (2 or 3 axes) that dictates the overall structure of the result (the result will have the same number of axes as `index`. @param [in] symbol_range_begin Beginning of the range (interval) of symbols that are to be replaced with Fsas. Symbols numbered symbol_range_begin <= i < src.Dim0() will be replaced with the Fsa in `src[i - symbol_range_begin]` @param [out,optional] arc_map_src If not nullptr, will be set to a new array that maps from arc-indexes in the result to the corresponding arc in `src`, or -1 if there was no such arc (for out-of-range symbols in `index`) @param [out,optional] arc_map_index If not nullptr, will be set to a new array that maps from arc-indexes in the result to the arc in `index` that it originates from, only if it includes the weight from that arc in `index`; and -1 otherwise). For arcs that result from inserting an Fsa in `src`, (say, src[i]) they include the weight from the arc in `index` if the arc was from the initial state in src[i]. */ FsaOrVec ReplaceFsa(FsaVec src, FsaOrVec index, int32_t symbol_range_begin, Array1<int32_t> *arc_map_src = nullptr, Array1<int32_t> *arc_map_index = nullptr); ``` Since we have an extra option symbol_range_begin, I suppose it might make sense to just make this a separate function/op at the Python level, like replace_fsa(), rather than trying to make it part of a generic indexing function. Note on edits I just made: I removed RepairFinalSymbols() from the draft because there is now a FixFinalLabels() function that does the same thing; and I added the requirement that FSAs in `src` may not have arcs entering their initial state; and I simplified a comment about `arc_map_index`.

…

On Wed, May 12, 2021 at 11:43 PM jtrmal ***@***.***> wrote: It I can handle the c++. Up to you. Would like to pickup some task on k2 tho Y. On Wed, May 12, 2021 at 11:42 Jan Yenda Trmal ***@***.***> wrote: > Ping me next week if you won't be interested in working on it. I think I > would be, but this week I'm kinda overwhelmed by work. > Y. > > On Wed, May 12, 2021 at 11:38 Piotr Żelasko ***@***.***> > wrote: > >> Could be a good opportunity to get more familiar with k2's C++ code. I'll >> start with the Python part and let's see then. >> >> — >> You are receiving this because you are subscribed to this thread. >> Reply to this email directly, view it on GitHub >> <#191 (comment)>, >> or unsubscribe >> < https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q > >> . >> > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#191 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO2O5253QINSL4IUAWTTNKO27ANCNFSM44WAIS5Q> .

danpovey · 2021-05-13T04:03:24Z

And I haven't given a thought to the right way to handle auxiliary labels here. The easiest way is probably to have them inherited from `index` in the same way as the weights (via arc_map_index) and say they are disallowed in `src` for now

…

On Thu, May 13, 2021 at 11:58 AM Daniel Povey ***@***.***> wrote: Perhaps both of you could collaborate on it, e.g someone write the structure and someone fill it in? I suggest the following C++ interface (a little different from what I said before): /* Replace, in `index`, labels symbol_range_begin <= label < symbol_range_begin+src.Dim0() with the Fsa indexed `label - symbol_range_begin` in `src`, identifying the source and destination states of the arc in `src` with the initial and final states in `src[label - symbol_range_begin]`. Arcs with labels outside this range are just copied. Caution: the result may not be a valid Fsa because labels on final-arcs in `src` (which will likely be -1) may end up on non-final arcs in the result; you can use RepairFinalSymbols() to fix this. @param [in] src FsaVec containing individual Fsas that we'll be inserting into the result. @param [in] index Fsa or FsaVec (2 or 3 axes) that dictates the overall structure of the result (the result will have the same number of axes as `index`. @param [in] symbol_range_begin Beginning of the range (interval) of symbols that are to be replaced with Fsas. Symbols numbered symbol_range_begin <= i < src.Dim0() will be replaced with the Fsa in `src[i - symbol_range_begin]` @param [out,optional] arc_map_src If not nullptr, will be set to a new array that maps from arc-indexes in the result to the corresponding arc in `src`, or -1 if there was no such arc (for out-of-range symbols in `index`) @param [out,optional] arc_map_index If not nullptr, will be set to a new array that maps from arc-indexes in the result to the arc in `index` that it originates from, only if it includes the weight from that arc in `index`; and -1 otherwise). For arcs that result from inserting an Fsa in `src`, (say, src[i]) they include the weight from the arc in `index` if one of the following two conditions is true: - The arc was from the initial state in src[i], and src[i] has no arcs entering its initial state - The arc was to the final state in src[i], and src[i] has at least one arc entering its initial state */ FsaOrVec ReplaceFsa(FsaVec src, FsaOrVec index, int32_t symbol_range_begin, Array1<int32_t> *arc_map_src = nullptr, Array1<int32_t> *arc_map_index = nullptr); /* Ensures that labels on final-arcs in `a` are -1, and replaces labels on non-final arcs in `a` with `nonfinal_label`. */ void RepairFinalSymbols(FsaOrVec *a, int32_t nonfinal_label = 0); You can decide whether to expose RepairFinalSymbols to Python via _k2, or make it part of the Python-level interface of ReplaceFsa. Since we have an extra option symbol_range_begin, I suppose it might make sense to just make this a separate function/op at the Python level, like replace_fsa(), rather than trying to make it part of a generic indexing function. On Wed, May 12, 2021 at 11:43 PM jtrmal ***@***.***> wrote: > It I can handle the c++. Up to you. Would like to pickup some task on k2 > tho > Y. > > On Wed, May 12, 2021 at 11:42 Jan Yenda Trmal ***@***.***> wrote: > > > Ping me next week if you won't be interested in working on it. I think I > > would be, but this week I'm kinda overwhelmed by work. > > Y. > > > > On Wed, May 12, 2021 at 11:38 Piotr Żelasko ***@***.***> > > wrote: > > > >> Could be a good opportunity to get more familiar with k2's C++ code. > I'll > >> start with the Python part and let's see then. > >> > >> — > >> You are receiving this because you are subscribed to this thread. > >> Reply to this email directly, view it on GitHub > >> <#191 (comment) > >, > >> or unsubscribe > >> < > https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q > > > >> . > >> > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#191 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAZFLO2O5253QINSL4IUAWTTNKO27ANCNFSM44WAIS5Q> > . >

pzelasko · 2021-05-17T21:12:10Z

@jtrmal I won't find the time to work on it this week -- if you want to, feel free to start (just let me know if you do).

jtrmal · 2021-05-17T21:15:36Z

OK, I'm trying to get started with the C++ -- you can catch up from Python direction in a week or two, I don't think I will be faster than that. y.

…

On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko ***@***.***> wrote: @jtrmal <https://github.com/jtrmal> I won't find the time to work on it this week -- if you want to, feel free to start (just let me know if you do). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#191 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q> .

danpovey · 2021-05-28T04:15:35Z

@jan Trmal ***@***.***> did you make any progress?

…

On Tue, May 18, 2021 at 5:15 AM jtrmal ***@***.***> wrote: OK, I'm trying to get started with the C++ -- you can catch up from Python direction in a week or two, I don't think I will be faster than that. y. On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko ***@***.***> wrote: > @jtrmal <https://github.com/jtrmal> I won't find the time to work on it > this week -- if you want to, feel free to start (just let me know if you > do). > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#191 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q > > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#191 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO76OCGIKQGZ3EPQ6RTTOGBQXANCNFSM44WAIS5Q> .

jtrmal · 2021-05-28T11:49:06Z

I have something in C++, will try to make PR next week -- I will be traveling over the weekend. y. On Fri, May 28, 2021 at 12:15 AM Daniel Povey ***@***.***> wrote:

…

@jan Trmal ***@***.***> did you make any progress? On Tue, May 18, 2021 at 5:15 AM jtrmal ***@***.***> wrote: > OK, I'm trying to get started with the C++ -- you can catch up from Python > direction in a week or two, I don't think I will be faster than that. > y. > > On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko ***@***.***> > wrote: > > > @jtrmal <https://github.com/jtrmal> I won't find the time to work on it > > this week -- if you want to, feel free to start (just let me know if you > > do). > > > > — > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > <#191 (comment) >, > > or unsubscribe > > < > https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q > > > > . > > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#191 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAZFLO76OCGIKQGZ3EPQ6RTTOGBQXANCNFSM44WAIS5Q > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#191 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX5VTFYJQAWC65PGFL3TP4KHHANCNFSM44WAIS5Q> .

danpovey · 2021-05-28T12:10:39Z

Great!

…

On Fri, May 28, 2021 at 7:49 PM jtrmal ***@***.***> wrote: I have something in C++, will try to make PR next week -- I will be traveling over the weekend. y. On Fri, May 28, 2021 at 12:15 AM Daniel Povey ***@***.***> wrote: > @jan Trmal ***@***.***> did you make any progress? > > > On Tue, May 18, 2021 at 5:15 AM jtrmal ***@***.***> wrote: > > > OK, I'm trying to get started with the C++ -- you can catch up from > Python > > direction in a week or two, I don't think I will be faster than that. > > y. > > > > On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko ***@***.***> > > wrote: > > > > > @jtrmal <https://github.com/jtrmal> I won't find the time to work on > it > > > this week -- if you want to, feel free to start (just let me know if > you > > > do). > > > > > > — > > > You are receiving this because you were mentioned. > > > Reply to this email directly, view it on GitHub > > > < #191 (comment) > >, > > > or unsubscribe > > > < > > > https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q > > > > > > . > > > > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > <#191 (comment) >, > > or unsubscribe > > < > https://github.com/notifications/unsubscribe-auth/AAZFLO76OCGIKQGZ3EPQ6RTTOGBQXANCNFSM44WAIS5Q > > > > . > > > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#191 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/ACUKYX5VTFYJQAWC65PGFL3TP4KHHANCNFSM44WAIS5Q > > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#191 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLOYSSVHUBLIOWI5HV3DTP57MFANCNFSM44WAIS5Q> .

danpovey mentioned this issue Jun 7, 2021

FsaReplace operation k2-fsa/k2#755

Closed

pzelasko mentioned this issue Jul 1, 2021

Design thoughts k2-fsa/icefall#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building lexicons in Python #191

Building lexicons in Python #191

pzelasko commented May 11, 2021

danpovey commented May 12, 2021 •

edited

Loading

danpovey commented May 12, 2021

csukuangfj commented May 12, 2021

francisr commented May 12, 2021

pzelasko commented May 12, 2021

danpovey commented May 12, 2021 via email

pzelasko commented May 12, 2021

jtrmal commented May 12, 2021 via email

jtrmal commented May 12, 2021 via email

danpovey commented May 13, 2021 via email •

edited

Loading

danpovey commented May 13, 2021 via email

pzelasko commented May 17, 2021

jtrmal commented May 17, 2021 via email

danpovey commented May 28, 2021 via email

jtrmal commented May 28, 2021 via email

danpovey commented May 28, 2021 via email

Building lexicons in Python #191

Building lexicons in Python #191

Comments

pzelasko commented May 11, 2021

danpovey commented May 12, 2021 • edited Loading

danpovey commented May 12, 2021

csukuangfj commented May 12, 2021

francisr commented May 12, 2021

pzelasko commented May 12, 2021

danpovey commented May 12, 2021 via email

pzelasko commented May 12, 2021

jtrmal commented May 12, 2021 via email

jtrmal commented May 12, 2021 via email

danpovey commented May 13, 2021 via email • edited Loading

danpovey commented May 13, 2021 via email

pzelasko commented May 17, 2021

jtrmal commented May 17, 2021 via email

danpovey commented May 28, 2021 via email

jtrmal commented May 28, 2021 via email

danpovey commented May 28, 2021 via email

danpovey commented May 12, 2021 •

edited

Loading

danpovey commented May 13, 2021 via email •

edited

Loading