-
Notifications
You must be signed in to change notification settings - Fork 42
Building lexicons in Python #191
Comments
num sil/nonsil states relates to the topo, so probably doesn't belong in the dict. If we do use word-position dependent phones, we'd probably want to simplify them into 2 classes instead of 5. unk-fst may still be useful, I guess, but I think we can leave it separate from Dict, for now at least. silprob: my feeling is we may not need it since it can just be absorbed into the probability of silence in the acoustic model There is even a question whether the silence_phones / nonsilence_phones belongs in the Dict. It's not clear what uses we Also: turning the Dict into an Fsa may not be the most efficient method of graph-building (at least for supervisions) One possibility is to turn the Dict into an FsaVec and introducing a new indexing operation whereby an FsaVec can be indexed by an Fsa or FsaVec. The idea is this: that an expression a[b] gives you something with the top-level structure of b, but where each arc in b with a label x is replaced by the Fsa a[x], with the start-state and final-state of a[x] being identified with the source-state and destination-state of the arc in b, and any additional states in a[x] being inserted somewhere in the result (e.g. just after the source-state of the arc). I would propose to have epsilon be treated as a normal symbol and element 0 of a being what we replace epsilon arcs with (would likely be just a single arc from start-state to final-state); the last element of a being used when the symbol in b is -1's; and -1 arcs in a being replaced with 0 if their destination-state in a[b] is not a final-state. This way, we could put the optional silence at the start of all the individual FSA's, and the final FSA in a would also have the optional-silence which may be present at end-of-sentence. |
@csukuangfj I'll talk to Kangwei about doing this, it would be a good first project that can let him understand the basics of k2 C++ programming. |
Cool! |
Even if there's no WER gain with position dependent phones, it's useful for fast lattice alignment. |
Cool! In that case, I won't start working on it to avoid duplicated effort. |
Clarification: For Kangwei, I was just talking about implementing that
function involving indexing a FsaVec with Fsas or FsaVecs, in k2.
This shouldn't be necessary for this lexicon-building Python code since
we'll anyway need a way to turn the whole thing into an Fsa.
If you feel like implementing that indexing thing, I won't say no, since
Kangwei won't join us for a couple of weeks.
…On Wed, May 12, 2021 at 9:52 PM Piotr Żelasko ***@***.***> wrote:
@csukuangfj <https://github.com/csukuangfj> I'll talk to Kangwei about
doing this, it would be a good first project that can let him understand
the basics of k2 C++ programming.
Cool! In that case, I won't start working on it to avoid duplicated effort.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#191 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOYRSQMM6XXGHDQYDM3TNKBZFANCNFSM44WAIS5Q>
.
|
Could be a good opportunity to get more familiar with k2's C++ code. I'll start with the Python part and let's see then. |
Ping me next week if you won't be interested in working on it. I think I
would be, but this week I'm kinda overwhelmed by work.
Y.
…On Wed, May 12, 2021 at 11:38 Piotr Żelasko ***@***.***> wrote:
Could be a good opportunity to get more familiar with k2's C++ code. I'll
start with the Python part and let's see then.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#191 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q>
.
|
It I can handle the c++. Up to you. Would like to pickup some task on k2 tho
Y.
…On Wed, May 12, 2021 at 11:42 Jan Yenda Trmal ***@***.***> wrote:
Ping me next week if you won't be interested in working on it. I think I
would be, but this week I'm kinda overwhelmed by work.
Y.
On Wed, May 12, 2021 at 11:38 Piotr Żelasko ***@***.***>
wrote:
> Could be a good opportunity to get more familiar with k2's C++ code. I'll
> start with the Python part and let's see then.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#191 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q>
> .
>
|
Perhaps both of you could collaborate on it, e.g someone write the
structure and someone fill it in?
I suggest the following C++ interface (a little different from what I said
before):
```
/*
Replace, in `index`, labels symbol_range_begin <= label < symbol_range_begin+src.Dim0()
with the Fsa indexed `label - symbol_range_begin` in `src`, identifying the source and destination
states of the arc in `src` with the initial and final states in
`src[label - symbol_range_begin]`.
Arcs with labels outside this range are just copied. Caution: the
result may not be a valid Fsa
because labels on final-arcs in `src` (which will be -1) may end
up on non-final arcs in
the result; you can use FixFinalLabels() to fix this.
@param [in] src FsaVec containing individual Fsas that we'll be
inserting into the result. No FSA in `src` have arcs
entering its initial state; this function will crash if this requirement
is violated.
@param [in] index Fsa or FsaVec (2 or 3 axes) that dictates the overall
structure of the result
(the result will have the same number of axes as
`index`.
@param [in] symbol_range_begin Beginning of the range (interval) of
symbols that are to
be replaced with Fsas. Symbols numbered
symbol_range_begin <= i < src.Dim0()
will be replaced with the Fsa in `src[i -
symbol_range_begin]`
@param [out,optional] arc_map_src If not nullptr, will be set to a new
array that
maps from arc-indexes in the result to the corresponding arc
in `src`, or -1 if there was no such arc (for out-of-range
symbols in `index`)
@param [out,optional] arc_map_index If not nullptr, will be set to a
new array
that maps from arc-indexes in the result to the arc in `index`
that it originates from, only if it includes the weight from
that arc in `index`; and -1
otherwise). For arcs that result from inserting an Fsa in
`src`, (say, src[i]) they include the
weight from the arc in `index` if the arc was from the initial state in src[i].
*/
FsaOrVec ReplaceFsa(FsaVec src, FsaOrVec index, int32_t symbol_range_begin,
Array1<int32_t> *arc_map_src =
nullptr,
Array1<int32_t> *arc_map_index =
nullptr);
```
Since we have an extra option symbol_range_begin, I suppose it might make
sense to just make
this a separate function/op at the Python level, like replace_fsa(), rather
than trying to make it part of a
generic indexing function.
Note on edits I just made: I removed RepairFinalSymbols() from the draft because there is now a FixFinalLabels() function
that does the same thing; and I added the requirement that FSAs in `src` may not have arcs entering their initial state; and I simplified a comment about `arc_map_index`.
…On Wed, May 12, 2021 at 11:43 PM jtrmal ***@***.***> wrote:
It I can handle the c++. Up to you. Would like to pickup some task on k2
tho
Y.
On Wed, May 12, 2021 at 11:42 Jan Yenda Trmal ***@***.***> wrote:
> Ping me next week if you won't be interested in working on it. I think I
> would be, but this week I'm kinda overwhelmed by work.
> Y.
>
> On Wed, May 12, 2021 at 11:38 Piotr Żelasko ***@***.***>
> wrote:
>
>> Could be a good opportunity to get more familiar with k2's C++ code.
I'll
>> start with the Python part and let's see then.
>>
>> —
>> You are receiving this because you are subscribed to this thread.
>> Reply to this email directly, view it on GitHub
>> <#191 (comment)>,
>> or unsubscribe
>> <
https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q
>
>> .
>>
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#191 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO2O5253QINSL4IUAWTTNKO27ANCNFSM44WAIS5Q>
.
|
And I haven't given a thought to the right way to handle auxiliary labels
here. The easiest way is probably to have them inherited from `index` in
the
same way as the weights (via arc_map_index) and say they are disallowed in
`src` for now
…On Thu, May 13, 2021 at 11:58 AM Daniel Povey ***@***.***> wrote:
Perhaps both of you could collaborate on it, e.g someone write the
structure and someone fill it in?
I suggest the following C++ interface (a little different from what I said
before):
/*
Replace, in `index`, labels symbol_range_begin <= label <
symbol_range_begin+src.Dim0()
with the Fsa indexed `label - symbol_range_begin` in `src`, identifying
the source and destination
states of the arc in `src` with the initial and final states in
`src[label - symbol_range_begin]`.
Arcs with labels outside this range are just copied. Caution: the
result may not be a valid Fsa
because labels on final-arcs in `src` (which will likely be -1) may end
up on non-final arcs in
the result; you can use RepairFinalSymbols() to fix this.
@param [in] src FsaVec containing individual Fsas that we'll be
inserting into the result.
@param [in] index Fsa or FsaVec (2 or 3 axes) that dictates the overall
structure of the result
(the result will have the same number of axes as
`index`.
@param [in] symbol_range_begin Beginning of the range (interval) of
symbols that are to
be replaced with Fsas. Symbols numbered
symbol_range_begin <= i < src.Dim0()
will be replaced with the Fsa in `src[i -
symbol_range_begin]`
@param [out,optional] arc_map_src If not nullptr, will be set to a new
array that
maps from arc-indexes in the result to the corresponding arc
in `src`, or -1 if there was no such arc (for out-of-range
symbols in `index`)
@param [out,optional] arc_map_index If not nullptr, will be set to a
new array
that maps from arc-indexes in the result to the arc in `index`
that it originates from, only if it includes the weight from
that arc in `index`; and -1
otherwise). For arcs that result from inserting an Fsa in
`src`, (say, src[i]) they include the
weight from the arc in `index` if one of the following two
conditions is true:
- The arc was from the initial state in src[i], and src[i]
has no arcs entering its initial state
- The arc was to the final state in src[i], and src[i] has
at least one arc entering its initial state
*/
FsaOrVec ReplaceFsa(FsaVec src, FsaOrVec index, int32_t
symbol_range_begin,
Array1<int32_t> *arc_map_src =
nullptr,
Array1<int32_t> *arc_map_index =
nullptr);
/*
Ensures that labels on final-arcs in `a` are -1, and replaces labels on
non-final arcs in `a`
with `nonfinal_label`.
*/
void RepairFinalSymbols(FsaOrVec *a, int32_t nonfinal_label = 0);
You can decide whether to expose RepairFinalSymbols to Python via _k2, or
make it part of the Python-level
interface of ReplaceFsa.
Since we have an extra option symbol_range_begin, I suppose it might make
sense to just make
this a separate function/op at the Python level, like replace_fsa(),
rather than trying to make it part of a
generic indexing function.
On Wed, May 12, 2021 at 11:43 PM jtrmal ***@***.***> wrote:
> It I can handle the c++. Up to you. Would like to pickup some task on k2
> tho
> Y.
>
> On Wed, May 12, 2021 at 11:42 Jan Yenda Trmal ***@***.***> wrote:
>
> > Ping me next week if you won't be interested in working on it. I think I
> > would be, but this week I'm kinda overwhelmed by work.
> > Y.
> >
> > On Wed, May 12, 2021 at 11:38 Piotr Żelasko ***@***.***>
> > wrote:
> >
> >> Could be a good opportunity to get more familiar with k2's C++ code.
> I'll
> >> start with the Python part and let's see then.
> >>
> >> —
> >> You are receiving this because you are subscribed to this thread.
> >> Reply to this email directly, view it on GitHub
> >> <#191 (comment)
> >,
> >> or unsubscribe
> >> <
> https://github.com/notifications/unsubscribe-auth/ACUKYXYZWQOCS4XVJDFNE6DTNKOI3ANCNFSM44WAIS5Q
> >
> >> .
> >>
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#191 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAZFLO2O5253QINSL4IUAWTTNKO27ANCNFSM44WAIS5Q>
> .
>
|
@jtrmal I won't find the time to work on it this week -- if you want to, feel free to start (just let me know if you do). |
OK, I'm trying to get started with the C++ -- you can catch up from Python
direction in a week or two, I don't think I will be faster than that.
y.
…On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko ***@***.***> wrote:
@jtrmal <https://github.com/jtrmal> I won't find the time to work on it
this week -- if you want to, feel free to start (just let me know if you
do).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#191 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q>
.
|
@jan Trmal ***@***.***> did you make any progress?
…On Tue, May 18, 2021 at 5:15 AM jtrmal ***@***.***> wrote:
OK, I'm trying to get started with the C++ -- you can catch up from Python
direction in a week or two, I don't think I will be faster than that.
y.
On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko ***@***.***>
wrote:
> @jtrmal <https://github.com/jtrmal> I won't find the time to work on it
> this week -- if you want to, feel free to start (just let me know if you
> do).
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#191 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q
>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#191 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO76OCGIKQGZ3EPQ6RTTOGBQXANCNFSM44WAIS5Q>
.
|
I have something in C++, will try to make PR next week -- I will be
traveling over the weekend.
y.
On Fri, May 28, 2021 at 12:15 AM Daniel Povey ***@***.***>
wrote:
… @jan Trmal ***@***.***> did you make any progress?
On Tue, May 18, 2021 at 5:15 AM jtrmal ***@***.***> wrote:
> OK, I'm trying to get started with the C++ -- you can catch up from
Python
> direction in a week or two, I don't think I will be faster than that.
> y.
>
> On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko ***@***.***>
> wrote:
>
> > @jtrmal <https://github.com/jtrmal> I won't find the time to work on
it
> > this week -- if you want to, feel free to start (just let me know if
you
> > do).
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <#191 (comment)
>,
> > or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q
> >
> > .
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#191 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AAZFLO76OCGIKQGZ3EPQ6RTTOGBQXANCNFSM44WAIS5Q
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#191 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX5VTFYJQAWC65PGFL3TP4KHHANCNFSM44WAIS5Q>
.
|
Great!
…On Fri, May 28, 2021 at 7:49 PM jtrmal ***@***.***> wrote:
I have something in C++, will try to make PR next week -- I will be
traveling over the weekend.
y.
On Fri, May 28, 2021 at 12:15 AM Daniel Povey ***@***.***>
wrote:
> @jan Trmal ***@***.***> did you make any progress?
>
>
> On Tue, May 18, 2021 at 5:15 AM jtrmal ***@***.***> wrote:
>
> > OK, I'm trying to get started with the C++ -- you can catch up from
> Python
> > direction in a week or two, I don't think I will be faster than that.
> > y.
> >
> > On Mon, May 17, 2021 at 5:12 PM Piotr Żelasko ***@***.***>
> > wrote:
> >
> > > @jtrmal <https://github.com/jtrmal> I won't find the time to work on
> it
> > > this week -- if you want to, feel free to start (just let me know if
> you
> > > do).
> > >
> > > —
> > > You are receiving this because you were mentioned.
> > > Reply to this email directly, view it on GitHub
> > > <
#191 (comment)
> >,
> > > or unsubscribe
> > > <
> >
>
https://github.com/notifications/unsubscribe-auth/ACUKYX7VVKNQD36CU3SWIT3TOGBD7ANCNFSM44WAIS5Q
> > >
> > > .
> > >
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <#191 (comment)
>,
> > or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/AAZFLO76OCGIKQGZ3EPQ6RTTOGBQXANCNFSM44WAIS5Q
> >
> > .
> >
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#191 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ACUKYX5VTFYJQAWC65PGFL3TP4KHHANCNFSM44WAIS5Q
>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#191 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOYSSVHUBLIOWI5HV3DTP57MFANCNFSM44WAIS5Q>
.
|
The current setup inherits building lexicon FSTs from Kaldi. I think it makes sense to have the ability to build it directly in Python, which should make building new recipes easier, as well as (eventually) allow for some things like dynamic expansion of the lexicon without leaving Python.
The data structure would basically resemble that of Kaldi, e.g.:
and methods:
Kaldi's
prepare_lang.sh
has accumulated a lot of options, so I'd like to get some feedback which of them are useful to keep and which are not:num sil/nonsil states
andshare_silence_phones
are currently unused and probably not needed anymore?position dependent phones
seems superficial in our current setups, not sure if it'll be useful?unk-fst
be still useful?silprob/sil_prob
- is it worth supporting it?We can of course start from something minimal and extend it... It does seem like a substantial amount of work but I think it's worth it and I can give it a shot, or at least lay some groundwork. What do you guys think? Also, I want to make sure I wouldn't be duplicating anybody's effort.
The text was updated successfully, but these errors were encountered: