Improve training speed by pre-computing `compose(ctc_topo, P, L)` #172

csukuangfj · 2021-04-22T03:06:00Z

Seems to be working, but it needs more tests.

Will continue with it after fixing #169

Relates to #165 and depends on k2-fsa/k2#726

pzelasko · 2021-04-22T03:29:07Z

I wonder if it makes sense to retain a copy of the un-optimized version of the LFMMI loss, maybe sth like “SimpleLFMMI”, as a reference for people who just want to understand how it works.

csukuangfj · 2021-04-22T08:25:10Z

snowfall/training/mmi_graph.py

+
+        # TODO(fangjun): k2.connect supports only CPU.
+        # Add CUDA support.
+        num_graphs = k2.connect(num_graphs.to('cpu')).to(P.device)


@danpovey

I think we probably need a CUDA version of k2.connect(),
though I have not profiled this pull-request. It is currently
slower than before. Not sure if it is the problem of TaskRedirect or is caused by this statement.

is .connect() really necessary?

is .connect() really necessary?

If I don't invoke k2.connect, then the resulting get_tot_scores for the num_lats returns all -inf.
If k2.connect is used, then get_tot_scores returns no -infs.

Something is not right here.
I think it may be a mistake to compose ctc_topo unless it's right at the end. I believe ctc_topo expects to be composed with something that was epsilon-free and which then had epsilon self-loops added. Because we are interpreting the epsilons on one side as "blank", which is in a sense a real symbol, things are a little subtle there.

... so I think it may be OK to compose L and P, and to compose that with the transcripts, but I'd leave ctc_topo until the end.

I think it may be a mistake to compose ctc_topo unless it's right at the end. I believe ctc_topo expects to be composed with something that was epsilon-free and which then had epsilon self-loops added.

I am using intersect_device(ctc_topo_inv, P_with_self_loops).invert() (equivalent to compose(ctc_topo, P, treat_epslion_speciall=True).

There is no 0 (neither blank nor epsilons) in P, so I think it is correct.

danpovey · 2021-04-22T09:28:35Z

can you find out how the num-states changes? It could be a state-sorting issue, although we should be detecting that from the property flags.

…

On Thu, Apr 22, 2021 at 5:17 PM Fangjun Kuang ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In snowfall/training/mmi_graph.py <#172 (comment)>: > + linear_fsas = self.build_linear_fsas(texts) + linear_fsas_with_self_loops = k2.add_epsilon_self_loops(linear_fsas) + + b_to_a_map = torch.zeros(len(texts), + dtype=torch.int32, + device=self.device) + + num_graphs = k2.intersect_device(self.HPL_inv_sorted, + linear_fsas_with_self_loops, + b_to_a_map, + sorted_match_a=True) + num_graphs = k2.invert(num_graphs) + + # TODO(fangjun): k2.connect supports only CPU. + # Add CUDA support. + num_graphs = k2.connect(num_graphs.to('cpu')).to(P.device) is .connect() really necessary? If I don't invoke k2.connect, then the resulting get_tot_scores for the num_lats returns all -inf. If k2.connect is used, then get_tot_scores returns no -infs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#172 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO2WAZI3FT2J3LINR3TTJ7ST3ANCNFSM43LRQVJQ> .

danpovey · 2021-04-22T09:29:10Z

.. also, while non-connected input could cause search errors, I wouldn't expect the result to be *all* infinity, unless there was a problem like it was not state-sorted.

…

On Thu, Apr 22, 2021 at 5:28 PM Daniel Povey ***@***.***> wrote: can you find out how the num-states changes? It could be a state-sorting issue, although we should be detecting that from the property flags. On Thu, Apr 22, 2021 at 5:17 PM Fangjun Kuang ***@***.***> wrote: > ***@***.**** commented on this pull request. > ------------------------------ > > In snowfall/training/mmi_graph.py > <#172 (comment)>: > > > + linear_fsas = self.build_linear_fsas(texts) > + linear_fsas_with_self_loops = k2.add_epsilon_self_loops(linear_fsas) > + > + b_to_a_map = torch.zeros(len(texts), > + dtype=torch.int32, > + device=self.device) > + > + num_graphs = k2.intersect_device(self.HPL_inv_sorted, > + linear_fsas_with_self_loops, > + b_to_a_map, > + sorted_match_a=True) > + num_graphs = k2.invert(num_graphs) > + > + # TODO(fangjun): k2.connect supports only CPU. > + # Add CUDA support. > + num_graphs = k2.connect(num_graphs.to('cpu')).to(P.device) > > is .connect() really necessary? > > If I don't invoke k2.connect, then the resulting get_tot_scores for the > num_lats returns all -inf. > If k2.connect is used, then get_tot_scores returns no -infs. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#172 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAZFLO2WAZI3FT2J3LINR3TTJ7ST3ANCNFSM43LRQVJQ> > . >

csukuangfj · 2021-04-22T09:40:39Z

.. also, while non-connected input could cause search errors, I wouldn't
expect the result to be all infinity, unless there was a problem like
it was not state-sorted.

Thanks, will check that.

csukuangfj · 2021-04-23T12:43:12Z

@danpovey

. also, while non-connected input could cause search errors, I wouldn't
expect the result to be all infinity, unless there was a problem like
it was not state-sorted.

I confirm that the tot_scores of num_lats are all -inf even if it is top sorted.
(The number of Fsas in the FsaVec is 8).

Some information about the FsaVec before and after calling k2.connect:

Before

num_fsas: 8
num_states: 3309424
num_arcs: 3315379
properties: "Valid|Nonempty|MaybeAccessible"

after

num_fsas: 8
num_states: 3228
num_arcs: 9183
properties: "Valid|Nonempty|TopSorted|MaybeAccessible|MaybeCoaccessible"

(NOTE: k2.arc_sort is called later for both cases)

The GetCounts() issue is fixed in k2-fsa/k2@d43f77e (from k2-fsa/k2#726)
and it is this num_graphs that results in errors in GetCounts(). After fixing it, k2.top_sort is applied to
this num_graphs to make it top sorted (but not connected).

csukuangfj · 2021-04-25T11:12:39Z

Here are the profiling results of this pull-request.

this pull-request

master branch

Even though this pull-request requires one less call to k2.intersect_device, it is actually slower. I compared the size of the resulting num_graphs, listed in below. You can see that the resulting num_graphs of this pull-request is actually larger, so it takes more time.

this pull-request

num_fsas: 42
num_states: 19626
num_arcs: 51152

master branch

num_fsas: 42
num_states: 8051
num_arcs: 11051

Closing.

csukuangfj added 2 commits April 22, 2021 11:02

Improve training speed by pre-computing compose(ctc_topo, P, L)

29d7d12

minor fixes.

b2acad7

Fix an error.

91fc151

csukuangfj commented Apr 22, 2021

View reviewed changes

Fix errors.

fa82e64

csukuangfj closed this Apr 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve training speed by pre-computing `compose(ctc_topo, P, L)` #172

Improve training speed by pre-computing `compose(ctc_topo, P, L)` #172

csukuangfj commented Apr 22, 2021 •

edited

Loading

pzelasko commented Apr 22, 2021

csukuangfj Apr 22, 2021

danpovey Apr 22, 2021

csukuangfj Apr 22, 2021

danpovey Apr 23, 2021

danpovey Apr 23, 2021

csukuangfj Apr 24, 2021

danpovey commented Apr 22, 2021 via email

danpovey commented Apr 22, 2021 via email

csukuangfj commented Apr 22, 2021

csukuangfj commented Apr 23, 2021 •

edited

Loading

csukuangfj commented Apr 25, 2021

Improve training speed by pre-computing compose(ctc_topo, P, L) #172

Improve training speed by pre-computing compose(ctc_topo, P, L) #172

Conversation

csukuangfj commented Apr 22, 2021 • edited Loading

pzelasko commented Apr 22, 2021

csukuangfj Apr 22, 2021

Choose a reason for hiding this comment

danpovey Apr 22, 2021

Choose a reason for hiding this comment

csukuangfj Apr 22, 2021

Choose a reason for hiding this comment

danpovey Apr 23, 2021

Choose a reason for hiding this comment

danpovey Apr 23, 2021

Choose a reason for hiding this comment

csukuangfj Apr 24, 2021

Choose a reason for hiding this comment

danpovey commented Apr 22, 2021 via email

danpovey commented Apr 22, 2021 via email

csukuangfj commented Apr 22, 2021

csukuangfj commented Apr 23, 2021 • edited Loading

Before

after

csukuangfj commented Apr 25, 2021

this pull-request

master branch

this pull-request

master branch

Improve training speed by pre-computing `compose(ctc_topo, P, L)` #172

Improve training speed by pre-computing `compose(ctc_topo, P, L)` #172

csukuangfj commented Apr 22, 2021 •

edited

Loading

csukuangfj commented Apr 23, 2021 •

edited

Loading