Speeding up training #165

danpovey · 2021-04-16T04:16:12Z

After having a look at nsys output, I think we are largely limited by latency of sequential operations in IntersectDevice, IntersectDense, GetForwardScores and GetBackwardScores (and of memory transfer when we invoke Array1::Back()).
I think there are two ways we can significantly reduce the time taken:

We can let the num and den FSAs be processed together by concatenating together the FsaVecs and calling IntersectDevice() just once, getting the tot_scores just once, and then post-processing ranges of the tot_scores.
IntersectDevice() is called when forming minibatches (intersecting with L and then with ctc_topo). If we can somehow arrange to batch these up it would be more efficient. It might not be super convenient code-wise, though.

csukuangfj mentioned this issue Apr 16, 2021

Invoke k2.intersect_dense and get_tot_scores only once in LFMMILoss #166

Merged

csukuangfj self-assigned this Apr 20, 2021

This was referenced Apr 21, 2021

Implement ComposeArcMaps for 1-D arrays. k2-fsa/k2#726

Merged

Improve training speed by pre-computing compose(ctc_topo, P, L) #172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speeding up training #165

Speeding up training #165

danpovey commented Apr 16, 2021

Speeding up training #165

Speeding up training #165

Comments

danpovey commented Apr 16, 2021