-
Notifications
You must be signed in to change notification settings - Fork 42
Plan for multi pass n-best rescoring #232
Comments
OK, the next step is to determine the subset of paths in the Nbest object to rescore. The input to this process is the ragged array of total_scores that we obtained as mentioned above from composing with the lattices, and the immediate output of this would be RaggedInt/Ragged<int32_t> containing the subset of idx01's into the Nbest object that we want to retain. [This will be regular, i.e. we keep the same number from each supervision, even if this means having to use repeats. We'll have to figure out later what to do in case no paths survived in one of the supervisions.] We can use the shape of this to create the new Nbest object, indexing the Fsa of the original Nbest object with the idx01's to get the correct subset. For the very first iteration of our code we can just have this take the most likely n paths, although this is likely not optimal (might not have enough diversity). We can figure this out later. So at this point we still have an Nbest object, but it has a regular structure so will be easier to do rescoring with. Note: it is important that we have the original acoustic and LM scores per token (as the 'scores' in the FSA0, because we will later have a prediction scheme that makes use of these. |
Any rescoring processes we have (e.g. LM rescoring, transformer decoding) should produce an Nbest object with the exact same structure as the one produced in the comment above, i.e. with a regular number of paths per supervision, like 10. We'll need this exact same structure to be preserved so that our process for finding the n-best paths to rescore will work. This will require us to train a simple model to predict the total-score of a path. For each word-position in each of the remaining paths (i.e. that were not selected in the 1st pass), we want to predict the score for that position after rescoring as a Gaussian. Let an "initial-score" be an elements of the .scores of the n-best lists before neural rescoring, and a "final-score" be an element of the .scores of the n-best lists after neural rescoring. The inputs to this model include the mean and variance of the best-matching positions, and a n-gram order. What I mean here, is: for a particular position in a path, we find the longest-matching sequence (i.e. up to and including this word) in any of the n-best lists that we actually rescored; and if there are multiple with the same longest length, we treat them as a set (if there is just one, the variance would be 0). We can also provide this count to the model. The mean and variance means the mean and variance of the scores at those longest-matching positions. Now, it might look like this process of finding this set of longest-matching words, and computing the mean and variance of the scores, would be very time-consuming. Actually it can be done very efficiently (linear time in the total number of words we are processing, including words in paths that we selected in the 1st pass and those we did not, i.e. queries, and keys), although the algorithms will need to be done on CPU for now because they are too complex to implement on GPU in a short timeframe. I'll describe these algorithms in the next comment on this issue. |
Let me first describe an internal interface for the code that gets the (mean,variance,ngram_order) of the best matching positions that rescored in the 1st round. I'm choosing a level of interface that will let you know the basic picture, but there will be other interfaces above and below. Something like this, assuming it's in Python:
The implementation of this function will use suffix arrays. For now everything will be done on the CPU. The basic plan is as follows; let's say we do it separately for each utterance.
|
I will first implement the ideas in the first comment, i.e., the |
Incidentally, regarding padding, speechbrain has something called undo_padding |
So, these n paths(after rescoring) will be the keys to calculate mean and variance, and the other paths not selected will be queries. Is it right? |
Yes. |
[Guys, I have gym now so I'll submit this and write the rest of this later today. ]
I am creating an issue to describe a plan for multi-pass n-best-list rescoring. This will also require
new code in k2, I'll create a separate issue.
The scenario is that we have a CTC or LF-MMI model and we do the 1st decoding pass from that.
Anything that we can do with lattices, we do first (e.g. including any FST-based LM rescoring).
Let the possibly-LM-rescored lattice be the starting point for the n-best rescoring process.
The first step is to generate a long n-best list for each lattice by calling RandomPaths() with a largish number,
like 1000. We then choose unique paths based on token sequences, where 'token' is whatever type of token
we are using in the transformer and RNNLM-- probably word pieces. That is, we use
inner_labels='tokens'
when doing the composition with the CTC topo when making the decoding graph, and these get propagated
to the lattices, so we can use lats.tokens and remove epsilons and pick the unique paths.
I think we could have a data structure called Nbest-- we could draft this in snowfall for now and later move
to k2-- that contains an Fsa and also a _k2.RaggedShape that dictates how each of the paths relate to the
original supervision segments. But I guess we could draft this pipeline without the data structure.
Supposing we have the Nbest with ragged numbers of paths, we can then add epsilon self-loops and
intersect it with the lattices, after moving the 'tokens' to the 'labels' of the lattices; we'd then
get the 1-best path and remove epsilons so that we get an Nbest that has just the best path's
tokens and no epsilons.
(We could define, in class Nbest, a form of intersect() that does the right thing when composing with an Fsa
representing an FsaVec; we might also define wrappers for some Fsa operations so they work also on Nbest).
So at this point we have an Nbest with ragged numbers of paths up to 1000 (depending how many unique
paths we got) and that is just a linear sequence of arcs, one per token; and it has costs defined per
token. (It may also have other types of label and cost that were passively inherited). The way we allocate
these costs, e.g. of epsilons and token-repeats, to each token will of course be a little arbitrary-- it's a function
of how the epsilon removal algorithm works-- and we can try to figure out later on whether it needs to be changed
somehow.
We get the total_scores of this Nbest object; they will be used in determining which ones to use in the first
n-best list that we rescore. We can define its total_scores() function so that it returns it as a ragged array,
which it logically is.
The text was updated successfully, but these errors were encountered: