Initialization of last layer to zero #161

danpovey · 2021-04-15T08:15:04Z

Guys,
I just remembered a trick that we used to use in Kaldi to help models converge early on, and I tried it on a setup
that was not converging great and it has a huge effect. I want to remind you of this (I don't have time to
try it on one of our standard setups just now).
It's just to set the last layer's parameters to zero.

   def __init__(self): 
   <snip>
        self.final_conv1d = nn.Conv1d(dim, num_classes, stride=1, kernel_size=1, bias=True)
        self.reset_parameters()

    def reset_parameters(self):
        torch.nn.init.constant_(self.final_conv1d.weight, 0.)
        torch.nn.init.constant_(self.final_conv1d.bias, 0.)

The text was updated successfully, but these errors were encountered:

danpovey · 2021-04-15T08:44:32Z

Mm, on the master branch with transformer, this gives an OOM error. We need to have some code LFMmiLoss to conditionally prune the lattices more if they are too large. @csukuangfj can you point me to any code that does this?

csukuangfj · 2021-04-15T08:49:30Z

@danpovey
Please see

snowfall/snowfall/decoding/lm_rescore.py

Lines 262 to 281 in ed4c74a

    
           try: 
        
               rescoring_lats = k2.intersect_device(G_with_epsilon_loops, 
        
                                                    inverted_lats_with_epsilon_loops, 
        
                                                    b_to_a_map, 
        
                                                    sorted_match_a=True) 
        
           except RuntimeError as e: 
        
               print(f'Caught exception:\n{e}\n') 
        
               print(f'Number of FSAs: {inverted_lats.shape[0]}') 
        
               print('num_arcs before pruning: ', 
        
                     inverted_lats_with_epsilon_loops.arcs.num_elements()) 
        
               # NOTE(fangjun): The choice of the threshold 0.01 is arbitrary here 
        
               # to avoid OOM. We may need to fine tune it. 
        
               inverted_lats = k2.prune_on_arc_post(inverted_lats, 0.001, True) 
        
               inverted_lats_with_epsilon_loops = k2.add_epsilon_self_loops( 
        
                   inverted_lats) 
        
               print('num_arcs after pruning: ', 
        
                     inverted_lats_with_epsilon_loops.arcs.num_elements()) 
        
               rescoring_lats = k2.intersect_device(G_with_epsilon_loops,

It is from #147

pzelasko · 2021-04-15T18:06:51Z

That's a cool trick. Why does it work?

danpovey · 2021-04-16T03:18:09Z

M actually in snowfall, now that I test properly, it's not clear that it's working.
It's OK to leave one layer uninitialized, derivs will still be nonzero.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialization of last layer to zero #161

Initialization of last layer to zero #161

danpovey commented Apr 15, 2021 •

edited

Loading

danpovey commented Apr 15, 2021

csukuangfj commented Apr 15, 2021

pzelasko commented Apr 15, 2021

danpovey commented Apr 16, 2021

Initialization of last layer to zero #161

Initialization of last layer to zero #161

Comments

danpovey commented Apr 15, 2021 • edited Loading

danpovey commented Apr 15, 2021

csukuangfj commented Apr 15, 2021

pzelasko commented Apr 15, 2021

danpovey commented Apr 16, 2021

danpovey commented Apr 15, 2021 •

edited

Loading