Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

Initialization of last layer to zero #161

Open
danpovey opened this issue Apr 15, 2021 · 4 comments
Open

Initialization of last layer to zero #161

danpovey opened this issue Apr 15, 2021 · 4 comments

Comments

@danpovey
Copy link
Contributor

danpovey commented Apr 15, 2021

Guys,
I just remembered a trick that we used to use in Kaldi to help models converge early on, and I tried it on a setup
that was not converging great and it has a huge effect. I want to remind you of this (I don't have time to
try it on one of our standard setups just now).
It's just to set the last layer's parameters to zero.

   def __init__(self): 
   <snip>
        self.final_conv1d = nn.Conv1d(dim, num_classes, stride=1, kernel_size=1, bias=True)
        self.reset_parameters()

    def reset_parameters(self):
        torch.nn.init.constant_(self.final_conv1d.weight, 0.)
        torch.nn.init.constant_(self.final_conv1d.bias, 0.)
@danpovey
Copy link
Contributor Author

Mm, on the master branch with transformer, this gives an OOM error. We need to have some code LFMmiLoss to conditionally prune the lattices more if they are too large. @csukuangfj can you point me to any code that does this?

@csukuangfj
Copy link
Collaborator

@danpovey
Please see

try:
rescoring_lats = k2.intersect_device(G_with_epsilon_loops,
inverted_lats_with_epsilon_loops,
b_to_a_map,
sorted_match_a=True)
except RuntimeError as e:
print(f'Caught exception:\n{e}\n')
print(f'Number of FSAs: {inverted_lats.shape[0]}')
print('num_arcs before pruning: ',
inverted_lats_with_epsilon_loops.arcs.num_elements())
# NOTE(fangjun): The choice of the threshold 0.01 is arbitrary here
# to avoid OOM. We may need to fine tune it.
inverted_lats = k2.prune_on_arc_post(inverted_lats, 0.001, True)
inverted_lats_with_epsilon_loops = k2.add_epsilon_self_loops(
inverted_lats)
print('num_arcs after pruning: ',
inverted_lats_with_epsilon_loops.arcs.num_elements())
rescoring_lats = k2.intersect_device(G_with_epsilon_loops,

It is from #147

@pzelasko
Copy link
Collaborator

That's a cool trick. Why does it work?

@danpovey
Copy link
Contributor Author

M actually in snowfall, now that I test properly, it's not clear that it's working.
It's OK to leave one layer uninitialized, derivs will still be nonzero.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants