In WeightDecay regularization class, the code replaces the parameter's gradient with the gradient of the regularization:
param.grad = self.regularize(param)
Should it instead add the regularization gradient to the existing parameter gradient? i.e.:
param.grad.add_(self.regularize(param))