-
-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gradient of Flux.normalise
return NaN when std
is zero
#2096
Comments
Xref JuliaML/MLUtils.jl#123 (about moving & renaming) and #1992 (about NaN from batch of 1). |
Other frameworks have implemented this using sqrt + var + the eps instead of using std directly. |
Not sure if #1992 is related. Batchnorm don't use
FWIW, PyTorch layernorm add the eps to var and then store the var with eps and use it directly in the pullback. This also brings up an issue about the error between real value and the value with eps. Since we are dividing by |
I think we'd have to go with |
You can replace |
For sure, but I'm loath to create a rrule just for this. I actually have a WIP PR bringing the norm functions to NNlib, so @chengchingwen if you want to continue this design discussion I can publish it. |
I would be interested. I actually have a function for computing the gradient of a layer norm directly in NAlib. This is the best (in terms of both performance and memory efficient) I can get without writing cuda kernel. The gradient of |
There is a problem in normalise that if `std(x) \approx 0`, then the chain rule evaluates to NaN. See e.g. here: [FluxML/Flux.jl#2096]. We tried to fix this here by adding some noise to x, although that might not be the best solution. We also fix in a later commit that all images actually have some noise in the background.
#2421 has been merged now. |
Flux.normalise
only prevent the forward value from division by zero, but there is also an division byFlux.normalise
.The text was updated successfully, but these errors were encountered: