Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tanh 150 constant - why? #186

Open
gordicaleksa opened this issue May 26, 2020 · 6 comments
Open

tanh 150 constant - why? #186

gordicaleksa opened this issue May 26, 2020 · 6 comments

Comments

@gordicaleksa
Copy link

Why do you use 150 as a constant to multiply tanh output of your model?
Wouldn't that give it the [-150, 150] output range (as tanh is in [-1, 1] range)?

And I see that in you inference script (fast_neural_style.lua) you're using this preprocess.deprocess(img_out)[1] which expects that the output is in the ImageNet's normalized range (mean 0 and std 1) and not [-150, 150].

I'm reconstructing your work in PyTorch, don't know anything about Lua but could infer some of the important details.

@htoyryla
Copy link
Contributor

I would think it is connected to the practice back then of using 0..255 pixel range directly in models. Eg. the original neural-style used a VGG model like that.

This issue seems to support my guess:
#112

@gordicaleksa
Copy link
Author

I'm aware that Caffe used BGR, 0..255 images mean ([123.675, 116.28, 103.53]) normalized using ImageNet's mean to train VGG models.
Whereas PyTorch which I'm using uses RGB, 0..1 range images also normalized using both std ([0.229, 0.224, 0.225]) and mean ([0.485, 0.456, 0.406]) to train VGG nets. Those are the same numbers just depending whether you use 0..1 or 0..255 range.

But that still doesn't explain the magic 150 number, 127.5 would make some sense then you could just shift [-127.5, 127.5] from the model output to [0, 255].

What I think happened here is that Johnson tried to emulate (using this 150*tanh output activation) what would happen when you used [0, 255] image and subtracted ImageNet's mean from it. That way you can directly feed the output of the transformer net into the perceptual net (VGG).

Namely, if you take 0..255 image and subtract [123.675, 116.28, 103.53] the biggest number will be 151.47 and the smallest will be -123.675 (assuming that we have 255 and 0s in the right channels). So that's kinda -150, 150...

@htoyryla
Copy link
Contributor

htoyryla commented May 28, 2020

That way you can directly feed the output of the transformer net into the perceptual net (VGG).

That's about what I was thinking too. But also that 150 may just have been a value that appeared to work. Note that it is in fact a command line option.

@gordicaleksa
Copy link
Author

Do we have a way of pinging Johnson for this thing? It'd be nice to explain the meaning of that magic number and whether our hypothesis is true. @jcjohnson

@htoyryla
Copy link
Contributor

Seems to me that he's not been active here for a long time.

The logical place to start is to consult the paper https://cs.stanford.edu/people/jcjohns/papers/eccv16/JohnsonECCV16.pdf .

From section 3.1

"All nonresidual convolutional layers are followed by batch normalization [50] and ReLU
nonlinearities with the exception of the output layer, which instead uses a scaled
tanh to ensure that the output has pixels in the range [0, 255]."

@gordicaleksa
Copy link
Author

https://github.com/gordicaleksa/pytorch-nst-feedforward I've implemented the paper without any tanh activation and it works like a charm!

Could you maybe link this repo as a PyTorch implementation of Johnson's original paper? I've documented the differences that it has from the original paper in transformer_net.py file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants