Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the computed BPD wrong? #3

Open
fhvilshoj opened this issue Feb 2, 2021 · 1 comment
Open

Is the computed BPD wrong? #3

fhvilshoj opened this issue Feb 2, 2021 · 1 comment

Comments

@fhvilshoj
Copy link

Hi,

There may be a mistake in the reported bits per dimension for your CIFAR10 experiments.
In particular, it seems like they are computed by this code, which either accounts for normalization (dividing by 256) or adding "sigma noise".

Code copy-pasted here:

bpd = output['L_x_val'].item()
if eval(args['data']['dequantize_uniform']):
    bpd += np.log(256)
else:
    bpd += np.log(data.sigma) + 0.5 * np.log(2 * np.pi)
bpd -= 0.5 * np.log(2*np.pi)
bpd /= np.log(2)

However, it seems from the config files that both cfg['data']['dequantize_uniform'] and cfg['data']['sigma_noise'] are turned on at once.

Please correct me if I am wrong.

Once again, thanks for a very well done code base.

@ardizzone
Copy link
Member

Hi!

Very good point, thank you! I don't think we originally intended to use both at the same time (ablations showed no difference whether de-quantization was on or off). The situation is not very clear cut, but it seems that the if/else is correct (I will however add an explaining comment in the code).

I should also say that there are no real implications for the BPD numbers in the paper, as all the models are trained in the same way (so they will all be offset by the same constant), and we do not compare to numbers from other works.

Dequantization here does not mean to divide by 256 (that has no effect on BPD of discrete values), but to change every discrete delta-peak into a uniform distribution with width 1/256.
From the theory standpoint, the uniform dequantization affects the original data distribtion p(X),
and the "sigma-noise" determines the noise distribution p(Epsilon) later leading to p(X + Epsilon).

One could argue (see Appendix A.4), that we should only be including the Epsilon noise for the BPD, regardless of dequantization.
On the other hand, this gives unusual values that don't coincide with the convention of including the dequantization used in most other papers.
If we want to include both the Epsilon noise and the dequantization, computing the constant is not trivial, because the two effects are convolved. In that case, by only including one, we get a bound on the true BPD (same as argued by Theis et al, arxiv.org/abs/1511.01844, although they miss the fact that you can compute the difference between bound and true value if the dequantization is non-overlapping, as most other papers do when computing BPD).

So in conclusion, it's not easy to say, but I think the code currently is the best solution (also for people in future who run the code to reproduce the numbers from the paper)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants