How do camera viewpoints work at training and inference? #24

QiuhongAnnaWei · 2023-10-26T16:14:03Z

Thanks for releasing the code!

I am trying to understand how the camera viewpoints are sampled and used, and I have a few questions:

How exactly does the model take in the camera viewpoint? Is it the same as zero123 conditional latent diffusion architecture where the input view (x) and a relative viewpoint transformation (R,T) are used as conditional information? If so, are you using the same conditional info encoder as zero123?
The report says Zero123++ uses a fixed set of 6 poses (relative azimuth and absolute elevation angles) as the prediction target.
a. zero123 uses a dataset of paired images and their relative camera extrinsics {(x, x_(R,T) , R, T)} for training, is the equivalent notation for zero123++ {(x, x_(tiled 6 images) , R_{1,...6}, T_{1..6})}
b. Tying back to Q1, does this mean instead of taking in (x) and (R,T) as conditional input, zero123++ takes in (x_{1...6}) and (R_{1...6}, T_{1..6}) as conditional input?
I hope to explicitly pass in a randomly sampled camera viewpoint at inference time, is that possible? I couldn't seem to find the exact part in the code that will allow this.

avaer · 2023-10-26T16:23:13Z

If the answer to 3) is it's not possible -- is the plan to release training code + dataset for folks to make their own view set (e.g. a 360 orbit)?

eliphatfs · 2023-10-26T17:03:30Z

We do not explicitly use any camera pose input during training or inference. It is just that the designed output views do not have any ambiguity given the input image so we do not need any. By sampling twice with different camera parameters as input we will not get consistent results, so we previously did not think it is helpful.

See #10 for comments on training code and camera pose conditioning.

moonryul · 2024-09-21T07:17:04Z

eliphatfs said:

We do not explicitly use any camera pose input during training or inference. It is just that the designed output views do not have any ambiguity given the input image so we do not need any.

=> By "not explicitly use any camera pose input during training",

do you mean that, as a training pair, you use

(cond_image_i, target_grid), i=1,12, for each mesh.

Here target_grid consists of 6 images obtained by rendering a given mesh using 6 camera positions with fixed absolute elevation angles and relative aznimuth angles. cond_image_i refers to the object obtained by rendering the mesh with the ith randomly chosen camera position. The number of cond_images, 12, is arbitrary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do camera viewpoints work at training and inference? #24

How do camera viewpoints work at training and inference? #24

QiuhongAnnaWei commented Oct 26, 2023

avaer commented Oct 26, 2023

eliphatfs commented Oct 26, 2023 •

edited

Loading

moonryul commented Sep 21, 2024

How do camera viewpoints work at training and inference? #24

How do camera viewpoints work at training and inference? #24

Comments

QiuhongAnnaWei commented Oct 26, 2023

avaer commented Oct 26, 2023

eliphatfs commented Oct 26, 2023 • edited Loading

moonryul commented Sep 21, 2024

eliphatfs commented Oct 26, 2023 •

edited

Loading