You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to understand how the camera viewpoints are sampled and used, and I have a few questions:
How exactly does the model take in the camera viewpoint? Is it the same as zero123 conditional latent diffusion architecture where the input view (x) and a relative viewpoint transformation (R,T) are used as conditional information? If so, are you using the same conditional info encoder as zero123?
The report says Zero123++ uses a fixed set of 6 poses (relative azimuth and absolute elevation angles) as the prediction target.
a. zero123 uses a dataset of paired images and their relative camera extrinsics {(x, x_(R,T) , R, T)} for training, is the equivalent notation for zero123++ {(x, x_(tiled 6 images) , R_{1,...6}, T_{1..6})}
b. Tying back to Q1, does this mean instead of taking in (x) and (R,T) as conditional input, zero123++ takes in (x_{1...6}) and (R_{1...6}, T_{1..6}) as conditional input?
I hope to explicitly pass in a randomly sampled camera viewpoint at inference time, is that possible? I couldn't seem to find the exact part in the code that will allow this.
The text was updated successfully, but these errors were encountered:
We do not explicitly use any camera pose input during training or inference. It is just that the designed output views do not have any ambiguity given the input image so we do not need any. By sampling twice with different camera parameters as input we will not get consistent results, so we previously did not think it is helpful.
See #10 for comments on training code and camera pose conditioning.
We do not explicitly use any camera pose input during training or inference. It is just that the designed output views do not have any ambiguity given the input image so we do not need any.
=> By "not explicitly use any camera pose input during training",
do you mean that, as a training pair, you use
(cond_image_i, target_grid), i=1,12, for each mesh.
Here target_grid consists of 6 images obtained by rendering a given mesh using 6 camera positions with fixed absolute elevation angles and relative aznimuth angles. cond_image_i refers to the object obtained by rendering the mesh with the ith randomly chosen camera position. The number of cond_images, 12, is arbitrary.
Thanks for releasing the code!
I am trying to understand how the camera viewpoints are sampled and used, and I have a few questions:
How exactly does the model take in the camera viewpoint? Is it the same as zero123 conditional latent diffusion architecture where the input view (x) and a relative viewpoint transformation (R,T) are used as conditional information? If so, are you using the same conditional info encoder as zero123?
The report says Zero123++ uses a fixed set of 6 poses (relative azimuth and absolute elevation angles) as the prediction target.
a. zero123 uses a dataset of paired images and their relative camera extrinsics {(x, x_(R,T) , R, T)} for training, is the equivalent notation for zero123++ {(x, x_(tiled 6 images) , R_{1,...6}, T_{1..6})}
b. Tying back to Q1, does this mean instead of taking in (x) and (R,T) as conditional input, zero123++ takes in (x_{1...6}) and (R_{1...6}, T_{1..6}) as conditional input?
I hope to explicitly pass in a randomly sampled camera viewpoint at inference time, is that possible? I couldn't seem to find the exact part in the code that will allow this.
The text was updated successfully, but these errors were encountered: