You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for sharing this fantastic work.
I am trying to understand the cross attention used in the model. Here, the conditional context only has one token, i.e., the clip embedding concatenated with pose. As a result, the size of the cross attention matrix is [num_spatial_token, 1] and all attention weights would be one. The output is just copying the value vector to each spatial location (or add, if we consider the residual connection). It seems that the K and Q are redundant in this case. Is this the expected behavior?
The text was updated successfully, but these errors were encountered:
Hi, thanks for sharing this fantastic work.
I am trying to understand the cross attention used in the model. Here, the conditional context only has one token, i.e., the clip embedding concatenated with pose. As a result, the size of the cross attention matrix is [num_spatial_token, 1] and all attention weights would be one. The output is just copying the value vector to each spatial location (or add, if we consider the residual connection). It seems that the K and Q are redundant in this case. Is this the expected behavior?
The text was updated successfully, but these errors were encountered: