Understanding cross attention layer with a single context token #29

daviddmc · 2023-04-08T02:01:06Z

Hi, thanks for sharing this fantastic work.
I am trying to understand the cross attention used in the model. Here, the conditional context only has one token, i.e., the clip embedding concatenated with pose. As a result, the size of the cross attention matrix is [num_spatial_token, 1] and all attention weights would be one. The output is just copying the value vector to each spatial location (or add, if we consider the residual connection). It seems that the K and Q are redundant in this case. Is this the expected behavior?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding cross attention layer with a single context token #29

Understanding cross attention layer with a single context token #29

daviddmc commented Apr 8, 2023

Understanding cross attention layer with a single context token #29

Understanding cross attention layer with a single context token #29

Comments

daviddmc commented Apr 8, 2023