You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks so much for the great article, I'm discussing it as background material for a journal club!
One confusion I had is the images / description for the decoder side, which suggest that the Keys & Values from the encoder are passed to the decoder.
The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence
My understanding is that is isn't quite the case--it's the outputs themselves that are passed ("Z"), not these intermediate K & V. As put in "Attention is all you need",
In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.
In addition, here's a popular implementation of the Transformer that demonstrates that only the outputs, not the keys & values, are passed into the decoder:
I just wanted to confirm that this is correct? Please let me know if I have gotten this wrong or missed something. If not, I suggest that the language & image be updated.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Thanks so much for the great article, I'm discussing it as background material for a journal club!
One confusion I had is the images / description for the decoder side, which suggest that the Keys & Values from the encoder are passed to the decoder.
My understanding is that is isn't quite the case--it's the outputs themselves that are passed ("Z"), not these intermediate K & V. As put in "Attention is all you need",
In addition, here's a popular implementation of the Transformer that demonstrates that only the outputs, not the keys & values, are passed into the decoder:
I just wanted to confirm that this is correct? Please let me know if I have gotten this wrong or missed something. If not, I suggest that the language & image be updated.
Beta Was this translation helpful? Give feedback.
All reactions