Illustrated transformer - Decoder confusion #23

tbenst · 2022-01-31T05:29:22Z

tbenst
Jan 31, 2022

Thanks so much for the great article, I'm discussing it as background material for a journal club!

One confusion I had is the images / description for the decoder side, which suggest that the Keys & Values from the encoder are passed to the decoder.

The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence

My understanding is that is isn't quite the case--it's the outputs themselves that are passed ("Z"), not these intermediate K & V. As put in "Attention is all you need",

In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.

In addition, here's a popular implementation of the Transformer that demonstrates that only the outputs, not the keys & values, are passed into the decoder:

def forward(self, src_seq, trg_seq):
        src_mask = get_pad_mask(src_seq, self.src_pad_idx)
        trg_mask = get_pad_mask(trg_seq, self.trg_pad_idx) & get_subsequent_mask(trg_seq)
        enc_output, *_ = self.encoder(src_seq, src_mask)
        dec_output, *_ = self.decoder(trg_seq, trg_mask, enc_output, src_mask)
        seq_logit = self.trg_word_prj(dec_output)
        return seq_logit.view(-1, seq_logit.size(2))

I just wanted to confirm that this is correct? Please let me know if I have gotten this wrong or missed something. If not, I suggest that the language & image be updated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Illustrated transformer - Decoder confusion #23

{{title}}

Replies: 0 comments

Select a reply

Illustrated transformer - Decoder confusion #23

tbenst Jan 31, 2022

Replies: 0 comments

tbenst
Jan 31, 2022