-
Notifications
You must be signed in to change notification settings - Fork 46
Open
Description
ass_mask=torch.ones(q_size2*q_size1,1,1,q_size0).cuda() #[31*128,1,1,11]
x, self.attn_asset = attention(ass_query, ass_key, ass_value, mask=None,
dropout=self.dropout)
Within MultiHeadedAttention the ass_mask is not being passed into the attention method here and appears as if it's unused. IIUC the attention mask is necessary to prevent look ahead bias in the attention mechanism and should be masking off future values when calculating attention.
If this mask is unused, what was it's intent? Where is attention being masked? And how should that be appied?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels