Inter-Attention
- It is basically same as LSTMs or RNNs but each output state s[i] is formed by a function of all input hidden states.
- Something like : s[i] = sum_over_j(w[j] * h[j]. Now by back-prop we learn these params w.
Self-Attention
- Now the problem of vanishing gradient is solved by since the path-length is still O(n), n is the seq_len.
- Inter Attention can't be parallelised.
- Next idea is to have self attention, what it means that
every word
in input sequence attends to someother word
in the input sequence itself.
Blogs/Resources