We have implemented different variants of Multihead Attention mechanisms:
Causal Self-Attention is the vanilla multi-head masked self-attention layer with a projection at the end. It employs the scaled dot-product as the scoring function:
Where:
Q
,K
, andV
are the query, key, and value matrices.d_k
is the dimensionality of the key vectors.
This mechanism computes a block_size × block_size
attention matrix, which makes the computation quadratic in the sequence length.
Synthesizer Self-Attention is a recent alternative to causal self-attention that removes the need for pairwise dot-product operations. Instead, it directly computes the block_size × block_size
matrix of attention scores:
Where:
W_1
,W_2
are learnable weight matrices.b_1
,b_2
are biases.\sigma
is a non-linear activation function.
Synthesizer Self-Attention reduces the quadratic computational cost associated with the scaled dot-product operation and offers an efficient alternative for long sequences.
- Synthesizer: Rethinking Self-Attention in Transformer Models