Skip to content

Synthesizer Self-Attention is a very recent alternative to causal self-attention that has potential benefits by removing this dot product.

License

Notifications You must be signed in to change notification settings

iafarhan/causal-synthesizer-multihead-attention

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Causal and Synthesizer Multihead Attention

We have implemented different variants of Multihead Attention mechanisms:

1. Causal Self-Attention

Causal Self-Attention is the vanilla multi-head masked self-attention layer with a projection at the end. It employs the scaled dot-product as the scoring function:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:

  • Q, K, and V are the query, key, and value matrices.
  • d_k is the dimensionality of the key vectors.

This mechanism computes a block_size × block_size attention matrix, which makes the computation quadratic in the sequence length.


2. Synthesizer Self-Attention

Synthesizer Self-Attention is a recent alternative to causal self-attention that removes the need for pairwise dot-product operations. Instead, it directly computes the block_size × block_size matrix of attention scores:

$$A = W_2 \sigma(W_1X + b_1) + b_2$$

Where:

  • W_1, W_2 are learnable weight matrices.
  • b_1, b_2 are biases.
  • \sigma is a non-linear activation function.

Synthesizer Self-Attention reduces the quadratic computational cost associated with the scaled dot-product operation and offers an efficient alternative for long sequences.

References

About

Synthesizer Self-Attention is a very recent alternative to causal self-attention that has potential benefits by removing this dot product.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages