By - Rishabh

Why?

The self attention mechanism in transformers is there to understand the relationship between tokens in a sequence. As we know self attention is permutation invariant, if we do not enrich the self attention with positional information, then we are incapable to determine many relationships.

The self-attention mechanism enables the model to weigh the importance of different elements in an input sequence and dynamically adjust their influence on the output.

$$ Attn(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

In other position embedding methods(Integer based, BPE, Sinusoidal), we’ve generated a position encoding vector and added it to our token embedding before passing to attention layer.

So, What’s the problem?

By Adding, we are losing the pure meaning (polluting the embedding), we should be attempting to encode information without polluting the embeddings. Multiplication should be the way for this.

The influence of one toke over another is determined by the QKᵀ dot product between query and key.

$$ \vec{a} \cdot \vec{b} = |\vec{a}||\vec{b}|\cos\theta $$

The interpretation of dot product gives us a interesting insight, that we can change/modulate the result of our dot product by just changing the angle between the vectors and this doesn’t pollute the embeddings as it has zero impact on the original embedding.

The Analogy above might be a bit tricky to digest, so here’s a intuition for it:

Let’s start with how we represent a word in Transformer. Each word gets two type of information:

With Additive positional encoding, its like we are taking two different pieces of paper (Word meaning and its positional information) and gluing them together, so the original word get mixed up with the positional information in a way that can distort both.

With Multiplicative positional encoding, it like we are bending the semantic information based on the position of that word without changing its fundamental nature.

Example(More Intuition):