Self-Attention: key mechanism that allows Transformers to process input sequences without recurrence.
Instead of processing one word at a time (like RNNs),
Self-Attention lets each word attend to all other words.
Each word is transformed into three vectors:
Query (Q): “What am I looking for?”
Key (K): “What do I have?”
Value (V): “What information do I pass forward?”
Each word compares itself to every other word in the sequence.
Words that are relevant to each other get higher attention scores.
The model weighs the words based on these scores before making a decision.
Input sequence:
E.g., “The cat sat on the mat”.
Tokenized and embedded into a vector $X$ (as seen for RNNs).
Each word in $X$ is multiplied by three weight matrices $\left(W_{Q}, W_{K^{\prime}}, W_{V}\right)$, creating three new matrices: $Q=X W_{Q}, K=X W_{K^{\prime}}, V=X W_{V}$
Compute attention scores:
Use the scaled dot-product attention $\quad$ Attention $=$ softmax $\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V$
$Q K^{T}$ : Computes the similarity between each query and each key.
$\sqrt{d_{k}}$ : Scaling factor to prevent large values.
Softmax: converts scores into probabilities.
Multiplying by V: weighs the words based on their importance.
Compute final attention output:
Words that are more relevant get a higher weight.
The attention mechanism highlights important words while downplaying irrelevant ones. E.g., given “cat”, “sat” will be more important than “mat”