LLM 25-Day Course - Day 5: Deep Dive into Attention Mechanisms

Day 5: Deep Dive into Attention Mechanisms

The heart of the Transformer is Attention. Today we’ll fully understand everything from the meaning of Query/Key/Value to Multi-Head Attention and positional encoding by implementing them from scratch with numpy.

Intuitive Understanding of Query, Key, Value

Using a library analogy:

  • Query (Q): What I’m looking for (the search query)
  • Key (K): The title/tags of each book (the index)
  • Value (V): The actual content of the book (the content)

We calculate the similarity between Q and K, and then retrieve V’s content in proportion to that similarity.

Scaled Dot-Product Attention Implementation

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: (seq_len, d_k) - Query
    K: (seq_len, d_k) - Key
    V: (seq_len, d_v) - Value
    mask: used in the decoder to hide future tokens
    """
    d_k = K.shape[-1]

    # Dot product then scale (large d_k leads to large dot products, making softmax extreme)
    scores = np.matmul(Q, K.T) / np.sqrt(d_k)

    # Masking: prevent the decoder from seeing future tokens
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)

    # Generate probability distribution with Softmax
    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

    return np.matmul(weights, V), weights

# 4 tokens, 8 dimensions
seq_len, d_k = 4, 8
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)

# Causal mask (GPT-style: only attend to previous tokens)
causal_mask = np.tril(np.ones((seq_len, seq_len)))
print(f"Causal mask:\n{causal_mask.astype(int)}")

output, weights = scaled_dot_product_attention(Q, K, V, mask=causal_mask)
print(f"Attention weights:\n{weights.round(3)}")

Multi-Head Attention Implementation

def multi_head_attention(x, num_heads, d_model):
    """
    Run multiple attention heads in parallel.
    Each head learns different relationship patterns.
    """
    d_k = d_model // num_heads
    seq_len = x.shape[0]
    outputs = []

    for head in range(num_heads):
        # Separate Q, K, V projection weights for each head
        W_q = np.random.randn(d_model, d_k) * 0.1
        W_k = np.random.randn(d_model, d_k) * 0.1
        W_v = np.random.randn(d_model, d_k) * 0.1

        Q = np.matmul(x, W_q)
        K = np.matmul(x, W_k)
        V = np.matmul(x, W_v)

        head_output, _ = scaled_dot_product_attention(Q, K, V)
        outputs.append(head_output)

    # Concatenate outputs from all heads
    concatenated = np.concatenate(outputs, axis=-1)

    # Final linear projection
    W_o = np.random.randn(d_model, d_model) * 0.1
    return np.matmul(concatenated, W_o)

d_model = 64
num_heads = 8  # 64 / 8 = 8 dimensions per head
x = np.random.randn(4, d_model)

output = multi_head_attention(x, num_heads, d_model)
print(f"Multi-Head Attention output: {output.shape}")
# Head 1: subject-verb relationships, Head 2: adjective-noun relationships, ...

Positional Encoding: Sinusoidal vs RoPE

import numpy as np

def sinusoidal_position_encoding(max_len, d_model):
    """Original Transformer positional encoding"""
    pe = np.zeros((max_len, d_model))
    position = np.arange(max_len)[:, np.newaxis]
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))

    pe[:, 0::2] = np.sin(position * div_term)  # Even dimensions: sin
    pe[:, 1::2] = np.cos(position * div_term)  # Odd dimensions: cos
    return pe

def apply_rope(x, position):
    """RoPE (Rotary Position Embedding) - used in Llama, GPT-NeoX, etc."""
    d = x.shape[-1]
    freqs = 1.0 / (10000 ** (np.arange(0, d, 2) / d))
    angles = position * freqs

    # Rotate even/odd dimensions
    cos_vals = np.cos(angles)
    sin_vals = np.sin(angles)
    x_even, x_odd = x[..., 0::2], x[..., 1::2]
    rotated_even = x_even * cos_vals - x_odd * sin_vals
    rotated_odd = x_even * sin_vals + x_odd * cos_vals

    result = np.zeros_like(x)
    result[..., 0::2] = rotated_even
    result[..., 1::2] = rotated_odd
    return result

# Verify sinusoidal positional encoding
pe = sinusoidal_position_encoding(max_len=10, d_model=16)
print(f"Positional encoding shape: {pe.shape}")
print(f"Distance between position 0 and 1: {np.linalg.norm(pe[0] - pe[1]):.3f}")
print(f"Distance between position 0 and 9: {np.linalg.norm(pe[0] - pe[9]):.3f}")
# Closer positions have more similar vectors

Attention learns “which token should attend to which other token.” Multi-Head performs this simultaneously from multiple perspectives, creating richer representations.

Today’s Exercises

  1. Explain mathematically why we scale by np.sqrt(d_k). Experiment with d_k=64 and compare the softmax output with and without scaling.
  2. Change the number of heads in Multi-Head Attention to 1, 4, 8, and 16, and observe how d_k changes. What problems arise when there are too many heads?
  3. Summarize the key differences between sinusoidal positional encoding and RoPE, and research why modern models prefer RoPE.

Was this article helpful?