Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs
Implement a single-head Transformer self-attention block (forward pass only) that takes a sequence of token embeddings and returns the attention-mixed outputs. This question tests whether you understand scaled dot-product attention and can implement it cleanly using basic NumPy ops.
The attention is defined as:
Implement the function
Rules:
Q = K = V = X (no learned projection matrices in this question).S = (Q @ K.T) / sqrt(d).S to get attention weights.A @ V, where A is the attention weight matrix.Output:
| Argument | Type |
|---|---|
| X | np.ndarray |
| Return Name | Type |
|---|---|
| value | np.ndarray |
Use NumPy matrix multiplication only
Stable softmax per row of scores
Get d = X.shape[1]. Use Q = K = V = X.
Compute attention scores with matrix multiply: scores = (X @ X.T) / np.sqrt(d). Softmax must be applied row-wise (each token attends over all tokens).
Implement numerically stable softmax: scores -= scores.max(axis=1, keepdims=True) before np.exp. Then attn = exp / exp.sum(axis=1, keepdims=True) and output attn @ X.