Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs

Back to Questions

35. Transformer self attention block

easy
GeneralGeneral
senior

Implement a single-head Transformer self-attention block (forward pass only) that takes a sequence of token embeddings and returns the attention-mixed outputs. This question tests whether you understand scaled dot-product attention and can implement it cleanly using basic NumPy ops.

The attention is defined as:

Attention(Q,K,V)=softmax(QK⊤d)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V

Requirements

Implement the function

python

Rules:

  • Treat Q = K = V = X (no learned projection matrices in this question).
  • Compute scaled dot-product attention scores: S = (Q @ K.T) / sqrt(d).
  • Apply a numerically stable softmax across each row of S to get attention weights.
  • Compute the output as A @ V, where A is the attention weight matrix.
  • Return the result as a NumPy array.

Example

python

Output:

python
Input Signature
ArgumentType
Xnp.ndarray
Output Signature
Return NameType
valuenp.ndarray

Constraints

  • Use NumPy matrix multiplication only

  • Stable softmax per row of scores

Hint 1

Get d = X.shape[1]. Use Q = K = V = X.

Hint 2

Compute attention scores with matrix multiply: scores = (X @ X.T) / np.sqrt(d). Softmax must be applied row-wise (each token attends over all tokens).

Hint 3

Implement numerically stable softmax: scores -= scores.max(axis=1, keepdims=True) before np.exp. Then attn = exp / exp.sum(axis=1, keepdims=True) and output attn @ X.

Roles
ML Engineer
AI Engineer
Companies
GeneralGeneral
Levels
senior
entry
Tags
scaled-dot-product-attention
numerically-stable-softmax
numpy-linear-algebra
transformer-self-attention
15 people are solving this problem
Python LogoPython Editor
Ln 1, Col 1

Input Arguments

Edit values below to test with custom inputs

You need tolog in/sign upto run or submit