Transformer self attention block

35. Transformer self attention block

easy

General

senior

Implement a single-head Transformer self-attention block (forward pass only) that takes a sequence of token embeddings and returns the attention-mixed outputs. This question tests whether you understand scaled dot-product attention and can implement it cleanly using basic NumPy ops.

The attention is defined as:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V

Requirements

Implement the function

python

Rules:

Treat Q = K = V = X (no learned projection matrices in this question).
Compute scaled dot-product attention scores: S = (Q @ K.T) / sqrt(d).
Apply a numerically stable softmax across each row of S to get attention weights.
Compute the output as A @ V, where A is the attention weight matrix.
Return the result as a NumPy array.

Example

python

Output:

python

Input Signature

Argument	Type
X	np.ndarray

Output Signature

Return Name	Type
value	np.ndarray

Constraints

Use NumPy matrix multiplication only
Stable softmax per row of scores

Hint 1

Get d = X.shape[1]. Use Q = K = V = X.

Hint 2

Compute attention scores with matrix multiply: scores = (X @ X.T) / np.sqrt(d). Softmax must be applied row-wise (each token attends over all tokens).

Hint 3

Implement numerically stable softmax: scores -= scores.max(axis=1, keepdims=True) before np.exp. Then attn = exp / exp.sum(axis=1, keepdims=True) and output attn @ X.

Roles

ML Engineer

AI Engineer

Companies

General

Levels

senior

entry

Input Arguments

Edit values below to test with custom inputs

You need tolog in/sign upto run or submit