Masked attention scores - ML & AI Coding

197. Masked attention scores

medium

General

staff

Implement masked attention scores for a Transformer, where certain key positions are blocked so they contribute zero probability to the final attention. You’ll compute the scaled dot-product scores and apply a boolean mask before the softmax.

The attention scores are defined as:

S = \frac{QK^\top}{\sqrt{d}} \quad,\quad A = \text{softmax}(S + M)

where $M_{ij}=0$ if key $j$ is allowed for query $i$ , and $M_{ij}=-10^9$ if it’s masked.

Requirements

Implement the function

python

Rules:

Compute scaled dot-product scores between every query and key.
Apply the mask by adding -1e9 to masked score entries (where mask[i][j] == 0).
Apply a numerically stable softmax across each row (over keys) to get probabilities.
Return the attention probabilities as a NumPy array.
Don’t use any prebuilt attention utilities (no torch, no scipy, etc.).

Example

python

Output:

python

Input Signature

Argument	Type
K	np.ndarray
Q	np.ndarray
mask	np.ndarray

Output Signature

Return Name	Type
value	np.ndarray

Constraints

Use NumPy only; no torch/scipy.
Return ndarray.
Softmax must be stable across each row.

Hint 1

Compute the raw score matrix with a dot product: scores = (Q @ K.T) / sqrt(d) where d is the embedding dimension.

Hint 2

Turn the 0/1 mask into an additive bias: add -1e9 wherever mask[i][j] == 0 so those entries become ~0 after softmax.

Hint 3

Implement a numerically stable row-wise softmax: subtract each row’s max before exp, then divide by the row sum.

Roles

ML Engineer

AI Engineer

Companies

General

Levels

staff

senior

entry

Input Arguments

Edit values below to test with custom inputs

You need tolog in/sign upto run or submit