Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs
Implement masked attention scores for a Transformer, where certain key positions are blocked so they contribute zero probability to the final attention. You’ll compute the scaled dot-product scores and apply a boolean mask before the softmax.
The attention scores are defined as:
where if key is allowed for query , and if it’s masked.
Implement the function
Rules:
-1e9 to masked score entries (where mask[i][j] == 0).torch, no scipy, etc.).Output:
| Argument | Type |
|---|---|
| K | np.ndarray |
| Q | np.ndarray |
| mask | np.ndarray |
| Return Name | Type |
|---|---|
| value | np.ndarray |
Use NumPy only; no torch/scipy.
Return ndarray.
Softmax must be stable across each row.
Compute the raw score matrix with a dot product: scores = (Q @ K.T) / sqrt(d) where d is the embedding dimension.
Turn the 0/1 mask into an additive bias: add -1e9 wherever mask[i][j] == 0 so those entries become ~0 after softmax.
Implement a numerically stable row-wise softmax: subtract each row’s max before exp, then divide by the row sum.