Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs
Implement causal attention masking for a self-attention score matrix in NLP, so each token can only attend to itself and earlier tokens. You’ll take raw attention scores and apply a mask that forces “future” positions to have zero probability after softmax.
The masked attention is:
where (S) is the score matrix and (M_{ij} = 0) if (j \le i), otherwise (M_{ij} = -10^9).
Implement the function
Rules:
j > i are masked out (future tokens).exp).-1e9 (or another very large negative number) for masked positions before softmax.Output:
| Argument | Type |
|---|---|
| scores | np.ndarray |
| Return Name | Type |
|---|---|
| value | np.ndarray |
Use NumPy only; no other libraries.
Return NumPy array.
Row-wise stable softmax after masking.
Create a lower-triangular allow-mask: positions with j > i should be suppressed so they become ~0 after softmax.
Before softmax, add a mask matrix M where M[i][j]=0 if j<=i else -1e9 (e.g., with np.triu(..., k=1)).
For numerical stability, compute softmax row-wise as exp(x - max(x)) / sum(exp(x - max(x))) after masking.