Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs

Back to Questions

122. Attention masking

medium
GeneralGeneral
senior

Implement causal attention masking for a self-attention score matrix in NLP, so each token can only attend to itself and earlier tokens. You’ll take raw attention scores and apply a mask that forces “future” positions to have zero probability after softmax.

The masked attention is:

A=softmax(S+M)A = \text{softmax}(S + M)

where (S) is the score matrix and (M_{ij} = 0) if (j \le i), otherwise (M_{ij} = -10^9).

Requirements

Implement the function

python

Rules:

  • Apply a causal mask so positions where j > i are masked out (future tokens).
  • Use a numerically stable softmax per row (subtract the row max before exp).
  • Use -1e9 (or another very large negative number) for masked positions before softmax.
  • Return the result as a NumPy array.
  • Use only NumPy (and built-in Python libraries if needed).

Example

python

Output:

python
Input Signature
ArgumentType
scoresnp.ndarray
Output Signature
Return NameType
valuenp.ndarray

Constraints

  • Use NumPy only; no other libraries.

  • Return NumPy array.

  • Row-wise stable softmax after masking.

Hint 1

Create a lower-triangular allow-mask: positions with j > i should be suppressed so they become ~0 after softmax.

Hint 2

Before softmax, add a mask matrix M where M[i][j]=0 if j<=i else -1e9 (e.g., with np.triu(..., k=1)).

Hint 3

For numerical stability, compute softmax row-wise as exp(x - max(x)) / sum(exp(x - max(x))) after masking.

Roles
ML Engineer
AI Engineer
Data Scientist
Quantitative Analyst
Companies
GeneralGeneral
Levels
senior
entry
Tags
causal-mask
softmax
numpy
self-attention
24 people are solving this problem
Python LogoPython Editor
Ln 1, Col 1

Input Arguments

Edit values below to test with custom inputs

You need tolog in/sign upto run or submit