Attention masking - ML & AI Coding

122. Attention masking

medium

General

senior

Implement causal attention masking for a self-attention score matrix in NLP, so each token can only attend to itself and earlier tokens. You’ll take raw attention scores and apply a mask that forces “future” positions to have zero probability after softmax.

The masked attention is:

A = \text{softmax}(S + M)

where (S) is the score matrix and (M_{ij} = 0) if (j \le i), otherwise (M_{ij} = -10^9).

Requirements

Implement the function

python

Rules:

Apply a causal mask so positions where j > i are masked out (future tokens).
Use a numerically stable softmax per row (subtract the row max before exp).
Use -1e9 (or another very large negative number) for masked positions before softmax.
Return the result as a NumPy array.
Use only NumPy (and built-in Python libraries if needed).

Example

python

Output:

python

Input Signature

Argument	Type
scores	np.ndarray

Output Signature

Return Name	Type
value	np.ndarray

Constraints

Use NumPy only; no other libraries.
Return NumPy array.
Row-wise stable softmax after masking.

Hint 1

Create a lower-triangular allow-mask: positions with j > i should be suppressed so they become ~0 after softmax.

Hint 2

Before softmax, add a mask matrix M where M[i][j]=0 if j<=i else -1e9 (e.g., with np.triu(..., k=1)).

Hint 3

For numerical stability, compute softmax row-wise as exp(x - max(x)) / sum(exp(x - max(x))) after masking.

Roles

ML Engineer

AI Engineer

Data Scientist

Quantitative Analyst

Companies

General

Levels

senior

entry

Input Arguments

Edit values below to test with custom inputs

You need tolog in/sign upto run or submit