Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs
Residual connections are a simple but crucial part of Transformers that help gradients flow and keep representations stable as depth increases. In this task, you’ll implement the residual “Add & Norm” step used around Transformer sublayers.
The operations are:
where and are the mean and variance of taken across the feature dimension.
Implement the function
Rules:
z = x + sublayer_out elementwise.d_model) for each token independently.gamma and beta to scale and shift the normalized values.torch.nn.LayerNorm).Output:
| Argument | Type |
|---|---|
| x | np.ndarray |
| eps | float |
| beta | np.ndarray |
| gamma | np.ndarray |
| sublayer_out | np.ndarray |
| Return Name | Type |
|---|---|
| value | np.ndarray |
Normalize across d_model (axis=1), per token.
No prebuilt LayerNorm; use numpy ops only.
Return NumPy array.
Start by computing the residual: z = x + sublayer_out (broadcasting works naturally).
LayerNorm here is per token: for each row z[i], compute mean and var across the feature dimension (axis=1). Use eps inside the square root.
After normalization norm = (z - mean) / sqrt(var + eps), apply the affine transform out = gamma * norm + beta and return out.