Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs
Positional embeddings help a vision model keep track of where each patch/token came from in an image, so spatial layout isn’t lost when you flatten patches into a sequence. In this task, you’ll generate 2D sinusoidal positional embeddings for an image grid, similar to what’s used in Vision Transformers.
The 2D sinusoidal embedding is formed by concatenating 1D embeddings for row and column positions:
where each 1D embedding uses
Implement the function
Rules:
d_model/2.(height * width, d_model).idx = y * width + x corresponds to position (y, x).Output:
| Argument | Type |
|---|---|
| width | int |
| height | int |
| d_model | int |
| Return Name | Type |
|---|---|
| value | np.ndarray |
Return NumPy array.
Output shape (height*width, d_model)
d_model must be even; half uses sin/cos
Split d_model into two halves: half = d_model//2 for rows and columns, and ensure d_model is even.
Implement a reusable 1D sinusoidal embedding builder: precompute denom = 10000 ** (2*i/half) for i=0..half/2-1, then fill even indices with sin(pos/denom) and odd with cos(pos/denom).
Precompute row_emb for all y and col_emb for all x. Use repeat or tile to create grids, then concat and reshape.