Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs

Back to Questions

42. Positional embeddings for images

medium
GeneralGeneral
manager

Positional embeddings help a vision model keep track of where each patch/token came from in an image, so spatial layout isn’t lost when you flatten patches into a sequence. In this task, you’ll generate 2D sinusoidal positional embeddings for an image grid, similar to what’s used in Vision Transformers.

The 2D sinusoidal embedding is formed by concatenating 1D embeddings for row and column positions:

PE(x,y)=[PErow(y)  PEcol(x)]PE(x, y) = [\,PE_{row}(y)\ \Vert\ PE_{col}(x)\,]

where each 1D embedding uses

PE(pos,2i)=sin(pos100002i/d),PE(pos,2i+1)=cos(pos100002i/d).PE(pos, 2i)=\sin\left(\frac{pos}{10000^{2i/d}}\right),\quad PE(pos, 2i+1)=\cos\left(\frac{pos}{10000^{2i/d}}\right).

Requirements

Implement the function

python

Rules:

  • Use 2D embeddings by concatenating row and column sinusoidal embeddings, each of size d_model/2.
  • Return a NumPy array of shape (height * width, d_model).
  • Use row-major order: index idx = y * width + x corresponds to position (y, x).
  • Do not call any prebuilt positional embedding utilities (e.g., from transformers).
  • Use NumPy for vectorized computation.

Example

python

Output:

python
Input Signature
ArgumentType
widthint
heightint
d_modelint
Output Signature
Return NameType
valuenp.ndarray

Constraints

  • Return NumPy array.

  • Output shape (height*width, d_model)

  • d_model must be even; half uses sin/cos

Hint 1

Split d_model into two halves: half = d_model//2 for rows and columns, and ensure d_model is even.

Hint 2

Implement a reusable 1D sinusoidal embedding builder: precompute denom = 10000 ** (2*i/half) for i=0..half/2-1, then fill even indices with sin(pos/denom) and odd with cos(pos/denom).

Hint 3

Precompute row_emb for all y and col_emb for all x. Use repeat or tile to create grids, then concat and reshape.

Roles
ML Engineer
AI Engineer
Companies
GeneralGeneral
Levels
manager
staff
senior
Tags
sinusoidal-embeddings
vision-transformers
broadcasting
positional-encoding
49 people are solving this problem
Python LogoPython Editor
Ln 1, Col 1

Input Arguments

Edit values below to test with custom inputs

You need tolog in/sign upto run or submit