Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs
Implement Vision Transformer (ViT) patch embedding, which turns an image into a sequence of patch vectors for transformer input. You’ll split the image into non-overlapping patches, flatten each patch, then apply a learned linear projection to get embeddings.
The patch embedding can be written as:
where (X \in \mathbb{R}^{N \times (P^2C)}) is the matrix of flattened patches, (W \in \mathbb{R}^{(P^2C) \times D}) is the projection matrix, (b \in \mathbb{R}^{D}), (N) is the number of patches, (P) is the patch size, (C) is the number of channels, and (D) is the embedding dimension.
Implement the function
Rules:
image into non-overlapping patches of size patch_size x patch_size.X with shape (N, P*P*C) in top-to-bottom, left-to-right patch order.E = X @ W + b (use NumPy for matrix math).E as a NumPy array.Output:
| Argument | Type |
|---|---|
| W | np.ndarray |
| b | np.ndarray |
| image | np.ndarray |
| patch_size | int |
| Return Name | Type |
|---|---|
| value | np.ndarray |
Use NumPy for matrix multiplication.
Return NumPy array.
Flatten order: H, W, then C.
Get H, W, C from image.shape.
Loop patches in top-to-bottom, left-to-right order: for i in range(H//P) and j in range(W//P), slice image[i*P:(i+1)*P, j*P:(j+1)*P, :].
Flatten each patch with patch.reshape(-1) (row-major, channels last), stack into X with shape (N, P*P*C), then compute E = X @ W + b.