Join Our 5-Week ML/AI Engineer Interview Bootcamp š led by ML Tech Leads at FAANGs
Detect embedding collapse in an embeddings-and-retrieval pipeline, where many items unintentionally map to nearly the same vector and retrieval quality degrades. Youāll compute a simple ācollapse scoreā based on average cosine similarity across embedding pairs.
The cosine similarity between two vectors and is:
Implement the function
Rules:
(i, j) where i < j.is_collapsed based on threshold.sklearn.metrics.pairwise).Output:
| Argument | Type |
|---|---|
| threshold | float |
| embeddings | np.ndarray |
| Return Name | Type |
|---|---|
| value | tuple |
Use NumPy only; no sklearn similarities
L2-normalize each embedding row
Average i<j cosine pairs only
Input is np.ndarray
Normalize first. Ensure embeddings is a 2D NumPy array and L2-normalize each row so cosine similarity becomes a dot product.
Vectorize pairwise cosine. After normalization, compute S = X @ X.T to get all pair cosine similarities at once.
Average only unique pairs. Use np.triu_indices(n, k=1) to select i < j, then take mean; handle n < 2 (no pairs) explicitly.