Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs
Implement negative sampling for embedding-based retrieval, where you pick “hard negatives” that are most similar to a query but are not its positive match. You’ll compute cosine similarities and return the top-k negative candidate IDs for each query.
The cosine similarity is:
Implement the function
Rules:
(query, doc) pair.i, exclude positives[i] from being selected as a negative.k doc indices with highest similarity (hard negatives) for each query.Output:
| Argument | Type |
|---|---|
| k | int |
| docs | np.ndarray |
| queries | np.ndarray |
| positives | np.ndarray |
| Return Name | Type |
|---|---|
| value | np.ndarray |
Use NumPy vectorized ops; no ANN libraries.
Return NumPy array of doc indices.
Exclude positives; rank by descending cosine similarity.
Start by L2-normalizing each query/doc vector so cosine becomes a dot product.
Compute all similarities at once with a matrix multiply: sim = q_norm @ d_norm.T (shape n_queries x n_docs).
Exclude the positive per query by setting sim[i, positives[i]] = -inf, then take top-k indices via argpartition and sort those k by score.