Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs

Back to Questions

192. Negative sampling

medium
GeneralGeneral
manager

Implement negative sampling for embedding-based retrieval, where you pick “hard negatives” that are most similar to a query but are not its positive match. You’ll compute cosine similarities and return the top-k negative candidate IDs for each query.

The cosine similarity is:

cos(q,d)=qdq2d2\cos(q, d) = \frac{q \cdot d}{\|q\|_2 \, \|d\|_2}

Requirements

Implement the function

python

Rules:

  • Use cosine similarity to score each (query, doc) pair.
  • For each query i, exclude positives[i] from being selected as a negative.
  • Return the k doc indices with highest similarity (hard negatives) for each query.
  • Do not use any prebuilt nearest-neighbor or retrieval libraries (e.g., FAISS, sklearn neighbors).
  • Key considerations: normalize embeddings for cosine; use vectorized NumPy ops where possible; rank by descending similarity.

Example

python

Output:

python
Input Signature
ArgumentType
kint
docsnp.ndarray
queriesnp.ndarray
positivesnp.ndarray
Output Signature
Return NameType
valuenp.ndarray

Constraints

  • Use NumPy vectorized ops; no ANN libraries.

  • Return NumPy array of doc indices.

  • Exclude positives; rank by descending cosine similarity.

Hint 1

Start by L2-normalizing each query/doc vector so cosine becomes a dot product.

Hint 2

Compute all similarities at once with a matrix multiply: sim = q_norm @ d_norm.T (shape n_queries x n_docs).

Hint 3

Exclude the positive per query by setting sim[i, positives[i]] = -inf, then take top-k indices via argpartition and sort those k by score.

Roles
ML Engineer
AI Engineer
Companies
GeneralGeneral
Levels
manager
staff
senior
Tags
cosine-similarity
hard-negative-mining
top-k-selection
numpy-vectorization
17 people are solving this problem
Python LogoPython Editor
Ln 1, Col 1

Input Arguments

Edit values below to test with custom inputs

You need tolog in/sign upto run or submit