Negative sampling - ML & AI Coding

192. Negative sampling

medium

General

manager

Implement negative sampling for embedding-based retrieval, where you pick “hard negatives” that are most similar to a query but are not its positive match. You’ll compute cosine similarities and return the top-k negative candidate IDs for each query.

The cosine similarity is:

\cos(q, d) = \frac{q \cdot d}{\|q\|_2 \, \|d\|_2}

Requirements

Implement the function

python

Rules:

Use cosine similarity to score each (query, doc) pair.
For each query i, exclude positives[i] from being selected as a negative.
Return the k doc indices with highest similarity (hard negatives) for each query.
Do not use any prebuilt nearest-neighbor or retrieval libraries (e.g., FAISS, sklearn neighbors).
Key considerations: normalize embeddings for cosine; use vectorized NumPy ops where possible; rank by descending similarity.

Example

python

Output:

python

Input Signature

Argument	Type
k	int
docs	np.ndarray
queries	np.ndarray
positives	np.ndarray

Output Signature

Return Name	Type
value	np.ndarray

Constraints

Use NumPy vectorized ops; no ANN libraries.
Return NumPy array of doc indices.
Exclude positives; rank by descending cosine similarity.

Hint 1

Start by L2-normalizing each query/doc vector so cosine becomes a dot product.

Hint 2

Compute all similarities at once with a matrix multiply: sim = q_norm @ d_norm.T (shape n_queries x n_docs).

Hint 3

Exclude the positive per query by setting sim[i, positives[i]] = -inf, then take top-k indices via argpartition and sort those k by score.

Roles

ML Engineer

AI Engineer

Companies

General

Levels

manager

staff

senior

Input Arguments

Edit values below to test with custom inputs

You need tolog in/sign upto run or submit