Embedding collapse detection

225. Embedding collapse detection

medium

General

senior

Detect embedding collapse in an embeddings-and-retrieval pipeline, where many items unintentionally map to nearly the same vector and retrieval quality degrades. You’ll compute a simple “collapse score” based on average cosine similarity across embedding pairs.

The cosine similarity between two vectors $u$ and $v$ is:

\cos(u, v) = \frac{u \cdot v}{\|u\|\|v\|}

Requirements

Implement the function

python

Rules:

Convert the input to a NumPy array if necessary and L2-normalize each embedding vector.
Compute the average cosine similarity over all unique pairs (i, j) where i < j.
Return the average similarity and a boolean is_collapsed based on threshold.
Don’t use any prebuilt similarity utilities (e.g., sklearn.metrics.pairwise).
Keep it in a single Python function using only NumPy (and Python built-ins if needed).

Example

python

Output:

python

Input Signature

Argument	Type
threshold	float
embeddings	np.ndarray

Output Signature

Return Name	Type
value	tuple

Constraints

Use NumPy only; no sklearn similarities
L2-normalize each embedding row
Average i<j cosine pairs only
Input is np.ndarray

Hint 1

Normalize first. Ensure embeddings is a 2D NumPy array and L2-normalize each row so cosine similarity becomes a dot product.

Hint 2

Vectorize pairwise cosine. After normalization, compute S = X @ X.T to get all pair cosine similarities at once.

Hint 3

Average only unique pairs. Use np.triu_indices(n, k=1) to select i < j, then take mean; handle n < 2 (no pairs) explicitly.

Roles

ML Engineer

AI Engineer

Companies

General

Levels

senior

entry

Input Arguments

Edit values below to test with custom inputs

You need tolog in/sign upto run or submit