Cosine similarity for text

46. Cosine similarity for text

easy

General

senior

Compute cosine similarity between two text strings using a simple bag-of-words representation, a common baseline in NLP. You’ll tokenize text, build aligned term-frequency vectors, then return their cosine similarity score.

Cosine similarity is defined as:

\cos(\theta) = \frac{\mathbf{x}\cdot \mathbf{y}}{\|\mathbf{x}\|\;\|\mathbf{y}\|}

Requirements

Implement the function

python

Rules:

Tokenize by lowercasing and splitting on whitespace (no punctuation handling needed).
Build a shared vocabulary across both texts, then create aligned term-frequency vectors.
Compute cosine similarity using only NumPy operations for dot product and norms.
If a word appears multiple times, it should increase its term frequency accordingly.
Do not use any prebuilt NLP/vectorizer utilities (e.g., sklearn’s CountVectorizer/TfidfVectorizer).

Example

python

Output:

python

Input Signature

Argument	Type
text1	str
text2	str

Output Signature

Return Name	Type
value	float

Constraints

Tokenize with lower() and split() only.
Use NumPy dot and linalg.norm only.
No sklearn/vectorizer utilities allowed.

Hint 1

Start by normalizing both texts: lower() then split() on whitespace to get token lists.

Hint 2

Create a shared vocabulary across both texts (e.g., {word: index}), then build two same-length term-frequency vectors by counting occurrences.

Hint 3

Use NumPy for dot and L2 norms: sim = np.dot(v1,v2)/(np.linalg.norm(v1)*np.linalg.norm(v2)); return 0.0 if either norm is zero to avoid division by zero.

Roles

ML Engineer

AI Engineer

Data Scientist

Quantitative Analyst

Companies

General

Levels

senior

entry

Input Arguments

Edit values below to test with custom inputs

You need tolog in/sign upto run or submit