Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs

Back to Questions

46. Cosine similarity for text

easy
GeneralGeneral
senior

Compute cosine similarity between two text strings using a simple bag-of-words representation, a common baseline in NLP. You’ll tokenize text, build aligned term-frequency vectors, then return their cosine similarity score.

Cosine similarity is defined as:

cos(θ)=xyx  y\cos(\theta) = \frac{\mathbf{x}\cdot \mathbf{y}}{\|\mathbf{x}\|\;\|\mathbf{y}\|}

Requirements

Implement the function

python

Rules:

  • Tokenize by lowercasing and splitting on whitespace (no punctuation handling needed).
  • Build a shared vocabulary across both texts, then create aligned term-frequency vectors.
  • Compute cosine similarity using only NumPy operations for dot product and norms.
  • If a word appears multiple times, it should increase its term frequency accordingly.
  • Do not use any prebuilt NLP/vectorizer utilities (e.g., sklearn’s CountVectorizer/TfidfVectorizer).

Example

python

Output:

python
Input Signature
ArgumentType
text1str
text2str
Output Signature
Return NameType
valuefloat

Constraints

  • Tokenize with lower() and split() only.

  • Use NumPy dot and linalg.norm only.

  • No sklearn/vectorizer utilities allowed.

Hint 1

Start by normalizing both texts: lower() then split() on whitespace to get token lists.

Hint 2

Create a shared vocabulary across both texts (e.g., {word: index}), then build two same-length term-frequency vectors by counting occurrences.

Hint 3

Use NumPy for dot and L2 norms: sim = np.dot(v1,v2)/(np.linalg.norm(v1)*np.linalg.norm(v2)); return 0.0 if either norm is zero to avoid division by zero.

Roles
ML Engineer
AI Engineer
Data Scientist
Quantitative Analyst
Companies
GeneralGeneral
Levels
senior
entry
Tags
cosine-similarity
bag-of-words
numpy
text-tokenization
31 people are solving this problem
Python LogoPython Editor
Ln 1, Col 1

Input Arguments

Edit values below to test with custom inputs

You need tolog in/sign upto run or submit