Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs
Compute cosine similarity between two text strings using a simple bag-of-words representation, a common baseline in NLP. You’ll tokenize text, build aligned term-frequency vectors, then return their cosine similarity score.
Cosine similarity is defined as:
Implement the function
Rules:
Output:
| Argument | Type |
|---|---|
| text1 | str |
| text2 | str |
| Return Name | Type |
|---|---|
| value | float |
Tokenize with lower() and split() only.
Use NumPy dot and linalg.norm only.
No sklearn/vectorizer utilities allowed.
Start by normalizing both texts: lower() then split() on whitespace to get token lists.
Create a shared vocabulary across both texts (e.g., {word: index}), then build two same-length term-frequency vectors by counting occurrences.
Use NumPy for dot and L2 norms: sim = np.dot(v1,v2)/(np.linalg.norm(v1)*np.linalg.norm(v2)); return 0.0 if either norm is zero to avoid division by zero.