Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs
Turn raw text into numbers using Bag-of-Words (BoW) vectorization, a simple NLP preprocessing step that counts how often each word appears. Given a fixed vocabulary, represent each sentence as a count vector where each entry corresponds to a vocabulary word.
Implement the function
Rules:
vocabulary are ignored.Output:
| Argument | Type |
|---|---|
| texts | np.ndarray |
| vocabulary | np.ndarray |
| Return Name | Type |
|---|---|
| value | np.ndarray |
No sklearn or prebuilt vectorizers
Whitespace split; lowercase tokens only
Return NumPy array count matrix
Lowercase each text and split on whitespace to get tokens; ignore everything else.
Precompute a dictionary {vocab_token: column_index} so you can update counts in O(1) per token.
Initialize a zero NumPy matrix of shape (len(texts), len(vocabulary)), then for each token found in the vocab, increment result[row, vocab_index[token]].