Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs

Back to Questions

44. Bag of words vectorization

easy
GeneralGeneral
senior

Turn raw text into numbers using Bag-of-Words (BoW) vectorization, a simple NLP preprocessing step that counts how often each word appears. Given a fixed vocabulary, represent each sentence as a count vector where each entry corresponds to a vocabulary word.

Requirements

Implement the function

python

Rules:

  • Tokenize each text by lowercasing and splitting on whitespace.
  • For each text, count occurrences of each vocabulary token and output counts in the exact vocabulary order.
  • Words not in vocabulary are ignored.
  • Return the result as a NumPy array.
  • Do not use any prebuilt vectorizers (e.g., scikit-learn).
  • Use only NumPy and built-in Python libraries.

Example

python

Output:

python
Input Signature
ArgumentType
textsnp.ndarray
vocabularynp.ndarray
Output Signature
Return NameType
valuenp.ndarray

Constraints

  • No sklearn or prebuilt vectorizers

  • Whitespace split; lowercase tokens only

  • Return NumPy array count matrix

Hint 1

Lowercase each text and split on whitespace to get tokens; ignore everything else.

Hint 2

Precompute a dictionary {vocab_token: column_index} so you can update counts in O(1) per token.

Hint 3

Initialize a zero NumPy matrix of shape (len(texts), len(vocabulary)), then for each token found in the vocab, increment result[row, vocab_index[token]].

Roles
ML Engineer
AI Engineer
Data Scientist
Quantitative Analyst
Companies
GeneralGeneral
Levels
senior
entry
Tags
bag-of-words
tokenization
hashmap-counting
nlp-preprocessing
49 people are solving this problem
Python LogoPython Editor
Ln 1, Col 1

Input Arguments

Edit values below to test with custom inputs

You need tolog in/sign upto run or submit