Bag of words vectorization

44. Bag of words vectorization

easy

General

senior

Turn raw text into numbers using Bag-of-Words (BoW) vectorization, a simple NLP preprocessing step that counts how often each word appears. Given a fixed vocabulary, represent each sentence as a count vector where each entry corresponds to a vocabulary word.

Requirements

Implement the function

python

Rules:

Tokenize each text by lowercasing and splitting on whitespace.
For each text, count occurrences of each vocabulary token and output counts in the exact vocabulary order.
Words not in vocabulary are ignored.
Return the result as a NumPy array.
Do not use any prebuilt vectorizers (e.g., scikit-learn).
Use only NumPy and built-in Python libraries.

Example

python

Output:

python

Input Signature

Argument	Type
texts	np.ndarray
vocabulary	np.ndarray

Output Signature

Return Name	Type
value	np.ndarray

Constraints

No sklearn or prebuilt vectorizers
Whitespace split; lowercase tokens only
Return NumPy array count matrix

Hint 1

Lowercase each text and split on whitespace to get tokens; ignore everything else.

Hint 2

Precompute a dictionary {vocab_token: column_index} so you can update counts in O(1) per token.

Hint 3

Initialize a zero NumPy matrix of shape (len(texts), len(vocabulary)), then for each token found in the vocab, increment result[row, vocab_index[token]].

Roles

ML Engineer

AI Engineer

Data Scientist

Quantitative Analyst

Companies

General

Levels

senior

entry

Input Arguments

Edit values below to test with custom inputs

You need tolog in/sign upto run or submit