Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs
Implement whitespace tokenization for a simple NLP preprocessing step, where you split text into tokens by runs of whitespace and discard empty tokens. Return the tokens in order so they can be fed into later steps like vocabulary building.
Implement the function
Rules:
"hello," stays as "hello,").Output:
| Argument | Type |
|---|---|
| text | str |
| Return Name | Type |
|---|---|
| value | np.ndarray |
Input is a Python str.
Split on any whitespace runs.
Return np.ndarray of strings, no empty tokens.
Python already has a built-in way to split on any whitespace (spaces, tabs, newlines).
If you call split() without an explicit delimiter, it treats runs of whitespace as one separator and drops empty strings automatically.
Use np.array(text.split()) to convert the list of tokens to a NumPy array.