Join Our 5-Week ML/AI Engineer Interview Bootcamp đ led by ML Tech Leads at FAANGs
Handle high-cardinality categorical features during preprocessing, where categories can have many unique values and naive one-hot encoding explodes feature size. Youâll implement a simple top-K + âOTHERâ encoder to keep the representation compact.
The encoding is defined as:
Implement the function:
Rules:
values and identify the k most frequent categories."OTHER".Output:
| Argument | Type |
|---|---|
| k | int |
| values | np.ndarray |
| Return Name | Type |
|---|---|
| value | np.ndarray |
Use only NumPy.
Return np.ndarray in same input order.
Tie-break: lexicographic ascending on equal counts.
Use np.unique(values, return_counts=True) to get unique categories and their frequencies.
Use np.lexsort to sort indices based on multiple keys. Remember it sorts by the last key passed as primary. To sort by frequency descending and then value ascending, pass keys corresponding to value and negative frequency.
Use np.isin to create a boolean mask for the original array indicating which elements are in the top-k set, then use np.where to replace non-top-k elements with "OTHER".