Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs

Back to Questions

32. High cardinality categoricals

easy
GeneralGeneral
senior

Handle high-cardinality categorical features during preprocessing, where categories can have many unique values and naive one-hot encoding explodes feature size. You’ll implement a simple top-K + “OTHER” encoder to keep the representation compact.

The encoding is defined as:

enc(x)={xif x∈TopK(X,K)"OTHER"otherwise\text{enc}(x)= \begin{cases} x & \text{if } x \in \text{TopK}(X, K) \\ \text{"OTHER"} & \text{otherwise} \end{cases}

Requirements

Implement the function:

python

Rules:

  • Count category frequencies in values and identify the k most frequent categories.
  • If there are ties in frequency, break ties by lexicographic order (ascending).
  • Replace any category not in the top-k set with the string "OTHER".
  • Return the encoded array in the same order as input.
  • Use only NumPy.

Example

python

Output:

python
Input Signature
ArgumentType
kint
valuesnp.ndarray
Output Signature
Return NameType
valuenp.ndarray

Constraints

  • Use only NumPy.

  • Return np.ndarray in same input order.

  • Tie-break: lexicographic ascending on equal counts.

Hint 1

Use np.unique(values, return_counts=True) to get unique categories and their frequencies.

Hint 2

Use np.lexsort to sort indices based on multiple keys. Remember it sorts by the last key passed as primary. To sort by frequency descending and then value ascending, pass keys corresponding to value and negative frequency.

Hint 3

Use np.isin to create a boolean mask for the original array indicating which elements are in the top-k set, then use np.where to replace non-top-k elements with "OTHER".

Roles
ML Engineer
AI Engineer
Data Scientist
Quantitative Analyst
Companies
GeneralGeneral
Levels
senior
entry
Tags
categorical-encoding
frequency-counting
sorting-tie-breaks
preprocessing
22 people are solving this problem
Python LogoPython Editor
Ln 1, Col 1

Input Arguments

Edit values below to test with custom inputs

You need tolog in/sign upto run or submit