Join Our 5-Week ML/AI Engineer Interview Bootcamp 🚀 led by ML Tech Leads at FAANGs
Preprocessing text for NLP models often requires every token sequence to have the same length. Write a function that pads shorter sequences and truncates longer ones to a fixed max_len.
For a sequence of length and target length , define and produce an output of length by copying tokens (from the start or end) and filling the rest with pad_value.
Implement the function
Rules:
(batch_size, max_len).seqs is a standard Python list of lists.max_len, truncate according to truncating ("pre" keeps the last max_len, "post" keeps the first max_len).max_len, pad using pad_value according to padding ("pre" pads on the left, "post" pads on the right).seqs in-place.Output:
| Argument | Type |
|---|---|
| seqs | list |
| max_len | int |
| padding | str |
| pad_value | int |
| truncating | str |
| Return Name | Type |
|---|---|
| value | np.ndarray |
Use only NumPy and Python built-ins
Do not modify input seqs in-place
Each output sequence length equals max_len
Start by deciding how to handle a single sequence: if len(seq) > max_len, slice it based on truncating (seq[:max_len] vs seq[-max_len:]).
Once you have the tokens to keep, compute how many pads you need: pad_count = max_len - len(tokens). Then build the final sequence by adding [pad_value] * pad_count either before or after depending on padding.
Using NumPy can simplify this: pre-allocate np.full((len(seqs), max_len), pad_value) and copy tokens into either padded[i, :n] (post) or padded[i, -n:] (pre). Return the NumPy array.