# [2023] Data Science Coding Interview Guide (+ Questions)

Got a data science coding interview lined up? Chances are that you are interviewing for ML engineering and/or data scientist position. Companies that have data science coding interviews areĀ **Google**,Ā **Meta**,Ā **Stripe**, and startups. And, the coding questions are peppered throughout the technical screen and on-site rounds. We will cover the following areas of the data science coding interview so you are well prepared for your upcoming interviewš

**šĀ What is the Data Science Coding Interview?**

**šĀ Areas Covered inĀ Data Science Coding**

**āļøĀ Sample Questions and Solutions**

**š”Ā Prep Tips**

## šĀ What is the Data Science Coding Interview?

Letās start with how the interview is conducted. You will most likely hop onto a virtual call with a code or text editor. The interviewer will most likely be a senior/staff MLE or data scientist who will be evaluating you based on code proficiency, accuracy and interpretability. Your communication skills ā the ability to understand and explain your thoughts clearly ā will be assessed as well.

## šĀ Areas Covered inĀ Data Science Coding

There are four major areas often assessed in data science coding interviews. These areĀ *data structures & algorithms, data manipulation, statistical coding, and machine learning functions*. The types of roles covered tend to be role-specific.

**MLE / Full-Stack Data Scientist**Ā ā If the role requires you to deploy models to production, then you should expect algorithms & data structure questions. This means that you should brush up on strings, math, arrays, sorting, searching, dynamic programming, queues & stacks, and, in some cases, trees and graphs.**Product / Generalist Data Scientist**Ā ā You should expect more coverage on data manipulation questions. These are what you call āPandasā SQL problems that involve leveraging Pandas to solve SQL like table manipulation problems. In some cases, you may be asked on statistical coding problems which tend to be asked in quant-roles, Google DS interviews and etc.**Data Analyst**Ā ā Like product and generalist data scientist roles, you should expect Pandas-like SQL problems that involve leveraging Pandas to solve SQL-like table manipulation problems. You donāt need to worry too much about other areas.

Now, letās do a deep-dive on each of the four areas.

**š Data Structures & Algorithms**

These are the classic SWE questions posed in the data science interviewing: strings, math, arrays, sorting, searching, dynamic programming, queues & stacks, and, in some cases, trees and graphs. You should have firm grasp of runtime and space complexity, and write the most optimal solution. A great place to learn about data structures & algorithms are

```
# Sample Questions
1. [Microsoft] Function to check whether a word is a palindrome
2. [Adobe] Program to find a number from two sorted arrays such that the sum of the two numbers is closest to an integer
3. [Amazon] Find the shortest paths between two coordinates.
```

šĀ **Data Manipulation**

These are SQL-like table manipulation. Familiarity with Pandas or R DataFrames is essential in tackling these questions. The common operations you should be familiar with are ā selection, aggregation, lags, group by, partition by, filtering, JOINs, sorting, and ranking

```
# Sample Questions
| post_id | user_id | post_text | post_date | likes_count | comments_count | post_type |
|---------|---------|-------------------------------|------------|-------------|----------------|-----------|
| 1 | 101 | "Enjoying a day at the beach!"| 2023-07-25 | 217 | 30 | Photo |
| 2 | 102 | "Just finished a great book!" | 2023-07-24 | 120 | 18 | Status |
| 3 | 103 | "Check out this cool video!" | 2023-07-23 | 345 | 47 | Video |
| 4 | 101 | "That's awesome?" | 2023-07-22 | 52 | 70 | Status |
1. [Meta] Using the following dataset, find users who never posted a photo
2. [Meta] Retrieve users who posted more than three times but received less than 100 total likes
3. [Meta] Find the user with the highest average comments per post
```

**š Statistical Coding**

These are Google style questions that involve statistical simulation or writing functions that provide statistical values like the Pearson Correlation value. You should expect such questions generally across interviews, but more particularly in quant / Google / startup interviews. Depending on the interview, some may allow you to load third-party libraries like Numpy and Scipy. But, you will need to ask the interviewer to get the specifics.

```
# Sample Questions
1. [Google] In a World Series, suppose that the probability team A winning a match is 0.60. What is the probability that team A wins the World Series in each of the 7 matches? Use Numpy should you need to.
2. [Google] Demonstrate the confidence interval. Use Numpy and Scipy should you need to.
3. [Microsoft] Write a function that computes the inverse matrix. Use Numpy should you need to.
```

**š Machine Learning Functions**

ML coding is similar to LeetCode style, but the main difference is that it is the application of machine learning using coding. Expect to write ML functions from scratch. In some cases, you will not be allowed to import third-party libraries like SkLearn as the questions are designed to assess your conceptual understanding and coding ability.

```
# Sample Questions
1. [Uber] Write an AUC from scratch using vanilla Python
2. [Google] Write the K-Means algorithm using Numpy only
```

## āļøĀ Sample Questions and Solutions

**Sample Question 1 ā Data Manipulation**

```
# Sample Questions
[Meta] Retrieve users who posted more than three times but received less than 100 total likes
| post_id | user_id | post_text | post_date | likes_count | comments_count | post_type |
|---------|---------|-------------------------------|------------|-------------|----------------|-----------|
| 1 | 101 | "Enjoying a day at the beach!"| 2023-07-25 | 217 | 30 | Photo |
| 2 | 102 | "Just finished a great book!" | 2023-07-24 | 120 | 18 | Status |
| 3 | 103 | "Check out this cool video!" | 2023-07-23 | 345 | 47 | Video |
| 4 | 101 | "That's awesome?" | 2023-07-22 | 52 | 70 | Status |
```

**Solution**

```
1# Logic
2# 1. Group the original DataFrame by user_id.
3# 2. Calculate the sum of the likes_count column and the count of posts for each user.
4# 3. Filter the grouped data for users who posted more than three times but received less than 100 total likes.
5
6# Group by user_id and calculate sum of likes_count and count of posts
7grouped_users = df.groupby('user_id').agg({'likes_count': 'sum', 'post_id': 'count'})
8
9# Filter users who posted more than three times but received less than 100 total likes
10filtered_users_optimal_approach = grouped_users[(grouped_users['post_id'] > 3) & (grouped_users['likes_count'] < 100)]
11filtered_users_optimal_approach
```

**Sample Question 2 ā Statistical Coding**

```
[Google] Demonstrate the confidence interval. Use Numpy and Scipy should you need to.
```

**Solution**

```
1# Import libraries
2import numpy as np
3import scipy.stats as sci
4
5# Set the random seed
6np.random.seed(111)
7
8# Set the simulation parameters
9pop_mean = 100 # Population mean
10pop_std = 10 # Population standard deviation
11sample_size = 100 # Sample size
12num_samples = 1000 # Number of samples in the simulation
13alpha = 0.05 # Set the alpha
14
15# Run simulation
16mean_in_interval = 0 # Count the number of times the pop. mean is in the CI interval
17for i in range(num_samples):
18 # Sample 100 observations from a normal distribution
19 obs = np.random.normal(loc=100, scale=10, size=sample_size)
20 # Get the mean and standard error
21 sample_mean = np.mean(obs)
22 standard_error = sci.sem(obs)
23 # Generate the 95% confidence interval of the mean
24 lower, upper = sci.t.interval(confidence=(1-alpha), df=sample_size-1, loc=sample_mean, scale=standard_error)
25 # Count of number of instances when the bound
26 if pop_mean > lower and pop_mean < upper:
27 mean_in_interval += 1
28
29# Generate the proportion of the times that the pop. mean is in the CI interval
30proportion = mean_in_interval / num_samples
31print(f'Based on a simulation {num_samples} trials, the true population mean,\n'
32 f'{pop_mean}, is found in the {1-alpha} confidence interval about {proportion*100}% of the time.')
```

**Sample Question 3 ā Machine Learning Functions**

```
[Google] Write the K-Means algorithm using Numpy only
```

**Solution**

```
1import numpy as np
2
3class KMeans:
4 def __init__(self, k=2, max_iterations=500):
5 self.k = k
6 self.max_iterations = max_iterations
7
8 def fit(self, X):
9 # Initialize centroids randomly
10 self.centroids = X[np.random.choice(range(len(X)), self.k, replace=False)]
11
12 for i in range(self.max_iterations):
13 # Assign each data point to the nearest centroid
14 clusters = [[] for _ in range(self.k)]
15 for x in X:
16 distances = [np.linalg.norm(x - c) for c in self.centroids]
17 cluster = np.argmin(distances)
18 clusters[cluster].append(x)
19
20 # Recalculate centroids
21 prev_centroids = self.centroids
22 self.centroids = []
23 for cluster in clusters:
24 if cluster:
25 self.centroids.append(np.mean(cluster, axis=0))
26 else:
27 self.centroids.append(prev_centroids[np.random.choice(range(self.k))])
28
29 # Check for convergence
30 if np.allclose(prev_centroids, self.centroids):
31 break
32
33 def predict(self, X):
34 distances = [np.linalg.norm(X - c, axis=1) for c in self.centroids]
35 return np.argmin(distances, axis=0)
```

## š”Ā Prep Tips

**Tip 1 ā Front-Load Python problem sets**

Those who succeed in passing coding problems are often āprimedā for interviews. Given that coding interviews are usually assessed first in the technical screen, it is vital that you front-load coding as part of your daily/weekly prep. Go through about 2 to 3 problems per day leading up to the interview. For more resources, visitĀ datainterview.com

**Tip 2 ā Practice Explaining Verbally**

Interviewing is not a written exercise, itās a verbal exercise. Whether the interviewer asks you a coding question, you will be expected to explain you solution with clarity and in-details. As you practice interview questions, practice verbally.

**Tip 3 ā Join the Ultimate Prep**

Get access to ML questions, cases and machine learning mock interview recordings when you join the interview program onĀ datainterview.com