# 100 Data Science Interview Questions

Getting ready for the data scientist interviews? Hereβs a compilation of 100 data science interview questions asked in FAANGs, startups, and consulting firms. The core areas asked in data science interviews are:

- Applied Statistics
- Machine Learning
- Business/Product Case
- SQL
- Coding
- Leadership & Behavioral

Use these question list as a starting point for your prep!

## Applied Statistics

### 1. Can you explain the Central Limit Theorem and its importance in statistics?

The Central Limit Theorem is a fundamental theorem in the field of statistics. It provides insight into the distributional properties of sample averages and sums.

#### Definition:

The CLT states that, for a sufficiently large sample size from any population with a finite standard deviation, the distribution of the sample means will approximate a normal distribution, no matter the shape of the populationβs distribution.

#### Mathematically:

LetΒ π1,π2,β¦ππ*X*1β,*X*2β,β¦*X**n*βΒ be a random sample of sizeΒ π*n*Β drawn from a population with a meanΒ π*ΞΌ*Β and a standard deviationΒ π*Ο*. IfΒ π*n*Β is sufficiently large (often cited asΒ πβ₯30*n*β₯30), the sampling distribution of the sample meanΒ πΛ*X*ΛΒ will be approximately normal, with:

- Mean:Β π
*ΞΌ* - Standard Error (SE):Β ππ
*n*β*Ο*β

πΛβΌπ(π,π2π)*X*ΛβΌ*N*(*ΞΌ*,*n**Ο*2β)

#### Importance:

**Predictability**: The CLT provides a predictable shape for the distribution of sample means, which is the well-known bell curve or normal distribution.**Inference**: Many statistical tests and procedures are based on the assumption of normality. The CLT allows us to use these methods with large samples even if the original data isnβt normally distributed.**Practicality**: In real-world applications, actual distributions are often unknown. The CLT provides a way to work with the familiar normal distribution, sidestepping challenges posed by unknown distributions.**Simplicity**: The normal distribution has well-understood properties, making it easier to work with in statistical computations.

### 2. What is the difference between Type I and Type II errors in hypothesis testing?

#### Type I and Type II Errors in Hypothesis Testing

When performing hypothesis testing in statistics, there are two main types of errors that can occur, known as Type I and Type II errors.

#### Type I Error (False Positive):

This error occurs when the null hypothesis is true, but we incorrectly reject it.

**Formula**:Β πΌ=π(RejectΒ π»0β£π»0Β isΒ true)*Ξ±*=*P*(RejectΒ*H*0ββ£*H*0βΒ isΒ true)Β whereΒ πΌ*Ξ±*Β is the significance level of the test, often set at levels like 0.05 or 0.01.

#### Type II Error (False Negative):

This error occurs when the alternative hypothesis is true, but we fail to reject the null hypothesis.

**Formula**:Β π½=π(FailΒ toΒ RejectΒ π»0β£π»πΒ isΒ true)*Ξ²*=*P*(FailΒ toΒ RejectΒ*H*0ββ£*H**a*βΒ isΒ true)Β The power of a test, denoted asΒ 1βπ½1β*Ξ²*, represents the probability of correctly rejectingΒ π»0*H*0βΒ whenΒ π»π*H**a*βΒ is true.

#### In Summary:

**Type I Error**: Proclaiming an effect when there isnβt one.**Type II Error**: Failing to proclaim an effect when there is one.

The balance between these errors is critical in research. Researchers must decide the acceptable levels for these errors, considering the consequences of each type of mistake.

### 3. Can you explain what is meant by p-value?

In the context of hypothesis testing, the p-value is a fundamental concept that helps researchers determine the strength of evidence against the null hypothesis.

#### Definition:

The p-value represents the probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample data, assuming the null hypothesis is true.

#### Mathematically:

Given a test statisticΒ π*T*Β and observed valueΒ π‘*t*, the p-value can be defined as:

- For a two-tailed test:Β π-value=π(πβ₯β£π‘β£β£π»0Β isΒ true)+π(πβ€ββ£π‘β£β£π»0Β isΒ true)
*p*-value=*P*(*T*β₯β£*t*β£β£*H*0βΒ isΒ true)+*P*(*T*β€ββ£*t*β£β£*H*0βΒ isΒ true) - For a right-tailed test:Β π-value=π(πβ₯π‘β£π»0Β isΒ true)
*p*-value=*P*(*T*β₯*t*β£*H*0βΒ isΒ true) - For a left-tailed test:Β π-value=π(πβ€π‘β£π»0Β isΒ true)
*p*-value=*P*(*T*β€*t*β£*H*0βΒ isΒ true)

#### Interpretation:

A smaller p-value suggests stronger evidence against the null hypothesis, and thus a greater chance of the alternative hypothesis being true. Common thresholds for the p-value are 0.05, 0.01, and 0.10, though these can vary by field and specific research. If the p-value is less than or equal to a pre-defined significance level (often denoted asΒ πΌ*Ξ±*), one might reject the null hypothesis in favor of the alternative.

### 4. How do you assess the normality of a dataset?

When working with statistical analyses, itβs often essential to determine whether a dataset follows a normal (or Gaussian) distribution. There are several methods to assess normality:

#### 1.Β **Histograms and Q-Q Plots**:

**Histogram**: A visual representation where data is grouped into bins, and the frequency of data points in each bin is shown using bars. A bell-shaped histogram suggests a normal distribution.**Q-Q Plot**: A quantile-quantile plot compares the quantiles of a dataset to the quantiles of a normal distribution. If the data is normally distributed, the points should roughly lie on a straight line.

#### 2.Β **Statistical Tests**:

**Shapiro-Wilk Test**:Null hypothesis (π»0*H*0β): The data is normally distributed.π=(βπππ₯(π))2β(π₯πβπ₯Λ)2*W*=β(*x**i*ββ*x*Λ)2(β*a**i*β*x*(*i*)β)2βWhereΒ π₯(π)*x*(*i*)βΒ are the ordered sample values andΒ ππ*a**i*βΒ are constants derived from the means, variances, and covariances of the order statistics of a sample from a normal distribution. A low p-value suggests deviation from normality.**Kolmogorov-Smirnov Test**:This test compares the cumulative distribution function of the sample data to that of a normal distribution. A significant difference suggests non-normality.

#### 3.Β **Normal Probability Plots**:

A plot that pairs observed values with the corresponding percentiles of a normal distribution. If the dataset is normal, this plot should be linear.

#### 4.Β **Skewness and Kurtosis**:

**Skewness**: Measures the asymmetry of the probability distribution. For a normal distribution, skewness should be about 0.Skewness=πΈ[(πβππ)3]Skewness=*E*[(*Ο**X*β*ΞΌ*β)3]**Kurtosis**: Measures the βtailednessβ of the distribution. A normal distribution has a kurtosis of 3.Kurtosis=πΈ[(πβππ)4]Kurtosis=*E*[(*Ο**X*β*ΞΌ*β)4]

### 5. What is the difference between correlation and causation?

### Correlation:

Correlation refers to a statistical measure that describes the extent to which two variables change together. If one variable tends to go up when the other goes up, there is a positive correlation between them. Conversely, if one variable tends to go down when the other goes up, thereβs a negative correlation.

The most common measure of correlation is the Pearson correlation coefficient, denoted asΒ π*r*. It quantifies the strength and direction of a linear relationship.

π=β(π₯πβπ₯Λ)(π¦πβπ¦Λ)β(π₯πβπ₯Λ)2β(π¦πβπ¦Λ)2*r*=β(*x**i*ββ*x*Λ)2β(*y**i*ββ*y*Λβ)2ββ(*x**i*ββ*x*Λ)(*y**i*ββ*y*Λβ)β

Where:

- π₯π
*x**i*βΒ andΒ π¦π*y**i*βΒ are data points. - π₯Λ
*x*ΛΒ andΒ π¦Λ*y*ΛβΒ are the means of the datasetsΒ π₯*x*Β andΒ π¦*y*Β respectively. - π
*r*Β ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear correlation.

While correlation can be quantified and easily computed, causation requires deeper investigation, often through controlled experiments. The oft-repeated maxim, βcorrelation does not imply causation,β warns us that just because two variables move together doesnβt mean one causes the other.

### 6. How do you handle missing data in a dataset?

Dealing with missing or corrupted data is vital for maintaining data integrity and ensuring valid outcomes from analyses. Hereβs how you can address these issues:

### Imputation:

#### a.Β **Mean/Median/Mode Imputation**:

Replace missing values with the mean (for continuous data), median (when data has outliers), or mode (for categorical data). For a datasetΒ π*X*Β with missing values:

πimputed=πmissingβͺπΛ*X*imputedβ=*X*missingββͺ*X*Λ

WhereΒ πΛ*X*ΛΒ is the mean of the observed values.

#### b.Β **Linear Interpolation**:

Especially useful for time series data. Given two known points,Β (π₯1,π¦1)(*x*1β,*y*1β)Β andΒ (π₯2,π¦2)(*x*2β,*y*2β), the value at a missing pointΒ π₯*x*Β can be found as:

π¦=π¦1+(π¦2βπ¦1)(π₯2βπ₯1)(π₯βπ₯1)*y*=*y*1β+(*x*2ββ*x*1β)(*y*2ββ*y*1β)β(*x*β*x*1β)

#### c.Β **Removal**:

Sometimes, itβs best to simply remove the data entries that have missing or corrupted values, especially if they are not a significant portion of the dataset.

#### d.Β **Predictive Filling**:

Use machine learning models, such as decision trees or k-nearest neighbors, to estimate and fill in missing values based on other available data.

### 7. What are the assumptions required for linear regression?

The assumptions of a linear regression model is summarized as the following:

**Linearity**: The relationship between the independent and dependent variables should be linear. This can often be verified with scatter plots.**Independence**: Observations should be independent of each other. This is particularly crucial for time series data where this assumption might be violated.**Homoscedasticity**: The residuals (or errors) should have constant variance across all levels of the independent variables. This can be checked using a residuals vs. fitted values plot.**Normality of Residuals**: The residuals should be normally distributed. This can be tested using plots or tests like the Shapiro-Wilk:**No Multicollinearity**: Independent variables should not be too highly correlated with each other. Variance Inflation Factor (VIF) is commonly used:

VIF=11βπ
2VIF=1β*R*21β

WhereΒ π
2*R*2Β is the determination coefficient of the variable against others.

### Additional Questions

- Explain the concept of power in statistical tests
- How would you explain to a non-technical team member what a confidence interval is?
- What is multiple regression and when do we use it?
- Can you describe the difference between cross-sectional and longitudinal data?
- What is the role of data cleaning in data analysis?
- Can you explain the difference between ANOVA and t-test?
- How do you interpret the coefficients of a logistic regression model?
- What is multicollinearity and how do you detect and deal with it?
- Can you describe the difference between a parametric and a non-parametric test?
- What are some of the methods you would use to detect outliers?
- What is survival analysis and when can it be useful?
- Can you explain what is meant by the term βresidualβ in regression analysis?
- Describe a situation where you used statistical analysis to make a decision or solve a problem

## Machine Learning

### 1. What is the difference between supervised and unsupervised learning?

**Supervised Learning**: In supervised learning, an algorithm learns from a labeled dataset, which means each example in the dataset is paired with the correct output. The aim is to learn a mapping from inputs (π*X*) to outputs (π¦*y*). Once the algorithm is trained on this data, it can start making predictions on new, unseen data. Common algorithms in this category include linear regression:

π¦=π½0+π½1π1+β―+π½πππ+π*y*=*Ξ²*0β+*Ξ²*1β*X*1β+β―+*Ξ²**n*β*X**n*β+*Ο΅*

and classification algorithms like logistic regression.

**Unsupervised Learning**: This involves training an algorithm on data where the output labels are not provided. The goal is often to find patterns or structures in the data. Common techniques include clustering, where data is grouped based on similarities, and dimensionality reduction, which reduces the number of random variables under consideration and can be achieved through methods like Principal Component Analysis (PCA):

π=ππ*Z*=*X**V*

WhereΒ π*X*Β is the data andΒ π*V*Β represents the principal components.

While supervised learning is more directed in its approach, requiring explicit supervision with correct answers, unsupervised learning explores the data on its own to find hidden patterns or groupings.

### 2. Can you explain the concept of overfitting and underfitting in machine learning models?

**Overfitting**Β occurs when a machine learning model learns not only the underlying pattern of the data but also its noise. As a result, it performs exceptionally well on the training data but poorly on new, unseen data. In a sense, the model is too complex. If we consider polynomial regression, an overfitted model might have a very high-degree polynomial:

π¦=π½0+π½1π+π½2π2+β―+π½πππ*y*=*Ξ²*0β+*Ξ²*1β*X*+*Ξ²*2β*X*2+β―+*Ξ²**n*β*X**n*

whereΒ π*n*Β is large, making the model fit even tiny fluctuations in the training data.

**Underfitting**Β is the opposite. The model is too simple to capture the underlying structure of the data. It performs poorly on both the training data and the unseen data. In the polynomial regression context, an underfitted model might be a straight line (linear) when the actual relation is quadratic or cubic.

Visualizing the modelβs fit can help identify these issues: Overfitting typically shows a very wiggly curve fitting all data points, whereas underfitting might show a too straight or simplistic curve not capturing evident trends.

Regularization techniques, model selection, and cross-validation are common strategies to counteract overfitting and underfitting.

### 3. What is cross-validation? Why is it important?

**Cross-validation**Β is a technique used to assess the performance of machine learning models by splitting the dataset into two segments: one used to train the model and the other used to validate the model. By repeating this process multiple times and averaging out the performance metrics, we obtain a better estimation of how the model will perform on unseen data.

One popular form of cross-validation isΒ **k-fold cross-validation**. Here, the data is divided intoΒ π*k*Β subsets (or βfoldsβ). The model is trained onΒ πβ1*k*β1Β of these folds and tested on the remaining fold. This process is repeatedΒ π*k*Β times, with each fold serving as the test set exactly once. The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop:

πΆπ(π)=1πβπ=1ππππΈπ*C**V*(*k*)β=*k*1β*i*=1β*k*β*MS**E**i*β

WhereΒ πππΈπ*MS**E**i*βΒ is the Mean Squared Error of theΒ ππ‘β*i**t**h*Β iteration.

Cross-validation is crucial as it provides a more generalized performance metric for the model. By ensuring that every data point has been part of both training and testing, it reduces the chances of overfitting and provides a more robust measure of a modelβs predictive power.

### 4. Describe how a decision tree works. When would you use it over other algorithms?

AΒ **decision tree**Β is a flowchart-like structure wherein each internal node represents a feature (or attribute), each branch represents a decision rule, and each leaf node represents an outcome. The topmost node is called the root. The main idea is to split the data into subsets based on the value of input features, aiming to have subsets that are as pure (homogeneous) as possible concerning the target variable.

The splits are chosen based on metrics like:

**Gini impurity**:

πΊπππ(π)=1ββπ=1πππ2*G**ini*(*p*)=1β*i*=1β*n*β*p**i*2β

**Entropy**:

πΈππ‘ππππ¦(π)=ββπ=1πππlogβ‘2(ππ)*E**n**t**ro**p**y*(*p*)=β*i*=1β*n*β*p**i*βlog2β(*p**i*β)

Decision trees are favored because theyβre interpretable, handle both numerical and categorical data, and require little data preprocessing. Theyβre especially useful when domain knowledge suggests the existence of clear decision rules or when interpretability is paramount. However, they can be prone to overfitting, especially when deep. In such cases, algorithms like Random Forest or Gradient Boosting Machines, which ensemble multiple trees, can be preferred for better generalization.

### 5. What is the bias-variance tradeoff?

TheΒ **bias-variance tradeoff**Β is a fundamental concept in machine learning that describes the balance between two sources of errors:

**Bias**: Error due to overly simplistic assumptions in the learning algorithm. High bias can cause the model to miss relevant relations between input and output variables (underfitting).**Variance**: Error due to excessive complexity in the learning algorithm. High variance can cause the model to model the random noise in the training data (overfitting).

Mathematically, the expected test mean squared error (MSE) of a model can be decomposed as:

πππΈ=π΅πππ 2+ππππππππ+πΌπππππ’πππππΒ πΈππππ*MSE*=*B**ia**s*2+*Va**r**ian**ce*+*I**rre**d**u**c**ib**l**e*Β *E**rror*

- π΅πππ 2
*B**ia**s*2Β represents the squared difference between the expected predictions of our model and the correct values. - Variance measures the variability of a model prediction for a given data point.
- Irreducible Error is the noise inherent to any problem.

In essence, as we increase model complexity (like adding more parameters or features), bias decreases but variance increases. Conversely, reducing model complexity increases bias and decreases variance. The tradeoff is achieving the right balance to minimize the total error, which often means finding a middle ground between a model thatβs too simple (high bias) and one thatβs too complex (high variance).

### Additional Questions

- What is the difference between stochastic gradient descent (SGD) and batch gradient descent?
- What is the difference between bagging and boosting?
- How would you validate a model you created to generate a predictive analysis?
- What are some of the advantages and disadvantages of a neural network?
- How does the k-means algorithm work?
- Can you explain the difference between L1 and L2 regularization methods?
- What is principal component analysis (PCA) and when is it used?
- Can you describe what an activation function is and why it is used in an artificial neural network?
- How would you handle an imbalanced dataset?
- Can you explain the concept of βfeature selectionβ in machine learning?

## Product Case

### 1. Can you explain the concept of A/B testing?

A/B testing, also known as split testing, is a method of comparing two versions of a webpage, app, or other product (A and B) to determine which one performs better. The goal is to improve upon key performance indicators (KPIs) such as click-through rates, user engagement, or conversion rates.

The procedure starts by randomly splitting the users into two groups. Group 1 interacts with version A, while Group 2 interacts with version B. We then measure the effect on user behavior.

The key principle behind A/B testing is the hypothesis testing in statistics. We start with a null hypothesis,Β π»0*H*0β, typically stating there is no difference in behavior between the groups. The alternative hypothesis,Β π»1*H*1β, states there is a difference.

After running the experiment and collecting the data, we perform a statistical test (like the two-sample t-test) to decide if the observed differences are significant enough to rejectΒ π»0*H*0β.

Ultimately, A/B testing provides an empirical basis to make decisions enhancing user experience and product effectiveness.

### 2. How would you measure the success of a new feature launch?

Measuring the success of a new feature launch involves evaluating both quantitative and qualitative metrics to ascertain its impact and reception.

**Quantitative Metrics**:**Engagement Metrics**: Look at the number of users interacting with the feature, and the frequency of use. IfΒ ππππ‘ππ*N**a**f**t**er*βΒ represents the engagement after the feature launch andΒ πππππππ*N**b**e**f**ore*βΒ represents engagement before, the relative increase is given byΒ ππππ‘ππβππππππππππππππ*N**b**e**f**ore*β*N**a**f**t**er*ββ*N**b**e**f**ore*ββ.**Retention Rate**: Examine if the feature positively impacts user retention. IfΒ π πππ€*R**n**e**w*βΒ is the retention rate with the new feature andΒ π πππ*R**o**l**d*βΒ without it, a success is whenΒ π πππ€>π πππ*R**n**e**w*β>*R**o**l**d*β.**Conversion Rate**: For features aiming to drive user actions, check if the conversion rate, represented asΒ numberΒ ofΒ conversionstotalΒ visitorstotalΒ visitorsnumberΒ ofΒ conversionsβ, has increased.

**Qualitative Metrics**:**User Feedback**: Collect feedback through surveys, feedback forms, or in-app prompts. Positive sentiments indicate a well-received feature.**Support Tickets**: Monitor any increase in support tickets related to the new feature, indicating possible confusion or issues.

### 3. How would you determine the optimal sample size for an A/B test?

The optimal sample size for an A/B test is influenced by several parameters: the minimum detectable effect (the smallest difference thatβs meaningful to detect), the statistical power (typically set at 0.8 or 80%), and the significance level (often chosen as 0.05 or 5%).

The formula for computing the sample size for each group in a two-sample comparison (assuming equal variance) is:

π=((ππΌ/2+ππ½)2β
(ππ΄2+ππ΅2)πΏ2)*n*=(*Ξ΄*2(*Z**Ξ±*/2β+*Z**Ξ²*β)2β
(*Ο**A*2β+*Ο**B*2β)β)

Where:

- π
*n*Β is the sample size per group. - ππΌ/2
*Z**Ξ±*/2βΒ is the critical value for a two-tailed test (forΒ πΌ=0.05*Ξ±*=0.05, itβs approximately 1.96). - ππ½
*Z**Ξ²*βΒ is the critical value associated with power (forΒ π½=0.2*Ξ²*=0.2, itβs about 0.84). - ππ΄2
*Ο**A*2βΒ andΒ ππ΅2*Ο**B*2βΒ are the variances for groups A and B. - πΏ
*Ξ΄*Β is the minimum detectable effect.

Estimating variances can be tricky, so pilot tests or past data can be beneficial. Tools and calculators are available online to simplify this computation.

### 4. How would you interpret the results of an A/B test that showed no significant difference between the control and treatment groups?

When an A/B test yields no significant difference between the control (A) and treatment (B) groups, it means that, statistically, we cannot assert that the change in the treatment group had a detectable impact. In other words, the observed differences (if any) might have occurred by chance.

Hereβs how to interpret:

**Inherent Variability**: Thereβs always some variability in experiments. Given the sample size and the magnitude of the effect, the test might not have had enough power to detect a difference. Ensure your sample size was correctly calculated using formulas likeΒ π=((ππΌ/2+ππ½)2β (ππ΄2+ππ΅2)πΏ2)*n*=(*Ξ΄*2(*Z**Ξ±*/2β+*Z**Ξ²*β)2β (*Ο**A*2β+*Ο**B*2β)β).**Practical vs. Statistical Significance**: Even if the result isnβt statistically significant, consider if thereβs any practical significance. For instance, a small increase in conversion rate might not be statistically significant but can be materially impactful for a large business.**External Factors**: Ensure no external events (like a major holiday or website outage) influenced the results.**Feature Evaluation**: Maybe the feature or change being tested genuinely doesnβt influence the behavior youβre measuring. It might be worth re-evaluating the featureβs objectives and designs.

### 5. Can you explain the concept of statistical power in the context of A/B testing?

Statistical power, often denoted asΒ 1βπ½1β*Ξ²*, is the probability that a test correctly rejects the null hypothesisΒ π»0*H*0βΒ when the alternative hypothesisΒ π»1*H*1βΒ is true. In A/B testing terms, itβs the likelihood of detecting a difference (if one genuinely exists) between the control and treatment groups.

Key points:

**Relationship with Type II Error**: The power is intrinsically linked to Type II error (π½*Ξ²*). IfΒ π½*Ξ²*Β is the risk of not detecting an effect that is there (false negative), thenΒ 1βπ½1β*Ξ²*Β (power) is the likelihood of detecting it.**Desired Power**: In many fields, a power of 0.8 (or 80%) is considered adequate, meaning thereβs an 80% chance of detecting a true effect.**Factors Influencing Power**: Several factors affect the power of a test:**Effect Size**: Larger effect sizes are easier to detect, increasing power.**Sample Size**: Larger sample sizes increase the power. This is why calculating the required sample size beforehand, often using formulas likeΒ π=((ππΌ/2+ππ½)2β (ππ΄2+ππ΅2)πΏ2)*n*=(*Ξ΄*2(*Z**Ξ±*/2β+*Z**Ξ²*β)2β (*Ο**A*2β+*Ο**B*2β)β), is crucial.**Significance Level (πΌ**: A lower significance level (e.g., 0.01 vs. 0.05) reduces power, as it sets a stricter criterion for significance.*Ξ±*)

## Additional Questions

- What is a p-value? How would you explain it to a non-technical stakeholder?
- Describe a situation where A/B testing would not be appropriate.
- What metrics would you look at to understand user engagement on Instagram??
- How do you deal with seasonality in A/B testing?
- How would you design an AB test for the News Feed algorithm on Facebook?

## SQL

### Airbnb SQL Questions

```
| property_id | owner_id | property_type | property_quality | no_bedrooms | no_bathrooms | total_sqft | nightly_price_range | parking_spots | location_town | location_country | first_listed_on |
|-------------|----------|---------------|------------------|-------------|--------------|------------|---------------------|---------------|--------------------|------------------|-----------------|
| 1 | 9 | Luxury | Acceptable | 2 | 0.5 | 2001-2500 | 151-200 | 2 | Troutdale | US | 2021-12-06 |
| 2 | 18 | Mid-Budget | Satisfactory | 8 | 4 | 3000+ | 0-100 | 1 | Watsonville | US | 2020-12-28 |
| 3 | 3 | Luxury | Acceptable | 1 | 4 | 3000+ | 101-150 | 2 | Sun City Center | US | 2020-06-15 |
| 4 | 3 | Mid-Budget | Acceptable | 0 | 4 | 3000+ | 251-300 | 2 | Sunnyside | US | 2021-12-13 |
| 5 | 29 | Mid-Budget | Satisfactory | 2 | 2.5 | 2001-2500 | 151-200 | 1 | Granger | US | 2021-06-21 |
```

### 1. Retrieve all properties that are located in the town of βTroutdaleβ.

```
SELECT * FROM properties
WHERE location_town = 'Troutdale';
```

### 2. Find the average number of bedrooms for all βLuxuryβ properties.

```
SELECT AVG(no_bedrooms) AS average_bedrooms
FROM properties
WHERE property_type = 'Luxury';
```

### 3. Identify owner_ids of those owners who have more than one property listed.

```
SELECT owner_id, COUNT(property_id) AS number_of_properties
FROM properties
GROUP BY owner_id
HAVING COUNT(property_id) > 1;
```

### 4. Find the properties in the βMid-Budgetβ category that have the highest number of bathrooms.

```
SELECT property_id, no_bathrooms
FROM properties
WHERE property_type = 'Mid-Budget'
ORDER BY no_bathrooms DESC
LIMIT 1;
```

### 5. Determine how many properties were first listed in each month of 2021.

```
SELECT MONTH(first_listed_on) AS month, COUNT(property_id) AS number_of_properties
FROM properties
WHERE YEAR(first_listed_on) = 2021
GROUP BY MONTH(first_listed_on)
ORDER BY month ASC;
```

### 6. Identify Owners with a Large Variety of Property Types:

Use a CTE to find owners who have more than one type of property and then order the results by those with the most variety of property types.

```
WITH OwnerVariety AS (
SELECT owner_id, COUNT(DISTINCT property_type) as type_count
FROM properties
GROUP BY owner_id
HAVING COUNT(DISTINCT property_type) > 1
)
SELECT owner_id, type_count
FROM OwnerVariety
ORDER BY type_count DESC;
```

### 7. Find the Average Size of Luxury and Mid-Budget Properties and Compare Them:

Use a CTE to find the average size of both βLuxuryβ and βMid-Budgetβ properties. Afterwards, identify whether Luxury properties are on average larger than Mid-Budget properties.

```
WITH AverageSizes AS (
SELECT property_type,
CASE
WHEN total_sqft = '2001-2500' THEN 2250.5
WHEN total_sqft = '3000+' THEN 3250 -- Assuming an average value for demonstration
ELSE 0
END as avg_size
FROM properties
WHERE property_type IN ('Luxury', 'Mid-Budget')
)
SELECT property_type, AVG(avg_size) as average_size
FROM AverageSizes
GROUP BY property_type
HAVING AVG(avg_size) > (
SELECT AVG(avg_size)
FROM AverageSizes
WHERE property_type = 'Mid-Budget'
);
```

### 8. Determine the Property with the Best Value in Each Town:

Use a CTE to calculate a value score for each property based on the number of bedrooms, bathrooms, and price range. Then, for each town, identify the property with the highest value score.

```
WITH ValueScores AS (
SELECT property_id, location_town,
(no_bedrooms + no_bathrooms) /
CASE
WHEN nightly_price_range = '0-100' THEN 50
WHEN nightly_price_range = '101-150' THEN 125.5
WHEN nightly_price_range = '151-200' THEN 175.5
WHEN nightly_price_range = '251-300' THEN 275.5
ELSE 1
END as value_score
FROM properties
)
SELECT location_town, MAX(value_score) as top_value_score
FROM ValueScores
GROUP BY location_town;
```

### Facebook SQL Questions

| post_id | user_id | post_text | post_date | likes_count | comments_count | post_type | |βββ|βββ|ββββββββββ-|ββββ|ββββ-|βββββ-|ββββ| | 1 | 101 | βEnjoying a day at the beach!β| 2023-07-25 | 217 | 30 | Photo | | 2 | 102 | βJust finished a great book!β | 2023-07-24 | 120 | 18 | Status | | 3 | 103 | βCheck out this cool video!β | 2023-07-23 | 345 | 47 | Video | | 4 | 101 | βThatβs awesome?β | 2023-07-22 | 52 | 70 | Status |### 1. Find Users Who Have Never Posted a Photo:

```
SELECT user_id
FROM UserPosts
GROUP BY user_id
HAVING SUM(CASE WHEN post_type = 'Photo' THEN 1 ELSE 0 END) = 0;
```

### 2. Count the Number of Posts Made by Each User in July 2023:

```
SELECT user_id, COUNT(post_id) as number_of_posts
FROM UserPosts
WHERE post_date BETWEEN '2023-07-01' AND '2023-07-31'
GROUP BY user_id;
```

### 3. Find the User with the Highest Average Comments per Post:

```
SELECT user_id, AVG(comments_count) as avg_comments
FROM UserPosts
GROUP BY user_id
ORDER BY avg_comments DESC
LIMIT 1;
```

### 4. Identify the Percentage of Posts that are Videos:

```
SELECT (COUNT(post_id) * 100.0 / (SELECT COUNT(*) FROM UserPosts)) as video_percentage
FROM UserPosts
WHERE post_type = 'Video';
```

### 5. Retrieve Users Who Posted More than Three Times but Received Less than 100 Total Likes:

```
WITH UserTotals AS (
SELECT user_id,
COUNT(post_id) as total_posts,
SUM(likes_count) as total_likes
FROM UserPosts
GROUP BY user_id
)
SELECT user_id, total_posts, total_likes
FROM UserTotals
WHERE total_posts > 3 AND total_likes < 100;
```

### 6. Identify Users Who Had the Biggest Increase in Likes from Their First to Their Last Post:

```
WITH FirstPost AS (
SELECT user_id, MIN(post_date) as first_post_date
FROM UserPosts
GROUP BY user_id
),
LastPost AS (
SELECT user_id, MAX(post_date) as last_post_date
FROM UserPosts
GROUP BY user_id
),
FirstPostLikes AS (
SELECT a.user_id, b.likes_count as first_post_likes
FROM FirstPost a
JOIN UserPosts b ON a.user_id = b.user_id AND a.first_post_date = b.post_date
),
LastPostLikes AS (
SELECT a.user_id, b.likes_count as last_post_likes
FROM LastPost a
JOIN UserPosts b ON a.user_id = b.user_id AND a.last_post_date = b.post_date
)
SELECT f.user_id,
f.first_post_likes,
l.last_post_likes,
(l.last_post_likes - f.first_post_likes) as likes_difference
FROM FirstPostLikes f
JOIN LastPostLikes l ON f.user_id = l.user_id
ORDER BY likes_difference DESC
LIMIT 1;
```

### 7. Find Users Who Had the Most Consistent Engagement (least variance in likes):

```
WITH UserAverageLikes AS (
SELECT user_id, AVG(likes_count) as average_likes
FROM UserPosts
GROUP BY user_id
),
UserLikesVariance AS (
SELECT a.user_id,
SUM(POWER(b.likes_count - a.average_likes, 2)) / COUNT(b.post_id) as variance
FROM UserAverageLikes a
JOIN UserPosts b ON a.user_id = b.user_id
GROUP BY a.user_id, a.average_likes
)
SELECT user_id, MIN(variance) as least_variance
FROM UserLikesVariance
GROUP BY user_id
ORDER BY least_variance ASC
LIMIT 1;
```

## Additional Questions

### Pinterest SQL Questions

```
| pin_id | user_id | board_id | image_url | pin_description | pin_date | repins_count | pin_category |
|--------|---------|----------|------------------------------|----------------------------------|------------|--------------|--------------|
| 1 | 12 | 101 | "http://image101.com" | "Beautiful sunset at the beach" | 2023-01-05 | 230 | Travel |
| 2 | 13 | 101 | "http://image223.com" | "DIY candle making tutorial" | 2023-02-10 | 15 | DIY |
| 3 | 14 | 102 | "http://image443.com" | "Homemade chocolate chip cookies"| 2023-03-07 | 75 | Food |
| 4 | 12 | 103 | "http://image554.com" | "Stunning mountain views" | 2023-03-20 | 105 | Travel |
| 5 | 15 | 104 | "http://image555.com" | "Fashion trends for summer" | 2023-04-08 | 320 | Fashion |
```

- How many pins have been added in each category?
- Identify the top 5 users with the highest number of pins.
- How many pins have been repinned more than 50 times?
- Whatβs the average number of repins for pins in the βFashionβ category?
- Find users who have pinned images but have never repinned someone elseβs image.
- For each board, how many pins does it have, and whatβs the total repin count?
- Identify the month in the past year with the highest number of pins added.
- Which categories have an average repin count above 100?
- Find the top 3 most popular pins (based on repins) in the βTravelβ category from the last 6 months.
- Which user has the widest variety of pin categories?

## Coding

### 1. Given two sorted arrays, find the median of the combined arrays.

```
```python
def findMedianSortedArrays(nums1, nums2):
merged = sorted(nums1 + nums2)
n = len(merged)
if n % 2 == 1:
return merged[n // 2]
else:
return (merged[n // 2 - 1] + merged[n // 2]) / 2.0
```
```

### 2. Implement a moving average from a stream.

```
```python
class MovingAverage:
def __init__(self, size: int):
self.size = size
self.queue = []
self.sum = 0
def next(self, val: int) -> float:
if len(self.queue) == self.size:
self.sum -= self.queue.pop(0)
self.queue.append(val)
self.sum += val
return self.sum / len(self.queue)
```
```

### 3. Given a dataset with timestamps and values, find sections where the values have been steadily increasing for a given period.

```
```python
def steady_increase(data, period):
increasing_sections = []
start = None
for i in range(1, len(data)):
if data[i]['value'] > data[i-1]['value']:
if start is None:
start = data[i-1]['timestamp']
else:
if start and data[i-1]['timestamp'] - start >= period:
increasing_sections.append((start, data[i-1]['timestamp']))
start = None
return increasing_sections
```
```

### 4. Implement an algorithm for stratified sampling

```
```python
import random
def stratified_sample(data, strat_col, sample_size):
unique_strats = set(data[strat_col])
strata_samples = []
for strat in unique_strats:
strat_data = [row for row in data if row[strat_col] == strat]
strat_sample = random.sample(strat_data, min(sample_size, len(strat_data)))
strata_samples.extend(strat_sample)
return strata_samples
```
```

### 5. Given a time series of stock prices, implement a function to determine the best time to buy and sell to maximize profit.

```
```python
def max_profit(prices):
if not prices:
return 0
min_price = prices[0]
max_profit = 0
for price in prices:
if price < min_price:
min_price = price
else:
profit = price - min_price
max_profit = max(max_profit, profit)
return max_profit
```
```

### 6. Implement a function to calculate the Root Mean Squared Error (RMSE) of a modelβs predictions.

```
```python
def rmse(predictions, targets):
differences = [(a - b)**2 for a, b in zip(predictions, targets)]
return (sum(differences) / len(predictions)) ** 0.5
```
```

### 7. You have a dataset that contains user activities. Implement a function to find users who have logged in for 5 consecutive days.

```
```python
def five_consecutive_logins(data):
data.sort(key=lambda x: x['date'])
consecutive_users = set()
for i in range(len(data) - 4):
if data[i]['user_id'] == data[i + 4]['user_id'] and (data[i + 4]['date'] - data[i]['date']).days == 4:
consecutive_users.add(data[i]['user_id'])
return consecutive_users
```
```

### 8. Given a list of strings, implement a function to cluster them based on string similarity (e.g., using Jaccard similarity or Levenshtein distance).

```
```python
def levenshtein(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for index2, char2 in enumerate(s2):
new_distances = [index2 + 1]
for index1, char1 in enumerate(s1):
if char1 == char2:
new_distances.append(distances[index1])
else:
new_distances.append(1 + min((distances[index1], distances[index1 + 1], new_distances[-1])))
distances = new_distances
return distances[-1]
def cluster_strings(strings, threshold):
clusters = []
for string in strings:
for cluster in clusters:
if levenshtein(string, cluster[0]) <= threshold:
cluster.append(string)
break
else:
clusters.append([string])
return clusters
```
```

### 9. Implement a function to generate n bootstrap samples from a given dataset.

```
```python
import random
def bootstrap_samples(data, n):
samples = []
for _ in range(n):
sample = [random.choice(data) for _ in data]
samples.append(sample)
return samples
```
```

### 10. Given an array of user reviews, implement a function to determine the top k frequently mentioned words, ignoring common stop words.

```
from collections import Counter
import re
def top_k_words(reviews, k, stop_words):
"""
Function to find the top k frequently mentioned words from an array of user reviews, ignoring common stop words.
:param reviews: List of user reviews.
:param k: Number of top words to retrieve.
:param stop_words: Set of words to ignore.
:return: List of top k words.
"""
# Convert all reviews to lowercase and tokenize by words
words = [word for review in reviews for word in re.findall(r'\w+', review.lower())]
# Filter out stop words
words = [word for word in words if word not in stop_words]
# Count word frequencies and retrieve top k words
count = Counter(words)
return [item[0] for item in count.most_common(k)]
# Example
reviews = ["I love this product!", "This is the best thing ever.", "I won't buy this again.", "Ever tried this?"]
stop_words = set(["this", "is", "the", "i", "won't", "ever", "buy"])
print(top_k_words(reviews, 2, stop_words)) # ['love', 'product']
```

## Additional Questions

- Implement the K-Nearest Neighbors (KNN) algorithm to classify new data points in a multi-dimensional space.
- You have a dataset with timestamps of user logins. Implement a function to find peak login times, i.e., times when the maximum number of users are logged in simultaneously.
- Given a set of coordinates (latitude, longitude), cluster them intoΒ
`n`

Β groups where each group has nearby coordinates (use any clustering algorithm of your choice, such as DBSCAN or hierarchical clustering).- Implement the gradient descent algorithm for a simple linear regression problem.
- Given a set of features and target values, implement a basic decision tree algorithm to predict the target based on feature values.

## Leadership & Behavioral Questions

Behavioral and leadership interview questions aim to assess a candidateβs interpersonal skills, decision-making capabilities, and leadership potential.

**Collaboration & Teamwork**- βCan you describe a situation where you had to work with someone whose background or perspective was very different from yours? How did you handle it?β
- βDescribe a time when you received constructive criticism on your work. How did you respond, and what changes did you make based on the feedback?β
- βCan you give an example of a time when there was a disagreement in your team about the interpretation of data or the approach to a problem? How did you handle it?β
- βTell me about a time when you had to collaborate with a non-technical team. How did you ensure effective communication and mutual understanding?β

**Problem-Solving & Continuous Learning**- βDescribe a time when you faced an unexpected challenge in a data project. How did you handle it, and what did you learn from it?β
- βCan you tell me about a time when you had to quickly learn a new tool or technique for a project? How did you approach it, and what was the outcome?β
- βWith the rapidly evolving field of data science, how do you keep yourself updated with the latest tools and techniques?β
- βDescribe an instance where your initial analysis didnβt produce the expected results. How did you adapt and what alternative methods did you employ?β

**Influence & Communication**- βHave you ever had to persuade a stakeholder to accept an analysis or approach that they initially disagreed with? How did you go about it?β
- βHow have you handled situations where your data contradicted the existing beliefs or opinions of senior leadership? Walk me through your communication strategy.β
- βHow have you handled situations where your data contradicted the existing beliefs or opinions of senior leadership? Walk me through your communication strategy.β

**Decision-Making**- βHave you ever encountered a situation where the data suggested one course of action, but ethically, another course was the right thing to do? How did you navigate this?β

**Leadership & Initiative**- βTell me about a time when you took the lead on a project without being asked. What prompted you to take the initiative, and what was the outcome?β
- βDescribe a time when you proposed a new approach or strategy based on your data analysis that had a significant impact on the direction of a project or the business.β
- βHave you ever had to abandon a particular approach or method in the middle of a project due to unforeseen challenges? How did you pivot and ensure the project remained on track?β