on my own Data Science job search journey and have been very lucky to have gotten the chance to interview with many companies.

These interviews have been a mix of technical and behavioral when meeting with real people, and I’ve also gotten my fair share of assessment tasks to complete on my own.

Going through this process I’ve done a lot of research about what kinds of questions are commonly asked during data science interviews. These are concepts you should not only be familiar with, but also know how to explain. 

1. P value

Image by author

When you run a statistical test, typically you are going to have a null hypothesis H0 and an alternative hypothesis H1. 

Let’s say you are running an experiment to determine the effectiveness of some weight-loss medication. Group A took a placebo and Group B took the medication. You then calculate a mean number of pounds lost over six months for each group and want to see if the number of weight lost for Group B is statistically significantly higher than Group A. In this case, the null hypothesis, H0 would be that there was no statistically significant differences in the mean number of lbs lost between groups, meaning that the medication had no real effect on weight loss. H1 would be that there was a significant difference and Group B lost more weight due to the medication.

To recap:

  • H0: Mean lbs lost Group A = Mean lbs lost Group B
  • H1: Mean lbs lost Group A < Mean lbs lost Group B

You would then conduct a t-test to compare means to get a p-value. This can be done in Python or other statistical software. However, prior to getting a p-value, you would first choose an alpha (α) value (aka significance level) that you will compare the p to.

The typical alpha value chosen is 0.05, which means that the probability of a Type I error (Saying that there is a difference in means when there isn’t) is 0.05 or 5%.

If your p value is < alpha value, you can reject your null hypothesis. Otherwise, if p > alpha, you fail to reject your null hypothesis.

2. Z-score (and other outlier detection methods)

Z-score is a measure of how far a data point lies from the mean and is one of the most common outlier detection methods.

In order to understand the z score you need to understand basic statistical concepts such as:

  • Mean — the average of a set of values
  • Standard deviation — a measure of spread between values in a dataset in relation to the mean (also the square root of variance). In other words, it shows how far apart values in the dataset are from the mean.

A z-score value of 2 for a given data point indicates that that value is 2 standard deviations above the mean. A z-score of -1.5 indicates that the value is 1.5 standard deviations below the mean.

Typically, a data point with a z-score of >3 or <-3 is considered an outlier. 

Outliers are a common problem within data science so it’s important to know how to identify them and deal with them.

To learn more about some other simple outlier detection methods, check out my article on z-score, IQR, and modified z score:

3. Linear Regression

Image by author

Linear regression is one of the most fundamental ML and statistical models and understanding it is crucial to being successful in any data science role.

On a high level, Linear Regression aims to model the relationship between an independent variable(s) to a dependent variable and attempts to use an independent variable to predict the value of the dependent variable. It does so by fitting a “line of best fit” to the dataset — a line that minimizes the sum of squared differences between the actual values and the predicted values.

An example of this is when trying to model the relationship between temperature and electric energy consumption. When measuring electric consumption of a building often times the temperature will impact the usage because as electricity is often used for cooling, as the temperature goes up, buildings will use more energy to cool down their spaces.

So we can use a regression model to model this relationship where the independent variable is temperature and the dependent variable is the consumption (since the usage is dependent on the temperature and not vice versa).

Linear regression will output an equation in the format y=mx+b, where m is the slope of the line and b is the y intercept. To make a prediction for y, you would plug your x value into the equation.

Regression has 4 different assumptions of the underlying data which can be remembered by the acronym LINE:

L: Linear relationship between the independent variable x and the dependent variable y.

I: Independence of the residuals. Residuals don’t influence each other. (A residual is the difference between the value predicted by the line and the actual value).

N: Normal distribution of the residuals. The residuals follow a normal distribution.

E: Equal variance of residuals across different x values.

The most common performance metric when it comes to linear regression is the R², which tells you the proportion of variance in the dependent variable that can be explained by the independent variable. An R² of 1 indicates a perfect linear relationship whereas an R² of 0 means there is no predictive ability for this dataset. A good R² tends to be 0.75 or above, but this also varies depending on the type of problem you’re solving.

Linear regression is different from correlation. Correlation between two variables gives you a numeric value between -1 and 1 which tells you the strength and direction of the relationship between two variables. Regression gives you an equation which can be used to predict future values based on the line of best fit for past values.

4. Central limit theorem 

The Central Limit Theorem (CLT) is a fundamental concept in statistics that states that the distribution of the sample mean will approach a normal distribution as the sample size becomes larger, regardless of the original distribution of the data.

A normal distribution, also known as the bell curve, is a statistical distribution in which the mean is 0 and the standard deviation is 1.

CLT is based on these assumptions: 

  • Data are independent
  • Population of data has a finite level of variance
  • Sampling is random

A sample size of ≥ 30 is typically seen as the minimum acceptable value for the CLT to hold true. However, as you increase the sample size the distribution will look more and more like a bell curve. 

CLT allows statisticians to make inferences about population parameters using the normal distribution, even when the underlying population is not normally distributed. It forms the basis for many statistical methods, including confidence intervals and hypothesis testing.

5. Overfitting and underfitting

Image by author

When a model underfits, it has not been able to capture patterns in the training data properly. Because of this, not only does it perform poorly on the training dataset, it performs poorly on unseen data as well.

How to know if a model is undercutting:

  • The model has a high error on the train, cross-validation and test sets

When a model overfits, this means that it has learned the training data too closely. Essentially it has memorized the training data and is great at predicting it, but it cannot generalize to unseen data when it comes time to predict new values.

How to know if a model is overfitting:

  • The model has a low error on the entire train set, but a high error on the test and cross-validation sets

Additionally:

A model that underfits has high bias.

A model that overfits has high variance.

Finding a good balance between the two is called the bias-variance tradeoff. 

Conclusion

This is by no means a comprehensive list. Other important topics to review include:

  • Decision Trees
  • Type I and Type II Errors
  • Confusion Matrices
  • Regression vs Classification
  • Random Forests
  • Train/test split
  • Cross validation
  • The ML Life Cycle

Here are some of my other articles covering many of these basic ML and statistics concepts:

It’s normal to feel overwhelmed when reviewing these concepts, especially if you haven’t seen many of them since your data science courses in school. But what’s more important is ensuring that you’re up to date with what’s most relevant to your own experience (e.g. the basics of time series modeling if that’s your speciality), and simply having a basic understanding of these other concepts. 

Also, remember that the best way to explain these concepts in an interview is to use an example and walk the interviewers through the relevant definitions as you talk through your scenario. This will help you remember everything better too.

Thanks for reading

  • Connect with me on LinkedIn
  • Buy me a coffee to support my work!
  • I’m now offering 1:1 data science tutoring, career coaching/mentoring, writing advice, resume reviews & more on Topmate!
Share.

Comments are closed.