5 Statistical Concepts You Need to Know Before Your Next Data Science Interview

on my own Data Science job search journey and have been very lucky to have gotten the chance to interview with many companies.

These interviews have been a mix of technical and behavioral when meeting with real people, and I’ve also gotten my fair share of assessment tasks to complete on my own.

Going through this process I’ve done a lot of research about what kinds of questions are commonly asked during data science interviews. These are concepts you should not only be familiar with, but also know how to explain.

1. P value

Image by author

When you run a statistical test, typically you are going to have a null hypothesis H0 and an alternative hypothesis H1.

Let’s say you are running an experiment to determine the effectiveness of some weight-loss medication. Group A took a placebo and Group B took the medication. You then calculate a mean number of pounds lost over six months for each group and want to see if the number of weight lost for Group B is statistically significantly higher than Group A. In this case, the null hypothesis, H0 would be that there was no statistically significant differences in the mean number of lbs lost between groups, meaning that the medication had no real effect on weight loss. H1 would be that there was a significant difference and Group B lost more weight due to the medication.

To recap:

H0: Mean lbs lost Group A = Mean lbs lost Group B
H1: Mean lbs lost Group A < Mean lbs lost Group B

You would then conduct a t-test to compare means to get a p-value. This can be done in Python or other statistical software. However, prior to getting a p-value, you would first choose an alpha (α) value (aka significance level) that you will compare the p to.

The typical alpha value chosen is 0.05, which means that the probability of a Type I error (Saying that there is a difference in means when there isn’t) is 0.05 or 5%.

If your p value is < alpha value, you can reject your null hypothesis. Otherwise, if p > alpha, you fail to reject your null hypothesis.

2. Z-score (and other outlier detection methods)

Z-score is a measure of how far a data point lies from the mean and is one of the most common outlier detection methods.

In order to understand the z score you need to understand basic statistical concepts such as:

Mean — the average of a set of values
Standard deviation — a measure of spread between values in a dataset in relation to the mean (also the square root of variance). In other words, it shows how far apart values in the dataset are from the mean.

A z-score value of 2 for a given data point indicates that that value is 2 standard deviations above the mean. A z-score of -1.5 indicates that the value is 1.5 standard deviations below the mean.

Typically, a data point with a z-score of >3 or <-3 is considered an outlier.

Outliers are a common problem within data science so it’s important to know how to identify them and deal with them.

To learn more about some other simple outlier detection methods, check out my article on z-score, IQR, and modified z score: