This article is republished with permission from the author from Medium's Towards Data Science blog. View the original here.
For someone working or trying to work in data science, statistics is probably the biggest and most intimidating area of knowledge you need to develop. The goal of this post is to reduce what you need to know to a finite number of concrete ideas, techniques, and equations.
Of course, that’s an ambitious goal — if you plan to be in data science for the long-term, I’d still expect to continue learning statistical concepts and techniques throughout your career. But what I’m aiming to do is provide a baseline so that you can get through your interviews and start practicing data science in as short and painless of a process as possible. I’ll also end each section with key terms and resources for further reading. Let’s dive in.
Probability is the underpinnings of statistics and often comes up in interviews. It’s worth learning the basics, not just so you can make it past the typical probability brain teasers that interviewers like to ask, but also because it’ll enhance and solidify your understanding of statistics.
Probability is about random processes. The classic examples are things like flipping coins and rolling dice — it gives you a framework for determining things like the number of 6s you’d expect to roll over a certain number of throws, or the likelihood of flipping 10 fair coins without a heads coming up. While these examples might seem pretty abstract, they are actually important ideas for analyzing human behavior and other domains that deals with non-deterministic processes, and are crucial to data scientists.
The approach I favor for learning or re-learning probability is to start with combinatorics, which provides some intuition on how random processes behave, and then move on to how we derive the rules of expectation and variance from those processes. Being comfortable with these topics should let you pass a typical data science interview.
To prepare specifically for the type of probability questions you’re likely to get asked, I’d find some example questions (this is a reasonable list but there are many others too) and work through them on a whiteboard. Practice making probability trees to help visualize and think through the problems.
Key Ideas: Random variables, continuous versus discrete variables, permutations, combinations, expected value, variance
Intimately related to the topics above are probability distributions. A probability distribution is just a distribution that describes how likely it is that a single observation of a random variable is equal to a particular value or range of values. In other words, for any given random process there is both a range of values that are possible and a likelihood that a single draw from the random process will take on one of those values. Probability distributions provide the likelihood for all possible values of a given process.
As with probability, knowing distributions is a prerequisite to understanding inferential and predictive statistics, but you might also get interview questions specifically about them. The most typical example is: you have a process that behaves like X— what distribution would you use to model that process? Answering these types of questions is just a matter of mapping random processes to a sensible distribution, and this blog post does a great job of explaining just how to do that.
Key Ideas: Probability density function, cumulative distribution function, skew, kurtosis, uniform distribution, normal (gaussian) distribution, other distributions described in the blog post
The Central Limit Theorem and Hypothesis Testing
Once you have a grasp on probability and distributions, you can focus on how scientists do inference. The key insight is that once you have the tools to describe how probability distributions behave, the descriptive statistics we use to summarize data (often just the mean) can be modeled as aggregates of random variables.
Conveniently, there is a theorem that tells us that given a large enough sample the mean of a random variable becomes normally distributed. This is called the Central Limit Theorem (CLT), and I’ve written about it in some detail here. That article is a good place to go for an introduction or a refresher if you’ve already studied this stuff.
Using the CLT, we can assess the likelihood that a given mean came from a particular distribution, an idea that allows us to test hypotheses. For example we might have an average of a group of people’s heights and want to test the hypothesis that it came from a random process whose average is greater than 6 feet. Knowing that the means are normally distributed allows us to assess this proposition and reject, or fail to reject, our hypothesis.
The interview questions on hypothesis testing will either be about mapping certain scenarios to an appropriate test, or about elaborating on some of the key ideas of hypothesis testing: p-values, standard errors, etc. The readings below will cover the latter type of question. However, for the former, practice is the best approach. Grab some sample data sets and try coming up with practical questions, then articulate hypotheses and pick tests which will allow you to assess them. Assume you’ll have to justify these decisions to your interviewer, and practice those explanations accordingly.
Key Ideas: Central limit theorem, distribution of sample statistics, standard error, p-value, one-tailed versus two-tailed test, type-one and type-two error, T-test, other hypothesis tests
Randomization and Inference
To continue with the idea above, testing the hypothesis that the average height among a population is equal to 6 feet is reasonable. More often, as data scientists, you’re interested in causal questions though. That is, you want to know whether performing X will lead to Y.
For instance, a question along the lines of “Does living in California make you taller?” is more of a type of question a scientist would want to answer. The naive approach is to measure the height of people in California and test the hypothesis that their average height is greater than the average height among non-Californians. Unfortunately though, simply measuring and comparing observed data will always yield a biased and incorrect estimate of the true causal effect. In this example, there are numerous things correlated with living in California that may also affect someone’s height, so we don’t actually know whether it’s living in California that makes people taller, or other variables that may be responsible.
The fix for this is randomization. We can randomly assign people to live in California or not live in California and then measure the heights of those individuals. This ensures that the treatment is the only thing that is systematically different between the two groups, and so any difference in height must be the result of living in California.
This is why businesses conduct experiments or, as they’re referred to in industry terms, A/B tests. When you want to understand the true causal effect of decisions or products on business metrics, a randomized experiment is the one and only way to be confident in the results.
Unlike probability or distributions, outside of very specialized roles it’s unlikely that any single part of your interview will focus on causality. That said, understanding why correlation does not imply causation and when it’s necessary to run true randomized tests versus using observational data is very important and guaranteed to be a topic that comes up during the course of a data science interview. Your preparation in this area will probably be limited to reading, rather than whiteboarding or problem-solving, but it’s still incredibly important.
If you want to go a little deeper, Columbia hosts some great material on causal statistical inference. Among other things, it introduces the potential outcomes framework and the Rubin Causal Model, which I’d highly recommend for anyone interested in experiments.
Lastly, we come to prediction. This is the stuff that a lot of people are most excited about — it includes topics as diverse as image recognition, video recommendations, web search, and text translation. Obviously this is a huge area, but I’m assuming you’re interviewing for a more generalist position, in which case expertise in any of the areas will not be assumed.
Instead you want to be able to take any particular prediction problem an interviewer throws at you and provide a reasonable approach to start solving it. Mostly, this means being ready to discuss how you’d pick a model, assess that model’s effectiveness, and then improve on the model. When interviewing, I’d break down the problem into those three steps.
When choosing a model, you mostly want to base your decision on the following: the type and distribution of the outcome variable, the nature of the relationship between dependent and independent variables, the amount of data you have, and the desired level of interpretability. Again, there are no right answers here (though there are often wrong ones), so you just want to be able to have an intelligent discussion about the decisions you’d make and the tradeoffs they imply.
You might also get asked about what kind of features you’d want to include as independent variables (predictors) in your model. This is primarily an exercise in domain knowledge: it’s more about understanding the industry and which pieces of data are likely to predict the outcome of interest than about statistics. The discussion might also drift into feature engineering, which would involve having some intuition on how and when to transform your variables, and data-driven ways of selecting your predictors (i.e. regularization, dimensionality reduction, and automated feature selection).
Assessing a model is a relatively straightforward art that involves holdout data sets used to validate your model and mitigate any overfitting issues. The wiki on this topic is probably sufficient for a baseline. Additionally you want to be familiar with the different evaluation metrics: accuracy, ROC curves, confusion matrices, etc. This stuff is much less open-ended and I wouldn’t expect to go into microscopic detail about it. A cursory understanding of why holdout sets are necessary, and the pros and cons of different evaluation metrics should suffice.
The third step would be improvement. Mostly this is just a rehash of the feature engineering topics, and the decision about whether it’s necessary to collect more data. When interviewing, make sure your first stab at a model leaves room for you to make improvements — otherwise you’ll have a hard time answering the inevitable follow-up on how you could make it better.
Key Ideas: Regression versus classification, linear versus non-linear, supervised versus unsupervised, feature engineering, cross validation, ROC curve, precision-recall, bias-variance trade-off, boosting, bagging
Please Don’t Memorize Models
I’ve outlined an approach to learning statistics for data science interviews that starts at the fundamentals and builds up to more advanced techniques. This is not arbitrary — it’s because understanding the mathematical building blocks will allow you to reason effectively about different models, make good decisions, and speak intelligently about topics or techniques you’ve never thought about before.
The opposite approach, and unfortunately one that I myself and others have tried, is to start instead at the top of the pyramid and just memorize different techniques. This is incredibly ineffective as you end up trying to understand a bunch of isolated ideas out of context, since you’re lacking the basics that glue everything together and help you reason with new ideas. Please don’t do this. Start with probability, move to distributions, and then tackle inference and prediction. You’ll have an easier time, I promise.
Extra Credit: Time Series and Bayes
At this point, I’ve talked about fairly traditional approaches to inference and prediction, but not two large areas of statistics that treat these problems very differently.
One is analysis of time series data, the study of data over time and the special techniques you need to apply when the data-generating processes are not static. The second is Bayesian statistics, which takes an entirely different approach to statistics by making the decision to incorporate prior knowledge about a domain into assessments of probability. Both areas are important and worth knowing, but for a typical generalist interview, it would be unusual to go very deep in either of these areas.
However, if you’re interested in learning Bayesian stats now, I’d recommend this DataScience.com post as a beginner-friendly introduction. Time series is introduced sensibly in this post.