Missing values (NA or NaN) is one of the most common problems in data science. NA values can be problematic because they can lead to biased results. For example, if you want to know the average revenue of a population, and you have missing values, you might have problems such as:
A naive solution to the problem of missing values is to just keep the NA values. Even tough NA generally cause problems when doing statistical analysis, there are tools that allow you to keep the NA values. For example, you can use a histogram-based gradient boosting model that can handle NA values. These models can work either with numerical or categorical variables. However, this solution is not always the best and does not allows you to explore as much as you would like your data.
There are different types of missing values:
Why does it matter? Because the way you deal with NA depends on the type of missing values. For example, if you have MCAR and a high enough sample size, you can just drop the rows with NA. If you have MNAR, your estimations will be biased if you do not take into account the NA.
It's a very common practice to replace NA values by the mean or the mode. However, this is not a very good practice. Indeed, this practice can lead to biased results. For example, if you have missing values in the Age variable, and you replace the NA values by the mean, you will have a lower Age variance. The good news is that maybe you don't even need to replace the NA values with an estimation!
At each of the 3 following steps, you should check if there are still NA, because it's not necessarily the case after.
The very first you should do before anything is to drop observations/variables that have 100% NA values. By doing this, you don't take any risk (these observations/variables couldn't be used anyway) and you reduce the number of NA.
Then, you should check if there are structurally missing values. For example, if you have a variable about the number of cigarettes consumed per day with NA values, you should check if these NA values are for non-smokers. If it's the case, you can replace the NA values by 0. This can bring lots of work if you have lots of variables, but it's the best way to deal with structurally missing values. Also, it means that you should have a good understanding of the data you are working with.
If you have categorical variables, you can replace NA values by a new category. For example, if you have a variable about the marital status with NA values, you can replace the NA values by "unknown", also known as explicit NA. By doing this, you don't take any risk and you don't lose information. The one problem that comes with it is that you increase the number of categories and that maybe all married people will be categorized as "married" and "unknown", which is not ideal.
With these three steps, you have a good start of reducing the number of NA values without using any estimation and made your data more understandable. However, you might still have NA values, especially in numerical variables.
If you still have NA values, you can use more complex solutions. These solutions are more complex because they use estimations, modelling, etc. However, they are still easy to implement and can bring a lot of value to your analysis.
We'll go over 3 solutions with references on how to implement them in Python and R:
This section is a work in progress.
In this post, we saw that the "all other things being equal" expression is used to say that you are looking at the effect of one variable on another, while keeping all other variables constant. We also saw that this expression has an impact on the interpretation of the coefficients. Finally, we saw that the Frisch-Waugh-Lovell theorem is a theorem that allows you to get the same results as if you had run a regression with all the variables, while only running a regression with one variable.
If you want to go further, check:
This post is the work of Joseph Barbier and Thomas Salanova. If you have any questions, feel free to contact us!