Missing values: how to go beyond the mean and the mode?

Missing values (NA or NaN) is one of the most common problems in data science. NA values can be problematic because they can lead to biased results. For example, if you want to know the average revenue of a population, and you have missing values, you might have problems such as:

Maybe rich people are more likely to not answer the question
Maybe poor people are more likely to not answer the question
Maybe people who are not rich nor poor are more likely to not answer the question
Maybe NA values are randomly distributed

In this post, we will go over the following points:

Why not keeping the NA?
What are the different types of missing values?
Easy and better solutions than the mean and the mode
More complex solutions

Why not keeping the NA?

A naive solution to the problem of missing values is to just keep the NA values. Even tough NA generally cause problems when doing statistical analysis, there are tools that allow you to keep the NA values. For example, you can use a histogram-based gradient boosting model that can handle NA values. These models can work either with numerical or categorical variables. However, this solution is not always the best and does not allows you to explore as much as you would like your data.

Different types of missing values

There are different types of missing values:

MCAR: Missing Completely At Random. This concerns NA whose absence is determined purely at random and does not depend on the individuals. This is a rather rare situation.
MAR: Missing At Random. The probability of being missing is the same only within groups defined by the observed data. The missingness is related to the observed data but not the unobserved data.
MNAR: Missing Not At Random. The probability of being missing varies for reasons that are unknown to the researcher and related to the missing data itself.
Structurally missing. Missingness that occurs due to the structure of the data. For example, in a survey, a question about pregnancy might only be presented to female respondents, so male respondents would have structurally missing values for this item.

Why does it matter? Because the way you deal with NA depends on the type of missing values. For example, if you have MCAR and a high enough sample size, you can just drop the rows with NA. If you have MNAR, your estimations will be biased if you do not take into account the NA.

Easy and better solutions than the mean and the mode

It's a very common practice to replace NA values by the mean or the mode. However, this is not a very good practice. Indeed, this practice can lead to biased results. For example, if you have missing values in the Age variable, and you replace the NA values by the mean, you will have a lower Age variance. The good news is that maybe you don't even need to replace the NA values with an estimation!

At each of the 3 following steps, you should check if there are still NA, because it's not necessarily the case after.

1 - Very first thing to do

The very first you should do before anything is to drop observations/variables that have 100% NA values. By doing this, you don't take any risk (these observations/variables couldn't be used anyway) and you reduce the number of NA.

2 - Search for structurally missing values

Then, you should check if there are structurally missing values. For example, if you have a variable about the number of cigarettes consumed per day with NA values, you should check if these NA values are for non-smokers. If it's the case, you can replace the NA values by 0. This can bring lots of work if you have lots of variables, but it's the best way to deal with structurally missing values. Also, it means that you should have a good understanding of the data you are working with.

3 - Replace NA by a new category

If you have categorical variables, you can replace NA values by a new category. For example, if you have a variable about the marital status with NA values, you can replace the NA values by "unknown", also known as explicit NA. By doing this, you don't take any risk and you don't lose information. The one problem that comes with it is that you increase the number of categories and that maybe all married people will be categorized as "married" and "unknown", which is not ideal.

With these three steps, you have a good start of reducing the number of NA values without using any estimation and made your data more understandable. However, you might still have NA values, especially in numerical variables.

Going further

In this post, we saw that the "all other things being equal" expression is used to say that you are looking at the effect of one variable on another, while keeping all other variables constant. We also saw that this expression has an impact on the interpretation of the coefficients. Finally, we saw that the Frisch-Waugh-Lovell theorem is a theorem that allows you to get the same results as if you had run a regression with all the variables, while only running a regression with one variable.

If you want to go further, check:

This post is the work of Joseph Barbier and Thomas Salanova. If you have any questions, feel free to contact us!