Dealing with missing values in the data

Oded Bizan
Jan 18, 2017
5 min read

A common issue data analysts need to deal with is missing values. There are really two issues to deal with:

To treat or not to treat the missing values?
If the decision is to treat the missing values, what is the most appropriate way?

To treat or not to treat, that is the question…

The answer, not surprisingly, is “it depends“.

It depends on the purpose of the analysis, on the percentage of missing values in the data, and on the variables for which values are missing: a dependent variable or an independent variable.

The key question when deciding whether to treat or not is: Does the remaining data representative of the entire data, given the purpose of the analysis? To illustrate this, consider the following example.

A regional U.S. bank is developing a model to predict the 12-month probability of default of business loans. The region is a major touristic magnet and therefore about 75% of the loans are related to tourism-related businesses including, for example, resort development, restaurants, transportation, or other services. The other 25% of the loans in the bank’s portfolio are made to small local businesses. Such businesses do not typically collect systematic data on their customers. By contrast, tourist-centered businesses collect this type of data obsessively. As a result, when model developers in the bank set up the data for their failure prediction model, they note that there are many missing values for some of the key predictors. More importantly, they note that missing values are more common for smaller loans typically given to local businesses.

Let’s, for simplicity, assume that there is only one factor that affects the probability that a loan will default within 12 months and that information about this factor (i.e., its values) is missing for small businesses more frequently than for large businesses. Then, if we were to characterize default behavior based on the information that is available for this factor, we would characterize default mostly based on the behavior of large businesses. In fact, our characterization will not be very useful because we would not be able to use it for small businesses (because our characterization is mostly applicable for large businesses) or for large businesses (because the little information we used about small businesses distorted our characterization of large businesses).

If we were interested only in large businesses we could simply focus on them and ignore small businesses. If we were interested only in small businesses then we would need information on enough of them to be able to make any characterization. But since we know that for many of the small businesses information about the key factor is missing we would have to check if we have information about enough of the small businesses. More importantly, we would have to make sure that there is no underlying reason for missing information (that is, it is missing at random – more on that, later in this post). If these two conditions (enough information and random missingness) hold we would be able to characterized default behavior for small businesses.

But what if the conditions do not hold? In this case we would have to “treat” the data to make it more “complete” but without distorting the overall relationship between our key factor and default behavior.

Alternative ways to treat the data

The best approach for treating missing values depends on the nature of missingness, so before deciding how to treat missing values we need to understand the “missingness mechanism”. That is, we need to understand why values are missing. Generally, there are 4 types of missingness:

Missing completely at random – each value has the same probability of being missing. This is the only case where complete-case analysis would not distort our results.
Missing at random – the probability of a value being missing depends on available information (i.e., it depends on other variable for which we have complete information).
Missingness that depends on unobserved information (e.g., it depends on other variable for which we do not have complete information)
Missingness that depends on the missing value itself – the probability of missingness is higher for certain value ranges. The extreme case, where all values are missing for a certain range (e.g., values higher than a certain threshold), is called censored data.

When missingness is not at random (types 3 and 4), it should be modeled explicitly. Otherwise, the analysis will have some degree of bias.

Ultimately, there are 3 ways to treat missing values:

Deleting the observations (a.k.a., complete-case analysis)
Deleting the variable
Imputing values for those that are missing

a. Mean/median/mode imputation

b. Prediction

i. Multiple imputation

ii. Single imputation

Complete-case analysis - Unless the nature of missingness is completely at random, this is the worst alternative for dealing with missing data. The only advantage for this approach is that it is easy to implement. There are however multiple problems that arise when adopting this alternative. First, deleting observations for which some information is missing results in deleting all the information for that observation. The result is that the remaining sample is no longer representative and that the power of the results is diminished because it relies on less information. Second, if observations with missing values differ systematically from observations without missing values the complete-case analysis would be biased. Third, because the sample we analyze depends on missingness that is different across variables, it is not possible to compare analyses because the sample is different every time. Fourth, if many variables are used in a model there may be very few observations left with full information.

Deleting the variable(s) - If one or few variables have most of the missing values then removing (or not using) the variable(s) can leave the modeler with many observations. The problem, however, is that the variable(s) may be an important factor for the model. In this case, the model would be substantially weakened or, even worst, completely misspecified.

Imputation of values - There are two broadly-defined approaches that are used to infer missing values: (a) substituting missing values with the mean/median/mode of the entire sample or based on subgroups, and (b) imputation of missing values based on predictive models. Either approach is preferable to deleting observations or deleting the variable(s). Inferring missing values based on sample mean/median/mode, which is simpler to implement, tend to reduce variability within the data. As a result, the outputs obtained from models using the data appear more precise than they actually are. Imputation based on predictive models (regression techniques, machine learning methods, or data mining techniques), is harder to implement (although most statistical packages have built-in modules designed for this purpose) and requires substantially more computing power. Furthermore, when the proportion of missing values is large the preferred approach is multiple imputation where several complete data sets are generated and are used to create separate models which are then integrated into a single model. In this case, computing power may become prohibitive.