During my experience as a data scientist, one of the most common problems I have faced during the process of data cleaning/exploratory analysis has been **handling missing data.** In the ideal case, all attributes of all objects in the data table have well-defined values. However, in real data sets, it is not unusual for an attribute to contain missing data.

When dealing with prediction tasks in supervised learning, I quickly came to the realization that for a lot of machine learning algorithms available in Python, the task of handling missing data cannot be done naturally. I.e. the omitted instances have to somehow be filled with a placeholder (most likely a number) for them to run smoothly.

Perhaps you as a reader may have come up with a simple solution already:

Why not just ignore those instances in the pre-processing stage and conclude it that way?

After all, it is not our fault that the data is missing and moreover, we should not make any assumptions about the nature of missing data since their true value is *unknowable* in principle. While this may certainly be tempting (if not advisable) in some situations, I will attempt to make the case that imputing (rather than ignoring) missing values can be a better practice that in the end leads to more reliable and unbiased results for our machine learning models.

## Types of missing data

There may be various reasons responsible for why the data is missing. Depending on those reasons, it can be classified into three main types:

1) **Missing completely at random (MCAR) **– Imagine that you print out the data table on a sheet of paper with no missing values and then someone accidentally spills a cup of coffee on it. In this case, the conclusion is that the unknown values of an attribute follow the same distribution as known ones. This is the best case for missing values [1].

2) **Missing at random (MAR) – **In this case, the missing value from an attribute **X** is dependent on other attributes but is independent of the true value of X. For example, if an outdoor air temperature sensor runs out of batteries and the staff forgets to change them because it was raining, we can conclude that temperature values are more likely to be missing when it is raining, so they are dependent on the *rain *attribute. If we compute the temperature based only on the present values, we would probably overestimate the average value, since the temperature may be lower when it is raining compared to when it’s not.

3) **Missing not at random (MNAR) – **This usually occurs when the lack of data is directly dependent on its value. For example when a temperature sensor fails if temperatures drop below 0Â°C. Another example is when people with a certain level of income choose not to disclose that information to a census taker. In this case, it is more difficult to replace the missing values with a reasonable estimate.

It is important to identify these types of missing data since it can help us make certain assumptions about their distribution and therefore improve our chances of making good estimations.

## Ways of handling missing data

First of all, we need to identify which attributes exactly contain missing values, as well as get an idea of their frequency, as shown in the table below:

The sorting of the attributes is in descending order based on the number of instances with unknown values.

### 2.1 Deleting missing data

In my opinion, if the missing value percentage is above a certain threshold (say, 60%), it does not make much sense to try and impute them because it would likely influence our predictions due to the biased estimations. Deletion of the rows or columns with unknown values would be better suited. For illustrative purposes, suppose the data set looks like this (missing instances are denoted with the **NaN **notation):

The Python `pandas`

library allows us to drop the missing values based on the rows that contain them (i.e. drop rows that have at least one `NaN`

value):

`import pandas as pd`

`df = pd.read_csv('data.csv')`

`df.dropna(axis=0)`

The output is as follows:

`id`

`col1`

`col2`

`col3`

`col4`

`col5`

`0`

`2.0`

`5.0`

`3.0`

`6.0`

`4.0`

Similarly, we can drop columns that have at least one `NaN`

in any row:`df.dropna(axis=1)`

The above code produces:

However, I think that in most scenarios it is better to keep data than discard it. One obvious reason is that removing rows or columns that contain unknown values will result in losing too much valuable information, especially if we don’t have much data, to begin with.

### 2.2 Simple imputation of missing data

We could use simple interpolation techniques to estimate unknown data. One of the most common interpolation techniques is **mean imputation **[2]. Here, we simply replace the missing values in each column with the mean value of the corresponding feature column.

The `sciki-learn`

library offers us a convenient way to achieve this by calling the `SimpleImputer`

class and then applying the `fit_transform()`

function:

`from sklearn.impute import SimpleImputer`

`import numpy as np`

`sim = SimpleImputer(missing_values=np.nan, strategy='mean')`

`imputed_data = sim.fit_transform(df.values)`

After running the code, we get the imputed dataset:

Other imputation strategies are available with this class, for example “median” or “most frequent” in the case of categorical data, which replaces the missing data with the most common category.

This simplistic approach does have its drawbacks however. For example, by using the mean as an imputation strategy we **do not**:

1) Account for the variability of the missing values, since these values are replaced by a constant.

2) Take into account the potential dependency of the missing data from the other attributes which are present in the data set.

That’s why I decided to focus my attention on a few more sophisticated approaches.

### 2.3 Imputation of missing data using machine learning

A more advanced method of imputation is to model an attribute containing unknown values as a target variable which is dependent on the other variables present in the data set and then apply traditional regression or machine learning algorithms to predict its missing instances. A rough mathematical representation could be formulated as follows:**y=f(X)**

where **y **represents the attribute for which we want to predict the missing values and **X **is the set of predictor variables, i.e. the *other *variables. This relationship is most clearly visible in the case of simple linear regression where we have:* y*=

*c*+

*b**

**X**

After we build our simple model we can then use it to predict the unknown values of **y **for which the corresponding **X** values will be available. The exact same principle applies to ML algorithms as well, albeit the relationship representation between target and predictor cannot be done so neatly.

Relying on linear regression (or logistic regression for categorical data) to fill the gaps has of course its drawbacks as well. Most importantly, *this approach assumes that the relationship between its predictors (or the log odds of its predictors in logistic regression) and the target variable is linear*, even though this may not be the case at all.

For this reason, I have chosen to perform imputation using ML algorithms that are able to also capture non-linear relationships. The modus operandi can be summarized in the following pseudocode:

#### For each attribute containing missing values do:

*Substitute missing values in the other variables with temporary placeholder values derived solely from the non-missing values using a simple imputation technique**Drop all rows where the values are missing for the current variable in the loop**Train an ML model on the remaining data set to predict the current variable**Predict the missing values of the current variable with the trained model (when the current variable will be subsequently used as an independent predictor in the models for other variables, both the observed and predicted values in this step will be used).*

**Firstly**, as you probably noticed, I have performed a simple form of imputation (median) already in the first step. This is necessary because there may be multiple features with missing data present, and in order for them to be used as predictors for other features, their gaps need to be temporarily filled somehow.

Secondly, the prediction of missing data is done in a “progressive” manner in the sense that variables which were imputed in the previous iteration are used as predictors along with those imputed values. So at each iteration except the first, we are relying on the predictive power of our model to fill the remaining gaps.

Thirdly, given that the data set provided in this case contained a mix of data types, I have employed ML regressors (for continuous attributes) as well as classifiers (for categorical attributes) to cover all possible scenarios.

In the subsequent sections, I have listed all the ML models used in this study, along with small snippets of code that demonstrate their implementation in Python.

#### 2.3.1 Imputation of missing data using Random Forests

###### Quick data preprocesing tips

Before training a model on the data, it is necessary to perform a few preprocessing steps first:

**Scale the numeric attributes**(apart from our target) to make the algorithm find a better solution quicker.

This can be achieved using`scikit-learns`

‘s`StandardScaler()`

class:`from sklearn.preprocessing import StandardScaler`

`X = df.values`

`standard_scaler = preprocessing.StandardScaler()`

`x_scaled = standard_scaler.fit_transform(X)`

**Encode the categorical data**so that the representation of each category of an attribute is in a binary 1 (present) – 0 (not present) fashion. This happens because most models cannot handle non-numerical features naturally.

We can do this by using the`pandas`

`get_dummies()`

method:`import pandas as pd`

`encoded_country = pd.get_dummies(df['Country'])`

`df.join([encoded_country])`

`del df['Country']`

The first ML model used was scikit-learn‘s `RandomForestRegressor`

. Random forests are a collection of individual decision trees (bagging) that make decisions by averaging out the prediction of every single estimator. They tend to be resistant to overfitting because tree predictions cancel each other out. If you want to learn more, refer to [3].

##### Below is a small snippet that translates the above pseudocode into actual Python code:

`from sklearn.ensemble import RandomForestRegressor`

`for numeric_feature in num_features:`

`df_temp = df.copy()`

`sim = SimpleImputer(missing_values=np.nan, strategy='median')`

`df_temp = pd.DataFrame(sim.fit_transform(df_temp))`

`df_temp.columns = df.columns`

`df_temp[numeric_feature] = df[numeric_feature]`

`df_train = df_temp[~df_temp[numeric_feature].isnull()]`

`y = df_train[numeric_feature].values`

`del df_train[numeric_feature]`

`df_test = df_temp[df_temp[numeric_feature].isnull()]`

`del df_test[numeric_feature]`

`X = df_train.values`

`standard_scaler = preprocessing.StandardScaler()`

`x_scaled = standard_scaler.fit_transform(X)`

`test_scaled = standard_scaler.fit_transform(df_test.values)`

`rf_regressor = RandomForestRegressor()`

`rf_regressor = rf_regressor.fit(x_scaled, y)`

`pred_values = rf_regressor.predict(test_scaled)`

`df.loc[df[numeric_feature].isnull(), numeric_feature] = pred_values`

Categorical feature imputation is done in a similar way. In this case, we are dealing with a classification task and should use the `RandomForestClassifier`

class.

Important note on using categorical features as predictors:

In my opinion, it is correct to perform temporary imputation of categorical features before encoding them.

Consider the below example where the *Country* feature has already been encoded before beginning the imputation procedure:

`id`

`Austria`

`Italy`

`Germany`

`0`

`0`

`1`

`0`

`1`

`1`

`0`

`0`

`2`

`0`

`0`

`1`

`3`

`NaN`

`NaN`

`NaN`

If we apply a simple imputation using the most frequent value for example, we would get the following result on the last row:

`id`

`Austria`

`Italy`

`Germany`

`0`

`0`

`0`

`0`

This is a logical mistake in the representation since each row should contain exactly one column that takes 1 as value to denote the presence of a particular county. We can avoid this mistake by imputing before encoding since we are guaranteed to fill the missing values with a certain country value.

#### 2.3.2 Imputation of missing data using XGBoost

The XGBoost algorithm is an improved version of the Gradient Boosting one. Similar to Random Forests, XGBoost is a tree-based estimator, but decisions are taken sequentially rather than in parallel. For more information, check out the official documentation.**The XGB model can actually handle missing values on its own**, so it is not necessary to perform temporary simple imputation on predictor variables, i.e. we could skip the first step in the pseudocode.

Training and prediction of missing values are done in a similar fashion to the random forest approach:

`import xgboost as xgb`

`.`

`.`

`.`

`xgbr = xgb.XGBRegressor()`

`xgbr = xgbr.fit(x_scaled, y)`

`pred_values = xgbr.predict(test_scaled)`

`.`

`.`

`.`

#### 2.3.3 Imputation of missing data using Keras Deep Neural Networks

Neural networks follow a fundamentally different approach during training compared to tree-based estimators. In my work, I have used the neural network implementation offered by the Keras library. Below I wrote an example demonstrating its application in Python:

`import tensorflow as tf`

`from keras.models import Sequential`

`from keras.layers import Dense`

`.`

`.`

`.`

`model = Sequential()`

`model.add(Dense(30, input_dim=input_layer_size, activation='relu')`

`model.add(Dense(30, activation='relu'))`

`# identity activation in the output layer for regression`

`model.add(Dense(1))`

`# in case of multi classification:`

`# model.add(Dense(1, activation='softmax'))`

`model.compile(loss='mean_squared_error')`

`# in case of multi classification:`

`# model.compile(loss='categorical_crossentropy')`

`model.fit(x_scaled, y)`

`pred_values = model.predict(test_scaled)[:, 0]`

`.`

`.`

`.`

#### 2.3.4 Imputation of missing data using Datawig

Datawig is another deep learning model I employed. Its design is specifically made for missing value imputation as it utilizes MXNet’s pre-trained DNNs to make predictions. It can work with missing data during training and it automatically handles categorical data with its `CategoricalEncoder`

class, so we don’t need to pre-encode them. A possible way of implementation is the following:

`import datawig`

`.`

`.`

`.`

`imputer = datawig.SimpleImputer(input_columns=list(df_test.columns),`

`output_column=numeric_feature, # the column to impute`

`output_path='imputer_model' # stores model data and metrics)`

`# Fit the imputer model on the train data:`

`imputer.fit(train_df = scaled_df_train)`

`# Alternatively, we could use the fit_hpo() method to find`

`# the best hyperparameters:`

`# imputer.fit_hpo(train_df = scaled_df_train)`

`# Impute missing values, return original dataframe with predictions`

`pred_vals = imputer.predict(scaled_df_test).iloc[:, -1:].values[:, 0]`

`.`

`.`

`.`

Datawig is optimized for `pandas`

`DataFrames`

, meaning that it takes dataframe objects directly as input for training and prediction, so we do not need to transform them into NumPy arrays.

Moreover, we should not drop the target variable column from the training set and input it as a separate argument as we did previously when fitting a model. Datawig handles this automatically.

#### 2.3.5 Imputation of missing data using IterativeImputer

The `scikit-learn`

package also offers a more sophisticated approach to data imputation with the `IterativeImputer()`

class. So where does this approach differ from the ones we saw before? The names give us a hint.

Iterative means that each feature is imputed multiple times. Each iteration’s naming can be “a *cycle*“. The reason behind running multiple cycles is to achieve some sort of ‘convergence’. Although it is not clear this means exactly that, looking at the `scikit-learn`

documentation. However, you can think of convergence in terms of *stabilization* of the predicted values:

`from sklearn.experimental import enable_iterative_imputer # noqa`

`from sklearn.impute import IterativeImputer`

`iim=IterativeImputer(estimator=xgb.XGBRegressor(),`

`initial_strategy='median',`

`max_iter=10,`

`missing_values=np.nan,`

`skip_complete=True)`

`# impute all the numeric columns containing missing values`

`# with just one line of code:`

`imputed_df = pd.DataFrame(iim.fit_transform(df))`

`imputed_df.columns = df.columns`

From the code above, we can see that each feature is imputed 10 times (`max_iter=10`

) and in the end, we get the imputed values of the last cycle.

Notice how I used an XGBoost regressor as model input. This shows that IterativeImputer also accepts some ML models that are not native to the `scikit-learn`

library.

Despite being easy to implement, it takes a very large amount of time to calculate compared to the other approaches.

In addition, *I would advise using this class with care since it is still in its experimental stages*.

## Comparing Performances of ML Algorithms

Up until this point, we have seen how various employable techniques to impute missing data as well as the actual process of imputation. However, I have not explained how we can compare the qualities of predictions provided by these approaches.

This is not immediately obvious, because well, we do not possess the missing data to compare them to the predictions. Instead what I decided to do is keep a holdout or validation set from training data, and then use it for model performance evaluation.

So, we are pretending that some data is missing and inferring the actual accuracy of the imputed values based on the accuracy of the imputations on these fake missing values. The snipped below enables us to do this:

`from sklearn.metrics import r2_score`

`import random`

`.`

`.`

`.`

`train_copy = df_train.copy()`

`random.seed(23)`

`current_feat = train_copy[numeric_feature]`

`missing_pct = int(current_feat.size * 0.2)`

`i = sorted(random.sample(range(current_feat.shape[0]), missing_pct))`

`current_feat.iloc[i] = np.nan`

`y_fake_test = df_train.iloc[i, :][numeric_feature].values`

`new_train_df = train_copy[~train_copy[numeric_feature].isnull()]`

`fake_test_df = train_copy[train_copy[numeric_feature].isnull()]`

`train_y = new_train_df[numeric_feature].values`

`del new_train_df[numeric_feature]`

`del fake_test_df[numeric_feature]`

`rf_regressor = rf_regressor.fit(new_train_df.values, train_y)`

`train_pred = rf_regressor.predict(new_train_df.values)`

`test_pred = rf_regressor.predict(fake_test_df.values)`

`print("R2 train:{} | R2 test:{}".format(r2_score(train_y, train_pred), r2_score(y_fake_test, test_pred)))`

The prediction quality, or goodness of fit, can be measured by the coefficient of determination, which expresses as:

Coefficient of determination formula

where **RSS** is the sum of squared residuals, and **TSS** represents the total sum of squares. Below I have plotted a visual comparison of the model performances for several attributes. Visualization was done utilizing the `seaborn`

library.

#### Comparing model prediction accuracy on various attributes

We can see that the random forest model consistently ranks among the best.

To get another hint at the consistency of RF, I have plotted the actual values against the predicted values in the test set for the VAR_1 variable:

#### Plotting actual values against predicted ones for Var_1

Ideally, the line in any graph should be a straight, diagonal one. The model which comes closest to this is the random forest, which was ultimately my choice for imputation.

## Potential Future Steps

Another interesting technique for imputation, which you can employ in the future, is the Multiple Imputation Chained Equations (MICE) method. This takes iterative imputation up a notch. The core idea behind it is to create multiple copies of the original data set (usually 5 to 10 are enough) and perform iterative imputations on each dataset. The obtained results from each data set, in accordance with some metrics that we can define, you can later pool together.

Ultimately, the goal is to somehow account for the variability of the missing data and study the effects of different permutations on the prediction results. The scheme below illustrates this:

In Python, MICE is offered by a few libraries like `impyute`

or `statsmodels`

. However, linear regression estimators are their limit.

Another way to mimic the MICE approach would be to run `scikit-learn`

‘s `IterativeImputer`

many times on the same dataset using different random seeds each time.

Yet another take at the imputation problem is to apply a technique called **maximum likelihood estimation**. It can derive missing values from a user-defined distribution function, whose parameters chose in a way that maximizes the likelihood of the imputed values actually occurring

## Conclusion

We got a glimpse of what the potential approaches for handling missing values are, from the simplest techniques like deletion to more complex ones like iterative imputation.

In general, there is no best way to solve imputation problems and solutions vary according to the nature of the problem, size of the data set, etc. However, I hope to have convinced you that an ML-based approach has inherent value because it offers us a ‘universal’ way out. While missing data may be truly unknowable, we can at least try to come up with an educated guess based on the hidden relationships with the already existing attributes, captured and exposed to us by the power of machine learning.

## References

[1] Berthold M.R., and others, Data understanding, in: * Guide to Intelligent Data Analysis*, Springer, London, pp. 37-40, 42-44.

[2] Raschka S., Data preprocessing, in: * Python Machine Learning*, Packt, Birmingham, pp. 82-83, 90-91.

[3] Tan P., and others, Data preprocessing, Classification, Ensemble methods, in: * Introduction to Data Mining*, Addison Wesley, Boston, pp. 187-188, 289-292.