Tag: Data Science

Office employee doing Financial Data Extraction from Statements with Machine Learning

Financial Data Extraction from Statements with Machine Learning

Data is the foundation that drives the whole decision-making process in the finance ecosystem. With the growth of fin-tech services the process of collecting this data is more easy accessible, and for a data scientist becomes necessary to develop a set of information extraction tools that would automatically fetch and store this relevant data. Doing so, we facilitate the process of financial data extraction, which before the development of this tools was done manually, a very tedious and not very time-efficient task.

One of the main providers of this “key knowledge” in finance is the Financial Statements, which offer important insight for a company through its performance, operations, cash flows, or balance sheets. While this information is usually provided in text-based formats or other data structures like spreadsheets or data-frames (which can easily be utilized using parser-s or converters), there is also the case when this data comes as other document formats or images in a semi-structured fashion, which also varies between different sources. In this post we will go through different approaches we used to automate the data extraction from these Financial Statements that were formerly provided from different external sources.

Financial Data Extraction: Problem introduction

In our particular case, our data consisted of different semi-structured financial statements provided in PDFs and images, each one following a particular template layout. These financial statements consist of relevant company-related information (company name, industry sector, address), different financial metrics, and balance sheets. For us, the extraction process task consists of retrieving all the relevant information of each document for every different entity class and storing them as a key-value pair (e.g. company_name -> CARDO AI). Since the composition of information differs between documents we end up having different clusters of them. Going even further, we observe that the information inside the document itself represents itself through different typologies of data (text, numerical, tables, etc.).

Sample of a financial statement containing relevant entity classes

In this case, two main problems emerge: we have to find an approach to solve the task for one type of document, and secondly by inductive reasoning, form a broader approach for the general problem, which applies in the whole data set. We have to note here that we are trying to find a single solution that works in the same way for all this diverse data. Treating separately every single document with a different approach denotes missing the whole point of the present task at hand.

“An automated system won’t solve the problem. You have to solve a problem, then automate the solution”

Methodology and solution

Text extraction tools

Firstly we started with text extraction tools like Tabula for tabular data, and PDFMiner and Tesseract for text data. Tabula scrapes tabular data from PDF files, meanwhile, PDFMiner and Tesseract are text extraction tools that gather the text data respectively from PDF and images. The way these tools work is by recognizing pieces of text on visual representations (PDFs and images) into textual data (document text). The issue with Tabula was that it worked only on tabular data, however, the most relevant information in the financial documents that we have is not always represented in tabular format.

Meanwhile, when we applied the other tools, PDFMiner and Tesseract, the output raw text was completely unstructured and non-human-unreadable (adding here unnecessary white-spaces or confusing words that contained special characters). This text was hard to break down into the meaningful entity classes that we want to extract from there. This was clearly not enough so we had to discover other approaches.

GPT-2

Before moving on, we made an effort to pre-process the outputted text from the above-mentioned extraction tools, and for that, we tried GPT-2 [1]. GPT-2 is a large transformer-based language model with 1.5 billion parameters developed from OpenAI and was considered a great innovative breakthrough in the field of NLP. This model, and also its successor – GPT-3, has achieved strong performance on many NLP tasks, including text generation, translation, as well as several tasks that require on-the-fly reasoning or domain adaptation. In our case, we tried to exploit one of its capabilities which was text summation. After getting a considerable amount of text from the previous text extraction tool, we tried to summarize all this information using the GPT-2 model and take out non-relevant information, taking advantage of the attention mechanism of the transformer model. But this approach did not seem to work quite well considering the non-structured text which is very hard to summarize. Apart from that, there would always be the possibility of the model removing the important information from the text and we cannot give it the benefit of doubt in this regard.

Bounding boxes relationship – OpenCV

The unpromising results of the above approaches made us entertain the idea of treating it as an object detection task using computer vision. Object detection is done by means of outputting a bounding box around the object of interest along with a class label. Then we could construct a relationship graph between these “boxed” entities [2] (see image above). Going forward with this method we tried to do the same with our documents, but instead draw boxes around text that represents an identifiable entity and label each box with the entity name it contained. The next step would have been to develop an algorithm that calculates a metric that represents the relationship values between these boxes based on their spatial position. We could then train a machine learning model that would learn from these relationship values and sequentially decide the position of the next entity by knowing the document locations of the previous ones.

The model creates a relationship graph between entities

However, that was not an easy task, due to the fact that it is very hard to determine the right box which represents a distinct meaningful component in the report, and also as mentioned above different documents follow different document layouts and the position of the information we want to extract is arbitrarily positioned in the document. Henceforth, the previously mentioned algorithm might be inaccurate in determining the position of every box. We moved on to seek a better plan.

Named Entity Recognition

An example of NER annotation

Named-entity recognition (NER) is a sub-task of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories.

During our research on the quest to explore new approaches for this task, we came upon the expression “named entity” which generally refers to those entities for which one or many strings, such as words or phrases, stands consistently for some referent. Then we discovered Named Entity Recognition, the task of locating and classifying words from a not annotated block of text into predefined categories. The desired approach to solving this task is by using deep learning NLP models which use linguistic grammar-based techniques. Conceptually this task is divided into distinct problems: detection of names and classifying them into the category they fall into. Hence we started to look for different implementations to design our language model.

NLP model – spaCy

At this point, our process path was pretty straightforward to follow due to the ease that the NLP libraries offer. And for this job, we decided to go with spaCy [3], which offers a very simple and flexible API to develop many NLP tasks, and one of them being Named Entity Recognition. The design pipeline could be conceptualized with the below diagram:

Solution pipeline

Before we start the design of our model we have to first construct the training data set. It will essentially consist of annotated blocks of texts or “sentences” that contain the value of the entity it represents and the entity name itself. For that, firstly we extract the text from the paragraphs where the desired information is present by making use of the previously used extraction tools. Then we annotate this text with the found categories, by providing the starting position and length of the word in the text. Doing so, we also provide some context to the model by keeping the nearby words around the annotated word. This whole information retrieved from the financial statements can then be easily stored in a CSV file. In SpaCy this would be represented with the below structure:

TRAIN_DATA = [    ("Cardo AI SRL is a fintech company", {"entities": [(0, 12, "Company")]}),    ("Company is based in Italy", {"entities": [(20, 25, "LOC")]})]

After we prepared our dataset, we then decided to design the NLP model by choosing between the alternatives that spaCy provided. We started from a blank non-trained model and then made an outline of the input and output of the model. We split the data into train and test sets and then started training the model, following this pipeline. From the training data, the text is firstly tokenized using the Doc module, which basically means breaking down the text into individual linguistic units, and then the annotated text inputted with the supported format is parsed with the GoldParse module to then be fed into the training pipeline of the model.

Training pipeline

Results and constraints

After training the model on about 800 input rows and testing on 200, we got these evaluations:

Evaluations

The evaluation results seemed promising, but that may have come also from the fact that the model was over-fitting or there was not a lot of variability in our data. After our model was trained, all we had to do was feed it with text data taken from the input reports after they had been divided into boxed paragraphs and expect the output represented as a key-value pair.

Constraints

Lack of data
– in order to avoid bias or over-fitting the model should be trained on a relatively large amount of data.
– acquiring all this data is not an easy process, adding here the data pre-processing step.
Ambiguous output
– the model may output more than one value per entity, which leads to inconsistency in interpreting the results.
Unrecognizable text
– the financial statements have poorly written text not correctly identifiable during the text data extraction tools recognition.
Numerical values
– having lots of numerical values in the reports it is hard to distinguish the real labels they represent.

Potential future steps toward better Financial Data Extraction

In recent years, convolution neural networks have shown great success in various computer vision tasks such as classification and object detection. Seeing the problem from a computer vision perspective as a document segmentation (creating bounding boxes around the text contained in the document and classifying it into categories) is a good approach to proceed on with. And for that, the magic formula might be called “RCNN” [4].

By following this path, we might be able to resolve the above-mentioned issues we ended up with our solution. Integrating many different approaches together may also improve the overall accuracy of the labeling process.

After the solution process is stable and the model’s accuracy is satisfactory we need to streamline this whole workflow. For a machine learning model, it is important to be fed with an abundant amount of new data which improves the overall performance and predicts future observations more reliably. In order to achieve that, it comes necessary to build an Automated Retraining Pipeline for the model, with a workflow displayed as the following diagram:

Workflow diagram

Conclusion

We went through and reviewed a couple of different approaches we attempted on solving the Named-Entity Recognition task on Financial Statements. And from this trial and error journey, it seemed that the best method was solving it using Natural Language Processing models trained with our own data and labels.

But despite seemingly obtaining satisfactory results in our case study regarding financial data extraction, there is still room for improvement. The one thing we know for sure is that the above Machine Learning approach provided us the best results and following the same path on solving this task is the way to go. Machine Learning is very close to reaching super-intelligence and with the right approach to present problems in every domain, it is becoming a powerful tool to make use of.

References

[1] Better Language Models and Their Implications

[2] Object Detection with Deep Learning: A Review

[3] spaCy Linguistic Features, Named Entity

[4] Ren et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Introducing Marti – CardoAI’s chatbot!

The chatbot trend has taken the commercial industry by storm, where having bots and digital assistants is becoming not only the norm but the primary channel of communication for providing services to prospecting clients.

With increasing interest and research on the Artificial Intelligence (AI), Natural Language Processing (NLP) and machine learning, bots are becoming progressively more efficient. Nowadays, the usage of chatbots is ubiquitous and synonymous with a booking holiday or ordering in restaurants.

However, in Fintech, the chatbots serve a more tailored purpose supporting the user along its journey on an application. From setting up accounts in digital banking solutions, assisting in the lending process via online forms to the lending robots that carry out investment strategies for retail or institutional investors.

Chatbots have become an intelligent solution for the significant financial and banking industry. They have eliminated the long queues at their branches, saving time and energy, giving customers the liberty to get the work done from anywhere without compromising the safety. [1]

For this reason, CardoAI’s next project is developing and improving Marti – the chatbot integrated inside the digital lending product to guide our users and help in navigating the platform with a personal assistant.

Say hello to Marti!

Upon login on the digital lending product, Marti pops up ready to accommodate the user’s needs through the interactive chat on the side that is available at all times.

Marti greeting users upon sign in

Thanks to Marti, all our users are now able to check their portfolio’s balance, the status of trades in different countries and currencies and several other statistics that will help the user to grasp the portfolio’s position and performance through a simple request by chat.

A sample interaction with Marti

Marti is also clever enough to suggest intents – possible actions that the user might request and information he wants to retrieve in a couple of seconds.

Not only can the bot display real-time financial information, but it also processes this data to provide insights and graphs based on it.

Marti displaying in-chat graphs

The chatbot is capable to depict numerous types of charts such as line charts, doughnut charts, allows the user to select and download files within it and even change the theme of the whole platform!

Our data science team goal is that through Marti, to automate and facilitate all the operations that are already available within the platform by enabling the end user to retrieve data and navigate the platform in the easiest way possible.

Disclaimer: The statistics shown on this post are obfuscated to preserve our clients’ data.

References:

[1] How AI Chatbots Play A Role In The Fintech Industry

This was a collaborative post between myself and Elitjon Metaliaj.

Handling missing values & data in Machine Learning Modelling with Python

Handling Missing Data in ML Modelling (with Python)

During my experience as a data scientist, one of the most common problems I have faced during the process of data cleaning/exploratory analysis has been handling missing data. In the ideal case, all attributes of all objects in the data table have well-defined values. However, in real data sets, it is not unusual for an attribute to contain missing data.

When dealing with prediction tasks in supervised learning, I quickly came to the realization that for a lot of machine learning algorithms available in Python, the task of handling missing data cannot be done naturally. I.e. the omitted instances have to somehow be filled with a placeholder (most likely a number) for them to run smoothly.

Perhaps you as a reader may have come up with a simple solution already:

Why not just ignore those instances in the pre-processing stage and conclude it that way?

After all, it is not our fault that the data is missing and moreover, we should not make any assumptions about the nature of missing data since their true value is unknowable in principle. While this may certainly be tempting (if not advisable) in some situations, I will attempt to make the case that imputing (rather than ignoring) missing values can be a better practice that in the end leads to more reliable and unbiased results for our machine learning models.

Types of missing data

There may be various reasons responsible for why the data is missing. Depending on those reasons, it can be classified into three main types:

1) Missing completely at random (MCAR) – Imagine that you print out the data table on a sheet of paper with no missing values and then someone accidentally spills a cup of coffee on it. In this case, the conclusion is that the unknown values of an attribute follow the same distribution as known ones. This is the best case for missing values [1].

2) Missing at random (MAR) – In this case, the missing value from an attribute X is dependent on other attributes but is independent of the true value of X. For example, if an outdoor air temperature sensor runs out of batteries and the staff forgets to change them because it was raining, we can conclude that temperature values are more likely to be missing when it is raining, so they are dependent on the rain attribute. If we compute the temperature based only on the present values, we would probably overestimate the average value, since the temperature may be lower when it is raining compared to when it’s not.

3) Missing not at random (MNAR) – This usually occurs when the lack of data is directly dependent on its value. For example when a temperature sensor fails if temperatures drop below 0°C. Another example is when people with a certain level of income choose not to disclose that information to a census taker. In this case, it is more difficult to replace the missing values with a reasonable estimate.

It is important to identify these types of missing data since it can help us make certain assumptions about their distribution and therefore improve our chances of making good estimations.

Ways of handling missing data

First of all, we need to identify which attributes exactly contain missing values, as well as get an idea of their frequency, as shown in the table below:

The sorting of the attributes is in descending order based on the number of instances with unknown values.

2.1 Deleting missing data

In my opinion, if the missing value percentage is above a certain threshold (say, 60%), it does not make much sense to try and impute them because it would likely influence our predictions due to the biased estimations. Deletion of the rows or columns with unknown values would be better suited. For illustrative purposes, suppose the data set looks like this (missing instances are denoted with the NaN notation):

The Python pandas library allows us to drop the missing values based on the rows that contain them (i.e. drop rows that have at least one NaN value):

import pandas as pd

df = pd.read_csv('data.csv')
df.dropna(axis=0)

The output is as follows:

id    col1     col2      col3     col4     col5
0      2.0       5.0       3.0       6.0       4.0

Similarly, we can drop columns that have at least one NaN in any row:
df.dropna(axis=1)

The above code produces:

However, I think that in most scenarios it is better to keep data than discard it. One obvious reason is that removing rows or columns that contain unknown values will result in losing too much valuable information, especially if we don’t have much data, to begin with.

2.2 Simple imputation of missing data

We could use simple interpolation techniques to estimate unknown data. One of the most common interpolation techniques is mean imputation [2]. Here, we simply replace the missing values in each column with the mean value of the corresponding feature column.

The sciki-learn library offers us a convenient way to achieve this by calling the SimpleImputer class and then applying the fit_transform() function:

from sklearn.impute import SimpleImputer
import numpy as np

sim = SimpleImputer(missing_values=np.nan, strategy='mean')
imputed_data = sim.fit_transform(df.values)

After running the code, we get the imputed dataset:

Other imputation strategies are available with this class, for example “median” or “most frequent” in the case of categorical data, which replaces the missing data with the most common category.
This simplistic approach does have its drawbacks however. For example, by using the mean as an imputation strategy we do not:
1) Account for the variability of the missing values, since these values are replaced by a constant.
2) Take into account the potential dependency of the missing data from the other attributes which are present in the data set.
That’s why I decided to focus my attention on a few more sophisticated approaches.

2.3 Imputation of missing data using machine learning

A more advanced method of imputation is to model an attribute containing unknown values as a target variable which is dependent on the other variables present in the data set and then apply traditional regression or machine learning algorithms to predict its missing instances. A rough mathematical representation could be formulated as follows:
y=f(X)

where y represents the attribute for which we want to predict the missing values and X is the set of predictor variables, i.e. the other variables. This relationship is most clearly visible in the case of simple linear regression where we have:
y=c+b*X

After we build our simple model we can then use it to predict the unknown values of y for which the corresponding X values will be available. The exact same principle applies to ML algorithms as well, albeit the relationship representation between target and predictor cannot be done so neatly.
Relying on linear regression (or logistic regression for categorical data) to fill the gaps has of course its drawbacks as well. Most importantly, this approach assumes that the relationship between its predictors (or the log odds of its predictors in logistic regression) and the target variable is linear, even though this may not be the case at all.
For this reason, I have chosen to perform imputation using ML algorithms that are able to also capture non-linear relationships. The modus operandi can be summarized in the following pseudocode:

For each attribute containing missing values do:

  1. Substitute missing values in the other variables with temporary placeholder values derived solely from the non-missing values using a simple imputation technique
  2. Drop all rows where the values are missing for the current variable in the loop
  3. Train an ML model on the remaining data set to predict the current variable
  4. Predict the missing values of the current variable with the trained model (when the current variable will be subsequently used as an independent predictor in the models for other variables, both the observed and predicted values in this step will be used).

Firstly, as you probably noticed, I have performed a simple form of imputation (median) already in the first step. This is necessary because there may be multiple features with missing data present, and in order for them to be used as predictors for other features, their gaps need to be temporarily filled somehow.
Secondly, the prediction of missing data is done in a “progressive” manner in the sense that variables which were imputed in the previous iteration are used as predictors along with those imputed values. So at each iteration except the first, we are relying on the predictive power of our model to fill the remaining gaps.
Thirdly, given that the data set provided in this case contained a mix of data types, I have employed ML regressors (for continuous attributes) as well as classifiers (for categorical attributes) to cover all possible scenarios.

In the subsequent sections, I have listed all the ML models used in this study, along with small snippets of code that demonstrate their implementation in Python.

2.3.1 Imputation of missing data using Random Forests

Quick data preprocesing tips

Before training a model on the data, it is necessary to perform a few preprocessing steps first:

  • Scale the numeric attributes (apart from our target) to make the algorithm find a better solution quicker.
    This can be achieved using scikit-learns‘s StandardScaler() class:
    from sklearn.preprocessing import StandardScaler
    X = df.values
    standard_scaler = preprocessing.StandardScaler()
    x_scaled = standard_scaler.fit_transform(X)
  • Encode the categorical data so that the representation of each category of an attribute is in a binary 1 (present) – 0 (not present) fashion. This happens because most models cannot handle non-numerical features naturally.
    We can do this by using the pandas get_dummies() method:
    import pandas as pd
    encoded_country = pd.get_dummies(df['Country'])
    df.join([encoded_country])
    del df['Country']

The first ML model used was scikit-learn‘s RandomForestRegressor. Random forests are a collection of individual decision trees (bagging) that make decisions by averaging out the prediction of every single estimator. They tend to be resistant to overfitting because tree predictions cancel each other out. If you want to learn more, refer to [3].

Below is a small snippet that translates the above pseudocode into actual Python code:

from sklearn.ensemble import RandomForestRegressor

for numeric_feature in num_features:
df_temp = df.copy()
sim = SimpleImputer(missing_values=np.nan, strategy='median')
df_temp = pd.DataFrame(sim.fit_transform(df_temp))
df_temp.columns = df.columns
df_temp[numeric_feature] = df[numeric_feature]
df_train = df_temp[~df_temp[numeric_feature].isnull()]
y = df_train[numeric_feature].values
del df_train[numeric_feature]
df_test = df_temp[df_temp[numeric_feature].isnull()]
del df_test[numeric_feature]
X = df_train.values
standard_scaler = preprocessing.StandardScaler()
x_scaled = standard_scaler.fit_transform(X)
test_scaled = standard_scaler.fit_transform(df_test.values)
rf_regressor = RandomForestRegressor()
rf_regressor = rf_regressor.fit(x_scaled, y)
pred_values = rf_regressor.predict(test_scaled)
df.loc[df[numeric_feature].isnull(), numeric_feature] = pred_values

Categorical feature imputation is done in a similar way. In this case, we are dealing with a classification task and should use the RandomForestClassifier class.

Important note on using categorical features as predictors:
In my opinion, it is correct to perform temporary imputation of categorical features before encoding them.

Consider the below example where the Country feature has already been encoded before beginning the imputation procedure:

id    Austria    Italy     Germany
0            0             1              0
1            1             0              0
2            0             0              1
3            NaN         NaN          NaN

If we apply a simple imputation using the most frequent value for example, we would get the following result on the last row:

id    Austria    Italy    Germany
0            0             0              0

This is a logical mistake in the representation since each row should contain exactly one column that takes 1 as value to denote the presence of a particular county. We can avoid this mistake by imputing before encoding since we are guaranteed to fill the missing values with a certain country value.

2.3.2 Imputation of missing data using XGBoost

The XGBoost algorithm is an improved version of the Gradient Boosting one. Similar to Random Forests, XGBoost is a tree-based estimator, but decisions are taken sequentially rather than in parallel. For more information, check out the official documentation.
The XGB model can actually handle missing values on its own, so it is not necessary to perform temporary simple imputation on predictor variables, i.e. we could skip the first step in the pseudocode.
Training and prediction of missing values are done in a similar fashion to the random forest approach:

import xgboost as xgb
.
.
.
xgbr = xgb.XGBRegressor()
xgbr = xgbr.fit(x_scaled, y)
pred_values = xgbr.predict(test_scaled)
.
.
.

2.3.3 Imputation of missing data using Keras Deep Neural Networks

Neural networks follow a fundamentally different approach during training compared to tree-based estimators. In my work, I have used the neural network implementation offered by the Keras library. Below I wrote an example demonstrating its application in Python:

import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
.
.
.
model = Sequential()
model.add(Dense(30, input_dim=input_layer_size, activation='relu')
model.add(Dense(30, activation='relu'))
# identity activation in the output layer for regression
model.add(Dense(1))
# in case of multi classification:
# model.add(Dense(1, activation='softmax'))
model.compile(loss='mean_squared_error')
# in case of multi classification:
# model.compile(loss='categorical_crossentropy')
model.fit(x_scaled, y)
pred_values = model.predict(test_scaled)[:, 0]
.
.
.

2.3.4 Imputation of missing data using Datawig

Datawig is another deep learning model I employed. Its design is specifically made for missing value imputation as it utilizes MXNet’s pre-trained DNNs to make predictions. It can work with missing data during training and it automatically handles categorical data with its CategoricalEncoder class, so we don’t need to pre-encode them. A possible way of implementation is the following:

import datawig
.
.
.
imputer = datawig.SimpleImputer(input_columns=list(df_test.columns),
output_column=numeric_feature, # the column to impute
output_path='imputer_model' # stores model data and metrics)
# Fit the imputer model on the train data:
imputer.fit(train_df = scaled_df_train)

# Alternatively, we could use the fit_hpo() method to find
# the best hyperparameters:
# imputer.fit_hpo(train_df = scaled_df_train)

# Impute missing values, return original dataframe with predictions
pred_vals = imputer.predict(scaled_df_test).iloc[:, -1:].values[:, 0]
.
.
.

Datawig is optimized for pandas DataFrames, meaning that it takes dataframe objects directly as input for training and prediction, so we do not need to transform them into NumPy arrays.
Moreover, we should not drop the target variable column from the training set and input it as a separate argument as we did previously when fitting a model. Datawig handles this automatically.

2.3.5 Imputation of missing data using IterativeImputer

The scikit-learn package also offers a more sophisticated approach to data imputation with the IterativeImputer() class. So where does this approach differ from the ones we saw before? The names give us a hint.

Iterative means that each feature is imputed multiple times. Each iteration’s naming can be “a cycle“. The reason behind running multiple cycles is to achieve some sort of ‘convergence’. Although it is not clear this means exactly that, looking at the scikit-learn documentation. However, you can think of convergence in terms of stabilization of the predicted values:

from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer

iim=IterativeImputer(estimator=xgb.XGBRegressor(),
initial_strategy='median',
max_iter=10,
missing_values=np.nan,
skip_complete=True)

# impute all the numeric columns containing missing values
# with just one line of code:
imputed_df = pd.DataFrame(iim.fit_transform(df))

imputed_df.columns = df.columns

From the code above, we can see that each feature is imputed 10 times (max_iter=10) and in the end, we get the imputed values of the last cycle.

Notice how I used an XGBoost regressor as model input. This shows that IterativeImputer also accepts some ML models that are not native to the scikit-learn library.
Despite being easy to implement, it takes a very large amount of time to calculate compared to the other approaches.

In addition, I would advise using this class with care since it is still in its experimental stages.

Comparing Performances of ML Algorithms

Up until this point, we have seen how various employable techniques to impute missing data as well as the actual process of imputation. However, I have not explained how we can compare the qualities of predictions provided by these approaches.

This is not immediately obvious, because well, we do not possess the missing data to compare them to the predictions. Instead what I decided to do is keep a holdout or validation set from training data, and then use it for model performance evaluation.

So, we are pretending that some data is missing and inferring the actual accuracy of the imputed values based on the accuracy of the imputations on these fake missing values. The snipped below enables us to do this:

from sklearn.metrics import r2_score
import random
.
.
.
train_copy = df_train.copy()
random.seed(23)
current_feat = train_copy[numeric_feature]
missing_pct = int(current_feat.size * 0.2)
i = sorted(random.sample(range(current_feat.shape[0]), missing_pct))
current_feat.iloc[i] = np.nan

y_fake_test = df_train.iloc[i, :][numeric_feature].values
new_train_df = train_copy[~train_copy[numeric_feature].isnull()]
fake_test_df = train_copy[train_copy[numeric_feature].isnull()]
train_y = new_train_df[numeric_feature].values
del new_train_df[numeric_feature]
del fake_test_df[numeric_feature]

rf_regressor = rf_regressor.fit(new_train_df.values, train_y)
train_pred = rf_regressor.predict(new_train_df.values)
test_pred = rf_regressor.predict(fake_test_df.values)

print("R2 train:{} | R2 test:{}".format(r2_score(train_y, train_pred), r2_score(y_fake_test, test_pred)))

The prediction quality, or goodness of fit, can be measured by the coefficient of determination, which expresses as:

Coefficient of determination formula

where RSS is the sum of squared residuals, and TSS represents the total sum of squares. Below I have plotted a visual comparison of the model performances for several attributes. Visualization was done utilizing the seaborn library.

Comparing model prediction accuracy on various attributes

We can see that the random forest model consistently ranks among the best.
To get another hint at the consistency of RF, I have plotted the actual values against the predicted values in the test set for the VAR_1 variable:

Plotting actual values against predicted ones for Var_1

Ideally, the line in any graph should be a straight, diagonal one. The model which comes closest to this is the random forest, which was ultimately my choice for imputation.

Potential Future Steps

Another interesting technique for imputation, which you can employ in the future, is the Multiple Imputation Chained Equations (MICE) method. This takes iterative imputation up a notch. The core idea behind it is to create multiple copies of the original data set (usually 5 to 10 are enough) and perform iterative imputations on each dataset. The obtained results from each data set, in accordance with some metrics that we can define, you can later pool together.

Ultimately, the goal is to somehow account for the variability of the missing data and study the effects of different permutations on the prediction results. The scheme below illustrates this:

In Python, MICE is offered by a few libraries like impyute or statsmodels. However, linear regression estimators are their limit.
Another way to mimic the MICE approach would be to run scikit-learn‘s IterativeImputer many times on the same dataset using different random seeds each time.
Yet another take at the imputation problem is to apply a technique called maximum likelihood estimation. It can derive missing values from a user-defined distribution function, whose parameters chose in a way that maximizes the likelihood of the imputed values actually occurring

Conclusion

We got a glimpse of what the potential approaches for handling missing values are, from the simplest techniques like deletion to more complex ones like iterative imputation.

In general, there is no best way to solve imputation problems and solutions vary according to the nature of the problem, size of the data set, etc. However, I hope to have convinced you that an ML-based approach has inherent value because it offers us a ‘universal’ way out. While missing data may be truly unknowable, we can at least try to come up with an educated guess based on the hidden relationships with the already existing attributes, captured and exposed to us by the power of machine learning.


References

[1] Berthold M.R., and others, Data understanding, in: Guide to Intelligent Data Analysis, Springer, London, pp. 37-40, 42-44.

[2] Raschka S., Data preprocessing, in: Python Machine Learning, Packt, Birmingham, pp. 82-83, 90-91.

[3] Tan P., and others, Data preprocessing, Classification, Ensemble methods, in: Introduction to Data Mining, Addison Wesley, Boston, pp. 187-188, 289-292.

Subscribe to our Newsletter

The ability to operate with technology and true intelligence at speed can be the deciding factor in success or failure in private market investments.

Start lowering your costs, scale faster and use more data in your decisions. Today!

Our Offices
  • Milan:
    Via Monte di Pietà 1A, Milan, Italy
  • London:
    40 New Bond St, London W1S 2DE, UK
  • Tirana:
    Office 1: Rruga Adem Jashari 1, Tirana, AL
    Office 2: Blvd Zogu I, Tirana, AL

Copyright Cardo AI 2021. All rights reserved. P.IVA: 28385 437473 3745