Tag: private debt

Office employee doing Financial Data Extraction from Statements with Machine Learning

Financial Data Extraction from Statements with Machine Learning

Data is the foundation that drives the whole decision-making process in the finance ecosystem. With the growth of fin-tech services the process of collecting this data is more easy accessible, and for a data scientist becomes necessary to develop a set of information extraction tools that would automatically fetch and store this relevant data. Doing so, we facilitate the process of financial data extraction, which before the development of this tools was done manually, a very tedious and not very time-efficient task.

One of the main providers of this “key knowledge” in finance is the Financial Statements, which offer important insight for a company through its performance, operations, cash flows, or balance sheets. While this information is usually provided in text-based formats or other data structures like spreadsheets or data-frames (which can easily be utilized using parser-s or converters), there is also the case when this data comes as other document formats or images in a semi-structured fashion, which also varies between different sources. In this post we will go through different approaches we used to automate the data extraction from these Financial Statements that were formerly provided from different external sources.

Financial Data Extraction: Problem introduction

In our particular case, our data consisted of different semi-structured financial statements provided in PDFs and images, each one following a particular template layout. These financial statements consist of relevant company-related information (company name, industry sector, address), different financial metrics, and balance sheets. For us, the extraction process task consists of retrieving all the relevant information of each document for every different entity class and storing them as a key-value pair (e.g. company_name -> CARDO AI). Since the composition of information differs between documents we end up having different clusters of them. Going even further, we observe that the information inside the document itself represents itself through different typologies of data (text, numerical, tables, etc.).

Sample of a financial statement containing relevant entity classes

In this case, two main problems emerge: we have to find an approach to solve the task for one type of document, and secondly by inductive reasoning, form a broader approach for the general problem, which applies in the whole data set. We have to note here that we are trying to find a single solution that works in the same way for all this diverse data. Treating separately every single document with a different approach denotes missing the whole point of the present task at hand.

“An automated system won’t solve the problem. You have to solve a problem, then automate the solution”

Methodology and solution

Text extraction tools

Firstly we started with text extraction tools like Tabula for tabular data, and PDFMiner and Tesseract for text data. Tabula scrapes tabular data from PDF files, meanwhile, PDFMiner and Tesseract are text extraction tools that gather the text data respectively from PDF and images. The way these tools work is by recognizing pieces of text on visual representations (PDFs and images) into textual data (document text). The issue with Tabula was that it worked only on tabular data, however, the most relevant information in the financial documents that we have is not always represented in tabular format.

Meanwhile, when we applied the other tools, PDFMiner and Tesseract, the output raw text was completely unstructured and non-human-unreadable (adding here unnecessary white-spaces or confusing words that contained special characters). This text was hard to break down into the meaningful entity classes that we want to extract from there. This was clearly not enough so we had to discover other approaches.

GPT-2

Before moving on, we made an effort to pre-process the outputted text from the above-mentioned extraction tools, and for that, we tried GPT-2 [1]. GPT-2 is a large transformer-based language model with 1.5 billion parameters developed from OpenAI and was considered a great innovative breakthrough in the field of NLP. This model, and also its successor – GPT-3, has achieved strong performance on many NLP tasks, including text generation, translation, as well as several tasks that require on-the-fly reasoning or domain adaptation. In our case, we tried to exploit one of its capabilities which was text summation. After getting a considerable amount of text from the previous text extraction tool, we tried to summarize all this information using the GPT-2 model and take out non-relevant information, taking advantage of the attention mechanism of the transformer model. But this approach did not seem to work quite well considering the non-structured text which is very hard to summarize. Apart from that, there would always be the possibility of the model removing the important information from the text and we cannot give it the benefit of doubt in this regard.

Bounding boxes relationship – OpenCV

The unpromising results of the above approaches made us entertain the idea of treating it as an object detection task using computer vision. Object detection is done by means of outputting a bounding box around the object of interest along with a class label. Then we could construct a relationship graph between these “boxed” entities [2] (see image above). Going forward with this method we tried to do the same with our documents, but instead draw boxes around text that represents an identifiable entity and label each box with the entity name it contained. The next step would have been to develop an algorithm that calculates a metric that represents the relationship values between these boxes based on their spatial position. We could then train a machine learning model that would learn from these relationship values and sequentially decide the position of the next entity by knowing the document locations of the previous ones.

The model creates a relationship graph between entities

However, that was not an easy task, due to the fact that it is very hard to determine the right box which represents a distinct meaningful component in the report, and also as mentioned above different documents follow different document layouts and the position of the information we want to extract is arbitrarily positioned in the document. Henceforth, the previously mentioned algorithm might be inaccurate in determining the position of every box. We moved on to seek a better plan.

Named Entity Recognition

An example of NER annotation

Named-entity recognition (NER) is a sub-task of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories.

During our research on the quest to explore new approaches for this task, we came upon the expression “named entity” which generally refers to those entities for which one or many strings, such as words or phrases, stands consistently for some referent. Then we discovered Named Entity Recognition, the task of locating and classifying words from a not annotated block of text into predefined categories. The desired approach to solving this task is by using deep learning NLP models which use linguistic grammar-based techniques. Conceptually this task is divided into distinct problems: detection of names and classifying them into the category they fall into. Hence we started to look for different implementations to design our language model.

NLP model – spaCy

At this point, our process path was pretty straightforward to follow due to the ease that the NLP libraries offer. And for this job, we decided to go with spaCy [3], which offers a very simple and flexible API to develop many NLP tasks, and one of them being Named Entity Recognition. The design pipeline could be conceptualized with the below diagram:

Solution pipeline

Before we start the design of our model we have to first construct the training data set. It will essentially consist of annotated blocks of texts or “sentences” that contain the value of the entity it represents and the entity name itself. For that, firstly we extract the text from the paragraphs where the desired information is present by making use of the previously used extraction tools. Then we annotate this text with the found categories, by providing the starting position and length of the word in the text. Doing so, we also provide some context to the model by keeping the nearby words around the annotated word. This whole information retrieved from the financial statements can then be easily stored in a CSV file. In SpaCy this would be represented with the below structure:

TRAIN_DATA = [    ("Cardo AI SRL is a fintech company", {"entities": [(0, 12, "Company")]}),    ("Company is based in Italy", {"entities": [(20, 25, "LOC")]})]

After we prepared our dataset, we then decided to design the NLP model by choosing between the alternatives that spaCy provided. We started from a blank non-trained model and then made an outline of the input and output of the model. We split the data into train and test sets and then started training the model, following this pipeline. From the training data, the text is firstly tokenized using the Doc module, which basically means breaking down the text into individual linguistic units, and then the annotated text inputted with the supported format is parsed with the GoldParse module to then be fed into the training pipeline of the model.

Training pipeline

Results and constraints

After training the model on about 800 input rows and testing on 200, we got these evaluations:

Evaluations

The evaluation results seemed promising, but that may have come also from the fact that the model was over-fitting or there was not a lot of variability in our data. After our model was trained, all we had to do was feed it with text data taken from the input reports after they had been divided into boxed paragraphs and expect the output represented as a key-value pair.

Constraints

Lack of data
– in order to avoid bias or over-fitting the model should be trained on a relatively large amount of data.
– acquiring all this data is not an easy process, adding here the data pre-processing step.
Ambiguous output
– the model may output more than one value per entity, which leads to inconsistency in interpreting the results.
Unrecognizable text
– the financial statements have poorly written text not correctly identifiable during the text data extraction tools recognition.
Numerical values
– having lots of numerical values in the reports it is hard to distinguish the real labels they represent.

Potential future steps toward better Financial Data Extraction

In recent years, convolution neural networks have shown great success in various computer vision tasks such as classification and object detection. Seeing the problem from a computer vision perspective as a document segmentation (creating bounding boxes around the text contained in the document and classifying it into categories) is a good approach to proceed on with. And for that, the magic formula might be called “RCNN” [4].

By following this path, we might be able to resolve the above-mentioned issues we ended up with our solution. Integrating many different approaches together may also improve the overall accuracy of the labeling process.

After the solution process is stable and the model’s accuracy is satisfactory we need to streamline this whole workflow. For a machine learning model, it is important to be fed with an abundant amount of new data which improves the overall performance and predicts future observations more reliably. In order to achieve that, it comes necessary to build an Automated Retraining Pipeline for the model, with a workflow displayed as the following diagram:

Workflow diagram

Conclusion

We went through and reviewed a couple of different approaches we attempted on solving the Named-Entity Recognition task on Financial Statements. And from this trial and error journey, it seemed that the best method was solving it using Natural Language Processing models trained with our own data and labels.

But despite seemingly obtaining satisfactory results in our case study regarding financial data extraction, there is still room for improvement. The one thing we know for sure is that the above Machine Learning approach provided us the best results and following the same path on solving this task is the way to go. Machine Learning is very close to reaching super-intelligence and with the right approach to present problems in every domain, it is becoming a powerful tool to make use of.

References

[1] Better Language Models and Their Implications

[2] Object Detection with Deep Learning: A Review

[3] spaCy Linguistic Features, Named Entity

[4] Ren et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Handling missing values & data in Machine Learning Modelling with Python

Handling Missing Data in ML Modelling (with Python)

During my experience as a data scientist, one of the most common problems I have faced during the process of data cleaning/exploratory analysis has been handling missing data. In the ideal case, all attributes of all objects in the data table have well-defined values. However, in real data sets, it is not unusual for an attribute to contain missing data.

When dealing with prediction tasks in supervised learning, I quickly came to the realization that for a lot of machine learning algorithms available in Python, the task of handling missing data cannot be done naturally. I.e. the omitted instances have to somehow be filled with a placeholder (most likely a number) for them to run smoothly.

Perhaps you as a reader may have come up with a simple solution already:

Why not just ignore those instances in the pre-processing stage and conclude it that way?

After all, it is not our fault that the data is missing and moreover, we should not make any assumptions about the nature of missing data since their true value is unknowable in principle. While this may certainly be tempting (if not advisable) in some situations, I will attempt to make the case that imputing (rather than ignoring) missing values can be a better practice that in the end leads to more reliable and unbiased results for our machine learning models.

Types of missing data

There may be various reasons responsible for why the data is missing. Depending on those reasons, it can be classified into three main types:

1) Missing completely at random (MCAR) – Imagine that you print out the data table on a sheet of paper with no missing values and then someone accidentally spills a cup of coffee on it. In this case, the conclusion is that the unknown values of an attribute follow the same distribution as known ones. This is the best case for missing values [1].

2) Missing at random (MAR) – In this case, the missing value from an attribute X is dependent on other attributes but is independent of the true value of X. For example, if an outdoor air temperature sensor runs out of batteries and the staff forgets to change them because it was raining, we can conclude that temperature values are more likely to be missing when it is raining, so they are dependent on the rain attribute. If we compute the temperature based only on the present values, we would probably overestimate the average value, since the temperature may be lower when it is raining compared to when it’s not.

3) Missing not at random (MNAR) – This usually occurs when the lack of data is directly dependent on its value. For example when a temperature sensor fails if temperatures drop below 0°C. Another example is when people with a certain level of income choose not to disclose that information to a census taker. In this case, it is more difficult to replace the missing values with a reasonable estimate.

It is important to identify these types of missing data since it can help us make certain assumptions about their distribution and therefore improve our chances of making good estimations.

Ways of handling missing data

First of all, we need to identify which attributes exactly contain missing values, as well as get an idea of their frequency, as shown in the table below:

The sorting of the attributes is in descending order based on the number of instances with unknown values.

2.1 Deleting missing data

In my opinion, if the missing value percentage is above a certain threshold (say, 60%), it does not make much sense to try and impute them because it would likely influence our predictions due to the biased estimations. Deletion of the rows or columns with unknown values would be better suited. For illustrative purposes, suppose the data set looks like this (missing instances are denoted with the NaN notation):

The Python pandas library allows us to drop the missing values based on the rows that contain them (i.e. drop rows that have at least one NaN value):

import pandas as pd

df = pd.read_csv('data.csv')
df.dropna(axis=0)

The output is as follows:

id    col1     col2      col3     col4     col5
0      2.0       5.0       3.0       6.0       4.0

Similarly, we can drop columns that have at least one NaN in any row:
df.dropna(axis=1)

The above code produces:

However, I think that in most scenarios it is better to keep data than discard it. One obvious reason is that removing rows or columns that contain unknown values will result in losing too much valuable information, especially if we don’t have much data, to begin with.

2.2 Simple imputation of missing data

We could use simple interpolation techniques to estimate unknown data. One of the most common interpolation techniques is mean imputation [2]. Here, we simply replace the missing values in each column with the mean value of the corresponding feature column.

The sciki-learn library offers us a convenient way to achieve this by calling the SimpleImputer class and then applying the fit_transform() function:

from sklearn.impute import SimpleImputer
import numpy as np

sim = SimpleImputer(missing_values=np.nan, strategy='mean')
imputed_data = sim.fit_transform(df.values)

After running the code, we get the imputed dataset:

Other imputation strategies are available with this class, for example “median” or “most frequent” in the case of categorical data, which replaces the missing data with the most common category.
This simplistic approach does have its drawbacks however. For example, by using the mean as an imputation strategy we do not:
1) Account for the variability of the missing values, since these values are replaced by a constant.
2) Take into account the potential dependency of the missing data from the other attributes which are present in the data set.
That’s why I decided to focus my attention on a few more sophisticated approaches.

2.3 Imputation of missing data using machine learning

A more advanced method of imputation is to model an attribute containing unknown values as a target variable which is dependent on the other variables present in the data set and then apply traditional regression or machine learning algorithms to predict its missing instances. A rough mathematical representation could be formulated as follows:
y=f(X)

where y represents the attribute for which we want to predict the missing values and X is the set of predictor variables, i.e. the other variables. This relationship is most clearly visible in the case of simple linear regression where we have:
y=c+b*X

After we build our simple model we can then use it to predict the unknown values of y for which the corresponding X values will be available. The exact same principle applies to ML algorithms as well, albeit the relationship representation between target and predictor cannot be done so neatly.
Relying on linear regression (or logistic regression for categorical data) to fill the gaps has of course its drawbacks as well. Most importantly, this approach assumes that the relationship between its predictors (or the log odds of its predictors in logistic regression) and the target variable is linear, even though this may not be the case at all.
For this reason, I have chosen to perform imputation using ML algorithms that are able to also capture non-linear relationships. The modus operandi can be summarized in the following pseudocode:

For each attribute containing missing values do:

  1. Substitute missing values in the other variables with temporary placeholder values derived solely from the non-missing values using a simple imputation technique
  2. Drop all rows where the values are missing for the current variable in the loop
  3. Train an ML model on the remaining data set to predict the current variable
  4. Predict the missing values of the current variable with the trained model (when the current variable will be subsequently used as an independent predictor in the models for other variables, both the observed and predicted values in this step will be used).

Firstly, as you probably noticed, I have performed a simple form of imputation (median) already in the first step. This is necessary because there may be multiple features with missing data present, and in order for them to be used as predictors for other features, their gaps need to be temporarily filled somehow.
Secondly, the prediction of missing data is done in a “progressive” manner in the sense that variables which were imputed in the previous iteration are used as predictors along with those imputed values. So at each iteration except the first, we are relying on the predictive power of our model to fill the remaining gaps.
Thirdly, given that the data set provided in this case contained a mix of data types, I have employed ML regressors (for continuous attributes) as well as classifiers (for categorical attributes) to cover all possible scenarios.

In the subsequent sections, I have listed all the ML models used in this study, along with small snippets of code that demonstrate their implementation in Python.

2.3.1 Imputation of missing data using Random Forests

Quick data preprocesing tips

Before training a model on the data, it is necessary to perform a few preprocessing steps first:

  • Scale the numeric attributes (apart from our target) to make the algorithm find a better solution quicker.
    This can be achieved using scikit-learns‘s StandardScaler() class:
    from sklearn.preprocessing import StandardScaler
    X = df.values
    standard_scaler = preprocessing.StandardScaler()
    x_scaled = standard_scaler.fit_transform(X)
  • Encode the categorical data so that the representation of each category of an attribute is in a binary 1 (present) – 0 (not present) fashion. This happens because most models cannot handle non-numerical features naturally.
    We can do this by using the pandas get_dummies() method:
    import pandas as pd
    encoded_country = pd.get_dummies(df['Country'])
    df.join([encoded_country])
    del df['Country']

The first ML model used was scikit-learn‘s RandomForestRegressor. Random forests are a collection of individual decision trees (bagging) that make decisions by averaging out the prediction of every single estimator. They tend to be resistant to overfitting because tree predictions cancel each other out. If you want to learn more, refer to [3].

Below is a small snippet that translates the above pseudocode into actual Python code:

from sklearn.ensemble import RandomForestRegressor

for numeric_feature in num_features:
df_temp = df.copy()
sim = SimpleImputer(missing_values=np.nan, strategy='median')
df_temp = pd.DataFrame(sim.fit_transform(df_temp))
df_temp.columns = df.columns
df_temp[numeric_feature] = df[numeric_feature]
df_train = df_temp[~df_temp[numeric_feature].isnull()]
y = df_train[numeric_feature].values
del df_train[numeric_feature]
df_test = df_temp[df_temp[numeric_feature].isnull()]
del df_test[numeric_feature]
X = df_train.values
standard_scaler = preprocessing.StandardScaler()
x_scaled = standard_scaler.fit_transform(X)
test_scaled = standard_scaler.fit_transform(df_test.values)
rf_regressor = RandomForestRegressor()
rf_regressor = rf_regressor.fit(x_scaled, y)
pred_values = rf_regressor.predict(test_scaled)
df.loc[df[numeric_feature].isnull(), numeric_feature] = pred_values

Categorical feature imputation is done in a similar way. In this case, we are dealing with a classification task and should use the RandomForestClassifier class.

Important note on using categorical features as predictors:
In my opinion, it is correct to perform temporary imputation of categorical features before encoding them.

Consider the below example where the Country feature has already been encoded before beginning the imputation procedure:

id    Austria    Italy     Germany
0            0             1              0
1            1             0              0
2            0             0              1
3            NaN         NaN          NaN

If we apply a simple imputation using the most frequent value for example, we would get the following result on the last row:

id    Austria    Italy    Germany
0            0             0              0

This is a logical mistake in the representation since each row should contain exactly one column that takes 1 as value to denote the presence of a particular county. We can avoid this mistake by imputing before encoding since we are guaranteed to fill the missing values with a certain country value.

2.3.2 Imputation of missing data using XGBoost

The XGBoost algorithm is an improved version of the Gradient Boosting one. Similar to Random Forests, XGBoost is a tree-based estimator, but decisions are taken sequentially rather than in parallel. For more information, check out the official documentation.
The XGB model can actually handle missing values on its own, so it is not necessary to perform temporary simple imputation on predictor variables, i.e. we could skip the first step in the pseudocode.
Training and prediction of missing values are done in a similar fashion to the random forest approach:

import xgboost as xgb
.
.
.
xgbr = xgb.XGBRegressor()
xgbr = xgbr.fit(x_scaled, y)
pred_values = xgbr.predict(test_scaled)
.
.
.

2.3.3 Imputation of missing data using Keras Deep Neural Networks

Neural networks follow a fundamentally different approach during training compared to tree-based estimators. In my work, I have used the neural network implementation offered by the Keras library. Below I wrote an example demonstrating its application in Python:

import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
.
.
.
model = Sequential()
model.add(Dense(30, input_dim=input_layer_size, activation='relu')
model.add(Dense(30, activation='relu'))
# identity activation in the output layer for regression
model.add(Dense(1))
# in case of multi classification:
# model.add(Dense(1, activation='softmax'))
model.compile(loss='mean_squared_error')
# in case of multi classification:
# model.compile(loss='categorical_crossentropy')
model.fit(x_scaled, y)
pred_values = model.predict(test_scaled)[:, 0]
.
.
.

2.3.4 Imputation of missing data using Datawig

Datawig is another deep learning model I employed. Its design is specifically made for missing value imputation as it utilizes MXNet’s pre-trained DNNs to make predictions. It can work with missing data during training and it automatically handles categorical data with its CategoricalEncoder class, so we don’t need to pre-encode them. A possible way of implementation is the following:

import datawig
.
.
.
imputer = datawig.SimpleImputer(input_columns=list(df_test.columns),
output_column=numeric_feature, # the column to impute
output_path='imputer_model' # stores model data and metrics)
# Fit the imputer model on the train data:
imputer.fit(train_df = scaled_df_train)

# Alternatively, we could use the fit_hpo() method to find
# the best hyperparameters:
# imputer.fit_hpo(train_df = scaled_df_train)

# Impute missing values, return original dataframe with predictions
pred_vals = imputer.predict(scaled_df_test).iloc[:, -1:].values[:, 0]
.
.
.

Datawig is optimized for pandas DataFrames, meaning that it takes dataframe objects directly as input for training and prediction, so we do not need to transform them into NumPy arrays.
Moreover, we should not drop the target variable column from the training set and input it as a separate argument as we did previously when fitting a model. Datawig handles this automatically.

2.3.5 Imputation of missing data using IterativeImputer

The scikit-learn package also offers a more sophisticated approach to data imputation with the IterativeImputer() class. So where does this approach differ from the ones we saw before? The names give us a hint.

Iterative means that each feature is imputed multiple times. Each iteration’s naming can be “a cycle“. The reason behind running multiple cycles is to achieve some sort of ‘convergence’. Although it is not clear this means exactly that, looking at the scikit-learn documentation. However, you can think of convergence in terms of stabilization of the predicted values:

from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer

iim=IterativeImputer(estimator=xgb.XGBRegressor(),
initial_strategy='median',
max_iter=10,
missing_values=np.nan,
skip_complete=True)

# impute all the numeric columns containing missing values
# with just one line of code:
imputed_df = pd.DataFrame(iim.fit_transform(df))

imputed_df.columns = df.columns

From the code above, we can see that each feature is imputed 10 times (max_iter=10) and in the end, we get the imputed values of the last cycle.

Notice how I used an XGBoost regressor as model input. This shows that IterativeImputer also accepts some ML models that are not native to the scikit-learn library.
Despite being easy to implement, it takes a very large amount of time to calculate compared to the other approaches.

In addition, I would advise using this class with care since it is still in its experimental stages.

Comparing Performances of ML Algorithms

Up until this point, we have seen how various employable techniques to impute missing data as well as the actual process of imputation. However, I have not explained how we can compare the qualities of predictions provided by these approaches.

This is not immediately obvious, because well, we do not possess the missing data to compare them to the predictions. Instead what I decided to do is keep a holdout or validation set from training data, and then use it for model performance evaluation.

So, we are pretending that some data is missing and inferring the actual accuracy of the imputed values based on the accuracy of the imputations on these fake missing values. The snipped below enables us to do this:

from sklearn.metrics import r2_score
import random
.
.
.
train_copy = df_train.copy()
random.seed(23)
current_feat = train_copy[numeric_feature]
missing_pct = int(current_feat.size * 0.2)
i = sorted(random.sample(range(current_feat.shape[0]), missing_pct))
current_feat.iloc[i] = np.nan

y_fake_test = df_train.iloc[i, :][numeric_feature].values
new_train_df = train_copy[~train_copy[numeric_feature].isnull()]
fake_test_df = train_copy[train_copy[numeric_feature].isnull()]
train_y = new_train_df[numeric_feature].values
del new_train_df[numeric_feature]
del fake_test_df[numeric_feature]

rf_regressor = rf_regressor.fit(new_train_df.values, train_y)
train_pred = rf_regressor.predict(new_train_df.values)
test_pred = rf_regressor.predict(fake_test_df.values)

print("R2 train:{} | R2 test:{}".format(r2_score(train_y, train_pred), r2_score(y_fake_test, test_pred)))

The prediction quality, or goodness of fit, can be measured by the coefficient of determination, which expresses as:

Coefficient of determination formula

where RSS is the sum of squared residuals, and TSS represents the total sum of squares. Below I have plotted a visual comparison of the model performances for several attributes. Visualization was done utilizing the seaborn library.

Comparing model prediction accuracy on various attributes

We can see that the random forest model consistently ranks among the best.
To get another hint at the consistency of RF, I have plotted the actual values against the predicted values in the test set for the VAR_1 variable:

Plotting actual values against predicted ones for Var_1

Ideally, the line in any graph should be a straight, diagonal one. The model which comes closest to this is the random forest, which was ultimately my choice for imputation.

Potential Future Steps

Another interesting technique for imputation, which you can employ in the future, is the Multiple Imputation Chained Equations (MICE) method. This takes iterative imputation up a notch. The core idea behind it is to create multiple copies of the original data set (usually 5 to 10 are enough) and perform iterative imputations on each dataset. The obtained results from each data set, in accordance with some metrics that we can define, you can later pool together.

Ultimately, the goal is to somehow account for the variability of the missing data and study the effects of different permutations on the prediction results. The scheme below illustrates this:

In Python, MICE is offered by a few libraries like impyute or statsmodels. However, linear regression estimators are their limit.
Another way to mimic the MICE approach would be to run scikit-learn‘s IterativeImputer many times on the same dataset using different random seeds each time.
Yet another take at the imputation problem is to apply a technique called maximum likelihood estimation. It can derive missing values from a user-defined distribution function, whose parameters chose in a way that maximizes the likelihood of the imputed values actually occurring

Conclusion

We got a glimpse of what the potential approaches for handling missing values are, from the simplest techniques like deletion to more complex ones like iterative imputation.

In general, there is no best way to solve imputation problems and solutions vary according to the nature of the problem, size of the data set, etc. However, I hope to have convinced you that an ML-based approach has inherent value because it offers us a ‘universal’ way out. While missing data may be truly unknowable, we can at least try to come up with an educated guess based on the hidden relationships with the already existing attributes, captured and exposed to us by the power of machine learning.


References

[1] Berthold M.R., and others, Data understanding, in: Guide to Intelligent Data Analysis, Springer, London, pp. 37-40, 42-44.

[2] Raschka S., Data preprocessing, in: Python Machine Learning, Packt, Birmingham, pp. 82-83, 90-91.

[3] Tan P., and others, Data preprocessing, Classification, Ensemble methods, in: Introduction to Data Mining, Addison Wesley, Boston, pp. 187-188, 289-292.

Cardo AI top startup at MILAN FINTECH SUMMIT

MILAN FINTECH SUMMIT: A SELECTION OF THE BEST OF ITALIAN AND INTERNATIONAL INNOVATION
Among the over 70 candidates from 18 countries, 10 Italian and 10 international companies were selected based on their potentials on the market

Milan, 23 November 2020 – The Fintech companies deemed as having the highest market potential will be the protagonists of the second day of Milan Fintech Summit, the international event dedicated to the world of Finance Technology, scheduled as a streaming live on 10 and 11 December 2020. It is promoted and organised by Fintech District and Fiera Milano Media – Business International supported by the City of Milan through Milano&Partners, and sponsored by AIFI, Assolombarda, Febaf, ItaliaFintech and VC Hub.

Following the call launched on an international level and a careful selection by such experts in the sectors as the Conference Chair Alessandro Hatami and representatives of the organizing committee, today they announced the 20 companies that will be given the opportunity to be on the digital stage to present their own ideas and solutions for the future of financial services.
Among the over 70 candidates from 18 countries, 10 Italian and 10 International companies were selected.

The Italian companies are: insurtech Neosurance, See Your Box and Lokky; WizKey, Soisy, Cardo AI, Stonize and Faire Labs, operating in the lending and credit sector; Trakti offering cybersecurity solutions; Indigo.ai dealing with artificial intelligence.

The international ones that were selected are: Insurtech Descartes Underwriting and Zelros (France); Keyless Technologies (UK), CYDEF – Cyber Defence Corporation, Tehama (Canada), dealing with DaaS and Cybersecurity and Privasee (UK), operating in the data market protection; Pocketnest (USA), a SaaS company; Wealth Manager Wondeur (France), DarwinAI (USA) operating in the artificial intelligent sector and Oper Credits (Belgium), operating in the lending and credit field.

These realities, which will be introduced to a parterre of selected Italian and International investors and to fintech experts, were chosen based on the criteria of: innovativeness of the proposal, potential size of the target market, scalability of the proposal, potentials in capital raising; type of the employed technological solution.
The Milan Fintech Summit will thus help introduce the potential of our fintech companies abroad reinforcing the role of Milan as European capital of innovation, an ideal starting point for international companies that want to enter the Italian market.

FINTECH ACCESSIBLE TO EVERYONE, AN OPEN DOOR EVENT
The program of the event is available on the official site and a physical appointment of the summit is already scheduled for next year, on 4 and 5 October 2021. The December appointments are open to all those interested in knowing and understanding in depth the potentials of fintech. You can register now for free using this link, or purchase a premium ticket to participate as listeners to the pitch session (the only closed door part of the program) and be entitled to other benefits offered by the Summit partners.

Fintech District
Fintech District is the reference international community for fintech ecosystem in Italy. It acts with the aim of creating the best conditions to help all the stakeholders (start-ups, financial institutions, corporations, professionals, institutions, investors) operate in synergy and find opportunities of local and international growth. The companies that decide to adhere have in common the tendency to innovate and the will to develop collaborations based on opening and sharing, The community now consists in 160 start-ups and 14 corporate members choosing to participate to the creation of open innovation projects by collaborating with fintech. Fintech District also has relationships with equivalent innovation hubs abroad to multiply the opportunity to invest and cooperate, establishing its own role as access door and reference in the Italian market. Created in 2017, the Fintech District has its seat in Milan in Palazzo COPERNICO ISOLA FOR S32, in Via Sassetti 32. Fintech District is part of Fabrick.

Subscribe to our Newsletter

The ability to operate with technology and true intelligence at speed can be the deciding factor in success or failure in private market investments.

Start lowering your costs, scale faster and use more data in your decisions. Today!

Our Offices
  • Milan:
    Via Monte di Pietà 1A, Milan, Italy
  • London:
    40 New Bond St, London W1S 2DE, UK
  • Tirana:
    Office 1: Rruga Adem Jashari 1, Tirana, AL
    Office 2: Blvd Zogu I, Tirana, AL

Copyright Cardo AI 2021. All rights reserved. P.IVA: 10357440964