Author: Foti Kërkeshi

Office employee doing Financial Data Extraction from Statements with Machine Learning

Financial Data Extraction from Statements with Machine Learning

Data is the foundation that drives the whole decision-making process in the finance ecosystem. With the growth of fin-tech services the process of collecting this data is more easy accessible, and for a data scientist becomes necessary to develop a set of information extraction tools that would automatically fetch and store this relevant data. Doing so, we facilitate the process of financial data extraction, which before the development of this tools was done manually, a very tedious and not very time-efficient task.

One of the main providers of this “key knowledge” in finance is the Financial Statements, which offer important insight for a company through its performance, operations, cash flows, or balance sheets. While this information is usually provided in text-based formats or other data structures like spreadsheets or data-frames (which can easily be utilized using parser-s or converters), there is also the case when this data comes as other document formats or images in a semi-structured fashion, which also varies between different sources. In this post we will go through different approaches we used to automate the data extraction from these Financial Statements that were formerly provided from different external sources.

Financial Data Extraction: Problem introduction

In our particular case, our data consisted of different semi-structured financial statements provided in PDFs and images, each one following a particular template layout. These financial statements consist of relevant company-related information (company name, industry sector, address), different financial metrics, and balance sheets. For us, the extraction process task consists of retrieving all the relevant information of each document for every different entity class and storing them as a key-value pair (e.g. company_name -> CARDO AI). Since the composition of information differs between documents we end up having different clusters of them. Going even further, we observe that the information inside the document itself represents itself through different typologies of data (text, numerical, tables, etc.).

Sample of a financial statement containing relevant entity classes

In this case, two main problems emerge: we have to find an approach to solve the task for one type of document, and secondly by inductive reasoning, form a broader approach for the general problem, which applies in the whole data set. We have to note here that we are trying to find a single solution that works in the same way for all this diverse data. Treating separately every single document with a different approach denotes missing the whole point of the present task at hand.

“An automated system won’t solve the problem. You have to solve a problem, then automate the solution”

Methodology and solution

Text extraction tools

Firstly we started with text extraction tools like Tabula for tabular data, and PDFMiner and Tesseract for text data. Tabula scrapes tabular data from PDF files, meanwhile, PDFMiner and Tesseract are text extraction tools that gather the text data respectively from PDF and images. The way these tools work is by recognizing pieces of text on visual representations (PDFs and images) into textual data (document text). The issue with Tabula was that it worked only on tabular data, however, the most relevant information in the financial documents that we have is not always represented in tabular format.

Meanwhile, when we applied the other tools, PDFMiner and Tesseract, the output raw text was completely unstructured and non-human-unreadable (adding here unnecessary white-spaces or confusing words that contained special characters). This text was hard to break down into the meaningful entity classes that we want to extract from there. This was clearly not enough so we had to discover other approaches.


Before moving on, we made an effort to pre-process the outputted text from the above-mentioned extraction tools, and for that, we tried GPT-2 [1]. GPT-2 is a large transformer-based language model with 1.5 billion parameters developed from OpenAI and was considered a great innovative breakthrough in the field of NLP. This model, and also its successor – GPT-3, has achieved strong performance on many NLP tasks, including text generation, translation, as well as several tasks that require on-the-fly reasoning or domain adaptation. In our case, we tried to exploit one of its capabilities which was text summation. After getting a considerable amount of text from the previous text extraction tool, we tried to summarize all this information using the GPT-2 model and take out non-relevant information, taking advantage of the attention mechanism of the transformer model. But this approach did not seem to work quite well considering the non-structured text which is very hard to summarize. Apart from that, there would always be the possibility of the model removing the important information from the text and we cannot give it the benefit of doubt in this regard.

Bounding boxes relationship – OpenCV

The unpromising results of the above approaches made us entertain the idea of treating it as an object detection task using computer vision. Object detection is done by means of outputting a bounding box around the object of interest along with a class label. Then we could construct a relationship graph between these “boxed” entities [2] (see image above). Going forward with this method we tried to do the same with our documents, but instead draw boxes around text that represents an identifiable entity and label each box with the entity name it contained. The next step would have been to develop an algorithm that calculates a metric that represents the relationship values between these boxes based on their spatial position. We could then train a machine learning model that would learn from these relationship values and sequentially decide the position of the next entity by knowing the document locations of the previous ones.

The model creates a relationship graph between entities

However, that was not an easy task, due to the fact that it is very hard to determine the right box which represents a distinct meaningful component in the report, and also as mentioned above different documents follow different document layouts and the position of the information we want to extract is arbitrarily positioned in the document. Henceforth, the previously mentioned algorithm might be inaccurate in determining the position of every box. We moved on to seek a better plan.

Named Entity Recognition

An example of NER annotation

Named-entity recognition (NER) is a sub-task of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories.

During our research on the quest to explore new approaches for this task, we came upon the expression “named entity” which generally refers to those entities for which one or many strings, such as words or phrases, stands consistently for some referent. Then we discovered Named Entity Recognition, the task of locating and classifying words from a not annotated block of text into predefined categories. The desired approach to solving this task is by using deep learning NLP models which use linguistic grammar-based techniques. Conceptually this task is divided into distinct problems: detection of names and classifying them into the category they fall into. Hence we started to look for different implementations to design our language model.

NLP model – spaCy

At this point, our process path was pretty straightforward to follow due to the ease that the NLP libraries offer. And for this job, we decided to go with spaCy [3], which offers a very simple and flexible API to develop many NLP tasks, and one of them being Named Entity Recognition. The design pipeline could be conceptualized with the below diagram:

Solution pipeline

Before we start the design of our model we have to first construct the training data set. It will essentially consist of annotated blocks of texts or “sentences” that contain the value of the entity it represents and the entity name itself. For that, firstly we extract the text from the paragraphs where the desired information is present by making use of the previously used extraction tools. Then we annotate this text with the found categories, by providing the starting position and length of the word in the text. Doing so, we also provide some context to the model by keeping the nearby words around the annotated word. This whole information retrieved from the financial statements can then be easily stored in a CSV file. In SpaCy this would be represented with the below structure:

TRAIN_DATA = [    ("Cardo AI SRL is a fintech company", {"entities": [(0, 12, "Company")]}),    ("Company is based in Italy", {"entities": [(20, 25, "LOC")]})]

After we prepared our dataset, we then decided to design the NLP model by choosing between the alternatives that spaCy provided. We started from a blank non-trained model and then made an outline of the input and output of the model. We split the data into train and test sets and then started training the model, following this pipeline. From the training data, the text is firstly tokenized using the Doc module, which basically means breaking down the text into individual linguistic units, and then the annotated text inputted with the supported format is parsed with the GoldParse module to then be fed into the training pipeline of the model.

Training pipeline

Results and constraints

After training the model on about 800 input rows and testing on 200, we got these evaluations:


The evaluation results seemed promising, but that may have come also from the fact that the model was over-fitting or there was not a lot of variability in our data. After our model was trained, all we had to do was feed it with text data taken from the input reports after they had been divided into boxed paragraphs and expect the output represented as a key-value pair.


Lack of data
– in order to avoid bias or over-fitting the model should be trained on a relatively large amount of data.
– acquiring all this data is not an easy process, adding here the data pre-processing step.
Ambiguous output
– the model may output more than one value per entity, which leads to inconsistency in interpreting the results.
Unrecognizable text
– the financial statements have poorly written text not correctly identifiable during the text data extraction tools recognition.
Numerical values
– having lots of numerical values in the reports it is hard to distinguish the real labels they represent.

Potential future steps toward better Financial Data Extraction

In recent years, convolution neural networks have shown great success in various computer vision tasks such as classification and object detection. Seeing the problem from a computer vision perspective as a document segmentation (creating bounding boxes around the text contained in the document and classifying it into categories) is a good approach to proceed on with. And for that, the magic formula might be called “RCNN” [4].

By following this path, we might be able to resolve the above-mentioned issues we ended up with our solution. Integrating many different approaches together may also improve the overall accuracy of the labeling process.

After the solution process is stable and the model’s accuracy is satisfactory we need to streamline this whole workflow. For a machine learning model, it is important to be fed with an abundant amount of new data which improves the overall performance and predicts future observations more reliably. In order to achieve that, it comes necessary to build an Automated Retraining Pipeline for the model, with a workflow displayed as the following diagram:

Workflow diagram


We went through and reviewed a couple of different approaches we attempted on solving the Named-Entity Recognition task on Financial Statements. And from this trial and error journey, it seemed that the best method was solving it using Natural Language Processing models trained with our own data and labels.

But despite seemingly obtaining satisfactory results in our case study regarding financial data extraction, there is still room for improvement. The one thing we know for sure is that the above Machine Learning approach provided us the best results and following the same path on solving this task is the way to go. Machine Learning is very close to reaching super-intelligence and with the right approach to present problems in every domain, it is becoming a powerful tool to make use of.


[1] Better Language Models and Their Implications

[2] Object Detection with Deep Learning: A Review

[3] spaCy Linguistic Features, Named Entity

[4] Ren et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Subscribe to our Newsletter

The ability to operate with technology and true intelligence at speed can be the deciding factor in success or failure in private market investments.

Start lowering your costs, scale faster and use more data in your decisions. Today!

Our Offices
  • Milan:
    Via Monte di Pietà 1A, Milan, Italy
  • London:
    40 New Bond St, London W1S 2DE, UK
  • Tirana:
    Office 1: Rruga Adem Jashari 1, Tirana, AL
    Office 2: Blvd Zogu I, Tirana, AL

Copyright Cardo AI 2021. All rights reserved. P.IVA: 10357440964