Tag: Alternative credit

Data extraction from Financial Statements with Machine Learning

Data is the foundation that drives the whole decision-making process in the finance ecosystem. With the growth of fin-tech services the process of collecting this data is more easy accessible, and for a data scientist becomes necessary to develop a set of information extraction tools that would automatically fetch and store this relevant data. Doing so, we facilitate the process of information retrieval, which before the development of this tools was done manually, a very tedious and not very time-efficient task.

One of the main providers of this “key knowledge” in finance are the Financial Statements, which offer important insight for a company through its performance, operations, cash flows or balance sheets. While this information is usually provided in text-based formats or other data structures like spreadsheets or data-frames (which can easily be utilized using parser-s or converters), there is also the case when this data comes as other document formats or images in a semi-structured fashion, which also varies between different sources. In this post we will go through different approaches we used to automate the information extraction from these Financial Statements that were formerly provided from different external sources.

Problem introduction

In our particular case, our data consisted of different semi-structured financial statements provided in PDFs and images, each one following a particular template layout. These financial statements consist of relevant company-related information (company name, industry sector, address), different financial metrics and balance sheets. For us, the extraction process task consists of retrieving all the relevant information of each document for every different entity class and storing them as a key-value pair (e.g. company_name -> CARDO AI). Since the composition of information differs between documents we end up having different clusters of them. Going even further, we observe that the information inside the document itself is represented by different typologies of data (text, numerical, tables etc.).

Sample of a financial statement containing relevant entity classes

In this case two main problems emerge: we have to find an approach to solve the task for one type of document and secondly by inductive reasoning, form a broader approach for the general problem, which applies in the whole data set. We have to note here that we are trying to find a single solution that works in the same way for all this diverse data. Treating separately every single document with a different approach denotes missing the whole point of the present task in hand.

“An automated system won’t solve the problem. You have to solve a problem, then automate the solution”

Methodology and solution

Text extraction tools

Firstly we started with text extraction tools like Tabula for tabular data, and PDFMiner and Tesseract for text data. Tabula scrapes tabular data from PDF files, meanwhile PDFMiner and Tesseract are text extraction tools that gather the text data respectively from PDF and images. The way these tools work is by recognizing pieces of text on visual representations (PDFs and images) into textual data (document text). The issue with Tabula was that it worked only on tabular data, however, the most relevant information in the financial documents that we have is not always represented in tabular format.

Meanwhile, when we applied the other tools, PDFMiner and Tesseract, the output raw text was completely unstructured and non human-unreadable (adding here unnecessary white-spaces or confusing words that contained special characters). This text was hard to break down into the meaningful entity classes that we want to extract from there. This was clearly not enough so we had to discover other approaches.

GPT-2

Before moving on, we made an effort to pre-process the outputted text from the above-mentioned extraction tools, and for that we tried GPT-2 [1]. GPT-2 is a large transformer-based language model with 1.5 billion parameters developed from OpenAI, and was considered a great innovative breakthrough in the field of NLP. This model, and also its successor – GPT-3, have achieved strong performance on many NLP tasks, including text generation, translation, as well as several tasks that require on-the-fly reasoning or domain adaptation. In our case, we tried to exploit one of its capabilities which was text summation. After getting a considerable amount of text from the previous text extraction tool, we tried to summarize all this information using the GPT-2 model and take out non-relevant information, taking advantage of the attention mechanism of the transformer model. But this approach did not seem to work quite well considering the non-structured text which is very hard to summarize. Apart from that, there would always be the possibility of the model removing the important information from the text and we cannot give it the benefit of doubt in this regard.

Bounding boxes relationship – OpenCV

The unpromising results of the above approaches made us entertain the idea of treating it as an object detection task using computer vision. Object detection is done by means of outputting a bounding box around the object of interest along with a class label. Then we could construct a relationship graph between these “boxed” entities [2] (see image above). Going forward with this method we tried to do the same with our documents, but instead draw boxes around text that represent an identifiable entity and label each box with the entity name it contained. The next step would have been to develop an algorithm that calculates a metric that represents the relationship values between these boxes based on their spatial position. We could then train a machine learning model that would learn from these relationship values and sequentially decide the position of the next entity by knowing the document locations of the previous ones.

The model creates a relationship graph between entities

However, that was not an easy task, due to the fact that it is very hard to determine the right box which represents a distinct meaningful component in the report, and also as mentioned above different documents follow different document layouts and the position of the information we want to extract is arbitrarily positioned in the document. Henceforth, the previously mentioned algorithm might be inaccurate in determining the position of every box. We moved on to seek for a better plan.

Named Entity Recognition

An example of NER annotation

Named-entity recognition (NER) is a sub-task of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories.

During our research on the quest to explore new approaches for this task, we came upon the expression “named entity” which generally refers to those entities for which one or many strings, such as words or phrases, stands consistently for some referent. Then we discovered Named Entity Recognition, the task of locating and classifying words from an not annotated block of text into predefined categories. The desired approach to solving this task is by using deep learning NLP models which use linguistic grammar-based techniques. Conceptually this task is divided into distinct problems: detection of names and classifying them into the category they fall into. Hence we started to look for different implementations to design our language model.

NLP model – spaCy

At this point, our process path was pretty straightforward to follow due to the ease that the NLP libraries offer. And for this job, we decided to go with spaCy [3], which offers a very simple and flexible API to develop many NLP tasks, and one of them being Named Entity Recognition. The design pipeline could be conceptualized with the below diagram:

Solution pipeline

Before we start the design of our model we have to first construct the training data set. It will essentially consist of annotated blocks of texts or “sentences” that contain the value of the entity it represents and the entity name itself. For that, firstly we extract the text from the paragraphs where the desired information is present by making use of the previously used extraction tools. Then we annotate this text with the found categories, by providing the starting position and length of the word in the text. Doing so, we also provide some context to the model by keeping the nearby words around the annotated word. This whole information retrieved from the financial statements can then be easily stored in a CSV file. In SpaCy this would be represented with the below structure:

TRAIN_DATA = [    ("Cardo AI SRL is a fintech company", {"entities": [(0, 12, "Company")]}),    ("Company is based in Italy", {"entities": [(20, 25, "LOC")]})]

After we prepared our dateset, we then decided to design the NLP model by choosing between the alternatives that spaCy provided. We started from a blank non-trained model and then made an outline of the input and output of the model. We split the data into train and test sets, and then started training the model, following this pipeline. From the training data, the text is firstly tokenized using Doc module, which basically means breaking down text into individual linguistic units and then the annotated text inputted with the supported format is parsed with the GoldParse module to then be fed into the train pipeline of the model.

Training pipeline

Results and constraints

After training the model on about 800 input rows and testing on 200, we got this evaluations:

The evaluation results seemed promising, but that may have come also from the fact that the model was over-fitting or there was not a lot of variability in our data. After our model was trained, all we had to do was feed it with text data taken from the input reports after they had been divided into boxed paragraphs and expect the output represented as a key-value pair.

Constraints

Lack of data
– in order to avoid bias or over-fitting the model should be trained on a relatively large amount of data
– acquiring all this data is not an easy process, adding here the data pre-processing step
Ambiguous output
– the model may output more than one value per entity, which leads to inconsistency in interpreting the results
Unrecognizable text
– the statements have poorly written text not correctly identifiable during the text extraction tools recognition
Numerical values
– having lots of numerical values in the reports it is hard to distinguish the real labels they represent

Potential future steps

In recent years, convolution neural networks have shown great success in various computer vision tasks such as classification and object detection. Seeing the problem from a computer vision perspective as a document segmentation (creating bounding boxes around the text contained in the document and classifying it into categories) is a good approach to proceed on with. And for that the magic formula might be called “RCNN” [4]. By following this path, we might be able to resolve the above-mentioned issues we ended up with our solution. Integrating many different approaches together may also improve the overall accuracy on the labeling process.

After the solution process is stable and the model’s accuracy is satisfactory we need to streamline this whole workflow. For a machine learning model it is important to be fed with an abundant amount of new data which improves the overall performance and predicts future observations more reliably. In order to achieve that, it comes necessary to build an Automated Retraining Pipeline for the model, with a workflow displayed as the following diagram:

Conclusion

We went through and reviewed a couple of different approaches we attempted on solving the Named-Entity Recognition task on Financial Statements. And from this trial and error journey, it seemed that the best method was solving it using Natural Language Processing models trained with our own data and labels. But despite seemingly obtaining satisfactory results in our case study, there is still room for improvement. The one thing we know for sure is that the above Machine Learning approach provided us the best results and following the same path on solving this task is the way to go. Machine Learning is very close to reaching super-intelligence and with the right approach to present problems in every domain, it is becoming a powerful tool to make use of.

References

[1] Better Language Models and Their Implications

[2] Object Detection with Deep Learning: A Review

[3] spaCy Linguistic Features, Named Entity

[4] Ren et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

EU Taxonomy: A tool for ESG transition or a nightmare?

A practical guide on main steps and cases study results on how to comply with the EU Taxonomy

The EU Taxonomy is one of the most significant developments in sustainable finance and will have wide-ranging implications for investors and issuers working in the EU and beyond. This tool helps investors navigate the transition to a low carbon, resilient and resource-efficient economy by assessing to what degree investment portfolios (both equity and fixed income) are aligned with the European environmental objectives:

  1. Climate change mitigation
  2. Climate change adaptation
  3. Sustainable use and protection of water and marine resources
  4. Transition to a circular economy
  5. Pollution prevention and control
  6. Protection and restoration of biodiversity and ecosystems

As with all other regulations, nothing comes easy. Applying the taxonomy requires a five steps approach that in reality becomes seven or eight steps depending on the data availability, internal ESG readiness and the disclosure level of the invested companies.

One of the key disclosures that investors need to make is the definition of the proportion of underlying investments that are Taxonomy-aligned, expressed as a percentage of the investment, fund, or portfolio – including details on the respective proportions of enabling and transition activities.

Source: Cardo AI analysis

To do that, investors need to go through the following steps:

Step 0 – Translate every sector/industry classification system to NACE economic activity code

To determine eligible economic activities in the Taxonomy, investors need to map the current classification system in use: e.g. NAICS or BICS (already mapped by the Taxonomy), GICS, ICB, SIC, TRBC, etc. with the European industry classification system (NACE).

Step 1 – Break-down invested company’s sectors by turnover or capex, and if relevant opex to determine if these activities are listed in the Taxonomy

Starting from the turnover, investors need to be able to break down the sectors of activity in which the company’s funds are allocated (both equity and fixed income). Successively, map these sectors to the list of taxonomy and flag those sectors that are present.

Step 2 – Validate if the companies meet the substantial contribution criteria

Every company in the portfolio is required to validate whether or not each economic activity meets substantial contribution criteria to climate mitigation and/or adaptation objectives. Substantial contribution is assessed through different screening tests carried out based on a collection of thresholds by sector.

To do that, investors need to have ready the data reported from the companies. In case data is not available – due to lack of reporting – an estimation of the data point or an approximation of the threshold could be a solution to determine the significant contribution.

Step 3 – Validate if the companies meet the “do no significant harm” criteria

The third step requires investors to conduct a due diligence-type process to verify the company’s activities meet the “do no significant harm” to the other environmental objectives using a set of qualitative and quantitative tests. They are typically applied at the company level looking at the production process or in the use phase and end of life treatment of the products produced.

Step 4 – Control if there are any violations of the social minimum safeguards

Investors need to conduct due diligence to control any negative impacts on the minimum safeguards related to UNGP (United Nations Guiding Principles on Business and Human Rights), OECD (Organisation for Economic Co-operation and Development), and ILO (International Labour Organization) conventions. OECD guidelines for MNES (Multinational Enterprises) ensure compliance with qualitative DNSH and minimum safeguards are recommended to be followed for the due diligence process.

Step 5 – Calculate the alignment of investment with the Taxonomy and prepare disclosure at the investment product level

Once the previous steps have been completed and the aligned portions of the companies in the portfolio have been identified, investors can calculate the alignment of their funds with the taxonomy (as an example, if 10% of a fund is invested in a company that makes 10% of its revenue from Taxonomy-aligned activities, the fund is 1% taxonomy-aligned for that investment, and so on).

Assessing a portfolio for Taxonomy alignment

Source: Taxonomy: Final report of the Technical Expert Group on Sustainable Finance (March 2020)

A great initiative has been the case studies around how to use the EU Taxonomy shared by PRI’s (Principle for Responsible Investment: The PRI is an investor initiative in partnership with UNEP Finance Initiative and UN Global Compact). Starting in late 2019, over 40 investment managers and asset owners worked to implement the Taxonomy on a voluntary basis in anticipation of upcoming European regulation.

Here is a summary of some of the cases studies:

For more details on these case studies and practical EU taxonomy implementation recommendations click here!

What you should know about being a developer in a Fintech startup…

Upon finishing my Bachelor’s, I never imagined I would find myself in the crossroad between the digital lending market and software engineering, with the former being just a part of the broad Fintech domain. I settled into frontend development with no prior knowledge on the tools that I am using today and although it has been only almost one year since I am developing for CardoAI, I would like to share my thoughts and experience on what it means to be a software developer in a Fintech startup.

You will not be able to understand everything at once

There are a lot of business and financial terms that will take time until you are accustomed to and being a developer, you should start to distinguish between different data visualizations and how to best display this ‘odd’ financial information that you are just starting to learn. However, this is part of the process as you get to immerse yourself in the world of Fintech, so just give it some time because you will not be able to understand everything overnight, even if you possess some previous background on finance and economics.

Ask questions A LOT

Always ask questions. Everyone has distinct learning paths so I do not think there are any wrong questions. Especially when it comes to developing in a rapidly changing market with the latest technologies, you will be judged for not asking questions. I have found pair programming and productive brainstorming sessions for discussing a new requirement to be particularly helpful in sharing knowledge between team members.

Put your customers first

Being part of a startup is a significant experience that provides to the team another mindset on how to approach the product and most importantly: your customers. Customer-Centered Approach is highly emphasized and it is the cornerstone in creating positive and meaningful relationships with your customers. The entire planning process and the development cycle will be adjusted to serve your customers’ needs: what they prioritize and what would give them a competitive advantage in the market. Once you start thinking like the customer and once you embrace the product as your own, you will start to identify problems and even hidden opportunities – leading to proactively improving the software without waiting for a customer request.

Challenges will help you grow

A software developer in Fintech should always expect the unexpected. There will be challenges waiting for you at every corner, but this should not demotivate you; quite the contrary, accept the new challenges coming in your way and use them at your advantage as an opportunity to grow. Having to deal with the look and feel of the application, a new challenge may also help you unravel your inner creative spirit and force you to think out of the box.

Flying Airplanes While We Build Them

Being part of a startup in a Fintech environment means that nothing is certain. The market is extremely volatile and everything is changing rapidly. Although a software developer would not necessarily be interested in the business side of doing things, in this case I think that the involvement of the tech team in better understanding the business requirements is one of the greatest strengths a startup could have. As we recognize this constant change in the Fintech domain, we as developers are one step ahead to deliver highly demanded features and perceive the importance of them by putting ourselves in the shoes of our customers. Flying Airplanes While We Build Them – the catchphrase of CardoAI best summarizes the challenging environment we face.

Nevertheless, as in any industry there will be opportunities to capitalize on and also challenges waiting to overcome. Most of the things, as a developer, will be learned through the ‘hands-on’ approach, however there are gaps that could have been filled if the universities prepared their students better on what to expect upon graduating. In my opinion, it is important that the curricula is updated and adapted around the latest technologies most companies are working with, as well as making general financial and accounting courses compulsory for engineering degrees.  

Having said that, newcomers should not be frightened to join, quite the contrary, there are plenty of chances to learn and even with limited knowledge and skills there will always be people willing to help.

MILAN FINTECH SUMMIT – Cardo AI, one of the top Startups having the highest market potential

MILAN FINTECH SUMMIT: A SELECTION OF THE BEST OF ITALIAN AND INTERNATIONAL INNOVATION
Among the over 70 candidates from 18 countries, 10 Italian and 10 international companies were selected based on their potentials on the market

Milan, 23 November 2020 – The Fintech companies deemed as having the highest market potential will be the protagonists of the second day of Milan Fintech Summit, the international event dedicated to the world of Finance Technology, scheduled as a streaming live on 10 and 11 December 2020. It is promoted and organised by Fintech District and Fiera Milano Media – Business International supported by the City of Milan through Milano&Partners, and sponsored by AIFI, Assolombarda, Febaf, ItaliaFintech and VC Hub.

Following the call launched on an international level and a careful selection by such experts in the sectors as the Conference Chair Alessandro Hatami and representatives of the organizing committee, today they announced the 20 companies that will be given the opportunity to be on the digital stage to present their own ideas and solutions for the future of financial services.
Among the over 70 candidates from 18 countries, 10 Italian and 10 International companies were selected.

The Italian companies are: insurtech Neosurance, See Your Box and Lokky; WizKey, Soisy, Cardo AI, Stonize and Faire Labs, operating in the lending and credit sector; Trakti offering cybersecurity solutions; Indigo.ai dealing with artificial intelligence.

The international ones that were selected are: Insurtech Descartes Underwriting and Zelros (France); Keyless Technologies (UK), CYDEF – Cyber Defence Corporation, Tehama (Canada), dealing with DaaS and Cybersecurity and Privasee (UK), operating in the data market protection; Pocketnest (USA), a SaaS company; Wealth Manager Wondeur (France), DarwinAI (USA) operating in the artificial intelligent sector and Oper Credits (Belgium), operating in the lending and credit field.

These realities, which will be introduced to a parterre of selected Italian and International investors and to fintech experts, were chosen based on the criteria of: innovativeness of the proposal, potential size of the target market, scalability of the proposal, potentials in capital raising; type of the employed technological solution.
The Milan Fintech Summit will thus help introduce the potential of our fintech companies abroad reinforcing the role of Milan as European capital of innovation, an ideal starting point for international companies that want to enter the Italian market.

FINTECH ACCESSIBLE TO EVERYONE, AN OPEN DOOR EVENT
The program of the event is available on the official site and a physical appointment of the summit is already scheduled for next year, on 4 and 5 October 2021. The December appointments are open to all those interested in knowing and understanding in depth the potentials of fintech. You can register now for free using this link, or purchase a premium ticket to participate as listeners to the pitch session (the only closed door part of the program) and be entitled to other benefits offered by the Summit partners.

Fintech District
Fintech District is the reference international community for fintech ecosystem in Italy. It acts with the aim of creating the best conditions to help all the stakeholders (start-ups, financial institutions, corporations, professionals, institutions, investors) operate in synergy and find opportunities of local and international growth. The companies that decide to adhere have in common the tendency to innovate and the will to develop collaborations based on opening and sharing, The community now consists in 160 start-ups and 14 corporate members choosing to participate to the creation of open innovation projects by collaborating with fintech. Fintech District also has relationships with equivalent innovation hubs abroad to multiply the opportunity to invest and cooperate, establishing its own role as access door and reference in the Italian market. Created in 2017, the Fintech District has its seat in Milan in Palazzo COPERNICO ISOLA FOR S32, in Via Sassetti 32. Fintech District is part of Fabrick.

Subscribe to our Newsletter

The ability to operate with technology and true intelligence at speed can be the deciding factor in success or failure in private market investments.

Start lowering your costs, scale faster and use more data in your decisions. Today!

Our Offices
  • Milan:
    Via Monte di Pietà 1A, Milan, Italy
  • London:
    40 New Bond St, London W1S 2DE, UK
  • Tirana:
    Office 1: Rruga Adem Jashari 1, Tirana, AL
    Office 2: Blvd Zogu I, Tirana, AL

Copyright Cardo AI 2021. All rights reserved. P.IVA: 28385 437473 3745