ESAs release updated statement on SFDR and Taxonomy disclosure

ESAs release updated statement on SFDR and Taxonomy disclosure

On the 24th of March 2022, the Joint European Supervisory Authorities (ESAs), has published an updated statement on the application of Level-2 disclosure required under the SFDR and The Taxonomy Regulation (Regulation 2019/2088, the Sustainable Finance Disclosure Regulation, and Regulation 2020/852).

In this article, we will look at what the ESAs statement entails and what are the main takeaways from the update.

Why the ESAs released an updated statement on SFDR and Taxonomy Disclosure 

The ESAs include the European Securities and Markets Authority (ESMA), the European Banking Authority (EBA), and the European Insurance and Occupational Pensions Authority (EIOPA). The organization has the objective of supervising the EU financial markets and developing common regulatory frameworks along with the national supervisory authorities of the member states.

The aim of the update is to prevent the “divergent application” of both Regulations in the interim period from 10 March 2021 (the application date of SFDR Level-1 disclosure) to 1 January 2023, the estimated application date of the Regulatory Technical Standards (“RTS”), the so-called “Level-2” disclosure under the SFDR.

The first draft of the RTS was finalized by the ESAs in February 2021, but on the 8th of July 2021, the European Commission (EC) announced its intention to bundle all RTS (including the new provisions introduced by the EU Taxonomy Regulation) in one delegated act to be applied from 1 July 2022. The draft bundled RTS have been submitted by the ESAs in October 2021, but in a later letter, the EC has announced that their application would have been further delayed to the 1st of January 2023. 

The required disclosures in the interim period

Let’s now look at the main disclosures that are required by financial market participants (FMPs) in this interim period.

Entity-level disclosure: getting ready for PAIS (Principal Adverse Impacts on Sustainability)

Entity-level disclosure on the consideration of  PAIS within investment decisions (art.4 SFDR) has been applying since March 2021 on a comply or explain basis, except for FMPs exceeding the criterion of the average number of 500 employees, whose statement on due diligence policies with respect to those impacts is due since 30 June 2021. 

FMPs having published such statements face additional disclosure requirements following art.4 (2) SFDR (currently listed in the draft RTS). On top of extensive qualitative information, FMPs should disclose at least 16 ESG indicators as listed in Annex I. Such additional detailed disclosure is due by 30 June 2023 on the reference period 2022.

Product-level Taxonomy-alignment disclosure: interim clarity or confusion?

The ESAs are clear in stating that the delay of application of RTS to January 2023 does not affect Taxonomy-alignment disclosure, which applies nonetheless starting January 2022. This means that financial products claiming to be article 8 or article 9 compliant, must disclose the share of Taxonomy aligned investments for climate change mitigation and/or adaptation.

The ESAs’ expectation in this interim period is that, in order to comply with TR provisions, FMPs should provide in pre-contractual disclosure an explicit percentage of investments that are Taxonomy-aligned, together with a “qualitative clarification” explaining how the final figure was determined, “for example by identifying the sources of information”. No other information should be provided according to the ESAs. However, this is basically what current articles 8 and 9 pre-contractual templates require, as available within the draft RTS. Website detailed disclosure about the data sources and the methodologies used to calculate Taxonomy-alignment of article 8 and article 9 products, is required starting January 2023 as well.

Challenges and concerns surrounding the ESAs’ statement

What concerns raise doubts, in the market is the ESAs’ assertion that “estimates should not be used”, and that where publicly available data from investee companies are not available, such data should be obtained either directly from investee companies or from third party providers. However, as the Head of Eurosif told Responsible Investor: “They say you shouldn’t rely on estimates, but in practice, we know as long as companies are not reporting the data points, you cannot avoid using some estimates, and therefore data providers”. Furthermore, as he correctly points out, it’s not like data providers have access to sustainability data that “has been hidden from everybody else”, but they are as well relying on their own internal estimates.

It is also hard to reconcile the ESA’s discouragement from the use of estimates with the encouragement of the use of draft RTS in this interim period since articles 40 and 53 explicitly contemplate disclosure about the proportion of data that is estimated for both articles 8 and 9 financial products.

Data challenges are of particularly high magnitude for those investors with large exposures to investee companies not subject to non-financial reporting requirements, such as SMEs and small-mid corporates. Asset managers with significant exposures to private debt will very likely always be in need of good estimates, not currently being well serviced by ESG data providers nor having high leverage on investee companies to require direct additional disclosure.

As the EBA’s Banking Stakeholder Group correctly pointed out in the feedback provided during the public consultation period, the finalized draft RTS fails to adequately integrate the proportionality principle, not envisioning specific provisions depending on FMPs dimensions and type of activity.

ESAs’ updated statement: looking ahead

While the European Commission’s final adopted Delegated Regulation could differ from the current draft RTS, it is important that financial market participants make good use of use 2022 to prepare for the detailed reporting requirements that articles 8 and 9 financial products face starting January 2023.

Do you want to know more about the latest news on ESG regulation? Visit our dedicated page on navigating ESG in the private debt market!

Beyond Excel: Risks of using spreadsheets to manage your private debt investments

Beyond Excel: Risks of using spreadsheets to manage your private debt investments

What are the risks of using spreadsheets? Is there a better way to manage your private debt investments? Is it worth it to switch from traditional but consolidated processes involving spreadsheets to more advanced technologies?

After reading this article, you will understand the main risks related to the use of spreadsheets when managing private debt investments and explore the real need for modern technologies in the market.

Private debt: a growing asset class

In the last decade, private debt has grown significantly – with an average increase of assets under management of 13.5% each year, according to Preqin 2022 Report. The positive trend is showing no signs of slowing down and is expected to reach $2.69tn in AuM by 2026. 

The main reason for this growth has been the attractiveness of private debt as an asset class. Low volatility, low correlation with other asset classes, higher yield prospects, and the floating rate of loans are only some of the benefits that this asset class can offer.

But operating in the private debt industry has different challenges. Operational structures are complex and very frequently non-standard. By definition, you win when you are continuously adapting to dynamic market shifts with innovative financial and deal structures that are very difficult to standardize. In addition to that, with increased competition for deals, the need for speed and intelligence becomes paramount.

Private Debt Market in 2021

Risks of using spreadsheets: billions of dollars managed with outdated tools

Every day, agents in the private debt market are managing billions of credit investments using spreadsheets, manual workarounds, pdfs, and word documents. 

Spreadsheets have become the go-to tool when managing these complex structures. It is widely used to model data and handle several important processes, from evaluations, pricing, cash-flows reconciliations, portfolio monitoring to capital allocations, and reporting. At the end of the day, spreadsheets are immediate, flexible, simple, and easy to use.While it may seem like an easy and painless solution, there are actually a lot of hidden risks and costs. Let’s take a closer look at the top six risks of using spreadsheets.

The initial purpose of the electronic spreadsheet was to replace paper-based systems in the business world. Originally developed for accounting or bookkeeping tasks, spreadsheets provided users with a simple way of calculating values.

Since their invention, spreadsheets have evolved into more complex products with many features and enhancements, which are now used for a vast array of tasks by millions of companies around the world.

However, spreadsheets were not developed for the investment management industry and even less for the level of security needed in handling data.

6 risks of using spreadsheets for private debt investments

1. Prone to errors and mistakes

88% of spreadsheets contain errors. A small mistake can have a snowball effect and a very big impact on business. At JP Morgan, one single error resulted in a $6 billion loss when someone copied and pasted from one spreadsheet to another. 

When Lehman Brothers collapsed in September 2008, few were aware of one related incident when Barclays Capital almost bought Lehman Brothers’ 179 trading contracts by accident. Lehman Brothers filed for bankruptcy on September 15, 2008. Three days later, Barclays Capital offered to acquire a portion of the US bank’s assets, including some of Lehman’s trading positions. As part of the deal, Cleary Gottlieb Steen & Hamilton, the law firm representing Barclays, had to submit the purchase offer to the U.S. Bankruptcy Court for the Southern District of New York’s website by midnight on Sept. 18.

Barclays sent an Excel file containing assets they intended to acquire to Cleary Gottlieb at 7:50 pm on September the 18th, only a few hours before the deadline. The spreadsheet had 1,000 rows and 24,000 cells, including those listing the 179 trading contracts that Barclays did not want to buy. They were, however, hidden instead of being deleted. 

Cleary Gottlieb was tasked with reformatting the Excel file to a pdf document so it could be uploaded to the court’s website. They didn’t pay attention to the hidden rows, which were visible again in the pdf file. The mistake was only spotted on October 1st after the deal had been approved. Cleary Gottlieb then had to file a legal motion to exclude those contracts from the deal.

Another case involves the outsourcing specialists Mouchel that had to endure a £4.3 million profits write down due to a spreadsheet error in a pension fund deficit caused by an outside firm of actuaries. Not only did Mouchel’s profits take a huge hit, but it also caused their share price to drop and their chairman to resign amid fears they would break their banking agreements.

Axa Rosenberg, the global equity investment manager, was fined £150 million for covering up a spreadsheet error back in 2011.

Documents detailing the collapse of Enron in 2001, released after the conclusion of all legal proceedings, showed that 24% of the corporation’s spreadsheet formulas contained errors.

Fidelity’s $2.6bn “minus sign” error. Fidelity’s ‘Magellan’ fund estimated that they would make a $4.32 per share distribution at the end of 1994. This incorrect forecast happened because an in-house tax accountant missed out on the minus sign on a net capital loss of $1.3 billion. This made the net capital loss a net capital gain. This caused the dividend estimate to be off by $2.6 billion.

More often than not just one person in a company has the knowledge of how the financial spreadsheet models are constructed. Other people are unable to understand and therefore check the analysis. The potential for errors is massive.

2. Vulnerable to security threats

Security risks make spreadsheets inefficient for storing clients’ investment data. Critical information cannot be encrypted with spreadsheets – which exposes sensitive data (financial information, social security numbers, etc.) to security breaches. 

As Microsoft itself underlines on the Excel support pageWorksheet level protection is not intended as a security feature. It simply prevents users from modifying locked cells within the worksheet.”

Excel is not immune to cybersecurity attacks. In 2019, it was discovered that hackers could attack Excel files through Power Query

This is also confirmed by Cisco, which stated that Microsoft Office formats, including Excel, make up the most prevalent group of malicious file extensions in emails, as attackers can use VBA (Visual Basic for Applications) scripts to create macro malware. 

3. No integration capability

Only those with the right form of data can successfully navigate the market, make future predictions, and adjust their business to fit market trends.

There is no doubt about the acceleration of the digital transformation of our economies and our daily jobs. A lot of data is made available to asset managers and asset owners in the private debt market. But most of the data we handle today is unstructured, which means it comes in different forms, sizes, and even shapes. And spreadsheets cannot manage this type of data. 

Excel was mainly built for independent analysis and single files. Mixing data sources that come from different systems is almost impossible to do in excel. Importing, exporting, and updating data from other platforms or databases can become an extremely tedious and time-consuming task. 

4. No real-time update

With spreadsheets, teams are often operating and making decisions on outdated or simply inconsistent information, as infinite versions of the same file are created and saved on each local computer day after day. Furthermore, there is no way of tracking changes to the files – when mistakes occur they can be difficult to identify and correct in time.

5. No permission controls

When it comes to private debt investment data, not all members of the organization need to access the same information. Specific roles and people within the organization need to be able to access a particular spreadsheet, but not others. Spreadsheets don’t come with tools for granting permissions on a single user level. 

Another drawback of Excel is that you have no visibility of who accesses your data and when. A topic that becomes very relevant considering recent GDPR regulations regarding data privacy.

6. Cannot handle large volumes of data

Spreadsheets are not a natural fit for handling large amounts of data. With just ten thousand rows, the program starts to perform poorly and slows down calculations each time a new formula or macro is added.  In other words, the more information you enter into spreadsheets, the more complicated it becomes to manage it all.

Data has always been an essential asset to the growth of any organization. There are 2.5 quintillion bytes of data created every day. Once analyzed, this data can help the private debt industry in a multitude of ways. As in healthcare, data helps avoid preventable diseases by detecting them in their early stages. It could be immensely useful in the private debt sector, to predict different patterns of behavior and increase performance or decrease losses. It can also aid in recognizing illegal activities such as money laundering or fraud cases.

Does any of these issues sound familiar?

If you have ever had to work with spreadsheets to manage your private debt investments, you probably have come across these problems before.

But is there anything better in 2022? What if you could apply the advances in technology and data science to this market? From cloud computing to big data, from machine learning to artificial intelligence, there are many ways technology can make your life easier and better. 

The need for advanced technology in the private debt market

As an operator in the private debt market, you likely have two main objectives: making the right investment decisions and scaling your organization in the most efficient way possible.

However, the tools and technology you are using can have a significant impact on these goals. How so? Keep reading to find out:

Making the right investment decisions

When it comes to private debt, speed is a decisive factor. With the competition going up and fewer deals available in the market, you need to act fast and with confidence.

When using traditional tools like spreadsheets, you spend too much time concentrating on how to get and analyze the data, rather than focusing on what the data is telling you. By the time you get the right information, it might already be too late. How can you make timely decisions when you have data that is several days old?

Scaling the organization

As volume increases, so does operations’ complexity, making it very difficult to scale. While hiring more people is not a sustainable and efficient solution, technology can come to the rescue and support your operations teams with more automated tasks. 

As a result, teams will be able to focus on the operations strategy, rather than crunching and reconciling data in excel files. 

Specific tools for complex activities

When managing private debt investments, there are several critical processes that you just cannot handle efficiently with spreadsheets. Let’s dive into two of the most common activities private debt market actors encounter in their day-to-day: credit risk assessment and reporting.

Credit Risk Assessment for asset managers

When it comes to credit risk assessment for innovative credit products (such as buy now pay later, revenue and inventory financing, salary financing) product-based risk modeling is needed.

Rating classes, risk scores, or probability of default estimations on a yearly basis are not enough. Credit Risk estimates have to follow the structure of the product itself, with the same speed of innovation. 

What market players need are specialized models that target the product risk, such as delay prediction models, propensity to pay back models and revenue limit estimations, etc. This is what enables them to make quick and timely decisions, moving with the speed of the market.

Producing reports for external stakeholders

Private debt reporting needs have become quite complex, especially considering disruptive events like the Covid-19 and the Russia-Ukraine conflict. Asset owners are increasingly demanding more frequent reporting deliveries and more custom-made structures.

In addition, as the asset class grows so does the scrutiny of regulators, which are increasing the reporting requirements with detailed look-through demands. 

To meet these requirements you need to have a flexible approach to data and reporting deliveries. We will look at two examples of common reporting activities: periodic reports from contractual provisions and regulatory reporting.

Contractual provisions require periodic reports that need to be produced monthly or quarterly. In some cases, reports are a prerequisite to making payments or acquiring new assets. Therefore, speed is critical.

In addition, custom requests arrive from time to time, driven by specific external events (e.g. what is the exposure to industries more exposed to the drawbacks of the pandemic? Or towards the Russia-Ukraine conflict?) or even for internal purposes such as investment committees, audit exercises, etc. 

It can become difficult to accommodate such needs quickly by using spreadsheets as they require custom adjustments each time. 

Regulatory reporting (like ESMA transparency reports) is another activity that must be carried out periodically and requires specific processes, standards, formats, and a dedicated tool to produce the report timely and with quality. In some cases, this activity is outsourced externally to third parties that still handle most tasks manually, continuing to have potential errors and data breaching risks.  

Beyond spreadsheets: a better way to manage private debt investments

In this article, we have analyzed the risks of using spreadsheets and the reasons that make them an unsuitable tool for the private debt industry. 

The good news? There is now a better way to manage your private debt investments.

At Cardo AI, we have been working since 2018 to provide asset managers, banks, and digital lending providers with the speed and accuracy they need in the private debt market. 

With Cardo AI’s proprietary technology, our clients and partners are now able to focus on what really matters, on the real work that has to be done: that is taking good investment decisions.

If you are ready to harness the power of technology and abandon your outdated spreadsheets, discover our products today!

The role of regulation in the ESG Securitization Market – What you need to know

The role of regulation in the ESG Securitization Market – What you need to know

Regulation in the ESG securitization market: what are the needs of the industry actors when it comes to sustainability? In this article, we will review the evolution of the market in regards to ESG from a regulatory perspective. We will also look at EBA’s recommendations concerning the dedicated framework for sustainable securitization.

Considering that the overall market mainly focused on the Environmental aspect of ESG, in this article we are going to dive into green securitizations and its related standards.

An overview of the European Securitization Market

Before looking at the role of regulation in the ESG securitization landscape, let’s review the market evolution in the last few years. Looking at the performance of the EU sustainable securitization in 2021, you might think that the market boomed. The issuance of EUR 8bn resulted in a +273% increase in volume compared to 2020 (when it was only EUR 2.1bn).

However, sustainable issuance volume was only 3% of the total of EUR 233.1bn securitizations executed. In the same period, EU sustainable loans represented 25.4% of the total loans market, while sustainable bonds accounted for 20.2% of the total bonds market.

While US green securitization market volume is staggering compared to its EU equivalent (USD 115.5bn at the end of Q1 2021), the share out of the total securitization market is very similar, standing at 1%. As a matter of comparison, the Chinese green securitization market recorded, by the end of 2020, 85 ABS deals totaling RMB 115 bn (~EUR 17.5bn), representing less than 1% of the overall Chinese securitization market (RMB 4.3 tln).

European ESG Securitisation Issuance by Asset ClassSource: AFME

These numbers suggest that the depth of the US and Chinese green securitization markets is, therefore, more related to the dimension of the overall “domestic” securitization market rather than to factors directly impacting the sustainable issuances. However, the EU green securitization market represents only 1.7% of the overall EU ESG bond market. If we compare this ratio with the US market (above 10% in 2021) and the Chinese Market (11% in 2020), the European market seems to miss something versus its international peers.

Challenges in the development of the EU sustainable securitisation market

The  differences between US/China and Europe can be explained by two main factors:

  • In the US the most active players in the market have set their own standards for sustainable securitizations. For example, the issuance programs for green securitization of two mortgage loan companies Fannie Mae (USD 100bn of cumulated green multifamily mortgage-backed securities) and Freddie Mac (over USD 600m).
  • In China, green finance policies focused on securitization have been developed, providing definition, eligibility criteria, verification, and disclosure requirements for green ABS.

As also confirmed by an assessment made by the EBA, the main challenges affecting the development of the EU sustainable securitization market are linked to:

The lack of a market standard 

As already highlighted, the European market misses an “official” (or at least a widely accepted) definition and related disclosures for sustainable securitizations. This generates an asymmetry of information between market participants and poses a reputational risk in case transactions do not meet ESG requirements and incur in being labeled as green/social washing.

The lack of available sustainable assets to collateralise and originate

The lack of common definitions and indicators on which assets have to be considered ESG makes it difficult for originators to identify the sustainable assets within their balance sheets. 

One of the options to identify sustainable assets is to use the EU Taxonomy, which is in place from January 2022 (even if some reporting requirements will be applied from January 2023). Even if the taxonomy is relatively new for market players, it is estimated that currently only 7.9% of the total assets of EU banking institutions are represented by EU taxonomy-compliant assets. In the absence of a reliable alternative, the potential pool of sustainable assets is therefore very limited and holds back the market from potential development.

Lack of technology

In addition to the lack of assets, the existing internal reporting and data management systems of most institutional lenders have not been adapted yet to track sustainable assets and provide proper monitoring and reporting.

Based on the evidence from the US and China, and as also confirmed by EBA, a dedicated framework would address the lack of standard definitions and criteria to be met by a securitization to be labeled as ”sustainable”, reducing the greenwashing risk and improving the market transparency.

What standard is currently available in the EU Market? 

Looking at the available sustainable standard in the EU market, the only one currently “in place” is the so-called EU Green Bond Standard (hereinafter “EU GBS”).

EU Green Bond Framework - Source European Commission
EU Green Bond Framework – Source European Commission

In July 2021, the EU Commission published a legislative proposal for a Regulation on green bonds (2021/0191 COD) aiming at creating a voluntary, high-quality standard for green bonds (‘EU GBS’) to support the financing of green investments while addressing concerns around ‘greenwashing’. Its main objectives are:

  • Improving the ability of investors to identify high-quality green bonds
  • Facilitate the issuance of greeb bonds by specifying the eligibility conditions to obtain an EU green bond label, thus reducing potential reputational risks for issuers. 

The main characteristics of the EU GBS are:

  • 100% of proceeds to be used to finance EU Taxonomy-compliant investments by the maturity of the bond;
  • Compliance with the EU GBS disclosure framework which includes the publication of 
    • EuGB factsheet together with a pre-issuance review of the factsheet by an external reviewer;
    • An allocation report every year until full allocation of the proceeds;
    • A post-issuance review by an external reviewer of the first allocation report following full allocation of the proceeds;
    • An impact report after the full allocation of the proceeds at least once during the lifetime of the bond;
  • Control by an external reviewer supervised by ESMA

The EU GBS is intended to be applied to any type of bond instrument, including securitizations. 

Is the EU GBS applicable to securitizations?

Even though securitizations fall within the scope of the EU Green Bond Standard, their application might have significant drawbacks. 

In particular, the use of proceeds criterion in the original EU GBS applies at the issuer level, that in the case of securitization would be the SPV (or Securitisation Special Purpose Entity), therefore resulting in a collateral-based approach, since it would have the effect of imposing the purchase of a portfolio made up of 100% EU taxonomy-aligned assets.

Furthermore, the application of the EU GBS requirements to an SPV would not ensure the effectiveness of administrative sanctions and other measures (including material pecuniary sanctions) in case of breach of the regulation, given the limited legal and economic substance of SSPEs.

Regulation in the ESG Securitization market: EBA’s recommendation on a dedicated framework 

EBA mandates on sustainable finance - Source EBA
EBA mandates on sustainable finance – Source EBA

Following the mandate provided by the EU Capital Markets Recovery Package in the context of the amendments introduced to the Securitisation Regulation (SECR), on the 2nd of March 2022, the EBA released the Report on Developing a framework for sustainable securitization, with the goal of assessing:

  • the main challenges that affect the development of the EU ESG securitization market;
  • how to define and implement sustainability-related disclosures and due diligence requirements for securitisation products;
  • how to establish a specific framework for sustainable securitisation products drawing upon the EU Taxonomy Regulation and the SFDR;
  • the potential impact of a sustainable securitisation framework on financial stability, the scaling-up of the EU securitisation market and bank lending capacity;

Even if EBA performed an analysis of the market status and the viable options, with regards to the definition of a dedicated framework, it has decided not to move ahead at full steam. 

The decision was taken considering, on one side, the limited availability of “pret-a-porter” green assets and on the other, the limited experience of the market so far. 

The regulator has concluded that too little experience has been gained on the application of ESG factors so far, not only for securitizations but also in the wider bond market, where the EU GBS has not been applied yet, since it is still being examined by the Council of the European Union. The EBA has therefore decided to allow the market to self-develop before taking a definitive stance, in order not to stifle innovation by promoting a solution that doesn’t suit investors’ and issuers’ needs. 

Given the early stage of development of the EU green securitization market and the insufficient amount of green assets, the EBA deemed as too early the definition of a dedicated green securitization framework for both the true-sale and synthetic securitization. 

In the meantime, the EBA has suggested the application of the EU GBS to securitizations with some adjustments aimed at solving the identified drawbacks:

Application of the use of proceeds criterion at the originator level rather than at the SPV level

This requires the originator to invest the transaction’s proceeds in the origination of EU-Taxonomy compliant assets. This approach is not as obvious as it might appear, considering the typical ring-fence structure of securitizations, since its effect is to create obligations on the originator, which would be liable towards the noteholder (and eventually the Regulator).

Definition of additional disclosure requirements

This should allow investors to perform proper due diligence and monitoring of securitization transactions and assess their compliance with ESG criteria (e.g. green assets ratio and the banking book taxonomy-alignment ratio and similar KPIs for non-financial institutions). Additionally, disclosure needs to cover the originator in order to make sure that the use of proceeds goes in the right direction and that there is no adverse selection on the securitized assets (i.e. selling the non-green part of the portfolio to keep the high-quality green assets);

Compatibility between EU GBS and transparency requirements

The reporting of the EU GBS disclosure framework should be compatible with the transparency requirements for originators, sponsors, and SPVs foreseen by Article 7 of the SECR. This will ensure that investors have access to the securitization-specific information and prevent the EU GBS and the securitization disclosure requirements from being fragmented between different sources. 

The EBA considers the above just as a temporary interim measure in order to support the development of the EU market up until it will be ready for a collateral-based approach. Therefore, the re-evaluation of a dedicated framework for green securitizations should be performed at a later stage.

Disclosure’s requirements applicable to green securitisation 

As we have seen, one of the key elements to support the development of a sustainable securitization market is the disclosure concerning the performance, in terms of ESG metrics, of the different parties involved (such as originator, issuer, assets, etc.). 

ESG disclosures in the EU: Financial institutions - Source EBA
ESG disclosures in the EU: Financial institutions – Source EBA

Sustainability disclosure is not just a matter of supporting the potential development of the market. There are already several sustainability disclosure frameworks that affect, or are going to affect in the near future, the securitization transactions issued in the European Market:

The Sustainable Finance Disclosure Regulation (Regulation (EU) 2019/2088, SFDR) 

The regulation introduces, for financial market participants, periodic and pre-contractual disclosures requirements on how a specific financial product “promotes environmental or social characteristics or “has sustainable investments as its objective”. Moreover, it introduces the obligation of providing dedicated disclosure on principal adverse impacts on sustainability (hereinafter “PAI”) of investments. The PAI can be defined as indicators for assessing the potential impacts on sustainability factors that are “caused, compounded by, or directly linked to investment decisions and advice performed by the legal entity”.

The EU Non-financial Reporting Directive (Directive 2013/34/EU, NFRD)

The directive introduces disclosures requirements for large financial and non-financial undertakings with regards to environmental, social and employee matters, respect for human rights, anti-corruption and bribery matters;

The Regulation on EU GBS

This regulation will introduce the obligation to publish several reports concerning, inter alia, the allocation of the proceeds collected.

The Capital Requirement Regulation (Regulation – EU 575/2013)

Especially with regards to Article 449a and the EBA’s Implementing Technical Standards on prudential disclosures on ESG risk, that require the banking institutions to provide quantitative and qualitative data on ESG Risk and on “green” capital allocation.

Once the pending disclosure regulations for EU-Taxonomy-aligned products will be finalized, the EBA deems that the above-mentioned regulations should be complemented with additional disclosures requirements for green securitization transactions.

However, ensuring that both investors and asset managers take into consideration the ESG factors also in securitization transactions that are not labeled as “sustainable”, is vital.

In this sense, the amendment to the Securitisation Regulation pushed towards the integration of the sustainability-related disclosure set forth by the SFDR.

The importance of standardised data in sustainability-related disclosure

In October 2021, the European Supervisory Authorities’ Joint Committee has published the final draft Regulatory Technical Standards, which introduces specific indicators to determine the PAI of investment products on ESG factors.

Ensuring standardized data for the assessment of the principal adverse impacts of securitization investments is key to improving the sustainability of the overall EU securitization market since it would:

  1. Allow originators and issuers to provide better disclosure on ESG factors
  2. Enhance investors ability to perform their due diligence on ESG factors

However, the securitization products, and therefore their originators/issuers, do not fall directly within the scope of the SFDR. This has a two-fold negative impact:

  • It does not ensure a standardised source of data for the investors, which under the SFDR have to collect “principal adverse impact” data in order to determine whether the securitisation transaction is aligned with their ESG target;
  • It makes the securitisation products less attractive to investors since they cannot be included within their ESG investment strategies, unless an independent evaluation is carried on;

Having acknowledged the issues above, the Securitisation Regulation (“SECR”) amendments included in the Capital Market Recovery Package and published on the 31st of March 2021, have involved also the Articles 22(6) and 26d(6). These articles attribute to the ESAS a joint mandate to develop dedicated Regulatory Technical Standards on the disclosures concerning the sustainability indicators in relation to the adverse impacts on climate and other environmental, social and governance-related dimensions. Such RTSs shall mirror or draw upon the other RTSs already prepared with regards to the SFDR framework.

Thus, it is likely that the PAI indicators and other disclosures requirements identified by the ESA’s RTSs, and applicable to financial products currently falling within the scope of the SFDR, could be extended also to the securitization products and their originators.

The EBA’s recommendations don’t just refer to STS transactions currently in scope for ESAs’ RTS, but also involve non-STS transactions in order to unleash the full potential of the sustainable market. Additionally, the EBA deems it appropriate to adjust the ESMA templates in order to ensure that the loan data provided are adequate for the PAI calculation at the transaction level.

It is therefore evident that in the near future, originators and issuers will have to be able to gather a wide array of ESG data in order not to lose access to investors needing that information to comply with the disclosure requirements set forth by the SFDR. 

Regulation and the ESG Securitization Market: Conclusions

There is ongoing turmoil in the ESG world, and securitizations are not excluded, even in good old Europe. Notwithstanding the European market still lags behind international peers when it comes to numbers and volumes of transactions, the attention that sustainability matters have attracted from practitioners and regulators signals that something is going on. 

EBA’s recommendations on how to apply EU GBS to securitizations have been widely welcomed since they are in line with market needs in terms of supporting its expansion without introducing too many regulatory overheads (at least for the moment). The recommendations also would potentially leave space for the development of independent labels or standards. While there are several directions the market and its participants could take in the future on ESG matters (self-definition of a sustainable standard or adaptation to regulatory requests), it is clear that new needs will emerge in terms of data, monitoring, and reporting infrastructure.

On one hand, the Securitization Regulation is already set to be implemented with new disclosure requirements on PAIs with the upcoming JC RTSs. On the other hand, investors are developing more and more sophisticated approaches to assess the investments’ sustainability risks and impacts. 

Both of the above require a new and dedicated set of data that will have to prove to be solid and reliable. Originators and services will be requested to disclose ESG data in their periodic reporting and, therefore, a dedicated framework for the retrieval and management of internal and external data will be needed. 

The new solutions that will emerge as “best practices” will not only support new issuances but will also allow originators and investors to assess and disclose the sustainability credentials of their legacy assets and existing securitizations. 

Do you want to know more about navigating ESG issues in the private debt market? Visit our ESG dedicated page where we discuss the main challenges regarding sustainability. 

Visualizing BigO Notation in 12 minutes in text and three robots holding a Big O

Visualizing BigO Notation in 12 Minutes

In this article, I want to make a tiny contribution to the community by explaining a very important concept like BigO notation using some power of visualization and my background in math to help you all better understand this topic.

I truly believe that there is nothing you can’t learn if you start from the most fundamentals.

If you combine the correct resources and the will to learn nothing can stand in front of you and your dreams.

Let’s begin with the definition:

BigO notation is a mathematical notation that describes the limiting behavior of a function when the argument tends toward a particular value or infinity.

How do we understand that?

In computer science, there are almost always several ways to solve a problem. Our job as software engineers is to find the most effective solution and implement it. But what does it mean effective anyway? Is it the fastest way? Is it the one that takes less space than the others? Actually, it depends. This is purely related to your particular problem.

If you are working in embedded systems with limited memory, for example, if the problem is to calculate the required power in watts to defrost 200gr of meat in a microwave you can trade an algorithm that is more memory efficient and takes 1s to make this calculation to another one which makes this calculation in milliseconds but will take much more memory. After all, even if it takes milliseconds to start, the defrost process itself will require 10 to15 mins.

If we talk about the algorithm that locks missiles to a target airplane, it is clear that we are dealing with milliseconds here and the memory consumption can be sacrificed. The plane is big enough to have free space for some more memory slots.

In general, software engineering is about the trade-offs. A good software engineer has to be aware of the requirements and come up with solutions to fulfill them.

With all that being said, it is understandable right now that we need to somehow quantify and measure the performance and memory implications of any algorithm. One way to do that is to take a look at how many seconds one algorithm requires to complete. That can provide some value but the problem is that if my search algorithm takes 2 to 3 seconds on my laptop with an array of 1000 items it can take less than that on another more powerful laptop right? Even if we agree to take my laptop’s performance as a base, we are not aware of telling what happens when the size of the array doubles? What happens when the size of the array goes to infinity?

To answer these questions we will need a measurement that is independent of the machine and can tell us what will happen with our algorithm when the size of the input gets larger and larger.

Here comes the BigO Notation.

BigO aims to find how many operations you need to perform in the worst-case scenario given an input of any size. It aims to find what is the limit for the number of operations of your algorithm when the size gets larger and larger. BigO is divided into Time and Space Complexity analysis. For every algorithm you calculate its Time Complexity by simply counting the number of the auxiliary operations you perform on the given data structure.

If you copy an array of 10 items in another array then you will need to loop all over the array, which means 10 operations. Which in BigO notation is expressed as O(N) where N is the size of the input array. The Space Complexity for this example is again O(N) because you are going to allocate some more memory for the copied array.

What BigO does is to give you a math function that is purely focused on finding the limit of the number of operations you need to perform when the size of the input gets larger. For example, if you are searching for number 5 inside of a given list with a linear search. Then in the worst case, this number will be at the end of the list, but since you will start the iteration from the beginning you will need to perform as many lookup operations as the number of the input.

[1, 2, 3, 4, 5]  # you will perform 5 operations here to find it

Here I want to stop for a moment on the term “worst case”. If you think about it, there is a chance that the required number will be at the beginning of the list. In this case, you will perform only 1 operation.

[5, 1, 2, 3, 4]  # you will perform only one operation in this case

The problem is that we can not take into consideration the best case and hope that it will happen most of the time because in this way we are not able to compare different algorithms to each other. In the context of the BigO notation, we are always interested in the worst case (with some exceptions like hash maps, more on that later).

I said before that BigO gives you a math function that is focused on finding the limit of the number of operations. When we talk about the limits in math, we can not only talk about them without any visualization. This helps a lot in understanding the trend of the function as the size of the input goes to infinity. Let’s start by analyzing one by one some very common BigO notations together with an example.

BigO Notation examples

O(1) Constant Time

O(1) Constant Time

This is understandably the best BigO notation an algorithm can have. Especially when you want to perform a certain action and you can perform it in only one operation. Let’s take a look at an example using Python:

country_phone_code_map = {
'Albania': '+355',
'Algeria': '+213',
'American Samoa': '+684',
country = 'Albania'
print(country_phone_code_map[country]) # 1 operation
>>> '+355'

In Python, if you want to make a lookup for an item into a dict then the operation is O(1) Time Complexity. Dict in Python is similar to HashMap in other languages.

To be exact, the worst-case scenario is O(N) and this is related to how well the data structure is implemented. The hashing function takes the key role here but in general, it is agreed that the BigO for dict lookups is O(1). If you are in a coding interview you can assume that it’s O(1).

An important topic when calculating the BigO notations: the constants.

You may or may not be familiar that the constants are ignored when calculating the BigO. I don’t want you to just accept this as a rule and don’t think about the reasons behind it.

This is exactly why I am visualizing the BigO notation. So let’s assume that for the above example we will also need to get the 3 and 2 letter country code by having the country name. This means that we have some other mapping for the 2 and 3 letter country code and we will just need to perform 2 more operations inside the same function.

country_phone_code_map = {
'Albania': '+355',
'Algeria': '+213',
country_2_letter_code_map = {
'Albania': 'AL',
'Algeria': 'DZ',
country_3_letter_code_map = {
'Albania': 'ALB',
'Algeria': 'DZA',
country = 'Albania'

phone_code = country_phone_code_map[country] # 1 operation
two_letter_code = country_2_letter_code_map[country] # 1 operation
three_letter_code = country_3_letter_code_map[country] # 1 operation

If we continue to count the number of operations as we agreed to do before. Here we will have 3 operations that make the BigO = O(3) right?

Let’s visualize this:

BigO Notation

As you can see the number of operations moved up by 3. Which means we are actually performing more than one operation to complete this task. But, BigO says that if there are constants just ignore them. So O(3) or O(2n) or O(2n + 1) will be respectively O(1), O(n), O(n).

This is because we are interested to know the limit of the function as N goes to infinity and not how many operations exactly it will perform. We are not calculating the number of operations but instead, we are interested to see how the number of operations will grow as N goes to infinity. You may be thinking that, yes but an algorithm with O(1000n) is slower than one with O(n) so we need to consider that 1000 we cannot ignore it. That’s true but this number 1000 is a constant and it is not getting bigger as N. Even when N is 10 it will remain 1000, even when N is 1B it will remain 1000. So it does not provide us with any valuable information regarding the limits of the function. The only important part is O(n) which tells us that the more you increase, the more you are going to perform operations to complete the task.

O(logn) Logarithmic Time

O(logn) Logarithmic Time

This notation usually comes together with searching algorithms with the divide and conquer approach. If we are searching for a number in a sorted array we can use the most basic algorithm, binary search. This algorithm will divide the array by half on every operation and it will take log(n) operations to find the number. Here is a nice tool to visualize this algorithm.

Something important here is that when we talk about logarithm in computer science without specifying the base we always talk about the logarithm with base 2. In math, we are used to a logarithm with base 10 in this case but it is different in computer science. Just keep that in mind when dealing with complexity analysis.

As you can see from the image above, this is actually a very good time complexity. To have a complexity of O(logn) in plain English means that every time the size of the input doubles we only need to make one more iteration to complete the task. When N is about a million we need to perform only 20 operations, and when it gets around 1billion we need to perform only 30 operations. You can see the power of an algorithm with O(logn) time complexity. For such a huge increase in N, we only need 10 more operations to perform.

O(N) Linear Time

O(N) Linear Time

In this case when N goes to infinity the number of operations goes to infinity as well with the same rate as N. An example is a linear search as we discussed before.

array = [1, 2, 3, 4, 5]
number = 5
for index, item in enumerate(array): # loop n times
if item == number: # check for equality
print(f'Found item at {index=}')

>>> Found item at index=4

O(NlogN) Log-Linear Time

O(NlogN) Log-Linear Time

This notation usually comes together with sorting algorithms. Take a look at this visualization for merge sort. Merge sort divides the array into two halves O(logn) and takes O(n) linear time to merge divided arrays.

O(N²) Quadratic Time

O(N²) Quadratic Time

Usually algorithms with nested loops. For example a sorting algorithm with brute force, that is looping all over the array in two nested for loops. Bubble sort is an example:

def bubble_sort(data):
for _ in range(len(data)): # O(n)
for i in range(len(data) - 1): # nested O(n)

if data[i] > data[i + 1]:
data[i], data[i + 1] = data[i + 1], data[i]
return data

Since the second loop is nested we will multiply the complexity of it with the complexity of the first loop. Which is O(n) * O(n) = O(n²)

If the second loop were outside of the first one, we would sum them instead of multiplying because in this case the second loop will not be repeated as many times as the first loop.

O(N³) Cubic Time

O(N³) Cubic Time

The simplest example can be an algorithm with 3 nested for loops.

for i in range(len(array)):  # O(n)
for j in range(len(array)): # O(n)
for p in range(len(array)): # O(n)
print(i, j, p)

If you directly apply the mathematical definition of matrix multiplication then you will end up with an algorithm with cubic time. There are some improved algorithms for this task, take a look here.

O(2^N) Exponential Time

O(2^N) Exponential Time

The most known example for this notation is finding the nth Fibonacci number with a recursive solution.

def nth_fibonacci(n: int) -> int:
if n in [1, 2]:
return 1

return nth_fibonacci(n - 1) + nth_fibonacci(n - 2)

O(N!) Factorial Time

O(N!) Factorial Time

An example of this would be to generate all the permutations of a list. Take a look at the Traveling Salesman Problem.

Take the most important factor.

We’ve already talked about dropping the constants when calculating the complexity of an algorithm because they don’t provide us with any value. There is something more regarding that rule. When performing complexity analysis, we can end up with an algorithm that performs more than 1 type of operation on the input given. For example, we may need some function to initially sort an array and then search on it. Let’s assume that there will be one operation to sort with complexity O(NlogN) plus another one to search with complexity O(logN).

The time complexity for such a function will be O(NlogN) + O(logN). Let’s visualize this:

O(NlogN) + O(logN)

If you take a look at this graph, you will notice that the impact of O(NlogN) is bigger than the impact of O(logN) since the graph is more similar to the one of O(NlogN) compared to O(logN). We can even mathematically show that by doing so.

O(NlogN) + O(logN) = O((N+1)logN)  # factorize
O((N+1)logN) = O(NlogN) # drop constant 1

In this case, they are relatively close to each other and the difference is not obvious, but if we take another example like O(N! + N³ + N² + N) we will notice that the impact of the notations, except N!, is so small compared to N! when N gets too large!

We can easily compute 1 000 000 ^ 3 but try the same for 1 000 000 factorial.

O(N! + N³ + N² + N)

The factorial of 10 is 3 628 800 whereas 10³ is only 1000. As you can see the impact of N³ is so small compared to the N! that we can actually ignore it. That is why when we have multiple notations summed up we take the most important factor.

Something very important to know when taking the most important factor is that we group factors by the inputs. This means if we have an algorithm that operates on 2 different arrays one of size N and one of size M and the complexity of the algorithm is O(N² + N + M³ + M²) then we cannot just say that the highest factor is M³ so the complexity is O(M³). This is not true, because they are completely separate variables in our function. Our algorithm depends on both of them to work so what is correct is to take out the highest factors for both variables. We eliminate N since N² is higher, and we eliminate M² since M³ is higher and the result is O(N² + M³).


If you want to learn algorithms and data structures deeply then you need to question everything. Do not take as granted any of the rules. Question them and try to find answers. Visualization makes a huge difference in understanding complex algorithms. Do not forget that you should not be learning these topics to pass some coding interviews but to make yourself a better engineer.

Visit Betterprogramming,pub for the original article!


Our 2021 in Review at Cardo AI

2021 Review – What happened this year at Cardo AI

Our 2021 in Review at Cardo AI

2021 review – What happened this year at Cardo AI

As we are starting the new year, we took a look back at what we were able to accomplish during 2021. January is an ideal time to reflect on what we’ve achieved so far and what we expect to do in the months ahead. 

Keep on reading to discover more about how 2021 has been like here at Cardo AI!

Our Key Moments

2021 has been a big year for Cardo AI: we have reached amazing milestones and achievements that have brought us closer to our main mission: revolutionizing the private debt market with technology powered by artificial intelligence. 

Let’s look at some of the major accomplishments of the year:

Our business results

Our technology has helped clients to make smarter decisions, as they are able to analyse quickly more data and gain better insights, lower their operational costs, and further scale their operation. With the same team, our clients are now able to manage 2.5x more assets while encountering 95% less errors. 

  • 30+

    # Transactions

    Alternative Funds, Sub-funds, SPVs
  • 3Bn+

    € Amount

    SME loans, Consumer Loans, Trade Receivables, PA Loans, etc.
  • 600K+

    # Loans

    35 countries, 22 sectors

Becoming a PRI signatory

  • We are 100% committed to promoting responsible investment

    Our mission is to support our clients in having a real influence on sustainable investing and integrate ESG elements into the private debt market. One of the ways we hope to accomplish this is through our future products, which will allow managing ESG scoring and rating data from both external and internal sources. Find more by reading our article about becoming a PRI signatory.

  • Cardo AI becomes a PRI signatory

Talent & People

One of the things we really are proud of is the fact that we’ve had more than 40 new joiners this year, doubling our team compared to 2020. Both senior and junior talents have joined our teams for Business Analytics & Development, Data Science &  Engineering, Financial, Marketing, and Software Engineering. 

  • Reaching 50 talents

    Back in 2018, we started CARDO AI with a small group of people and the mission to bring technology into the private debt market. In September 2021, we reached 50 amazing talents working across three different countries and helping solve the biggest challenges in private debt!

  • Opening new offices

    As we have welcomed many new employees, we also needed to expand our working spaces. We opened a new office in Tirana, Albania (where we now have two) and we moved to a bigger office in Milan, Italy. 

Benefits and initiatives

In 2021 we introduced several new initiatives and benefits for our talents, with the objective of further improving our working environment and making Cardo AI a great place to grow and develop your career.

  • Work-Life balance officer

    We want to make sure that each employee has the perfect environment to thrive & grow in his chosen career. 
    That is why we appointed a Work-Life Balance Officer who, alongside his activities as Growth Manager, will ensure a healthy work-life balance at Cardo and find ways to mitigate and train people in managing stress, burnout, and overtime work. 

  • Remote and Flexible Working

    At Cardo we give team members maximum flexibility to choose the setup and schedule that works best for them, whether that’s at home, at one of our offices, or at another location. We truly believe that giving our employees the freedom to choose where and when they work best can boost long-term motivation, happiness, and overall productivity.

  • Stock option plan

    With our stock option plan we give the opportunity to employees  to become part of the ownership of the company and become a real shareholder of Cardo. 

  • Relocation Package

    We give the opportunity to employees who have been with us for more than a year to relocate to their office of choice, be it Milan, Tirana, or London!

  • Intrapreneurship in Cardo AI

    “Intrapreneurship in Cardo AI” means that our team members can propose new ideas to launch new products, features, or new technologies. If the idea is selected, a side development team is created to come to MVP and production stage.

  • Cardo AI Startup Incubator

    With this initiative we plan to select 2/3 ideas per cohort and everyone that wants to follow any of the startups is free to do so. At the end of the program, both Cardo AI and everyone from our team can invest in the startups, so we all become entrepreneurs.

  • Cardo Kickstart Training Program for recent graduates

    Cardo Kickstart

    We launched the second edition of Cardo Kickstart, a program aimed at supporting fresh graduates in their transition from university halls to the labor market in Tirana.  

Company Trips

We were able to organize two amazing company trips – a great opportunity to connect with each other, relax and recharge for the upcoming challenges! team building and getting to know each other more. Here are some pictures of our two trips, the first one to Drimades, in June 2021, and the second one to Theth, in September. 

One year of tech innovation

  • Virtual Data Room

    We have integrated a VDR system into our Securitization platform, allowing our clients to securely store critical papers, contracts, and data that they are willing to share with a third party.

    Thanks to Cardo AI’s VDR, originators and arrangers of securitizations are now able to:

    • Ensure quality, accessibility, and reliability of data in all the stages of the transaction.
    • Offer to potential investors a fully digital Due Diligence experience.
    • Keep data always up-to-date to grant full transparency.
  • Marti the Chatbot

    With increasing interest and research on Artificial Intelligence (AI), Natural Language Processing (NLP), and machine learning, bots are becoming progressively more efficient. In the fintech market, bots can support the user along its journey on applications, programs, and software. 

    For this reason, in 2021 we launched Marti, Cardo AI’s virtual assistant. Currently, the chatbot is integrated into our digital lending product to guide our users and help them navigate the platform, making it very fast and easy to access information and gain insights on private debt investments and operations.

  • IDP

    IDP is a service that handles Authentication & Authorization, in a focused way, in a cluster infrastructure. This way, we can have a single UserBase for different applications or services while they focus more on providing features and functionalities.

    Some of the main features it offers are:

    • Single Sign-on – This allows a user to access multiple applications in the cluster with the same set of credentials.
    • Granular permissions structure – Roles, Functionalities, Permission.
    • Default Deny – This means that explicit permissions have to be given to each user for everything that they can access.
    • Temporary user access – This feature allows the creation of temporary tokens.

Looking ahead

There are many initiatives we want to carry out this year, starting with the first one in January: The Women in Tech event.

It will be an online event with the objective of underlining females’ contribution to the tech industry, their participation in the tech community along with the challenges they face in becoming part of it. In addition, we want to give emphasis to female entrepreneurship and leadership in technology. Discover more on the LinkedIn page of the event.

We are very proud of everything we accomplished in 2021, and we can’t wait to work towards new goals and achievements in 2022!

Want to stay updated about our initiatives and products? Don’t forget to follow us on Linkedin!

About the author

Altin Kadareja

Altin Kadareja is the CEO and co-founder of Cardo AI. Prior to founding Cardo AI, Altin has covered several investment and risk management roles at BlackRock, Prometeia, Intesa Sanpaolo and Allianz Bank Financial Advisors across different European markets. He holds a master of science degree in Economics and Management of Innovation and Technology from Bocconi University in Milan and followed an executive program in Risk Management at Imperial Business School in London.

Continue reading

Office employee doing Financial Data Extraction from Statements with Machine Learning

Financial Data Extraction from Statements with Machine Learning

Data is the foundation that drives the whole decision-making process in the finance ecosystem. With the growth of fin-tech services the process of collecting this data is more easy accessible, and for a data scientist becomes necessary to develop a set of information extraction tools that would automatically fetch and store this relevant data. Doing so, we facilitate the process of financial data extraction, which before the development of this tools was done manually, a very tedious and not very time-efficient task.

One of the main providers of this “key knowledge” in finance is the Financial Statements, which offer important insight for a company through its performance, operations, cash flows, or balance sheets. While this information is usually provided in text-based formats or other data structures like spreadsheets or data-frames (which can easily be utilized using parser-s or converters), there is also the case when this data comes as other document formats or images in a semi-structured fashion, which also varies between different sources. In this post we will go through different approaches we used to automate the data extraction from these Financial Statements that were formerly provided from different external sources.

Financial Data Extraction: Problem introduction

In our particular case, our data consisted of different semi-structured financial statements provided in PDFs and images, each one following a particular template layout. These financial statements consist of relevant company-related information (company name, industry sector, address), different financial metrics, and balance sheets. For us, the extraction process task consists of retrieving all the relevant information of each document for every different entity class and storing them as a key-value pair (e.g. company_name -> CARDO AI). Since the composition of information differs between documents we end up having different clusters of them. Going even further, we observe that the information inside the document itself represents itself through different typologies of data (text, numerical, tables, etc.).

Sample of a financial statement containing relevant entity classes

In this case, two main problems emerge: we have to find an approach to solve the task for one type of document, and secondly by inductive reasoning, form a broader approach for the general problem, which applies in the whole data set. We have to note here that we are trying to find a single solution that works in the same way for all this diverse data. Treating separately every single document with a different approach denotes missing the whole point of the present task at hand.

“An automated system won’t solve the problem. You have to solve a problem, then automate the solution”

Methodology and solution

Text extraction tools

Firstly we started with text extraction tools like Tabula for tabular data, and PDFMiner and Tesseract for text data. Tabula scrapes tabular data from PDF files, meanwhile, PDFMiner and Tesseract are text extraction tools that gather the text data respectively from PDF and images. The way these tools work is by recognizing pieces of text on visual representations (PDFs and images) into textual data (document text). The issue with Tabula was that it worked only on tabular data, however, the most relevant information in the financial documents that we have is not always represented in tabular format.

Meanwhile, when we applied the other tools, PDFMiner and Tesseract, the output raw text was completely unstructured and non-human-unreadable (adding here unnecessary white-spaces or confusing words that contained special characters). This text was hard to break down into the meaningful entity classes that we want to extract from there. This was clearly not enough so we had to discover other approaches.


Before moving on, we made an effort to pre-process the outputted text from the above-mentioned extraction tools, and for that, we tried GPT-2 [1]. GPT-2 is a large transformer-based language model with 1.5 billion parameters developed from OpenAI and was considered a great innovative breakthrough in the field of NLP. This model, and also its successor – GPT-3, has achieved strong performance on many NLP tasks, including text generation, translation, as well as several tasks that require on-the-fly reasoning or domain adaptation. In our case, we tried to exploit one of its capabilities which was text summation. After getting a considerable amount of text from the previous text extraction tool, we tried to summarize all this information using the GPT-2 model and take out non-relevant information, taking advantage of the attention mechanism of the transformer model. But this approach did not seem to work quite well considering the non-structured text which is very hard to summarize. Apart from that, there would always be the possibility of the model removing the important information from the text and we cannot give it the benefit of doubt in this regard.

Bounding boxes relationship – OpenCV

The unpromising results of the above approaches made us entertain the idea of treating it as an object detection task using computer vision. Object detection is done by means of outputting a bounding box around the object of interest along with a class label. Then we could construct a relationship graph between these “boxed” entities [2] (see image above). Going forward with this method we tried to do the same with our documents, but instead draw boxes around text that represents an identifiable entity and label each box with the entity name it contained. The next step would have been to develop an algorithm that calculates a metric that represents the relationship values between these boxes based on their spatial position. We could then train a machine learning model that would learn from these relationship values and sequentially decide the position of the next entity by knowing the document locations of the previous ones.

The model creates a relationship graph between entities

However, that was not an easy task, due to the fact that it is very hard to determine the right box which represents a distinct meaningful component in the report, and also as mentioned above different documents follow different document layouts and the position of the information we want to extract is arbitrarily positioned in the document. Henceforth, the previously mentioned algorithm might be inaccurate in determining the position of every box. We moved on to seek a better plan.

Named Entity Recognition

An example of NER annotation

Named-entity recognition (NER) is a sub-task of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories.

During our research on the quest to explore new approaches for this task, we came upon the expression “named entity” which generally refers to those entities for which one or many strings, such as words or phrases, stands consistently for some referent. Then we discovered Named Entity Recognition, the task of locating and classifying words from a not annotated block of text into predefined categories. The desired approach to solving this task is by using deep learning NLP models which use linguistic grammar-based techniques. Conceptually this task is divided into distinct problems: detection of names and classifying them into the category they fall into. Hence we started to look for different implementations to design our language model.

NLP model – spaCy

At this point, our process path was pretty straightforward to follow due to the ease that the NLP libraries offer. And for this job, we decided to go with spaCy [3], which offers a very simple and flexible API to develop many NLP tasks, and one of them being Named Entity Recognition. The design pipeline could be conceptualized with the below diagram:

Solution pipeline

Before we start the design of our model we have to first construct the training data set. It will essentially consist of annotated blocks of texts or “sentences” that contain the value of the entity it represents and the entity name itself. For that, firstly we extract the text from the paragraphs where the desired information is present by making use of the previously used extraction tools. Then we annotate this text with the found categories, by providing the starting position and length of the word in the text. Doing so, we also provide some context to the model by keeping the nearby words around the annotated word. This whole information retrieved from the financial statements can then be easily stored in a CSV file. In SpaCy this would be represented with the below structure:

TRAIN_DATA = [    ("Cardo AI SRL is a fintech company", {"entities": [(0, 12, "Company")]}),    ("Company is based in Italy", {"entities": [(20, 25, "LOC")]})]

After we prepared our dataset, we then decided to design the NLP model by choosing between the alternatives that spaCy provided. We started from a blank non-trained model and then made an outline of the input and output of the model. We split the data into train and test sets and then started training the model, following this pipeline. From the training data, the text is firstly tokenized using the Doc module, which basically means breaking down the text into individual linguistic units, and then the annotated text inputted with the supported format is parsed with the GoldParse module to then be fed into the training pipeline of the model.

Training pipeline

Results and constraints

After training the model on about 800 input rows and testing on 200, we got these evaluations:


The evaluation results seemed promising, but that may have come also from the fact that the model was over-fitting or there was not a lot of variability in our data. After our model was trained, all we had to do was feed it with text data taken from the input reports after they had been divided into boxed paragraphs and expect the output represented as a key-value pair.


Lack of data
– in order to avoid bias or over-fitting the model should be trained on a relatively large amount of data.
– acquiring all this data is not an easy process, adding here the data pre-processing step.
Ambiguous output
– the model may output more than one value per entity, which leads to inconsistency in interpreting the results.
Unrecognizable text
– the financial statements have poorly written text not correctly identifiable during the text data extraction tools recognition.
Numerical values
– having lots of numerical values in the reports it is hard to distinguish the real labels they represent.

Potential future steps toward better Financial Data Extraction

In recent years, convolution neural networks have shown great success in various computer vision tasks such as classification and object detection. Seeing the problem from a computer vision perspective as a document segmentation (creating bounding boxes around the text contained in the document and classifying it into categories) is a good approach to proceed on with. And for that, the magic formula might be called “RCNN” [4].

By following this path, we might be able to resolve the above-mentioned issues we ended up with our solution. Integrating many different approaches together may also improve the overall accuracy of the labeling process.

After the solution process is stable and the model’s accuracy is satisfactory we need to streamline this whole workflow. For a machine learning model, it is important to be fed with an abundant amount of new data which improves the overall performance and predicts future observations more reliably. In order to achieve that, it comes necessary to build an Automated Retraining Pipeline for the model, with a workflow displayed as the following diagram:

Workflow diagram


We went through and reviewed a couple of different approaches we attempted on solving the Named-Entity Recognition task on Financial Statements. And from this trial and error journey, it seemed that the best method was solving it using Natural Language Processing models trained with our own data and labels.

But despite seemingly obtaining satisfactory results in our case study regarding financial data extraction, there is still room for improvement. The one thing we know for sure is that the above Machine Learning approach provided us the best results and following the same path on solving this task is the way to go. Machine Learning is very close to reaching super-intelligence and with the right approach to present problems in every domain, it is becoming a powerful tool to make use of.


[1] Better Language Models and Their Implications

[2] Object Detection with Deep Learning: A Review

[3] spaCy Linguistic Features, Named Entity

[4] Ren et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

EU Taxonomy: A practical guide for navigating troubled waters

EU Taxonomy Article A practical guide for navigating troubled waters

EU Taxonomy: A practical guide for navigating troubled waters

What is the EU taxonomy and what are its implications for financial market participants? In this article you fill find a practical guide on how to comply with the EU Taxonomy

What is the EU Taxonomy?

The EU Taxonomy is one of the main pillars of the EU’s Action Plan for Financing Sustainable Growth (2018), whose one fundamental aim is to reorient capital flows towards a more sustainable economy.

The Action Plan assigns private finance a pivotal role in reaching the EU’s ambitious goal of transitioning to a low-carbon, more resource-efficient, resilient, and competitive economy – in line with its commitment to fully implement the UN 2030 Agenda and the Paris Agreement, both being an integral part of its Green Deal.

However, it is posited that no shift of capital flows towards more sustainable activities can be truly achieved without a common definition of what “sustainable” means.

Through Regulation 2020/852 (“Taxonomy”), the EU establishes a unified classification system aiming at helping financial market participants (FMPs) channel investments towards financial products that truly pursue environmentally sustainable objectives, addressing, therefore “greenwashing” concerns.

The Taxonomy is complementary to the SFDR (Sustainable Finance Disclosure Regulation, Reg.2019/2088), in the sense that it mandates additional transparency requirements (both in pre-contractual, website, and periodic disclosures) in case financial products promote environmental characteristics (article 8 products) or pursue an environmental objective (article 9 products).

What does the EU taxonomy specify?

In particular, the EU taxonomy specifies that such financial products should first disclose to which of the following environmental objectives they contribute:

  • Climate change mitigation
  • Climate change adaptation
  • Sustainable use and protection of water and marine resources
  • Transition to a circular economy
  • Pollution prevention and control
  • Protection and restoration of biodiversity and ecosystems

Within its Final Report on the EU Taxonomy, the Technical Expert Group (TEG) recommends that investors should estimate Taxonomy-alignment separately for each of the environmental objectives for which substantial contribution technical screening criteria (TSC) have been developed. This means that it should be completed separately for climate change mitigation and adaptation (the objectives for which TSC are available as of now).

This is just one and the simplest step in assessing portfolio compliance with the Taxonomy. In the next section, we explore in detail the full process that has to be followed.

How does the Taxonomy work in practice?

As with all other regulations, nothing comes easy. This is particularly true in case investors have companies in their portfolios that are not subject to the EU Non-Financial Reporting Directive, such as non-EU companies and small-medium enterprises (SMEs). For such cases – which are not negligible especially for private debt investors – the TEG advises to follow a 5-steps approach:

Step 0 – Map in-use industry classification systems to NACE sectors eligible under the EU Taxonomy

A pre-requisite for estimating the Taxonomy-alignment of investment portfolios is the mapping of in-use sector classification systems (e.g. SIC, NAICS, BICS, GICS, ICB, RBICS, TRBC, etc) with the European industry classification system (NACE).

The Platform for Sustainable Finance (a permanent expert group of the European Commission) has elaborated a table providing an indicative mapping of selected industry classification systems, and how they relate to the description of economic activities in the EU Taxonomy Delegated Act adopted by the Commission.

Step 1 – Eligibility screening

Identify the companies whose turnover, CAPEX, or OPEX match the economic activities listed in the Taxonomy

For each entity, investors need to be able to assess the proportion of turnover derived from economic activities eligible under the Taxonomy (approx. 70 activities). If data can be obtained, investors should look also into CAPEX and OPEX. 

The turnover KPI (or revenue, if appropriate) is particularly relevant for the climate change mitigation objective. For climate change adaptation the assessment is rather more complicated (especially in the absence of reported data): for an activity to be eligible, there should be evidence that the entity has implemented tailored solutions to prevent physical climate change risks based on the performance of a vulnerability assessment.

It is recommended at this stage to also group eligible activities in two clusters:

  • economic activities that are “enabling” other activities to make a substantial contribution to one or more environmental objectives (e.g. the manufacturing of renewable energy equipment in the case of climate change mitigation);
  •  “transitional” activities, i.e. those activities for which there is no technologically and economically feasible low-carbon alternative (e.g. manufacturing of iron and steel), but that shall qualify as long as their technology is consistent with a 1.5C scenario (they “pass” certain technical screening criteria).

Economic activities such as electricity generation from Solar PV are considered to substantially contribute to climate change mitigation through their own performance.

Step 2 – Substantial contribution screening

Validate if the eligible companies meet the technical screening criteria (TSC) provided for the economic activity

This is likely the most difficult step to verify, especially in the absence of reported data. While some economic activities (e.g. electricity generation from wind) do not have technical thresholds to comply with, most of them have. As an example, electricity generation from geothermal is Taxonomy-eligible, but it should meet the technical criteria of no more than 100g CO2-e emissions per kWh over the life-cycle of the installation, as calculated using specific methodologies (e.g. ISO 14067:2018) and verified by a third party.

The final Delegated Act on climate objectives containing all the TSC has been published on the 9th of December. For ease of consultation, FMPs can use the Taxonomy compass.

The Taxonomy Regulation recognizes that in the absence of reported data, this step can be particularly burdensome. For this reason, it allows for complimentary assessments and estimates, as long as financial market participants explain the basis for their conclusions and the reasons for having made such estimates.

Step 3 – Do Not Significant Harm (DNSH) screening

Validate if the eligible economic activities do not significantly harm other environmental objectives

The Taxonomy Regulation mandates that once the TSC are deemed as satisfied, FMPs should check also that the economic activity that, e.g., contributes to climate change mitigation, does not significantly harm the other five environmental objectives. 

The third step requires investors to conduct due diligence to verify if the company’s activities meet some qualitative, quantitative, and process-based requirements for each other environmental objectives, not only at the production stage but over the life-cycle of the activity itself. 

Also here the lack of data could be a challenge for FMPs. The TEG recommends the reliance on existing credible information sources, such as reports from international organizations, civil society, and media, as well as established market data providers.

Step 4 – Social minimum safeguards screening

Validate if companies meet minimum human and labor rights standards

The Taxonomy mandates that for economic activity to be environmentally sustainable, it should also be aligned with the OECD Guidelines for Multinational Enterprises, the UN Guiding Principles on Business and Human Rights, the International Labour Organisation’s (ILO) Core Conventions, and the International Bill of Human Rights. As for step 3, the TEG recommends relying on internal due diligence processes as well as on external credible information sources.

Step 5 – Calculate the alignment of the investment with the Taxonomy 

Economic activity is to be considered Taxonomy-aligned only if it complies with steps 1-4. Once the aligned portions of the companies in the portfolio have been identified, investors can calculate the alignment of their funds with the taxonomy (as an example, if 10% of a fund is invested in a company that makes 10% of its revenue from Taxonomy-aligned activities, the fund is 1% taxonomy-aligned for that investment, and so on).

FMPs can find use cases studies on the application of Taxonomy requirements for several asset classes available on the PRI (Principles for Responsible Investment) website.

What is the relationship between the SFDR and the EU Taxonomy?

As stated previously, the Taxonomy Regulation is complementary to the SFDR, since it requires additional disclosure requirements for FMPs in case they market financial products promoting environmental characteristics (article 8) or the attainment of an environmental objective (article 9).

As Regulation 2019/2088 mandates, Taxonomy-alignment disclosure of financial products it’s not only due in pre-contractual documents (article 8, 9) but also on websites (article 10) and through periodic reporting (article 11). 

An important point to underline is that website disclosure shall provide, for each art.8 and art.9 product, “information on the methodologies used to assess, measure and monitor the environmental or social characteristics or the impact of the sustainable investments selected for the financial product, including its data sources […] and the relevant sustainability indicators”. In case such products have an environmental focus, there should be also disclosure on the methodology used to estimate Taxonomy-alignment.

Last but not least, DNSH screening for Taxonomy-aligned products should not be confused with PASI (Principal Adverse Sustainability Impacts) reporting, due at entity level pursuant to article 4 SFDR, and at product level pursuant article 7.

Article 4 demands FMPs exceeding 500 employees or stating considering “principal adverse impacts of investment decisions on sustainability factors” to publish on their websites a description of such impacts. The ESAs (European Supervisory Authorities, EBA, EIOPA, ESMA) have developed draft Regulatory Technical Standards (RTS) supplementing Reg.2019/2088, according to which financial undertakings will have to disclose on their websites selected aggregate ESG metrics (approx.20) estimated across all investee companies. Companies with less than 500 employees not considering adverse impacts on sustainability factors of investment decisions will have to publish as well a clear motivated statement for not doing so. In both cases, FMPs will have to disclose relevant information by 30 June 2023.

What are the deadlines for reporting Taxonomy alignment?

The SDFR started applying on 10 March 2021 and the Taxonomy from 1 January 2022. However, the design of the Regulatory Technical Standards – which provide the detailed requirements for pre-contractual, website, and periodic disclosure pursuant to both the SFDR and the Taxonomy – has proven longer than expected. 

In an effort to jointly develop RTS for both the SFDR and the Taxonomy, the European Commission has postponed the application of the Delegated Act containing the RTS to January 2023.

However, financial undertakings subject to an obligation to publish non-financial information pursuant to Article 19a or Article 29a* of Directive 2013/34/EU, shall start disclosing the proportion of Taxonomy-eligible activities within their portfolios from January 2022. Full disclosure of Taxonomy-aligned activities will be required instead from January 2024. Furthermore, Reg.2021/2178 clarifies that exposures to national and supranational issuers including central banks shall be excluded from the calculation of Taxonomy KPIs altogether, while derivatives and exposures to undertakings are not subject to non-financial disclosure regulation (e.g. SMEs) shall be excluded only from the numerator.

It should be borne in mind that such a timeline applies only to matters regarding the publishing of non-financial statements. FMPs considering adverse impacts on sustainability in their investments (PASI) and/or marketing article 8 and 9 products, should follow closely the developments linked to the Delegated Act containing the Regulatory Technical Standards.

(*) Article 19a and 29a pertain to non-financial disclosure requirements for large undertakings

For more details and recommendations on EU, Taxonomy implementation does not hesitate to contact us.

About the Author

Cristina Hanga

Cristina is the ESG Expert at CARDO AI, working across the company’s product suite.

Prior to joining CARDO,  she has worked as an ESG Analyst at Sustainalytics, where she was Lead Quality Control for the Consumer Goods sector, and contributed to several methodology developments. Cristina has also spent a period in KPMG, where she advised companies on ESG disclosure and ESG Ratings.

She holds a Master’s Degree in International Cooperation and Development from the University of Bologna, where she focused particularly on climate change policy.

Continue reading

Handling missing values & data in Machine Learning Modelling with Python

Handling Missing Data in ML Modelling (with Python)

During my experience as a data scientist, one of the most common problems I have faced during the process of data cleaning/exploratory analysis has been handling missing data. In the ideal case, all attributes of all objects in the data table have well-defined values. However, in real data sets, it is not unusual for an attribute to contain missing data.

When dealing with prediction tasks in supervised learning, I quickly came to the realization that for a lot of machine learning algorithms available in Python, the task of handling missing data cannot be done naturally. I.e. the omitted instances have to somehow be filled with a placeholder (most likely a number) for them to run smoothly.

Perhaps you as a reader may have come up with a simple solution already:

Why not just ignore those instances in the pre-processing stage and conclude it that way?

After all, it is not our fault that the data is missing and moreover, we should not make any assumptions about the nature of missing data since their true value is unknowable in principle. While this may certainly be tempting (if not advisable) in some situations, I will attempt to make the case that imputing (rather than ignoring) missing values can be a better practice that in the end leads to more reliable and unbiased results for our machine learning models.

Types of missing data

There may be various reasons responsible for why the data is missing. Depending on those reasons, it can be classified into three main types:

1) Missing completely at random (MCAR) – Imagine that you print out the data table on a sheet of paper with no missing values and then someone accidentally spills a cup of coffee on it. In this case, the conclusion is that the unknown values of an attribute follow the same distribution as known ones. This is the best case for missing values [1].

2) Missing at random (MAR) – In this case, the missing value from an attribute X is dependent on other attributes but is independent of the true value of X. For example, if an outdoor air temperature sensor runs out of batteries and the staff forgets to change them because it was raining, we can conclude that temperature values are more likely to be missing when it is raining, so they are dependent on the rain attribute. If we compute the temperature based only on the present values, we would probably overestimate the average value, since the temperature may be lower when it is raining compared to when it’s not.

3) Missing not at random (MNAR) – This usually occurs when the lack of data is directly dependent on its value. For example when a temperature sensor fails if temperatures drop below 0°C. Another example is when people with a certain level of income choose not to disclose that information to a census taker. In this case, it is more difficult to replace the missing values with a reasonable estimate.

It is important to identify these types of missing data since it can help us make certain assumptions about their distribution and therefore improve our chances of making good estimations.

Ways of handling missing data

First of all, we need to identify which attributes exactly contain missing values, as well as get an idea of their frequency, as shown in the table below:

The sorting of the attributes is in descending order based on the number of instances with unknown values.

2.1 Deleting missing data

In my opinion, if the missing value percentage is above a certain threshold (say, 60%), it does not make much sense to try and impute them because it would likely influence our predictions due to the biased estimations. Deletion of the rows or columns with unknown values would be better suited. For illustrative purposes, suppose the data set looks like this (missing instances are denoted with the NaN notation):

The Python pandas library allows us to drop the missing values based on the rows that contain them (i.e. drop rows that have at least one NaN value):

import pandas as pd

df = pd.read_csv('data.csv')

The output is as follows:

id    col1     col2      col3     col4     col5
0      2.0       5.0       3.0       6.0       4.0

Similarly, we can drop columns that have at least one NaN in any row:

The above code produces:

However, I think that in most scenarios it is better to keep data than discard it. One obvious reason is that removing rows or columns that contain unknown values will result in losing too much valuable information, especially if we don’t have much data, to begin with.

2.2 Simple imputation of missing data

We could use simple interpolation techniques to estimate unknown data. One of the most common interpolation techniques is mean imputation [2]. Here, we simply replace the missing values in each column with the mean value of the corresponding feature column.

The sciki-learn library offers us a convenient way to achieve this by calling the SimpleImputer class and then applying the fit_transform() function:

from sklearn.impute import SimpleImputer
import numpy as np

sim = SimpleImputer(missing_values=np.nan, strategy='mean')
imputed_data = sim.fit_transform(df.values)

After running the code, we get the imputed dataset:

Other imputation strategies are available with this class, for example “median” or “most frequent” in the case of categorical data, which replaces the missing data with the most common category.
This simplistic approach does have its drawbacks however. For example, by using the mean as an imputation strategy we do not:
1) Account for the variability of the missing values, since these values are replaced by a constant.
2) Take into account the potential dependency of the missing data from the other attributes which are present in the data set.
That’s why I decided to focus my attention on a few more sophisticated approaches.

2.3 Imputation of missing data using machine learning

A more advanced method of imputation is to model an attribute containing unknown values as a target variable which is dependent on the other variables present in the data set and then apply traditional regression or machine learning algorithms to predict its missing instances. A rough mathematical representation could be formulated as follows:

where y represents the attribute for which we want to predict the missing values and X is the set of predictor variables, i.e. the other variables. This relationship is most clearly visible in the case of simple linear regression where we have:

After we build our simple model we can then use it to predict the unknown values of y for which the corresponding X values will be available. The exact same principle applies to ML algorithms as well, albeit the relationship representation between target and predictor cannot be done so neatly.
Relying on linear regression (or logistic regression for categorical data) to fill the gaps has of course its drawbacks as well. Most importantly, this approach assumes that the relationship between its predictors (or the log odds of its predictors in logistic regression) and the target variable is linear, even though this may not be the case at all.
For this reason, I have chosen to perform imputation using ML algorithms that are able to also capture non-linear relationships. The modus operandi can be summarized in the following pseudocode:

For each attribute containing missing values do:

  1. Substitute missing values in the other variables with temporary placeholder values derived solely from the non-missing values using a simple imputation technique
  2. Drop all rows where the values are missing for the current variable in the loop
  3. Train an ML model on the remaining data set to predict the current variable
  4. Predict the missing values of the current variable with the trained model (when the current variable will be subsequently used as an independent predictor in the models for other variables, both the observed and predicted values in this step will be used).

Firstly, as you probably noticed, I have performed a simple form of imputation (median) already in the first step. This is necessary because there may be multiple features with missing data present, and in order for them to be used as predictors for other features, their gaps need to be temporarily filled somehow.
Secondly, the prediction of missing data is done in a “progressive” manner in the sense that variables which were imputed in the previous iteration are used as predictors along with those imputed values. So at each iteration except the first, we are relying on the predictive power of our model to fill the remaining gaps.
Thirdly, given that the data set provided in this case contained a mix of data types, I have employed ML regressors (for continuous attributes) as well as classifiers (for categorical attributes) to cover all possible scenarios.

In the subsequent sections, I have listed all the ML models used in this study, along with small snippets of code that demonstrate their implementation in Python.

2.3.1 Imputation of missing data using Random Forests

Quick data preprocesing tips

Before training a model on the data, it is necessary to perform a few preprocessing steps first:

  • Scale the numeric attributes (apart from our target) to make the algorithm find a better solution quicker.
    This can be achieved using scikit-learns‘s StandardScaler() class:
    from sklearn.preprocessing import StandardScaler
    X = df.values
    standard_scaler = preprocessing.StandardScaler()
    x_scaled = standard_scaler.fit_transform(X)
  • Encode the categorical data so that the representation of each category of an attribute is in a binary 1 (present) – 0 (not present) fashion. This happens because most models cannot handle non-numerical features naturally.
    We can do this by using the pandas get_dummies() method:
    import pandas as pd
    encoded_country = pd.get_dummies(df['Country'])
    del df['Country']

The first ML model used was scikit-learn‘s RandomForestRegressor. Random forests are a collection of individual decision trees (bagging) that make decisions by averaging out the prediction of every single estimator. They tend to be resistant to overfitting because tree predictions cancel each other out. If you want to learn more, refer to [3].

Below is a small snippet that translates the above pseudocode into actual Python code:

from sklearn.ensemble import RandomForestRegressor

for numeric_feature in num_features:
df_temp = df.copy()
sim = SimpleImputer(missing_values=np.nan, strategy='median')
df_temp = pd.DataFrame(sim.fit_transform(df_temp))
df_temp.columns = df.columns
df_temp[numeric_feature] = df[numeric_feature]
df_train = df_temp[~df_temp[numeric_feature].isnull()]
y = df_train[numeric_feature].values
del df_train[numeric_feature]
df_test = df_temp[df_temp[numeric_feature].isnull()]
del df_test[numeric_feature]
X = df_train.values
standard_scaler = preprocessing.StandardScaler()
x_scaled = standard_scaler.fit_transform(X)
test_scaled = standard_scaler.fit_transform(df_test.values)
rf_regressor = RandomForestRegressor()
rf_regressor =, y)
pred_values = rf_regressor.predict(test_scaled)
df.loc[df[numeric_feature].isnull(), numeric_feature] = pred_values

Categorical feature imputation is done in a similar way. In this case, we are dealing with a classification task and should use the RandomForestClassifier class.

Important note on using categorical features as predictors:
In my opinion, it is correct to perform temporary imputation of categorical features before encoding them.

Consider the below example where the Country feature has already been encoded before beginning the imputation procedure:

id    Austria    Italy     Germany
0            0             1              0
1            1             0              0
2            0             0              1
3            NaN         NaN          NaN

If we apply a simple imputation using the most frequent value for example, we would get the following result on the last row:

id    Austria    Italy    Germany
0            0             0              0

This is a logical mistake in the representation since each row should contain exactly one column that takes 1 as value to denote the presence of a particular county. We can avoid this mistake by imputing before encoding since we are guaranteed to fill the missing values with a certain country value.

2.3.2 Imputation of missing data using XGBoost

The XGBoost algorithm is an improved version of the Gradient Boosting one. Similar to Random Forests, XGBoost is a tree-based estimator, but decisions are taken sequentially rather than in parallel. For more information, check out the official documentation.
The XGB model can actually handle missing values on its own, so it is not necessary to perform temporary simple imputation on predictor variables, i.e. we could skip the first step in the pseudocode.
Training and prediction of missing values are done in a similar fashion to the random forest approach:

import xgboost as xgb
xgbr = xgb.XGBRegressor()
xgbr =, y)
pred_values = xgbr.predict(test_scaled)

2.3.3 Imputation of missing data using Keras Deep Neural Networks

Neural networks follow a fundamentally different approach during training compared to tree-based estimators. In my work, I have used the neural network implementation offered by the Keras library. Below I wrote an example demonstrating its application in Python:

import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(30, input_dim=input_layer_size, activation='relu')
model.add(Dense(30, activation='relu'))
# identity activation in the output layer for regression
# in case of multi classification:
# model.add(Dense(1, activation='softmax'))
# in case of multi classification:
# model.compile(loss='categorical_crossentropy'), y)
pred_values = model.predict(test_scaled)[:, 0]

2.3.4 Imputation of missing data using Datawig

Datawig is another deep learning model I employed. Its design is specifically made for missing value imputation as it utilizes MXNet’s pre-trained DNNs to make predictions. It can work with missing data during training and it automatically handles categorical data with its CategoricalEncoder class, so we don’t need to pre-encode them. A possible way of implementation is the following:

import datawig
imputer = datawig.SimpleImputer(input_columns=list(df_test.columns),
output_column=numeric_feature, # the column to impute
output_path='imputer_model' # stores model data and metrics)
# Fit the imputer model on the train data: = scaled_df_train)

# Alternatively, we could use the fit_hpo() method to find
# the best hyperparameters:
# imputer.fit_hpo(train_df = scaled_df_train)

# Impute missing values, return original dataframe with predictions
pred_vals = imputer.predict(scaled_df_test).iloc[:, -1:].values[:, 0]

Datawig is optimized for pandas DataFrames, meaning that it takes dataframe objects directly as input for training and prediction, so we do not need to transform them into NumPy arrays.
Moreover, we should not drop the target variable column from the training set and input it as a separate argument as we did previously when fitting a model. Datawig handles this automatically.

2.3.5 Imputation of missing data using IterativeImputer

The scikit-learn package also offers a more sophisticated approach to data imputation with the IterativeImputer() class. So where does this approach differ from the ones we saw before? The names give us a hint.

Iterative means that each feature is imputed multiple times. Each iteration’s naming can be “a cycle“. The reason behind running multiple cycles is to achieve some sort of ‘convergence’. Although it is not clear this means exactly that, looking at the scikit-learn documentation. However, you can think of convergence in terms of stabilization of the predicted values:

from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer


# impute all the numeric columns containing missing values
# with just one line of code:
imputed_df = pd.DataFrame(iim.fit_transform(df))

imputed_df.columns = df.columns

From the code above, we can see that each feature is imputed 10 times (max_iter=10) and in the end, we get the imputed values of the last cycle.

Notice how I used an XGBoost regressor as model input. This shows that IterativeImputer also accepts some ML models that are not native to the scikit-learn library.
Despite being easy to implement, it takes a very large amount of time to calculate compared to the other approaches.

In addition, I would advise using this class with care since it is still in its experimental stages.

Comparing Performances of ML Algorithms

Up until this point, we have seen how various employable techniques to impute missing data as well as the actual process of imputation. However, I have not explained how we can compare the qualities of predictions provided by these approaches.

This is not immediately obvious, because well, we do not possess the missing data to compare them to the predictions. Instead what I decided to do is keep a holdout or validation set from training data, and then use it for model performance evaluation.

So, we are pretending that some data is missing and inferring the actual accuracy of the imputed values based on the accuracy of the imputations on these fake missing values. The snipped below enables us to do this:

from sklearn.metrics import r2_score
import random
train_copy = df_train.copy()
current_feat = train_copy[numeric_feature]
missing_pct = int(current_feat.size * 0.2)
i = sorted(random.sample(range(current_feat.shape[0]), missing_pct))
current_feat.iloc[i] = np.nan

y_fake_test = df_train.iloc[i, :][numeric_feature].values
new_train_df = train_copy[~train_copy[numeric_feature].isnull()]
fake_test_df = train_copy[train_copy[numeric_feature].isnull()]
train_y = new_train_df[numeric_feature].values
del new_train_df[numeric_feature]
del fake_test_df[numeric_feature]

rf_regressor =, train_y)
train_pred = rf_regressor.predict(new_train_df.values)
test_pred = rf_regressor.predict(fake_test_df.values)

print("R2 train:{} | R2 test:{}".format(r2_score(train_y, train_pred), r2_score(y_fake_test, test_pred)))

The prediction quality, or goodness of fit, can be measured by the coefficient of determination, which expresses as:

Coefficient of determination formula

where RSS is the sum of squared residuals, and TSS represents the total sum of squares. Below I have plotted a visual comparison of the model performances for several attributes. Visualization was done utilizing the seaborn library.

Comparing model prediction accuracy on various attributes

We can see that the random forest model consistently ranks among the best.
To get another hint at the consistency of RF, I have plotted the actual values against the predicted values in the test set for the VAR_1 variable:

Plotting actual values against predicted ones for Var_1

Ideally, the line in any graph should be a straight, diagonal one. The model which comes closest to this is the random forest, which was ultimately my choice for imputation.

Potential Future Steps

Another interesting technique for imputation, which you can employ in the future, is the Multiple Imputation Chained Equations (MICE) method. This takes iterative imputation up a notch. The core idea behind it is to create multiple copies of the original data set (usually 5 to 10 are enough) and perform iterative imputations on each dataset. The obtained results from each data set, in accordance with some metrics that we can define, you can later pool together.

Ultimately, the goal is to somehow account for the variability of the missing data and study the effects of different permutations on the prediction results. The scheme below illustrates this:

In Python, MICE is offered by a few libraries like impyute or statsmodels. However, linear regression estimators are their limit.
Another way to mimic the MICE approach would be to run scikit-learn‘s IterativeImputer many times on the same dataset using different random seeds each time.
Yet another take at the imputation problem is to apply a technique called maximum likelihood estimation. It can derive missing values from a user-defined distribution function, whose parameters chose in a way that maximizes the likelihood of the imputed values actually occurring


We got a glimpse of what the potential approaches for handling missing values are, from the simplest techniques like deletion to more complex ones like iterative imputation.

In general, there is no best way to solve imputation problems and solutions vary according to the nature of the problem, size of the data set, etc. However, I hope to have convinced you that an ML-based approach has inherent value because it offers us a ‘universal’ way out. While missing data may be truly unknowable, we can at least try to come up with an educated guess based on the hidden relationships with the already existing attributes, captured and exposed to us by the power of machine learning.


[1] Berthold M.R., and others, Data understanding, in: Guide to Intelligent Data Analysis, Springer, London, pp. 37-40, 42-44.

[2] Raschka S., Data preprocessing, in: Python Machine Learning, Packt, Birmingham, pp. 82-83, 90-91.

[3] Tan P., and others, Data preprocessing, Classification, Ensemble methods, in: Introduction to Data Mining, Addison Wesley, Boston, pp. 187-188, 289-292.

No industry for late data

The need for real time transaction data amid regulators’ requirements and investors’ need for higher transparency

Let’s picture this for a second, a world where Spotify had no real time data and used no AI and ML algorithms. You would need to send an email to the support center, indicate a list of your past songs and genres that you like, wait a couple of hours or a few days based on how busy the client support is and only then receive a recommendation for listening to a new song. Crazy, isn’t it?

This is exactly what is happening in the financial institutions. Old processes, old systems, late and old data. Today, many financial institutions continue to take decisions involving millions of Euros (if not billions) on the basis of outdated (and often inconsistent) data deriving from manual processes, usually processed using excel.

Regulators are well aware of the risk in using aged (e.g. year old financial reports) and not updated and homogenous data (coming from different sources and based on different definitions) when assessing new investment opportunities.

An example is the new definition of defaults set by the CRR (Capital Requirement Regulation). In September 2016, EBA published final guidelines on the application of Art. 178 related to the definition of default and Regulatory Technical Standards on the materiality threshold of past due credit obligation.

Paragraph 106 – Timeliness of the identification of default states that “Institutions should have effective processes that allow them to obtain the relevant information in order to identify defaults in a timely manner, and to channel the relevant information in the shortest possible time”.

This RTS does not only require a fast process but also indicates that the identification of default should be performed on a daily basis. This is becomes paramount for the industry as it requires to move processes and procedures to the next level in order to comply with this requirement.

On 1 January 2021, all of this will be real, and  credit institutions and investment firms using both IRB or Standardized approach will be required to comply with  the above.

Taking as an example the  securitization industry, credit originators or vehicle servicers report data on monthly (if not quarterly) basis using excel, or in some cases  PDF files. This requires a relevant amount of time to manipulate data (cleaning fields, merging files, linking items, standardizing output) and extract relevant information making investor constantly running behind data.

Regulators are clearly pushing the financial industry to set advanced technological solutions to improve the way they manage data. Another example is the Draft Regulatory Technical Standards on the prudential treatment of software assets published on October 2015[1]that directly support investments by financial institutions in these solutions.

Another need for new and improved technologies to manage data comes from the increased volatility of financial markets (that became even more evident  with the Covid-19 pandemic) requiring prompt reactions even in private markets. But how could you react fast in your portfolio if your date are one month old?

The  buy-side industry (including Asset Manages, Pension Funds, Investment Funds, etc.) requires additional level of transparency when it comes to financial data. To establish trust among investors, managers of securitization vehicles are asked to provide detailed information that goes far beyond the publishing of a monthly report but encompasses asset level information to be provided on a daily basis. This requires  a rethinking of the reporting processes of securitisations, leaving aside excel and pdf files and starting to embed technologies that allow all stakeholders involved to access real time data 24/7.

Technology is now available off-the-shelf also to small players, not only the top ones. Thanks to fintech developments and use of cloud computing, any actor (small or big) can take advantage of advanced ready to use technology propositions. This will in turn, avoid the large and risky project-specific capital expenditures.

The Tortoise and the Hare

What makes the difference is the time to market in terms of adoption of such new technology propositions, not the deep pockets to invest in the development of any proprietary tool as it is was still the case a few years ago.

[1]Draft Regulatory Technical Standards on the prudential treatment of software assets under Article 36 of Regulation (EU) No 575/2013 (Capital Requirements Regulation – CRR) amending Delegated Regulation (EU) 241/2014 supplementing Regulation (EU) No 575/2013 of the European Parliament and of the council with regard to regulatory technical standards for own funds requirements for institutions

Subscribe to our Newsletter

The ability to operate with technology and true intelligence at speed can be the deciding factor in success or failure in private market investments.

Start lowering your costs, scale faster and use more data in your decisions. Today!

Our Offices
  • Milan:
    Via Monte di Pietà 1A, Milan, Italy
  • London:
    40 New Bond St, London W1S 2DE, UK
  • Tirana:
    Office 1: Rruga Adem Jashari 1, Tirana, AL
    Office 2: Blvd Zogu I, Tirana, AL

Copyright Cardo AI 2021. All rights reserved. P.IVA: 10357440964