Using Large Language Models For Data Enrichment In Financial Crime and Compliance

Benjamin Wootton

Benjamin Wootton

Follow me on LinkedIn
Max Worrall

Max Worrall

Follow me on LinkedIn
Using Large Language Models For Data Enrichment In Financial Crime and Compliance

Many businesses find themselves under near-constant attack from criminals seeking to commit financial crimes such as fraud, scams, and money laundering.

Though businesses have systems in place to protect against these threats and maintain compliance with the law, these systems are increasingly falling short in detecting the increasingly sophisticated techniques employed by bad actors.

In response, businesses are applying more advanced data and analytics techniques to close their detection gaps by using the data they collect to identify subtle indications of risk or financial crime.

rule based systems

Generative AI and Large Language Models

Generative AI and Large Language Models (LLMs) such as GPT-4 are one such innovation that financial crime and compliance teams are looking to leverage.

One of the key use cases is in using LLMs to automatically process long, complex and unstructured text documents, applying techniques such as summarisation and entity extraction to glean key insights from them.

Once these insights are extracted from the text and organised, it then becomes much easier to integrate them into either manual review processes or automated risk systems that monitor individuals and their transactions.

To illustrate, consider the following piece of text, which could be found in a person's online written biography:

James Laskon, a distinguished UK politician (MP) and former airforce 
officer has had a storied career marked by academic achievements and 
notable service in the Polish Air Force. Born and raised in Warsaw, 
Poland, Laskon demonstrated early on a keen intellect and a passion for 
aeronautics. This passion led him to pursue higher education at three prestigious 
universities. His academic journey began at the Uniwersytet Warszawski 
where he earned a Bachelor’s degree in Aerospace Engineering. Here, he 
was renowned for his research on flight dynamics and his 
contributions to the university's aeronautical research programs.

After his undergraduate studies, Laskon furthered his education at 
the Massachusetts Institute of Technology (MIT) in the United States. At MIT, 
he completed his Master's and Ph.D. in Aeronautics and Astronautics. His doctoral 
research focused on advanced propulsion systems, earning him recognition in 
academic circles. During his time at MIT, Laskon also collaborated on several 
international projects, expanding his expertise and forging lasting 
professional relationships. His contributions to the field were acknowledged 
through various awards and publications in esteemed journals, solidifying his 
reputation as an emerging leader in aerospace technology.

Following his illustrious academic tenure in the United States, Laskon returned 
to Poland, where he joined the Polish Air Force. His time in the military was 
marked by significant achievements, including his role as a lead engineer on 
critical defence projects and his work in modernising Poland's aerial capabilities.
Laskon’s military service also saw him teaching at the Polish Air Force Academy, 
where he shared his extensive knowledge with the next generation of pilots and
engineers. 

In 2010 he moved to the UK to take up executive roles with two aeronautical
firms.  In the 2010s he became more politically active as an independent, 
in 2019 becoming the member of parliament for Chesterfield.  

Alongside his UK Parliamentry commitments as an MP he also works 
closely with a number of international corporations helping to develop cross 
border trade. 

Even to read this would take some time, and then to process it and extract information such as the institutions and James's relationship with them requires a fair amount of time and effort which would typically fall on the shoulders of a fincrime analyst.

Using the LLM however, we can process this passage of text and extract key facts automatically. For instance, in the prompt below we are asking the LLM to list and categorise the military and educational institutions that James has been associated with:

Please can you summarise the institutions that James Laskon was associated 
with, categorising them into military or education. Return the answer in 
JSON format.    

The LLM, in this case GPT-4o, returns the following correct result in a format structured for processing:

{ "Military": [ "Polish Air Force" ], 
 "Education": [ "University of Warsaw", 
                "Massachusetts Institute of Technology (MIT)", 
                "University of Cambridge", 
                "Polish Air Force Academy" 
] }

Compared with unstructured text, this structured and organised data can much more easily consumed by a compliance analyst through an application, requring less time and manual effort and with much less scope for error. Alternatively, the structured data could be more easily incorporated into automated risk systems to drive up the level of straight through processing.

Applications In Financial Crime Monitoring

This technique is highly relevant to financial crime and compliance, because a lot of information related to biographical information and financial affairs is stored in unstructured text documents such as statements, transaction records, sanctions documents, company formation documents, company websites, onboarding identification and the like.

Analysing these documents and performing entity extraction by hand would be very manual and error prone, and due to the sheer volume of content it would require significant human effort to do this on a regular basis across a large customer base.

Because of this, a lot of potentially valuable contextual information that is in the public domain is simply not used because it is too hard and expensive to integrate.

The opportunity therefore is to use large language models to automate this process of analysing the documents and making the data ready for downstream analysis.

A Worked Example

We will now illustrate this with a worked example for a fictional person to explain how records are enhanced over time and where LLMs fit in.

The business in question has captured basic demographic data as part of its usual onboarding process including characteristics such as the name, date or birth and address. In this instance, the date of birth and address are not complete representing a typical data quality issue.

FieldValue
Entity Reference12979431
NameJames Laskon
Birth Date1963
CountryGB
AddressChipping Norton, Oxfordshire, OX7 5AG

As part of the standard compliance processes (e.g. KYC or AML), the record would be matched with structured data sourced from datasets such as politically exposed person (PEP) or sanctions lists. In this instance we have identified that our client of interest James Laskon is politically exposed having served as a member of the UK parliament from 2019 onwards.

Because of this increased risk, the business may then seek a fuller and more up to date view of the person’s data. In this case they have manually contacted the customer to clarify their full date of birth, address, aliases and contact details. This new data meets the businesses risk based threshold for continuing to engage with the customer.

FieldValue
Entity Reference12979431
NameJames Laskon
AliasesJamie Laskon, James Lacks
Birth Date1963-11-29
CountryGB
Address199 Portland Place, Chipping Norton, Oxfordshire, OX7 5AG
Phone Number+4412468888222
EmailLaskonj@gov.committee.uk
FlagsPolitically Exposed Person
Political PositionsMember of the 58th Parliament of the United Kingdom (2019-)

LLMs already have potential at this point in the process, in that we could have automatically processed and ingested documents such as his passport provided at onboarding to validate his date of birth. This could have avoided the need for the manual contact with the customer.

At this point, many businesses will look to enhance their records further with other structured data. In this instance, we have matched the individual with the Open Sanctions politically exposed person dataset, allowing us to bring in more finely grained PEP detail such as the specific political positions that he has held including start dates and end dates. New aliases have also been identified during this phase.

FieldValue
Entity Reference12979431
NameJames Laskon
AliasesJamie Laskon, James Lacks, マーク・プリチャード; 馬克·普理查
Birth Date1963-11-29
CountryGB
Address199 Portland Place, Chipping Norton, Oxfordshire, OX7 5AG
Phone Number+4412468888222
EmailLaskonj@gov.committee.uk
FlagsPolitically Exposed Person
Political PositionsMember of the 58th Parliament of the United Kingdom (2019-)
Former Political PositionsHouse of Commons (member, 2005-2010) · House of Commons (member, 2010-2015) · House of Commons (member, 2015-2017) · House of Commons (member, 2017-) · Member of the Privy Council of the United Kingdom (2021-) · Representative of the Parliamentary Assembly of the Council of Europe (2015-2017) · Member of the 54th Parliament of the United Kingdom (2005-2010) · Member of the 55th Parliament of the United Kingdom (2010-2015) · Member of the 56th Parliament of the United Kingdom (2015-2017) · Member of the 57th Parliament of the United Kingdom (2017-2019)

There are a number of similar structured datasets such as published sanctions lists (OFAC, HMT, EU, UN), policing and judicial watchlists and other key data sources from credit reference agencies, regulatory and enforcement authorities that can form part of the data enhancement journey. This is an ongoing journey which allows businesses to better know their customers and the risks they represent.

LLM’s can also be used during this phase to corroborate data across multiple verifiable data sources, giving increased confidence as to it's accuracy. The models can be adapted to reflect a specific AML risk based approach or a vendor's data governance process not just at the point of onboarding/screening but a part of a perpetual KYC process.

So far this has all been about integrating structured data into the records. However, the next step in this evolution, and where LLMs can potentially help, is giving us the ability to bring in data found in unstructured data sources such as documents and online articles.

Examples of these unstructured documents include interest registers, public procurement documents, government reports, company registration documents and civil and military awards. This type of information is much less likely to be found in easily accessible structured datasets but is still extraordinarily valuable.

Returning to our passage shared above describing James Laskons biography, we have already used the LLM to identify his associations with the aforementioned educational and military institutions. This information can be added to the record.

FieldDetails
Entity Reference12979431
NameJames Laskon
AliasesJamie Laskon, James Lacks, マーク・プリチャード; 馬克·普理查
Birth Date1963-11-29
CountryGB
Address199 Portland Place, Chipping Norton, Oxfordshire, OX7 5AG
Phone Number+4412468888222
EmailLaskonj@gov.committee.uk
FlagsPolitically Exposed Person
Political PositionsMember of the 58th Parliament of the United Kingdom (2019-)
Former Political PositionsHouse of Commons (member, 2005-2010) · House of Commons (member, 2010-2015) · House of Commons (member, 2015-2017) · House of Commons (member, 2017-) · Member of the Privy Council of the United Kingdom (2021-) · Representative of the Parliamentary Assembly of the Council of Europe (2015-2017) · Member of the 54th Parliament of the United Kingdom (2005-2010) · Member of the 55th Parliament of the United Kingdom (2010-2015) · Member of the 56th Parliament of the United Kingdom (2015-2017) · Member of the 57th Parliament of the United Kingdom (2017-2019)
Educational InstitutionsUniversity of Warsaw · Massachusetts Institute of Technology (MIT) · University of Cambridge · Polish Air Force Academy
Military Branch/ServiceSiły Powietrzne

Next, we could combine this online biography with other publicly available documents. In this case we managed to identify that he is associated with five specific companies in the energy and military sector, that he is connected to NATO and received various public financial donations.

Interestingly, this information also provides evidence of associations with certain locations including North Macedonia who are currently ranked 76th globally by the Transparency International corruption index. This could be a very relevant risk indicator which should be raised for high priority manual review.

FieldDetails
Entity Reference12979431
NameJames Laskon
AliasesJamie Laskon, James Lacks, マーク・プリチャード; 馬克·普理查
Birth Date1963-11-29
CountryGB
Address199 Portland Place, Chipping Norton, Oxfordshire, OX7 5AG
Phone Number+4412468888222
EmailLaskonj@gov.committee.uk
FlagsPolitically Exposed Person
Political PositionsMember of the 58th Parliament of the United Kingdom (2019-)
Former Political PositionsHouse of Commons (member, 2005-2010) · House of Commons (member, 2010-2015) · House of Commons (member, 2015-2017) · House of Commons (member, 2017-) · Member of the Privy Council of the United Kingdom (2021-) · Representative of the Parliamentary Assembly of the Council of Europe (2015-2017) · Member of the 54th Parliament of the United Kingdom (2005-2010) · Member of the 55th Parliament of the United Kingdom (2010-2015) · Member of the 56th Parliament of the United Kingdom (2015-2017) · Member of the 57th Parliament of the United Kingdom (2017-2019)
Educational InstitutionsUniversity of Warsaw · Massachusetts Institute of Technology (MIT) · University of Cambridge · Polish Air Force Academy
Military Branch/ServiceSiły Powietrzne
Political AffiliationIndependent
Rank/EnsignPodporucznik (Lieutenant)
Appointed BodiesParliament Security and Intelligence Committee
Associated CompaniesHT Anbar Group (Military/Aerospace, Skopje, MKD, St. 1550 no.19, Industrial Zone Vizbegovo)
Tex Oil Inc Energy Holdings (Energy, Houston, 432 Milam Street, Suite 1300, Houston, TX 77003)
Novagas (Energy, Sofia, Bulgaria, 5 Filip Kutev Street, 1407)
Nord Vujl AG (Energy, St Gallen, Switzerland, Oberer Graben 4-6, 2000)
Stratgen Holdings LLC (Gov Advisory, Washington, 1900 K Street, NW, Washington, D.C. 20006)
Associated Company SectorsEnergy & Aerospace
Associated CountriesPoland, UK, USA, Switzerland, North Macedonia
Awards CivilOrder of Bath
Awards MilitaryOrder of Polonia Restituta
Donor NameJQ8 Limited (13rd Floor, 39 Sloane Street, UK) · European Management Covenant (8 St James's Square, St James's, SW1Y 4JU UK) · Israeli Friendship NL Hof Group (3811 NG Netherlands)
Published PapersElection Cause, Humanitarian
Associated OrganisationNATO, Combatants for Peace, Royal Society
GSA ContractKovnak COMMUNICATIONS LLC (Address: 71 KIJI DAVA CIR STE A, PRESCOTT, AZ 86301-5691, Phone: 928-774-0992)

The net result of all of this is that we now have a highly enriched profile which provides additional contextual information for fincrime analysts to use as part of investigations and risk based classifications.

By putting this information into the right front-end systems, it also reduces the amount of time spent researching an entity when an event such as a higher risk transaction is flagged for review because all of the information is already collated and surfaced to the analyst.

Enhancing Graph Analytics And Graph Database With LLM Outputs

Another of the key analytical tools that are useful in the fight against financial crime is graph analytics. This is a data science technique that allows us to understand and analyse how entities are connected to each other in order to uncover hidden relationships and other insights.

graph analytics

These connections are at the heart of the financial crime detection problem. If we can confirm that two individuals or organisations are somehow linked, and that for instance a customer is associated with a known fraudster or sanctioned individual, then this is a significant risk indicator which a fincrime team would need to know about.

Again, LLMs have an important role to play by using unstructured data sources to add more detail and fidelity to these graphs. For example, we could process documents that uncover that James Laskon, our politically exposed person, has an employment relationship with a company that is ran by sanctioned individuals. This would have gone entirely undetected using structured data sources only.

Example documents which we could interrogate to build our graph could include honors and accolades lists, civil and material awards, gifts, investments, educational affiliations and public sector procurement contractors. This type of information, not typically integrated into traditional systems, can provide a richer, more comprehensive view of the entities we analyze and allow us to connect entities which would have historically remained unlinked.

By combining structured and unstructured data, we can create the most accurate and high fidelity graph, enhancing our ability to make confident linkages and informed decisions.

Technical Architecture

Today, it is likely that systems like this would need to be built using cloud based analytics platforms such as those offered by Amazon Web Services or Databricks. Over time, we would expect these solutions to become more productised and off the shelf, but today the best place to access the latest LLM innovations is through the hyperscale cloud platforms.

The entrance to the process would be a shared folder where new documents are deposited. When a new document is placed in this folder, a process could identify it and send it to the LLM for entity extraction. When complete, the document would be moved to a folder with a processing complete state.

Using prompt engineering, we would guide the LLM with how best to process the documents, including any relevant business rules and regulations. We would also request that that results are returned in JSON format to make downstream processing easier.

These prompts would be evaluated using automated and manual evaluation techniques against known test data in order to understand their performance.

Each response from the LLM would be validated, before passing the JSON blocks describing the extracted entities onto a queue for subsequent processing.

New records would be extracted from the queue and upserted into a graph database such as Neo4J.

This graph could also incorporate transactional relational data, which could be sourced from a real time database such as ClickHouse.

Periodically, entity resolution processes would run against the graph database to determine if records correspond to the same real world entity. For instance, if we manage to capture an address for a given person, this could give us confidence that two records do in fact corrrespond to the same real world person and allow us to link the entity.

After this entity resolution, we would re-query the database for situations of interest such as a politically exposed person who is newly linked to a sanctioned individual via an intermediary.

If the situation is identified, the fincrime analyst would be proactively alerted via a bespoke user interface or a workflow system, complete with visualisations like the below to illustrate the linkage:

graph entity relationships

Conclusion

In this article we have explained how LLMs can be used to automatically process unstructured data and text stored in documents. We explained how insights can be extracted from this text and bought into structured formats which are easier to process.

We demonstrated how this structured data can be used to enhance a record, giving us additional context which is very valuable to fincrime and compliance teams who need to make risk based decisions.

We explained how the same process can be used to enhance the connections stored within graph databases, potentially uncovering hidden linkages between entities and allowing us to perform more effective entity resolution.

Finally, we touched on how we could technically implement this using a cloud platform analytics platform such as that provided by Amazon Web Services to make use of the latest LLM innovations.

Though this is a high level article, we have hopefully made the point how LLMs could move the needle in the detection of sophisticated financial crime, whilst also reducing manual efforts for analysts and compliance teams. Our feeling is that this is relatively "cutting edge" work, but even a small uplift in detection rates could have an outsized income for businesses in their fight against financial crime.

To learn more about this topic, please reach out to us for an informal discussion.

Join our mailing list for regular insights:

We help enterprise organisations deploy advanced data, analytics and AI enabled systems based on modern cloud-native technology.

© 2024 Ensemble. All Rights Reserved.