Regulators and auditors are already using AI for large-scale data analysis, but since most information relevant to investigations is unstructured, current tools and processes are lacking. But AI expert Johannes “Jan” Scholtes says that just as compliance teams and regulators adopted eDiscovery for dealing with large data sets, so, too, can the principles of AI and machine learning be leveraged to detect irregularities and prevent fraud.
These days, almost all regulators use some form of AI for large-scale data analysis in case of irregularities. To auditors, the field of data mining is better known than that of text mining. A good example of data mining is the analysis of financial transactions. A wealth of algorithms and analytical methods are available to find patterns of interest or fraudulent behavior in such data sets.
However, 90% of all information is unstructured information in the form of text documents, emails, social media or multimedia files. Analysis using database or data-mining techniques of this information is not possible, as data-mining tools work only on structured data in the forms of columns and rows, as used in databases.
In addition, fraudsters are becoming more knowledgeable on how audit and compliance algorithms work, so they tend to make sure that the transactional aspects of their actions do not appear as anomalies to such algorithms. The details of what is really going on can often only be found in auxiliary information, such as email, text messages, WhatsApp, legal agreements, voicemails or discussions in a forum.
Today, auditors, compliance officers and fraud investigators face an overwhelming amount of digital information that can be reviewed. In most cases, they do not know beforehand what exactly they are looking for, nor where to find it.
In addition, individuals or groups may use different forms of deception to hide their behavior and intentions varying from using complex digital formats, rare languages or by using code words. Effectively, this means fraud investigators are looking for a needle in the haystack without knowing what the needle looks like or where the haystack is.
Using technology is essential to address the high strategic ambitions auditors and fraud investigators have regarding such large data collections. The main problem using such technology is finding the balance between identifying what is suspicious and flagging too many false positives, which would create too much work for auditors or victimize innocent individuals.
This is why many regulatory agencies rely extensively on highly accurate methods from the world of AI like text mining and natural language processing tools to truly understand the meaning of text and identify the who, where, when, why, what, how and how much of an investigation of large data sets requested or confiscated for compliance investigations.
Helping AI to help compliance
Historically, AI has been concerned mostly with teaching a computer system to understand different forms of human perception: speech, vision and language.
Not all data we have to deal with in compliance audits is text-searchable. This is where AI can help us: metadata extraction (normal and forensic), machine translation, optical character recognition (OCR), audio transcription, image and video tagging have reached highly reliable levels of quality due to recent developments in deep learning. Therefore, text can be used as a good common denominator describing the content of all electronic data, regardless of the format.
The study of text mining is concerned with the development of various mathematical, statistical, linguistic and deep-learning techniques that allow automatic analysis of unstructured information, as well as the extraction of high-quality and relevant data, and to make the complete text more searchable. (High quality refers here to the combination of relevance and the acquiring of new and interesting insights.)
A textual document contains characters that together form words, which can be combined to form phrases. These are all syntactic properties that together represent defined categories, concepts, senses or meanings. Text mining algorithms can recognize, extract and use all this information.
Using text mining, instead of searching for words, we search for syntactic, semantic and higher-level linguistic word patterns. With text-mining algorithms, we aim to find someone or something that doesn’t want to be found.
The ability to model the context of text is vital to avoid finding too many false positives in audits and fraud investigations. Algorithms that enable us to properly understand such context have greatly advanced in recent years due to the successful progress using deep-learning algorithms for highly context-sensitive NLP tasks, such as machine translation, human-machine dialogues, named entity recognition, sentiment detection, emotion detection or even complex linguistic tasks like co-reference and pronoun resolution.
The above-mentioned progress originates from the development of the so-called transformer architecture. Transformer models are based on large pre-trained recurrent neural networks that already embed significant linguistic knowledge and that can be fine-tuned on specific tasks requiring a relatively small amount of additional training.
A fundamental benefit of the transformer architecture is the ability to perform transfer learning. Traditionally, deep learning models require a large amount of task-specific training data to achieve a desirable performance.
However, for most tasks, we do not have the amount of labeled training data required to train these networks. By pre-training with large sets of natural text, the model learns a significant amount of task-invariant information on how language is constructed. With all this information already contained in these models, we can focus our training process on learning the patterns that are specific for the task at hand. We will still require more data points than needed in most statistical models, but not as much as the billions required if we were to start the training of the deep-learning models from scratch.
Transformers can model a wide scope of linguistic context, depending on previous words and on future words. They are, so to say, more context sensitive than models that can only use past context into consideration. In addition, this context is included in the embedding vectors, which allows for a richer representation and more complex linguistic tasks.
Uncovering code words and entities in fraud investigations
Fraud investigators have another common problem: At the beginning of the investigation, they do not know exactly what to search for. As using encryption for such communication would have a red flag effect to an auditor, such communication is often done in plain open text, using code words. Investigators do not know such specific code names, or they do not exactly know which companies, persons, account numbers or amounts they must search for. Using text mining, it is possible to identify all these types of entities or properties from their linguistic role, and then to classify them in a structured manner to present them to the auditor.
For example, one can look for patterns such as: “who paid whom,” “who talked to whom” or “who traveled where” by searching for linguistic matches. Subsequently, the actual sentences and words matching with such patterns can then be extracted from text of the auxiliary documentation and presented to the investigator. By using frequency analysis or simple anomaly detection methods, one can then quickly separate legitimate transactions from the suspicious ones or identify code words.
Techniques and insights for early data assessment
Depending on the type of audit, there are different dimensions that may be interesting for an early data assessment: custodians, data volumes, location, time series, events, modus operandi, motivations, etc. As described in a 2010 paper, traditional investigation methods can provide guidance for the relevant dimensions of such assessments: who, where, when, why, what, how and how much are the basic elements for analysis.
Who, where and when can be determined by named entity recognition (NER) methods. Why is harder, but law enforcement investigations show that data locations with high emotion and sentiment values also provide a good indication of the motivation or insights into the modus operandi. What can be understood by using methods like topic modeling. A good overview of all these techniques can be found in our contribution in Big Data Law with Roland Vogl of Stanford Law School.
eDiscovery technology taught us how to deal with real-world big data. Text mining taught us how to find specific patterns of interest in textual data. The combination of eDiscovery and text mining will teach us how to find even more complex (temporal) relations in big data for audits and ultimately train our algorithms to provide better decision support and assist auditors detecting anomalies and moments of incidents in our ever-growing electronic data sets.
This is a rapidly evolving field, where new methods to understand the structure, meaning and complexity of natural language are introduced at an ever-accelerating speed. These developments will result in essential tools for auditors and internal investigators to keep up with the ever-growing electronic data sets to get as quickly and efficiently as possible to the essence of a case.