Deep Learning For Detecting Financial Statement Fraud
Deep Learning For Detecting Financial Statement Fraud
PII: S0167-9236(20)30176-7
DOI: https://doi.org/10.1016/j.dss.2020.113421
Reference: DECSUP 113421
Please cite this article as: P. Craja, A. Kim and S. Lessmann, Deep learning for detecting
financial statement fraud, Decision Support Systems (2020), https://doi.org/10.1016/
j.dss.2020.113421
This is a PDF file of an article that has undergone enhancements after acceptance, such
as the addition of a cover page and metadata, and formatting for readability, but it is
not yet the definitive version of record. This version will undergo additional copyediting,
typesetting and review before it is published in its final form, but we are providing this
version to give early visibility of the article. Please note that, during the production
process, errors may be discovered which could affect the content, and all legal disclaimers
that apply to the journal pertain.
Abstract
Financial statement fraud is an area of significant consternation for potential investors,
auditing companies, and state regulators. The paper proposes an approach for detecting statement
fraud through the combination of information from financial ratios and managerial comments
within corporate annual reports. We employ a hierarchical attention network (HAN) to extract text
of
features from the Management Discussion and Analysis (MD&A) section of annual reports. The
ro
model is designed to offer two distinct features. First, it reflects the structured hierarchy of
documents, which previous approaches were unable to capture. Second, the model embodies two
-p
different attention mechanisms at the word and sentence level, which allows content to be
re
differentiated in terms of its importance in the process of constructing the document
representation. As a result of its architecture, the model captures both content and context of
lP
managerial comments, which serve as supplementary predictors to financial ratios in the detection
of fraudulent reporting. Additionally, the model provides interpretable indicators denoted as
na
“red-flag” sentences, which assist stakeholders in their process of determining whether further
investigation of a specific annual report is required. Empirical results demonstrate that textual
ur
features of MD&A sections extracted by HAN yield promising classification results and
substantially reinforce financial ratios.
Jo
1. Introduction
Fraud is a global problem that affects a variety of different businesses with a severe
negative impact on firms and relevant stakeholders. The financial implications of fraudulent
*
Declaration of interest: none
Email addresses: [email protected] (Patricia Craja), [email protected] (Alisa
Kim), [email protected] (Stefan Lessmann)
Journal Pre-proof
activities occurring globally in the past two decades are estimated to amount up to $5.127 trillion,
with associated losses increasing by 56% in the past ten years [21]. The actual costs of fraud are
potentially greater, particularly if one also considers the indirect costs including harm the to
credibility of investors, creditors and, employees and the reduction in business caused by the
resultant scandal. Eventually, fraudulent activities may also lead to bankruptcy. All types of
businesses and industries are affected by fraud. However, smaller organizations (less than 100
employees) as well as nonprofit organizations that have weaker internal control systems and fewer
resources to recover from fraud losses can be more susceptible to fraud [2].
The Association of Certified Fraud Examiners (ACFE), the world‟s largest anti-fraud
of
organization, recognizes three main classes of fraud: corruption, asset misappropriation, and
ro
fraudulent statements [66]. All three have specific properties and successful fraud detection
requires comprehensive knowledge of their particular characteristics. This study concentrates on
-p
financial statement fraud and adheres to the definition of fraud proposed by Nguyen [50], who
stated that it is “the material omissions or misrepresentations resulting from an intentional failure
re
to report financial information in accordance with generally accepted accounting principles“. For
lP
this study, the terminology “financial statement fraud“, “fraudulent financial reporting“, and
“financial misstatements“ are used interchangeably and are distinguished from different factors
na
14 months. Although the frequency at which asset misappropriation and corruption occur tends to
Jo
be greater than for financial statement fraud, the impact of the latter crime is significantly more
severe, accounting for a median loss of $954,000 and a median duration of 24 months [2].
The Center for Audit Quality indicted that managers commit financial statement fraud for a
variety of reasons, such as personal benefit, the necessity to satisfy short-term financial goals, and
the intention to hide bad news. Fraudulent financial statements can be manipulated so that they
bear a convincing resemblance to non-fraudulent reports, and they can emerge in various distinct
types [28]. Examples of frequently used methods are net income over- or understatements,
falsified or understated revenues, hidden or overstated liabilities and expenses, inappropriate
valuations of assets, and false disclosures [66]. Authorities directly reacted to the increased
prevalence of corporate fraud by adopting new standards for accounting and auditing.
Journal Pre-proof
Nevertheless, financial statement irregularities are frequently observed and complicate the
detection of fraudulent instances.
Detecting financial statement abnormalities is regarded as the duty of the auditor [18].
Despite the existing guidelines, the detection of indicators of fraud can be challenging. A 2020
report revealed that only a limited number of cases of fraud were identified by internal and external
auditors, with rates of 15% and 4%, respectively [2]. Hence, there has been an increased focus on
automated systems for the detection of financial statement fraud [70]. Such systems have specific
importance for all groups of stakeholders: for investors - to facilitate qualified decisions, for
auditing companies - to speed up and improve the accuracy of the audit, and for state regulators - to
of
concentrate their investigations more effectively [1, 3]. Therefore, efforts have been made to
ro
develop smart systems designed to detect financial statement fraud. Previous studies have
examined various quantitative financial and linguistic factors as indicators of financial
-p
irregularities [5, 15]. In the context of annual reporting, quantitative financial information is
supported by textual information, such as the MD&A section, which aims at providing investors
re
with an insight into the management‟s opinions regarding the organisation‟s future prospects. The
lP
language used in the MD&A section could reveal managers‟ cognitive processes and indicate
fraudulent behaviour. Even though studies have emphasised the increasing significance of textual
na
statement fraud as well as on the interpretability of predictions, which is a crucial aspect to support
Jo
auditors.
We aim to bridge this gap and contribute to the development of decision support systems
for fraud detection by offering a state-of-the-art DL model for screening submitted reports based
on a combination of financial and textual data. The proposed method exhibits superior predictive
performance and allows the identification of early warning indicators (red-flags) on both the word-
and sentence-level for the facilitation of the audit process. Additionally, we showcase the results of
comparative modeling on different data types associated with financial reports and offer the
alternative performance metrics that are centered around the cost imbalance of miss-classification
errors.
Journal Pre-proof
of
• RQ 3: Can the proposed DL model assist in interpreting textual features signaling
ro
fraud and provide "red-flag" indicators to support the decision-making of auditors?
To answer these research questions, we select an array of classification models for
-p
detecting fraud based on different combinations of data. We consider techniques well established
re
in the fraud detection literature including logistic regression (LR), support vector machines (SVM)
and random forest (RF). Extreme gradient Boosting (XGB) and artificial neural networks (ANN),
lP
which have not been tested for statement fraud detection, are also part of the study. The main focus
of the paper is textual data processing. We introduce a novel DL method called hierarchical
na
attention network (HAN) to the community and demonstrate it distinctive features of providing
accurate and interpretable fraud predictions. In line with previous research, the paper concentrates
ur
on the MD&A sections of annual reports filed by firms within the US with the Securities and
Exchange Commission (SEC), which are referenced as annual reports on form 10-K. The SEC is
Jo
the preeminent financial supervisory organisation that is responsible for monitoring financial
reports of firms listed on the US stock exchange.
All selected models are trained on five different combinations of data contained in the
statements submitted for audit: financial indicators (FIN), linguistic features (LING) of an MD&A
text, financial and linguistic features (FIN + LING), the full text of an MD&A (TXT), full text and
the financial indicators (FIN + TXT). We compare the predictive performance of the models with
regard to their ability to distinguish fraud cases, for which we use traditional metrics like Accuracy
and area under the Receiver Operating Curve (AUC) and metrics that reflect the imbalance in
classification error, namely Sensitivity, F1-score, and F2-scores. The comparative study
contributes to the empirical literature on fraud detection through i) expanding the set of considered
Journal Pre-proof
classifiers, ii) offering previously unexplored data combinations, and iii) introducing new DL
methods that provide accurate fraud forecasts and interpretative features.
Following RQ 3, we offer a novel fraud detection method that provides signaling tools for
scholars and practitioners. We examine words considered "red-flags" by the RF feature
importance method and the HAN attention layer output. Given that the use of words for signaling
fraud may be the subject of manipulation, we offer sentence-level importance indicators as a
remedy and demonstrate how the latter can guide the audit process.
of
Previous studies proposed fraud detection systems and offered systematic literature
ro
reviews on fraud detection approaches [71, 56]. Table 1 depicts the status-quo in the field of
-p
financial fraud detection along four dimensions: the technique utilized, the type of data, the
country of study, and the predictive performance in terms of classification accuracy and other
re
metrics. Prior research focused on financial variables and applied a range of modeling techniques,
from LR to DL. Several authors experimented with linguistic variables. These variables were
lP
based on pre-determined lists of words associated with fraud or readability measures such as the
average length of words and sentences, lexical diversity, and sentence complexity [29]. Few
na
studies applied natural language processing (NLP) techniques representing the whole textual
content of 10-K reports and, to the best of our knowledge, no previous study considered DL
ur
models for text analysis in financial statements. Furthermore, most studies examined the relation
between linguistic aspects and fraudulent actions. Only Hajek and Henriques [26] combined them
Jo
with financial data and showed that although financial variables are essential for the detection of
fraud, it is possible to enhance the performance through the inclusion of linguistic data. However,
their study was not targeted at evaluating the textual content of corporate annual reports and it did
not include sophisticated techniques for mining text such as BOW and DL. The majority of
existing research measured performance in terms of accuracy. Some studies also considered
precision and recall. Additionally, most previous studies neglected model interpretability, which is
crucial to support auditors during client selection or audit planning. Hajek and Henriques [26]
pointed out the importance of transparent fraud detection models and derived interpretable
"green-flag" values (for which fraud is likely absent). However, because the detection of
Journal Pre-proof
fraudulent firms requires more complex non-interpretable models, no "red-flag" values (for which
fraud is likely present) could be derived. We bridge this gap by suggesting the use of textual
elements as "red-flags" for auditors. Given that the cost of failing to detect statement fraud is
higher than that of incorrectly flagging a legitimate statement as fraudulent [26], focusing on early
warning signs of fraud is necessary and can enhance the efficiency of auditing processes. In
conclusion, this paper adds to the literature by offering an integrated approach for processing both
textual and financial data using interpretable state-of-the-art DL methods. Furthermore, we
provide a comprehensive evaluation of different modeling techniques using cost-sensitive metrics
to account for the different severities of false alarms versus missed fraud cases.
of
ro
Study Data Countr Features Classifiers Used metrics
(fraud / no y
fraud)
-p
BBN(90.3),
re
DTNB(89.5),
lP
RF(87.5),
Hajek and 311/311 US FIN+ LING Bag(87.1), Acc, TPR, TNR, MC, F-score,
na
C4.5(86.1),
LMT(85.4),
Jo
SVM(78.0),
MLP(77.9),
AB(77.3),
LR(74.5),
NB(57.8)
Kim et al. [34] 788/2156 US FIN LR (88.4), SVM Acc, TPR, G-mean, Cost
(87.7), BBN Matrices
(82.5)
Goel and 180/180 US LING+POS SVM(81.8) Acc,TPR,FPR,Precision,F-scor
Uzuner [25] tags e
Journal Pre-proof
of
Gangolly [23]
ro
Dechow et al. 293/7935 US FIN LR(63.7) Acc, TPR, FPR, FNR, min
[15] 8 F-Score
Humpherys et 101/101
al. [29]
US LING
-p
C4.5 (67.3), NB Acc, Precision, Recall, F-score
(67.3), SVM
re
(65.8)
lP
Glancy and 11/20 US TXT hierarchical TP, TN, FP, FN, p-value
Yadav [22] (BOW) clustering (83.9)
na
LR(0.0026),
C4.5 (0.0028),
Jo
bagging(0.0028)
, DNN(0.0030)
Cecchini et al. 61/61 US LING SVM (82.0) AUC, TPR, FPR, FNR
[12]
Goel et al. [24] 126/622 US LING+TX SVM(89.5), Acc, TPR, FPR, Precision,
T (BOW) NB(55.28%) F-score
Lin et al. [42] 127/447 Taiwan FIN DNN (92.8), Acc, FPR, FNR, MC
CART (90.3),
LR (88.5)
Journal Pre-proof
Ravisankar et 101/101 China FIN PNN (98.1), GP Acc, TPR, TNR, AUC
al. [61] (94.1), GMDH
(93.0),
DNN (78.8),
SVM (73.4)
Table 1: Analysis of classifier comparisons in financial statement fraud detection
FIN – financial data, LING – linguistic data (word category frequency counts, readability and
complexity scores, etc.), TXT - text data, BOW–bag-of-words, POS – part of speech tags (nouns,
of
verbs, adjectives), BBN – Bayesian belief network, NB-Naive Bayes, DTNB - NB with the
ro
induction of decision tables, CART–classification and regression tree, LMT - logistic model trees,
MLP - multi-layer network, Bag-Bagging, AB - AdaboostM1, GMDH – group method data
-p
handling, GP–genetic programming, GLRT - generalized likelihood ratio test, LR–logistic
regression, DNN – deep neural network, PNN – probabilistic neural network, RF – random forest,
re
SVM – support vector machine, Acc - Accuracy, AUC – area under the ROC curve, MC -
lP
misclassification cost, TPR - true positive rate, TNR - true negative rate, FPR - false-positive rate,
FNR - false-negative rate.
na
announcements [14], media reports [67] and annual reports [44, 11]. Several studies have
concentrated on the MD&A section to examine the language used in annual reports [19, 12, 29].
The MD&A is especially relevance as it offers investors the possibility of reviewing the
performance of the company as well as its future potential from the perspective of management.
This part also provides scope for the management‟s opinions on the primary threats to the business
and necessary actions.
A positive factor that renders the textual information contained within annual reports
conducive to the detection of fraud is that it is not subject to the same degree of regulation as
financial information, thus providing the organisation‟s management more opportunities when
divulging textual data [29]. Given that MD&A involves predictions, presumptions, and decisions,
Journal Pre-proof
management might be tempted to manipulate the information in order to present the organisation
in a more favourable light [51]. In addition to manipulating data, management could purposely
exclude important information thus leading to the same outcome. Examples for risk factors
associated with financial statement fraud can be a poor financial performance, a pressure for
management to meet the requirements or expectations of third parties such as investors, or a need
to obtain financing or to minimize reported earnings for tax-motivated reasons. Breiman et al. [9]
conducted analysis on infamous examples such as WorldCom and Enron, and determined that
senior managers participated in, encouraged, approved, and had knowledge of the fraudulent
activities in most cases. Social psychology research suggests that the emotions and cognitive
of
processes of managers who intend to conceal the real situation could indicate specific linguistic
ro
cues that facilitate the identification of fraud [16]. Therefore, prior work has emphasised the
increasing significance of textual analysis of financial documentation.
-p
Studies that analyzes the use of language within annual reports usually adopts one of two
strategies [40]. The first strategy draws on research in linguistics and psychology and is dependent
re
on pre-determined lists of words that have an association with a specific sentiment, like negativity,
lP
optimism, deceptiveness, or ambiguity. Loughran and Mcdonald [44] (L&M) demonstrated that if
these lists are adapted to the financial domain, it is possible to determine relationships among
na
financial-negative, financial-uncertain, and financial-litigious word lists and 10-k filing returns,
trading volume, return volatility, fraud, material weakness, and unexpected earnings. As it was
developed for analyzing 10-K text, the L&M sentiment word lists have been broadly employed in
ur
fraud-detection research [26]. Accordingly, the L&M word lists enters this study as a benchmark
Jo
to DL approaches for extracting features from the MD&A section of 10-Ks. Other researchers
based their approaches for detecting fraud on word lists that indicate positive, negative or neutral
emotions [29, 23] or more specifically anger, anxiety, and negativity according to the definitions
supplied by the Linguistic Inquiry and Word Count dictionary [29, 38, 52].
The second strategy relies on ML to extract informative features for automatic
differentiation between fraudulent and non-fraudulent texts. Li [40] contended that this method
has various benefits compared with predetermined lists of words and cues, including the fact that
no adaptation to the business context is required. ML algorithms have been used in the detection of
financial statement fraud by several researchers, such as Cecchini et al. [12], Hajek and Henriques
[26], Humpherys et al. [29], Goel and Uzuner [25], Goel et al. [24], Glancy and Yadav [22], and
Journal Pre-proof
Purda and Skillicorn [58]. Some attempts to integrate different types of data have also been made.
Purda and Skillicorn [58] compared a language-based method to detect fraud based on SVM to the
financial measures proposed by Dechow et al. [15], and concluded that these approaches are
complementary. The methods displayed low forecast correlation and identified specific types of
fraud that the other could not detect. This finding motivates us to combine financial variables and
linguistic variables to complement each other in the detection of statement fraud.
The study of Hajek and Henriques [26] is closest to this work as they combined financial
ratios with linguistic variables from annual reports of US firms and employed a variety of
classification models, as shown in Table 1. Despite these similarities, the study by Hajek and
of
Henriques [26] was not targeted at evaluating the textual content of corporate annual reports.
ro
Hence, it did not include modern NLP approaches such as deep learning-based feature extraction.
was frequently adopted for the extraction of the linguistic properties of financial documentation.
The BOW approach represents a document by a vector of word counts that appear in it.
na
Consequently, the word frequency is used as the input for the ML algorithms. This method does
not consider the grammar, context, and structure of sentences and could be overly simple in terms
ur
of uncovering the real sense of the text [38]. A different technique for analyzing text is DL. Deep
ANN are able to extract high-level features from unstructured data automatically. Textual analysis
Jo
models based on DL can “learn” the specific patterns that underpin the text, “understand” its
meaning and subsequently output abstract aspects gleaned from the text. Hence, they resolve some
of the problems associated with the BOW technique, including the extraction of contextual
information from documents. Due to their capacity to deal with sequences with distinct lengths,
ANN have shown excellent results in recent studies on text processing [76]. Despite their
achievements in NLP, there has been limited focus on the application of state-of-the-art DL
methods to the analysis of a financial text, with the notable exception of [36]. For an adoption in
practice, DK models should not only be precise, but also interpretable [37, 28]. However, the
majority of systems designed to detect fraud reported by researchers aim to maximise the
prediction accuracy, while disregarding how transparent they are [26]. This factor has particular
Journal Pre-proof
significance as the development of interpretable models is critical for supporting the investigation
procedure in auditing.
4. Methodology
The objective of this study is to devise a fraud detection system that classifies annual
reports. While financial and linguistic variables represent structured tabular data and require no
extensive preprocessing, the unstructured text data has to be transformed into a numeric format,
which preserves its informative content and facilitates algorithmic processing. To achieve the
of
latter, words are embedded as numeric vectors. The field of NLP has proposed various ways to
construct such vectors. We consider two methods for text representation: frequency-based BOW
ro
embeddings and prediction-based neural embeddings (word2vec). An advantage of the BOW
-p
approach, which has been used in prior work on financial statement fraud (see Table1), is its
simplicity. However, BOW represents a set of words without grammar and disrupts word order.
re
Unlike BOW, the application of DL is still relatively new to the area of regtech (management of
regulatory processes within the financial industry through technology). Therefore, the following
lP
subsections clarify neural word embeddings and address the DL components of the proposed HAN
model.
na
The BOW model represents every word as a feature. The amount of features denotes the
Jo
dimension of the document vector [46]. Since the amount of unique words within a document
typically only represents a small proportion of the overall amount of unique words within the
whole corpus, BOW document vectors are very sparse. A more advanced model for creating lower
dimensional, dense embeddings of words is word2vec. As opposed to BOW, word2vec
embeddings enable words that have similar meanings to be given similar vector representations
and capture the syntactic and semantic similarities. Word2vec [48] is an example of a NN model
that is capable of learning word representations from a large corpus. Every word within the corpus
is mapped to a vector of 50 to 300 dimensions. Mikolov et al. [48] demonstrated that such vectors
offer advanced capabilities to measure the semantic and syntactic similarities between words. The
generated word embeddings are a suitable input for text mining algorithms based on DL, as will be
Journal Pre-proof
observed in the next part. They constitute the first layer of the model and allow further processing
of text input within the DL architecture.
The initial word2vec algorithm is followed by GloVe [53], FastText [8], and GPT-2 [59],
as well as the appearance of publicly available sets of pre-trained embeddings that are acquired by
applying the above-mentioned algorithms on large text corpora. Pre-trained word embeddings
accelerate training DL models and were successfully used in numerous NLP tasks. We apply
several types of pre-trained embeddings for HAN model and a neural network with a bidirectional
Gated Recurrent Unit (GRU) layer that serves as a benchmark from the field of DL. As a result of
a performance-based selection, the HAN model is built with word2vec embeddings with 300
of
neurons, trained on the Google News corpus, with a vocabulary size of 3 million words. The DL
ro
benchmark is used with the GPT-2 pre-trained embeddings from the WebText, offered by Radford
et al. [59], as they arguable constitute the current state-of-the-art language model. The DL
-p
benchmark model is thus referred to as GPT-2 and is used together with the attention mechanism,
discussed further.
re
lP
predictive modeling. Conventional methods for classifying text involve the representation of
sparse lexical features, like TF-IDF, and subsequently utilize a linear model or kernel techniques
ur
(CNN) [32] and Recurrent Neural Networks (RNN) [27] for learning textual representations [73].
The RNN architecture allows retaining the input sequence, which made it widely used for natural
language understanding, language generation, and video processing [47, 31]. An LSTM is a
special type of RNN, comprised of various gates determining whether the information is kept,
forgotten or updated and enabling long-term dependencies to be learned by the model [27]. An
LSTM retains or modifies previous information on a selective basis and stores important
information in a separate cell, which acts as a memory [69]. Consequently, overwriting of
important information by the new inputs does not occur, it can persist for extended periods.
More advanced DL approaches also address hierarchical patterns of language such as the
hierarchy between words, sentences, and documents. Some methods have covered the hierarchical
construction of documents [74, 60]. The specific contexts of words and sentences, whereby the
meaning of a word or sentence could change depending on the document, is a comparatively new
concept for the process of text classification, and the HAN was developed to address this issue
[72]. When computing the document encoding, HAN firstly detects the words that have
importance within a sentence, and subsequently, those sentences that have importance within a
document while considering the context (see Figure 1). The model recognizes the fact that an
occurrence of a word may be significant when found in a particular sentence, whereas another
of
occurrence of that word may not be important in another sentence (context).
ro
Figure 1: HAN Architecture. Image based on Yang et al. [72]
-p
The HAN builds a document representation via the initial construction of sentence vectors
re
based on words followed by the aggregation of these sentence vectors into a document
lP
representation through the application of the attention mechanism. The model consists of an
encoder that generates relevant contexts and an attention mechanism, which calculates importance
na
weights. The same algorithms are consecutively implemented at the word level and then at the
sentence level.
ur
Word Level. The input is transformed into structured tokens wit that denote word i in sentence
t [1, T ] . Tokens are further passed through a pre-trained embedding matrix We that allocates
Jo
multidimensional vectors xit = We wit to every token. As a result, words are denoted in numerical
concatenated at every time step t to acquire the internal representation of the bi-directional
LSTM hit .
Word Attention. The annotations hit construct the input for the attention mechanism that learns
enhanced annotations denoted by uit . Additionally, the tanh function adjusts the input values so
that they fall in the range of -1 to 1 and maps zero to near-zero. The newly generated annotations
are then multiplied again with a trainable context vector uw and subsequently normalized to an
importance weight per word it via a softmax function. As part of the training procedure, the
word context vector uw is initialized randomly and concurrently learned. The total of these
of
importance weights concatenated with the already computed context annotations is defined as the
ro
sentence vector si :
uit = tanh(Wwhit bw )
it =
-p
exp(uitT uw )
(1)
re
(2)
t exp(uitT uw )
lP
si = it hit (3)
t
Sentence Level and Sentence Encoder. Subsequently, the entire network is run at the sentence level
na
using the same fundamental process used for the word level. An embedding layer is not required as
sentence vectors si have previously been acquired from the word level as input. Summarization
ur
of sentence contexts is performed using a bi-directional LSTM, which analyzes the document in
Jo
hi = [hi , hi ] (6)
Sentence Attention. For rewarding sentences that are indicators of the correct document
classification, the attention mechanism is applied once again along with a sentence-level context
vector us , which is utilized to measure the sentence importance. Both trainable weights and biases
are initialized randomly and concurrently learned during the training procedure, thus yielding:
ui = tanh(Ws hi bs ) (7)
Journal Pre-proof
exp(uiT us )
i = (8)
i exp(uiT us )
d = i hi (9)
i
where d denotes the document vector summarising all the information contained within each of
the document‟s sentences. Finally, the document vector d is a high-level representation of the
overall document and can be utilized as features for document classification to generate output
vector ŷ :
yˆ = softmax(Wc d bc ) (10)
of
where ŷ denotes a K dimensional vector and the components yk model the probability that
ro
document d is a member of class k in the set 1,..., K .
The application of the HAN follows the application of Kränkel and Lee [35]. Training of
-p
the DL model is performed on the training data set using both textual and quantitative features.
re
Hence, the textual data acquired in the previous section is concatenated with the financial ratios.
The model is employed to predict fraud probabilities of annual statements in the corresponding
lP
validation and test partitions, which were constructed with random sampling with stratification.
Figure 2 shows the architecture of the HAN based fraud detection model and the output
na
The LSTM layer consists of 150 neurons, a HAN dense dimension of 200, and a last dense
layer dimension of 6. In this case, a combination of forward and backward LSTMs gives 300
dimensions for word and sentence annotation. The last layer of the HAN involves the application
of dropout regularization to prevent over-fitting. In a final step, the resulting
document-representation of dimension 200 is concatenated with 47 financial ratios and inputted to
a dense layer before running through a softmax function that outputs the fraud probabilities. For
training, a batch size of 32 and 17 epochs was used after hyperparameter tuning on the train
validation set.
Journal Pre-proof
of
sensitivity, specificity, F1-score, F2-score, and accuracy.
ro
The F-score is a combination of precision (correct classification of fraud cases as a
percentage of all instances classified as fraudulent) and sensitivity (indicates how many fraudulent
-p
instances the classifier misses). It measures how precise and how robust the models classify
fraudulent cases:
re
precision sensitivity
F score = (1 2 ) (11)
( precision) sensitivity
lP
financial statement fraud detection. Nevertheless, the majority of models have exhibited higher
performance in detecting truthful transactions in comparison to those that are fraudulent [71, 61].
ur
An explanation for this preference is that FN rate (type II error) and FP rate (type I error) result in
different misclassification costs (MC). Hajek and Henriques [26] estimated the cost of failing to
Jo
detect fraudulent statements (type II error) to be twice as high as the cost of incorrectly classifying
fraudulent statements (type I error). Hence, effective models should concentrate on high
sensitivity and classify correctly as many positive samples as possible, rather than maximizing the
number of correct classifications. To that end, this study employs the F2-score in addition to the
F1-score (harmonic mean of precision and sensitivity), as it weights sensitivity higher than
precision and is, therefore, more suitable for fraud detection. The AUC captures the ability of a
model to rank fraud and non-fraud cases in the right order. The higher the AUC, the better the
model can distinguish between fraud and non-fraud cases. Being robust toward imbalanced class
distributions, the AUC is preferred to accuracy in fraud detection [68, 58] and also used in this
study.
Journal Pre-proof
5. Data
Fraud detection is a challenging task because of the low number of known fraud cases. A
of
severe imbalance between the positive and the negative class impedes classification. For example,
the proportion of statements that were fraudulent and non-fraudulent in the annual reports
ro
submitted to the SEC for the period from 1999 to 2019 was 1:250. In past research, the number of
-p
firms that committed fraud contained in the data varied between 12 and 788 [68, 34]. The data used
here consists of 208 fraudulent and 7 341 non-fraud cases, making it the largest data set with a
re
textual component so far (c.f., Table1). The data set consists of US companies‟ annual financial
reports (10-K filings) that are publicly available through the EDGAR database of the SEC‟s
lP
1
website and quantitative financial data, sourced from the Compustat database 2.
na
5.1. Labeling
Companies submit yearly reports that undergo an audit. Labeling these reports requires
ur
several filtering decisions: when can a report be considered fraudulent and what type of fraud
Jo
should we consider? To address the first question, we follow [58, 29, 26] and consider a report as
"fraudulent" if the company that filed it was convicted. The SEC publishes statements called
"Accounting and Auditing Enforcement Releases" (AAER) that describe financial reporting
related enforcement actions taken against companies that violated the reporting rules 3 . SEC
concentrates on cases with high importance and applies enforcement actions where the evidence of
manipulation is sufficiently robust [26, 33]. This provides a high degree of trust to this source.
Labeling reports based on the AAER offers simplicity and consistency with easy replication and
1
www.sec.gov/edgar.shtml
2
www.compustat.com
3
https://www.sec.gov/divisions/enforce/friactions.shtml
Journal Pre-proof
avoids possible bias from a subjective categorization. Following [58], we select the AAERs
concerning litigations issued during the period from 1999 to 2019 with identified manipulation
instances between the year 1995 and 2016 that discuss the words "fraud", "fraudulent",
"anti-fraud" and "defraud" as well as "annual reports" or "10-K". Addressing the second question,
we follow [12, 24, 29, 58, 26] and focus on binary fraud classification. This implies that we do not
distinguish between truthful and unintentionally misstated annual reports. The resulting data set
contains 187 869 annual reports filed between 1993 and 2019, with 774 firm-years subject to
enforcement actions. However, due to missing entries and mismatches in existing CIK indexation,
the final data set is reduced to 7 757 firm-year observations with 208 fraud and 7 549 non-fraud
of
filings. Further, we perform the extraction of text and financial data.
ro
5.2. Text data
-p
The MD&A section constitutes the primary source of raw text data in this study. In
re
addition, nine linguistic features are utilized as predictors (described in the online appendix). The
selection of these features is influenced by past studies that demonstrated several patterns of
lP
fraudulent agents, like an increased likelihood of using words that indicate negativity [68, 49],
absence of process ownership implying lack of assurance and resulting in statements containing
na
less certainty [38] or an average of three times more positive sentiment and four times more
negative sentiment in comparison to honest reports Goel and Uzuner [25]. Additionally, the
ur
general tone (sentiment) and the proportion of constraining words, were included by Hajek and
Henriques [26], Loughran and Mcdonald [44], and Bodnaruk et al. [7]. Lastly, the average length
Jo
of sentence, the proportion of compound words, and the fog index are incorporated as measures of
complexity and legibility because prior work suggests that reports produced by misstating firms
had reduced readability [29, 39].
guidelines of existing research, the financial ratios and balance sheet variables presented in the
online appendix are extracted from Compustat, based on formulas presented by Dechow et al. [15]
and Beneish [6]. Financial variables include indicators like total assets (adopted as a proxy for
company size [68, 4]), profitability ratios [26], accounts receivable and inventories as non-cash
working capital drivers [1, 12, 55]. Additionally, a reduced ratio of sales general and
administrative expenses (SGA) to revenues (SGAI) is found to signalize fraud [1]. Missing values
are imputed using the RF algorithm. However, observations with more than 50% of the variables
missing are excluded.
of
5.4. Imbalance treatment
ro
The majority of previous research has balanced the fraud and non-fraud cases in a data set
using undersampling [26, 29, 61]. We follow this approach and consider a fraud-to-non-fraud-ratio
-p
of 1:4, which reflects the fact that the majority of firms have no involvement in fraudulent
re
behaviour. Both year and sector are utilized for balancing, in order to take into account different
economic conditions, change in regulation, as well as to eradicate any differences across distinct
lP
sectors [29, 34]. The latter is extracted with the SIC code [65] and is of particular importance for
text mining, as the utilization of words within financial documentation could differ according to
na
the sector. The resulting balanced data set consists of 1 163 reports, out of which 201 are
fraudulent, and 962 are non-fraudulent annual reports. In the years 2002 to 2004 more financial
ur
misstatements than in other years can be observed. This could be attributed to the tightened
regulations after the big fraud scandals in 2001 and the resulting implementation of the
Jo
Sarbanes-Oxley Act (SOX) in 2002. Also, fewer misstatements are noted in recent years since the
average period between the end of the fraud and the publication of an AAER is three years [57].
6. Classification results
We answer RQ 1 and 2 by means of empirical analysis and compare a set of classification
models in terms of their fraud detection performance. The models generate fraud classifications
based on financial indicators, linguistic features of reports, the reports‟ text, and combinations of
these groups of features. Table 2 reports corresponding results from the out-of-sample test set. The
baseline accuracy of classifying all cases of the test set as non-fraudulent (majority class) is
Journal Pre-proof
82.81%.
of
ANN 0.7564 0.7833 0.6574 0.4563 0.6835 0.6790
ro
Linguistics data (LING) Comparison to FIN
AUC Sensitivit Specificit F1-score F2-score Accuracy Delta Delta F1
y y
-p AUC
re
LR 0.6719 0.7000 0.6193 0.3962 0.6398 0.8280 -0.0901 -0.0805
RF 0.7713 0.7500 0.7197 0.4839 0.7302 0.8424 -0.0896 -0.0669
lP
of
+ LING
ro
AUC Sensitivit Specificit F1-score F2-score Accuracy Delta Delta F1
y y AUC
LR 0.8598 0.7833 0.7854
-p
0.5562 0.7890 0.8424 0.0916 -0.0795
re
RF 0.8797 0.6660 0.9550 0.7079 0.9043 0.8739 0.0191 -0.1571
SVM 0.8902 0.7833 0.8961 0.6861 0.8784 0.8280 0.0929 -0.2576
lP
n
Table 2: Comparative performance of selected binary classifiers on the different types of test data.
Jo
RF performed especially well in case of high-dimensional financial fraud data. Hajek and
Henriques [26] also reported an accuracy of 88.1% on FIN data and concluded that the ensemble of
tree-based algorithms is superior to SVM, LR and ANN due to a relatively low dimensionality
achieved during feature selection. The predictive performance aligns with the results of Kim et al.
[34], offering the LR and SVM models as the most accurate. Lin et al. [42] and Ravisankar et al.
[61] showcased that the DNN models (ANN with more than one hidden layer) outperform LR and
SVM, offering an accuracy higher by around 4.5%. The SVM is a widely recognized model and
was applied both for fraud detection [54] and in other fields [17]. However, the results show that
inherent configuration complexities make SVM a secondary choice for practitioners. ANN show
of
less impressive predictive performance but proved to be the most efficient in terms of sensitivity.
ro
However, for model evaluation, a balanced indicator like F1- and F2-scores provide a better
perspective. These metrics suggest XGB to outperform other models. XGB represents an
-p
advancement in the in the field of ML, its high performance is noteworthy since it was not
considered in prior work on fraud detection. Given the much higher cost of missing actual fraud
re
cases compared to false alarms, we argue that the F2-score is the most suitable threshold-based
lP
indicator of model performance. Therefore, we emphasize the F2-score together with the AUC,
which allows the tuning of the threshold.
na
The modeling on linguistic data (LING) was the first step towards including text in fraud
detection. The earlier experiments by Cecchini et al. [12], Humpherys et al. [29] and Goel et al.
Jo
[24] employed SVM and achieved accuracy of 82%, 65.8%, and 89.5% respectively. The latter
additionally included the BOW method that we will discuss further. Our modeling falls in line with
the previous work and exhibits SVM as the second strongest predictor, yielding an AUC of 74%
and accuracy of 82%. RF remained the most reliable predictor with the highest AUC, accuracy,
and F2-score. Modeling done solely on LING will allow us to assess the degree to which both
sources of data contribute to accurate classification. In line with Hajek and Henriques [26], all
models exhibit higher performance on FIN data than on LING data solely, leading to the
conclusion that financial covariates have more predictive power than linguistic variables.
However, the performance differences are not substantial and suggest a strong relationship
between linguistic features and fraudulent behavior, which agrees with previous studies.
Journal Pre-proof
Following the ideas of Hajek2017MiningMethods, we combine FIN and LING data to evaluate if
the classifier can make use of both data sources. Our results differ in terms of the leading models,
with RF and XGB offering the highest AUCs of 86%. XGB is showing an improvement,
performing well on FIN data, falling back a little in the LING set up but making better use of the
combined input. Once again we observe the superior performance of XGB in terms of F2-score
with 76.87% followed closely by 76.10% of RF and 74.48% of SVM, which once again advocates
for the usefulness of advanced ML methods for practical tasks. Interestingly, for the rest of
classifiers, the accuracy dropped a little in comparison to FIN, but the AUCs improved (with a
minor exception for RF). This serves as an indication that LING and FIN data combined may
of
provide conflicting signals to the classifier. However, the data mix is a definite improvement as it
ro
provides a stronger signal to the classifier and enhances predictive performance.
We offer the advanced methods of NLP, previously unexplored for fraud detection, and compare
them to the performance of more traditional models. Goel et al. [24], Glancy and Yadav [22] and
na
Purda and Skillicorn [58] applied the BOW model to perform modeling on text data, while Goel
and Uzuner [25] made use of part-of-speech tagging. They utilized SVM and hierarchical
ur
clustering as classifiers and achieved accuracies of 89.5%, 83.4%, 89%, and 81.8%, respectively.
Table 2 offers an overview of the modeling results, starting with purely textual input (TXT)
Jo
and continuing with text enhanced by financial data (FIN+TXT). Two new DL methods are
included in TXT modeling, namely HAN and GPT-2. While traditional benchmarks take the
TF-IDF transformations of word input, the DL models make use of pre-trained embeddings. We
can observe that modeling on TXT improves all models in comparison to LING, with the largest
AUC delta of 0.2 in the case of ANN. This increase can be attributed to the richer input of the
actual MD&A content. While the more basic feature input of the LING models solely incorporated
linguistic information (e.g., frequency counts of word categories based on LM word lists,
readability and complexity ratios, etc.) the textual input (TXT) is based on advanced text mining
techniques and vector space representations containing the information of the whole MD&A
content (TF-IDF based embedding), as well as grammar, context, and structure of sentences (DL
Journal Pre-proof
based embeddings). ANN demonstrates the highest accuracy, 89%, and the best F1- and F2-scores,
which constitutes a strong signal that the neural network architecture is a favorable candidate for
the task, regardless of the BOW input. Given the complexity of text processing, ANN proves its
capacity to pick up on complex relationships between the target and explanatory variables. The
improvement is also visible for the F2-score of 89.93% that closely follows that of RF (89.98%). It
is interesting to compare the BOW-based ANN with GPT-2 and HAN, all of which represent a NN
architecture. GPT-2 performs better on TXT than any other model on LING. Though it fails to
show superior accuracy, its sensitivity is one of the highest, leading to the conclusion that with
some threshold adjustment, it could provide better predictive performance than other models like
of
LR or tree-based models. This example underlines the potential gains of implementing the new DL
ro
methods that allow superior insights into unstructured data. Unlike BOW-based benchmarks,
embeddings-based HAN and GPT-2 retained the structure and context of the input. HAN showed
-p
superior results in terms of AUC 91.08% but fell short in terms of accuracy. However, its
sensitivity is exceeding those of all other benchmarks except SVM, making it a promising model
re
for fraud detection. The appealing performance of HAN can be explained by its intrinsic capacity
lP
to extract significant contextual similarities within documents and that pertinent cues, which allow
truthful text to be distinguished from deceitful ones, are dependent on the context rather than the
na
content [75]. All in all, the results suggest that textual data can offer much more insight than LING
across all classifiers.
We conclude the analysis by examining the feature combination FIN+TXT, which is at the
ur
core of our study. The input setup is done in two ways: a combination of word vectors with
Jo
financial indicators into one data set and a 2-step modeling approach. The latter comprises
building a TXT model and using its probability prediction as an input to another DL model that
will concoct it with FIN and output the final binary prediction. The first approach is applied in the
case of benchmark models, including ANN. The second one is implemented for HAN and GPT-2.
Based on a comparison of models using only TXT or FIN data, Purda and Skillicorn [58]
concluded that these data sources are complementary with each source identifying specific types
of fraud that the other cannot detect. In our case, all benchmarks exhibit improved performance in
comparison to the FIN + LING setup, especially LR and SVM. However, the same unanimity is
observed in decreased F1-score. We observe the superiority of predictive powers of full-textual
input over the linguistic metrics. If we compare the additional value of FIN for the performance,
Journal Pre-proof
we can see only a minor increase in almost all metrics, once again underlying the complexity and
potential misalignment of FIN and TXT data. However, it is essential to note that unlike F1-score,
F2-score increases across the ML benchmarks, which brings us to the initial assumption behind the
preference toward the F2-score as key to model evaluation for practical use. We conclude that with
the increased complexity of input, one should opt for advanced ML techniques for the extraction of
extra insight.
The best performance is again yielded by HAN with AUC 92.64%, followed by XGB and
ANN with AUCs of 89%. HAN is also offering the highest sensitivity of 90% across all datasets
and models, making it the recommended solution for statement fraud detection. Going back to the
of
triad comparison between ANN, HAN, and GPT-2, we can see that the latter does not show much
ro
improvement with added FIN data. This signals the potentially poor choice of pre-trained
embeddings, highlighting the importance of this decision in the design of a DL classifier and
-p
reminding that state-of-the-art solutions do not guarantee superior results. ANN does not catch up
with HAN AUC-wise. However, it showcases the higher F2-score of 90.55%, surpassed only by
re
XGB, which proved to be a promising alternative to the DL methods. The results of modeling on
lP
HAN showed its capacity to incorporate and extract additional values from the diversified input,
which contributes to the existing field of research and opens new opportunities to the further
na
properties allow us to offer a look into the "black box" of the DL models and provide the rationale
Jo
behind the classification decision. This interpretability capacity might be particularly important for
practitioners, given the need to substantiate the audit judgment, and will be further explored in the
next Section.
sections increased after SOX became effective; nevertheless, Li [41] concluded that no changes
were made to the information contained within MD&A sections or the style of language adopted.
Taking further the fraud detection efforts, we developed a method to facilitate the audit of the
MD&A section. We employ state-of-the-art textual analysis to shed light on managers‟ cognitive
processes, which could be revealed by the language used in the MD&A section. Zhou et al. [75]
demonstrated that it is plausible to detect lies based on textual cues. Nonetheless, the pertinent cues
that allow truthful texts to be distinguished from deceitful ones are dependent on the context. One
way to support auditors would be the "red-flag" indication in the body of the MD&A section.
Hajek and Henriques [26] explored the use of "green-flag" and "red-flag" values of financial
of
indicators and concluded that the identification of non-fraudulent firms is less complex and can be
ro
accompanied by interpretable "green-flag" values, however because the detection of fraudulent
firms requires more complex non interpretable ML models, no "red-flag" values could be derived.
-p
We will take it further and provide the suggestion for the use of textual elements as "red-flags" for
auditors. This can be done on the word level or the sentence-level and is to our best knowledge,
re
new to the field. The HAN model allows a holistic analysis of the text structure and the underlying
lP
semantics. In contrast to BOW that ignores specific contextual meanings of words, the HAN
model considers the grammar, structure, and context of words within a sentence and of sentences
na
within a document, which is essential for the identification of fraudulent behaviour. The attention
mechanisms of the HAN retain the logical dependencies of the content and enable the
identification of the words and sentences that contribute the most in attributing the fraudulent
ur
behaviour by extracting the word and sentence attention weights defined in Equation 2 and 8.
Jo
These valuable insights into the internal document structure together with strong predictive
performance, make HAN notably advantageous in comparison to BOW-based traditional
benchmarks.
Based on the assumption that fraudulent actors are capable of manipulating their writings
so that they have convincing similarities to those that are non-fraudulent, only concentrating on
words that focus on the content of the text while disregarding the context could be overly
simplistic for differentiating truthful from misleading statements. We assume that due to their
inherently higher complexity, sentence-level indicators are less prone to manipulation and provide
robust insight for auditing.
Journal Pre-proof
7.1. Word-level
We provide a comparative analysis of words considered to be "red-flags" by the more
traditional RF model and those offered by HAN. The RF model proved to be a potent and
consistent classifier throughout the comparative analysis. We apply the lime methodology of
Ribeiro et al. [63] to gain insight into the role of different words in the model‟s classification
decision. lime stands for Local Interpretable Model-Agnostic Explanations and is based on
explaining the model functioning in the locality of the chosen observation. Ribeiro et al. [63]
explains every input separately; the example of its application to one of the fraud texts can be
found in Figure 3:
of
ro
Figure 3: Words with top weights indicating fraud from a sample MD&A
-p
We supply all fraud cases through the lime package and extract the top ten words, that have
the strongest effect on the model in terms of fraud indication. We further aggregate these words
re
and gain a "red-flag" vocabulary. Additionally, we perform the same analysis with the DNN model
lP
and extract the weights assigned by the HAN attention layer. The results are summarized in Figure
4:
na
Figure 4: "Red-flag" words identified by RF and HAN, the bottom section contains the words
ur
Fifteen words are found to be important for an indication of fraudulent activity by both
algorithms, including "government", "certain", "gross", potentially indicating adverse
involvement of the state institutions. It would seem that RF derives judgment from the industry:
"aerospace", medical terms, "pilotless", "armourgroup". HAN picks up on financial and legal
terms like "cost", "acquisition", "property". Both classifiers also include time- and
calendar-related words like names of the month. It is not obvious how much the context affects this
selection. Additionally, derivation of a word-based rule might potentially lead to a quick
adaptation of the reporting entities for audit circumvention. Ambiguous interpretation and
manipulation risks motivate the creation of the sentence-level decision support system.
Journal Pre-proof
7.2. Sentence-level
The added contextual information extracted by the HAN shows improved performance on
the test set in comparison to linguistic features and other DNN models. It can be partially
explained by the hierarchical structure of language, which entails the unequal roles of words in the
overall structure. Following RQ 3, we want to benefit from the structural and context insight
retained in sentence-level analysis, provided uniquely by the HAN model.
We extract the sentence-level attention weights for 200 fraudulent reports gained as a result
of prediction by HAN and filter the top ten most important sentences per report. The mean weight
of a sentence that can be considered a "red-flag" is 0.05, with a maximum at 0.61. We devise a rule,
of
dictating that sentences with weights higher than 0.067 (top 25% quantile) will be referred to as
ro
"extra important", sentences between 0.04 and 0.67 (top half) are "important" and those between
0.022 and 0.04 are "noteworthy". These three groups of words get respective coloring and are
-p
highlighted in the considered MD&A, as depicted in Figure 5.
We propose to use the probability prediction of the HAN model and assign sentence
re
weights as a two-step decision support system for auditors. Given its strong predictive
lP
performance, HAN can provide an initial signal about the risks of fraud. Based on a selected
sensitivity threshold, auditors may select to evaluate a potentially fraudulent report with extra
na
caution and use the highlighted sentences as additional guidance. Given the lengthiness of an
average MD&A and limited physical concentration capacities associated with the manual audit,
ur
this sort of visual guidance can offer higher accuracy of fraud detection.
Jo
Figure 5: A page from MD&A (on the left) and its extract with "red-flag" phrases for the attention
of the auditor (on the right). Sentences that contributed the most to the decision towards "fraud" are
labeled by HAN as extra important and important. Additional examples are provided in the Online
Appendix.
8. Discussion
As reported in the literature review, Hajek and Henriques [26] and Throckmorton et al.
[68] have tackled the task of combined mining financial and linguistic data for financial statement
fraud prediction, and no study was found on the combination of financial and textual data. Given
Journal Pre-proof
the managerial efforts to conceal bad news by using particular wording [29] and by generating less
understandable reports [39, 45], it is pivotal to adopt more advanced text processing techniques.
SVM showed good performance across most experimental setups.Due to its ability to deal
with high dimensional, sparse feature spaces, SVM has achieved the best performances in previous
studies that incorporated the BOW approach [23, 58]. in this study, RF showed the best predictive
performance, managing to extract knowledge from both financial and BOW-based textual sources.
DL models also proved capable of distinguishing fraudulent cases. However, only the HAN
architecture showcased exceptional capacity to extract signals from the FIN + TXT setting, which
is in the center of the current research. The HAN detects a high number of fraudulent cases
of
compared to remaining models, strengthening the statement by Zhou et al. [75] that the detection
ro
of deception based on text necessitates contextual information.
The results of the AUC measures indicate that the linguistic variables extracted with HAN
-p
and TF-IDF add significant value to fraud detection models in combination with financial ratios.
The heterogeneity in performance shifts among different data types for models, showing that
re
different models pick up on different signals, and a combination of these models might be more
lP
addressed the practical applicability of the classification models, given the imbalance of error
costs. The superior predictive capacity should be considered in combination with the model‟s
sensitivity in order to account for the implications of non-detecting the fraudulent case.
ur
We have explored the interpretation capacities of RF and HAN models on the word and
Jo
sentence levels. Both models agreed on a specific "red-flag" vocabulary; however, mostly, they
picked up on different terms. Also, out of context, these words might be misleading. The indication
of "red-flags" words is becoming increasingly unreliable with the adaptive response of the alleged
offending parties. The offered sentence-level markup showed a more robust approach to the
provision of decision support for the auditors.
Auditors must devote effort and time to risk assessment of financial misstatements, which
is tedious and complex. The utilisation of interpretable state-of-the-art technology is essential to
facilitate the detection of fraud by auditors and will significantly enhance effectiveness and
efficiency of audit work. Moreover, the presence of enhanced anti-fraud controls will prevent
individuals from committing fraudulent acts and therefore reduces fraud risks.
Journal Pre-proof
9. Conclusion
The detection of financial fraud is a challenging endeavor. The continually adapting and
complex nature of fraudulent activities necessitates the application of the latest technologies to
confront fraud. The paper examined the potential of a state-of-the-art DL model to add to the
development of advanced financial fraud detection methods. Minimal research has been conducted
on the subject of methods that combine the analysis of financial and linguistic information, and no
studies were discovered on the application of text representation based on DL to detect financial
of
statement fraud. In addition to quantitative data, we investigated the potential of the accompanying
text data in annual reports, and have emphasized the increasing significance of textual analysis for
ro
the detection of signals of fraud within financial documentation. The proposed HAN method
-p
concentrates on the content as well as the context of textual information. Unlike the BOW method,
which disregards word order and additional grammatical information, DL is capable of capturing
re
semantic associations and discerning the meanings of different word and phrase combinations.
The results have shown that the DL model achieved considerable improvement in AUC
lP
compared to the benchmark models. The findings indicate that the DL model is well suited to
identify the fraudulent cases correctly, whereas most ML models fail to detect fraudulent cases
na
while performing better at correctly identifying the truthful statements. The detection of fraudulent
firms is of great importance due to the significantly higher MC associated with fraud. Thus,
ur
specifically in the highly unbalanced case of fraud detection, it is advisable to use multiple models
designed to capture different aspects. Based on these findings, we conclude that the textual
Jo
information of the MD&A section extracted through HAN has the potential to enhance the
predictive accuracy of financial statement fraud models, particularly in the generation of warning
signals for the fraudulent behavior that can serve to support the decision making-process of
stakeholders. The distorted word order handicaps the ability of the BOW-based ML benchmarks to
offer a concise indication of the "red-flags". We offered the decision support solution to the
auditors that allows a sentence-level indication of text fragments that trigger the classifier to treat
the submitted case as fraudulent. The user can select the degree of impact of indicated sentences
and improve the timing and accuracy of the audit process.
Journal Pre-proof
Acknowledgement
Funding: This work was supported by Deutsche Forschungsgemeinschaft in the scope of
the International Research Training Group (IRTG) 1792.
References
[1] A. Abbasi, C. Albrecht, A. Vance, and J. Hansen. “Metafraud: A meta-learning framework
for detecting financial fraud”. In: MIS Quarterly: Management Information Systems 36.4
(2012), pp.1293–1327.
of
[2] ACFE. Report to the Nations 2020 Global Study on Occupational Fraud and Abuse. Tech.
ro
rep. 2020. URL:
https://acfepublic.s3-us-west-2.amazonaws.com/2020-Report-t
[3]
o-the-Nations.pdf.
-p
W. S. Albrecht, C. Albrecht, and C. C. Albrecht. “Current trends in fraud and its
re
detection”. In: Information Security Journal 17.1 (2008), pp. 2–12.
B. Bai, J. Yen, and X. Yang. “False financial statements: Characteristics of China‟s listed
lP
[4]
companies and cart detecting approach”. In: International Journal of Information
Technology and Decision Making 7.2 (2008), pp. 339–359.
na
[6] M. D. Beneish. “The Detection of Earnings Manipulation”. In: Financial Analysts Journal
55.5 (1999), pp. 24–36.
[7] A. Bodnaruk, T. Loughran, and B. McDonald. “Using 10-K text to gauge financial
constraints”. In: Journal of Financial and Quantitative Analysis 50.4 (2015), pp. 623–646.
[8] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. “Enriching Word Vectors with
Subword Information”. In: arXiv preprint arXiv:1607.04606 (2016).
[9] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression
Trees. Vol. 19. 1984, p. 368.
[10] L. Breiman. “Random forests”. In: Machine Learning 45.1 (2001), pp. 5–32.
[11] S. V. Brown and J. W. Tucker. “Large-Sample Evidence on Firms‟ Year-over-Year
Journal Pre-proof
MD&A Modifications”. In: Journal of Accounting Research 49.2 (2011), pp. 309–346.
[12] M. Cecchini, H. Aytug, G. J. Koehler, and P. Pathak. “Making words work: Using financial
text as a predictor of financial events”. In: Decision Support Systems 50.1 (2010), pp.
164–175.
[13] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. “Empirical evaluation of gated recurrent
neural networks on sequence modeling”. In: arXiv preprint arXiv:1412.3555 (2014).
[14] A. K. Davis, J. M. Piger, and L. M. Sedor. “Beyond the Numbers: Measuring the
Information Content of Earnings Press Release Language”. In: Contemporary Accounting
Research 29.3 (2012), pp. 845–868.
of
[15] P. M. Dechow, W. Ge, C. R. Larson, and R. G. Sloan. “Predicting Material Accounting
Misstatements”. In: Contemporary Accounting Research 28.1 (2011), pp. 17–82.
ro
[16] B. M. DePaulo, R. Rosenthal, J. Rosenkrantz, and C. Rieder Green. “Actual and Perceived
-p
Cues to Deception: A Closer Look at Speech”. In: Basic and Applied Social Psychology
3.4 (1982), pp. 291–312.
re
[17] S. Dumais, J. Platt, D. Heckerman, and M. Sahami. “Inductive learning algorithms and
lP
[18]
Journal of Finance 65.6 (2010), pp. 2213–2253.
[19] R. Feldman, S. Govindaraj, J. Livnat, and B. Segal. “Management‟s tone change, post
ur
earnings announcement drift and accruals”. In: Review of Accounting Studies 15.4 (2010),
Jo
pp. 915–953.
[20] C. Gaganis. “Classification techniques for the identification of falsified financial
statements: a comparative analysis”. In: Intelligent Systems in Accounting, Finance &
Management 16.3 (2009), pp. 207–229.
[21] J. Gee and M. Button. The Financial Cost of Fraud 2019. Tech. rep. Crowe, 2019.
[22] F. H. Glancy and S. B. Yadav. “A computational model for financial reporting fraud
detection”. In: Decision Support Systems 50.3 (2011), pp. 595–601.
[23] S. Goel and J. Gangolly. “Beyond the numbers: Mining the annual reports for hidden cues
indicative of financial statement fraud”. In: Intelligent Systems in Accounting, Finance and
Management 19.2 (2012), pp. 75–89.
Journal Pre-proof
[24] S. Goel, J. Gangolly, S. R. Faerman, and O. Uzuner. “Can linguistic predictors detect
fraudulent financial filings?” In: Journal of Emerging Technologies in Accounting 7.1
(2010), pp. 25–46.
[25] S. Goel and O. Uzuner. “Do Sentiments Matter in Fraud Detection? Estimating Semantic
Orientation of Annual Reports”. In: Intelligent Systems in Accounting, Finance and
Management 23.3 (2016), pp. 215–239.
[26] P. Hajek and R. Henriques. “Mining corporate annual reports for intelligent detection of
financial statement fraud – A comparative study of machine learning methods”. In:
Knowledge- Based Systems 128 (2017), pp. 139–152.
of
[27] S. Hochreiter and J. Schmidhuber. “Long Short-Term Memory”. In: Neural Computation
ro
9.8 (1997), pp. 1735–1780.
[28] S. Y. Huang, R. H. Tsaih, and F. Yu. “Topological pattern discovery and feature extraction
-p
for fraudulent financial reporting”. In: Expert Systems with Applications 41.9 (2014), pp.
4360–4372.
re
[29] S. L. Humpherys, K. C. Moffitt, M. B. Burns, J. K. Burgoon, and W. F. Felix.
lP
[30]
categorization”. In: the 14th International Conference on Machine Learning (ICML ’97)
(1997), pp. 143–151.
ur
[31] N. Kalchbrenner and P. Blunsom. “Recurrent continuous translation models”. In: EMNLP
Jo
using multi-class cost-sensitive learning”. In: Expert Systems with Applications 62 (2016),
pp. 32–43.
[35] M. Kränkel and H.-E. L. Lee. “Text Classification with Hierarchical Attention Networks”.
In: (2019). URL:
https://humboldt-wi.github.io/blog/research/information_sys
tems_1819/group5_han/.
[36] M. Kraus and S. Feuerriegel. “Decision support from financial disclosures with deep
neural networks and transfer learning”. In: Decision Support Systems 104 (2017), pp.
38–48.
of
[37] M. Kraus, S. Feuerriegel, and A. Oztekin. “Deep learning in business analytics and
operations research: Models, applications and managerial implications”. In: European
ro
Journal of Operational Research 281.3 (2020), pp. 628–641.
[38]
-p
D. F. Larcker and A. A. Zakolyukina. “Detecting Deceptive Discussions in Conference
Calls”. In: Journal of Accounting Research 50.2 (2012), pp. 495–540.
re
[39] F. Li. “Annual report readability, current earnings, and earnings persistence”. In: Journal
lP
[41] F. Li. “The information content of forward- looking statements in corporate filings-A naïve
bayesian machine learning approach”. In: Journal of Accounting Research 48.5 (2010), pp.
ur
1049–1102.
Jo
[42] C. C. Lin, A. A. Chiu, S. Y. Huang, and D. C. Yen. “Detecting the financial statement
fraud: The analysis of the differences between data mining techniques and experts‟
judgments”. In: Knowledge-Based Systems 89 (2015), pp. 459–470.
[43] C. Liu, Y. Chan, S. H. Alam Kazmi, and H. Fu. “Financial Fraud Detection Model: Based
on Random Forest”. In: International Journal of Economics and Finance 7.7 (2015).
[44] T. I. M. Loughran and B. Mcdonald. “When is a Liability not a Liability ? Textual Analysis
, Dictionaries , and 10-Ks Journal of Finance , forthcoming”. In: Journal of Finance 66.1
(2011), pp. 35–65.
[45] T. Loughran and B. Mcdonald. “Measuring readability in financial disclosures”. In:
Journal of Finance (2014).
Journal Pre-proof
of
[50] K. Nguyen. “Financial statement fraud:Motives, Methods, Cases and Detection”. In: The
ro
Secured Lender 51.2 (1995), p. 36.
[51] H. Öğüt, R. Aktaş, A. Alp, and M. M. Doğanay. “Prediction of financial information
-p
manipulation by using support vector machine and probabilistic neural network”. In:
Expert Systems with Applications 36.3 PART 1 (2009), pp. 5419–5423.
re
[52] J. W. Pennebaker, M. R. Mehl, and K. G. Niederhoffer. “Psychological Aspects of Natural
lP
Language Use: Our Words, Our Selves”. In: Annual Review of Psychology 54.1 (2003), pp.
547–577.
J. Pennington, R. Socher, and C. D. Manning. “GloVe: Global vectors for word
na
[53]
representation”. In: EMNLP 2014 - 2014 Conference on Empirical Methods in Natural
ur
[54] J. Perols. “Financial statement fraud detection: An analysis of statistical and machine
learning algorithms”. In: Auditing 30.2 (2011), pp. 19–50.
[55] O. S. Persons. “Using Financial Statement Data To Identify Factors Associated With
Fraudulent Financial Reporting”. In: Journal of Applied Business Research (JABR) 11.3
(2011), p. 38.
[56] T. Pourhabibi, K.-L. Ong, B. H. Kam, and Y. L. Boo. “Fraud detection: A systematic
literature review of graph-based anomaly detection approaches”. In: Decision Support
Systems 133 (2020), p. 113303. URL:
https://www.sciencedirect.com/science/article/pii/S01679236
20300580?via%3Dihub.
Journal Pre-proof
[57] L. D. Purda and D. Skillicorn. “Reading between the Lines: Detecting Fraud from the
Language of Financial Reports”. In: SSRN Electronic Journal (2012).
[58] L. Purda and D. Skillicorn. “Accounting Variables, Deception, and a Bag of Words:
Assessing the Tools of Fraud Detection”. In: Contemporary Accounting Research 32.3
(2015), pp. 1193–1223.
[59] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. “Language models are
unsupervised multitask learners”. In: OpenAI Blog 1.8 (2019), p. 9.
[60] G. Rao, W. Huang, Z. Feng, and Q. Cong. “LSTM with sentence representations for
document-level sentiment classification”. In: Neurocomputing 308 (2018), pp. 49–57.
of
[61] P. Ravisankar, V. Ravi, G. Raghava Rao, and I. Bose. “Detection of financial statement
fraud and feature selection using data mining techniques”. In: Decision Support Systems
ro
50.2 (2011), pp. 491–500.
[62]
-p
Z. Rezaee. “Causes, consequences, and deterence of financial statement fraud”. In: Critical
Perspectives on Accounting 16.3 (2005), pp. 277–298.
re
[63] M. T. Ribeiro, S. Singh, and C. Guestrin. “"Why Should I Trust You?": Explaining the
lP
Predictions of Any Classifier”. In: Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August
na
437–485.
Jo
https://www.sec.gov/info/edgar/siccodes.htm.
[66] T. W. Singleton and A. J. Singleton. Fraud Auditing and Forensic Accounting, Fourth
Edition. 2011.
[67] P. C. Tetlock. “Giving content to investor sentiment: The role of media in the stock
market”. In: Journal of Finance 62.3 (2007), pp. 1139–1168.
[68] C. S. Throckmorton, W. J. Mayew, M. Venkatachalam, and L. M. Collins. “Financial fraud
detection using vocal, linguistic and financial cues”. In: Decision Support Systems 74
(2015), pp. 78–87.
Journal Pre-proof
[69] A. J.-P. Tixier. “Notes on Deep Learning for NLP”. In: (2018). URL:
https://arxiv.org/abs/1808.09772.
[70] US Securities and Exchange Comission. “Agency Financial Report”. In: US Department of
State (2019). URL:
https://www.sec.gov/files/sec-2019-agency-financial-report.
pdf#mission.
[71] J. West and M. Bhattacharya. Intelligent financial fraud detection: A comprehensive
review. 2016.
[72] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. Hierarchical Attention
of
Networks for Document Classification. Tech. rep. In Proc. of the 2016 Conference of the
North American Chapter of the Association for Computational Linguistics: Human
ro
Language Technologies, 2016, pp. 1480–1489.
[73]
Natural Language
-p
W. Yin, K. Kann, M. Yu, and H. Schütze. “Comparative Study of CNN and RNN for
Processing”. In: (2017). URL:
re
http://arxiv.org/abs/1702.01923.
C. Zhou, C. Sun, Z. Liu, and F. C. M. Lau. “A C-LSTM Neural Network for Text
lP
[74]
Classification”. In: (2015). URL: http://arxiv.org/abs/1511.08630.
[75] L. Zhou, J. K. Burgoon, J. F. Nunamaker, and D. Twitchell. “Automating
na
81–106.
Jo
[76] E. Zinovyeva, W. K. Härdle, and S. Lessmann. “Antisocial online behavior detection using
deep learning”. In: Decision Support Systems online first, doi:10.1016/j.dss.2020.113362
(2020).
Journal Pre-proof
Biographical Note
Patricia Craja
Patricia Craja received her B.Sc. degree in Mathematics and M.Sc. degree in Statistics, from the
Technical University of Berlin and the Humboldt University of Berlin, Germany, in 2013 and 2020,
respectively. Currently, she is working as a Data Science Freelancer. Her current research
interests are in the fields of natural language processing and decision making.
Alisa Kim
Alisa Kim is a researcher at Humboldt University, focusing on the application of deep learning and
natural language processing in financial and regulatory sectors. She got her Master's degree in
of
Management from St. Andrews University and worked in investment banking and consulting
before joining the Business Informatics Chair of HU as a full-time researcher and educator.
ro
Stefan Lessmann
Stefan Lessmann received a diploma in business administration and a PhD from the University of
-p
Hamburg in 2002 and 2007, respectively. Stefan worked as a lecturer and senior lecture in
business informatics at the Institute of Information Systems of the University of Hamburg. Since
re
2008, Stefan is a guest lecturer at the School of Management of University of Southampton,
where he teaches under- and postgraduate courses on quantitative methods, electronic business,
lP
and web application development. Stefan completed his habilitation in the area of predictive
analytics in 2012. In 2014, Stefan joined the Humboldt-University of Berlin, where he heads the
Chair of Information Systems at the School of Business and Economics. Stefan published several
papers in leading international journals and conferences, including the European Journal of
na
Operational Research, the IEEE Transactions of Software Engineering, and the International
Conference on Information Systems. He actively participates in knowledge transfer and
ur
consulting projects with industry partners; from small start-up companies to global players.
Jo
Journal Pre-proof
of
Stefan Lessmann: Conceptualization,Validation, Supervision
ro
-p
re
lP
na
ur
Jo
Journal Pre-proof
Highlights
● Combining financial and text data enhances fraudulent financial statements detection
● HAN, GPT-2, ANN and XGB detect financial misstatements based on textual cues
● Novel NLP techniques allow to capture content and context of MD&As
● Interpretability offered with “red-flag” sentences in the MD&As of annual reports
● The proposed models provide decision support for stakeholders
of
ro
-p
re
lP
na
ur
Jo
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5