0% found this document useful (0 votes)

270 views46 pages

Deep Learning For Detecting Financial Statement Fraud

This document describes a study that uses a hierarchical attention network (HAN) deep learning model to detect financial statement fraud. The model extracts textual features from the Management Discussion and Analysis (MD&A) sections of annual reports. It reflects the hierarchical structure of documents and uses attention mechanisms at the word and sentence levels. The model captures both content and context of managerial comments, which serve as supplementary predictors to financial ratios in fraud detection. Empirical results show that textual features extracted by HAN improve classification results and reinforce financial ratios in determining fraudulent reporting.

Uploaded by

GauravPrakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

270 views46 pages

Deep Learning For Detecting Financial Statement Fraud

Uploaded by

GauravPrakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Journal Pre-proof

Deep learning for detecting financial statement fraud

Patricia Craja, Alisa Kim, Stefan Lessmann

PII: S0167-9236(20)30176-7
DOI: https://doi.org/10.1016/j.dss.2020.113421
Reference: DECSUP 113421

To appear in: Decision Support Systems

Received date: 5 May 2020

Revised date: 3 September 2020
Accepted date: 30 September 2020

Please cite this article as: P. Craja, A. Kim and S. Lessmann, Deep learning for detecting
financial statement fraud, Decision Support Systems (2020), https://doi.org/10.1016/
j.dss.2020.113421

This is a PDF file of an article that has undergone enhancements after acceptance, such
as the addition of a cover page and metadata, and formatting for readability, but it is
not yet the definitive version of record. This version will undergo additional copyediting,
typesetting and review before it is published in its final form, but we are providing this
version to give early visibility of the article. Please note that, during the production
process, errors may be discovered which could affect the content, and all legal disclaimers
that apply to the journal pertain.

© 2020 Published by Elsevier.

Journal Pre-proof

Deep Learning for detecting financial statement fraud

Patricia Crajaa,, Alisa Kima, Stefan Lessmanna
a
School of Business and Economics, Humboldt University of Berlin, Berlin, Germany

Abstract
Financial statement fraud is an area of significant consternation for potential investors,
auditing companies, and state regulators. The paper proposes an approach for detecting statement
fraud through the combination of information from financial ratios and managerial comments
within corporate annual reports. We employ a hierarchical attention network (HAN) to extract text

of
features from the Management Discussion and Analysis (MD&A) section of annual reports. The

ro
model is designed to offer two distinct features. First, it reflects the structured hierarchy of
documents, which previous approaches were unable to capture. Second, the model embodies two
-p
different attention mechanisms at the word and sentence level, which allows content to be
re
differentiated in terms of its importance in the process of constructing the document
representation. As a result of its architecture, the model captures both content and context of
lP

managerial comments, which serve as supplementary predictors to financial ratios in the detection
of fraudulent reporting. Additionally, the model provides interpretable indicators denoted as
na

“red-flag” sentences, which assist stakeholders in their process of determining whether further
investigation of a specific annual report is required. Empirical results demonstrate that textual
ur

features of MD&A sections extracted by HAN yield promising classification results and
substantially reinforce financial ratios.
Jo

Keywords: fraud detection, financial statements, deep learning, text analytics

1. Introduction
Fraud is a global problem that affects a variety of different businesses with a severe
negative impact on firms and relevant stakeholders. The financial implications of fraudulent

*
Declaration of interest: none
Email addresses: [email protected] (Patricia Craja), [email protected] (Alisa
Kim), [email protected] (Stefan Lessmann)
Journal Pre-proof

activities occurring globally in the past two decades are estimated to amount up to $5.127 trillion,
with associated losses increasing by 56% in the past ten years [21]. The actual costs of fraud are
potentially greater, particularly if one also considers the indirect costs including harm the to
credibility of investors, creditors and, employees and the reduction in business caused by the
resultant scandal. Eventually, fraudulent activities may also lead to bankruptcy. All types of
businesses and industries are affected by fraud. However, smaller organizations (less than 100
employees) as well as nonprofit organizations that have weaker internal control systems and fewer
resources to recover from fraud losses can be more susceptible to fraud [2].
The Association of Certified Fraud Examiners (ACFE), the world‟s largest anti-fraud

of
organization, recognizes three main classes of fraud: corruption, asset misappropriation, and

ro
fraudulent statements [66]. All three have specific properties and successful fraud detection
requires comprehensive knowledge of their particular characteristics. This study concentrates on

-p
financial statement fraud and adheres to the definition of fraud proposed by Nguyen [50], who
stated that it is “the material omissions or misrepresentations resulting from an intentional failure
re
to report financial information in accordance with generally accepted accounting principles“. For
lP

this study, the terminology “financial statement fraud“, “fraudulent financial reporting“, and
“financial misstatements“ are used interchangeably and are distinguished from different factors
na

that cause misrepresentations within financial statements, such as unintended mistakes.

According to the "2020 Report to the Nations" published by the ACFE, the median cost per
instance of occupational fraud around the world amounted to $125,000 and the median period was
ur

14 months. Although the frequency at which asset misappropriation and corruption occur tends to
Jo

be greater than for financial statement fraud, the impact of the latter crime is significantly more
severe, accounting for a median loss of $954,000 and a median duration of 24 months [2].
The Center for Audit Quality indicted that managers commit financial statement fraud for a
variety of reasons, such as personal benefit, the necessity to satisfy short-term financial goals, and
the intention to hide bad news. Fraudulent financial statements can be manipulated so that they
bear a convincing resemblance to non-fraudulent reports, and they can emerge in various distinct
types [28]. Examples of frequently used methods are net income over- or understatements,
falsified or understated revenues, hidden or overstated liabilities and expenses, inappropriate
valuations of assets, and false disclosures [66]. Authorities directly reacted to the increased
prevalence of corporate fraud by adopting new standards for accounting and auditing.
Journal Pre-proof

Nevertheless, financial statement irregularities are frequently observed and complicate the
detection of fraudulent instances.
Detecting financial statement abnormalities is regarded as the duty of the auditor [18].
Despite the existing guidelines, the detection of indicators of fraud can be challenging. A 2020
report revealed that only a limited number of cases of fraud were identified by internal and external
auditors, with rates of 15% and 4%, respectively [2]. Hence, there has been an increased focus on
automated systems for the detection of financial statement fraud [70]. Such systems have specific
importance for all groups of stakeholders: for investors - to facilitate qualified decisions, for
auditing companies - to speed up and improve the accuracy of the audit, and for state regulators - to

of
concentrate their investigations more effectively [1, 3]. Therefore, efforts have been made to

ro
develop smart systems designed to detect financial statement fraud. Previous studies have
examined various quantitative financial and linguistic factors as indicators of financial

-p
irregularities [5, 15]. In the context of annual reporting, quantitative financial information is
supported by textual information, such as the MD&A section, which aims at providing investors
re
with an insight into the management‟s opinions regarding the organisation‟s future prospects. The
lP

language used in the MD&A section could reveal managers‟ cognitive processes and indicate
fraudulent behaviour. Even though studies have emphasised the increasing significance of textual
na

analysis of financial documentation, no research has focused on the application of state-of-the-art

deep learning (DL) models for textual feature extraction. Furthermore, minimal research focused
on the combination of financial and linguistic information for the intelligent prediction of financial
ur

statement fraud as well as on the interpretability of predictions, which is a crucial aspect to support
Jo

auditors.
We aim to bridge this gap and contribute to the development of decision support systems
for fraud detection by offering a state-of-the-art DL model for screening submitted reports based
on a combination of financial and textual data. The proposed method exhibits superior predictive
performance and allows the identification of early warning indicators (red-flags) on both the word-
and sentence-level for the facilitation of the audit process. Additionally, we showcase the results of
comparative modeling on different data types associated with financial reports and offer the
alternative performance metrics that are centered around the cost imbalance of miss-classification
errors.
Journal Pre-proof

2. Research design and contributions

In line with the above goals, we pose three research questions (RQ) that frame our
research:
• RQ 1: Does the novel combination of financial and text data (FIN+TXT) represent a
more informative data type for fraud detection as compared to using FIN or TXT in
isolation?
• RQ 2: Can a state-of-the-art DL model outperform the bag-of-words (BOW)
approach for textual feature extraction in combination with quantitative financial
features?

of
• RQ 3: Can the proposed DL model assist in interpreting textual features signaling

ro
fraud and provide "red-flag" indicators to support the decision-making of auditors?
To answer these research questions, we select an array of classification models for

-p
detecting fraud based on different combinations of data. We consider techniques well established
re
in the fraud detection literature including logistic regression (LR), support vector machines (SVM)
and random forest (RF). Extreme gradient Boosting (XGB) and artificial neural networks (ANN),
lP

which have not been tested for statement fraud detection, are also part of the study. The main focus
of the paper is textual data processing. We introduce a novel DL method called hierarchical
na

attention network (HAN) to the community and demonstrate it distinctive features of providing
accurate and interpretable fraud predictions. In line with previous research, the paper concentrates
ur

on the MD&A sections of annual reports filed by firms within the US with the Securities and
Exchange Commission (SEC), which are referenced as annual reports on form 10-K. The SEC is
Jo

the preeminent financial supervisory organisation that is responsible for monitoring financial
reports of firms listed on the US stock exchange.
All selected models are trained on five different combinations of data contained in the
statements submitted for audit: financial indicators (FIN), linguistic features (LING) of an MD&A
text, financial and linguistic features (FIN + LING), the full text of an MD&A (TXT), full text and
the financial indicators (FIN + TXT). We compare the predictive performance of the models with
regard to their ability to distinguish fraud cases, for which we use traditional metrics like Accuracy
and area under the Receiver Operating Curve (AUC) and metrics that reflect the imbalance in
classification error, namely Sensitivity, F1-score, and F2-scores. The comparative study
contributes to the empirical literature on fraud detection through i) expanding the set of considered
Journal Pre-proof

classifiers, ii) offering previously unexplored data combinations, and iii) introducing new DL
methods that provide accurate fraud forecasts and interpretative features.
Following RQ 3, we offer a novel fraud detection method that provides signaling tools for
scholars and practitioners. We examine words considered "red-flags" by the RF feature
importance method and the HAN attention layer output. Given that the use of words for signaling
fraud may be the subject of manipulation, we offer sentence-level importance indicators as a
remedy and demonstrate how the latter can guide the audit process.

3. Decision support for fraud detection

of
Previous studies proposed fraud detection systems and offered systematic literature

ro
reviews on fraud detection approaches [71, 56]. Table 1 depicts the status-quo in the field of

-p
financial fraud detection along four dimensions: the technique utilized, the type of data, the
country of study, and the predictive performance in terms of classification accuracy and other
re
metrics. Prior research focused on financial variables and applied a range of modeling techniques,
from LR to DL. Several authors experimented with linguistic variables. These variables were
lP

based on pre-determined lists of words associated with fraud or readability measures such as the
average length of words and sentences, lexical diversity, and sentence complexity [29]. Few
na

studies applied natural language processing (NLP) techniques representing the whole textual
content of 10-K reports and, to the best of our knowledge, no previous study considered DL
ur

models for text analysis in financial statements. Furthermore, most studies examined the relation
between linguistic aspects and fraudulent actions. Only Hajek and Henriques [26] combined them
Jo

with financial data and showed that although financial variables are essential for the detection of
fraud, it is possible to enhance the performance through the inclusion of linguistic data. However,
their study was not targeted at evaluating the textual content of corporate annual reports and it did
not include sophisticated techniques for mining text such as BOW and DL. The majority of
existing research measured performance in terms of accuracy. Some studies also considered
precision and recall. Additionally, most previous studies neglected model interpretability, which is
crucial to support auditors during client selection or audit planning. Hajek and Henriques [26]
pointed out the importance of transparent fraud detection models and derived interpretable
"green-flag" values (for which fraud is likely absent). However, because the detection of
Journal Pre-proof

fraudulent firms requires more complex non-interpretable models, no "red-flag" values (for which
fraud is likely present) could be derived. We bridge this gap by suggesting the use of textual
elements as "red-flags" for auditors. Given that the cost of failing to detect statement fraud is
higher than that of incorrectly flagging a legitimate statement as fraudulent [26], focusing on early
warning signs of fraud is necessary and can enhance the efficiency of auditing processes. In
conclusion, this paper adds to the literature by offering an integrated approach for processing both
textual and financial data using interpretable state-of-the-art DL methods. Furthermore, we
provide a comprehensive evaluation of different modeling techniques using cost-sensitive metrics
to account for the different severities of false alarms versus missed fraud cases.

of
ro
Study Data Countr Features Classifiers Used metrics
(fraud / no y
fraud)
-p
BBN(90.3),
re
DTNB(89.5),
lP

RF(87.5),
Hajek and 311/311 US FIN+ LING Bag(87.1), Acc, TPR, TNR, MC, F-score,
na

Henriques [26] JRIP(87.0), AUC

CART(86.2),
ur

C4.5(86.1),
LMT(85.4),
Jo

SVM(78.0),
MLP(77.9),
AB(77.3),
LR(74.5),
NB(57.8)
Kim et al. [34] 788/2156 US FIN LR (88.4), SVM Acc, TPR, G-mean, Cost
(87.7), BBN Matrices
(82.5)
Goel and 180/180 US LING+POS SVM(81.8) Acc,TPR,FPR,Precision,F-scor
Uzuner [25] tags e
Journal Pre-proof

Purda and 1407/470 US TXT SVM (AUC AUC, Fraud Probability

Skillicorn [58] 8 (BOW), top 89.0)
200 RF
words
Throckmorton 41/1531 US FIN+ LING GLRT (AUC AUC
et al. [68] from 81.0)
Conference
Calls
Goel and 405/622 US LING  2 statistics  2 statistics

of
Gangolly [23]

ro
Dechow et al. 293/7935 US FIN LR(63.7) Acc, TPR, FPR, FNR, min
[15] 8 F-Score
Humpherys et 101/101
al. [29]
US LING
-p
C4.5 (67.3), NB Acc, Precision, Recall, F-score
(67.3), SVM
re
(65.8)
lP

Glancy and 11/20 US TXT hierarchical TP, TN, FP, FN, p-value
Yadav [22] (BOW) clustering (83.9)
na

Perols [54] 51/15934 US FIN SVM(MC Fraud Probability and MC

0.0025),
ur

LR(0.0026),
C4.5 (0.0028),
Jo

bagging(0.0028)
, DNN(0.0030)
Cecchini et al. 61/61 US LING SVM (82.0) AUC, TPR, FPR, FNR
[12]
Goel et al. [24] 126/622 US LING+TX SVM(89.5), Acc, TPR, FPR, Precision,
T (BOW) NB(55.28%) F-score
Lin et al. [42] 127/447 Taiwan FIN DNN (92.8), Acc, FPR, FNR, MC
CART (90.3),
LR (88.5)
Journal Pre-proof

Ravisankar et 101/101 China FIN PNN (98.1), GP Acc, TPR, TNR, AUC
al. [61] (94.1), GMDH
(93.0),
DNN (78.8),
SVM (73.4)
Table 1: Analysis of classifier comparisons in financial statement fraud detection

FIN – financial data, LING – linguistic data (word category frequency counts, readability and
complexity scores, etc.), TXT - text data, BOW–bag-of-words, POS – part of speech tags (nouns,

of
verbs, adjectives), BBN – Bayesian belief network, NB-Naive Bayes, DTNB - NB with the

ro
induction of decision tables, CART–classification and regression tree, LMT - logistic model trees,
MLP - multi-layer network, Bag-Bagging, AB - AdaboostM1, GMDH – group method data

-p
handling, GP–genetic programming, GLRT - generalized likelihood ratio test, LR–logistic
regression, DNN – deep neural network, PNN – probabilistic neural network, RF – random forest,
re
SVM – support vector machine, Acc - Accuracy, AUC – area under the ROC curve, MC -
lP

misclassification cost, TPR - true positive rate, TNR - true negative rate, FPR - false-positive rate,
FNR - false-negative rate.
na

3.1. Text-based indicators

Textual analysis is frequently employed for the examination of corporate disclosures.

Linguistic features have been utilized in the analysis of corporate conference calls [38], earnings
Jo

announcements [14], media reports [67] and annual reports [44, 11]. Several studies have
concentrated on the MD&A section to examine the language used in annual reports [19, 12, 29].
The MD&A is especially relevance as it offers investors the possibility of reviewing the
performance of the company as well as its future potential from the perspective of management.
This part also provides scope for the management‟s opinions on the primary threats to the business
and necessary actions.
A positive factor that renders the textual information contained within annual reports
conducive to the detection of fraud is that it is not subject to the same degree of regulation as
financial information, thus providing the organisation‟s management more opportunities when
divulging textual data [29]. Given that MD&A involves predictions, presumptions, and decisions,
Journal Pre-proof

management might be tempted to manipulate the information in order to present the organisation
in a more favourable light [51]. In addition to manipulating data, management could purposely
exclude important information thus leading to the same outcome. Examples for risk factors
associated with financial statement fraud can be a poor financial performance, a pressure for
management to meet the requirements or expectations of third parties such as investors, or a need
to obtain financing or to minimize reported earnings for tax-motivated reasons. Breiman et al. [9]
conducted analysis on infamous examples such as WorldCom and Enron, and determined that
senior managers participated in, encouraged, approved, and had knowledge of the fraudulent
activities in most cases. Social psychology research suggests that the emotions and cognitive

of
processes of managers who intend to conceal the real situation could indicate specific linguistic

ro
cues that facilitate the identification of fraud [16]. Therefore, prior work has emphasised the
increasing significance of textual analysis of financial documentation.

-p
Studies that analyzes the use of language within annual reports usually adopts one of two
strategies [40]. The first strategy draws on research in linguistics and psychology and is dependent
re
on pre-determined lists of words that have an association with a specific sentiment, like negativity,
lP

optimism, deceptiveness, or ambiguity. Loughran and Mcdonald [44] (L&M) demonstrated that if
these lists are adapted to the financial domain, it is possible to determine relationships among
na

financial-negative, financial-uncertain, and financial-litigious word lists and 10-k filing returns,
trading volume, return volatility, fraud, material weakness, and unexpected earnings. As it was
developed for analyzing 10-K text, the L&M sentiment word lists have been broadly employed in
ur

fraud-detection research [26]. Accordingly, the L&M word lists enters this study as a benchmark
Jo

to DL approaches for extracting features from the MD&A section of 10-Ks. Other researchers
based their approaches for detecting fraud on word lists that indicate positive, negative or neutral
emotions [29, 23] or more specifically anger, anxiety, and negativity according to the definitions
supplied by the Linguistic Inquiry and Word Count dictionary [29, 38, 52].
The second strategy relies on ML to extract informative features for automatic
differentiation between fraudulent and non-fraudulent texts. Li [40] contended that this method
has various benefits compared with predetermined lists of words and cues, including the fact that
no adaptation to the business context is required. ML algorithms have been used in the detection of
financial statement fraud by several researchers, such as Cecchini et al. [12], Hajek and Henriques
[26], Humpherys et al. [29], Goel and Uzuner [25], Goel et al. [24], Glancy and Yadav [22], and
Journal Pre-proof

Purda and Skillicorn [58]. Some attempts to integrate different types of data have also been made.
Purda and Skillicorn [58] compared a language-based method to detect fraud based on SVM to the
financial measures proposed by Dechow et al. [15], and concluded that these approaches are
complementary. The methods displayed low forecast correlation and identified specific types of
fraud that the other could not detect. This finding motivates us to combine financial variables and
linguistic variables to complement each other in the detection of statement fraud.
The study of Hajek and Henriques [26] is closest to this work as they combined financial
ratios with linguistic variables from annual reports of US firms and employed a variety of
classification models, as shown in Table 1. Despite these similarities, the study by Hajek and

of
Henriques [26] was not targeted at evaluating the textual content of corporate annual reports.

ro
Hence, it did not include modern NLP approaches such as deep learning-based feature extraction.

3.2. Methods and evaluation metrics -p

re
Prior work has tested a variety of statistical fraud detection models including ANNs,
Decision Trees (DT), SVM, evolutionary algorithms, and text analysis [26]. The BOW technique
lP

was frequently adopted for the extraction of the linguistic properties of financial documentation.
The BOW approach represents a document by a vector of word counts that appear in it.
na

Consequently, the word frequency is used as the input for the ML algorithms. This method does
not consider the grammar, context, and structure of sentences and could be overly simple in terms
ur

of uncovering the real sense of the text [38]. A different technique for analyzing text is DL. Deep
ANN are able to extract high-level features from unstructured data automatically. Textual analysis
Jo

models based on DL can “learn” the specific patterns that underpin the text, “understand” its
meaning and subsequently output abstract aspects gleaned from the text. Hence, they resolve some
of the problems associated with the BOW technique, including the extraction of contextual
information from documents. Due to their capacity to deal with sequences with distinct lengths,
ANN have shown excellent results in recent studies on text processing [76]. Despite their
achievements in NLP, there has been limited focus on the application of state-of-the-art DL
methods to the analysis of a financial text, with the notable exception of [36]. For an adoption in
practice, DK models should not only be precise, but also interpretable [37, 28]. However, the
majority of systems designed to detect fraud reported by researchers aim to maximise the
prediction accuracy, while disregarding how transparent they are [26]. This factor has particular
Journal Pre-proof

significance as the development of interpretable models is critical for supporting the investigation
procedure in auditing.

4. Methodology
The objective of this study is to devise a fraud detection system that classifies annual
reports. While financial and linguistic variables represent structured tabular data and require no
extensive preprocessing, the unstructured text data has to be transformed into a numeric format,
which preserves its informative content and facilitates algorithmic processing. To achieve the

of
latter, words are embedded as numeric vectors. The field of NLP has proposed various ways to
construct such vectors. We consider two methods for text representation: frequency-based BOW

ro
embeddings and prediction-based neural embeddings (word2vec). An advantage of the BOW

-p
approach, which has been used in prior work on financial statement fraud (see Table1), is its
simplicity. However, BOW represents a set of words without grammar and disrupts word order.
re
Unlike BOW, the application of DL is still relatively new to the area of regtech (management of
regulatory processes within the financial industry through technology). Therefore, the following
lP

subsections clarify neural word embeddings and address the DL components of the proposed HAN
model.
na

4.1. Neural Embeddings

The BOW model represents every word as a feature. The amount of features denotes the
Jo

dimension of the document vector [46]. Since the amount of unique words within a document
typically only represents a small proportion of the overall amount of unique words within the
whole corpus, BOW document vectors are very sparse. A more advanced model for creating lower
dimensional, dense embeddings of words is word2vec. As opposed to BOW, word2vec
embeddings enable words that have similar meanings to be given similar vector representations
and capture the syntactic and semantic similarities. Word2vec [48] is an example of a NN model
that is capable of learning word representations from a large corpus. Every word within the corpus
is mapped to a vector of 50 to 300 dimensions. Mikolov et al. [48] demonstrated that such vectors
offer advanced capabilities to measure the semantic and syntactic similarities between words. The
generated word embeddings are a suitable input for text mining algorithms based on DL, as will be
Journal Pre-proof

observed in the next part. They constitute the first layer of the model and allow further processing
of text input within the DL architecture.
The initial word2vec algorithm is followed by GloVe [53], FastText [8], and GPT-2 [59],
as well as the appearance of publicly available sets of pre-trained embeddings that are acquired by
applying the above-mentioned algorithms on large text corpora. Pre-trained word embeddings
accelerate training DL models and were successfully used in numerous NLP tasks. We apply
several types of pre-trained embeddings for HAN model and a neural network with a bidirectional
Gated Recurrent Unit (GRU) layer that serves as a benchmark from the field of DL. As a result of
a performance-based selection, the HAN model is built with word2vec embeddings with 300

of
neurons, trained on the Google News corpus, with a vocabulary size of 3 million words. The DL

ro
benchmark is used with the GPT-2 pre-trained embeddings from the WebText, offered by Radford
et al. [59], as they arguable constitute the current state-of-the-art language model. The DL

-p
benchmark model is thus referred to as GPT-2 and is used together with the attention mechanism,
discussed further.
re
lP

4.2. Deep learning

After representing unstructured textual data in a numerical format, it can be used for
na

predictive modeling. Conventional methods for classifying text involve the representation of
sparse lexical features, like TF-IDF, and subsequently utilize a linear model or kernel techniques
ur

upon this representation [30].

Recently, DL has incorporated new techniques, including Convolutional Neural Networks
Jo

(CNN) [32] and Recurrent Neural Networks (RNN) [27] for learning textual representations [73].
The RNN architecture allows retaining the input sequence, which made it widely used for natural
language understanding, language generation, and video processing [47, 31]. An LSTM is a
special type of RNN, comprised of various gates determining whether the information is kept,
forgotten or updated and enabling long-term dependencies to be learned by the model [27]. An
LSTM retains or modifies previous information on a selective basis and stores important
information in a separate cell, which acts as a memory [69]. Consequently, overwriting of
important information by the new inputs does not occur, it can persist for extended periods.

4.3. Hierarchical Attention Network

Journal Pre-proof

More advanced DL approaches also address hierarchical patterns of language such as the
hierarchy between words, sentences, and documents. Some methods have covered the hierarchical
construction of documents [74, 60]. The specific contexts of words and sentences, whereby the
meaning of a word or sentence could change depending on the document, is a comparatively new
concept for the process of text classification, and the HAN was developed to address this issue
[72]. When computing the document encoding, HAN firstly detects the words that have
importance within a sentence, and subsequently, those sentences that have importance within a
document while considering the context (see Figure 1). The model recognizes the fact that an
occurrence of a word may be significant when found in a particular sentence, whereas another

of
occurrence of that word may not be important in another sentence (context).

ro
Figure 1: HAN Architecture. Image based on Yang et al. [72]

-p
The HAN builds a document representation via the initial construction of sentence vectors
re
based on words followed by the aggregation of these sentence vectors into a document
lP

representation through the application of the attention mechanism. The model consists of an
encoder that generates relevant contexts and an attention mechanism, which calculates importance
na

weights. The same algorithms are consecutively implemented at the word level and then at the
sentence level.
ur

Word Level. The input is transformed into structured tokens wit that denote word i in sentence

t  [1, T ] . Tokens are further passed through a pre-trained embedding matrix We that allocates
Jo

multidimensional vectors xit = We wit to every token. As a result, words are denoted in numerical

format by xit as a projection of the word in a continuous vector space.

Word Encoder. The vectorized tokens represent the inputs for the following layer. While Yang et
al. [72] employed GRU for encoding, we use LSTM as it showed better performance on the large
text sequences at hand [13]. In the context of the current model, a bidirectional LSTM is
implemented to obtain the annotations of words. The model consists of two uni-directional
LSTMs, whose parameters are different apart from the word embedding matrix. Processing of the
sentences in the initial forward LSTM occurs in a left to the right manner, whereas in the backward
LSTM, sentences are processed from right to left. The pair of sentence embeddings are
Journal Pre-proof

concatenated at every time step t to acquire the internal representation of the bi-directional
LSTM hit .

Word Attention. The annotations hit construct the input for the attention mechanism that learns

enhanced annotations denoted by uit . Additionally, the tanh function adjusts the input values so
that they fall in the range of -1 to 1 and maps zero to near-zero. The newly generated annotations
are then multiplied again with a trainable context vector uw and subsequently normalized to an

importance weight per word  it via a softmax function. As part of the training procedure, the

word context vector uw is initialized randomly and concurrently learned. The total of these

of
importance weights concatenated with the already computed context annotations is defined as the

ro
sentence vector si :

uit = tanh(Wwhit  bw )

 it =
-p
exp(uitT uw )
(1)
re
(2)
 t exp(uitT uw )
lP

si =  it hit (3)
t

Sentence Level and Sentence Encoder. Subsequently, the entire network is run at the sentence level
na

using the same fundamental process used for the word level. An embedding layer is not required as
sentence vectors si have previously been acquired from the word level as input. Summarization
ur

of sentence contexts is performed using a bi-directional LSTM, which analyzes the document in
Jo

both forward and backward directions:

hi = LSTM (si ), i [1, L] (4)

hi = LSTM (si ), i [T ,1] (5)

hi = [hi , hi ] (6)
Sentence Attention. For rewarding sentences that are indicators of the correct document
classification, the attention mechanism is applied once again along with a sentence-level context
vector us , which is utilized to measure the sentence importance. Both trainable weights and biases
are initialized randomly and concurrently learned during the training procedure, thus yielding:
ui = tanh(Ws hi  bs ) (7)
Journal Pre-proof

exp(uiT us )
i = (8)
 i exp(uiT us )
d =  i hi (9)
i

where d denotes the document vector summarising all the information contained within each of
the document‟s sentences. Finally, the document vector d is a high-level representation of the
overall document and can be utilized as features for document classification to generate output
vector ŷ :

yˆ = softmax(Wc d  bc ) (10)

of
where ŷ denotes a K dimensional vector and the components yk model the probability that

ro
document d is a member of class k in the set 1,..., K .
The application of the HAN follows the application of Kränkel and Lee [35]. Training of
-p
the DL model is performed on the training data set using both textual and quantitative features.
re
Hence, the textual data acquired in the previous section is concatenated with the financial ratios.
The model is employed to predict fraud probabilities of annual statements in the corresponding
lP

validation and test partitions, which were constructed with random sampling with stratification.
Figure 2 shows the architecture of the HAN based fraud detection model and the output
na

dimensions of each layer.

Figure 2: Architecture of the HAN based Fraud Detection Model

The LSTM layer consists of 150 neurons, a HAN dense dimension of 200, and a last dense
layer dimension of 6. In this case, a combination of forward and backward LSTMs gives 300
dimensions for word and sentence annotation. The last layer of the HAN involves the application
of dropout regularization to prevent over-fitting. In a final step, the resulting
document-representation of dimension 200 is concatenated with 47 financial ratios and inputted to
a dense layer before running through a softmax function that outputs the fraud probabilities. For
training, a batch size of 32 and 17 epochs was used after hyperparameter tuning on the train
validation set.
Journal Pre-proof

4.4. Evaluation metrics

Financial statement fraud detection is approached as a binary classification problem with
four possible outcomes: True positive (TP) denotes the correct classification of a fraud case, false
negative (FN) denotes the incorrect classification of a fraud case as non-fraud, true negative (TN)
denotes the correct classification of a non-fraud case, and false positive (FP) denotes the incorrect
classification of a non-fraud case as fraud.
To estimate predictive performance, many previous studies considered a combination of
measures such as accuracy, sensitivity (also called TP rate or recall), specificity (also called TN
rate), precision, and F1-score [71]. In this study, model performance is evaluated by the AUC,

of
sensitivity, specificity, F1-score, F2-score, and accuracy.

ro
The F-score is a combination of precision (correct classification of fraud cases as a
percentage of all instances classified as fraudulent) and sensitivity (indicates how many fraudulent

-p
instances the classifier misses). It measures how precise and how robust the models classify
fraudulent cases:
re
precision  sensitivity
F  score = (1   2 )  (11)
(   precision)  sensitivity
lP

Prior research emphasized that a higher sensitivity is preferred to higher specificity in

financial statement fraud detection. Nevertheless, the majority of models have exhibited higher
performance in detecting truthful transactions in comparison to those that are fraudulent [71, 61].
ur

An explanation for this preference is that FN rate (type II error) and FP rate (type I error) result in
different misclassification costs (MC). Hajek and Henriques [26] estimated the cost of failing to
Jo

detect fraudulent statements (type II error) to be twice as high as the cost of incorrectly classifying
fraudulent statements (type I error). Hence, effective models should concentrate on high
sensitivity and classify correctly as many positive samples as possible, rather than maximizing the
number of correct classifications. To that end, this study employs the F2-score in addition to the
F1-score (harmonic mean of precision and sensitivity), as it weights sensitivity higher than
precision and is, therefore, more suitable for fraud detection. The AUC captures the ability of a
model to rank fraud and non-fraud cases in the right order. The higher the AUC, the better the
model can distinguish between fraud and non-fraud cases. Being robust toward imbalanced class
distributions, the AUC is preferred to accuracy in fraud detection [68, 58] and also used in this
study.
Journal Pre-proof

Calculating F1- and F2-scores requires a cutoff to threshold model-based fraud

probabilities. We select the threshold that maximizes the difference between sensitivity and FP
rate and use it to evaluate the classification results. For the HAN model, the optimal threshold is
set at 0.03, implying that a statement is classified as fraudulent if its fraud probability is higher
than 3%.

5. Data
Fraud detection is a challenging task because of the low number of known fraud cases. A

of
severe imbalance between the positive and the negative class impedes classification. For example,
the proportion of statements that were fraudulent and non-fraudulent in the annual reports

ro
submitted to the SEC for the period from 1999 to 2019 was 1:250. In past research, the number of

-p
firms that committed fraud contained in the data varied between 12 and 788 [68, 34]. The data used
here consists of 208 fraudulent and 7 341 non-fraud cases, making it the largest data set with a
re
textual component so far (c.f., Table1). The data set consists of US companies‟ annual financial
reports (10-K filings) that are publicly available through the EDGAR database of the SEC‟s
lP

1
website and quantitative financial data, sourced from the Compustat database 2.
na

5.1. Labeling
Companies submit yearly reports that undergo an audit. Labeling these reports requires
ur

several filtering decisions: when can a report be considered fraudulent and what type of fraud
Jo

should we consider? To address the first question, we follow [58, 29, 26] and consider a report as
"fraudulent" if the company that filed it was convicted. The SEC publishes statements called
"Accounting and Auditing Enforcement Releases" (AAER) that describe financial reporting
related enforcement actions taken against companies that violated the reporting rules 3 . SEC
concentrates on cases with high importance and applies enforcement actions where the evidence of
manipulation is sufficiently robust [26, 33]. This provides a high degree of trust to this source.
Labeling reports based on the AAER offers simplicity and consistency with easy replication and

1
www.sec.gov/edgar.shtml
2
www.compustat.com
3
https://www.sec.gov/divisions/enforce/friactions.shtml
Journal Pre-proof

avoids possible bias from a subjective categorization. Following [58], we select the AAERs
concerning litigations issued during the period from 1999 to 2019 with identified manipulation
instances between the year 1995 and 2016 that discuss the words "fraud", "fraudulent",
"anti-fraud" and "defraud" as well as "annual reports" or "10-K". Addressing the second question,
we follow [12, 24, 29, 58, 26] and focus on binary fraud classification. This implies that we do not
distinguish between truthful and unintentionally misstated annual reports. The resulting data set
contains 187 869 annual reports filed between 1993 and 2019, with 774 firm-years subject to
enforcement actions. However, due to missing entries and mismatches in existing CIK indexation,
the final data set is reduced to 7 757 firm-year observations with 208 fraud and 7 549 non-fraud

of
filings. Further, we perform the extraction of text and financial data.

ro
5.2. Text data
-p
The MD&A section constitutes the primary source of raw text data in this study. In
re
addition, nine linguistic features are utilized as predictors (described in the online appendix). The
selection of these features is influenced by past studies that demonstrated several patterns of
lP

fraudulent agents, like an increased likelihood of using words that indicate negativity [68, 49],
absence of process ownership implying lack of assurance and resulting in statements containing
na

less certainty [38] or an average of three times more positive sentiment and four times more
negative sentiment in comparison to honest reports Goel and Uzuner [25]. Additionally, the
ur

general tone (sentiment) and the proportion of constraining words, were included by Hajek and
Henriques [26], Loughran and Mcdonald [44], and Bodnaruk et al. [7]. Lastly, the average length
Jo

of sentence, the proportion of compound words, and the fog index are incorporated as measures of
complexity and legibility because prior work suggests that reports produced by misstating firms
had reduced readability [29, 39].

5.3. Quantitative data

Along with text features, we used 47 quantitative financial predictors (described in the
online appendix), which are capable of capturing financial distress as well as managerial
motivations to misrepresent the performance of the firm. Past studies have presented robust
theoretical evidence supporting the utilization of financial variables [20, 64, 1, 26]. Following the
Journal Pre-proof

guidelines of existing research, the financial ratios and balance sheet variables presented in the
online appendix are extracted from Compustat, based on formulas presented by Dechow et al. [15]
and Beneish [6]. Financial variables include indicators like total assets (adopted as a proxy for
company size [68, 4]), profitability ratios [26], accounts receivable and inventories as non-cash
working capital drivers [1, 12, 55]. Additionally, a reduced ratio of sales general and
administrative expenses (SGA) to revenues (SGAI) is found to signalize fraud [1]. Missing values
are imputed using the RF algorithm. However, observations with more than 50% of the variables
missing are excluded.

of
5.4. Imbalance treatment

ro
The majority of previous research has balanced the fraud and non-fraud cases in a data set
using undersampling [26, 29, 61]. We follow this approach and consider a fraud-to-non-fraud-ratio

-p
of 1:4, which reflects the fact that the majority of firms have no involvement in fraudulent
re
behaviour. Both year and sector are utilized for balancing, in order to take into account different
economic conditions, change in regulation, as well as to eradicate any differences across distinct
lP

sectors [29, 34]. The latter is extracted with the SIC code [65] and is of particular importance for
text mining, as the utilization of words within financial documentation could differ according to
na

the sector. The resulting balanced data set consists of 1 163 reports, out of which 201 are
fraudulent, and 962 are non-fraudulent annual reports. In the years 2002 to 2004 more financial
ur

misstatements than in other years can be observed. This could be attributed to the tightened
regulations after the big fraud scandals in 2001 and the resulting implementation of the
Jo

Sarbanes-Oxley Act (SOX) in 2002. Also, fewer misstatements are noted in recent years since the
average period between the end of the fraud and the publication of an AAER is three years [57].

6. Classification results
We answer RQ 1 and 2 by means of empirical analysis and compare a set of classification
models in terms of their fraud detection performance. The models generate fraud classifications
based on financial indicators, linguistic features of reports, the reports‟ text, and combinations of
these groups of features. Table 2 reports corresponding results from the out-of-sample test set. The
baseline accuracy of classifying all cases of the test set as non-fraudulent (majority class) is
Journal Pre-proof

82.81%.

Finance data (FIN)

AUC Sensitivit Specificit F1-score F2-score Accuracy
y y
LR 0.7620 0.6833 0.7543 0.4767 0.7480 0.8252
RF 0.8609 0.7666 0.7889 0.5508 0.7892 0.8653
SVM 0.7561 0.6166 0.7820 0.4625 0.7595 0.8280
XGB 0.8470 0.6660 0.8719 0.5839 0.8391 0.8481

of
ANN 0.7564 0.7833 0.6574 0.4563 0.6835 0.6790

ro
Linguistics data (LING) Comparison to FIN
AUC Sensitivit Specificit F1-score F2-score Accuracy Delta Delta F1
y y
-p AUC
re
LR 0.6719 0.7000 0.6193 0.3962 0.6398 0.8280 -0.0901 -0.0805
RF 0.7713 0.7500 0.7197 0.4839 0.7302 0.8424 -0.0896 -0.0669
lP

SVM 0.7406 0.7000 0.6747 0.4285 0.6857 0.8280 -0.0155 -0.0340

XGB 0.7219 0.3666 0.9446 0.4489 0.8385 0.8338 -0.1251 -0.1350
na

ANN 0.6782 0.6333 0.6747 0.3958 0.6758 0.6676 -0.0782 -0.0605

Finance data + Linguistics data (FIN + LING) Comparison to FIN
ur

AUC Sensitivit Specificit F1-score F2-score Accuracy Delta Delta F1

y y AUC
Jo

LR 0.7682 0.7666 0.6782 0.4623 0.6984 0.8280 0.0062 -0.0144

RF 0.8606 0.7666 0.7543 0.5197 0.7610 0.8567 -0.0003 -0.0311
SVM 0.7973 0.7166 0.7439 0.4858 0.7448 0.8280 0.0567 0.0573
XGB 0.8651 0.8166 0.7543 0.5444 0.7687 0.8653 0.0181 -0.0395
ANN 0.7733 0.8333 0.6228 0.4566 0.6614 0.6590 0.0169 0.0003
Text data, TF-IDF (TXT) Comparison to
LING
AUC Sensitivit Specificit F1-score F2-score Accuracy Delta Delta F1
y y AUC
Journal Pre-proof

LR 0.8371 0.7333 0.8269 0.5714 0.8145 0.8281 0.1652 0.1752

RF 0.8740 0.7166 0.9377 0.7107 0.8998 0.8681 0.1027 0.2268
SVM 0.8836 0.8382 0.7544 0.5876 0.7731 0.8796 0.1275 0.1251
XGB 0.8785 0.7660 0.8581 0.6258 0.8451 0.8853 0.1566 0.1769
ANN 0.8829 0.7121 0.9434 0.7286 0.8993 0.8990 0.2047 0.3328
HAN 0.9108 0.8000 0.8896 0.5744 0.7982 0.8457
GPT-2+Att 0.7729 0.7619 0.6697 0.4423 0.6905 0.6484
n
Finance data + Text data, TF-IDF (FIN + TXT) Comparison to FIN

of
+ LING

ro
AUC Sensitivit Specificit F1-score F2-score Accuracy Delta Delta F1
y y AUC
LR 0.8598 0.7833 0.7854
-p
0.5562 0.7890 0.8424 0.0916 -0.0795
re
RF 0.8797 0.6660 0.9550 0.7079 0.9043 0.8739 0.0191 -0.1571
SVM 0.8902 0.7833 0.8961 0.6861 0.8784 0.8280 0.0929 -0.2576
lP

XGB 0.8983 0.7000 0.9653 0.7500 0.9187 0.9083 0.0332 -0.1661

ANN 0.8911 0.7460 0.9405 0.7401 0.9055 0.9054 0.1178 -0.2838
na

HAN 0.9264 0.9000 0.8206 0.6506 0.8361 0.8457

GPT-2+Att 0.7776 0.7678 0.6791 0.4455 0.6991 0.6934
ur

n
Table 2: Comparative performance of selected binary classifiers on the different types of test data.
Jo

The baseline accuracy is 0.8281

6.1. Modeling of financial data

Modeling using financial data (FIN) has been the most popular approach (Table 1). The
approach serves this study as a benchmark, to which we compare modeling on linguistic features
(LING) and the combination of both (FIN + LING). The last two columns of Table 2 show the
results of the comparison. In terms of AUC and accuracy, the tree-based models RF [10] and XGB
appear to excel at predicting fraud on FIN, indicating a non-linear dependency between financial
indicators and the fraud status of a report. This result is in line with Liu et al. [43] who showed that
Journal Pre-proof

RF performed especially well in case of high-dimensional financial fraud data. Hajek and
Henriques [26] also reported an accuracy of 88.1% on FIN data and concluded that the ensemble of
tree-based algorithms is superior to SVM, LR and ANN due to a relatively low dimensionality
achieved during feature selection. The predictive performance aligns with the results of Kim et al.
[34], offering the LR and SVM models as the most accurate. Lin et al. [42] and Ravisankar et al.
[61] showcased that the DNN models (ANN with more than one hidden layer) outperform LR and
SVM, offering an accuracy higher by around 4.5%. The SVM is a widely recognized model and
was applied both for fraud detection [54] and in other fields [17]. However, the results show that
inherent configuration complexities make SVM a secondary choice for practitioners. ANN show

of
less impressive predictive performance but proved to be the most efficient in terms of sensitivity.

ro
However, for model evaluation, a balanced indicator like F1- and F2-scores provide a better
perspective. These metrics suggest XGB to outperform other models. XGB represents an

-p
advancement in the in the field of ML, its high performance is noteworthy since it was not
considered in prior work on fraud detection. Given the much higher cost of missing actual fraud
re
cases compared to false alarms, we argue that the F2-score is the most suitable threshold-based
lP

indicator of model performance. Therefore, we emphasize the F2-score together with the AUC,
which allows the tuning of the threshold.
na

6.2. Modeling of linguistic data

The modeling on linguistic data (LING) was the first step towards including text in fraud
detection. The earlier experiments by Cecchini et al. [12], Humpherys et al. [29] and Goel et al.
Jo

[24] employed SVM and achieved accuracy of 82%, 65.8%, and 89.5% respectively. The latter
additionally included the BOW method that we will discuss further. Our modeling falls in line with
the previous work and exhibits SVM as the second strongest predictor, yielding an AUC of 74%
and accuracy of 82%. RF remained the most reliable predictor with the highest AUC, accuracy,
and F2-score. Modeling done solely on LING will allow us to assess the degree to which both
sources of data contribute to accurate classification. In line with Hajek and Henriques [26], all
models exhibit higher performance on FIN data than on LING data solely, leading to the
conclusion that financial covariates have more predictive power than linguistic variables.
However, the performance differences are not substantial and suggest a strong relationship
between linguistic features and fraudulent behavior, which agrees with previous studies.
Journal Pre-proof

Following the ideas of Hajek2017MiningMethods, we combine FIN and LING data to evaluate if
the classifier can make use of both data sources. Our results differ in terms of the leading models,
with RF and XGB offering the highest AUCs of 86%. XGB is showing an improvement,
performing well on FIN data, falling back a little in the LING set up but making better use of the
combined input. Once again we observe the superior performance of XGB in terms of F2-score
with 76.87% followed closely by 76.10% of RF and 74.48% of SVM, which once again advocates
for the usefulness of advanced ML methods for practical tasks. Interestingly, for the rest of
classifiers, the accuracy dropped a little in comparison to FIN, but the AUCs improved (with a
minor exception for RF). This serves as an indication that LING and FIN data combined may

of
provide conflicting signals to the classifier. However, the data mix is a definite improvement as it

ro
provides a stronger signal to the classifier and enhances predictive performance.

6.3. Modeling of text data -p

re
Researchers have been taking a step forward from aggregated linguistic features in an
attempt to derive more predictive power from the vast amounts of text contained in annual reports.
lP

We offer the advanced methods of NLP, previously unexplored for fraud detection, and compare
them to the performance of more traditional models. Goel et al. [24], Glancy and Yadav [22] and
na

Purda and Skillicorn [58] applied the BOW model to perform modeling on text data, while Goel
and Uzuner [25] made use of part-of-speech tagging. They utilized SVM and hierarchical
ur

clustering as classifiers and achieved accuracies of 89.5%, 83.4%, 89%, and 81.8%, respectively.
Table 2 offers an overview of the modeling results, starting with purely textual input (TXT)
Jo

and continuing with text enhanced by financial data (FIN+TXT). Two new DL methods are
included in TXT modeling, namely HAN and GPT-2. While traditional benchmarks take the
TF-IDF transformations of word input, the DL models make use of pre-trained embeddings. We
can observe that modeling on TXT improves all models in comparison to LING, with the largest
AUC delta of 0.2 in the case of ANN. This increase can be attributed to the richer input of the
actual MD&A content. While the more basic feature input of the LING models solely incorporated
linguistic information (e.g., frequency counts of word categories based on LM word lists,
readability and complexity ratios, etc.) the textual input (TXT) is based on advanced text mining
techniques and vector space representations containing the information of the whole MD&A
content (TF-IDF based embedding), as well as grammar, context, and structure of sentences (DL
Journal Pre-proof

based embeddings). ANN demonstrates the highest accuracy, 89%, and the best F1- and F2-scores,
which constitutes a strong signal that the neural network architecture is a favorable candidate for
the task, regardless of the BOW input. Given the complexity of text processing, ANN proves its
capacity to pick up on complex relationships between the target and explanatory variables. The
improvement is also visible for the F2-score of 89.93% that closely follows that of RF (89.98%). It
is interesting to compare the BOW-based ANN with GPT-2 and HAN, all of which represent a NN
architecture. GPT-2 performs better on TXT than any other model on LING. Though it fails to
show superior accuracy, its sensitivity is one of the highest, leading to the conclusion that with
some threshold adjustment, it could provide better predictive performance than other models like

of
LR or tree-based models. This example underlines the potential gains of implementing the new DL

ro
methods that allow superior insights into unstructured data. Unlike BOW-based benchmarks,
embeddings-based HAN and GPT-2 retained the structure and context of the input. HAN showed

-p
superior results in terms of AUC 91.08% but fell short in terms of accuracy. However, its
sensitivity is exceeding those of all other benchmarks except SVM, making it a promising model
re
for fraud detection. The appealing performance of HAN can be explained by its intrinsic capacity
lP

to extract significant contextual similarities within documents and that pertinent cues, which allow
truthful text to be distinguished from deceitful ones, are dependent on the context rather than the
na

content [75]. All in all, the results suggest that textual data can offer much more insight than LING
across all classifiers.
We conclude the analysis by examining the feature combination FIN+TXT, which is at the
ur

core of our study. The input setup is done in two ways: a combination of word vectors with
Jo

financial indicators into one data set and a 2-step modeling approach. The latter comprises
building a TXT model and using its probability prediction as an input to another DL model that
will concoct it with FIN and output the final binary prediction. The first approach is applied in the
case of benchmark models, including ANN. The second one is implemented for HAN and GPT-2.
Based on a comparison of models using only TXT or FIN data, Purda and Skillicorn [58]
concluded that these data sources are complementary with each source identifying specific types
of fraud that the other cannot detect. In our case, all benchmarks exhibit improved performance in
comparison to the FIN + LING setup, especially LR and SVM. However, the same unanimity is
observed in decreased F1-score. We observe the superiority of predictive powers of full-textual
input over the linguistic metrics. If we compare the additional value of FIN for the performance,
Journal Pre-proof

we can see only a minor increase in almost all metrics, once again underlying the complexity and
potential misalignment of FIN and TXT data. However, it is essential to note that unlike F1-score,
F2-score increases across the ML benchmarks, which brings us to the initial assumption behind the
preference toward the F2-score as key to model evaluation for practical use. We conclude that with
the increased complexity of input, one should opt for advanced ML techniques for the extraction of
extra insight.
The best performance is again yielded by HAN with AUC 92.64%, followed by XGB and
ANN with AUCs of 89%. HAN is also offering the highest sensitivity of 90% across all datasets
and models, making it the recommended solution for statement fraud detection. Going back to the

of
triad comparison between ANN, HAN, and GPT-2, we can see that the latter does not show much

ro
improvement with added FIN data. This signals the potentially poor choice of pre-trained
embeddings, highlighting the importance of this decision in the design of a DL classifier and

-p
reminding that state-of-the-art solutions do not guarantee superior results. ANN does not catch up
with HAN AUC-wise. However, it showcases the higher F2-score of 90.55%, surpassed only by
re
XGB, which proved to be a promising alternative to the DL methods. The results of modeling on
lP

HAN showed its capacity to incorporate and extract additional values from the diversified input,
which contributes to the existing field of research and opens new opportunities to the further
na

exploration of data enrichment for fraud detection.

The results of HAN address the RQ 1 and 2, allowing us to conclude that the proposed DL
architecture offers a substantial improvement for fraud detection facilitation. Additionally, its
ur

properties allow us to offer a look into the "black box" of the DL models and provide the rationale
Jo

behind the classification decision. This interpretability capacity might be particularly important for
practitioners, given the need to substantiate the audit judgment, and will be further explored in the
next Section.

7. Interpretation and decision support application

SEC developed software specifically focused on the MD&A section [58] to examine the
use of language for indications of fraud. The importance of the MD&A section can be observed in
reforms introduced by SOX in 2002, which demanded that the relevant section should present and
offer full disclosure on critical accounting estimates and policies [62]. The length of MD&A
Journal Pre-proof

sections increased after SOX became effective; nevertheless, Li [41] concluded that no changes
were made to the information contained within MD&A sections or the style of language adopted.
Taking further the fraud detection efforts, we developed a method to facilitate the audit of the
MD&A section. We employ state-of-the-art textual analysis to shed light on managers‟ cognitive
processes, which could be revealed by the language used in the MD&A section. Zhou et al. [75]
demonstrated that it is plausible to detect lies based on textual cues. Nonetheless, the pertinent cues
that allow truthful texts to be distinguished from deceitful ones are dependent on the context. One
way to support auditors would be the "red-flag" indication in the body of the MD&A section.
Hajek and Henriques [26] explored the use of "green-flag" and "red-flag" values of financial

of
indicators and concluded that the identification of non-fraudulent firms is less complex and can be

ro
accompanied by interpretable "green-flag" values, however because the detection of fraudulent
firms requires more complex non interpretable ML models, no "red-flag" values could be derived.

-p
We will take it further and provide the suggestion for the use of textual elements as "red-flags" for
auditors. This can be done on the word level or the sentence-level and is to our best knowledge,
re
new to the field. The HAN model allows a holistic analysis of the text structure and the underlying
lP

semantics. In contrast to BOW that ignores specific contextual meanings of words, the HAN
model considers the grammar, structure, and context of words within a sentence and of sentences
na

within a document, which is essential for the identification of fraudulent behaviour. The attention
mechanisms of the HAN retain the logical dependencies of the content and enable the
identification of the words and sentences that contribute the most in attributing the fraudulent
ur

behaviour by extracting the word and sentence attention weights defined in Equation 2 and 8.
Jo

These valuable insights into the internal document structure together with strong predictive
performance, make HAN notably advantageous in comparison to BOW-based traditional
benchmarks.
Based on the assumption that fraudulent actors are capable of manipulating their writings
so that they have convincing similarities to those that are non-fraudulent, only concentrating on
words that focus on the content of the text while disregarding the context could be overly
simplistic for differentiating truthful from misleading statements. We assume that due to their
inherently higher complexity, sentence-level indicators are less prone to manipulation and provide
robust insight for auditing.
Journal Pre-proof

7.1. Word-level
We provide a comparative analysis of words considered to be "red-flags" by the more
traditional RF model and those offered by HAN. The RF model proved to be a potent and
consistent classifier throughout the comparative analysis. We apply the lime methodology of
Ribeiro et al. [63] to gain insight into the role of different words in the model‟s classification
decision. lime stands for Local Interpretable Model-Agnostic Explanations and is based on
explaining the model functioning in the locality of the chosen observation. Ribeiro et al. [63]
explains every input separately; the example of its application to one of the fraud texts can be
found in Figure 3:

of
ro
Figure 3: Words with top weights indicating fraud from a sample MD&A

-p
We supply all fraud cases through the lime package and extract the top ten words, that have
the strongest effect on the model in terms of fraud indication. We further aggregate these words
re
and gain a "red-flag" vocabulary. Additionally, we perform the same analysis with the DNN model
lP

and extract the weights assigned by the HAN attention layer. The results are summarized in Figure
4:
na

Figure 4: "Red-flag" words identified by RF and HAN, the bottom section contains the words
ur

matching both sets

Fifteen words are found to be important for an indication of fraudulent activity by both
algorithms, including "government", "certain", "gross", potentially indicating adverse
involvement of the state institutions. It would seem that RF derives judgment from the industry:
"aerospace", medical terms, "pilotless", "armourgroup". HAN picks up on financial and legal
terms like "cost", "acquisition", "property". Both classifiers also include time- and
calendar-related words like names of the month. It is not obvious how much the context affects this
selection. Additionally, derivation of a word-based rule might potentially lead to a quick
adaptation of the reporting entities for audit circumvention. Ambiguous interpretation and
manipulation risks motivate the creation of the sentence-level decision support system.
Journal Pre-proof

7.2. Sentence-level
The added contextual information extracted by the HAN shows improved performance on
the test set in comparison to linguistic features and other DNN models. It can be partially
explained by the hierarchical structure of language, which entails the unequal roles of words in the
overall structure. Following RQ 3, we want to benefit from the structural and context insight
retained in sentence-level analysis, provided uniquely by the HAN model.
We extract the sentence-level attention weights for 200 fraudulent reports gained as a result
of prediction by HAN and filter the top ten most important sentences per report. The mean weight
of a sentence that can be considered a "red-flag" is 0.05, with a maximum at 0.61. We devise a rule,

of
dictating that sentences with weights higher than 0.067 (top 25% quantile) will be referred to as

ro
"extra important", sentences between 0.04 and 0.67 (top half) are "important" and those between
0.022 and 0.04 are "noteworthy". These three groups of words get respective coloring and are

-p
highlighted in the considered MD&A, as depicted in Figure 5.
We propose to use the probability prediction of the HAN model and assign sentence
re
weights as a two-step decision support system for auditors. Given its strong predictive
lP

performance, HAN can provide an initial signal about the risks of fraud. Based on a selected
sensitivity threshold, auditors may select to evaluate a potentially fraudulent report with extra
na

caution and use the highlighted sentences as additional guidance. Given the lengthiness of an
average MD&A and limited physical concentration capacities associated with the manual audit,
ur

this sort of visual guidance can offer higher accuracy of fraud detection.
Jo

Figure 5: A page from MD&A (on the left) and its extract with "red-flag" phrases for the attention
of the auditor (on the right). Sentences that contributed the most to the decision towards "fraud" are
labeled by HAN as extra important and important. Additional examples are provided in the Online
Appendix.

8. Discussion
As reported in the literature review, Hajek and Henriques [26] and Throckmorton et al.
[68] have tackled the task of combined mining financial and linguistic data for financial statement
fraud prediction, and no study was found on the combination of financial and textual data. Given
Journal Pre-proof

the managerial efforts to conceal bad news by using particular wording [29] and by generating less
understandable reports [39, 45], it is pivotal to adopt more advanced text processing techniques.
SVM showed good performance across most experimental setups.Due to its ability to deal
with high dimensional, sparse feature spaces, SVM has achieved the best performances in previous
studies that incorporated the BOW approach [23, 58]. in this study, RF showed the best predictive
performance, managing to extract knowledge from both financial and BOW-based textual sources.
DL models also proved capable of distinguishing fraudulent cases. However, only the HAN
architecture showcased exceptional capacity to extract signals from the FIN + TXT setting, which
is in the center of the current research. The HAN detects a high number of fraudulent cases

of
compared to remaining models, strengthening the statement by Zhou et al. [75] that the detection

ro
of deception based on text necessitates contextual information.
The results of the AUC measures indicate that the linguistic variables extracted with HAN

-p
and TF-IDF add significant value to fraud detection models in combination with financial ratios.
The heterogeneity in performance shifts among different data types for models, showing that
re
different models pick up on different signals, and a combination of these models might be more
lP

appropriate to support the decision-making processes of stakeholders in the determination of fraud

than the choice of a single model. The use of additional performance metrics like F2-score
na

addressed the practical applicability of the classification models, given the imbalance of error
costs. The superior predictive capacity should be considered in combination with the model‟s
sensitivity in order to account for the implications of non-detecting the fraudulent case.
ur

We have explored the interpretation capacities of RF and HAN models on the word and
Jo

sentence levels. Both models agreed on a specific "red-flag" vocabulary; however, mostly, they
picked up on different terms. Also, out of context, these words might be misleading. The indication
of "red-flags" words is becoming increasingly unreliable with the adaptive response of the alleged
offending parties. The offered sentence-level markup showed a more robust approach to the
provision of decision support for the auditors.
Auditors must devote effort and time to risk assessment of financial misstatements, which
is tedious and complex. The utilisation of interpretable state-of-the-art technology is essential to
facilitate the detection of fraud by auditors and will significantly enhance effectiveness and
efficiency of audit work. Moreover, the presence of enhanced anti-fraud controls will prevent
individuals from committing fraudulent acts and therefore reduces fraud risks.
Journal Pre-proof

9. Conclusion
The detection of financial fraud is a challenging endeavor. The continually adapting and
complex nature of fraudulent activities necessitates the application of the latest technologies to
confront fraud. The paper examined the potential of a state-of-the-art DL model to add to the
development of advanced financial fraud detection methods. Minimal research has been conducted
on the subject of methods that combine the analysis of financial and linguistic information, and no
studies were discovered on the application of text representation based on DL to detect financial

of
statement fraud. In addition to quantitative data, we investigated the potential of the accompanying
text data in annual reports, and have emphasized the increasing significance of textual analysis for

ro
the detection of signals of fraud within financial documentation. The proposed HAN method

-p
concentrates on the content as well as the context of textual information. Unlike the BOW method,
which disregards word order and additional grammatical information, DL is capable of capturing
re
semantic associations and discerning the meanings of different word and phrase combinations.
The results have shown that the DL model achieved considerable improvement in AUC
lP

compared to the benchmark models. The findings indicate that the DL model is well suited to
identify the fraudulent cases correctly, whereas most ML models fail to detect fraudulent cases
na

while performing better at correctly identifying the truthful statements. The detection of fraudulent
firms is of great importance due to the significantly higher MC associated with fraud. Thus,
ur

specifically in the highly unbalanced case of fraud detection, it is advisable to use multiple models
designed to capture different aspects. Based on these findings, we conclude that the textual
Jo

information of the MD&A section extracted through HAN has the potential to enhance the
predictive accuracy of financial statement fraud models, particularly in the generation of warning
signals for the fraudulent behavior that can serve to support the decision making-process of
stakeholders. The distorted word order handicaps the ability of the BOW-based ML benchmarks to
offer a concise indication of the "red-flags". We offered the decision support solution to the
auditors that allows a sentence-level indication of text fragments that trigger the classifier to treat
the submitted case as fraudulent. The user can select the degree of impact of indicated sentences
and improve the timing and accuracy of the audit process.
Journal Pre-proof

Acknowledgement
Funding: This work was supported by Deutsche Forschungsgemeinschaft in the scope of
the International Research Training Group (IRTG) 1792.

References
[1] A. Abbasi, C. Albrecht, A. Vance, and J. Hansen. “Metafraud: A meta-learning framework
for detecting financial fraud”. In: MIS Quarterly: Management Information Systems 36.4
(2012), pp.1293–1327.

of
[2] ACFE. Report to the Nations 2020 Global Study on Occupational Fraud and Abuse. Tech.

ro
rep. 2020. URL:

https://acfepublic.s3-us-west-2.amazonaws.com/2020-Report-t

[3]
o-the-Nations.pdf.
-p
W. S. Albrecht, C. Albrecht, and C. C. Albrecht. “Current trends in fraud and its
re
detection”. In: Information Security Journal 17.1 (2008), pp. 2–12.
B. Bai, J. Yen, and X. Yang. “False financial statements: Characteristics of China‟s listed
lP

[4]
companies and cart detecting approach”. In: International Journal of Information
Technology and Decision Making 7.2 (2008), pp. 339–359.
na

[5] M. D. Beneish. “Detecting GAAP violation: Implications for assessing earnings

management among firms with extreme financial performance”. In: Journal of Accounting
ur

and Public Policy 16.3 (1997), pp. 271–309.

[6] M. D. Beneish. “The Detection of Earnings Manipulation”. In: Financial Analysts Journal
55.5 (1999), pp. 24–36.
[7] A. Bodnaruk, T. Loughran, and B. McDonald. “Using 10-K text to gauge financial
constraints”. In: Journal of Financial and Quantitative Analysis 50.4 (2015), pp. 623–646.
[8] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. “Enriching Word Vectors with
Subword Information”. In: arXiv preprint arXiv:1607.04606 (2016).
[9] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression
Trees. Vol. 19. 1984, p. 368.
[10] L. Breiman. “Random forests”. In: Machine Learning 45.1 (2001), pp. 5–32.
[11] S. V. Brown and J. W. Tucker. “Large-Sample Evidence on Firms‟ Year-over-Year
Journal Pre-proof

MD&A Modifications”. In: Journal of Accounting Research 49.2 (2011), pp. 309–346.
[12] M. Cecchini, H. Aytug, G. J. Koehler, and P. Pathak. “Making words work: Using financial
text as a predictor of financial events”. In: Decision Support Systems 50.1 (2010), pp.
164–175.
[13] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. “Empirical evaluation of gated recurrent
neural networks on sequence modeling”. In: arXiv preprint arXiv:1412.3555 (2014).
[14] A. K. Davis, J. M. Piger, and L. M. Sedor. “Beyond the Numbers: Measuring the
Information Content of Earnings Press Release Language”. In: Contemporary Accounting
Research 29.3 (2012), pp. 845–868.

of
[15] P. M. Dechow, W. Ge, C. R. Larson, and R. G. Sloan. “Predicting Material Accounting
Misstatements”. In: Contemporary Accounting Research 28.1 (2011), pp. 17–82.

ro
[16] B. M. DePaulo, R. Rosenthal, J. Rosenkrantz, and C. Rieder Green. “Actual and Perceived

-p
Cues to Deception: A Closer Look at Speech”. In: Basic and Applied Social Psychology
3.4 (1982), pp. 291–312.
re
[17] S. Dumais, J. Platt, D. Heckerman, and M. Sahami. “Inductive learning algorithms and
lP

representations for text categorization”. In: Proceedings of the 7th International

Conference on Information and Knowledge Management (1998), pp. 148–155.
A. Dyck, A. Morse, and L. Zingales. “Who blows the whistle on corporate fraud?” In:
na

[18]
Journal of Finance 65.6 (2010), pp. 2213–2253.
[19] R. Feldman, S. Govindaraj, J. Livnat, and B. Segal. “Management‟s tone change, post
ur

earnings announcement drift and accruals”. In: Review of Accounting Studies 15.4 (2010),
Jo

pp. 915–953.
[20] C. Gaganis. “Classification techniques for the identification of falsified financial
statements: a comparative analysis”. In: Intelligent Systems in Accounting, Finance &
Management 16.3 (2009), pp. 207–229.
[21] J. Gee and M. Button. The Financial Cost of Fraud 2019. Tech. rep. Crowe, 2019.
[22] F. H. Glancy and S. B. Yadav. “A computational model for financial reporting fraud
detection”. In: Decision Support Systems 50.3 (2011), pp. 595–601.
[23] S. Goel and J. Gangolly. “Beyond the numbers: Mining the annual reports for hidden cues
indicative of financial statement fraud”. In: Intelligent Systems in Accounting, Finance and
Management 19.2 (2012), pp. 75–89.
Journal Pre-proof

[24] S. Goel, J. Gangolly, S. R. Faerman, and O. Uzuner. “Can linguistic predictors detect
fraudulent financial filings?” In: Journal of Emerging Technologies in Accounting 7.1
(2010), pp. 25–46.
[25] S. Goel and O. Uzuner. “Do Sentiments Matter in Fraud Detection? Estimating Semantic
Orientation of Annual Reports”. In: Intelligent Systems in Accounting, Finance and
Management 23.3 (2016), pp. 215–239.
[26] P. Hajek and R. Henriques. “Mining corporate annual reports for intelligent detection of
financial statement fraud – A comparative study of machine learning methods”. In:
Knowledge- Based Systems 128 (2017), pp. 139–152.

of
[27] S. Hochreiter and J. Schmidhuber. “Long Short-Term Memory”. In: Neural Computation

ro
9.8 (1997), pp. 1735–1780.
[28] S. Y. Huang, R. H. Tsaih, and F. Yu. “Topological pattern discovery and feature extraction

-p
for fraudulent financial reporting”. In: Expert Systems with Applications 41.9 (2014), pp.
4360–4372.
re
[29] S. L. Humpherys, K. C. Moffitt, M. B. Burns, J. K. Burgoon, and W. F. Felix.
lP

“Identification of fraudulent financial statements using linguistic credibility analysis”. In:

Decision Support Systems 50.3 (2011), pp. 585–594.
T. Joachims. “A probabilistic analysis of the Rocchio algorithm with TFIDF for text
na

[30]
categorization”. In: the 14th International Conference on Machine Learning (ICML ’97)
(1997), pp. 143–151.
ur

[31] N. Kalchbrenner and P. Blunsom. “Recurrent continuous translation models”. In: EMNLP
Jo

2013 - 2013 Conference on Empirical Methods in Natural Language Processing,

Proceedings of the Conference. Association for Computational Linguistics (ACL), 2013,
pp. 1700–1709.
[32] N. Kalchbrenner, E. Grefenstette, and P. Blunsom. “A convolutional neural network for
modelling sentences”. In: 52nd Annual Meeting of the Association for Computational
Linguistics, ACL 2014 - Proceedings of the Conference. Vol. 1. Association for
Computational Linguistics (ACL), 2014, pp. 655–665.
[33] J. M. Karpoff, A. Koester, D. S. Lee, and G. S. Martin. “Database Challenges in Financial
Misconduct Research”. In: Working Paper (2014), pp. 1–66.
[34] Y. J. Kim, B. Baik, and S. Cho. “Detecting financial misstatements with fraud intention
Journal Pre-proof

using multi-class cost-sensitive learning”. In: Expert Systems with Applications 62 (2016),
pp. 32–43.
[35] M. Kränkel and H.-E. L. Lee. “Text Classification with Hierarchical Attention Networks”.
In: (2019). URL:

https://humboldt-wi.github.io/blog/research/information_sys
tems_1819/group5_han/.
[36] M. Kraus and S. Feuerriegel. “Decision support from financial disclosures with deep
neural networks and transfer learning”. In: Decision Support Systems 104 (2017), pp.
38–48.

of
[37] M. Kraus, S. Feuerriegel, and A. Oztekin. “Deep learning in business analytics and
operations research: Models, applications and managerial implications”. In: European

ro
Journal of Operational Research 281.3 (2020), pp. 628–641.
[38]
-p
D. F. Larcker and A. A. Zakolyukina. “Detecting Deceptive Discussions in Conference
Calls”. In: Journal of Accounting Research 50.2 (2012), pp. 495–540.
re
[39] F. Li. “Annual report readability, current earnings, and earnings persistence”. In: Journal
lP

of Accounting and Economics 45.2-3 (2008), pp. 221–247.

[40] F. Li. “Textual analysis of corporate disclosures: A survey of the literature”. In: Journal of
accounting literature 29 (2010), p. 143.
na

[41] F. Li. “The information content of forward- looking statements in corporate filings-A naïve
bayesian machine learning approach”. In: Journal of Accounting Research 48.5 (2010), pp.
ur

1049–1102.
Jo

[42] C. C. Lin, A. A. Chiu, S. Y. Huang, and D. C. Yen. “Detecting the financial statement
fraud: The analysis of the differences between data mining techniques and experts‟
judgments”. In: Knowledge-Based Systems 89 (2015), pp. 459–470.
[43] C. Liu, Y. Chan, S. H. Alam Kazmi, and H. Fu. “Financial Fraud Detection Model: Based
on Random Forest”. In: International Journal of Economics and Finance 7.7 (2015).
[44] T. I. M. Loughran and B. Mcdonald. “When is a Liability not a Liability ? Textual Analysis
, Dictionaries , and 10-Ks Journal of Finance , forthcoming”. In: Journal of Finance 66.1
(2011), pp. 35–65.
[45] T. Loughran and B. Mcdonald. “Measuring readability in financial disclosures”. In:
Journal of Finance (2014).
Journal Pre-proof

[46] C. D. Manning, P. Ragahvan, and H. Schutze. “An Introduction to Information Retrieval”.

In: Information Retrieval c (2009), pp. 1–18.
[47] T. Mikolov, M. Karafiát, L. Burget, J. „ Cernocký, and S. Khudanpur. “Recurrent Neural
Network Language Modeling”. In: Interspeech. September. 2010, pp. 1045–1048.
[48] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. “Efficient Estimation of Word
Representations in Vector Space Tomas”. In: IJCAI International Joint Conference on
Artificial Intelligence (2013). URL: http://arxiv.org/abs/1301.3781.
[49] M. L. Newman, J. W. Pennebaker, D. S. Berry, and J. M. Richards. Lying words:
Predicting deception from linguistic styles. 2003.

of
[50] K. Nguyen. “Financial statement fraud:Motives, Methods, Cases and Detection”. In: The

ro
Secured Lender 51.2 (1995), p. 36.
[51] H. Öğüt, R. Aktaş, A. Alp, and M. M. Doğanay. “Prediction of financial information

-p
manipulation by using support vector machine and probabilistic neural network”. In:
Expert Systems with Applications 36.3 PART 1 (2009), pp. 5419–5423.
re
[52] J. W. Pennebaker, M. R. Mehl, and K. G. Niederhoffer. “Psychological Aspects of Natural
lP

Language Use: Our Words, Our Selves”. In: Annual Review of Psychology 54.1 (2003), pp.
547–577.
J. Pennington, R. Socher, and C. D. Manning. “GloVe: Global vectors for word
na

[53]
representation”. In: EMNLP 2014 - 2014 Conference on Empirical Methods in Natural
ur

Language Processing, Proceedings of the Conference. Association for Computational

Linguistics (ACL), 2014, pp.1532–1543.
Jo

[54] J. Perols. “Financial statement fraud detection: An analysis of statistical and machine
learning algorithms”. In: Auditing 30.2 (2011), pp. 19–50.
[55] O. S. Persons. “Using Financial Statement Data To Identify Factors Associated With
Fraudulent Financial Reporting”. In: Journal of Applied Business Research (JABR) 11.3
(2011), p. 38.
[56] T. Pourhabibi, K.-L. Ong, B. H. Kam, and Y. L. Boo. “Fraud detection: A systematic
literature review of graph-based anomaly detection approaches”. In: Decision Support
Systems 133 (2020), p. 113303. URL:

https://www.sciencedirect.com/science/article/pii/S01679236
20300580?via%3Dihub.
Journal Pre-proof

[57] L. D. Purda and D. Skillicorn. “Reading between the Lines: Detecting Fraud from the
Language of Financial Reports”. In: SSRN Electronic Journal (2012).
[58] L. Purda and D. Skillicorn. “Accounting Variables, Deception, and a Bag of Words:
Assessing the Tools of Fraud Detection”. In: Contemporary Accounting Research 32.3
(2015), pp. 1193–1223.
[59] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. “Language models are
unsupervised multitask learners”. In: OpenAI Blog 1.8 (2019), p. 9.
[60] G. Rao, W. Huang, Z. Feng, and Q. Cong. “LSTM with sentence representations for
document-level sentiment classification”. In: Neurocomputing 308 (2018), pp. 49–57.

of
[61] P. Ravisankar, V. Ravi, G. Raghava Rao, and I. Bose. “Detection of financial statement
fraud and feature selection using data mining techniques”. In: Decision Support Systems

ro
50.2 (2011), pp. 491–500.
[62]
-p
Z. Rezaee. “Causes, consequences, and deterence of financial statement fraud”. In: Critical
Perspectives on Accounting 16.3 (2005), pp. 277–298.
re
[63] M. T. Ribeiro, S. Singh, and C. Guestrin. “"Why Should I Trust You?": Explaining the
lP

Predictions of Any Classifier”. In: Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August
na

13-17, 2016. 2016, pp. 1135–1144.

[64] S. A. Richardson, R. G. Sloan, M. T. Soliman, and I. Tuna. “Accrual reliability, earnings
persistence and stock prices”. In: Journal of Accounting and Economics 39.3 (2005), pp.
ur

437–485.
Jo

[65] Securities and Exchange Commission. “Division of Corporation Finance: Standard

Industrial Classification (SIC) Code List”. In: (2019). URL:

https://www.sec.gov/info/edgar/siccodes.htm.
[66] T. W. Singleton and A. J. Singleton. Fraud Auditing and Forensic Accounting, Fourth
Edition. 2011.
[67] P. C. Tetlock. “Giving content to investor sentiment: The role of media in the stock
market”. In: Journal of Finance 62.3 (2007), pp. 1139–1168.
[68] C. S. Throckmorton, W. J. Mayew, M. Venkatachalam, and L. M. Collins. “Financial fraud
detection using vocal, linguistic and financial cues”. In: Decision Support Systems 74
(2015), pp. 78–87.
Journal Pre-proof

[69] A. J.-P. Tixier. “Notes on Deep Learning for NLP”. In: (2018). URL:

https://arxiv.org/abs/1808.09772.
[70] US Securities and Exchange Comission. “Agency Financial Report”. In: US Department of
State (2019). URL:

https://www.sec.gov/files/sec-2019-agency-financial-report.
pdf#mission.
[71] J. West and M. Bhattacharya. Intelligent financial fraud detection: A comprehensive
review. 2016.
[72] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. Hierarchical Attention

of
Networks for Document Classification. Tech. rep. In Proc. of the 2016 Conference of the
North American Chapter of the Association for Computational Linguistics: Human

ro
Language Technologies, 2016, pp. 1480–1489.
[73]
Natural Language
-p
W. Yin, K. Kann, M. Yu, and H. Schütze. “Comparative Study of CNN and RNN for
Processing”. In: (2017). URL:
re
http://arxiv.org/abs/1702.01923.
C. Zhou, C. Sun, Z. Liu, and F. C. M. Lau. “A C-LSTM Neural Network for Text
lP

[74]
Classification”. In: (2015). URL: http://arxiv.org/abs/1511.08630.
[75] L. Zhou, J. K. Burgoon, J. F. Nunamaker, and D. Twitchell. “Automating
na

Linguistics-Based Cues for detecting deception in text-based asynchronous

computer-mediated communication”. In: Group Decision and Negotiation 13.1 (2004), pp.
ur

81–106.
Jo

[76] E. Zinovyeva, W. K. Härdle, and S. Lessmann. “Antisocial online behavior detection using
deep learning”. In: Decision Support Systems online first, doi:10.1016/j.dss.2020.113362
(2020).
Journal Pre-proof

Biographical Note

Patricia Craja
Patricia Craja received her B.Sc. degree in Mathematics and M.Sc. degree in Statistics, from the
Technical University of Berlin and the Humboldt University of Berlin, Germany, in 2013 and 2020,
respectively. Currently, she is working as a Data Science Freelancer. Her current research
interests are in the fields of natural language processing and decision making.

Alisa Kim
Alisa Kim is a researcher at Humboldt University, focusing on the application of deep learning and
natural language processing in financial and regulatory sectors. She got her Master's degree in

of
Management from St. Andrews University and worked in investment banking and consulting
before joining the Business Informatics Chair of HU as a full-time researcher and educator.

ro
Stefan Lessmann
Stefan Lessmann received a diploma in business administration and a PhD from the University of

-p
Hamburg in 2002 and 2007, respectively. Stefan worked as a lecturer and senior lecture in
business informatics at the Institute of Information Systems of the University of Hamburg. Since
re
2008, Stefan is a guest lecturer at the School of Management of University of Southampton,
where he teaches under- and postgraduate courses on quantitative methods, electronic business,
lP

and web application development. Stefan completed his habilitation in the area of predictive
analytics in 2012. In 2014, Stefan joined the Humboldt-University of Berlin, where he heads the
Chair of Information Systems at the School of Business and Economics. Stefan published several
papers in leading international journals and conferences, including the European Journal of
na

Operational Research, the IEEE Transactions of Software Engineering, and the International
Conference on Information Systems. He actively participates in knowledge transfer and
ur

consulting projects with industry partners; from small start-up companies to global players.
Jo
Journal Pre-proof

CRediT author statement

Patricia Craja: Conceptualization, Methodology, Software, Validation, Formal

analysis,Investigation,Writing - Original Draft, Visualization

Alisa Kim: Conceptualization, Methodology, Software, Validation, Formal

analysis,Writing - Review & Editing, Visualization

of
Stefan Lessmann: Conceptualization,Validation, Supervision

ro
-p
re
lP
na
ur
Jo
Journal Pre-proof

Highlights

● Combining financial and text data enhances fraudulent financial statements detection
● HAN, GPT-2, ANN and XGB detect financial misstatements based on textual cues
● Novel NLP techniques allow to capture content and context of MD&As
● Interpretability offered with “red-flag” sentences in the MD&As of annual reports
● The proposed models provide decision support for stakeholders

of
ro
-p
re
lP
na
ur
Jo
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5

2011 DSS Detecting Evolutionary Financial Statement Fraud PDF
No ratings yet
2011 DSS Detecting Evolutionary Financial Statement Fraud PDF
7 pages
Financial Ratios and Fraud Detection
No ratings yet
Financial Ratios and Fraud Detection
14 pages
Combating Tax Crimes: OECD Academy Insights
No ratings yet
Combating Tax Crimes: OECD Academy Insights
25 pages
Behavioural Insights For Better Tax Administration A Brief Guide
No ratings yet
Behavioural Insights For Better Tax Administration A Brief Guide
36 pages
International Taxation: Tax Havens & Pricing
No ratings yet
International Taxation: Tax Havens & Pricing
163 pages
Financial Statement Fraud Detection in The Digital Age - The CPA Journal
No ratings yet
Financial Statement Fraud Detection in The Digital Age - The CPA Journal
18 pages
Beneish M-Score and Altman Z-Score To Predict Financial Statement Fraud and Financial Performance
No ratings yet
Beneish M-Score and Altman Z-Score To Predict Financial Statement Fraud and Financial Performance
9 pages
27 Financial & Cash Flow Analysis
No ratings yet
27 Financial & Cash Flow Analysis
194 pages
Machine Learning in Accounting & Auditing
100% (1)
Machine Learning in Accounting & Auditing
6 pages
Understanding Cash Flow in Capital Budgeting
No ratings yet
Understanding Cash Flow in Capital Budgeting
37 pages
Understanding Financial Statement Fraud
No ratings yet
Understanding Financial Statement Fraud
22 pages
Fraud Detection Techniques and Red Flags
No ratings yet
Fraud Detection Techniques and Red Flags
12 pages
Panduan Pajak Internasional dan Transfer Pricing
No ratings yet
Panduan Pajak Internasional dan Transfer Pricing
87 pages
Introduction to Forensic Accounting
100% (1)
Introduction to Forensic Accounting
27 pages
Taxation Aspects Issues of Intra Group Services 1754665356
No ratings yet
Taxation Aspects Issues of Intra Group Services 1754665356
33 pages
Fraud - Risk - Assessment PWC PDF
No ratings yet
Fraud - Risk - Assessment PWC PDF
2 pages
KPMG Fraud Survey (Ch3)
No ratings yet
KPMG Fraud Survey (Ch3)
38 pages
Cross-Border Business Alliance - Major Issues For Corporate Practice in Japan (v2019UI) F
No ratings yet
Cross-Border Business Alliance - Major Issues For Corporate Practice in Japan (v2019UI) F
23 pages
Bribery Practices in Germany
100% (1)
Bribery Practices in Germany
4 pages
IFRS EU Financial Statements 2023
No ratings yet
IFRS EU Financial Statements 2023
180 pages
Kranacher Chapter 8 Test Bank Answers
100% (1)
Kranacher Chapter 8 Test Bank Answers
10 pages
Basics of International Taxation Guide
No ratings yet
Basics of International Taxation Guide
969 pages
Jawaban Soal Latihan Sesi 1 Dan 2
No ratings yet
Jawaban Soal Latihan Sesi 1 Dan 2
10 pages
2018 Mod Chap003 - BASIC CONCEPT COST - BLOCHER
No ratings yet
2018 Mod Chap003 - BASIC CONCEPT COST - BLOCHER
20 pages
UPDATE 111120 Sore
No ratings yet
UPDATE 111120 Sore
103 pages
Tax Avoidance via Transfer Pricing Analysis
No ratings yet
Tax Avoidance via Transfer Pricing Analysis
55 pages
Comprehensive Financial Statements Overview
No ratings yet
Comprehensive Financial Statements Overview
277 pages
Jawaban Soal Latihan Sesi 3 Dan 4
No ratings yet
Jawaban Soal Latihan Sesi 3 Dan 4
7 pages
Fraud Detection With Active Data
100% (1)
Fraud Detection With Active Data
35 pages
Tax Administration 2021 Oecd
No ratings yet
Tax Administration 2021 Oecd
355 pages
Ethics and Fraud in Internal Control
No ratings yet
Ethics and Fraud in Internal Control
22 pages
IAS 8: Accounting Policies Overview
100% (1)
IAS 8: Accounting Policies Overview
83 pages
PWC Permanent Establishments at The Heart of The Matter Final
No ratings yet
PWC Permanent Establishments at The Heart of The Matter Final
24 pages
Alibaba.com: B2B E-commerce Success
No ratings yet
Alibaba.com: B2B E-commerce Success
158 pages
HR & Payroll Cycle Overview
No ratings yet
HR & Payroll Cycle Overview
9 pages
Understanding Financial Statement Fraud
100% (1)
Understanding Financial Statement Fraud
53 pages
Forensic Accountant Skills in Indonesia
No ratings yet
Forensic Accountant Skills in Indonesia
15 pages
ADIT Syllabus 2021
No ratings yet
ADIT Syllabus 2021
86 pages
Financial Statements Fraud Cases and Theory
No ratings yet
Financial Statements Fraud Cases and Theory
25 pages
Financial Statement Analysis Overview
No ratings yet
Financial Statement Analysis Overview
42 pages
Material Flow Cost Accounting (MFCA) : Pertemuan 6-7 Akuntansi Keberlanjutan
No ratings yet
Material Flow Cost Accounting (MFCA) : Pertemuan 6-7 Akuntansi Keberlanjutan
44 pages
MIA Capital Statement
No ratings yet
MIA Capital Statement
126 pages
Blockchain's Impact on Accounting
No ratings yet
Blockchain's Impact on Accounting
6 pages
Better Tax Researcher: Unlocking The Limits of The Keyword
No ratings yet
Better Tax Researcher: Unlocking The Limits of The Keyword
13 pages
Multinationality, Tax Havens, Intangible Assets and Transfer Pricing Aggressivness
No ratings yet
Multinationality, Tax Havens, Intangible Assets and Transfer Pricing Aggressivness
34 pages
(Group 7) Case 5 - Thieves in Texas
No ratings yet
(Group 7) Case 5 - Thieves in Texas
13 pages
Data Analytics in Accounting Overview
No ratings yet
Data Analytics in Accounting Overview
47 pages
Forensic Audit Procedures Discussion
No ratings yet
Forensic Audit Procedures Discussion
6 pages
Indonesian Tax Guide 2025 Overview
No ratings yet
Indonesian Tax Guide 2025 Overview
124 pages
Understanding Consumer Fraud Types
No ratings yet
Understanding Consumer Fraud Types
30 pages
Champion Group Tax and Financial Analysis
No ratings yet
Champion Group Tax and Financial Analysis
9 pages
Fraud Pentagon Dalam Mendeteksi Financial Statement Fraud
No ratings yet
Fraud Pentagon Dalam Mendeteksi Financial Statement Fraud
14 pages
M14 Salaries Tax
No ratings yet
M14 Salaries Tax
40 pages
Fraud Schemes & Ethical Dilemmas
No ratings yet
Fraud Schemes & Ethical Dilemmas
4 pages
Income Tax Case Study and Analysis
No ratings yet
Income Tax Case Study and Analysis
7 pages
ADIT Prospectus 2018
No ratings yet
ADIT Prospectus 2018
40 pages
Capital Investment Analysis Techniques
No ratings yet
Capital Investment Analysis Techniques
29 pages
KPMG - Global Profiles of The Fraudster
No ratings yet
KPMG - Global Profiles of The Fraudster
28 pages
Deep Learning in Financial Fraud Detection
No ratings yet
Deep Learning in Financial Fraud Detection
13 pages
Fraud Detection in Financial Reporting
No ratings yet
Fraud Detection in Financial Reporting
11 pages
Document 11
No ratings yet
Document 11
13 pages
Auditor's Report Essentials and Opinions
No ratings yet
Auditor's Report Essentials and Opinions
48 pages
Internal Audit Capability Model IA-CM For The Public Sector Overview
100% (3)
Internal Audit Capability Model IA-CM For The Public Sector Overview
40 pages
CH 10
No ratings yet
CH 10
28 pages
Revised Checklist
No ratings yet
Revised Checklist
4 pages
Week 2 - Lesson 2 The Accounting Process
No ratings yet
Week 2 - Lesson 2 The Accounting Process
16 pages
Tally ERP 9 - TAX AUDIT
No ratings yet
Tally ERP 9 - TAX AUDIT
15 pages
Module 1 Exercises - AT Answers To Chap 1 To 3
No ratings yet
Module 1 Exercises - AT Answers To Chap 1 To 3
6 pages
IEEE STD 1028 Reviews Presentation
No ratings yet
IEEE STD 1028 Reviews Presentation
11 pages
Irrigation Dept
100% (1)
Irrigation Dept
32 pages
Full GST in Tally Tutorial
No ratings yet
Full GST in Tally Tutorial
3 pages
Nexgram's RM90M Acquisition Plan
No ratings yet
Nexgram's RM90M Acquisition Plan
43 pages
Internal Audit Plan Template For An Mfi
No ratings yet
Internal Audit Plan Template For An Mfi
7 pages
SEBI Grade A: Companies Act Guide
No ratings yet
SEBI Grade A: Companies Act Guide
8 pages
Bank Risk Management Course
No ratings yet
Bank Risk Management Course
2 pages
APGENCO Annual Report Overview
No ratings yet
APGENCO Annual Report Overview
100 pages
Ca Niranjan Joshi
No ratings yet
Ca Niranjan Joshi
96 pages
PeopleSoft GL Points
100% (1)
PeopleSoft GL Points
14 pages
Hospitality Training Catalog
No ratings yet
Hospitality Training Catalog
100 pages
Employee Attendance Policy
No ratings yet
Employee Attendance Policy
26 pages
The Audit Process
100% (2)
The Audit Process
39 pages
Chapter 7
No ratings yet
Chapter 7
13 pages
Thermo 146C Calibrator Certification Guide
No ratings yet
Thermo 146C Calibrator Certification Guide
55 pages
Audit Evidence: Key Concepts and Procedures
No ratings yet
Audit Evidence: Key Concepts and Procedures
41 pages
ISO 13485 Gantt Chart - May 2020
100% (8)
ISO 13485 Gantt Chart - May 2020
4 pages
Powercor Technical Standards Update
No ratings yet
Powercor Technical Standards Update
83 pages
CFO Candidate Report: Manish Mittal
No ratings yet
CFO Candidate Report: Manish Mittal
11 pages
Cook Islands
No ratings yet
Cook Islands
2 pages
Managerial Accounting in Hospitality
No ratings yet
Managerial Accounting in Hospitality
3 pages
Let's Check - Answer Key
No ratings yet
Let's Check - Answer Key
29 pages

Deep Learning For Detecting Financial Statement Fraud

Uploaded by

Deep Learning For Detecting Financial Statement Fraud

Uploaded by

Journal Pre-proof

Deep learning for detecting financial statement fraud

Patricia Craja, Alisa Kim, Stefan Lessmann

To appear in: Decision Support Systems

Received date: 5 May 2020

© 2020 Published by Elsevier.

Deep Learning for detecting financial statement fraud

Keywords: fraud detection, financial statements, deep learning, text analytics

that cause misrepresentations within financial statements, such as unintended mistakes.

analysis of financial documentation, no research has focused on the application of state-of-the-art

2. Research design and contributions

3. Decision support for fraud detection

Henriques [26] JRIP(87.0), AUC

Purda and 1407/470 US TXT SVM (AUC AUC, Fraud Probability

Perols [54] 51/15934 US FIN SVM(MC Fraud Probability and MC

3.1. Text-based indicators

Textual analysis is frequently employed for the examination of corporate disclosures.

3.2. Methods and evaluation metrics -p

4.1. Neural Embeddings

4.2. Deep learning

upon this representation [30].

4.3. Hierarchical Attention Network

format by xit as a projection of the word in a continuous vector space.

both forward and backward directions:

hi = LSTM (si ), i [1, L] (4)

hi = LSTM (si ), i [T ,1] (5)

dimensions of each layer.

Figure 2: Architecture of the HAN based Fraud Detection Model

4.4. Evaluation metrics

Prior research emphasized that a higher sensitivity is preferred to higher specificity in

Calculating F1- and F2-scores requires a cutoff to threshold model-based fraud

5.3. Quantitative data

Finance data (FIN)

SVM 0.7406 0.7000 0.6747 0.4285 0.6857 0.8280 -0.0155 -0.0340

ANN 0.6782 0.6333 0.6747 0.3958 0.6758 0.6676 -0.0782 -0.0605

AUC Sensitivit Specificit F1-score F2-score Accuracy Delta Delta F1

LR 0.7682 0.7666 0.6782 0.4623 0.6984 0.8280 0.0062 -0.0144

LR 0.8371 0.7333 0.8269 0.5714 0.8145 0.8281 0.1652 0.1752

XGB 0.8983 0.7000 0.9653 0.7500 0.9187 0.9083 0.0332 -0.1661

HAN 0.9264 0.9000 0.8206 0.6506 0.8361 0.8457

The baseline accuracy is 0.8281

6.1. Modeling of financial data

6.2. Modeling of linguistic data

6.3. Modeling of text data -p

exploration of data enrichment for fraud detection.

7. Interpretation and decision support application

matching both sets

appropriate to support the decision-making processes of stakeholders in the determination of fraud

[5] M. D. Beneish. “Detecting GAAP violation: Implications for assessing earnings

and Public Policy 16.3 (1997), pp. 271–309.

representations for text categorization”. In: Proceedings of the 7th International

“Identification of fraudulent financial statements using linguistic credibility analysis”. In:

2013 - 2013 Conference on Empirical Methods in Natural Language Processing,

of Accounting and Economics 45.2-3 (2008), pp. 221–247.

[46] C. D. Manning, P. Ragahvan, and H. Schutze. “An Introduction to Information Retrieval”.

Language Processing, Proceedings of the Conference. Association for Computational

13-17, 2016. 2016, pp. 1135–1144.

[65] Securities and Exchange Commission. “Division of Corporation Finance: Standard

Linguistics-Based Cues for detecting deception in text-based asynchronous

CRediT author statement

Patricia Craja: Conceptualization, Methodology, Software, Validation, Formal

Alisa Kim: Conceptualization, Methodology, Software, Validation, Formal

You might also like