Detecting Accounting Fraud in Publicly Traded U.S. Firms Using a Machine Learning Approach

AuthorBIN LI,YANG BAO,Y. JULIA YU,JIE ZHANG,BIN KE
Published date01 March 2020
DOIhttp://doi.org/10.1111/1475-679X.12292
Date01 March 2020
DOI: 10.1111/1475-679X.12292
Journal of Accounting Research
Vol. 58 No. 1 March 2020
Printed in U.S.A.
Detecting Accounting Fraud in
Publicly Traded U.S. Firms Using
a Machine Learning Approach
YANG BAO,
BIN KE,
BIN LI,
Y. JUL IA YU ,§
AND JIE ZHANG
Received 7 October 2015; accepted 1 October 2019
ABSTRACT
We develop a state-of-the-art fraud prediction model using a machine learn-
ing approach. We demonstrate the value of combining domain knowledge
and machine learning methods in model building. We select our model in-
put based on existing accounting theories, but we differ from prior account-
ing research by using raw accounting numbers rather than financial ratios.
Antai College of Economics and Management, Shanghai Jiao Tong University;
Department of Accounting, NUS Business School, National University of Singapore;
Department of Finance, Economics and Management School, Wuhan University; §McIntire
School of Commerce, University of Virginia; School of Computer Engineering, Nanyang
Technological University.
Accepted by Christian Leuz. We wish to thank an anonymous reviewer, Mark Cec-
chini, Luo Zuo, and workshop participants at the Singapore Tri-Uni Accounting Research
Conference, the Inaugural Conference on Intelligent Information Retrieval in Accounting
and Finance at CUHK (Shenzhen), and HKUST for helpful comments. Part of this re-
search is funded by a Singapore Ministry of Education Tier 2 grant (No. MOE2012-T2-
1-045). Yang Bao acknowledges the financial support from a NSFC grant (No. 71601116)
and Shanghai Pujiang Program (No. 16PJC045). Ke Bin acknowledges the financial sup-
port from an MOE start-up grant (No. R-521-000-032-133). Bin Li acknowledges the finan-
cial support from National Natural Science Foundation of China (71971164, 91646206).
An online appendix to this paper can be downloaded at http://research.chicagobooth.
edu/arc/journal-of-accounting-research/online-supplements. The codes and data used for
our best model RUSBoost are available at the Github repository: https://github.com/
JarFraud/FraudDetection.
199
CUniversity of Chicago on behalf of the Accounting Research Center, 2019
200 Y.BAO ET AL.
We employ one of the most powerful machine learning methods, ensem-
ble learning, rather than the commonly used method of logistic regression.
To assess the performance of fraud prediction models, we introduce a new
performance evaluation metric commonly used in ranking problems that is
more appropriate for the fraud prediction task. Starting with an identical set
of theory-motivated raw accounting numbers, we show that our new fraud
prediction model outperforms two benchmark models by a large margin: the
Dechow et al. logistic regression model based on financial ratios, and the Cec-
chini et al. support-vector-machine model with a financial kernel that maps
raw accounting numbers into a broader set of ratios.
JEL codes: C53; M41
Keywords: fraud prediction; machine learning; ensemble learning
1. Introduction
Accounting fraud is a worldwide problem. If not detected and prevented
on a timely basis, it can cause significant harm to the stakeholders of
fraudulent firms (e.g., Enron and WorldCom) as well as the stakeholders
of many nonfraudulent firms indirectly (Gleason, Jenkins, and Johnson
[2008], Goldman, Peyer, and Stefanescu [2012], Hung, Wong, and Zhang
[2015]). Unfortunately, accounting fraud is difficult to detect. Moreover,
even if it is detected, serious damage has usually already been done (Dyck,
Morse, and Zingales [2010]). Hence, efficient and effective methods of cor-
porate accounting fraud detection would offer significant value to regula-
tors, auditors, and investors.
The objective of this study is to develop a new accounting fraud pre-
diction model out of sample by using readily available financial statement
data from publicly traded U.S. firms. Following Cecchini et al. [2010], and
Dechow et al. [2011], we use the detected material accounting misstate-
ments disclosed in the SEC’s Accounting and Auditing Enforcement Re-
leases (AAERs) as our accounting fraud sample. Although there are use-
ful nonfinancial predictors of accounting fraud (e.g., an executive’s per-
sonal behavior), we use only readily available financial data for two reasons.
First, fraud prediction models based on publicly available financial data
can be applied to any publicly traded firm at low cost. Second, most of the
fraud prediction models in the existing accounting literature also rely on
publicly available financial data (e.g., Green and Choi [1997], Summers
and Sweeney [1998], Beneish [1999], Cecchini et al. [2010], Dechow et al.
[2011]). By limiting the predictors to financial data only, the performance
of our fraud prediction models can be compared with the performance of
such existing models.
There is a fairly large accounting literature on the determinants of
accounting fraud (e.g., Entwistle and Lindsay [1994], Beasley [1996],
Dechow, Sloan, and Sweeney [1996], Beneish [1997, 1999], Summers and
Sweeney [1998], Efendi, Srivastava, and Swanson [2007], Brazel, Jones, and
DETECTING ACCOUNTING FRAUD IN PUBLICLY TRADED U.S.FIRMS 201
Zimbelman [2009], Dechow et al. [2011], Schrand and Zechman [2012]),
but the primary objective of most studies is to explain fraud within sample
and often emphasize causal inference. Our objective is different: We wish
to develop a model that can accurately predict accounting fraud out of sam-
ple (i.e., a prediction problem). Shmueli [2010] shows that the problems
of causal inference and prediction, although related, are fundamentally
different. Specifically, the objective of causal inference modeling is to min-
imize the bias resulting from model misspecification to obtain the most ac-
curate representation of the underlying theory. In contrast, the objective of
predictive modeling seeks to minimize out-of-sample prediction error, that
is, the combination of the bias and estimation variance that results from
using a sample to estimate model parameters.1Although causal inference
represents the main stream of existing social science research, Kleinberg
et al. [2015] show that there are many interesting prediction problems that
are neglected in the extant business and economics literatures.
We use two types of fraud prediction models from the extant litera-
ture as benchmarks. The first is ratio-based logistic regression, commonly
used in the accounting literature (e.g., Beneish [1997, 1999], Summers
and Sweeney [1998], Dechow et al. [2011]). Such models typically use
financial ratios as predictors; the ratios are often identified by human
experts based on theories (e.g., the motivation-ability-opportunity frame-
work from the criminology literature). Among these models, the model in
Dechow et al. [2011] is generally regarded as the most comprehensive fraud
prediction model in accounting literature. Accordingly, we adopt a similar
logistic regression model as our first benchmark model (referred to as the
Dechow et al. model). The second benchmark model is a fraud prediction
model developed by Cecchini et al. [2010] based on a more advanced ma-
chine learning method (hereafter referred to as the Cecchini et al. model).
Rather than using the financial ratios identified by human experts alone,
Cecchini et al. [2010] develop a new fraud prediction model based on sup-
port vector machines (SVM) with a financial kernel that maps raw financial
data into a broader set of ratios within the same year and changes in ra-
tios across different years. Cecchini et al. [2010] find that the SVM with
a financial kernel outperforms the traditional fraud prediction models in
accounting, including the Dechow et al. model.2
Our proposed fraud prediction model differs from both of these bench-
mark models in two key ways. First, we use ensemble learning, a state-of-the-
art machine learning paradigm, to predict fraud. Most prior fraud predic-
tion research in accounting uses the logistic regression (see Dechow et al.
[2011] for a review). Although ensemble learning has been successfully
1See the online appendix for a more detailed discussion on the differences between causal
inference and prediction.
2It is important to note that the performance results of our Dechow et al. model and Cec-
chini et al. model are not directly comparable to those of Dechow et al. [2011] and Cecchini
et al. [2010] because of a few crucial research design differences, explained in section 3.

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT