What Are You Saying? Using topic to Detect Financial Misreporting

AuthorRICHARD M. CROWLEY,W. BROOKE ELLIOTT,NERISSA C. BROWN
Date01 March 2020
DOIhttp://doi.org/10.1111/1475-679X.12294
Published date01 March 2020
DOI: 10.1111/1475-679X.12294
Journal of Accounting Research
Vol. 58 No. 1 March 2020
Printed in U.S.A.
What Are You Saying? Using topic to
Detect Financial Misreporting
NERISSA C. BROWN,
RICHARD M. CROWLEY,
AND W. BROOKE ELLIOTT
Received 12 September 2016; accepted 25 October 2019
ABSTRACT
We use a machine learning technique to assess whether the thematic con-
tent of financial statement disclosures (labeled topic) is incrementally infor-
mative in predicting intentional misreporting. Using a Bayesian topic mod-
eling algorithm, we determine and empirically quantify the topic content of
a large collection of 10-K narratives spanning 1994 to 2012. We find that the
algorithm produces a valid set of semantically meaningful topics that pre-
dict financial misreporting, based on samples of Securities and Exchange
University of Illinois at Urbana-Champaign; Singapore Management University.
Accepted by Phil Berger. We thank an anonymous reviewer, Andrew Bauer, Matt Cobabe,
Amanda Convery, Robert Davidson, Paul Demer´
e, Lucile Faurel, Shawn Gordon, Jing He,
Shiva Rajgopal, Kristina Rennekamp, Kecia Williams Smith, Gang Wang, and workshop partic-
ipants at Baruch College—City University of New York, Carnegie Mellon University,Columbia
University, Hong Kong University of Science and Technology, Nagoya University, University of
Illinois, U.S. Securities and Exchange Commission (Division of Economic and Risk Analysis),
Virginia Tech,the 2015 AAA FARS Mid-year Meeting, the 2015 AAA Annual Meeting, the 2015
Conference on Convergence of Financial and Managerial Accounting Research, the 2016 Con-
ference on Investor Protection, Corporate Governance, and Fraud Prevention, and the 2016
Conference on Financial Economics and Accounting for helpful comments. We also thank
Xiao Yu for insightful comments on methodology and coding, Brian Gale for helpful assistance
with Amazon Mechanical Turk, and Stephanie Grant, Chunlei Liu, Jill Santore, and Jingpeng
Zhu for excellent research assistance. We thank Derryck Coleman and Olga Usvyatsky (for-
merly) of Audit Analytics for assistance with the restatement data and text search scripts used
in this study.Brown gratefully acknowledges financial support from the PricewaterhouseCoop-
ers LLP Faculty Fellowship. Elliott gratefully acknowledges financial support from the Ernst &
Young Distinguished Professorship. An online appendix to this paper can be downloaded at
http://research.chicagobooth.edu/arc/journal-of-accounting-research/online-supplements.
237
CUniversity of Chicago on behalf of the Accounting Research Center, 2019
238 N.C.BROWN,R.M.CROWLEY,AND W.B.ELLIOTT
Commission (SEC) enforcement actions (Accounting and Auditing Enforce-
ment Releases [AAERs]) and irregularities identified from financial restate-
ments and 10-K filing amendments. Our out-of-sample tests indicate that topic
significantly improves the detection of financial misreporting by as much as
59% when added to models based on commonly used financial and textual
style variables. Furthermore, models that incorporate topic significantly out-
perform traditional models when detecting serious revenue recognition and
core expense errors. Taken together, our results suggest that the topics dis-
cussed in annual report filings and the attention devoted to each topic are
useful signals in detecting financial misreporting.
JEL codes: C80; K22; K42; M40; M41; M48
Keywords: topic modeling; disclosure; latent Dirichlet allocation; financial
misreporting
1. Introduction
This study investigates whether a novel text-based measure of the thematic
content of financial statement disclosures (labeled as topic)isusefulfor
detecting financial misreporting.1Detection models have long focused on
quantitative financial statement and stock market variables as predictive fac-
tors (Beneish [1997], Brazel, Jones, and Zimbelman [2009], Dechow et al.
[2011], Bao et al. [2020]). One drawback of this approach is that financial
misreporting can go undetected for multiple periods, because misreport-
ing firms often manipulate performance metrics and accounting transac-
tions to blend in better with their peers or the firm’s own past performance
(Lewis [2013]). To address this weakness, recent studies analyze the textual
and linguistic features of management disclosures, finding that summary
measures of these features serve as useful warnings of misreporting (see,
e.g., Hobson, Mayew, and Venkatachalam[2012], Larcker and Zakolyukina
[2012], Purda and Skillicorn [2015]).
Despite the usefulness of communication style in revealing misreport-
ing, the literature debates whether textual and linguistic features ade-
quately capture managers’ deliberate attempts to obfuscate or manipu-
late financial information (Bloomfield [2008], Bushee, Gow, and Taylor
[2018]). Further, as Loughran and McDonald [2016] highlight, commonly
used textual measures do not reflect the context or meaning of manage-
ment disclosures, thereby limiting the inferences that can be drawn. We
tackle these issues by introducing a machine learning tool that simultane-
ously detects and quantifies the thematic content (topic) of annual report
1We use the terms misreporting and misrepresentation interchangeably to refer to deliberate
violations of financial accounting standards and noncompliance with regulatory financial re-
porting rules. We refrain from using the term fraud because, in a legal sense, violations of
or noncompliance with financial reporting standards and rules are considered fraudulent
only if market participants rely on the misreported or misrepresented information to their
detriment.
USING topic TO DETECT FINANCIAL MISREPORTING 239
narratives. This approach departs from prior text-based research by focus-
ing on what is being disclosed by management rather than how. Using this
unique measure, we evaluate the disclosure topics associated with misre-
porting and how these topics evolve. More importantly, we investigate the
incremental predictive power of topic in detecting misreporting out of sam-
ple, relative to a collection of financial and textual style measures.
Our focus on the thematic content of financial statement filings draws on
the management disclosure and communications literatures. These bod-
ies of research suggest that the flexible nature of disclosure content allows
for a broader set of dimensions along which annual report narratives can
be used to identify financial misreporting, compared to quantitative finan-
cial metrics and summary measures of textual features (Hoberg and Lewis
[2017]). These literatures also argue that textual features, such as tone and
word usage, are difficult to classify as deceptive, because disclosure narra-
tives can be influenced by individuals’ expectations and motivations, even
when the intent is to communicate objectively and truthfully (Douglas and
Sutton [2003]). In that sense, the content of the disclosure and the atten-
tion devoted to each topic may better predict misreporting than how the
narrative is fashioned. We therefore examine whether the topic content of
financial statement disclosures is incrementally informative in assessing the
likelihood of misreporting, beyond textual style features. We also analyze
the ability of topic to detect misreporting, relative to quantitative financial
variables, given that these measures are typically backward-looking and less
efficient in predicting misreporting, compared to language-based measures
(Cecchini et al. [2010a], Goel and Gangolly [2012], Larcker and Zakolyuk-
ina [2012], Purda and Skillicorn [2015]).
We generate our topic measure by employing a Bayesian topic model-
ing algorithm developed by Blei, Ng, and Jordan [2003], termed Latent
Dirichlet Allocation (LDA). Similar to factor or cluster analysis, LDA is an
unsupervised and unstructured probabilistic model that “learns” or discov-
ers the latent thematic structure of words within a corpus of documents.2
The algorithm (and other variants) is widely used in practice by Internet
search engines to guide keyword selection and improve correlations be-
tween search terms and web content (Fishkin [2014]). A unique advantage
of LDA is that it does not require predetermined word dictionaries or topic
categories and instead relies on the fact that words frequently appearing
together tend to be semantically related. This process reduces researcher
bias, as foreknowledge of document content does not affect the topic clas-
sifications.3Furthermore, the algorithm can classify the content of large
2LDA is a “bag of words” algorithm that uses the distribution of words across documents to
classify and quantify themes without the need for predefined or researcher-determined word
lists or topic categories.
3Although LDA is unsupervised and does not rely on human input to identify topics, hu-
man judgment is necessary to interpret and label the topics inferred from the algorithm.
This is because the LDA output for a given topic consists only of word clusters and word

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT