False (and Missed) Discoveries in Financial Economics

AuthorCAMPBELL R. HARVEY,YAN LIU
Date01 October 2020
DOIhttp://doi.org/10.1111/jofi.12951
Published date01 October 2020
THE JOURNAL OF FINANCE VOL. LXXV, NO. 5 OCTOBER 2020
False (and Missed) Discoveries in
Financial Economics
CAMPBELL R. HARVEY and YAN LIU
ABSTRACT
Multiple testing plagues many important questions in finance such as fund and fac-
tor selection. We propose a new way to calibrate both Type I and Type II errors.Next,
using a double-bootstrap method, we establish a t-statistic hurdle that is associated
with a specific false discovery rate (e.g., 5%). We also establish a hurdle that is as-
sociated with a certain acceptable ratio of misses to false discoveries (Type II error
scaled by Type I error), which effectively allows for differential costs of the two types
of mistakes. Evaluating current methods, we find that they lack power to detect out-
performing managers.
IN MANAGER SELECTION (OR,EQUIVALENTLY, the selection of factors or trad-
ing strategies), investors can make two types of mistakes. The first involves
selecting a manager who turns out to be unskilled—this is a Type I error, or
a false positive.1The second error is not selecting or missing a manager that
the investor thought was unskilled but was not—this is a Type II error, or a
false negative. Both types of errors display economically important variation.
On the false positives side, for instance, one manager might slightly underper-
form while another manager might have a large negative return. Moreover,
the cost of a Type II error is likely different from the cost of a Type I error,
with the costs depending on the specific decision at hand. However, while an
investor may want to pick managers using the criterion that Type I errors are,
say, five times more costly than Type II errors, current tools do not allow for
such a selection criterion. On the one hand, current methods ignore the Type II
Campbell R. Harvey is with Duke University and National Bureau of Economic Research. Yan
Liu is with Purdue University.We appreciate the comments of Stefan Nagel; two anonymous refer-
ees; as well as Laurent Barras, Claude Erb, Juhani Linnainmaa, and Michael Weber;and seminar
participants at Rice University,University of Southern California, University of Michigan, Univer-
sity of California at Irvine, Hanken School of Economics, Man-Numeric; Research Affiliates; and
the 2018 Western Finance Association meetings in San Diego. Kay Jaitly provided editorial assis-
tance. The authors do not have any potential conflicts of interest, as identified in the JF Disclosure
Policy.
Correspondence: Campbell Harvey, Duke University and National Bureau of Economic Re-
search; e-mail: cam.harvey@duke.edu
1Throughout our paper,we follow the empirical literature on performance evaluation and asso-
ciate manager skill with alpha. In particular, we take skilled managers to be those that generate
positive alpha. Our notion of manager skill is thus different from that in Berk and Green (2004),
where skilled managers generate a zero net alpha in equilibrium.
DOI: 10.1111/jofi.12951
© 2020 the American Finance Association
2503
2504 The Journal of Finance®
error rate, which may lead us to miss outperforming managers. On the other
hand, it is difficult to simply characterize Type I errors because of multiple
testing—using a single-hypothesis testing criterion (e.g., two standard errors
from zero) will lead to massive Type I errors because, when there are thou-
sands of managers, many will look good (i.e., appear to outperform) purely by
luck. Statisticians have suggested a number of fixes that take multiple test-
ing into account. For example, the simplest is the Bonferroni correction, which
multiplies each manager’s p-value by the number of managers. But this type
of correction does not take the covariance structure into account. Further, it is
not obvious what the Type I error rate would be after implementing the cor-
rection. We know that the error rate would be less than that under a single
testing criterion—but how much less?
In this paper, we propose a different approach. Using actual manager data,
we first determine the performance threshold that delivers a particular Type
I error rate (e.g., 5%). We next characterize the Type II error rate associated
with our optimized Type I error rate. It is then straightforward to scale the
Type II error by the Type I error and solve for the cutoff that produces the
desired trade-off of false negatives and false positives.
Our focus on both Type I and Type II errors echoes recent studies in eco-
nomics that highlight the importance of examining test power. For example,
in a survey of a large number of studies, Ioannidis, Stanley, and Doucouliagos
(2017) show that 90% of results in many research areas are underpowered,
leading to an exaggeration of the results. Ziliak and McCloskey (2004)further
show that only 8% of the papers published in the American Economic Review in
the 1990s consider test power. The question of test power thus represents one
more challenge to research practices common in economics research (Leamer
(1983), De Long and Lang (1992), Ioannidis and Doucouliagos (2013), Harvey
and Liu (2013), Harvey, Liu, and Zhu (2016), Harvey (2017)).
Why is test power important for research in financial economics? On the
one hand, when a study’s main finding is the nonexistence of an effect (i.e.,
the null hypothesis is not rejected), test power directly affects the credibility
of the finding because it determines the probability of not rejecting the null
hypothesis when the effect is true. For example, in one of our applications, we
show that existing studies lack power to detect outperforming mutual funds.
On the other hand, when the main finding is the rejection of the null hypothesis
(i.e., the main hypothesis), this finding often has to survive against alternative
hypotheses (i.e., alternative explanations for the main finding). Low test power
for alternative explanations generates a high Type I error rate for the main
hypothesis (Ioannidis (2005)).
Our paper addresses the question of test power in the context of multiple
tests. Our contribution is threefold. First, we introduce a framework that of-
fers an intuitive definition of test power. Second, we employ a double-bootstrap
approach that can flexibly (i.e., specific to a particular data set) estimate
test power. Finally, we illustrate how taking test power into account can
materially change our interpretation of important research findings in the
current literature.
False (and Missed) Discoveries in Financial Economics 2505
In a single-hypothesis test, the Type II error rate at a particular parameter
value (in our context, the performance metric for the manager) is calculated as
the probability of failing to reject the null hypothesis at this value. In multiple
tests, the calculation of the Type II error rate is less straightforward because,
instead of a single parameter value, we need to specify a vector of nonzero
parameters, where each parameter corresponds to a single test under the al-
ternative hypothesis.
We propose a simple strategy to estimate the Type II error rate. Assuming
that a fraction p0of managers have skill, we adjust the data so that p0of
managers have skill (with their skill level set at the in-sample estimate) and
the remaining 1 p0of managers have no skill (with their skill level set to a
zero excess return or alpha). By bootstrapping from these adjusted data, we
evaluate the Type II error rate through simulations. Our method thus circum-
vents the difficulty of specifying the high-dimensional parameter vector under
the alternative hypothesis. We set the parameter vector at what we consider
a reasonable value—the in-sample estimate corresponding to a certain p0.In
essence, we treat p0as a sufficient statistic, which helps estimate the Type II
error rate. We interpret p0from both a frequentist and a Bayesian perspective.
Our strategy is related to the bootstrap approach in performance evaluation
proposed by Kosowski, Timmermann, Wermers, and White (2006, KTWW) and
Fama and French (2010).2These papers use a single-bootstrap approach to
adjust for multiple testing. In particular, under the assumption of no skill for
all funds (p0=0), they demean the data to create a “pseudo” sample, Y0,for
which p0=0 holds true in sample. They then bootstrap Y0to test the overall
hypothesis that all funds have zero alpha. Because we are interested in both
the Type I and the Type II error rates associated with a given testing procedure
(including those of KTWW and Fama and French (2010)), our method uses two
rounds of bootstrapping. For example, to estimate the Type I error rate of Fama
and French (2010), we first bootstrapY0to create a perturbation, Yi,forwhich
the null hypothesis is true. We then apply Fama and French (2010) (i.e., second
bootstrap) to each Yiand record the testing outcome (hi=1 if rejection). We
estimate the Type I error rate as the average hi. The Type II error rate can be
estimated in similar fashion.
After introducing our framework, we turn to two empirical applications to
illustrate how our framework helps address important issues related to Type I
and Type II errors associated with multiple tests. We first apply our method to
the selection of two sets of investment factors. The first set includes hundreds
of backtested factor returns. For a given p0, our method allows investors to
measure the Type I and Type II error rates for these factors and thus make
choices that strike a balance between Type I and Type II errors. When p0is
uncertain, investors can use our method to evaluate the performance of exist-
ing multiple-testing adjustments and select the adjustment that works well
regardless of the value of p0or for a range of p0values. Indeed, our application
2See Harvey and Liu (2019) for another application of the bootstrap approach to the test of
factor models.

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT