Evaluating Fairness of Algorithmic Risk Assessment Instruments: The Problem With Forcing Dichotomies

AuthorEvan M. Lowder,Sarah L. Desmarais,Samantha A. Zottola,Sarah E. Duhart Clarke
DOI10.1177/00938548211040544
Published date01 March 2022
Date01 March 2022
Subject MatterArticles
CRIMINAL JUSTICE AND BEHAVIOR, 2022, Vol. 49, No. 3, March 2022, 389 –410.
DOI: https://doi.org/10.1177/00938548211040544
Article reuse guidelines: sagepub.com/journals-permissions
© 2021 International Association for Correctional and Forensic Psychology
389
EVALUATING FAIRNESS OF ALGORITHMIC
RISK ASSESSMENT INSTRUMENTS
The Problem With Forcing Dichotomies
SAMANTHA A. ZOTTOLA
SARAH L. DESMARAIS
Policy Research Associates
EVAN M. LOWDER
George Mason University
SARAH E. DUHART CLARKE
RTI International
Researchers and stakeholders have developed many definitions to evaluate whether algorithmic pretrial risk assessment
instruments are fair in terms of their error and accuracy. Error and accuracy are often operationalized using three sets of
indicators: false-positive and false-negative percentages, false-positive and false-negative rates, and positive and negative
predictive value. To calculate these indicators, a threshold must be set, and continuous risk scores must be dichotomized. We
provide a data-driven examination of these three sets of indicators using data from three studies on the most widely used
algorithmic pretrial risk assessment instruments: the Public Safety Assessment, the Virginia Pretrial Risk Assessment
Instrument, and the Federal Pretrial Risk Assessment. Overall, our findings highlight how conclusions regarding fairness are
affected by the limitations of these indicators. Future work should move toward examining whether there are biases in how
the risk assessment scores are used to inform decision-making.
Keywords: Risk assessment; pretrial; bias; algorithmic fairness; Public Safety Assessment; Virginia Pretrial Risk
Assessment Instrument; Federal Pretrial Risk Assessment
INTRODUCTION
The fairness of algorithmic risk assessment instruments has received significant public
and legal attention in recent years (Angwin et al., 2016; Hamilton, 2018; Hellman, 2020;
Mayson, 2019; Minow et al., 2019; Pretrial Justice Institute, 2020; Slobogin, 2018;
Southerland & Woods, 2018; U.S. Department of Justice, 2014). This attention has included
discussion regarding whether algorithms developed from criminal justice data can only
AUTHORS’ NOTE: Correspondence concerning this article should be addressed to Samantha A. Zottola,
Policy Research Associates, 345 Delaware Ave, Delmar, NY 12054; e-mail: szottola@prainc.com
1040544CJBXXX10.1177/00938548211040544Criminal Justice and BehaviorZottola et al. / Forcing Dichotomies
research-article2022
390 CRIMINAL JUSTICE AND BEHAVIOR
exacerbate racial disparities by carrying forward the racism present in existing criminal
justice practices (e.g., disparate arrest rates experienced by black people; Kochel et al.,
2011) and whether algorithms will convey a false, and therefore harmful, sense of objectiv-
ity (Eckhouse et al., 2019; Robinson & Koepke, 2019). As part of these discussions, con-
cerns have been raised regarding whether the instruments produce risk estimates that
demonstrate evidence of bias in general and as a function of race, specifically. While these
concerns are relevant to algorithmic risk assessment instruments in general, discussion has
centered on instruments used to inform pretrial decision-making (Vincent & Viljoen, 2020).
Briefly, algorithmic pretrial risk assessment instruments are designed to evaluate the likeli-
hood that a person will fail to appear in court or be arrested for a new crime on pretrial
release. The likelihood of these outcomes (collectively referred to as pretrial misconduct) is
calculated via an algorithm that weights and sums ratings on items such as criminal history,
current charges, or personal characteristics to produce a scaled risk score representing the
likelihood of pretrial misconduct.
Currently, there is a great deal of debate around whether algorithmic risk assessments
should be part of pretrial reform efforts sweeping the United States. Those who support the
use of algorithmic risk assessment instruments assert that there is one strategy that may help
promote transparency, accuracy, and consistency in pretrial decisions (Desmarais & Lowder,
2019). They also point out that there is relatively limited empirical evidence demonstrating
differential predictive validity across groups (Desmarais et al., 2020; Vincent & Viljoen,
2020), but that when identified, differential prediction may highlight biases in the criminal
justice system that are otherwise difficult to quantify (Mayson, 2019; Viljoen & Vincent,
2020). Critics of these instruments, in contrast, assert that algorithmic pretrial risk assess-
ment instruments are racially biased and exacerbate racial disparities in pretrial decisions
(Angwin et al., 2016; Pretrial Justice Institute, 2020). These differing viewpoints on whether
there is racial bias in algorithmic pretrial risk assessments may reflect, at least in part, the
various definitions—and consequently, indicators—used for determining whether an algo-
rithmic pretrial risk assessment instrument produces “fair” or “biased” results (see for
example, Berk et al., 2021; Chouldechova, 2017; Kleinberg et al., 2018; Mayson, 2019).
Indeed, stakeholders with differing views on the issue tend to rely on evidence derived
using different indicators of predictive accuracy (Hamilton, 2018). Our goal is to examine
three sets of indicators that are being emphasized by advocates, legal scholars, and other
stakeholders as measures of fairness: (a) true-positive, false-positive, false-negative, and
true-negative percentages; (b) false-positive (FPRs) and false-negative rates (FNRs); and
(c) positive (PPVs) and negative predictive values (NPVs). In the sections that follow, we
describe how these indicators are being used to evaluate fairness of algorithmic risk assess-
ment instruments in the pretrial context.
INDICATORS USED TO EVALUATE FAIRNESS
Much of the discussion about fairness in algorithmic risk assessment instruments is con-
cerned with whether the error and accuracy of assessment results are the same across sub-
groups of people. There are dozens of indicators that can be employed to compare error and
accuracy in instruments, both overall and across subgroups. A full discussion of all indica-
tors is beyond the scope of this paper (for information on indicators not covered here, see:
American Educational Research Association [AERA], American Psychological Association,
& National Council on Measurement in Education, 2014; Berk et al., 2021; Chouldechova,

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT