Evidence-based sentencing and the scientific rationalization of discrimination.

Author:Starr, Sonja B.
Position:III. Assessing the Evidence for Evidence-Based Sentencing through Conclusion, with footnotes and tables, p. 842-872

    Protecting society from crime while avoiding excessive incarceration is no doubt an important interest, even a "compelling" one. But the Constitution and good policy also require assessing the strength of the relationship between EBS and that interest. When heightened scrutiny applies, it is the state's burden to provide convincing evidence establishing that relationship. In this Part, I show that the current empirical evidence does not suffice, with a focus on three main concerns. In Subpart A, I consider the limited ability of EBS instruments to precisely and accurately predict individual recidivism, as opposed to group recidivism rates. This individual-versus-group distinction has been raised by others, particularly in criminological literature, but it is often elided by EBS advocates. Because its legal implications are important (and previously unrecognized), I discuss it in some detail. Next, I turn to more novel criticisms. In Subpart B, I critically examine the underlying research and show that there is not yet any persuasive evidence even to support the basic claim that the actuarial instruments outperform judges' individual risk predictions, much less to show that the constitutionally problematic variables add substantial marginal predictive value to the instruments. In Subpart C, I argue that the instruments do not even address the right empirical question: the effect of sentencing decisions on the defendant's recidivism risk.

    1. Precision, Group Averages, and Individual Predictions

      The instruments' first serious limitation is that they do not provide anything even approaching a precise prediction of an individual's recidivism risk. The models are designed to predict the average recidivism rate for all offenders who share with the defendant whichever characteristics are included as variables in the model. If the model is well specified and based on an appropriate and large enough sample, then it might perform this task well. But because individuals vary much more than groups do, even a relatively precisely estimated model will often not do well at predicting individual outcomes in particular cases. "It is a statistical truism that the mean of a distribution tells us about everyone, yet no one." (168) Social scientists sometimes refer to the broader ranges attached to individual predictions as "prediction intervals" (or sometimes as "forecast uncertainty" or "confidence intervals for a forecast") to distinguish them from the "confidence intervals" that are estimated for the group mean or for the effect of a given variable.

      To illustrate simply, let's start with an example that involves predicting a continuous outcome (height)--rather than a binary future event--using just one explanatory variable (sex). The height distributions of the U.S. male and female populations look approximately like Figure 1, which is based on average heights of 70 inches for males and 65 inches for females, and normal distributions with standard deviations of 3 inches and 2.5 inches, respectively. (169)


      But suppose one did not know the true population distributions and had to estimate them using a random sample. With a large enough sample, it is easy to obtain quite precise estimates of the male and female averages and the difference between them. This point is illustrated in Table 1. I created simulated data for a "true population" of men and women that has the height distributions shown in Figure 1. Then I drew from that population random samples with sample sizes of 20, 200, and 400, regressed height on gender within each sample, and recorded the predicted mean heights for men and women and the confidence intervals for those means.

      How close each sample comes to approximating the true population means involves chance: different random samples of the same size may have different means. But chance can be expected to play a smaller role as the sample gets larger. This expectation is captured in the confidence intervals for the mean, which get narrower as the sample gets larger. Confidence intervals are a way of accounting for chance in sampling. For the 400-person sample, one can express 95% confidence in quite a precise estimate of the mean: for males, between 69.6 inches and 70.4 inches, and for females, between 64.5 inches and 65.3 inches. (170) If you keep drawing additional 400-person samples, they don't tend to differ very much; with that sample size, one can generally do quite a good job approximating the underlying population, which is why the confidence interval is narrow. Meanwhile, the 20-person sample produces much wider confidence intervals, spanning more than three inches.

      But what if you wanted to use your 400-person sample not to estimate the averages for the population but to predict the height of just the next random woman you meet? Your single best guess would be the female mean from your sample, which is 64.9 inches. But you wouldn't be nearly as confident in that prediction as you would be in the prediction for the group mean. In fact, within the same 400-person sample used above, only 13.5% of women have heights that are between 64.5 and 65.3 inches, which was your 95% confidence interval for the group mean. If you wanted to give an individual forecast for that next woman that you could be 95% confident in, it would have to be much less precise; you could predict that she would be somewhere between 59.5 inches and 70.3 inches, the 95% individual forecast interval shown in Table 1. In other words, you don't know much at all about how tall to expect the next woman to be. (171)

      One could make the example much more complicated, but the individual forecast interval is always wider than the confidence interval for the mean--generally much wider. (172) Note that while the confidence intervals for the means get much narrower as the sample grows, the individual forecast interval does not. The underlying uncertainty that it reflects is not mainly sampling error; it's the variability in the underlying population that we saw in Figure 1.

      The same basic intuition applies to models of binary outcomes, such as whether a defendant will recidivate. Some of the recidivism risk prediction instruments include confidence intervals for the probabilities they predict. Indeed, some scholars have urged that confidence intervals (rather than mere point estimates) should always be provided so that judges can get an idea of how precise the instruments are. (173) But given that judges are using the instruments to predict individual recidivism risk, providing them with a confidence interval for the group recidivism rate might misleadingly represent the instrument's precision. For instance, if judges are told, "The estimated probability that Defendant X will recidivate is 30%, and the 95% confidence interval for that prediction is 25% to 35%," that may well sound to the judge like a reasonably precise individual prediction, but it is not. It is merely a reasonably precise estimate of an average recidivism rate. (174)

      With binary outcomes, though, while the confidence interval for the mean may be misleading, the individual forecast interval is not a very useful alternative, because it tells you nothing that was not made clear by the point estimate itself. Unless the predicted probability is extremely low or extremely high, a 95% individual prediction interval will always run from 0 to 1, meaning that the only prediction that can be made with 95% confidence is that the individual will either recidivate or not. (175) This fact does not reflect poorly on the design of the prediction instruments or on the underlying research. It reflects the inherent uncertainty of this predictive task and the binary nature of the outcome.

      In order to assess how well a model predicts recidivism risk for individuals, some other metric is necessary. (176) There is no single agreed-upon method for assessing the individual predictive accuracy of a binary model. One common metric used in the recidivism prediction literature is called the "area under the curve" (AUC) approach. (177) This method pairs each person who ended up recidivating with a random person who did not; the score is the fraction of these pairs in which the recidivist had been given the higher predicted risk score. A perfect, omniscient model would rank all eventual recidivists higher than all eventual non-recidivists, giving it an AUC score of 1, while coin flips would on average produce a score of 0.5. The best published scores for recidivism prediction instruments appear to be around 0.75, and these are rich models that include various dynamic risk factors, including detailed psychological assessments, unlike simple point systems based on objective factors. (178) Many studies have reported AUC scores closer to 0.65. (179) By comparison, a prominent meta-analysis of studies of psychologists' clinical predictions of violence found a mean AUC score of 0.73, which the author characterized as a "modest, better-than-chance level of accuracy." (180) As another point of comparison, if one turns height into a binary variable called "tall" (above-average height), our basic, one-variable model does much better at predicting who will be tall than any actuarial model does at predicting who will recidivate; it has an AUC score of 0.825. (181) This is despite the fact that, as we saw, the model gives rather wide bounds for individual predictions of height; gender is actually quite a strong predictor of height, but it still leaves considerable individual variation unexplained. (182)

      Another simple measure of prediction accuracy is the linear correlation between predicted probabilities and actual outcomes for offenders. This measure will be 0 if the instrument explains nothing more than chance and 1 if it predicts perfectly. (183) In 1994, a prominent meta-analysis of studies comparing several actuarial recidivism...

To continue reading