Interrater Reliability of Psychopathy Checklist–Revised: Results on Multiple Analysis Levels for a Sample of Patients Undergoing Forensic Psychiatric Evaluation

AuthorAnna M. Dåderman,Åke Hellström
Published date01 February 2018
CRIMINAL JUSTICE AND BEHAVIOR, 2018, Vol. 45, No. 2, February 2018, 234 –263.
DOI: 10.1177/0093854817747647
© 2018 International Association for Correctional and Forensic Psychology
Results on Multiple Analysis Levels for a Sample of
Patients Undergoing Forensic Psychiatric Evaluation
University West
Stockholm University
Scores from the Psychopathy Checklist–Revised (PCL-R) are used to support decisions regarding personal liberty. In our
study, performed in an applied forensic psychiatric setting, intraclass correlation coefficients (ICC) for absolute agreement,
single rater (ICCA1) were .89 for the total score, .82 for Factor 1, .88 for Factor 2, and .78 to .86 for the four facets. These
results stand in contrast to lower reliabilities found in a majority of field studies. Disagreement among raters made a low
contribution (0%-5%) to variability of scores on the total score, factor, and facet level. For individual items, ICCA1 varied
from .38 to .94, with >.80 for seven of the 20 items. Items 17 (“Many short-term marital relationships”) and 19 (“Revocation
of conditional release”) showed very low reliabilities (.38 and .43, respectively). The importance of knowledge about factors
that can affect scoring of forensic instruments (e.g., education, training, experience, motivation, raters’ personality, and qual-
ity of file data) is emphasized.
Keywords: PCL-R; interrater reliability; separate interview data; generalizability theory; Swedish forensic psychiatric
AUTHORS’ NOTE: The authors would like to thank the three anonymous peer reviewers for their construc-
tively critical comments on earlier versions of this article and Dr. Jaime S. Henderson for her editorial com-
ments. We thank the entire staff at the forensic psychiatric evaluation unit in Stockholm, run by the Swedish
National Board of Forensic Medicine, as well as Jenny Laurell, BSc, for providing PCL-R ratings for the
purpose of this research. We also thank the patients who volunteered their time by providing information for
the PCL-R ratings. Partial results from this study were orally presented at the Second World Conference on
Personality, March 31 to April 4, 2016, Búzios, Brazil; the participation in this conference was sponsored by a
grant from the research environment, Learning in and for the New Working Life (LINA), at University West,
Trollhättan, Sweden. The research time and working facilities for the writing of this article were funded by
University West, while the data collection was supported by grants from Karolinska Institutet, the Söderström-
Königska Foundation, the Swedish Carnegie Institute, the Swedish Foundation for Care Sciences and Allergy
Research, and the Swedish National Board of Forensic Medicine (Rättsmedicinalverket). The funders had no
role in study design, data collection and analysis, decision to publish, or preparation of the article.
Correspondence concerning this article should be addressed to Anna M. Dåderman, Department of Social and
Behavioural Studies, Division of Psychology, Education, and Sociology, University West, SE-461 86 Trollhättan,
Sweden; e-mail:
The Psychopathy Checklist–Revised (PCL-R; Hare, 2003) is one of the most widely
used instruments in forensic psychiatry and psychology (Hurducas, Singh, de Ruiter, &
Petrila, 2014; Neal & Grisso, 2014) to support decisions regarding offenders’ personal lib-
erty. Data presented in the manual of PCL-R (Hare, 2003) indicate that its reliability in
research contexts is good (>.80), but in recent years a growing body of research (see Table 1
below), often using generalizability theory, shows that PCL-R interrater reliability in the
real world is much lower than stated in the manual. The current study aimed at adding to our
understanding of the factors behind this discrepancy by multiple-level analyses of data from
a study conducted in an applied forensic context, with a somewhat unusual design.
The PCL-R consists of 20 items (see Table 2 for item descriptions), 18 of them catego-
rized into two factors: Interpersonal/Affective (Factor 1) and Social Deviance (Factor 2),
each factor having two facets. The interrater reliability of PCL-R is measured by the intra-
class correlation coefficients (ICC; McGraw & Wong, 1996, Table 4). The most relevant
reliability measure for legal decisions in the real world is ICCA1. Interrater reliability of the
total PCL-R score, as well as of the factor scores, is reported in the manual as being ade-
quate (Hare, 2003). In contrast, A. K. Miller, Rufino, Boccaccini, Jackson, and Murrie
(2011) reported agreement on four videotaped practice cases between 22 trained, but not yet
experienced raters as being much lower than in the manual. Only 38% of the variance in the
PCL-R total score was attributable to the evaluees’ psychopathy, whereas 19% of the vari-
ance was attributable to scoring tendencies of individual raters. These tendencies were
related to raters’ personality traits (e.g., higher rater agreeableness was associated with
lower scoring of Facet 1, Interpersonal).
In most research studies, PCL-R scoring is conducted by research assistants who are well
trained, and who make their ratings in more ideal conditions than in real-world situations,
but there are very few studies where clinicians perform ratings for research purposes only.
Recently, Blais, Forth, and Hare (2017) reported results from a research study in which 280
trained raters (both researchers and clinicians) of the Dark Stone Research Group provided
PCL-R-ratings on six videotaped practice cases, three of the cases having high PCL-R
scores. The ratings were compared with a criterion of PCL-R scores determined by the con-
sensus of three experienced PCL-R users (one academic researcher and two clinical cor-
rectional psychologists). Values of ICCA1 were .75, .65, and .78 for PCL-R Total, Factor 1,
and Factor 2, respectively. ICCA1 values on the facet level were .66, .51, .83, and .61 for
Interpersonal, Affective, Lifestyle, and Antisocial, respectively. These values are lower than
those given in the manual. The highest value of ICCA1 was obtained for all raters regarding
the three cases with the highest PCL-R scores, suggesting that offenders with intermediate
and low scores are more difficult to assess than genuine psychopaths.
There is no absolute criterion for determining what level of interrater reliability is “reli-
able enough.” Cicchetti (1994) provided commonly cited ICC cutoff values, with interrater
reliability described as “poor” for ICC < .40, “fair” between .40 and .59, “good” between
.60 and .74, and “excellent” > .75. Rosenthal and Rosnow (1991) proposed that the thresh-
old of good reliability for professional ratings should be at a minimum of .85. Heilbrun
(1992) suggested that “the use of tests with a reliability coefficient of less than .80 is not
advisable” (p. 265), not referring to any particular statistic test of interrater reliability (such
as ICC or Cohen’s kappa). Heilbrun argued that wherever results from forensic instruments
influence personal liberty, the employed instrument should possess good (at least .80) test–
retest and interrater reliability. The main reason, according to Heilbrun, for using a higher

