Interrater Reliability of Psychopathy Checklist–Revised: Results on Multiple Analysis Levels for a Sample of Patients Undergoing Forensic Psychiatric Evaluation

Published date01 February 2018
Date01 February 2018
Subject MatterArticles
CRIMINAL JUSTICE AND BEHAVIOR, 2018, Vol. 45, No. 2, February 2018, 234 –263.
DOI: 10.1177/0093854817747647
© 2018 International Association for Correctional and Forensic Psychology
Results on Multiple Analysis Levels for a Sample of
Patients Undergoing Forensic Psychiatric Evaluation
University West
Stockholm University
Scores from the Psychopathy Checklist–Revised (PCL-R) are used to support decisions regarding personal liberty. In our
study, performed in an applied forensic psychiatric setting, intraclass correlation coefficients (ICC) for absolute agreement,
single rater (ICCA1) were .89 for the total score, .82 for Factor 1, .88 for Factor 2, and .78 to .86 for the four facets. These
results stand in contrast to lower reliabilities found in a majority of field studies. Disagreement among raters made a low
contribution (0%-5%) to variability of scores on the total score, factor, and facet level. For individual items, ICCA1 varied
from .38 to .94, with >.80 for seven of the 20 items. Items 17 (“Many short-term marital relationships”) and 19 (“Revocation
of conditional release”) showed very low reliabilities (.38 and .43, respectively). The importance of knowledge about factors
that can affect scoring of forensic instruments (e.g., education, training, experience, motivation, raters’ personality, and qual-
ity of file data) is emphasized.
Keywords: PCL-R; interrater reliability; separate interview data; generalizability theory; Swedish forensic psychiatric
AUTHORS’ NOTE: The authors would like to thank the three anonymous peer reviewers for their construc-
tively critical comments on earlier versions of this article and Dr. Jaime S. Henderson for her editorial com-
ments. We thank the entire staff at the forensic psychiatric evaluation unit in Stockholm, run by the Swedish
National Board of Forensic Medicine, as well as Jenny Laurell, BSc, for providing PCL-R ratings for the
purpose of this research. We also thank the patients who volunteered their time by providing information for
the PCL-R ratings. Partial results from this study were orally presented at the Second World Conference on
Personality, March 31 to April 4, 2016, Búzios, Brazil; the participation in this conference was sponsored by a
grant from the research environment, Learning in and for the New Working Life (LINA), at University West,
Trollhättan, Sweden. The research time and working facilities for the writing of this article were funded by
University West, while the data collection was supported by grants from Karolinska Institutet, the Söderström-
Königska Foundation, the Swedish Carnegie Institute, the Swedish Foundation for Care Sciences and Allergy
Research, and the Swedish National Board of Forensic Medicine (Rättsmedicinalverket). The funders had no
role in study design, data collection and analysis, decision to publish, or preparation of the article.
Correspondence concerning this article should be addressed to Anna M. Dåderman, Department of Social and
Behavioural Studies, Division of Psychology, Education, and Sociology, University West, SE-461 86 Trollhättan,
Sweden; e-mail:
747647CJBXXX10.1177/0093854817747647Criminal Justice and BehaviorDåderman, Hellström / Interrater Reliability of the PCL-R
The Psychopathy Checklist–Revised (PCL-R; Hare, 2003) is one of the most widely
used instruments in forensic psychiatry and psychology (Hurducas, Singh, de Ruiter, &
Petrila, 2014; Neal & Grisso, 2014) to support decisions regarding offenders’ personal lib-
erty. Data presented in the manual of PCL-R (Hare, 2003) indicate that its reliability in
research contexts is good (>.80), but in recent years a growing body of research (see Table 1
below), often using generalizability theory, shows that PCL-R interrater reliability in the
real world is much lower than stated in the manual. The current study aimed at adding to our
understanding of the factors behind this discrepancy by multiple-level analyses of data from
a study conducted in an applied forensic context, with a somewhat unusual design.
The PCL-R consists of 20 items (see Table 2 for item descriptions), 18 of them catego-
rized into two factors: Interpersonal/Affective (Factor 1) and Social Deviance (Factor 2),
each factor having two facets. The interrater reliability of PCL-R is measured by the intra-
class correlation coefficients (ICC; McGraw & Wong, 1996, Table 4). The most relevant
reliability measure for legal decisions in the real world is ICCA1. Interrater reliability of the
total PCL-R score, as well as of the factor scores, is reported in the manual as being ade-
quate (Hare, 2003). In contrast, A. K. Miller, Rufino, Boccaccini, Jackson, and Murrie
(2011) reported agreement on four videotaped practice cases between 22 trained, but not yet
experienced raters as being much lower than in the manual. Only 38% of the variance in the
PCL-R total score was attributable to the evaluees’ psychopathy, whereas 19% of the vari-
ance was attributable to scoring tendencies of individual raters. These tendencies were
related to raters’ personality traits (e.g., higher rater agreeableness was associated with
lower scoring of Facet 1, Interpersonal).
In most research studies, PCL-R scoring is conducted by research assistants who are well
trained, and who make their ratings in more ideal conditions than in real-world situations,
but there are very few studies where clinicians perform ratings for research purposes only.
Recently, Blais, Forth, and Hare (2017) reported results from a research study in which 280
trained raters (both researchers and clinicians) of the Dark Stone Research Group provided
PCL-R-ratings on six videotaped practice cases, three of the cases having high PCL-R
scores. The ratings were compared with a criterion of PCL-R scores determined by the con-
sensus of three experienced PCL-R users (one academic researcher and two clinical cor-
rectional psychologists). Values of ICCA1 were .75, .65, and .78 for PCL-R Total, Factor 1,
and Factor 2, respectively. ICCA1 values on the facet level were .66, .51, .83, and .61 for
Interpersonal, Affective, Lifestyle, and Antisocial, respectively. These values are lower than
those given in the manual. The highest value of ICCA1 was obtained for all raters regarding
the three cases with the highest PCL-R scores, suggesting that offenders with intermediate
and low scores are more difficult to assess than genuine psychopaths.
There is no absolute criterion for determining what level of interrater reliability is “reli-
able enough.” Cicchetti (1994) provided commonly cited ICC cutoff values, with interrater
reliability described as “poor” for ICC
.60 and .74, and “excellent” > .75. Rosenthal and Rosnow (1991) proposed that the thresh-
old of good reliability for professional ratings should be at a minimum of .85. Heilbrun
(1992) suggested that “the use of tests with a reliability coefficient of less than .80 is not
advisable” (p. 265), not referring to any particular statistic test of interrater reliability (such
as ICC or Cohen’s kappa). Heilbrun argued that wherever results from forensic instruments
influence personal liberty, the employed instrument should possess good (at least .80) test–
retest and interrater reliability. The main reason, according to Heilbrun, for using a higher
TABLE 1: Field Studies Addressing PCL-R Interrater Reliability, ICCA1
Study Setting nRaters PCL-R total F1 F2 Fac1 Fac2 Fac3 Fac4
Adversarial Context (“Allegiance”)
Boccaccini, Turner,
and Murrie (2008)
The United States, Texas 23
Archival prison records: State vs.
Def, termed “petitioners” and
“respondents” (doctoral-level
.39 — — — —
Murrie et al. (2009) The United States, Texas 35
Archival prison records: State
vs. Def (doctoral-level
psychologists, one psychiatrist)
— — — —
Lloyd, Clark, and
Forth (2010)
Canada, “Dangerous
offender hearing”
Sexual- and
nonsexual violent
LexisNexis legal database: (case
law reports of Canadian criminal
trials) “Experts”
.67 (n = 15; Crown vs. Def)
.82 (n = 7; Crt-appt vs. Def)
.71 (n = 5; Crown vs. Crt-
— — — —
Runo, Boccaccini,
Murrie, and Hawes
(2012; part of this
study was a research
study with excellent
ICC between
research raters)
The United States, Texas
(partly overlapping
sample of Murrie,
Boccaccini, Johnson, &
Janke, 2008; Murrie et
al., 2009)
Archival prison records: State
vs. Def—see Murrie et al.,
2008; Murrie et al., 2009) vs.
two independent research
raters (advanced level clinical
psychology doctoral students
each of whom scored 88 SVP
with le data only)
.33 (n = 44; State vs. Def)
.14 (n = 44; independent
vs. Def)
.29 (n = 44; independent vs.
.18 (n = 28) .39 <.50>
(n = 11)
<.50 class="_ _b"> .65
Boccaccini, Turner,
Murrie, and Runo
The United States, Texas 38
Archival prison records: State
vs. Def
.44, .52
44, 52/28, 20/28, 28
— — — —
DeMatteo, Galloway,
Edens, Cox, and
Smith (2014)
The United States; eight
states, most from
California, Texas, and
Archival prison records:
Prosecution vs. Def
(psychologists and psychiatrists)
.58 (regardless type of
.68 (n = 14; prosecution
vs. Def)
— — — —
Edens, Cox, Smith,
DeMatteo, and
Sörman (2015)
Canada 8-55
Sexual and nonsexual
LexisNexis legal database: (case
law reports of Canadian criminal
.54 (n = 55, Def vs. Crown)
.68 (n = 32; Def vs. Crt-appt)
.53 (n = 8; Crown vs. Crt-
— — — —
Differences Not Related to Allegiance
Levenson (2004) The United States,
Experts (doctoral-level licensed
psychologists or psychiatrists)
.72 (converted to a single
rater value, see Edens,
Boccaccini, and Johnson
(2010), p. 108
— — — —
Study Setting nRaters PCL-R total F1 F2 Fac1 Fac2 Fac3 Fac4
Boccaccini et al.
The United States, Texas 22
Archival prison records: State .47
— — — —
Edens et al.
The United States, Texas 20 “imprisoned sex
offenders with
PCL-R total score at
least of 25
Archival prison records: Trained
doctoral-level clinicians working
at state-operated psychiatric
.45 (corrected for range
for range
.55 (corrected
for range
— — — —
Murrie et al.
The United States, Texas 20
Archival data: Experts (doctoral-
level psychologists, one
.88 (n = 7; Respondents)
.24 (n = 13; Petitioners)
— — — —
Runo et al.
See details above 44 Archival data: State vs. two
independent research raters
(le data only)
.18 (n = 34) .77 <.50>
(n = 22)
<.50 class="_ _b"> .76
Boccaccini et al.
The United States, Texas 32
Archival data State .40
— — — —
C. S. Miller, Kimonis,
Otto, Kline, and
The United States,
Archival data: Team of mental
health professionals—
psychologists or psychiatrists
(employees or contractors at
.60 .48 .72 .48 .39 .56 .75
G. T. Harris, Rice,
and Cormier
Canada, High-security
psychiatric institution
Men admitted to
pretrial evaluation
Trained clinicians (psychologists,
psychometrists) vs. one “very
experienced” research assistant
(le data only)
.79 .71 .90 .69 .68 .80 .94
Sturup et al.
Sweden, National
Board of Forensic
Medicine, independent
government authority
prisoners who
applied to receive
a shortened prison
Archival data: Teams comprising
a forensic social investigator,
a psychologist, a forensic
.70 .62 .76 .60 .59 .54 .90
Boccaccini, Murrie,
Runo, and
Gardner (2014)
The United States, Texas 557
Archival data: Psychologists and
one psychiatrist .32 .23 .25 .18
(n = 425)
(n = 430)
(n = 426)
(n = 429)
TABLE 1 : (continued)
Study Setting nRaters PCL-R total F1 F2 Fac1 Fac2 Fac3 Fac4
Edens et al.
Canada 102
Sex offenders and
other offenders
LexisNexis legal database: (case
law reports of Canadian criminal
.59 (total sample)
.66 (sex offenders)
.46 (other offenders)
.64 (n = 28, Crown vs.
.20 (n = 19; Crt-appt vs.
.79 (n = 10, Def vs. Def)
— — — —
Ismail and Looman
Canada, MAU Ontario
vs. SSOA
Sex offenders
Archival data: Trained by Dark
Stone Research Group
psychologists (at MAU
supervised undergraduates, at
SSOA graduates)
.90 .78 .90 .76 .79 .82 .94
Jeandarme et al.
Belgium 74
Forensic psychiatric
patients and
Archival data:
Criminologists and psychologists,
all on master’s degree
.42 .44 .44 .54 .27 .28 .60
Boccaccini, Runo,
Jeon, and Murrie
(2017; part of this
study was a research
study with good ICC
between research
80 (data from Runo
et al., 2012)
Archival data: “state experts” vs.
two graduate students (le
data only)
.45 (State vs. Student 1)
.44 (State vs. Student 2)
.17 (Def vs. Student 1 and 2)
.06 (State
vs. Student
.07 (State
vs. Student
.41 (State vs.
Student 1)
.43 (State vs.
Student 2)
— — — —
Note. Proportions of variance attributable to offender/rater/random error. See the reference list for full references. Denition of eld studies by Edens and Boccaccini (2017) was adopted. Studies are
presented chronologically as they were published. PCL-R = Psychopathy Checklist–Revised; ICCA1 = ICC for absolute agreement, single rater score; n = the number and type of evaluees for whom
at least two PCL-R scores were available; SVP = sexually violent predators facing civil commitment; “—” = no data; DCF = Department of Children and Families; MAU = Millhaven Assessment Unit;
SSOA = Specialized Sexual Offender Assessment; Crown = state contracted experts. Def = defense retained experts; Crt-appt = court-appointed experts; F1 = Factor 1; F2 = Factor 2; Fac1 = Facet
1 (Interpersonal); Fac2 = Facet 2 (Affective); Fac3 = Facet 3 (Lifestyle); Fac4 = Facet 4 (Antisocial); ICC = intraclass correlation coefficients.
TABLE 1: (continued)
TABLE 2: Review of Past Research on Rater Agreement Coefficients on the PCL-R Item Level
No. PCL-R item description
Clinical samples Field samples
Grann, Långström,
Tengström, and
Stålenheim (1998)
N = 40
Hildebrand, de
Ruiter, de Vogel,
and van der Wolf
N = 60
Laurell and
N = 35
G. T. Harris,
Rice, and
N = 24
Blais, Forth,
and Hare
N = 6
C. S. Miller,
Kimonis, Otto,
Kline, and
Wasserman (2012)
N = 313
et al.
N = 27
Ismail and
N = 178
et al. (2017)
N = 24-41
1aGlibness/supercial charm
(F1; Fac1)
.37 .46 .61 .73 .65 .31 .31 .75 .27
2 Grandiose sense of self worth
(F1; Fac1)
.24 .51 .51 .47 .64 .41 .54 .76 .49
3 Need for stimulation/proneness to
boredom (F2; Fac3)
.34 .80 .64 .63 .82 .30 .62 .76 .36
4 Pathological lying (F1; Fac1) .24 .65 .66 .59 .39 .29 .42 .59 .49
5 Conning/manipulative (F1; Fac1) .30 .66 .76 .63 .42 .26 .37 .66 .26
6aLack of remorse or guilt
(F1; Fac2)
.32 .69 .89 .59 .34 .35 .51 .69 .52
7 Shallow affect (F1; Fac2) .27 .60 .74 .55 .49 .09 .51 .68 .38
8 Callous/lack of empathy
(F1; Fac2)
.16 .52 .70 .57 .51 .30 .36 .62 .21
9 Parasitic lifestyle (F2; Fac3) .29 .68 .44 .68 .68 .42 .47 .80 .30
10 Poor behavioral controls
(F2; Fac4)
.23 .65 .85 .89 .45 .39 .28 .75 .37
11 Promiscuous sexual behavior .24 .80 .87 .94 .68 .28 .59 .78 .46
12 Early behavioral problems
(F2; Fac4)
.63 .79 .66 .91 .64 .55 .77 .82 .42
13 Lack of realistic, long-term goals
(F2; Fac3)
.43 .52 .81 .55 .63 .34 .47 .67 .17
14aImpulsivity (F2; Fac3) .49 .67 .60 .49 .40 .27 .23 .70 .33
15 Irresponsibility (F2; Fac3) .27 .63 .64 .52 .43 .37 .39 .70 .19
No. PCL-R item description
Clinical samples Field samples
Grann, Långström,
Tengström, and
Stålenheim (1998)
N = 40
Hildebrand, de
Ruiter, de Vogel,
and van der Wolf
N = 60
Laurell and
N = 35
G. T. Harris,
Rice, and
N = 24
Blais, Forth,
and Hare
N = 6
C. S. Miller,
Kimonis, Otto,
Kline, and
Wasserman (2012)
N = 313
et al.
N = 27
Ismail and
N = 178
et al. (2017)
N = 24-41
16aFailure to accept responsibility for
own actions (F1; Fac2)
.23 .67 .71 .32 .14 .35 .66 .63 .20
17 Many short-term marital
.31 .79 .80 .92 .68 .32 .43 .74 .56
18 Juvenile delinquency (F2; Fac4) .55 .78 .80 .95 .70 .73 .80 .86 .66
19 Revocation of conditional release
(F2; Fac4)
.50 .77 .80 .50 .78 .70 .80 .75 .53
20 Criminal versatility (F2; Fac4) .64 .76 .98 .74 .68 .66 .75 .90 .14
Average across the 20 items .35 .67 .72 .66 .56 .38 .51 .73 .37
Note. Data regarding research samples from the PCL-R manual (Hare, 2003) are not shown. Averages across the 20 items were calculated by the authors. All studies with one exception reported
ICCs. The exception was the study of Grann et al., which reported Cohen’s kappa coefficients. For interpretation of interrater reliability coefficients, see Heilbrun (1992), describing values of at
least .80 as good. PCL-R = Psychopathy Checklist–Revised; F1 = Factor 1; F2 = Factor 2; Fac1 = Facet 1 (Interpersonal); Fac2 = Facet 2 (Affective); Fac3 = Facet 3 (Lifestyle); Fac4 = Facet 4
(Antisocial); ICC = intraclass correlation coefficients.
aIndicates items that were assessed by Runo, Boccaccini, and Guy (2011) as the most difficult to score.
TABLE 2: (continued)
threshold of reliability for “accepting” forensic instruments than in clinical psychology is
that in clinical settings, the result of an evaluation affects the evaluee only, whereas in
forensic psychology it may affect not only the evaluee, but also security, health, and so on,
of members of the society. Another reason is that reliability below .80 “has been criticized
for containing excessive error variance, and hence, poorer validity” (Heilbrun, 1992, p.
265), and Heilbrun proposed that this relationship may be described by generalizability
theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972). In the current study, we followed
Heilbrun’s definition of good reliability as being at least .80.
Only two previous studies (Dåderman & Kristiansson, 2003; Wong, 1988) were carried
out in research contexts with similarly trained raters conducting interviews independently
and having access to identical file data. In Wong’s study, two raters independently com-
pleted a PCL-R rating on each of 56 male forensic psychiatric patients based on file data
only, and then each rater interviewed each patient and made a second rating that reflected
both file and interview information. However, the old, outdated version of PCL was used
in Wong’s study, and inappropriately, Wong reported Pearson r values instead of ICCA1.
In Dåderman and Kristiansson’s study, 28 adolescent male juvenile delinquents aged 14
to 20 years, convicted of serious crimes, were rated by two independent experienced rat-
ers (an academic researcher and a forensic psychiatrist) who performed separate inter-
views and took part of file data. The remaining 28 delinquents were interviewed by only
one rater. Dåderman and Kristiansson reported good interrater reliabilities, with ICCA1
values similar to the value reported in the PCL-R manual, .90 for the PCL-R total, .85 for
Factor 1, and .90 for Factor 2. However, because the sample comprised only young
offenders (aged 14-20 years), and about half of them were previously treated within a
correctional institution 3 times or more, all of them were routinely assessed with a version
of PCL-R modified for young people (Forth, Hart, & Hare, 1990) by omitting three items
(“Parasitic lifestyle,” “Promiscuous sexual behavior,” and “Many short-term marital rela-
tionships”). These items were, in accordance with the PCL-R manual, “prorated” (using
the tables in the manual’s appendices to impute scores for up to five omitted items). In
addition, Dåderman and Kristiansson did not report ICCA1 values on the facet level, nor
on the item level. Finally, they did not calculate the proportion of score variance attribut-
able to different sources (see the “Method” section for details). Therefore, there is need
for further research.
Field studies of the reliability of PCL-R ratings in forensic clinical practice, often based
on archival data, have been reported, with independent raters producing two sets of scores.
In the field, often only one rater rates an evaluee, and incorporates these ratings in his or her
report. Some evaluees are rated at another setting and/or at another occasion (e.g., before
release from prison or hospital) by another rater, whereas some others, often being more
complex cases, are rated several times. The information about the use of independent inter-
views, type of education, experience, and PCL-R training is not always available in case
law reports.
Generally, in the literature of PCL-R interrater reliability, there seem to be two main foci
of attention. One concerns whether scores remain as reliable in field settings as they do in
research settings. The other concerns whether scores differ systematically in reliability
depending on the side for which they were produced (“allegiance”), which is a specific
issue in studies of field reliability. Table 1 briefly summarizes results published on PCL-R
reliability field studies.
Table 1 shows that “allegiance” field studies have found that total PCL-R scores from
state-appointed experts are consistently higher than those from defense-appointed
experts, entailing poorer interrater reliability, with ICCA1 between .14 and .82, than in
research contexts. The reported values of ICCA1 are consistently lower for Factor 1 than
for Factor 2.
The question is how much attention should be given to ICCA1 from field studies, espe-
cially those comprising only a few cases, when discussing PCL-R interrater reliability,
because such studies are likely to be marred by some serious biases, such as an offender
type bias (e.g., an overrepresentation of sexual offenders or of offenders with high scores),
a bias due to the length of imprisonment between scoring occasions, which may cause
ratings to be confounded by regression toward the mean (Streiner, 2001), and a bias
caused by the adversariness of the U.S. judicial system, which may prevent clinicians
representing opposite sides from having access to an equal amount of file data. In addi-
tion, there is a small-sample bias that should be evident from Table 1, where it is shown
that the 14 results from allegiance studies (see the fifth column) are based on very small
samples (in some cases n
field reliability studies, not biased by allegiance, are also biased by small samples, with
some exceptions (Edens, Cox, Smith, DeMatteo, & Sörman, 2015; G. T. Harris, Rice, &
Cormier, 2013; Jeandarme et al., 2017; Levenson, 2004; C. S. Miller, Kimonis, Otto,
Kline, & Wasserman, 2012).
Only few studies have reported PCL-R interrater reliability on the item level. Table 2
shows item-level rater agreement coefficients from studies published to date regarding
results from clinical samples on male patients referred for forensic psychiatric evaluation
before being sentenced (Grann, Långström, Tengström, & Stålenheim, 1998), patients
admitted to forensic psychiatric care (Hildebrand, de Ruiter, de Vogel, & van der Wolf,
2002), offenders convicted for murder (Laurell & Dåderman, 2007), offenders in a high-
security psychiatric institution (G. T. Harris et al., 2013), and also results from “practice
cases” of offenders (Blais et al., 2017). PCL-R field reliability studies generally do not
report reliabilities for individual items because such data are often not available (also,
PCL-R factor scores are relatively rarely reported, which is a limitation acknowledged by
Edens et al., 2015). Table 2 also lists results from four field studies where interviews by
at least two independent raters, as well as file data, were used. Studies were conducted by
C. S. Miller et al. (2012) on sexually violent predators (SVPs) from Florida, by Sturup et
al. (2014) on life-sentenced offenders in Sweden, by Ismail and Looman (2016) on sex
offenders from the Canadian province Ontario, and by Jeandarme et al. (2017) on inmates
and/or patients from Belgium, in two settings (prison and forensic psychiatric hospital).
In summary, there is relatively little knowledge on the reliabilities of individual PCL-R
items, regardless of type of design, and item-level results from past research are not
encouraging. Such knowledge is of importance, as low reliability of some items may
decrease the reliability of the total PCL-R score. When raters disagree on the item level,
this may lead to an unstable reliability of the categorical diagnosis of psychopathy.
Rufino, Boccaccini, and Guy (2011) found that some of the PCL-R items are difficult to
score because these require a great deal of subjective judgment by the raters (see shad-
owed rows in Tables 2, 4, and 6). The reviewed literature shows, however, no clear ten-
dency for particular items to show poorer or better reliability than other items, indicating
that research on interrater reliability on individual PCL-R items should be continued.
It is important to acknowledge the essential difference of the context of the current study
from other forensic psychiatric settings (e.g., in the United States or Canada) that usually
routinely conduct and “deliver” PCL-R scores for the court to support legal decisions.
During forensic psychiatric evaluation in Sweden, a forensic psychiatric team comprising a
forensic social investigator, a psychologist, a nurse, and a forensic psychiatrist investigates
whether the evaluee suffers from a severe mental disorder (or whether the crime was com-
mitted under the influence of such disorder). A severe mental disorder is present if the eval-
uee suffers from a psychotic disorder, a particularly severe depressive disorder, a personality
disorder, or a compulsive disorder. Sometimes, patients with severe neuropsychiatric disor-
ders may be assessed as suffering from a severe mental disorder. Also, patients with severe
dementia, severe mental redardation, or a grave brain injury may be assessed as suffering
from a severe mental disorder. The goal of such evaluation is to produce a verdict on whether
the crime has been committed under the influence of such disorder and whether the person
needs forensic psychiatric care. This gives the court the basis for deciding whether the eval-
uee should be exempted from being sentenced to imprisonment and instead sentenced to
forensic psychiatric care. The outcome of the evaluation may be decisive for the sentence.
However, psychopathy by itself is not considered a severe mental disorder, and therefore
the PCL-R is not routinely used for assessment of the evaluee in the forensic psychiatric
evaluation, and the level of psychopathy is not routinely reported to the Swedish court. In
fact, most potential psychopaths are often not referred for forensic psychiatric evaluation,
except for those who are suspected of murder.
However, the PCL-R is routinely utilized when a forensic psychiatric team within the
Swedish National Board of Forensic Medicine is asked by the court to conduct a risk (i.e.,
“dangerousness”) assessment of a life-sentenced person. This happens when the person
applies, after having served for at least 10 years, for his or her sentence to be converted into
a time-limited sentence (Sturup et al., 2014; Svennerlind et al., 2010). In such cases, several
risk assessment instruments, among others the PCL-R, are used to support the evaluation.
There are a few interrater reliability studies in the real world of forensic psychiatry where
separate raters perform PCL-R ratings, based on both interviews and file data, at different
time points. The present study aimed to fill this gap of knowledge. Therefore, its plan was
to analyze in-depth the interrater reliability and internal consistency of the PCL-R in a
sample of criminal suspects referred for forensic psychiatric evaluation before being sen-
tenced (they are labeled “patients” during this evaluation), using assessments based on
separate interviews by two independent raters (see below), as well as on file data. Specifically,
the aims of the current study were to
assess the interrater reliability of PCL-R scores among independent raters;
estimate how much of the variability in PCL-R scores was attributable to (a) psychopathy
indicators, (b) rater differences, and (c) random error and other sources.
PCL-R ratings were conducted using separate interviews by two types of independent
raters, in Table 4 called “1” and “2.” Raters 1 were forensic psychiatric team members
(both males and females); two forensic psychiatrists, three clinical psychologists, and
seven social workers; 10 women and two men. They participated voluntarily without any
compensation. Rater 2 (the first author) is an academic researcher and not a member of
any forensic psychiatric team. She is an experienced PCL-R rater, and is often used as
an expert witness for the defense regarding different psychological and forensic issues.
She has also educated and supervised doctoral students in PCL-R ratings at the master’s
and doctoral level, among others Rater 3 (see below). However, she had not trained
Raters 1 on PCL-R ratings, and she was not trained by any of them. All raters (1 and 2)
had sufficient forensic experience, were familiar with the clinical and research literature
on psychopathy, and were experienced PCL-R interviewers. Before the current study,
the clinicians (Raters 1), as well as the researchers (among others, Rater 2), at this foren-
sic unit had received specific formal training and formal authorization in the clinical use
of the PCL-R during basic and advanced workshops (“the Hare PCL-R Training
Program”) led by David J. Cooke, professor of psychiatry. This formal training empha-
sized the importance of the clinical information derived from the instrument and how
that information can be used in violence risk assessment. A variety of teaching modes
were used: didactic, interactive, and practical exercises. Video cases were used to allow
participants to apply the PCL-R ratings and obtain feedback. The strengths and limita-
tions of the PCL-R were discussed. This training was conducted at the Forensic
Assessment Unit in Stockholm, which is operated by the Swedish National Board of
Forensic Medicine. Forensic researchers, including Rater 2, and clinicians (here, Raters
1) at this Forensic Assessment Unit were housed at separate parts of the facility, and had
no discussions of any kind about individual patients. A third rater with comparable quali-
fications, Rater 3, rated 17 of the patients using file data only. She took a master’s course
in forensic psychology at Stockholm University, before her doctoral studies in psychol-
ogy. Because the forensic files were classified documents, she had to read these files at
the Forensic Assessment Unit, without any contact with the clinicians. Therefore, agree-
ment between the three kinds of raters could hardly be attributed to a shared
The patients, who were PCL-R rated in two separate interviews, were 43 male nonpsy-
chotic offenders who had been referred by Swedish courts, before being sentenced, for
forensic psychiatric evaluation at the Department of Forensic Psychiatry in Stockholm,
National Board of Forensic Medicine, which is the largest facility of its kind in Sweden.
The patients’ ages ranged from 16 to 35 years (M = 27.7, SD = 5.4). Most of them were
sentenced for more than two types of crimes. The majority (74.4%) were sentenced for
violent crimes (including murder, manslaughter, causing another’s death, assault, rape and
other sex crimes, unlawful threat, and unlawful deprivation of liberty). Other common
crimes were narcotic-related offenses (25.6%), weapon offenses (18.6%), arson and inflict-
ing damage (14%), robbery (9.3%), and driving under the influence of alcohol or narcotics
(7%). The intellectual capacity of the patients was estimated by standardized psychological
tests, and all patients had an IQ considered to be at least “normal” (i.e., patients with intel-
lectual disability were not included).
The PCL-R (Hare, 2003) is a 20-item checklist. Each item has four possible scoring
choices (0, 1, 2, and omit); 2 indicates that the item definitely applies, 1 indicates that it may
or may not apply, and 0 indicates that it does not apply. These scores quantify the degree to
which the interviewee manifests a particular diagnostic indicator of psychopathy. The total
score can range from 0 to 40. The manual (Hare, 2003) states that a score of 30 or above
confirms the diagnosis of psychopathy, but its Swedish version also states that when using
file data only, a score of 26 or higher qualifies this diagnosis in Swedish forensic psychiatric
samples. To compare with prior results, we consider both of these alternative criteria. Two
patients were below the age of 18 (one was 16 and another was 17), and for these two
patients, each of the 20 PCL-R items was rated using a modified version of the PCL-R
(Forth et al., 1990). The modifications were identical to those used in Dåderman and
Kristiansson’s (2003) study.
During a 1-year period, all male patients undergoing forensic psychiatric evaluation at the
Forensic Assessment Unit in Stockholm (run by the Swedish National Board of Forensic
Medicine), who fulfilled the inclusion criteria (age -
mal, not needing an interpreter, nonpsychotic) were invited by the first author to participate
in the current study, which was a part of a larger research project regarding, among other
things, cognitive functions and abuse patterns. The reason for excluding patients above the
age of 35 was that the main focus of this project was on cognitive functions (such as dys-
lexia) and that it was not possible to include all patients because of the time needed for all
psychological tests. Sixty out of the 68 eligible patients volunteered to participate. The proj-
ect was not allowed to interfere with the normal forensic psychiatric evaluation, and it was
not possible to ask the patients to participate before this evaluation was finished. For security
reasons, some of the patients being presumably dangerous, all patients had to be tested while
still being at the forensic unit. Two patients were PCL-R interviewed by Rater 2 in the cus-
tody. Raters 1 were those 12 clinicians who rated the patients at the forensic psychiatric
evaluation, and who had also participated in the previously described workshop on the use of
the PCL-R (i.e., were “certified”). Therefore, 43 patients out of the 60 who volunteered to
participate were PCL-R rated by Raters 1. The remaining 17 patients were PCL-R rated by a
third rater (Rater 3) using file data only (see below). All 60 patients were rated by Rater 2.
Each patient was informed that results from this project were not to be recorded in his
files. No other persons were to be present during the interviews. Rater 2 conducted the
PCL-R interviews in her office at the forensic psychiatric unit, with two exceptions (see
above), where she also read all files after performing the semistructured PCL-R interviews.
The only information Rater 2 received before the interview was the type of crime of which
the patient was suspected. This was the only difference between the rating procedures of
Raters 1 and Rater 2. All of Raters 1 had access to all criminal investigations before they
evaluated the patient, and sometimes they knew him, because some of the patients were
recidivists and had been psychiatrically evaluated in the past. The patients received no com-
pensation for their participation.
PCL-R ratings were conducted on three occasions. On the first occasion, during the 4-
to 6-week period of forensic psychiatric evaluation, each of the 43 patients was PCL-R
rated by one of the 12 Raters 1. This rating was based on an interview, as well as on files
from the Swedish National Police Register, which included all offenses committed by the
patient after the age of 15 years, on the Hospital Discharge Register of the Swedish National
Board of Health and Welfare, on previous institutional records, social service notes, school
and employment reports, medical records, previous forensic psychiatric files (where rele-
vant), and on court documents (including police reports) concerning current crimes. The
forensic psychiatric teams routinely contact the offenders’ families, school teachers, and so
on, and review earlier medical reports, probation reports, and criminal histories. They also
have access to comprehensive material from Forensic Psychiatric Medicine (e.g., pictures
of crime scenes including photos and descriptions of the victim’s injuries) and detailed
documents from the criminal investigation, which in some cases included videotapes. The
forensic psychiatric evaluations, to which Rater 2 had access after her performing the
PCL-R interviews, included summaries of this material, as well as nurses’ ratings and
clinical impressions by other staff who intervened with the patient during the evaluation
period. These files included detailed reports by clinical staff with regard to the patient’s
interactions with others, clinical judgments, diagnoses, and so forth.
On the second occasion, about 1 month after the first interview, Rater 2 conducted a
separate interview with each of the 43 patients, rated most of the PCL-R items, and after
being given similar file data as Raters 1, as detailed above, she completed the ratings. Rater
2 was blind to the PCL-R scores given by Raters 1, these ratings being archived in a secure
location. The patients’ files contained no detailed information relevant to specific PCL-R
features or items. Almost all of the interviews by Rater 2 were performed directly after the
end of the forensic psychiatric evaluation period, while the patients were still at the clinic.
Rater 2 also, within the aforementioned project, sampled data on the patients’ personality
traits, their neuropsychological functions, and their use of substances. In summary, Time 1
assessments were conducted among a group of raters (Raters 1), whereas Time 2 assess-
ments were all conducted by a single rater (Rater 2). To our knowledge, this design was not
used in any prior published study.
On the third occasion, for a purpose different from the current study, the remaining 17
patients were rated at Time 3 by Rater 3, using file data only. She was an experienced rater,
and had been trained by the first author.
All statistical analyses, with one exception, were conducted with the Statistical Package
for the Social Sciences (SPSS), Version 24. The exception was calculations using generaliz-
ability theory (Brennan, 2001; Cronbach, Nageswari, & Gleser, 1963), where the variance
components were calculated with routines in the R-package “gtheory” (R Core Team, 2017).
Where appropriate, we used 1,000-sample bootstrapping to overcome bias due to small
sample size; 95% bias corrected and accelerated confidence intervals (BCa CIs) around the
means were computed.
Values of ICCA1 (McGraw & Wong, 1996, Table 4) were calculated to assess the inter-
rater reliability of the PCL-R total score, Factor 1 and Factor 2, the four facets, and the 20
individual items. We use the notation ICCA1 to denote absolute agreement of a single mea-
sure. As was noted by Blais et al. (2017),
absolute agreement is preferred over consistency agreement when investigating the PCL-R
given that decisions within real-world context /. . ./ are often based on scores that have actual
clinical significance (i.e., 30 or higher; Murrie et al., 2008). (p. 766)
With an approach based on generalizability theory (Brennan, 2001; Cronbach et al., 1963),
we adopted the two-way ANOVA random effects model with an absolute agreement for each
analysis, with participants and different types of raters as random factors. Each analysis pro-
vided estimates of the percentages of variance attributable to patients, raters, and the residual
(Patient x Rater interaction and all other sources), respectively. In the random effects model,
participants and raters are both considered as randomly selected, and the scores are meant to
be generalized to the population of raters (i.e., clinicians within forensic psychiatry). In field
studies, it is important to have as high rater agreement coefficients as possible: for example,
an ICCA1 of at least .80 (Heilbrun, 1992), indicating that at least 80% of the variance in
PCL-R scores is attributable to differences in psychopathy indicators and at most 20% to
factors such as differences among the raters (training, experience, skills, personality, meth-
ods of interpreting file data, and other characteristics of raters that may inspire the patient to
respond differently to the same questions), combined with file data quality, patients present-
ing different information from one rater to another, and other systematic or random measure-
ment errors (Schmidt & Hunter, 1996). It should be kept in mind that the ratings were
conducted at different points in time, and thus, the participants could have been in different
psychological states, and therefore responded differently to the same questions.
The Swedish National Board of Forensic Medicine gave its permission to carry out the
ratings and allowed the first author and a research assistant (Rater 3) who completed 17
PCL-R ratings postfactum, to inspect the files. The Research Ethical Committee at Huddinge
University Hospital granted its approval to the project.
To examine the possibility that a large proportion of the variance in the PCL-R scores
given by Raters 1 could be attributed to nonrandom assignment of patients to these raters, we
used the Rater 2 PCL-R total score to check whether some raters were assigned to patients
who on average were more genuinely psychopathic than other raters. The sample comprised
patients varying widely in their degree of psychopathy (see Figure 1), and we submitted the
PCL-R total scores given by Rater 2 to an ANOVA to check whether patients assigned to
some of Raters 1, or to Rater 3, had higher or lower scores than those assigned to other raters.
Only those five Raters 1 who PCL-R rated at least four patients were included.
A one-way ANOVA did not reveal any significant differences in mean values of Rater 2’s
PCL-R total score between the groups of patients assigned to the different Raters 1, or to
Rater 3, F(5, 45) = 0.48, p = .787. We could therefore conclude that the patients were
assigned to Raters 1 and 3 in a random manner regarding the patients’ levels of genuine
psychopathy as assessed by Rater 2.
Cohen’s kappa (Cohen, 1960), which is meant to account for agreement occurring by
chance, was calculated for the yes/no diagnosis of psychopathy, using one cutoff limit for
PCL-R total score of 30 and another of 26. Following Blais et al. (2017), results are pre-
sented by using both .60 and .80 as indicators of “good” reliability.
Table 3 shows the agreement between Raters 1 and Rater 2 concerning diagnosis of psy-
chopathy, using the two cutoff limits. Considering values of .40 to .75 as “fair to good”
(Fleiss, 1981), agreement concerning the dichotomous classification was “good” (Table 3).
Using a cutoff score of 30, Cohen’s kappa was .69 (p
score of 26, it was .67 (p
Figure 1: Frequency Distribution of the Total PCL-R Score (Average of Scores From Raters 1 and 2, or 2
and 3) for 60 Male Patients Referred for Forensic Psychiatric Evaluation
Note. PCL-R = Psychopathy Checklist–Revised.
TABLE 3: Agreement Matrix for Psychopathy Diagnosis Using a Cutoff PCL-R Total Score of 30 (26), With
95% BCa CI in Parentheses
Raters 1
No psychopathy Psychopathy Total
Rater 2
No psychopathy 30 (23) 0 (0) 30 (23)
Psychopathy 5 (7) 8 (13) 13 (20)
Total 35 (30) 8 (13) 43 (43)
Note. CIs and SEs are based on 1,000 bootstrap samples. Cohen’s kappa = .69 (.67), SE = .12 (.11), approximate
t = 4.76 (4.63) with an approximate p
Revised; BCa = bias corrected and accelerated; CI = condence interval.
With a cutoff score of 30, Raters 1 and Rater 2 agreed on the patient’s diagnosis of
“psychopathy” in 65% of cases, and on “no psychopathy” in 100% of cases. With a cutoff
score of 26, they agreed on psychopathy in 61.5% of cases, and on “no psychopathy” in
100% of cases. Kappa statistics reached significance in both analyses (p
these values were statistically significant, and had a magnitude that, according to some
authors (e.g., Fleiss, 1981), can be considered as “good,” they may not be considered as
satisfactory in a real-world context, where clinical PCL-R raters are expected to be inter-
changeable. Recall that, according to Heilbrun (1992), reliability of instruments used in
forensic psychiatry should reach a value of at least .80; thus, our kappa statistics were not
satisfactory. Table 3 also shows that disagreements tended to be in one direction; Rater 2
was more likely to assign PCL-R total scores above the cutoff limit. To try to find the
reason for this, we also performed analyses for each rater category on each of the indi-
vidual PCL-R items.
For the individual PCL-R items, Table 4 presents frequencies of ratings in each category,
descriptive statistics for Raters 1 and Rater 2, results of paired-sample t tests, including
effect sizes of mean differences between the raters, ICCA1 with 95% CIs, and proportions of
variance attributable to patients’ individual differences in psychopathy indicators, rater (1
vs. 2), and other sources of errors on the item level (due to round-off errors, their sum some-
times does not equal 100).
To identify those items with comparatively better versus weaker interrater agreement, we
followed calculations presented by Blais et al. (2017) for the PCL-R. Raw (percentage)
agreement in coding individual PCL-R items was determined by calculating the percentage
of scores 0, 1, 2, or omit for each item and for each rater type (1 and 2). Good consistency
was defined as at least 90% scoring agreement between the two rater types. This was in
accordance with the cutoff definition adopted by Blais et al. For example, if Raters 1 scored
Item 1 (“Glibness/superficial charm”) as 0 in 28% of the cases, as 1 in 49%, as 2 in 21%,
and never omitted this item, while Rater 2 scored this item as 0 in 35% of the cases, as 1 in
40%, as 2 in 23%, and never omitted this item, this was considered as good consistency
because none of the percentage differences across scoring categories was larger than 10%.
TABLE 4: Percentage Raters Scoring the Item as 0, 1, 2, or Omit With Descriptive Statistics for Raters 1 and Rater 2, and Paired t Tests of Individual PCL-R
item no
Raters 1,
% of 0/1/2/
Rater 2, %
of 0/1/2/
Raters 1,
M (SD)
Rater 2,
M (SD)
95% CI of the
diff. t d ICCA1
1a28/49/21/0 35/40/23/0 0.93 (0.70) 0.88 (0.70) [−0.13, 0.22] 0.53 0.07 .70 [.50, .82] 70/0/30
240/40/21/0 42/40/19/0 0.81 (0.76) 0.74 (0.73) [−0.05, 0.19] 1.14 0.09 .85 [.75, .92] 85/0/15
319/16/65/2 44/21/47/0 1.46 (0.80) 1.13 (0.88) [0.16, 0.51] 3.91*** 0.39 .72 [.43, .86] 72/7/21
440/49/12/0 30/47/9/0 0.72 (0.67) 0.63 (0.65) [−0.07, 0.25] 1.16 0.14 .68 [.48, .81] 68/0/32
528/37/35/0 44/56/19/0 1.07 (0.80) 0.95 (0.65) [−0.06, 0.30] 1.30 0.16 .67 [.47, .81] 67/1/32
6a12/21/67/0 23/26/63/0 1.56 (0.70) 1.51 (0.70) [−0.07, 0.16] 0.81 0.07 .86 [.75, .92] 86/0/14
721/40/40/0 12/49/42/0 1.19 (0.76) 1.33 (0.64) [−0.27, −0.01] 2.22* 0.20 .82 [.68, .90] 82/2/17
816/35/51/0 12/51/37/0 1.37 (0.72) 1.23 (0.67) [−0.07, 0.25] 1.16 0.20 .71 [.53, .83] 71/0/29
916/33/44/5 16/40/37/5 1.29 (0.77) 1.22 (0.74) [−0.05, 0.20] 1.20 0.09 .85 [.75, .92] 85/0/14
10 12/14/70/0 12/9/60/0 1.63 (0.69) 1.51 (0.70) [−0.02, 0.25] 1.70 0.17 .79 [.64, .88] 79/1/20
11 79/7/9/12 70/12/7/5 0.39 (0.71) 0.40 (0.69) [−0.19, 0.16] 0.15 0.01 .67 [.46, .80] 67/0/33
12 16/33/51/0 16/35/49/0 1.35 (0.75) 1.37 (0.72) [−0.15, 0.10] 0.37 0.03 .85 [.74, .92] 85/0/15
13 9/37/51/5 12/33/49/0 1.42 (0.66) 1.36 (0.72) [−0.05, 0.16] 1.05 0.09 .87 [.78, .93] 87/0/13
14a12/23/65/2 12/42/44/0 1.53 (0.70) 1.33 (0.68) [0.07, 0.35] 2.95** 0.29 .74 [.54, .86] 74/4/22
15 9/35/58/5 14/40/47/0 1.49 (0.67) 1.33 (0.71) [−0.03, 0.35] 1.69 0.23 .59 [.36, .76] 59/2/39
16a12/28/58/0 9/35/56/0 1.46 (0.70) 1.46 (0.67) [−0.16, 0.16] 0.00 0.00 .70 [.51, .83] 70/0/30
17 74/19/2/12 74/14/0/5 0.32 (0.58 0.31 (0.56) [−0.19, 0.20] 0.06 0.02 .38 [.09, .61] 38/0/62
18 47/28/26/2 47/26/26/0 0.79 (0.83) 0.78 (0.85) [−0.08, 0.10] 0.22 0.01 .94 [.90, .97] 94/0/6
19 12/16/44/5 56/7/33/28 1.31 (0.72) 0.76 (0.93) [0.30, 0,80] 4.42*** 0.66 .43 [.09, .67] 43/17/40
20 28/21/51/0 35/23/42/0 1.23 (0.87) 1.02 (0.89) [0.01, 0.41] 2.15* 0.24 .72 [.53, .84] 72/2/26
Note. ICCA1 with 95% CIs. Proportions of variance attributable to patients’ individual differences in psychopathy indicators/Rater (1 vs. 2)/random error. “1” denotes one of
the 12 clinicians from forensic psychiatric teams who rated the patients on the rst occasion (during forensic psychiatric evaluation), and “2” a different rater (an academic
researcher) who rated them on the second occasion with a separate clinical interview. Bolded values indicate that 90% to 100% of ratings were provided by the two types
of raters with the same rating of 0, 1, 2, or omit; italicized values indicate considerable variability (at least 15% across the scorings options). Two-tailed paired t tests were
applied. Effect sizes d (Cohen’s d) were calculated using an online calculator ( PCL-R = Psychopathy Checklist–Revised (Hare, 2003);
ICCA1 = ICC for absolute agreement, single rater score; CI = condence interval; ICC = intraclass correlation coefficients.
aIndicates items that were assessed by Runo, Boccaccini, and Guy (2011) as the most difficult to score.
*p p p
Problematic consistency was defined as less than 85% agreement in scoring across catego-
ries between the two rater types.
Thirteen items (see Table 4, bold values) were identified as yielding near-perfect consis-
tency, that is, 90% to 100% of the raters assigned the same-level score, which contrasts with
the findings of Blais et al., who identified only seven items with this good consistency.
For the individual PCL-R items, the mean score difference between the two types of raters
(1 and 2) was evaluated by paired-sample t tests. The effect size was calculated using Cohen’s
d. According to Cohen (1988), d d -
erate” effect size, and d > 0.60 a “large” effect size. There were statistically significant dif-
ferences in mean scores between Raters 1 and Rater 2 for PCL-R items 3, 7, 14, 19, and 20,
indicating small rater effects for the scoring of items 7, 14, and 20. Interestingly, one item
(Item 19, “Revocation of conditional release”), which can be scored in a quite objective man-
ner, still showed a large rater effect; 17% of the variance was attributable to rater type. This
effect may be the result of an incorrect scoring procedure being applied by some of Raters 1.
The manual recommends scoring “Omitted” when an evaluee manifests no previous charges
in his adult criminal record prior to the current offense. In fact, some of Raters 1 instead
scored Item 19 as “0.” As the “Omitted” items are prorated, that is, assigned scores greater
than 0, incorrect “0” ratings cause a too low calculated mean score of Factor 2.
Item 3 (“Need for stimulation/proneness to boredom”) showed a “moderate” rater
effect, being more frequently rated as “2” by Raters 1 than by Rater 2, and 7% of the
variance was attributable to rater type. Item 14 (“Impulsivity”) showed a small, but sig-
nificant, rater effect. Results from generalizability theory showed that 4% of the vari-
ance was attributable to rater type. Item 7 (“Shallow affect”) is usually harder to rate.
Nevertheless, only 2% of variance was attributable to rater type (1 vs. 2). However, for
Rater 2 vs. Rater 3 (see Table 6), 23% of the variance in Item 7 was attributable to rater
type, which is probably due to the fact that Rater 3 had no opportunity to interact with
the patients.
Table 4 shows that ICCA1 values varied between .38 (Item 17) and .94 (Item 18 “Juvenile
delinquency”), with an average of .73 across the 20 items. ICCA1 values were good in seven
items, and poor in 13 items, especially in Item 17 (“Many short-term marital relationships”).
Twelve items showed zero variability contribution from rater type, which indicates that the
two types of raters (1 and 2) were interchangeable in ratings of these items. For example,
for Item 2 (“Grandiose sense of self worth”; see the last column in Table 4), 85% of the
variance was attributable to individual differences in the psychopathy indicator, 0% to rater
type (1 vs. 2), and 15% to other sources such as random error and patient–rater interaction.
Item 19 (“Revocation of conditional release”) showed the largest percentage of variability
attributable to rater type (17%).
Table 5 shows statistics for Raters 1 and 2 concerning the total score, the two factors, and
the four facets. We also compared scoring tendencies with norms (Hare, 2003, Table 9.2),
by presenting percentiles and T values of the total score, the two factors, and the four facets.
P. B. Harris, Boccaccini, and Murrie (2015) recommended this comparison between raters
because “rater differences likely impact psychopathy measure scores assigned by evaluators
in both research and clinical practice” (p. 330). Table 5 also shows ICCA1 with 95% CIs, and
TABLE 5: PCL-R Scores as Rated by Raters 1 and Rater 2 for Male Forensic Psychiatric Patients (N = 43)
Raters 1,
M (SD) %ile T value
Rater 2,
M (SD) %ile T value Mean difference 95% CI dICCA1
PCL-R total 21.36 (8.68) 46.1 49 23.31 (9.12) 57.4 52 1.95 (3.71)*** [0.80, 3.09] 0.22 .89 [.76, .95]
Factor 1 (Interpersonal,
8.77 (3.58) 5.1 53 9.00 (4.36) 5.1 53 0.23 (2.40) [−0.51, 0.97] 0.06 .82 [.69, .90]
Facet 1 (Interpersonal) 3.21 (2.05) 60.4 50 3.53 (2.33) 75.5 54 0.33 (1.46) [−0.12, 0.77] 0.15 .78 [.62, .87]
Facet 2 (Affective) 5.58 (2.05) 73.4 55 5.58 (2.42) 73.4 55 0.00 (1.43) [−0.44, 0.44] 0.00 .80 [.66, .89]
Factor 2 (Social
10.85 (4.61) 39.6 48 12.37 (4.32) 50.0 50 1.52 (1.74)*** [0.99, 2.06] 0.34 .88 [.50, .95]
Facet 3 (Lifestyle) 6.37 (2.85) 52.0 50 7.20 (2.63) 70.6 54 0.83 (1.28)*** [0.44, 1.23] 0.30 .85 [.62, .93]
Facet 4 (Antisocial) 5.45 (2.84) 40.9 46 6.31 (2.77) 55.0 50 0.86 (1.26)*** [0.48, 1.25] 0.31 .86 [.61, .94]
Note. ICCA1 with 95% CIs. Proportions of variance attributable to patients’ individual differences in psychopathy indicators/Rater (1 vs. 2)/random error. The T value is a
standardized score with a mean of 50 (SD = 10). T values of 40 to 60 are usually considered as “normal” for a normalized population (here: North American male forensic
patients, N = 1,246). Paired-samples t tests were applied. Effect sizes d (Cohen’s d) were calculated using an online calculator ( The t
tests and Cohen’s d values were based on mean differences of raw PCL-R values. On the rst occasion, one of Raters 1 (the 12 forensic raters; see Table 4 for details)
rated the patient. On the second occasion, Rater 2 (the rst author) rated each of the patients. The manual (Hare, 2003) provides percentile values and T values for score
values that are whole numbers, and therefore the percentile values and T values in this table are approximations based on rounded mean scores from raters. Each rating
was based on an interview and le data. PCL-R = Psychopathy Checklist–Revised; ICCA1 = ICC for absolute agreement, single rater score; CI = condence interval; ICC
= intraclass correlation coefficients.
proportions of variance attributable to patients, rater, and other sources for the total PCL-R
score, the two factors, and the four facets.
As can be seen in the last column of Table 5, interrater reliability was good (at least .80)
for the total score, the two factors, and three of the four facets (Facets 2-4), and almost good
(.78) for Facet 1, indicating that less than about 22% of the variance in these PCL-R scores
was not attributable to psychopathy. Only small percentages of variance (0%-5%) were
attributable to rater type (1 vs. 2) on the total score, factor, and facet level, and relatively
high percentages of variance (7%-22%) were attributable to other sources. This means that
rater type (1 vs. 2) had no great effects on scoring.
Raters 1 scored the patients generally lower than Rater 2, with the exception of Facet 2
where both types of raters rated the patients at a T value of 50. The difference in mean
PCL-R total score between Raters 1 and Rater 2 was significant (a small effect). Paired t
tests showed no difference between Raters 1 and Rater 2 in the mean score of Factor 1, but
there was a significant difference in the mean score of Factor 2 (a moderate effect). Even if
Item 19 is deleted, the difference in mean score of Factor 2 remains significant, t(42) = 4.41,
Table 6 (the first row for each variable) presents the interrater reliability (Raters 1 vs.
Rater 2 as well as Rater 2 vs. Rater 3) for the total score, the two factors, the four facets, and
each item. These statistics were calculated for each of those raters (Raters 1-3, 1-4, 1-6, 1-7,
and 1-10) who rated at least four patients, as well as for Rater 3. Table 6 also shows results
from generalizability theory (estimated percentages of variance components; the second
row for each variable).
An analysis on the item/rater level suggested that poor ICCA1 values for some items were
mainly attributable to two outliers among Raters 1 (Rater 1-4 and Rater 1-7). One of them
(Rater 1-7) was considered as being an outlier on seven of the 20 items by producing, rela-
tive to Rater 2 (and also in comparison with the rest of Raters 1), fewer scores above the
ICCA1 value of .60 on the item level, despite the fact that this rater had good ICCA1 values
in 13 of the 20 items. For example, the “difficult” (Rufino et al., 2011) Item 14 (“Impulsivity”)
was rated by Raters 1-4, 1-6, and 1-10 in almost perfect agreement with Rater 2, while only
Rater 1-7 deviated considerably in this respect. CIs surrounding individual item ratings for
two raters (Rater 1-4 and Rater 1-7) are too large to be of any practical use.
Results from generalizability theory analyses of the PCL-R total score showed that only
a small amount of the variance was attributable to the type of rater (0%-9%). This variabil-
ity was larger comparing with results presented in Table 5; only the two outlier raters (Rater
1-4 and Rater 1-7) accounted for higher figures (7%-9%), despite the fact that ICCA1 were
in both cases above .80. On the level of individual items, most of the variance in cases of
lower ICCA1 values was attributable to other sources of error such as random error and
patient–rater interaction.
Table 6 also reports PCL-R interrater reliability where, postfactum, Rater 3 rated the 17
patients who were rated by Rater 2 but not by Raters 1, using file data only. It should be
noted that generalizability theory identified the ratings by Rater 3 as quite comparable with
those by Raters 1, with 0% or 1% of the variance attributable to rater type in 12 of the 20
items. However, in the remaining eight items, this percentage was between 3% and 17%.
TABLE 6: ICCA1 With 95% Condence Intervals, for PCL-R Ratings on the Item Levels, by Each of Those Forensic Raters (Raters 1), who Rated Four to 10
Patients at Time 1, and Rater 3 Who Rated 17 Patients at Time 3, Compared With Ratings by Rater 2, Who Rated Each of the 60 Patients at Time 2
Rater 1-3
n = 8
Rater 1-4
n = 4
Rater 1-6
n = 10
Rater 1-7
n = 6
Rater 1-10
n = 6
Rater 3
n = 17
PCL-R total .77 [0.24, 0.95]
.93 [−0.01,1.00]
.93 [0.77, 0.98]
.87 [−0.03, 0.98]
.97 [0.81, 1.00]
.87 [0.68, 0.95]
Factor 1 .88 [0.53, 0.98]
.91 [0.02, 0.99]
.83 [0.45, 0.96]
.82 [0.25, 0.97]
.88 [0.44, 0.98]
.79 [0.52, 0.92]
Facet 1—
.72 [0.06, 0.94]
.88 [0.23, 0.99]
.77 [0.33, 0.94]
.83 [0.24, 0.97]
.79 [0.04, 0.97]
.80 [0.54, 0.92]
Item 1a.43 [−0.38, 0.86]
.87 [0.56, 0.97]
.21 [−0.26, 0.79]
.44 [−0.68, 0.90]
.89 [0.73, 0.96]
Item 2 .80 [0.33, 0.96]
.87 [0.18, 0.99]
.87 [0.58, 0.97]
.88 [0.44, 0.98]
.57 [0.14, 0.82]
Item 4 1.00
.57 [−0.48, 0.97]
.51 [−0.16, 0.85]
.86 [0.39, 0.98]
.62 [−0.31, 0.94]
.51 [0.02, 0.79]
Item 5 .50 [−0.35, 0.88]
.63 [−0.20, 0.97]
.75 [0.26, 0.93]
.58 [−0.13, 0.93]
.77 [−0.04, 0.97]
.69 [0.32, 0.88]
Facet 2—Affective .92 [0.68, 0.98]
.86 [0.17, 0.99]
.80 [0.42, 0.95]
.57 [−0.46, 0.93]
.87 [0.42, 0.98]
.75 [0.44, 0.90]
Item 6a.77 [0.27, 0.95]
.80 [−0.05, 0.99]
.93 [76, 98]
.58 [0.14, 0.82]
Item 7 .78 [0.29, 0.95]
.63 [0.09, 0.89]
.80 [0.20, 0.97]
.83 [0.28, 0.97]
.30 [−0.10, 0.65]
Item 8 .84 [0.43, 0.96]
.80 [−0.05, 0.99]
.87 [0.58, 0.97]
.38 [−0.31, 0.87]
.71 [−0.03, 0.95]
.70 [0.32, 0.88]
Item 16a1.00
.73 [−0.11, 0.98]
.86 [0.56, 0.96]
.38 [−0.76, 0.89]
.76 [0.12, 0.96]
.78 [0.50, 0.91]
Factor 2 .85 [0.18, 0.97]
.92 [0.20, 1.00]
.86 [0.31, 0.97]
.81 [−0.07, 0.97]
.97 [0.83, 1.00]
.90 [0.67, 0.97]
Facet 3—
.93 [0.73, 0.99]
.87 [−0.06, 0.99]
.90 [0.67, 0.97]
.85 [−0.02, 0.98]
.87 [0.42, 0.98]
.82 [0.58, 0.93]
Item 3 .92 [0.69, 0.98]
.39 [−0.25, 0.93]
.60 [0.05, 0.88]
.92 [0.61, 0.99]
.60 [−0.20, 0.93]
.72 [0.39, 0.89]
Rater 1-3
n = 8
Rater 1-4
n = 4
Rater 1-6
n = 10
Rater 1-7
n = 6
Rater 1-10
n = 6
Rater 3
n = 17
Item 9 1.00
.81 [0.02, 0.99]
.90 [0.67, 0.97]
.78 [0.16, 0.97]
.83 [0.28, 0.97]
.78 [0.50, 0.92]
Item 13 .77 [0.27, 0.95]
.76 [0.31, 0.93]
.83 [0.28, 0.97]
.77 [−0.02, 0.97]
.59 [0.19, 0.83]
Item 14a.63 [−0.13, 0.92]
.21 [−0.26, 0.79]
.90 [0.51, 0.99]
.82 [0.56, 0.93]
Item 15 .50 [−0.35, 0.88]
.57 [−0.14, 0.96]
.76 [0.31, 0.93]
.55 [−0.17, 0.92]
.90 [0.52, 0.99]
.75 [0.43, 0.90]
Facet 4—
.76 [0.06, 0.95]
.98 [0.79, 1.00]
.89 [0.57, 0.97]
.18 [−0.31, 0.78]
.96 [0.68, 1.00]
.90 [0.56, 0.97]
90 /4/6
Item 10 1.00
.68 [0.12, 0.91]
.71 [−0.02, 0.95]
.75 [−0.11, 0.96]
.71 [0.36, 0.88]
Item 12 .77 [0.27, 0.95]
77/0 /23
.86 [0.39, 0.98]
.60 [−0.20, 0.93]
.80 [0.45, 0.93]
Item 18 1.00
.87 [0.55, 0.97]
.92 [0.61, 0.99]
.88 [0.44, 0.98]
.87 [0.68, 0.95]
Item 19 .32 [−0.15, 0.78]
32 /40/28
.64 [−0.20, 0.97]
.37 [−0.13,.78]
.23 [−0.60, 0.84]
.87 [0.38, 0.98]
.64 [0.23, 0.85]
Item 20 .76 [0.19, 0.95]
.00 [−0.57, 0.72]
.32 [−0.37, 0.85]
.72 [0.39, 0.89]
Other Items
Item 11 −.51 [−1.06,
.98 [0.81, 1.00]
.93 [0.76, 0.98]
−.22 [−1.14, 0.70]
.38 [−0.06, 0.71]
Item 17 .08 [−0.76, 0.73]
.00 [−0.88, 0.88]
.57 [0.02, 0.87]
.00 [−0.36, 0.65]
Note. Proportions of variance attributable to patients’ individual differences in psychopathy indicators/rater type/other sources such as random error and patient–rater
interaction. The two-way random effects model (McGraw & Wong, 1996) for a single rater with absolute agreement was used. Raters 1-3, 1-4, 1-6, 1-7, and 1-10 were
forensic social investigators, while Rater 2 was an independent academic researcher. Rater 3 was trained in PCL-R le data ratings and an experienced research
assistant (doctoral student in psychology). Raters 1 and 2 rated the patients using separate interviews and le data, while Rater 3 used le data only. ICCA1 = ICC for
absolute agreement, single rater score; PCL-R = Psychopathy Checklist–Revised; ICC = intraclass correlation coefficients.
aIndicates items that were assessed by Runo, Boccaccini, and Guy (2011) as the most difficult to score.
TABLE 6: (continued)
This indicates that Rater 3, to score individual items, without the opportunity to interact
with the evaluees, used the file data with comparable ability as Raters 1 to focus on diag-
nostically relevant issues. It is, therefore, evident that file data were of a high quality, and
that Rater 3 was conscious in her effort to carefully score items that require a great degree
of subjective judgment. As shown in the last column in Table 6, variability attributable to
rater differences (2 vs. 3) for Items 1, 6, 14, and 16 was 0% to 1%. These four items were
ranked by Rufino et al. (2011) as the most difficult to score because they were viewed as
requiring the highest degree of subjective judgment. Here all types of raters (1, 2, and 3;
except for the two outlier raters, Rater 1-7 who showed 37% rater-type variability on Item
1, and Rater 1-4 who showed 9% rater-type variability on Item 16) that showed good agree-
ment with Rater 2 on these four items. Thus, we found no clear tendency toward scoring
disagreement among the raters regarding these four items, and therefore we cannot support
Rufino et al.’s (2011) suggestion that they are especially difficult to score. Possibly, this
may still be, to some extent, true for Item 1 (“Glibness/superficial charm”) because for three
of the raters only 21% to 41% of the variance was attributable to patient differences.
Interestingly, Rater 3, who never met the patients, was one of the three raters who had the
highest agreement with Rater 2 regarding Item 1.
The interrater reliability was good for the total score, the two factors, and three of the
four facets (and .78 for Facet 1), and therefore, as specified in the PCL-R manual, the scores
of Rater 2 were averaged with scores from Raters 1 or 3 to obtain the final PCL-R scores.
The final mean total score for all 60 participants was 23.8 (SD = 8.66), while it was 9.04 (SD
= 3.86) for Factor 1, and 12.36 (SD = 5.07) for Factor 2. This total score, compared with
results from North American male forensic psychiatric patients (see Hare, 2003, Table 9.2),
was at the 62.5 percentile (T = 54). Mean T values for Factors 1 and 2 were 53 and 50,
respectively, indicating that the results may be considered as “normal” for a forensic psy-
chiatric population. Figure 1 presents the distribution of the total PCL-R score (average of
ratings by Rater 2 and by Raters 1 or 3).
The internal consistency of the total score, Factor 1, and Factor 2 was assessed using
Cronbach’s alpha (Nunnally, 1978). The mean interitem correlation was also calculated
(Green, Lissitz, & Mulaik, 1977), as well as the corrected item-total correlation for each
item. Table 7 presents descriptive statistics and corrected item-total correlations for the
individual items.
The internal consistency of the PCL-R was high, as indicated by Cronbach’s alpha values
of .92 for the PCL-R total score, and of .87 for both Factor 1 and Factor 2. It was also
acceptable on the facet level, .82, .83, .82, and .78 for Facet 1, Facet 2, Facet 3, and Facet
4, respectively. The mean interitem correlation was .38 for the PCL-R total score, and .45
and .41 for items in Factors 1 and 2, respectively. It was .53, .55, .48, and .42 for Facet 1,
Facet 2, Facet 3, and Facet 4, respectively.
Assessing the interrater reliability of the PCL-R in real-world settings has obvious impor-
tance. Our results add to a growing body of research on PCL-R interrater reliability, using
generalizability theory. Here, Raters 1, 2, and 3 used similar, comprehensive file data, and
Raters 1 and 2 conducted separate clinical interviews. This is an uncommon feature of the
present study, which is also one of a small number of studies (see Table 2) using generaliz-
ability theory to locate the sources of unreliability of individual PCL-R items. Our results
show that (a) ICCA1 values were good (at least .80) for the total and factor scores, and were
.78 to .86 for the four facets; (b) the variability in PCL-R scores was low (0%-5%) among
the three types of raters on the total, factor, and facet scores, while it was higher on the item
level. Among Raters 1, we identified two outliers who deviated more than the others from
Rater 2 in their ratings regarding psychopathy indicators. Rater 3, who rated patients post-
factum, rated, in general, similarly as the other raters, who had the opportunity to interview
the patients. This may suggest that, given satisfactory quality of file data, PCL-R ratings
without an interview may reach good reliability for a well-trained and experienced rater.
The current study is not a classic field study where the data, according to the definition
by Edens and Boccaccini (2017), are collected “for the purpose of informing clinical, foren-
sic, or correctional decision making in regard to the examinee” (p. 600). Neither is it a labo-
ratory study where raters watch video-movies and have no opportunity to interact with the
evaluee. Instead, it is a research study performed in a field setting, with independent raters
(1 and 2) who conducted live interviews with forensic psychiatric patients, although these
interviews were not part of the forensic psychiatric evaluation. We also compared ratings
performed by Rater 2 and by Rater 3, who only used file data. These sets of ratings gave us
a unique opportunity to assess interrater reliabilities in our sample.
Overall, regardless of the method used, the reliabilities we found for the scores of PCL-R
total, Factor 1, and Factor 2, as well as the facets, are comparable to the values presented in
TABLE 7: Descriptive Statistics and Corrected Item-Total Correlations (r) of Averaged Scores From Rat-
ers 1 and 2, or 2 and 3 on Each Individual PCL-R Item
Item No. M SD Median Kurtosis Skewness r
1 0.86 0.74 1 −1.32 0.22 .54
2 0.82 0.72 1 −1.02 0.40 .65
3 1.37 0.73 1.5 −0.82 −0.78 .60
4 0.77 0.64 0.96 −0.74 0.37 .58
5 0.95 0.69 1 −1.22 −0.01 .66
6 1.58 0.64 2 0.50 −1.31 .60
7 1.22 0.65 1 −0.92 −0.34 .69
8 1.40 0.63 1.5 −1.07 −0.50 .65
9 1.34 0.70 1.5 −0.80 −0.66 .59
10 1.63 0.62 2 1.84 −1.72 .53
11 0.49 0.67 0 0.12 1.18 .46
12 1.35 0.71 1.5 −0.75 −0.70 .66
13 1.38 0.64 1.5 −0.74 −0.58 .55
14 1.46 0.62 1.5 −0.14 −0.89 .58
15 1.48 0.61 1.5 −0.17 −0.93 .79
16 1.50 0.63 2 −0.31 −0.93 .59
17 0.36 0.48 0 0.07 0.98 .44
18 0.87 0.83 1 −1.58 0.22 .62
19 1.09 0.69 1 −1.36 0.05 .55
20 1.09 0.79 1 −1.44 −0.18 .56
Note. N = 60. See Table 2 for a description of the 20 PCL-R items. PCL-R = Psychopathy Checklist–Revised.
the manual (Hare, 2003). This provides a counterpoint to recent critical studies (e.g., Edens,
Boccaccini, & Johnson, 2010; Edens et al., 2015) and is in line with a Swedish field study
(Sturup et al., 2014) where the PCL-R was used to rate offenders with life sentences. Our
discussion will therefore focus on the variability contributions from raters (and other sources
of errors) regarding individual items.
Only one of the studies published to date (Ismail & Looman, 2016) reached an average
value of ICCA1 across the 20 items as high as in the current study (see Table 2). In our study,
this average value was .73, while in the reviewed studies (see Table 2), it varied between .35
(C. S. Miller et al., 2012) and .75 (Ismail & Looman, 2016). Our ICCA1 measures (Table 4)
show good reliability (at least .80) for seven items. Only two of the items with poor reli-
ability had ICCA1 below .40 (Items 17 and 19). The low reliability of Item 17 is in line with
Miller et al. and Rufino et al. (2011), but the poor reliability of Item 19 stands in contrast to
past research. Whereas Item 17 was interpreted differently by different raters in the present
study, the poor reliability of Item 19 (“Revocation of conditional release”) is due to the fact
that, as was discussed in the “Results” section, some of Raters 1 coded this item as “0” for
those patients who had no prior contact with the criminal justice system, whereas the rest of
the raters followed the PCL-R manual’s recommendation of “Omit.” It seems reasonable to
change this recommendation to “0.”
Item 17 concerns “Many short-term marital relationships.” This concept is ill-defined,
and the item is quite irrelevant in Sweden, where many “marital” relationships take nontra-
ditional forms (such as cohabiting without marriage, or living separately from one’s part-
ner). Therefore, the rating depends on both the participant’s and the rater’s definitions of a
“short-term marital relationship.” This issue is clearly mirrored in Table 6. The proportion
of variance attributable to patients in scores of Item 17 was 0% for three of the six indi-
vidual raters (Rater 1-4, 1-10, and Rater 3).
Significant differences arose between Raters 1 and Rater 2 in their mean scores of five
items: 3, 7, 14, 19, and 20 (Table 4). Only one of these items (Item 14, “Impulsivity”) is,
according to Rufino et al. (2011), assumed to be more difficult to score than other items due
to its subjectivity. An inspection of the results presented in Table 6 shows that the difference
in mean score in Item 14 is only due to Rater 1-7 (and the variability attributable to rater
type is 0% in the other raters, including Rater 3), while for Item 3 (“Need for stimulation/
Prone to boredom”), it is due to Raters 1-4 and 1-6. The interval between the two rating
occasions was relatively short (about 1 month), and it is unlikely that these differences arose
from changes in the participants’ “true” scores. One cause may be a slight difference in the
depth of information possessed by the two types of raters regarding participants’ personality
traits, as well as individual raters’ conceptions of these traits. For example, impulsivity may
take different forms (e.g., cognitive impulsivity or behavioral impulsivity), and the manual
for scoring PCL-R does not differentiate between these forms. The manual recommends
looking for a pattern of criminal behavior without any clear planning and describes a person
whose behavior is generally impulsive, unpremeditated, and lacking forethought, and who
does things on the spur of the moment without thinking about the consequences. It is pos-
sible that the higher mean scores of Rater 2 for Items 3 and 14 arose because Rater 2 was
not a member of the forensic psychiatric team. The participants may have felt more confi-
dent, during this rater’s interview, that the information they provided would not be included
in their forensic psychiatric files. This may have led them to disclose more examples of
their impulsive sensation-seeking behavior. Furthermore, the fact that the participants in the
research project were at the same time involved in a forensic psychiatric evaluation per-
formed by Raters 1 may have caused some suspicion against these raters, which could have
contributed to the differences in PCL-R scores between Raters 1 and Rater 2. It should be
noted that for Item 14, ratings of Rater 3 agreed well with those of Rater 2 (see Table 6;
ICCA1 = .82, and 0% variability due to rater type) indicating that Raters 2 and 3 focused on
the “right” or a similar type of impulsivity while rating the patients. An inspection of Table
6 shows that only Rater 1-7 scored the offenders differently from Rater 2 on Item 14, indi-
cated by a very low (.21) value of ICCA1 and a high amount (32%) of variability attributable
to rater type, as well as a large portion of variability (47%) due to other sources of errors.
Taking together, these findings are in line with past research (see Table 2), where Item 14
was one of the three PCL-R items with a low ICCA1. Interestingly, impulsivity is central in
Hare’s (2003) description of psychopathy, but Cleckley’s (1976) 16 characteristics of psy-
chopathy do not explicitly comprise impulsivity. Cleckley only indicated the manifestation
of impulsivity in a psychopath’s inability to maintain and follow long-term goals.
More attention should be given in both research and field studies to the troubling find-
ings regarding Cohen’s kappa statistics for the dichotomous classification of psychopathy
(yes/no). This classification can have an impact on legal decisions, the common assumption
being that psychopaths are more dangerous and more prone to recidivism than nonpsycho-
paths. Our results are not impressive in this respect (see Table 3) but are still better than
those obtained in the field study by Murrie, Boccaccini, Johnson, and Janke (2008), who
report a kappa value of .13, or by DeMatteo et al. (2014), who report a kappa of .29.
The finding of significant interrater differences for Items 3 (“Need for stimulation/Prone
to boredom”) and 14 (“Impulsivity”) is important, as these items are two of five indicators
of Factor 3 (“Irresponsible behavioral style”) from Cooke and Michie’s (2001) 17-item,
three-factor psychopathy model, and as they are also indicators of Facet 3 (“Impulsive life-
style”) in Hare’s (2003) 18-item, two-factor, four-facet model. Our finding of poor reliabil-
ity of Item 14 is in line with past research (e.g., Jeandarme et al., 2017; Sturup et al., 2014),
and this finding may have theoretical implications for the discussion of different instru-
ments for assessing psychopathy. In their review, Poythress and Hall (2011) have ques-
tioned the appropriateness of impulsivity as a diagnostic criterion of psychopathy.
One interesting finding was that the correlations between individual items and the total
PCL-R score for Items 11, 17, and 19 (Table 7) were all at least .44, which contrasts with
lower values (.13-.27) presented in the PCL-R manual (Hare, 2003, Table 5.2). Our findings
may have theoretical implications because due to their poor psychometric properties, shown
in prior research (Hare, 2003), Items 11 (“Promiscuous sexual behavior”) and 17 (“Many
short-term marital relationships”) are excluded from the scoring grid when summing scores
for the two factors and the four facets, although these items are included in the total score.
Poor reliability is due not only to adversarial bias. But, in truth, there can be many other
forms of bias or error. Even among the most professional evaluators, there can be many
ways in which they understand items differently or apply scoring rules differently. The
importance of knowledge about the factors that can influence scoring of forensic instru-
ments (such as education, training, experience, and personality traits of raters, and quality
of file data) should be emphasized. Hare (2016b, p. 30) has already commented on this,
citing from his book chapter (Hare, 2016a) that “some perhaps are guided by personal
beliefs and philosophies about human nature, or by explicit or implicit biases against, or
attachments to, their clients; others may be too tender-hearted or tough-minded” (p. viii)
while conducting PCL-R ratings. And, of course, one may suspect that in all field settings,
clinicians consider themselves to work in an unbiased and professional manner. The nature
of bias is that people often consider themselves objective and professional, and are “blind”
to their own bias. Pronin, Lin, and Ross (2002) studied the “bias blind spot,” and their
results are relevant to understanding the rating discrepancies that researchers increasingly
find in studies of forensic evaluators.
The present study used ratings based on separate interviews (except for Rater 3 who did
not perform any interviews) conducted by different raters at different time points with real
interviewees, without contamination by adversarial bias. We have also analyzed rater differ-
ences on the item level, which is seldom possible in field studies because of the unavail-
ability of such data. The raters were both men and women, which may enhance the
generalizability of the findings. We found that psychopathy in male nonpsychotic patients
referred for a forensic psychiatric evaluation could be reliably assessed with the PCL-R—a
conclusion that holds for the total score, the two factors, and the four-facet scores, but is not
tenable for several individual items.
The results presented here were obtained with male offenders younger than 36 years
undergoing forensic psychiatric evaluation. They cannot be generalized to females or to
psychotic or mentally deficient patients. But we believe that they may be generalized to
field studies, and to other male forensic populations, as the T values in Table 4 indicate that
PCL-R scores in the present study were around a T value of 50 (40-60 is usually considered
as “normal”) with reference to the normative population referred to in the PCL-R manual
(North American male forensic patients, N = 1,246). In addition, Figure 1 shows that the
present sample received PCL-R total scores that were approximately normally distributed,
comprising low, medium, and high scores.
The possibility of obtaining a sample size typical of other clinical studies was limited due
to practical issues. Data collection for this study took about 1 year and involved 12 clini-
cians who voluntarily rated the patients. Our approach required the presence of Rater 2 on
any day during this year because her ratings of a patient could not be made until his ordinary
forensic psychiatric assessment was finished. As some of Raters 1 rated only one to two
patients each, we had to exclude their ratings from the comparative analyses presented in
Table 6, as well as from those analyses that were carried out to assure that there was random
assignment of patients to raters.
Despite generally good interrater reliabilities, we found some differences between the
raters in their scoring tendencies on the level of individual items, and we have already
attempted to explain these differences. We consider all of the raters (Types 1, 2, and 3) as
being on a high and comparable level of education, training, and experience as regards the
use of PCL-R. Possibly, Raters 1 were less motivated than Rater 2 because the PCL-R rat-
ings were performed besides their normal work. Future studies should include personal
characteristics of the raters, for example, their personality traits. A. K. Miller et al. (2011)
showed a significant negative correlation (–.51) between raters’ agreeableness and their
ratings on Facet 1. Miller et al. also showed a (nonsignificant) positive correlation between
raters’ conscientiousness and their ratings on Facet 1 and Facet 3 (.33 and .31, respectively),
and a negative correlation (–.35) with their ratings on Facet 4. One may speculate that pos-
sessing the latter personality trait may make raters more careful in their scoring and more
focused on important parts of file data. To keep the confidentiality of the 12 Raters 1, we
did not assess their personalities.
We conclude that despite these minor limitations, the comparison between Rater 2 and
five of Raters 1 gave a sufficiently nuanced picture of reality: Interrater reliability is often
insufficient on the item level. Some of the items showed particularly poor reliabilities,
which they also did in past research. This problem should be taken seriously in forensic
assessments, in general, and, in particular, in field assessments where PCL-R is used to sup-
port decisions regarding personal liberty. Future research in field studies should report reli-
abilities for individual items. It would be desirable if the court documents could contain
information about individual PCL-R items, which would allow more research on interrater
reliability. This information would also be suited for performing a larger item response
theory based study of item-level PCL-R scale properties in a real-world setting (Morizot,
Ainsworth, & Reise, 2007).
Careful assessment of psychopathy provides important assistance for risk assessment as
well as for the management of therapeutic efforts in prison settings and in forensic psychi-
atric care. Even trained and experienced evaluators who rate patients with potentially the
same psychopathy level (i.e., with a similar pattern of psychopathy-related traits) may pro-
duce different PCL-R scores, largely due to inconsistency of ratings on the item level. In
field contexts, several additional factors may suppress interrater reliability: There may be
restricted sets of offenders (e.g., with only high PCL-R scores), psychopathy assessment
may be biased by partisanship (Edens et al., 2010; Murrie, Boccaccini, Guarnera, & Rufino,
2013), and scoring tendencies may differ between raters due to inadequate training and
other factors (Boccaccini, Murrie, Rufino, & Gardner, 2014; P. B. Harris et al., 2015). More
work should be done to develop the skills of forensic psychiatric professionals in PCL-R
based psychopathy assessment (Boccaccini et al., 2014) and to maintain their mastering of
the scoring system (Hare, 2003).
Blais, J., Forth, A. E., & Hare, R. D. (2017). Examining the interrater reliability of the Hare Psychopathy Checklist-Revised
across a large sample of trained raters. Psychological Assessment, 29, 762-775.
Boccaccini, M. T., Murrie, D. C., Rufino, K. A., & Gardner, B. O. (2014). Evaluator differences in Psychopathy Checklist-
Revised factor and facet scores. Law and Human Behavior, 38, 337-345.
Boccaccini, M. T., Rufino, K. A., Jeon, H., & Murrie, D. C. (2017). Does the predictive validity of psychopathy ratings
depend on the clinical experience of the raters? International Journal of Forensic Mental Health, 16, 130-138.
Boccaccini, M. T., Turner, D. B., & Murrie, D. C. (2008). Do some evaluators report consistently higher or lower PCL-R
scores than others? Findings from a statewide sample of Sexually Violent Predators evaluations. Psychology, Public
Policy, and Law, 14, 262-283.
Boccaccini, M. T., Turner, D. B., Murrie, D. C., & Rufino, K. A. (2012). Do PCL-R scores from state or defense experts best
predict future misconduct among civilly committed sex offenders? Law and Human Behavior, 36, 159-169.
Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer-Verlag.
Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instru-
ments in psychology. Psychological Assessment, 6, 284-290.
Cleckley, H. (1976). The mask of sanity (5th ed.). St. Louis, MO: Mosby.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum.
Cooke, D. J., & Michie, C. (2001). Refining the construct of psychopathy: Towards a hierarchical model. Psychological
Assessment, 13, 171-188.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory
of generalizability for scores and profiles. New York, NY: John Wiley.
Cronbach, L. J., Nageswari, R., & Gleser, G. C. (1963). Theory of generalizability: A liberation of reliability theory. The
British Journal of Statistical Psychology, 16, 137-163.
Dåderman, A. M., & Kristiansson, M. (2003). Degree of psychopathy: Implications for treatment in male juvenile delinquents.
International Journal of Law and Psychiatry, 26, 301-315.
DeMatteo, D., Galloway, M., Edens, J. F., Cox, J., & Smith, S. T. (2014). The role and reliability of the Psychopathy Checklist-
Revised in U.S. sexually violent predator evaluations: A case law survey. Law and Human Behavior, 38, 248-255.
Edens, J. F., & Boccaccini, M. T. (2017). Taking forensic mental health assessment “out of lab” and into “the real world”:
The introduction to the special issue on field utility of forensic assessments and procedures. Psychological Assessment,
29, 599-610.
Edens, J. F., Boccaccini, M. T., & Johnson, D. W. (2010). Inter-rater reliability of the PCL-R total and factor scores among
psychopathic sex offenders: Are personality features more prone to disagreement than behavioral features? Behavioral
Sciences & the Law, 28, 106-119.
Edens, J. F., Cox, J., Smith, S. T., DeMatteo, D., & Sörman, K. (2015). How reliable are Psychopathy Checklist–Revised
scores in Canadian criminal trials? A case law review. Psychological Assessment, 27, 447-456.
Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York, NY: John Wiley.
Forth, A. E., Hart, S. D., & Hare, R. D. (1990). Assessment of psychopathy in male young offenders. Psychological
Assessment, 2, 342-344.
Grann, M., Långström, N., Tengström, A., & Stålenheim, G. (1998). Reliability of file-based retrospective ratings of psy-
chopathy with the PCL-R. Journal of Personality Assessment, 70, 416-426.
Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index of unidimensionality.
Educational and Psychological Measurement, 37, 827-838.
Hare, R. D. (2003). Hare PCL-R: Technical manual (2nd ed.). Toronto, Ontario, Canada: Multi-Health Systems.
Hare, R. D. (2016a). Forward. In C. B. Gacono (Ed.), The clinical and forensic assessment of psychopathy: A practitioner’s
guide (2nd ed., pp. xvi-xviii). New York, NY: Routledge.
Hare, R. D. (2016b). Psychopathy, PCL-R, and criminal justice: Some new findings and current issues. Canadian Psychology/
Psychologie Canadienne, 57, 21-34.
Harris, G. T., Rice, M. E., & Cormier, C. A. (2013). Research and clinical scoring of the Psychopathy Checklist can show
good agreement. Criminal Justice and Behavior, 40, 1349-1362.
Harris, P. B., Boccaccini, M. T., & Murrie, D. C. (2015). Rater differences in psychopathy measure scoring and predictive
validity. Law and Human Behavior, 39, 321-331.
Heilbrun, K. (1992). The role of psychological testing in forensic assessment. Law and Human Behavior, 16, 257-272.
Hildebrand, M., de Ruiter, C., de Vogel, V., & van der Wolf, P. (2002). Reliability and factor structure of the Dutch language
version of Hare’s Psychopathy Checklist-Revised. International Journal of Forensic Mental Health, 1, 139-154.
Hurducas, C. C., Singh, J. P., de Ruiter, C., & Petrila, J. (2014). Violence risk assessment tools: A systematic review of sur-
veys. International Journal of Forensic Mental Health, 13, 181-192.
Ismail, G., & Looman, J. (2016). Field inter-rater reliability of the Psychopathy Checklist-Revised. International Journal of
Offender Therapy and Comparative Criminology. Advance online publication. doi:10.1177/0306624X16652452
Jeandarme, I., Edens, J. F., Habets, P., Bruckers, L., Oei, K., & Bogaerts, S. (2017). PCL-R field validity in prison and hospital
settings. Law and Human Behavior, 41, 29-43.
Laurell, J., & Dåderman, A. M. (2007). Psychopathy (PCL-R) in a forensic psychiatric sample of homicide offenders: Some
reliability issues. International Journal of Law and Psychiatry, 30, 127-135.
Levenson, J. S. (2004). Reliability of sexually violent predator civil commitment criteria in Florida. Law and Human Behavior,
28, 357-368.
Lloyd, C. D., Clark, H. J., & Forth, A. E. (2010). Psychopathy, expert testimony, and indeterminate sentences: Exploring the
relationship between Psychopathy Checklist-Revised testimony and trial outcome in Canada. Legal and Criminological
Psychology, 15, 323-339.
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological
Methods, 1, 30-46.
Miller, A. K., Rufino, K. A., Boccaccini, M. T., Jackson, R. L., & Murrie, D. C. (2011). On individual differences in person
perception: Raters’ personality traits relate to their Psychopathy Checklist-Revised scoring tendencies. Assessment, 18,
Miller, C. S., Kimonis, E. R., Otto, R. K., Kline, S. M., & Wasserman, A. L. (2012). Reliability of risk assessment measures
used in sexually violent predator proceedings. Psychological Assessment, 24, 944-953.
Morizot, J., Ainsworth, A. T., & Reise, S. P. (2007). Towards modern psychometrics: Application of item response theory
models in personality research. In R. W. Robins, R. C. Fraley, & R. F. Krueger (Eds.), Handbook of research methods in
personality psychology (pp. 407-422). New York, NY: Guilford Press.
Murrie, D. C., Boccaccini, M. T., Guarnera, L. A., & Rufino, K. A. (2013). Are forensic experts biased by the side that
retained them? Psychological Science, 24, 1889-1897.
Murrie, D. C., Boccaccini, M. T., Johnson, J. T., & Janke, C. (2008). Does interrater (dis)agreement on Psychopathy Checklist
scores in sexually violent predator trials suggest partisan allegiance in forensic investigations? Law and Human Behavior,
32, 352-362.
Murrie, D. C., Boccaccini, M. T., Turner, D., Meeks, M., Woods, C., & Tussey, C. (2009). Rater (dis)agreement on risk
assessment measures in sexually violent predator proceedings: Evidence of adversarial allegiance in forensic investiga-
tion? Psychology, Public Policy, and Law, 15, 19-53.
Neal, T. M. S., & Grisso, T. (2014). Assessment practices and expert judgment methods in forensic psychology and psychia-
try: An international snapshot. Criminal Justice and Behavior, 41, 1406-1421.
Nunnally, J. (1978). Psychometric theory. New York, NY: McGraw-Hill.
Poythress, N. G., & Hall, J. R. (2011). Psychopathy and impulsivity reconsidered. Aggression and Violent Behavior, 16,
Pronin, E., Lin, D. Y., & Ross, L. (2002). The bias blind spot: Perceptions of bias in self versus others. Personality and Social
Psychology Bulletin, 28, 369-381.
R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. Available from
Rosenthal, R., & Rosnow, R. L. (1991). Essentials of behavioral research: Methods and data analysis (2nd ed.). New York,
NY: McGraw-Hill.
Rufino, K. A., Boccaccini, M. T., & Guy, L. S. (2011). Scoring subjectivity and item performance on measures used to assess
violence risk: The PCL-R and HCR-20 as exemplars. Assessment, 18, 453-463.
Rufino, K. A., Boccaccini, M. T., Murrie, D. C., & Hawes, S. W. (2012). When experts disagreed, who was correct? A com-
parison of PCL-R scores from independent raters and opposing forensic experts. Law and Human Behavior, 36, 527-537.
Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios.
Psychological Methods, 1, 199-223.
Streiner, D. L. (2001). Regression toward the mean: Its etiology, diagnosis and treatment. The Canadian Journal of Psychiatry,
46, 72-76.
Sturup, J., Edens, J. F., Sörman, K., Karlberg, D., Fredriksson, B., & Kristiansson, M. (2014). Field reliability of the
Psychopathy Checklist-Revised among life sentenced prisoners in Sweden. Law and Human Behavior, 38, 315-324.
Svennerlind, C., Nilsson, T., Kerekes, N., Andiné, P., Lagerkvist, M., Forsman, A., . . . Malmgren, H. (2010). Mentally dis-
ordered criminal offenders in the Swedish criminal system. International Journal of Law and Psychiatry, 33, 220-226.
Wong, S. (1988). Is Hare’s Psychopathy Checklist reliable without the interview? Psychological Reports, 62, 931-934.
Anna M. Dåderman is a professor of psychology and head of psychology at University West, Trollhättan, Sweden. She
earned her PhD in psychology at Stockholm University in 2002 and her MD in forensic psychiatry at Karolinska Institutet in
2005. Her research focuses on the development and improvement of psychological assessment instruments, and her teaching
is mainly on applied forensic psychology, as well as on methods and statistics and scientific writing at advanced level. She
also delivers expert opinions regarding forensic psychiatric cases.
Åke Hellström is an emeritus professor at Stockholm University. He earned his PhD in psychology at Stockholm University
in 1977 and is specializing in quantitative areas of psychology.

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT