Does peer review work? An experiment of experimentalism.

Author:Ho, Daniel E.
Position:Author abstract

Table of Contents Introduction I. Peer Review and Democratic Experimentalism A. Antecedents of Peer Review B. Democratic Experimentalism C. Limited Evidence Base II. Food Safety A. Inspection Systems as Fertile Testing Ground B. Washington State C. King and Pierce Counties 1. King County 2. County comparison III. Experimental Design A. Preparation and Rollout B. Randomized Peer Inspections C. Weekly Huddles IV. Results A. Peer Inspections B. Independent Inspections C. Qualitative Results V. Limitations A. Substantive B. Methodological VI. Implications A. Peer Review B. Rules and Guidance C. New Governance, Old Problems? D. Quality Assurance E. Facile Reforms Conclusion Appendix A Appendix B Appendix C Appendix D Appendix E Appendix F Appendix G Appendix H Appendix I Appendix J Introduction

Every day, thousands of frontline government officials carry out the law. These officials often have extensive discretion, and the quality and consistency of their decisions can vary dramatically.

This problem of inconsistency is endemic, spanning across all areas of law, levels of government, and types of institutional structures. To provide a sense of the scope, consider the following:

The U.S. Citizenship and Immigration Services deploys some 450 asylum officers (1) and some 250 immigration judges (2) to decide whether an asylum applicant has a "well-founded fear" of persecution. (3) Cases are assigned irrespective of the merits within an office, but examiners and judges vary widely in granting relief. (4) In New York, of the judges hearing more than one hundred cases, one judge had a grant rate of 6% and another 91%. (5) Scholars have denounced the process as a form of "refugee roulette." (6) Judge Richard Posner lamented "a complete breakdown of this immigration adjudication business." (7) Judge Marsha Berzon described one immigration judge's decision as "def[ying] parsing under ordinary rules of English grammar." (8)

The Social Security Administration (SSA) employs over 1300 administrative law judges (ALJs) to adjudicate whether an individual is entitled to social security disability. (9) The determination hinges on whether the individual is unable to engage in "substantial gainful activity" for her age, education, and work experience. (10) Based on an exhaustive study of this adjudicative system, six leading scholars concluded that the "evidence is persuasive that the interjudge dispersion in reversal rates is truly a product of subjective factors, probably relating to the interpretative role of the ALJ rather than the investigative one." (11) They found that inter-ALJ consistency was the "most glaring" weakness of the ALJ system, as some ALJs reversed state agency determinations only about 10% of the time, while others reversed upwards of 90% of the time. (12) With respect to ALJs, Justice Scalia argued that "we should be concerned not about bias but about bona fide incompetence." (13) Jerry Mashaw argued that conventional due process doctrine has failed to produce adjudicatory fairness and that due process should be reconceptualized to mandate improvement in management. (14) Inconsistencies continue to plague the system. (15) In 2013, of the San Francisco Bay Area ALJs deciding more than forty cases that year, one had a grant rate of 15% and another above 90%. (16) In two out of three cases, decisions are reversed on appeal, (17) and the "massive unexplained differences" between ALJs (18) have led to assessments of the process as "rife with errors," (19) "systematically wrong," (20) and "wildly out of control." (21)

The Patent and Trademark Office employs some 9100 examiners (22) to decide whether an invention is novel, nonobvious, and useful so as to warrant a patent. (23) Patent grants (24) and the search for prior art (25) vary considerably across examiners. Claim language amendments appear similarly affected by examiners, so that patent scope, according to one scholar, is "remarkably sensitive to the happenstance of examiner identity." (26) One widely cited diagnosis: "There may be as many patent offices as there are patent examiners." (27) The Centers for Medicare and Medicaid Services (CMS) contracts with states to conduct nursing home surveys for compliance with federal regulations. (28) By one count, inspectors enforced over a thousand regulations, (29) some involving highly discretionary or subjective judgments such as whether a home cares for residents in a manner that "maintains or enhances each resident's dignity." (30) The Government Accountability Office (GAO) found that an "important and continuing issue[]" is the "inconsistency among state surveyors." (31) One study attributed the low reliability of U.S. inspections to regulatory complexity: "How do [inspectors] cope with such a daunting task? The answer is that they do not. Some of the standards are completely forgotten.... " (32)

In response to allegations of child abuse or neglect, juvenile court judges decide whether to remove children from parental custody based on assessments of, for instance, "substantial risk of serious future injury." (33) One study of child welfare determinations from 1990 to 2001 in Illinois found statistically significant differences in removal rates across 409 case managers, even though cases were close to randomly assigned. (34) Other scholars assailed these standards as presenting "uncabinable discretion" with institutional "chaos, oppression, and tragic ineffectiveness." (35)

The Nuclear Regulatory Commission (NRC) employs roughly eight hundred staff members to conduct oversight inspections of some one hundred civilian nuclear reactors and thirty research reactors. (36) One audit concluded,

"The subjectivity of some inspection criteria, coupled with considerable staff discretion, provides an environment for potential program inconsistency." (37) A study of forty inspectors found violation detection rates ranging from under 10% to over 60%. (38) "[N]ondetection is endemic.... " (39) Said one NRC section chief: "People can write requirements forever. But it's a case of the alligator mouth and the hummingbird stomach." (40)

Environmental health inspectors visit restaurants, food trucks, schools, nursing homes, and cafeterias to assess compliance with food safety regulations to prevent foodborne illness. Vagueness in health codes can give inspectors considerable discretion. For instance, health codes mandate that there be "adequate spacing" between foods to prevent cross-contamination. (41) A 2009 audit of New York City's health department, which then employed roughly 160 inspectors who were randomly assigned to inspect establishments, found that oversight was lacking. (42) Inspector variation was substantial. At the time, twenty-eight violation points would have been considered a failed inspection, (43) and some inspectors had average inspection scores as low as fifteen and others as high as fifty points. (44) (Between 1988 and 1989, forty-six staff and former staff members--over half of the seventy-person inspection corps at the time--pled guilty to or were convicted of extortion, with the U.S. Attorney noting that "[t]he city agency was the criminal enterprise.") (45) A 2015 audit found that the department "did not consistently attempt follow-up inspections," violating mandatory timing 50% of the time, and "supervisors failed to consistently perform [required] supervisory field inspections." (46) Due in part to this "inspector lottery," one study of over 120,000 New York inspections showed that scores from one routine, unannounced inspection had virtually no predictive power for future inspection outcomes. (47)

As Appendix A documents, the examples go on and on, spanning tax law, labor law, privacy law, vehicle safety, criminal sentencing, drug manufacturing, and occupational safety, to name a few. (48) Frontline decision-making is where "100 percent of bureaucratic implementation begins, and most of it ends." (49) And perceptions of arbitrariness in frontline decisions can seriously erode trust in government. (50)

Yet administrative law--the body of law most directly concerned with accurate and consistent public administration of the laws--has shockingly little to offer as a proven remedy. (51)

This Article makes the following six contributions. First, this Article investigates the lynchpin of "democratic experimentalism," the widely influential school in "New Governance" (52) that posits that peer review can help government agencies implement the law more effectively and consistently. (53) Peer review consists of the direct and deliberative evaluation of work product by peers in the discipline. In the experimentalist sense, peer review also entails efforts at programmatic improvement based on pooling the results of such reviews with feedback to frontline employees. Scholars have discussed numerous variations of peer review, but the literature provides little sense of how to affirmatively design an effective peer review program given real regulatory constraints. (54) This Article shows concretely how to design a prospective, affirmative, and effective intervention of experimentalist peer review within such constraints.

Second, this Article demonstrates how to empirically ground our understanding of the administrative state by designing and tailoring a randomized controlled trial (RCT) of peer review in an actual regulatory enforcement setting. The evidence base for peer review remains weak, consisting primarily of limited case studies. (55) Observational inferences about one-time interventions are inherently fragile, as it can be difficult to attribute outcome differences to the intervention. RCTs, by contrast, represent the gold standard for assessing the causal effects of an intervention. (56) Randomization ensures that differences in outcomes can be credibly attributed to the intervention. In collaboration with the largest health department in Washington State (Public Health--Seattle & King...

To continue reading