AuthorGuha, Neel

TABLE OF CONTENTS I. INTRODUCTION 583 II. BACKGROUND: MODERN DISCOVERY AND TAR 591 A. Discovery Standards and the FRCP 591 B. Discovery and Technology-Assisted Review 592 C. Court-Imposed Standards, Cooperation, and Transparency in TAR 595 III. THE TAR GAMESMANSHIP AND ABUSE FRAMEWORK 600 A. Rules and Standards of Discovery Abuse 601 B. The Potential for TAR Abuse 603 IV. IDENTIFYING TAR VULNERABILITIES 605 A. Seed Set Composition and Data Distribution 608 B. Data Content and Composition: Data Poisoning and Adversarial Examples 615 C. Data Labeling: Hidden Stratification and Underspecification 622 D. Sampling Strategy and Choice of Stopping Point for Active Learning Systems 626 E. Validation Method and Aggregate Metrics 632 F. Role of Proprietary Datasets 637 V. EVALUATING TAR ABUSE: POSSIBLE BUT PREVENTABLE 643 A. Existing Sanctions and Counter-Moves Limit the Risks of TAR Abuse 643 B. TAR, Gamesmanship, and New Sanctions? 646 VI. SAFEGUARDING TAR & DISCOVERY: BEST PRACTICES, METRICS, AND BENCHMARKS, NOT TRANSPARENCY 649 A. The End of Process Transparency and the Rise of Algorithmic Transparency 649 B. Short Term: Updating Protocols with Better Metrics and Disclosures 650 C. Long Term: Sedona Working Group on Benchmark Methods and Additional Research 652 VII. CONCLUSION 654 APPENDIX 655 I. INTRODUCTION

In 2016, a group of plaintiffs sued the City of New York in federal court, alleging that an affordable housing program discriminated against minority applicants. (1) After surviving a motion to dismiss, plaintiffs "sought wide-ranging discovery, which the City... resisted vigorously." (2) Over the following two years, "plaintiffs lodged numerous complaints about the pace of discovery" and the court responded by directing the City to use "Technology Assisted Review ('TAR') software ... to hasten" the process of searching through millions of documents. (3) TAR software uses machine learning algorithms to identify documents responsive to a discovery request. While TAR was supposed to resolve discovery disputes, it instead spurred a new set of quarrels. Plaintiffs objected that TAR software was "improperly trained on what constitutes a responsive and non-responsive document" and therefore failed to produce documents that were "truly responsive" to plaintiffs' discovery requests. (4) The court disagreed with plaintiffs but reviewed the TAR process in camera and ordered defendants to produce further details about the training method. (5) Winfield v. City of New York is now on the verge of trial, and plaintiffs' ability to prove their claims hinges on the accuracy of TAR.

As Winfield demonstrates, much of our civil justice system now depends on the accuracy of e-discovery and, more specifically, TAR. Recent cases involving heated disputes on the use of TAR include claims that the City of Chicago Fire Department discriminated against women applicants, (6) a large antitrust claim by Epic against Apple, (7) and a class action claim against Barnes & Noble over its failure to pay employee wages under the Fair Labor Standards Act. (8) Even Department of Justice antitrust approval of corporate mergers depends on compliance with a complex TAR protocol. (9) And according to some general counsel, TAR has also reshaped the relationship between inhouse and outside counsel, forcing them to increase collaboration. (10) If, as some have argued, discovery is the "backbone of American litigation," (11) then TAR is the engine that moves discovery forward.

For nearly two decades, technologies have reinvented discovery in complex litigation, normalizing the use of TAR. "Predictive Coding" and "Continuous Active Learning" are but two commonly cited terms representing a variety of algorithms, software, and methods that fall under the general umbrella of TAR. (12) Attorneys and data vendors use TAR to speed up the discovery process and decrease the costs of review. Done well, TAR is welfare enhancing, as it makes discovery more accessible, saves thousands of hours of manual review, and helps parties find relevant documents. (13) While manual review can be riddled with problems, including human error, fatigue, and costs, (14) TAR at its best can leverage technologies that make litigation more efficient and fairer. (15) That is why diverse groups, from plaintiffs' attorneys and defense counsel, to the Department of Justice and the Federal Trade Commission, embrace TAR in their cases. (16)

Yet, debates over the use of TAR are heating up and the e-discovery community is nearing an inflection point. A budding literature casts TAR as opaque, open to abuse, and unduly benefiting repeat players at the cost of small litigants. (17) Defense counsel, for their part, complain that plaintiffs' attorneys have weaponized TAR and are exploiting an emphasis on transparency to increase costs, stymie innovations, and force defendants to release confidential information. (18) On top of this developing maelstrom, advances in discovery tech are growing beyond the reach of most attorneys and into the realm of computer science. (19) In turn, this change has provoked anxiety in the legal profession about a lack of control over discovery technology. Judges, attorneys, and scholars warn that lawyers need to supervise the technical aspects of TAR and avoid the dangers of abuse and risks of opaque technology. (20) This growing chorus of commentators has offered an array of reforms ranging from radical transparency (by the compelled sharing of the information used to train the algorithm) to third-party validation and has even proposed flipping the responsibility for running TAR searches. (21)

An emerging key question becomes "whether TAR increases or decreases gaming and abuse" (22) and how the legal field should respond. Gamesmanship has always been a part of discovery, with attorneys employing techniques to avoid producing valuable documents or drowning opponents with irrelevant documents. (23) However, the new twist is whether TAR can expand or transform abusive strategies, and whether attorneys can effectively safeguard the discovery process. Scholars like David Engstrom and Jonah Gelbach worry that "automated discovery might breed more abuse, and prove less amenable to oversight, than an analog system built upon 'eyes-on' review." (24) Engstrom and Gelbach note that as technology advances, "lawyers will progressively cede professional jurisdiction to technologists" and "discovery disputes will play out as expert battles in which dueling technologists opine about the propriety of data manipulations." (25) Others, including Seth Endo, argue that predictive coding in discovery can diminish participation values in the system and promote gamesmanship. (26) Dana Remus similarly warns that TAR does not eliminate discovery abuse because "lawyers who train the computer systems can continue to [employ] aggressive and even abusive" strategies with algorithms. (27) Still others worry that attorneys cannot "uncritically rely on outside advisors" to resolve their problems--they must do the hard work themselves. (28)

While scholars have identified important gaps in the system, they have not defined precise risks nor the appropriate boundaries for debate. A few open questions are clear: What, exactly, is the potential for abuse of TAR? Does TAR increase abuse? If so, how? And what can opposing counsel do about it?

In this Article, we investigate the possibilities of abuse and gamesmanship in technology-assisted discovery. We do so with three main goals in mind. First, our most basic aim is to provide a technical but accessible assessment of the potential for TAR abuse. To do so, we use the latest computer science research to break down the different ways that TAR can go awry, either due to intentional abuse or mistakes. Second, with a better understanding of how discovery can be subverted, we then map out potential remedies and reframe current debates in a more helpful light. Finally, we propose reforms to improve the system in the short and long term, with an emphasis on improved metrics that can more fully measure the quality of TAR. By exploring the technical background of discovery abuse we also seek to demystify the engineering substrate of modern discovery. Undertaking this study shows that lawyers--with the right technical knowledge and assistance--can safeguard technology-assisted discovery. There is no need for attorneys to surrender professional jurisdiction to engineers. (29)

Parts II and III of the Article provide a basic background on TAR and FRCP discovery standards and build a framework to evaluate discovery abuse in TAR. In Part IV, the heart of the Article, we then expose TAR to the most cutting-edge engineering research on algorithmic "attacks," or attempts to sabotage the process. Our methodology mirrors that of security research in computer science, where engineers routinely study worst-case outcomes. (30) The Article seeks to catalogue potential engineering techniques that could sabotage or disrupt the aims of discovery. We then assess these techniques' likelihood of success, potential solutions, indicia of manipulation, and whether the Federal Rules of Civil Procedure need updates.

Drawing on the most recent computer science literature, we identify six vulnerabilities in the discovery process:

(1) "Seed Set" and "Data Distribution": Associated problems occur when attorneys train a TAR algorithm on a subset of documents that is biased in some important way. For instance, if attorneys leave out of a seed set any emails that come from a particular mailing list, an algorithm may never be able to tag other mailing list emails as likely relevant, even if they are indeed relevant. This problem is not fully solved by using advanced learning processes.

(2) "Data Poisoning" or "Adversarial Examples": These sources of abuse arise when a party inserts a document which consistently tricks a machine learning algorithm into making an incorrect prediction...

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT