by John Cutler.
The age of big data started.. .yesterday. See Steve Lohr, The Age of Big Data, New York Times, Feb. 11, 2012, available at https://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html. It is time for lawyers to catch up. Corpus linguistics is one tool that can bring the power of big data to your practice. First, a few statistics. Eighty-five Westlaw secondary sources reference corpus linguistics - over one-third were published last year. Eleven judicial opinions identify this tool. But briefs in only eight cases (eleven total) apply it.1 That means judges have been willing to look to corpus linguistics even when the parties do not. E.g., State v. Rasabout, 2015 UT 72, ¶¶ 61, 66, 356 P.3d 1258.
Corpus linguistics is not hard. Judges are doing it. Lawyers should do more. If we do not, sooner or later we will be having hard conversations with clients about why we did not. For example, in American Bankers Ass'n v. National Credit Union Administration, 306 F.Supp. 3d 44 (D.D.C. 2018), the court used a mix of its own search of the Corpus of Historical American English and party-submitted Westlaw judicial opinion data on the phrase "rural district" to conclude that the agency's expanded definition was "manifestly contrary to the statute." Id. at 66-70. That's a strong conclusion - grounded directly in corpus data. The data will not always be conclusive. But, in some cases, it can be. We owe it to our clients to understand this tool. This article will help you get up to speed.
The article proceeds in four parts: (1) background on corpus linguistics, (2) application of corpus linguistics to law, (3) corpus linguistic tools, and (4) resources to learn more.
BACKGROUND ON CORPUS LINGUISTICS
"What is corpus linguistics? Well, simply put, it is the use of computers to analyze large collections of real examples of language in use." Tony McEnery, Lancaster University, What is corpus linguistics?, YouTube, https://www.youtube.com/ watch?v=KabHl_Bsx4U. Corpus linguistic analysis "refocuses the study of language on what's actually written or said rather than on what experts think people can or should say." Id. " [W]e can do this because computers enable us to analyze millions, nowadays billions, of words, of evidence to account for the changing patterns of use in written and spoken language in everyday communication." Id. These large collections of naturally occurring language are called corpora (or a corpus - singular). See The ESRC Centre for Corpus Approaches to Social Science (CASS), Lancaster University, UK, Corpus linguistics: Some Key Terms, at 5 (2013), available at http://cass.lancs.ac.uk/ wp-content/uploads/2013/12/CASS-Gloss-final 1.pdf. archived at https://perma.cc/2ANY-9FP5. The language collected in a corpus generally aims to be "representative of a particular variety of language or genre." Id. At its core, corpus linguistics involves the analysis of frequency data. Stefan Th. Cries, What Is Corpus linguistics?, 3 Language & Linguistic Compass 1188,1226-27 (Sept. 2009). This frequency data includes: • "frequencies of occurrence of linguistic elements, i.e. how often morphemes, words, grammatical patterns etc. occur in (parts of) a corpus...;"
• "frequencies of co-occurrence of these elements, i.e. how often morphemes occur with particular words, how often particular words occur in a certain grammatical construction;"
• "[whether] something (an individual element or the co-occurrence of more than one individual element) is attested in corpora; i.e. whether the observed frequency (of occurrence or co-occurrence) is 0 or larger;"
• " [whether] something is attested in corpora more often than something else; i.e. whether an observed frequency is larger than the observed frequency of something else;” and
• “[whether] something is observed more or less often than you would expect by chance.”
Id. at 1226–27.
But this data does not itself provide instant answers to linguistic (or legal) questions. Standing alone, “there are no meanings, no functions, no concepts in corpora– corpora are (usually text) files and all you can get out of such files is distributional (or quantitative⁄statistical) information.” Id. at 1226. Transforming raw data into information useful to linguists and lawyers requires two important ingredients: (A) a sound method for analyzing corpus data and (B) a theory that the data inputs inform.
Corpus linguistics frequency data are statistics. See id. at 1228. Like any statistic, corpus data can be bungled, mischaracterized, or manipulated by a linguist or lawyer’s failure to use appropriate methods in analyzing the data. Cf. Joel Best, Damned Lies And Statistics 1-6 (Updated Edition 2012) (identifying the pitfalls and perils inherent in statistics and the importance of methodologically sound statistics). The entire purpose of turning to corpus data was to get away from “intuiting acceptability judgments about what one can say and what one cannot” – for lawyers it is to get away from judges intuiting the ordinary meaning of statutes from their own personal experience with language usage. Gries, supra, at 1228. Because corpus data “provide distributional information in the sense mentioned earlier,” linguists and lawyers must use tools and methods “designed to deal with distributional information”: i.e. statistics. Id. If lawyers and linguists are going to criticize “faulty introspective judgments” of judges or theoretical linguists, “introspectively eyeball[ing] distributions and frequencies” will not cut it. Id.
As lawyers, we need not be expert statisticians, but we ought to familiarize ourselves with the basics. For a primer on how to be a more critical consumer of statistical information, see generally Best, supra. Statistics should not scare us. With a bit of background knowledge, eyeballing the results of a simple corpus search can offer some initial information that may...