Article A Practical Primer on Law & Corpus Linguistics, 0219 UTBJ, Vol. 32, No. 1. 14

Author:by John Cutler.
Position:Vol. 32 1 Pg. 14

Article A Practical Primer on Law & Corpus Linguistics

Vol. 32, No. 1 Pg. 14

Utah Bar Journal

February, 2019




John Cutler.

The age

of big data started.. .yesterday. See Steve Lohr,

The Age of Big Data, New York Times, Feb. 11, 2012,

available at

It is time for lawyers to catch up. Corpus linguistics is one

tool that can bring the power of big data to your practice.

First, a few statistics. Eighty-five Westlaw secondary

sources reference corpus linguistics - over one-third were

published last year. Eleven judicial opinions identify this

tool. But briefs in only eight cases (eleven total) apply


That means judges have been willing to look to corpus

linguistics even when the parties do not. E.g., State v.

Rasabout, 2015 UT 72, ¶¶ 61, 66, 356 P.3d



linguistics is not hard. Judges are doing it. Lawyers should

do more. If we do not, sooner or later we will be having hard

conversations with clients about why we did not. For example,

in American Bankers Ass'n v. National Credit Union

Administration, 306 F.Supp. 3d 44 (D.D.C. 2018), the

court used a mix of its own search of the Corpus of

Historical American English and party-submitted Westlaw

judicial opinion data on the phrase "rural

district" to conclude that the agency's expanded

definition was "manifestly contrary to the

statute." Id. at 66-70. That's a strong

conclusion - grounded directly in corpus data. The data will

not always be conclusive. But, in some cases, it can be. We

owe it to our clients to understand this tool. This article

will help you get up to speed.


article proceeds in four parts: (1) background on corpus

linguistics, (2) application of corpus linguistics to law,

(3) corpus linguistic tools, and (4) resources to learn more.




is corpus linguistics? Well, simply put, it is the use of

computers to analyze large collections of real examples of

language in use." Tony McEnery, Lancaster University,

What is corpus linguistics?, YouTube, watch?v=KabHl_Bsx4U. Corpus

linguistic analysis "refocuses the study of language on

what's actually written or said rather than on what

experts think people can or should say." Id.

" [W]e can do this because computers enable us to

analyze millions, nowadays billions, of words, of evidence to

account for the changing patterns of use in written and

spoken language in everyday communication." Id.

These large collections of naturally occurring language are

called corpora (or a corpus - singular). See The

ESRC Centre for Corpus Approaches to Social Science (CASS),

Lancaster University, UK, Corpus linguistics: Some Key

Terms, at 5 (2013), available at

wp-content/uploads/2013/12/CASS-Gloss-final 1.pdf.

archived at The language

collected in a corpus generally aims to be

"representative of a particular variety of language or

genre." Id. At its core, corpus linguistics

involves the analysis of frequency data. Stefan Th. Cries,

What Is Corpus linguistics?, 3 Language &

Linguistic Compass 1188,1226-27 (Sept. 2009). This frequency

data includes:

• "frequencies of occurrence of linguistic

elements, i.e. how often morphemes, words, grammatical

patterns etc. occur in (parts of) a corpus...;"

• "frequencies of co-occurrence of these elements,

i.e. how often morphemes occur with particular words, how

often particular words occur in a certain grammatical


• "[whether] something (an individual element or

the co-occurrence of more than one individual element) is

attested in corpora; i.e. whether the observed frequency (of

occurrence or co-occurrence) is 0 or larger;"

• " [whether] something is attested in corpora more

often than something else; i.e. whether an observed frequency

is larger than the observed frequency of something

else;” and

• “[whether] something is observed more or less

often than you would expect by chance.”

Id. at 1226–27.


this data does not itself provide instant answers to

linguistic (or legal) questions. Standing alone, “there

are no meanings, no functions, no concepts in corpora–

corpora are (usually text) files and all you can get out of

such files is distributional (or

quantitative⁄statistical) information.”

Id. at 1226. Transforming raw data into information

useful to linguists and lawyers requires two important

ingredients: (A) a sound method for analyzing corpus data and

(B) a theory that the data inputs inform.



linguistics frequency data are statistics. See id.

at 1228. Like any statistic, corpus data can be bungled,

mischaracterized, or manipulated by a linguist or

lawyer’s failure to use appropriate methods in

analyzing the data. Cf. Joel Best, Damned Lies

And Statistics 1-6 (Updated Edition 2012) (identifying

the pitfalls and perils inherent in statistics and the

importance of methodologically sound statistics). The entire

purpose of turning to corpus data was to get away from

“intuiting acceptability judgments about what one can

say and what one cannot” – for lawyers it is to

get away from judges intuiting the ordinary meaning of

statutes from their own personal experience with language

usage. Gries, supra, at 1228. Because corpus data

“provide distributional information in the sense

mentioned earlier,” linguists and lawyers must use

tools and methods “designed to deal with distributional

information”: i.e. statistics. Id. If lawyers

and linguists are going to criticize “faulty

introspective judgments” of judges or theoretical

linguists, “introspectively eyeball[ing] distributions

and frequencies” will not cut it. Id.


lawyers, we need not be expert statisticians, but we ought to

familiarize ourselves with the basics. For a primer on how to

be a more critical consumer of statistical information,

see generally Best, supra. Statistics

should not scare us. With a bit of background knowledge,

eyeballing the results of a simple corpus search can offer

some initial information that may shape how we proceed. For

example, in the American Bankers Ass’n case

noted in the introduction, Judge Friedrich’s corpus

search revealed that “the phrase rural

district was used with some frequency in the first half

of the twentieth century before mostly falling out of usage

in the second half.” 306 F.Supp.3d at 68 (discussing a

search of the Corpus of Historical American English at This type of corpus search is

incredibly simple to perform, but the results can be quite

powerful. A smart lawyer who finds potentially valuable

information by eyeballing corpus data will consult with an

expert in statistics to ensure the rigor of the analysis and

be prepared for arguments opposing counsel or the court might

raise to undermine the credibility of the corpus data.


performing their own corpus analysis do not have the option

of consulting with outside experts. But a judge who

identifies potentially useful corpus data may invite

supplemental briefing on the results – to allow the

adversary process to test the judge’s initial findings.

Courts do this all the time when internal research discovers

legal authority missed by the parties that may materially

alter the outcome of the case. Use of this process to handle

judicial inquiry into corpus data allows judges to access the

full panoply of interpretive tools, while also subjecting

judicial corpus analysis to the crucible of testing likely to

expose any problems with the court’s methodology or

resulting data.



statistically sound data cannot advance legal interpretation

unless there is a linguistic or legal theory that makes the

data consequential. Gries, supra, at 1228–29.

Among the linguistic or legal theories that can give corpus

data meaning is the notion that differences in language usage

reflect differences in meaning. Id. at 1229. The law

embraces a similar theory of meaning in the mirrored


To continue reading