Investigating Cohort Similarity as an Ex Ante Alternative to Patent Forward Citations

AuthorJonathan H. Ashtor
DOIhttp://doi.org/10.1111/jels.12237
Date01 December 2019
Published date01 December 2019
Journal of Empirical Legal Studies
Volume 16, Issue 4, 848–880, December 2019
Investigating Cohort Similarity as an
Ex Ante Alternative to Patent Forward
Citations
Jonathan H. Ashtor*
Forward citations are arguably the most widely used empirical metric for patents, including
as indicators of patent information content, cumulative innovation, value, and knowledge
flows. However, forward citations havemajor shortcomings. Citations require long time hori-
zons to accrue, and therefore they cannot be observed until several years after a patent
issues. Citation data are often noisy, discontinuous, and highly skewed, complicating empiri-
cal analysis. Moreover, recent studieshave questioned the reliabilityof citation data. As such,
the most widely used empirical metric of patents is also the most suspect. This study con-
structs a measure of patents that correlates with forward citations, but is observableex ante,
immediately upon patent issuance or even earlier upon publication of a patent application.
In addition, this measure is continuous and evenly distributed, such that it is suitable for
large-scale patent analytics applications. Finally, unlike citations, the measure is portable
across patent systems,facilitating cross-border comparisons of portfoliosand datasets. Specif-
ically, I constructa measure of the similarity of a patent to its technological-temporal cohort,
based on linguistic analysis of claim text. I employ advanced computational linguistic tech-
niques to analyzethe claims of all U.S. patents issuedin the period 1976–2017, over 6 million
patents in total, andI calculate the average degree of conceptual similarityof each patented
invention to all others in the same technology field and time period cohort. I then extend
the methodology to all issued EP patents, over 1.6 million in total. I validate the resulting
measures against multiple established patent metrics for U.S. and EP patents. I test the
robustness of this measure as a forecast for future patent citations in empirical researchand
big-data applications. I find that cohort similarity correlates significantly with forward cita-
tions received by both U.S. and EP patents. Cohort similarity also substitutes for citations in
*Address correspondence to Jonathan H. Ashtor, Professor of Practice at Benjamin N. Cardozo School of Law;
Counsel at Paul, Weiss, Rifkind, Wharton & Garrison LLP, 1285 Ave. of the Americas, New York, NY 10019 USA;
email: jashtor@paulweiss.com. This project does not reflect the views of Paul, Weiss or any of its attorneys or cli-
ents. The research for and writing of this article was supported by a Leonardo da Vinci Research Grant from the
Center for the Protection of Intellectual Property at George Mason University School of Law.
I am grateful for the helpful comments and suggestions of (alphabetically): Chris Buccafusco, Michael Frakes (dis-
cussant), Janet Freilich, Jeanne Fromer, Deepak Hegde, Jeff Kuhn (discussant), Brian Love, Alan Marco, Shawn
Miller, Adam Mossoff, Amanda Myers, Gae´tan de Rassenfosse, Lisa Oullette, Lindsey Raymond, Bhaven Sampat,
Josh Sarnoff, Dave Schwartz, Laura Sheridan, Ted Sichelman, Scott Stern, Asrat Tesfayesus (discussant), Neil
Thompson, Andy Toole, Ian Wetherbee and Brian Wright, as well as the organizers and participants of Conference
on Empirical Legal Studies 2018, European Policy for Intellectual Property Conference 2017–2018, Empirical
Methods in IP 2017–2018, IPSDM 2018 (presentation), IPSDM 2017 (poster), IPSC 2017-2018, WIPIP 2018,
PatCon 2018 and the USPTO Visiting Speaker Series in March 2018.
This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs
License, which permits use and distribution in any medium, provided the original work is properly cited, the use is
non-commercial and no modifications or adaptations are made.
848
leading prior studies of R&D output and innovation. Finally,I demonstrate that, unlike cita-
tions, cohort similarity is comparable across the U.S. and EP patent systems. Accordingly,
cohort similarity may be useful for empirical patent research, comparative studies of patent
policy, and analyticsof large-scale patent portfolios.
I. Introduction
Forward citations have many strengths but also many weaknesses. They have been found
to reflect the technological importance and private value of patents. (Lanjouw &
Schankerman 2001; Allison et al. 2004). They also provide measures of cumulative inno-
vation (Caballero & Jaffe 1993; Trajtenberg et al. 1997; Galasso & Schankerman 2014),
and R&D output (Hall et al. 2005). Citations have been used as proxies for the informa-
tion content of a patent’s disclosure (Ashtor forthcoming), as well as technological
impact (Corredoira & Banerjee 2015). As such, they are a widely used metric of patents
for both economics research and commercial patent analytics (Kuhn et al. 2018).
However, forward citation data have significant shortcomings. Foremost, citations
require long time horizons to accrue (Marco 2007) and therefore they cannot be reliably
observed until several years post-issuance. Distributions of forward citations are typically noisy,
discontinuous, and highly skewed, which complicates empirical analysis. (Gambardella et al.
2008; Roach & Cohen 2013). Moreover, citation data often lack precision and suffer from inac-
curacies. Studies have documented evidence of strategic applicant behavior influencing the dis-
tribution of citations (Sampat 2010; Abrams et al. 2013; Cotropia et al. 2013), and have also
found that many “forward” citations actually predate the patents they cite (Arora et al. 2017).
Most recently, leading academics have documented a recent surge in citation counts and
decline in relevance to the corresponding cited patents (Kuhn et al. 2018). Finally, because
citation measures are heavily driven by patent counts,theyvarywidelyacrosspatentsystems
and are not directly comparable for patents in different jurisdictions (Lerner & Seru 2015).
In light of the aforementioned problems, empirical experts have questioned
whether forward citations are still viable as a metric for research and large-scale patent
analysis. Indeed, Kuhn et al. (2018) have forcefully argued that “future research with pat-
ent data will require new measures and methods … that are not based on patent citations
at all.” Accordingly, it is advantageous to construct an alternative measure to citations for
use in empirical research and analytics applications.
This study seeks to develop and investigate a potential alternative to citations. It is
based on the recognition that the forward citations to a patent reflect, at least in part,
the information contained therein (Trajtenberg 1990b; Jung & Lee 2015; Corredoira &
Banerjee 2015). I model a patent’s information content by measuring the conceptual
similarity of the patent’s claims to the claims of a cohort of reference patents. Lower simi-
larity to the cohort is hypothesized to correspond to greater information content, which
may in turn result in more citations.
Specifically, I conduct latent semantic analysis (LSA) on the claims of the entire
population of U.S. patents issued from 1976 through 2017, nearly 6 million patents in
total, and I calculate the average degree of similarity of each patented invention relative
An Ex Ante Alternative to Patent Forward Citations 849
to all others in its technology-temporal cohort.
1
The cohorts are selected dynamically for
each patent, based on the application date and technology field of the patent in ques-
tion. I further extend the algorithm to all European patents (EP) issued by the EPO from
1978 to present, over 1.6 million in total, parsing the English-language claims from the
published XML files and using technology fields based on the International Patent Classi-
fication system.
2
I test the cohort similarity measure using multiple accepted patent metrics, includ-
ing number of patent classifications, claim breadth, prior art citations, patent mainte-
nance, recorded assignments, and backward citations. Additionally, I confirm that the
cohort similarity measure does not change substantially during patent examination, facili-
tating early assessment based on the published claims of patent applications.
3
Finally, I investigate the relationship between cohort similarity and forward cita-
tions through multiple lenses. I find significant correlations with forward citation counts
for five-year and seven-year time horizons for both U.S. and EP patents, after controlling
for other relevant factors. I also study the relationship between cohort similarity and sev-
eral of the phenomena that citations are used to model, such as R&D output and innova-
tion value. (Caballero & Jaffe 1993; Trajtenberg et al. 1997). In particular, I replicate one
of the pioneering studies of citations by Hall et al. (2005), replacing citations with cohort
similarity, and I also introduce citations into a more recent study by Kogan et al. (2017).
Finally, I compare cohort similarity measures of U.S. and EP counterparts and confirm
that, unlike citations, they are comparable across jurisdictions.
This article is organized as follows. Section II outlines the strategy and relevant
prior scholarship. Section III details the methodology and algorithms used herein.
Section IV provides the results of validation testing as well as benchmark testing of multi-
ple variations in the underlying algorithms (provided in Online Appendix B).
4
Section V
conducts specific studies of the cohort similarity measure and its relationship with for-
ward citations. Concluding remarks follow in Section VI.
Before proceeding, it is important to caveat that I am not performing any legal
analysis in this study, nor is the cohort similarity measure suitable for analyzing the legal
attributes or validity of any patent. This study constructs a statistical measure designed for
comparative and trend analyses, and, like forward citations, it does not reflect the validity
or value of any given patent or patent application. Rather, the purpose of this measure is
1
Except where indicated, results are currently reported for all U.S. patents issued in the period 1976–2017, approx-
imately 6 million total, and for EP patents issued in the period 1978–2017, approximately 1.6 million total.
2
Notably, EP data suffer from left-truncation because the earliest cohorts include very few patents, as appl icants
were gaining familiarity with the EPO. Accordingly, in the analyses below I exclude the earliest years of the EP
dataset, retaining measures for over 1 million patents.
3
This is performed using the U.S. dataset, due to unavailability of application data for EP patents.
4
Additionally for further reference, interactive demos of the validation testing are provided via Google’s
DataStudio at https://datastudio.google.com/open/1U-htaXLMuexkpKwjE-yBI68mtxgCfiNh.
850 Ashtor

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT