Patent Similarity Data and Innovation Metrics

Published date01 September 2020
AuthorLeslie DeChurch,Noshir Contractor,Alina Lungeanu,Ryan Whalen
DOIhttp://doi.org/10.1111/jels.12261
Date01 September 2020
Journal of Empirical Legal Studies
Volume 17, Issue 3, 615–639, September 2020
Patent Similarity Data and Innovation
Metrics
Ryan Whalen,*Alina Lungeanu, Leslie DeChurch, and Noshir Contractor
We introduce and describe the Patent Similarity Dataset, comprising vector space model-
based similarity scores for U.S. utility patents. The dataset provides approximately 640 mil-
lion pre-calculated similarity scores, as well as the code and computed vectors required to
calculate further pairwise similarities. In addition to the raw data, we introduce measures
that leverage patent similarity to provide insight into innovation and intellectual property
law issues of interest to both scholars and policymakers. Code is provided in accompanying
scripts to assist researchers in obtaining the dataset, joining it with other available patent
data, and using it in their research.
I. Introduction and Overview
For decades, empirical research on patent law and innovation has benefited from access
to increasingly high-quality patent datasets. Scholars have used these datasets to study
innovation in a wide variety of contexts at the national level,
1
firm level,
2
team level,
3
and
*Address correspondence to Ryan Whalen, NUS Faculty of Law, 469G Bukit Timah Road 259776, Singapore;
email: ryanwhalen@nus.edu.sg. Whalen is Assistant Professor of Law at the National University of Singapore, Singa-
pore.; Lungeanu is Research Assistant Professor at Northwestern University School of Communication, Evanston.;
DeChurch is Professor at Northwestern University School of Communication, Evanston.; Contractor is the Jane
S. & William J. White Professor of Behavioral Sciences in the School of Engineering, School of Communication
and the Kellogg School of Management, Evanston.
This research was supported by NSF award number 1856090.
1
See, e.g., Raffaele Paci, Antonio Sassu & Stefano Usai, International Patenting and National Technological Spe-
cialization, 17 Technovation 25 (1997).
2
See, e.g., Michele Grimaldi, Livio Cricelli, Martina Di Giovanni & Francesco Rogo, The Patent Portfolio Value
Analysis: A New Framework to Leverage Patent Information for Strategic Technology Planning, 94 Technological
Forecasting & Soc. Change 286 (2015).
3
See, e.g., Margherita Balconi, Stefano Breschi & Francesco Lissoni, Networks of Inventors and the Role of Acade-
mia: An Exploration of Italian Patent Data, 33 Res. Pol’y 127 (2004).
615
individual level.
4
In these studies, patent data have served a wide variety of purposes.
5
For
example, citations have been used as a proxy for knowledge inputs or measure of a pat-
ent’s value,
6
patents themselves have been used as proxy measures for innovation more
generally,
7
and the structure of the prior art citation network has been used to infer the
existence of thickets of intellectual property rights.
8
This project engages with the tradition of providing patent-related data that can
enrich future research on patent law and innovation. We begin by briefly reviewing the
state of available patent data, and the research that relies on it. We subsequently intro-
duce the Patent Similarity Dataset, which uses a vector space model to compute pairwise
distances between a large number of patents. After introducing vector space models gen-
erally, and explaining how the Patent Similarity Dataset was created, this article’s final
section goes on to describe the Patent Similarity Dataset’s qualities and demonstrate how
it can be used to generate a wide variety of metrics that provide new perspective on pat-
ent law and innovation.
A. The Growth in Patent Data Availability, and Patent-Data-Driven Research
One of the functions of patent law is to incentivize the disclosure of information relating
to innovation.
9
As a result of this, the patent system generates a large amount of data,
much of which is publicly available.
10
For decades now, researchers have been drawing
on this increasingly large body of available patent data to help better understand innova-
tion, science, and intellectual property law. Because the universe of patent data is quite
large and data are available in varying formats, many of these projects require substantial
data cleaning and preparation work. Researchers thus often publish their datasets both
4
See, e.g., Martin G. Moehrle, Lothar Walter, Anja Geritz & Sandra Mu
¨ller, Patent-Based Inventor Profiles as a
Basis for Human Resource Decisions in Research and Development,” 35 R&D Mgmt. 513 (2005).
5
For a review of patent data as an economic indicator, see Zvi Griliches, Patent Statistics as Economic Indicators: A
Survey,” in R&D and Productivity 287 (Univ. of Chicago Press 1998); Sadao Nagaoka, Kazuyuki Motohashi& Akira
Goto, Patent Statistics as an Innovation Indicator, 2 Handbook of the Economics of Innovation 1083 (B. H. Hall &
N. Rosenberg, eds., North-Holland 2010).
6
See Manuel Trajtenberg, A Penny for Your Quotes: Patent Citations and the Value of Innovations,” Rand J. of
Econ. 172 (1990).
7
Daron Acemoglu, Ufuk Akcigit & William R. Kerr, Innovation Network, 113 PNAS 11483 (2016).
8
George von Graevenitz, Stefan Wagner & Dietmar Harhoff, How to Measure Patent Thickets—A Novel Approach,
111 Econ. Letters 6 (2011).
9
Jeanne C. Fromer, Patent Disclosure, 94 Iowa L. Rev. 539 (2008–2009).
10
Indeed, the Patent Act requires that the PTO make patent data available. 35 U.S.C § 41(i). For an overview of IP
data, see David L. Schwartz & Ted Sichelman, Data Sources on Patents, Copyrights, Trademarks, and Other Intel-
lectual Property, in Research Handbook on the Economics of Intellectual Property Law (Edward Elgar Publish-
ing 2019).
616 Whalen et al.

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT