Text matching to measure patent similarity

Published date01 January 2018
AuthorSam Arts,Juan Carlos Gomez,Bruno Cassiman
DOIhttp://doi.org/10.1002/smj.2699
Date01 January 2018
RESEARCH ARTICLE
Text matching to measure patent similarity
Sam Arts
1
| Bruno Cassiman
2
| Juan Carlos Gomez
3
1
Faculty of Economics and Business, KU Leuven,
Antwerp, Belgium
2
IESE Business School, KU Leuven and CEPR,
Barcelona, Spain
3
Department of Electronics Engineering,
University of Guanajuato Campus Irapuato-
Salamanca, Salamanca, Mexico
Correspondence
Sam Arts, KU Leuven, Korte Nieuwstraat
33, 2000 Antwerp, Belgium.
Email: sam.arts@kuleuven.be
Funding information
KU Leuven, Grant/Award number: IMP/16/002;
Fonds Wetenschappelijk Onderzoek Vlaanderen,
Grant/Award number: G071417N; Spanish
Government, Grant/Award number: MEC grant
ECO2015-71173-P
Research Summary: We propose using text matching to
measure the technological similarity between patents.
Technology experts from different fields validate the new
similarity measure and its improvement on measures
based on the United States Patent Classification System,
and identify its limitations. As an application, we repli-
cate prior findings on the localization of knowledge spill-
overs by constructing a casecontrol group of text-
matched patents. We also provide open access to the code
and data to calculate the similarity between any two util-
ity patents granted by the United States Patent and Trade-
mark Office between 1976 and 2013, or between any two
patent portfolios.
Managerial Summary: We propose using text matching
to measure the technological similarity between patents.
The method can be used by various practitioners such as
inventors, attorneys, patent examiners, and managers to
search for closely related prior art, to assess the novelty
of a patent, to identify R&D opportunities in less
crowded areas, to detect in- or out-licensing opportuni-
ties, to map companies in technology space, and to find
acquisition targets. We use an expert panel to validate the
improvement of the new similarity measure on measures
based on the United States Patent Classification System,
and provide open access to the code and data to calculate
the similarity between any two utility patents granted by
the USPTO between 1976 and 2013, or between any two
patent portfolios.
KEYWORDS
matching, patent, patent classification, technological
similarity, text mining
The online appendix provides a description of the code and the data files. Code and data are available at https://dataverse.
harvard.edu/dataverse/patenttext.
Received: 31 March 2016 Revised: 23 August 2017 Accepted: 25 August 2017 Published on: 24 October 2017
DOI: 10.1002/smj.2699
62 Copyright © 2017 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/smj Strat Mgmt J. 2018;39:6284.
1|INTRODUCTION
Measuring the similarity between particular bodies of knowledge is a critical step in many innova-
tion and strategy studies: Are knowledge spillovers geographically localized (Jaffe, Trajtenberg, &
Henderson, 1993)? Does interfirm mobility of engineers influence the transfer of knowledge
between firms or regions (Almeida & Kogut, 1999; Singh & Agrawal, 2011)? Do alliance partners
draw on related technological knowledge from each other (Rosenkopf & Almeida, 2003)? Are
mergers and acquisitions between firms with similar technological knowledge more successful
(Makri, Hitt, & Lane, 2010)? How do knowledge spillovers affect R&D investments and the pro-
ductivity of firms (Bloom, Schankerman, & Van Reenen, 2013)? In order to answer these and
related questions convincingly, it is necessary to carefully measure the technological similarity
between patents or patent portfolios.
Prior and current research on innovation and strategy traditionally relies on the classification sys-
tem of patent offices to measure the similarity between patents (e.g., Aharonson & Schilling, 2016;
Singh & Marx, 2013), to construct a casecontrol group of similar patents (e.g., Agrawal, Cock-
burn, & Rosell, 2010; Almeida, 1996; Almeida & Kogut, 1999; Belenzon & Schankerman, 2013;
Jaffe et al., 1993), or to measure the similarity between patent portfolios of firms (e.g., Ahuja, 2000;
Makri et al., 2010; Bloom et al., 2013; Rosenkopf & Almeida, 2003). However, this aggregated
classification system might not capture all the technological characteristics of an invention (Singh &
Agrawal, 2011; Thompson & Fox-Kean, 2005). Moreover, different classes and subclasses might
contain significant overlap so that technologically similar patents can have a different classification
(McNamee, 2013).
In this article, we use a text-mining technique based on common keywords to develop a new
measure of technological similarity for all utility patents granted by the United States Patent and
Trademark Office (USPTO) between 1976 and 2013. This new similarity measure and its improve-
ment on measures based on the United States Patent Classification System (USPC) are validated in
two ways.
1
First, by means of an expert panel whereby 13 independent experts from five different
fields assess the technological similarity of a random sample of patents from their field of expertise.
Second, we start from the assumption that patents are more likely to be technologically similar if
they belong to the same patent family, list the same inventors, are owned by the same assignees, or
if they cite each other. Using the full population of patents, we confirm that text-matched patents
are more likely to cite each other, to belong to the same patent family, or to have a common inven-
tor or assignee compared to patents that are matched based on their patent classification. Moreover,
we find significant differences across different groups of text-matched patent pairs depending on the
degree of similarity in text.
To identify the limitations of the new similarity measure, we asked for feedback from the
experts on the discrepancies between our text-based similarity measure and their personal rating.
Not surprisingly, patents with only few keywords with little discriminatory powersuch as
method, system, process, and materialincrease the likelihood of false positives or type I errors.
Different spelling variants and synonyms increase the likelihood of false negatives or type II
errors.
1
The USPTO stopped classifying patents according to the USPC and switched to the Cooperative Patent Classification (CPC) system
in January 2015. The CPC is based on a harmonization of the existing classification systems of the European Patent Office and the
USPTO, ECLA, and USPC, respectively. Nonetheless, the large majority of studies rely on U.S. patents filed before 2015 and classi-
fied according to the USPC.
ARTS ET AL.63

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT