Machine learning and natural language processing on the patent corpus: Data, tools, and new measures

AuthorGuan‐Cheng Li,Doug O'Reagan,Scott Johnson,Sonja Lück,Bill Yeh,Kevin Johnson,Lee Fleming,Gabe Fierro,Guangzheng Zang,Mohamad Assaf,Tyler Chesebro,Benjamin Balsmeier
DOIhttp://doi.org/10.1111/jems.12259
Date01 September 2018
Published date01 September 2018
Received: 2 July 2015 Revised: 1 August 2017 Accepted: 2 August 2017
DOI: 10.1111/jems.12259
SPECIAL ISSUE
Machine learning and natural language processing on the patent
corpus: Data, tools, and new measures
Benjamin Balsmeier1Mohamad Assaf2,3 Tyler Chesebro4Gabe Fierro4
Kevin Johnson4Scott Johnson4Guan-Cheng Li2Sonja Lück5Doug O'Reagan2
Bill Yeh4Guangzheng Zang4Lee Fleming2
1Centre for Researchin Economics and Man-
agement, Universityof Luxembourg, Esch-
sur-Alzette,Luxembourg
2Coleman Fung Institute forEngineer ing
Leadership, UC Berkeley,Berkeley, CA, USA
3Department of Electrical and Computer
Engineering, American University of Beirut,
Beirut, Lebanon
4UC Berkeley,Electr ical Engineering and
Computer Science, Berkeley,CA, USA
5Department of Economics, University of
Paderborn, Paderborn, Germany
Correspondence
LeeFleming, Coleman Fung Institute for Engi-
neering Leadership, UC Berkeley,Berkeley,
CA, USA.
Email:lfleming@berkeley.edu
Abstract
Drawing upon recent advances in machine learning and natural language process-
ing, we introduce new tools that automatically ingest, parse, disambiguate, and build
an updated database using U.S. patent data. The tools identify unique inventor,
assignee, and location entities mentioned on each granted U.S. patent from 1976
to 2016. We describe data flow, algorithms, user interfaces, descriptive statistics,
and a novelty measure based on the first appearance of a word in the patent cor-
pus. We illustrate an automated coinventor network mapping tool and visualize
trends in patenting over the last 40 years. Data and documentation can be found at
https://console.cloud.google.com/launcher/partners/patents-public-data.
KEYWORDS
database, disambiguation, machine learning, natural language processing, patent, social networks
JEL CLASSIFICATION:
C80, C81, C88, O33, O34
1INTRODUCTION
Patent data have been used to study inventionand innovation for over half a century (see Hall & Harhoff, 2012, for an overview).
The popularity of patent data stems largely from the rich, consistent, and comparable information that can be obtained for a
huge number of entities, that is, organizations, individuals, and locations. Aggregating patents remains difficult because entities
are only listed by their names on each patent document and do not always receive a unique identifier from the patent office (at
worse they remain inconsistent text fields). Looking at these fields as they appear on the patent document reveals various forms
of misspellings or correct but different name spellings. The ambiguous names further limit the possibility to assemble patent
portfolios for researchas it is difficult to foresee all the kinds of name abbreviations that can occur. As a result of the lack of unique
identifiers, individual researchers spend significant amounts of time and resources on labor-intensive manual disambiguations
of relatively small numbers of patents. A few of these researchers (laudably) make these efforts available to the community,
which results in a patchwork of retrospective coverages of different periods, types of data, and manual cleaning methods.
The problem of disambiguating inventor names has received considerable attention (Carayol & Cassi, 2009; Fleming &
Juda, 2004; Lai, D'Amour, & Fleming, 2009; Li et al. 2014; Monath & McCallum, 2015; Pezzoni, Lissoni, & Tarasconi,
The authors wish to thank two very careful reviewerswho greatly enhanced the work and manuscript and Google for hosting Bigquery. This work is supported by
NSF grants 1360228 and 1536022, the US Patents and Trademark Office, the American Institutes for Research,and the Coleman Fung Institute for Engineering
Leadership; errors and omissions remain the authors'. Balsmeier gratefully acknowledges financial support from the Flemish Science Foundation.
J Econ Manage Strat. 2018;27:535–553. © 2018 WileyPeriodicals, Inc. 535wileyonlinelibrary.com/journal/jems
536 JOURNAL OF ECONOMICS & MANAGEMENTSTRATEGY
2012; Raffo & Lhuillery, 2009; Singh, 2005; Trajtenberg, Shiff, & Melamed, 2006). These efforts are gaining in sophistication,
accuracy, and speed, such that fully automated approaches can now compete with smaller hand crafted and manually tuned
datasets. Concurrent efforts have been made at the assignee level using automated and manual methods. Hall, Jaffe, and
Trajtenberg (2001) disambiguated the assignees and introduced their patent data project under the auspices of the National
Bureau of Economic Research (NBER). These data are widely used, partly because many assignees have also been matched
to unique identifiers of publicly listed firms, which, in turn, enables easy matching with various other firm-level databases,
for example, Compustat. Producing updates of the NBER patent data is costly, however, due to the quite sophisticated but still
often labor-intensive process. Location data are available for most inventors (their home towns in particular) and while these
data have been used previously, there does not exist a comprehensive and automated approach to their disambiguation.
Drawing upon recent advances in machine learning and natural language processing, the intent of this paper is to provide
working prototypes for automating the disambiguation of patent entities and patent data manipulation, with the ultimate goal
of providing tools and reasonably accurate and timely disambiguation data. Automation enables real-time updates with little
manual effort and thus enables investigation of contemporary questions (e.g., many papers that study firms stop at the end of
the NBER effort in 2006). The tools and data presented here are far from perfect and have not been fully characterized fort heir
many potential applications; adopting an open innovation model, the thrust of this work is to put forth prototypes of automated
disambiguation and data manipulation, with the hope that it greatly decreases manual cleaning and motivates further automation.
Illustrating the power of machine learning and natural language processing techniques in a different context than disam-
biguation, this paper also presents data and tools that open up new areas of investigation. It provides a measure of nov-
elty, based on the first occurrence of a word in the patent corpus, following a baseline period of 1975–1985. This mea-
sure enables a prospective measure of a patent's novelty that remains orthogonal to the typically used measure of cita-
tions (which have been used to measure very different concepts such as impact, financial value, or knowledge diffusion).
It also provides a real-time rendering of social networks, based on a list of patents or inventor input, along with two
levels of surrounding indirect linkages. All code is available from the last author. Users are encouraged to consult the
Bigquery documentation at https://cloud.google.com/bigquery/quickstart-web-ui; patent specific data and tool links are at
https://console.cloud.google.com/launcher/partners/patents-public-data.
2DATA SOURCES, PARSING, AND PREPARATION
Preparing the data is more time-consuming and arduous than running the actual disambiguation. This section describes how
data are obtained, parsed, cleaned, and structured. Knowing the necessary details should allow researchers to either circumvent
this step by building directly on these efforts or at least save themselves some inevitable detours.Rat her than parse weekly data
available from the USPTO (in formats that have varied greatly over time and still include errors that have been since fixed by
the USPTO), we scraped every granted patent from the USPTO web site and parsed, cleaned, and inserted the data into a SQL
database. All patents up to the end of May 2017, including utility, design, plant, and reissues, are scraped and processed (some
of the descriptive statistics below may reflect a December 2016 cutoff). Figure 1 illustrates the process.
As the data are extracted from USPTO documents, they are streamed into a relational database that contains records linking
patents, citations, inventors, assignees, technology classes, and locations. The database itself uses the SQL database engine, but
the patent library uses an object relational mapper (ORM) to allow database entries to be manipulated as Python objects. This
simplifies development by removing the need to write code for a specific database back end, and facilitatesuse by not requiring
the user to be familiar enough with relational databases in order to query and manipulate data. The raw tables in the database
contain data as it appears in the patent record; this preserves the resolution of the data and gives the user freedom to develop their
own processes over the original data. As the disambiguations are run, the raw records are linked to the disambiguated records.
The input data to be parsed come from the USPTO web site.1The output of the parsing algorithm is a tab-separated values
(TSVs) file, with each row corresponding to a particular patent parsed. Going from raw HTML to TSV requires five steps.
2.1 Step 1: Vectorization
Step 1 of the parsing algorithm iterates through the list of HTML files and extracts useful data from them. A python module
called “Beautiful Soup” is used to convert the raw HTML text into a well-maintained python object, making it easier to extract
text strings. The data are extracted into 23 fields, including grant ID/date, application ID/date, assignee names/locations, inventor
names/locations, US Class, CPC Class, referred US patents, etc. If there is an error extracting data from a particular patent, the
program would record that patent's ID, and skip it to continue to next patent (we encountered 195 errors, of which 183 resulted

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT