Machine learning and natural language processing on the patent corpus: Data, tools, and new measures

Document

Cited in

Author	Guan‐Cheng Li,Doug O'Reagan,Scott Johnson,Sonja Lück,Bill Yeh,Kevin Johnson,Lee Fleming,Gabe Fierro,Guangzheng Zang,Mohamad Assaf,Tyler Chesebro,Benjamin Balsmeier
DOI	http://doi.org/10.1111/jems.12259
Date	01 September 2018
Published date	01 September 2018

Received: 2 July 2015 Revised: 1 August 2017 Accepted: 2 August 2017

DOI: 10.1111/jems.12259

SPECIAL ISSUE

Machine learning and natural language processing on the patent

corpus: Data, tools, and new measures

Benjamin Balsmeier1Mohamad Assaf2,3 Tyler Chesebro4Gabe Fierro4

Kevin Johnson4Scott Johnson4Guan-Cheng Li2Sonja Lück5Doug O'Reagan2

Bill Yeh4Guangzheng Zang4Lee Fleming2

1Centre for Researchin Economics and Man-

agement, Universityof Luxembourg, Esch-

sur-Alzette,Luxembourg

2Coleman Fung Institute forEngineer ing

Leadership, UC Berkeley,Berkeley, CA, USA

3Department of Electrical and Computer

Engineering, American University of Beirut,

Beirut, Lebanon

4UC Berkeley,Electr ical Engineering and

Computer Science, Berkeley,CA, USA

5Department of Economics, University of

Paderborn, Paderborn, Germany

Correspondence

LeeFleming, Coleman Fung Institute for Engi-

neering Leadership, UC Berkeley,Berkeley,

CA, USA.

Email:lﬂeming@berkeley.edu

Abstract

Drawing upon recent advances in machine learning and natural language process-

ing, we introduce new tools that automatically ingest, parse, disambiguate, and build

an updated database using U.S. patent data. The tools identify unique inventor,

assignee, and location entities mentioned on each granted U.S. patent from 1976

to 2016. We describe data ﬂow, algorithms, user interfaces, descriptive statistics,

and a novelty measure based on the ﬁrst appearance of a word in the patent cor-

pus. We illustrate an automated coinventor network mapping tool and visualize

trends in patenting over the last 40 years. Data and documentation can be found at

https://console.cloud.google.com/launcher/partners/patents-public-data.

KEYWORDS

database, disambiguation, machine learning, natural language processing, patent, social networks

JEL CLASSIFICATION:

C80, C81, C88, O33, O34

1INTRODUCTION

Patent data have been used to study inventionand innovation for over half a century (see Hall & Harhoﬀ, 2012, for an overview).

The popularity of patent data stems largely from the rich, consistent, and comparable information that can be obtained for a

huge number of entities, that is, organizations, individuals, and locations. Aggregating patents remains diﬃcult because entities

are only listed by their names on each patent document and do not always receive a unique identiﬁer from the patent oﬃce (at

worse they remain inconsistent text ﬁelds). Looking at these ﬁelds as they appear on the patent document reveals various forms

of misspellings or correct but diﬀerent name spellings. The ambiguous names further limit the possibility to assemble patent

portfolios for researchas it is diﬃcult to foresee all the kinds of name abbreviations that can occur. As a result of the lack of unique

identiﬁers, individual researchers spend signiﬁcant amounts of time and resources on labor-intensive manual disambiguations

of relatively small numbers of patents. A few of these researchers (laudably) make these eﬀorts available to the community,

which results in a patchwork of retrospective coverages of diﬀerent periods, types of data, and manual cleaning methods.

The problem of disambiguating inventor names has received considerable attention (Carayol & Cassi, 2009; Fleming &

Juda, 2004; Lai, D'Amour, & Fleming, 2009; Li et al. 2014; Monath & McCallum, 2015; Pezzoni, Lissoni, & Tarasconi,

The authors wish to thank two very careful reviewerswho greatly enhanced the work and manuscript and Google for hosting Bigquery. This work is supported by

NSF grants 1360228 and 1536022, the US Patents and Trademark Oﬃce, the American Institutes for Research,and the Coleman Fung Institute for Engineering

Leadership; errors and omissions remain the authors'. Balsmeier gratefully acknowledges ﬁnancial support from the Flemish Science Foundation.

536 JOURNAL OF ECONOMICS & MANAGEMENTSTRATEGY

2012; Raﬀo & Lhuillery, 2009; Singh, 2005; Trajtenberg, Shiﬀ, & Melamed, 2006). These eﬀorts are gaining in sophistication,

accuracy, and speed, such that fully automated approaches can now compete with smaller hand crafted and manually tuned

datasets. Concurrent eﬀorts have been made at the assignee level using automated and manual methods. Hall, Jaﬀe, and

Trajtenberg (2001) disambiguated the assignees and introduced their patent data project under the auspices of the National

Bureau of Economic Research (NBER). These data are widely used, partly because many assignees have also been matched

to unique identiﬁers of publicly listed ﬁrms, which, in turn, enables easy matching with various other ﬁrm-level databases,

for example, Compustat. Producing updates of the NBER patent data is costly, however, due to the quite sophisticated but still

often labor-intensive process. Location data are available for most inventors (their home towns in particular) and while these

data have been used previously, there does not exist a comprehensive and automated approach to their disambiguation.

Drawing upon recent advances in machine learning and natural language processing, the intent of this paper is to provide

working prototypes for automating the disambiguation of patent entities and patent data manipulation, with the ultimate goal

of providing tools and reasonably accurate and timely disambiguation data. Automation enables real-time updates with little

manual eﬀort and thus enables investigation of contemporary questions (e.g., many papers that study ﬁrms stop at the end of

the NBER eﬀort in 2006). The tools and data presented here are far from perfect and have not been fully characterized fort heir

many potential applications; adopting an open innovation model, the thrust of this work is to put forth prototypes of automated

disambiguation and data manipulation, with the hope that it greatly decreases manual cleaning and motivates further automation.

Illustrating the power of machine learning and natural language processing techniques in a diﬀerent context than disam-

biguation, this paper also presents data and tools that open up new areas of investigation. It provides a measure of nov-

elty, based on the ﬁrst occurrence of a word in the patent corpus, following a baseline period of 1975–1985. This mea-

sure enables a prospective measure of a patent's novelty that remains orthogonal to the typically used measure of cita-

tions (which have been used to measure very diﬀerent concepts such as impact, ﬁnancial value, or knowledge diﬀusion).

It also provides a real-time rendering of social networks, based on a list of patents or inventor input, along with two

levels of surrounding indirect linkages. All code is available from the last author. Users are encouraged to consult the

Bigquery documentation at https://cloud.google.com/bigquery/quickstart-web-ui; patent speciﬁc data and tool links are at

https://console.cloud.google.com/launcher/partners/patents-public-data.

2DATA SOURCES, PARSING, AND PREPARATION

Preparing the data is more time-consuming and arduous than running the actual disambiguation. This section describes how

data are obtained, parsed, cleaned, and structured. Knowing the necessary details should allow researchers to either circumvent

this step by building directly on these eﬀorts or at least save themselves some inevitable detours.Rat her than parse weekly data

available from the USPTO (in formats that have varied greatly over time and still include errors that have been since ﬁxed by

the USPTO), we scraped every granted patent from the USPTO web site and parsed, cleaned, and inserted the data into a SQL

database. All patents up to the end of May 2017, including utility, design, plant, and reissues, are scraped and processed (some

of the descriptive statistics below may reﬂect a December 2016 cutoﬀ). Figure 1 illustrates the process.

As the data are extracted from USPTO documents, they are streamed into a relational database that contains records linking

patents, citations, inventors, assignees, technology classes, and locations. The database itself uses the SQL database engine, but

the patent library uses an object relational mapper (ORM) to allow database entries to be manipulated as Python objects. This

simpliﬁes development by removing the need to write code for a speciﬁc database back end, and facilitatesuse by not requiring

the user to be familiar enough with relational databases in order to query and manipulate data. The raw tables in the database

contain data as it appears in the patent record; this preserves the resolution of the data and gives the user freedom to develop their

own processes over the original data. As the disambiguations are run, the raw records are linked to the disambiguated records.

The input data to be parsed come from the USPTO web site.1The output of the parsing algorithm is a tab-separated values

(TSVs) ﬁle, with each row corresponding to a particular patent parsed. Going from raw HTML to TSV requires ﬁve steps.

2.1 Step 1: Vectorization

Step 1 of the parsing algorithm iterates through the list of HTML ﬁles and extracts useful data from them. A python module

called “Beautiful Soup” is used to convert the raw HTML text into a well-maintained python object, making it easier to extract

text strings. The data are extracted into 23 ﬁelds, including grant ID/date, application ID/date, assignee names/locations, inventor

names/locations, US Class, CPC Class, referred US patents, etc. If there is an error extracting data from a particular patent, the

program would record that patent's ID, and skip it to continue to next patent (we encountered 195 errors, of which 183 resulted

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Machine learning and natural language processing on the patent corpus: Data, tools, and new measures

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users