Accessing Online Data: Web‐Crawling and Information‐Scraping Techniques to Automate the Assembly of Research Data

Document

Cited in

Author	Brett Massimino
DOI	http://doi.org/10.1111/jbl.12120
Date	01 March 2016
Published date	01 March 2016

Accessing Online Data: Web-Crawling and Information-Scraping

Techniques to Automate the Assembly of Research Data

Brett Massimino

Cornell University

There is a growing interest in leveraging alternate sources of empirical data, with an increasing emphasis being placed on the Internet. This

paper serves as a primer for supply chain management (SCM) researchers that may be interested in leveraging Internet-based sources for

their own research, but perhaps not familiar with how to begin. Here, deﬁnitions and concepts critical to successful implementation in practice

are provided. In addition, concrete, discipline-relevant examples accompany the discussion, and are aided by a fully detailed online code supple-

ment. Performance enhancements are discussed, as well as associated caveats and limitations. Additionally, insights and guidance are offered on

the unique responsibilities for researchers to uphold the ethical spirit of scientiﬁc research when continuing along these paths. Pragmatic issues

related to the application of these techniques are presented for consideration of individual researchers and the SCM community as a whole.

Keywords: web crawling; information scraping; Internet data; programming etiquette; social responsibility

INTRODUCTION

This is the digital universe. It is growing 40% a year into

the next decade, expanding to include not only the increas-

ing number of people and enterprises doing everything

online, but also all the “things”–smart devices –

connected to the Internet, unleashing a new wave of

opportunities for businesses and people around the world.

–EMC

(2014)

In this day and age there is no need to rehash the remarkable

growth in scope and volume of data generated on a daily basis.

Worthy of explicit consideration, however, is how managerial

insights may be gleaned from these seemingly omnipresent

resources. Contemporary developments in “big data”and “analyt-

ics”—ﬁelds deﬁned by the use of data far exceeding the realm of

manual tractability—offer tremendous potential to remove many

limitations of traditional research, or at least greatly augment them.

This potential extends well into areas that are perhaps not immedi-

ately evident—creative applications may enrich case studies,

enhance the power of controlled laboratory experiments, and

assess the impact of sampling alternatives. Indeed, this potential is

becoming evidenced in contemporary supply chain management

(SCM) research, such as in the use of high-velocity sensor data

(Perdikaki et al. 2012) or granular social networks (Misopoulos

et al. 2014). Yet, many opportunities remain.

As empirical SCM research continues to increase in scope and

rigor, the likelihood that extensive, clean, relevant, and wholly

untapped data setscan be obtained from single sources will continue

to drop; the ability for researchers to automate data management

processes will become indispensable. Yet, many SCM researchers

wonder how such automation may be practically implemented. This

paper directly addresses such questions by systematically introduc-

ing practical techniques which tackle the navigation, retrieval, and

parsing of large, remote data sets. I illustrate these concepts through

concrete, functional examples, carefully selected on the basis of

timeliness and relevance to the SCM researcher and presented in a

functional accompaniment. In addition to these practical illustra-

tions, I discuss caveats and social considerations necessary to

uphold traditionsof scientiﬁc research in this new domain.

This paper is structured as follows. I begin by delineating the

scope and assumptions made of the reader. Next, I describe

concepts central to web-crawling and information scraping, and

provide deﬁnitions necessary for practical implementation. I

detail three generalizable, illustrative cases implementing these

methods. Afterward, I describe performance enhancements that

may appeal to the target SCM audience. Then, I present a sepa-

rate, comprehensive illustration which also has direct applicabil-

ity for SCM research. I conclude with a commentary on the

implications and questions for SCM researchers.

DEFINITION OF SCOPE

Target audience and article objectives

The overarching intent of this article is twofold: (1) provide a general-

izable primer on web-crawling and information-scraping methods for

researchers who may need it, and (2) offer guidance toward a sustain-

able, uniﬁed effort in using these methods. Because several, extre-

mely broad topics are being considered, it is not possible to provide

an exhaustive review within this article. Thus, I will ﬁrst delineate the

scope of this article, and outline the simplifying assumptions and

restrictions that were necessary to develop a useful, but brief essay.

This article spans two, distinct automated methodologies: web

crawling and information scraping.

Although each method

serves a distinctive purpose, the two are often implemented in

unison and confounded in practice. For semantic clarity, I

employ the following, functional deﬁnitions throughout this arti-

Corresponding author:

Brett Massimino, 340 Statler Hall, Cornell University, Ithaca, NY

14853, USA; E-mail: bjm272@cornell.edu

Numerous aliases are used in practice (e.g., web scraping,

web harvesting, and information extraction. For consistency, I

use only “web crawling”and “information scraping,”as deﬁned

in this paragraph.

Journal of Business Logistics, 2016, 37(1): 34–42 doi: 10.1111/jbl.12120

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Accessing Online Data: Web‐Crawling and Information‐Scraping Techniques to Automate the Assembly of Research Data

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users