Accessing Online Data: Web‐Crawling and Information‐Scraping Techniques to Automate the Assembly of Research Data

AuthorBrett Massimino
DOIhttp://doi.org/10.1111/jbl.12120
Date01 March 2016
Published date01 March 2016
Accessing Online Data: Web-Crawling and Information-Scraping
Techniques to Automate the Assembly of Research Data
Brett Massimino
Cornell University
There is a growing interest in leveraging alternate sources of empirical data, with an increasing emphasis being placed on the Internet. This
paper serves as a primer for supply chain management (SCM) researchers that may be interested in leveraging Internet-based sources for
their own research, but perhaps not familiar with how to begin. Here, denitions and concepts critical to successful implementation in practice
are provided. In addition, concrete, discipline-relevant examples accompany the discussion, and are aided by a fully detailed online code supple-
ment. Performance enhancements are discussed, as well as associated caveats and limitations. Additionally, insights and guidance are offered on
the unique responsibilities for researchers to uphold the ethical spirit of scientic research when continuing along these paths. Pragmatic issues
related to the application of these techniques are presented for consideration of individual researchers and the SCM community as a whole.
Keywords: web crawling; information scraping; Internet data; programming etiquette; social responsibility
INTRODUCTION
This is the digital universe. It is growing 40% a year into
the next decade, expanding to include not only the increas-
ing number of people and enterprises doing everything
online, but also all the things”–smart devices
connected to the Internet, unleashing a new wave of
opportunities for businesses and people around the world.
EMC
2
(2014)
In this day and age there is no need to rehash the remarkable
growth in scope and volume of data generated on a daily basis.
Worthy of explicit consideration, however, is how managerial
insights may be gleaned from these seemingly omnipresent
resources. Contemporary developments in big dataand analyt-
icselds dened by the use of data far exceeding the realm of
manual tractabilityoffer tremendous potential to remove many
limitations of traditional research, or at least greatly augment them.
This potential extends well into areas that are perhaps not immedi-
ately evidentcreative applications may enrich case studies,
enhance the power of controlled laboratory experiments, and
assess the impact of sampling alternatives. Indeed, this potential is
becoming evidenced in contemporary supply chain management
(SCM) research, such as in the use of high-velocity sensor data
(Perdikaki et al. 2012) or granular social networks (Misopoulos
et al. 2014). Yet, many opportunities remain.
As empirical SCM research continues to increase in scope and
rigor, the likelihood that extensive, clean, relevant, and wholly
untapped data setscan be obtained from single sources will continue
to drop; the ability for researchers to automate data management
processes will become indispensable. Yet, many SCM researchers
wonder how such automation may be practically implemented. This
paper directly addresses such questions by systematically introduc-
ing practical techniques which tackle the navigation, retrieval, and
parsing of large, remote data sets. I illustrate these concepts through
concrete, functional examples, carefully selected on the basis of
timeliness and relevance to the SCM researcher and presented in a
functional accompaniment. In addition to these practical illustra-
tions, I discuss caveats and social considerations necessary to
uphold traditionsof scientic research in this new domain.
This paper is structured as follows. I begin by delineating the
scope and assumptions made of the reader. Next, I describe
concepts central to web-crawling and information scraping, and
provide denitions necessary for practical implementation. I
detail three generalizable, illustrative cases implementing these
methods. Afterward, I describe performance enhancements that
may appeal to the target SCM audience. Then, I present a sepa-
rate, comprehensive illustration which also has direct applicabil-
ity for SCM research. I conclude with a commentary on the
implications and questions for SCM researchers.
DEFINITION OF SCOPE
Target audience and article objectives
The overarching intent of this article is twofold: (1) provide a general-
izable primer on web-crawling and information-scraping methods for
researchers who may need it, and (2) offer guidance toward a sustain-
able, unied effort in using these methods. Because several, extre-
mely broad topics are being considered, it is not possible to provide
an exhaustive review within this article. Thus, I will rst delineate the
scope of this article, and outline the simplifying assumptions and
restrictions that were necessary to develop a useful, but brief essay.
This article spans two, distinct automated methodologies: web
crawling and information scraping.
1
Although each method
serves a distinctive purpose, the two are often implemented in
unison and confounded in practice. For semantic clarity, I
employ the following, functional denitions throughout this arti-
Corresponding author:
Brett Massimino, 340 Statler Hall, Cornell University, Ithaca, NY
14853, USA; E-mail: bjm272@cornell.edu
1
Numerous aliases are used in practice (e.g., web scraping,
web harvesting, and information extraction. For consistency, I
use only web crawlingand information scraping,as dened
in this paragraph.
Journal of Business Logistics, 2016, 37(1): 3442 doi: 10.1111/jbl.12120
© Council of Supply Chain Management Professionals

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT