Choosing Your Platform for Social Media Drug Research and Improving Your Keyword Filter List

AuthorEleanor Erin Artigiani,Nikki Adams,Eric D. Wish
Published date01 July 2019
Date01 July 2019
DOI10.1177/0022042619833911
Subject MatterArticles
https://doi.org/10.1177/0022042619833911
Journal of Drug Issues
2019, Vol. 49(3) 477 –492
© The Author(s) 2019
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/0022042619833911
journals.sagepub.com/home/jod
Article
Choosing Your Platform
for Social Media Drug
Research and Improving
Your Keyword Filter List
Nikki Adams1, Eleanor Erin Artigiani1,
and Eric D. Wish1
Abstract
Social media research often has two things in common: Twitter is the platform used and a
keyword filter list is used to extract only relevant Tweets. Here we propose that (a) alternative
platforms be considered more often when doing social media research, and (b) regardless of
platform, researchers use word embeddings as a type of synonym discovery to improve their
keyword filter list, both of which lead to more relevant data. We demonstrate the benefit of
these proposals by comparing how successful our synonym discovery method is at finding terms
for marijuana and select opioids on Twitter versus a platform that can be filtered by topic,
Reddit. We also find words that are not on the U.S. Drug Enforcement Agency (DEA) drug
slang list for that year, some of which appear on the list the subsequent year, showing that this
method could be employed to find drug terms faster than traditional means.
Keywords
social media, keyword filter list, machine learning, synonym detection
Introduction
Research in the public health domain using social media has expanded greatly in recent years. In
particular, many scholars have begun taking advantage of social media’s big data properties by
using automated and semiautomated methods to analyze these data. This research spans many
topics, such as sentiment analysis of drugs: whether people feel positively or negatively about a
particular drug (e.g., Cavazos-Rehg et al., 2015; Daniulaityte et al., 2016), whether people are
misusing prescription drugs (e.g., Chary et al., 2017), and whether some characteristics of the
drug can be correlated to its legalization status in different states (e.g., Daniulaityte et al., 2015).
The National Institute on Drug Abuse (NIDA)–funded National Drug Early Warning System
(NDEWS) sought to tap into this pool of data as well to advance its mission to monitor indicators
of emerging drug trends.
NDEWS staff, in collaboration with staff at the Center for Advanced Study of Language
(CASL) at the University of Maryland in College Park, hypothesized that an important step in
being able to monitor drug trends was to be able to keep pace with the vocabulary around drugs
1University of Maryland, College Park, USA
Corresponding Author:
Nikki Adams, Center for Advanced Study of Language, University of Maryland, 7005 52nd Ave, College Park, MD
20742, USA.
Email: nadams2@umd.edu
833911JODXXX10.1177/0022042619833911Journal of Drug IssuesAdams et al.
research-article2019
478 Journal of Drug Issues 49(3)
and drug use. They used machine learning on data from Twitter to demonstrate that new drug
terms for marijuana could be discovered using big data methods (Simpson, Adams, Brugman,
& Conners, 2018). The results, however, were less impressive for other types of drugs. The aim
of the research we present here is to improve on that performance by looking at another social
media platform, Reddit, and comparing its utility for discovering drug terms in Twitter for both
marijuana and opioids. In doing so, we hope to provide methods and recommendations that
other drug researchers can use as they do their research on social media.
Most social science research has two things in common: (a) Twitter is used as the data source,
and (b) a keyword filter list is used, which extracts only from Twitter those tweets containing a
term on that list. Against that background, our aim here is to demonstrate how to improve such
research using social media in two ways. First, we would like to bring more attention to the idea
of using other corpora (i.e., data sets of text used for language analysis), in particular, to using
Reddit. We hypothesize that Reddit would be a better platform for drug term exploration, espe-
cially for opioids, as we could focus on a drug topic subset of the platform; this hypothesis turns
out to be supported. Second, we show that, regardless of which corpus (text data set) a researcher
uses, one can improve one’s keyword filter to increase the number of relevant posts for exami-
nation by doing some synonym discovery before deciding on that final list. The ability to dis-
cover (perhaps previously unknown) synonyms for drugs has implications for anyone doing
research or working in a field that requires knowing how people refer to drugs. This latter goal
also has implications for anyone wanting to learn more about the language used around drugs
(or, in fact, any topic).
The method we use to achieve both of these goals is a machine-learning algorithm for word
embeddings, Word2Vec. In short, in this algorithm, the corpus data are used to model how
words are related to one another based on how those words are used in context. As a result, a
researcher can then take terms that he or she already knows and find out terms that are like it,
such as synonyms, some of which will be ones the researcher had not thought of in advance.
We build these word-embedding models based on Twitter and Reddit and compare them. The
results of our comparison show that we can use synonym discovery to find “new” terms, with
“new” in this context being terms not already on the U.S. Drug Enforcement Agency (DEA)
list of drug slang words (DEA Intelligence Report, 2017). Simpson et al. (2018) used such
models for uncovering marijuana-related terms on Twitter. We do it for marijuana and opioid
terms on Twitter and Reddit, and we show that this method reveals new terms on both social
media platforms.
Overall, our results show that Reddit outperforms Twitter for synonym detection for the terms
we selected. The Reddit model was built using drug-related subforums only, making the data less
noisy. Filtering by keyword sets limits on what is ultimately analyzed based on terms that are
already known. This is particularly limiting for term discovery, where the goal is finding unknown
terms. For this reason, choosing a social media platform that is organized by topic can be benefi-
cial as it can be filtered by topic while not being filtered by keyword.
Of course, researchers should still choose Twitter when the research question calls for it.
Examples of reasons for using Twitter are that the researcher wants to analyze geographic distri-
bution of information as Twitter allows geolocation in its metadata, a feature most other social
media sites do not have. Our research findings, however, show there is good reason to consider
other sources like Reddit, which also provides a large user base and which, unlike Twitter, allows
the researcher to restrict by topic without restricting by keyword.
Background
Why should researchers care to learn drug terms for social media research beyond those that they
already know? What can be gained by discovering new terms for a substance? First, as stated, if

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT