Recently, an economy of data and service providers has evolved around Points of Interest (POI). (1) This includes map-centric applications such as Google Maps, local directory services such as Yelp, several location-based social networks, e.g., Foursquare, as well as numerous spatially enabled sharing services such as Path or Flickr. Each of these services specializes in certain kinds & place-related information. Additionally, while they collect user-generated content, access to this content is restricted by APIs, frequency limitations, e.g., 1000 queries per day, as well as storage restrictions, e.g., the queried data has to be deleted within 24 hours. As user data and profiles are the key assets of these companies, they have to carefully balance openness and interlinkage to other platforms with their own interest in walling off the data.
From a research perspective, however, the combination of these data sources would be desirable for multiple reasons. First, by conflating POI, we can exploit complementary attributes to arrive at a more holistic understanding of places. For instance, one can combine user reviews from different communities to study sentiment, compare the place categorization hierarchies and match them using ontology alignment techniques, mine check-in behavior for patterns, compare pictures from tourists versus locals, and so on. In fact, this variety aspect is one of the key value propositions of Big Data and semantic interoperability in general. Second, we can increase data quality by comparing the same attributes across data sets. One potential application would be to remove typos in place names contributed by volunteers. Finally, combining the data sources would also increase their coverage.
The process of conflating POI can be divided into two steps. First, identity has to be established between them. That is, it has to be determined whether both information entities correspond to the same place in the physical world. We refer to this part as matching throughout the paper. To do so, one would usually compare the values of attributes common to both datasets using a particular similarity measure. For example, if two data sets contain a name attribute for their POI, the Levenshtein distance can be used to match them. Simply comparing names alone, however, will only work for certain cases. Thus, other attributes such as geographic locations and tags will be compared using appropriate measures as well. In practice, these measures will rarely return exact matches, and we have to combine them and define a matching threshold. This, of course, assumes that we can successfully match the attributes in the first place; i.e., establish that place_name in one data set that describes the same attribute as POIname in another one. Strictly speaking, it also requires an understanding of what it means for places to be equal. If an established restaurant moves to another building, does it become a new place even though we still unambiguously refer to it using the same name? Similarly, is an old movie theater that becomes a night club known by the same name, and preserving the theater's original ambiance, still the same place?
In the second step, the attributes of the involved POI have to be conflated. For example, while a place may have multiple names and one can be chosen to be canonical, this is not feasible for geographic locations. Understanding how to proceed with different attributes is an ontological question. For instance, a particular POI definition may only allow for one place category such as Restaurant (see Equation (1)). Consequently, in this example, conflating data may require one to use the least upper bound of a hierarchy of place types or any other method that meaningfully reduces the set of types to one.
POI [??] Place [??] [there exists]hasName.Name [??] (= 1 hasCategory.Type) [??] ... (1)
In addition to this ontological perspective, it is also beneficial to understand the process by which attribute values are recorded as well as the resulting types of errors. For example, one could naively assume that because POI locations from location-based social networks (LBSN) are recorded by GPS positioning via smartphones, they may be inaccurate to about 5-30 meters and averaging positions from two LBSN would improve accuracy. We will later discuss why this is not the case.
In this work we will focus on the first step of conflation and show how to match POI from the LBSN Foursquare and the local directory service Yelp. Foursquare specializes in user check-ins and, thus, social and temporal aspects. While it also provides user tips, those are typically short personal statements. In contrast, Yelp focuses on detailed user reviews and a wide range of semi-structured place attributes such as the ambience, prices, noise level, and wifi availability. This makes conflating POI based on both data sets attractive; see Figure 1 for a comparison. For example, this would enable queries for places visited by friends, that have a low noise level, friendly staff and free wifi. Our more immediate interest, however, lies in exploiting the conflated POI to improve user similarity measures for the analysis of sparse semantic trajectories (McKenzie, Adams, and Janowicz 2013).
The contributions of this work are as follows:
* Intuitively, one may assume that both data sets have well-curated and canonical place names for their POI. Consequently, a syntactic string measure such as Levenshtein distance should be a strong matcher. We will test whether it can successfully match at least 80% of our sample data. It is important to keep in mind that for an automatic matcher a success rate of 80-90% is not sufficient. In our case, given the >30 million POI in the United States alone, at least 3 million POI would still have to be corrected and matched manually.
* Following the Pareto principle (Reed 2001), we assume that matching the remaining (less than) 20% of POI will require a weighted combination of marchers that exploit additional POI attributes. First, we will investigate whether an alternative place name matcher can improve our previous results. To do so, we will use Double Metaphone to match for phonetic similarity. Next, we will introduce matchers based on place categories, textual user reviews, as well as geographic distance, and evaluate their performance. To the best of our knowledge, phonetic and user review-based matchers have not been used in the literature before.
* Subsequently, we will use binomial probit regression to arrive at a weighted combination of all marchers and evaluate the results against an ordinal weight combination and an unweighted baseline.
* For the development and evaluation of our matching strategy we selected a subset of all POI so that for each randomly selected Foursquare POI there exists a true positive matching POI in Yelp. However, this is not always the case especially if we also take other platforms such as OpenStreetMap, Google Places, Yahoo Local, etc into account. Thus, we will also investigate what match score should be used for the automatic on-the-fly matching of POI from different sources. While essential for matching noisy data like VGI on the Web, the challenging topic of arriving at robust match scores has not been discussed in the literature before.
* Given the restrictions of the used APIs, we outline how to use Linked Data to preserve the match results. This allows to conflate the data on-the-fly in the future without violating the terms of usage.
* Finally, we will discuss some interesting insights made during our work. For instance, we will try to explain why the geographic coordinates of POI clearly differ between Yelp and Foursquare.
The remainder of this paper is divided into four sections. First, we will give a brief overview of related work. Next, we will introduce the POI attributes, the matching methodology, and the data sets. We then evaluate our work and compare the performance of the independent measures as well as their combination. Finally, we present conclusions, observations, and directions for further research.
The matching and conflation of geographic data sets has a long history in the field of geographic information science. Work in this area has typically divided conflation and matching patterns by the attributes present and the end goal of the research being conducted. Historically, two related areas of research have emerged, one focusing on the geometric or geographic properties of the data (Chen, Knoblock, and Shahabi 2006; Devogele 2002; Haunert 2005; Li and Goodchild 2011) and another centered on the descriptive attributes (Hastings 2008). Adams et al. (2010) proposed a general framework for conflation that combines geometric and other attributes. Likewise, work by Sehgal, Getoor, and Viechnicki (2006) integrated spatial and non spatial components (names, types, and demographic information) of geospatial locations with the goal of consolidating a collection of true locations. Recent work on online social networking applications has proposed matching Qype and Facebook Places to OpenStreetMap POI through geographic distance and name matching (Schemer, Schirru, and Lehmann 2012). While related in nature, the number of attributes accessible was limited, restricting the ability of the researchers to explore a weighted approach.
Matching attributes of spatial data can take any number of forms, though it usually involves some level of place name matching. The concept of name matching has been investigated in several different computationally focused fields. Hundreds of methods have been developed for analyzing text and assessing similarities between strings for duplication detection (Bilenko and Mooney 2003; Elmagarmid, Ipeirotis, and Verykios 2007; Lait and Randell 1996), language translation (Freeman, Condon, and Ackerman 2006), and information retrieval (Cohen, Ravikumar, and Fienberg 2003; Jones and...