Metadata And Linked Data in Word Sense Disambiguation.

Author:Corsmeier, Matthew


Word Sense Disambiguation (WSD) is referred to as an "Al-complete" problem (Mallery, 1998), i.e., a task that is relatively easy for people, but considerably more difficult for machines. If someone makes a query for a polysemous word (e.g., "plant," "bass," "mercury," etc ...), how is an information retrieval system to understand which sense of the word is intended? There exist tried-and-tested methods, such as just using the most predominant sense of the word (McCarthy, Koeling, Weeds, & Carroll, 2004); or looking at the words next to the query term to determine the statistically most likely meaning (Jurafsky & Martin, 2009; Manning & Schutze, 1999); but these methods often produce less-than-satisfactory results [often around 70%] (Navigli, 2009). Furthermore, these methods have been heavily dependent on the manual creation of knowledge sources (Edmonds, 2000), which are expensive to create and subject to change, thus creating what is termed a knowledge acquisition bottleneck (Gale, Church, & Yarowsky, 1992). Linked Data technologies (Berners-Lee, 2006), however, allow us to utilize existing ontologies and lexica, which can then be exploited to improve the automatic semantic understanding of the word. This paper will examine several systems that purport to disambiguate words by using Linked Data, and some of the models these systems use to ensure interoperability.

Literature Review

The most complete treatment of the subject of WSD is arguably Agirre & Edmonds [ed.] (2007), which presents a detailed definition of the problem, along with a history thereof, and numerous algorithms which are used in practice. Kwong (2013) offers slightly more recent coverage, along with predictions as to how WSD methods will evolve in the near future. Generalists might find sufficient the survey from Navigli (2009), or the chapters covering WSD in either Jurafsky & Martin (2009) or Manning & Schutze (1999). SemEval [which was originally named Senseval (Kilgarriff, 1998)] is an ongoing evaluation project which is used as a baseline to assess various WSD methods, including many which will be examined in this paper.

Linked Linguistic Open Data (LLOD) is heavily dependent on metadata, and any consideration thereof would require an examination of its standards. A brief history of the topic of linguistic annotation can be found in Palmer & Xue (2013). Bird & Simons (2003a) and Ide, Romary, & de la Clergerie (2004) proposed sets of best practices for linguistic annotations, while Simons, Bird, & Spanne (2008) offered a more recent set of recommendations that specifically suggested language codes from ISO 639-31 be used in metadata. Ide & Pustejovsky (2010) suggested a list of best practices for language technology metadata, focusing heavily on the work of the OLAC and European Languages Resource Association (ELRA). Gracia, Montiel-Ponsoda, Cimiano, Gomez-Perez. Buitelaar, & McCrae (2012) considered the issue of Linked Data being stored in different languages, and suggested that techniques such as ontology localization, ontology mapping, and cross-lingual ontology-based information access and presentation would help prevent information from being locked up in linguistic data silos. Gayo, Kontokostas, & Auer (2013) presented a set of best practices for multilingual linked open data, and point out that SPARQL queries can be improved if tags are identified by language. Reviews of specific linguistic annotation schemes include: the Open Languages Archives Community [OLAC] metadata set (Bird & Simons, 2003b); the General Ontology for Linguistic Description [GOLD] (Farrar & Langendoen; 2003); ISOcat, a Data Category Registry (DCR) for the ISO TC 37 (terminology and other language and content resources) registry (Kemps-Snijders, Windhouwer, Wittenburg, & Wright (2009); the ISO/TC 37/SC 4 standard (Lee & Romary, 2010); the lemon (LExicon Model for ONtologies) model (McCrae, Aguado-de-Cea, Buitelaar, Cimiano, Declerck, Gomez-Perez, ... & Wunner, 2012); and Lexical Markup Framework [LMF] (Francopoulo, 2013), which had a strong influence on the lemon model.

A number of papers detail projects that utilized the schemes listed above. Montiel-Ponsoda, Gracia del Rio, Aguado de Cea, & Gomez-Perez (2011) showed how the lemon model can be extended using a metamodel in OWL, which would allow translation to be represented on a separate layer. Buitelaar, Cimiano, Haase, & Sintek (2009) advocated using ontologies beyond those of RDFS, OWL, and SKOS, and presented a model called LexInfo, which combines aspects of older models. Chiarcos, Dipper, Gotze, Leser, Ludeling, Ritz, & Stede (2008) treated the Ontology of Linguistic Annotation, which is especially useful for corpora that have been annotated a number of different times in a number of different methods.

Two of the most commonly used linguistic tools on the Semantic Web are the general-purpose lexical ontologies WordNet (Fellbaum, 1998) and FrameNet (Baker, Fillmore, & Lowe, 1998). Although Ide (2014) argued that FrameNet was the "ideal resource for representation as linked data" (18), the majority of the projects covered later in this paper utilized WordNet, and thus this tool will be examined in more detail. Both FrameNet and WordNet are often used in Linked Data projects to automatically annotate texts with semantic metadata. Projects that have used these databases include Huang (2007), wherein WordNet files were converted to be presented in OWL to assist in machine comprehension of metaphor; and BabelNet (Navigli, 2012), a resource which will be reviewed later in this paper. Ehrmann, Cecconi, Vannella, McCrae, Cimiano, & Navigli (2014) converted BabelNet into Linked Data via the lemon model; and Moro, Navigli, Tucci & Passonneau (2014) used BabelNet to automatically annotate the Manually Annotated Sub-Corpus 3.0 (MASC) and therewith were able to perform automatic WSD with an accuracy of 70%, an impressive figure; but still too low to see much practical adoption.

Other examples of linguistic tools used with Semantic Web technologies include Krizhanovsky & Smirnov (2013), wherein Wiktionary was utilized to automatically create a general-purpose lexical ontology; Hellmann, Brekle, & Auer (2013) described a similar project wherein Wiktionary extractors made use of DBpedia to create RDF triples; de Melo (2014a) introduced, a system which automatically creates URIs for each word and sense, thus guaranteeing a constant reference; Mendes, Jakob, Garcia-Silva, & Bizer (2011), introduced DBpedia Spotlight, an open-source program that automatically annotates texts to the Linked Open Data cloud by using the URIs in DBpedia; and Serasset (2014) described the extraction of multilingual lexical data from Wiktionary, the importation thereof into DBNary, and the final conversion of the data into MLLOD (Multilingual Lexical Linked Open Data) via the lemon model.

A number of very different methods of using Semantic Web technologies to disambiguate word senses have been attempted and analyzed. Elbedweihy, Wrigley, Ciravegna, & Zhang (2013) used a combination of WordNet, BabelNet, and Wikipedia to help generate SPARQL queries, which would subsequently resolve ambiguities in the original queries with a success rate of 76%; Fragos (2013) also used WordNet--in this case the extended glosses of WordNet--to train WSD systems; McCarthy et al. (2004) used the WordNet similarity package and raw textual corpora to solve WSD by using the predominant sense of the word, a method which achieved a success rate of 64%; Ide (2006) treated the problem of polysemy by mapping FrameNet sets to WordNet.

Some case study reviews show the strengths and weaknesses of more general models: Haase (2004) looked at tags for digital images to argue that semantic metadata can help alleviate some of the issues of precision caused by selecting overly narrow terms; de Melo & Weikum (2008) argued that "language-related knowledge" forms the backbone of the semantic web, and presented ways in which linguistic items such as languages, scripts, and terms can unambiguously be linked with URIs, and from whence new links can automatically be formed; and Tagarelli, Longo, & Greco (2009) showed how notions of sense relatedness can be calculated by examining overlaps between dictionary glosses and measuring distances for ontology paths.

The rest of the paper will cover in more detail several...

To continue reading