Similarity of Semantic Relations
Peter D. Turney
National Research Council Canada
 

There are at least two kinds of semantic similarity. Relational similarity is correspondence between relations, in contrast with attributional similarity, which is correspondence between attributes. When two words have a high degree of attributional similarity, we call them synonyms. When two pairs of words have a high degree of relational similarity, we say that their relations are analogous. For example, the word pair mason:stone is analogous to the pair carpenter:wood. This paper introduces Latent Relational Analysis (LRA), a method for measuring relational similarity. LRA has potential applications in many areas, including information extraction, word sense disambiguation, and information retrieval. Recently the Vector Space Model (VSM) of information retrieval has been adapted to measuring relational similarity, achieving a score of 47% on a collection of 374 college-level multiple-choice word analogy questions. In the VSM approach, the relation between a pair of words is characterized by a vector of frequencies of predefined patterns in a large corpus. LRA extends the VSM approach in three ways: (1) the patterns are derived automatically from the corpus, (2) the Singular Value Decomposition (SVD) is used to smooth the frequency data, and (3) automatically generated synonyms are used to explore variations of the word pairs. LRA achieves 56% on the 374 analogy questions, statistically equivalent to the average human score of 57%. On the related problem of classifying semantic relations, LRA achieves similar gains over the VSM. 1. Introduction There are at least two kinds of semantic similarity. Attributional similarity is correspondence between attributes and relational similarity is correspondence between relations (Medin, Goldstone, and Gentner, 1990). When two words have a high degree of attributional similarity, we call them synonyms. When two word pairs have a high degree of relational similarity, we say they are analogous. Verbal analogies are often written in the form A:B::C:D, meaning A is to B as C is to D; for example, traffic:street::water:riverbed. Traffic flows over a street; water flows over a riverbed. A street carries traffic; a riverbed carries water. There is a high degree of relational similarity between the word pair traffic:street and the word pair water:riverbed. In fact, this analogy is the basis of several mathematical theories of traffic flow (Daganzo, 1994). In Section 2, we look more closely at the connections between attributional and relational similarity. In analogies such as mason:stone::carpenter:wood, it seems that relational similarity can be reduced to attributional similarity, since mason and carpenter are attributionally similar, as are stone and wood. In general, this reduction fails. Consider the analogy traffic:street::water:riverbed. Traffic and water are not attributionally similar. Street and riverbed are only moderately attributionally similar. Many algorithms have been proposed for measuring the attributional similarity between two words (Lesk, 1969; Resnik, 1995; Landauer and Dumais, 1997; Jiang and Conrath, 1997; Lin, 1998b; Turney, 2001; Budanitsky and Hirst, 2001; Banerjee and Ped  б

Institute for Information Technology, National Research Council Canada, Ottawa, Canada, K1A 0R6

c National Research Council Canada


Computational Linguistics

Volume 1, Number 1

ersen, 2003). Measures of attributional similarity have been studied extensively, due to their applications in problems such as recognizing synonyms (Landauer and Dumais, 1997), information retrieval (Deerwester et al., 1990), determining semantic orientation (Turney, 2002), grading student essays (Rehder et al., 1998), measuring textual cohesion (Morris and Hirst, 1991), and word sense disambiguation (Lesk, 1986). On the other hand, since measures of relational similarity are not as well developed as measures of attributional similarity, the potential applications of relational similarity are not as well known. Many problems that involve semantic relations would benefit from an algorithm for measuring relational similarity. We discuss related problems in natural language processing, information retrieval, and information extraction in more detail in Section 3. This paper builds on the Vector Space Model (VSM) of information retrieval. Given a query, a search engine produces a ranked list of documents. The documents are ranked in order of decreasing attributional similarity between the query and each document. Almost all modern search engines measure attributional similarity using the VSM (Baeza-Yates and Ribeiro-Neto, 1999). Turney and Littman (2005) adapt the VSM approach to measuring relational similarity. They used a vector of frequencies of patterns in a corpus to represent the relation between a pair of words. Section 4 presents the VSM approach to measuring similarity. In Section 5, we present an algorithm for measuring relational similarity, which we call Latent Relational Analysis (LRA). The algorithm learns from a large corpus of unlabeled, unstructured text, without supervision. LRA extends the VSM approach of Turney and Littman (2005) in three ways: (1) The connecting patterns are derived automatically from the corpus, instead of using a fixed set of patterns. (2) Singular Value Decomposition (SVD) is used to smooth the frequency data. (3) Given a word pair such as traffic:street, LRA considers transformations of the word pair, generated by replacing one of the words by synonyms, such as traffic:road, traffic:highway. Section 6 presents our experimental evaluation of LRA with a collection of 374 multiple-choice word analogy questions from the SAT college entrance exam.1 An example of a typical SAT question appears in Table 1. In the educational testing literature, the first pair (mason:stone) is called the stem of the analogy. The correct choice is called the solution and the incorrect choices are distractors. We evaluate LRA by testing its ability to select the solution and avoid the distractors. The average performance of collegebound senior high school students on verbal SAT questions corresponds to an accuracy of about 57%. LRA achieves an accuracy of about 56%. On these same questions, the VSM attained 47%. One application for relational similarity is classifying semantic relations in nounmodifier pairs (Turney and Littman, 2005). In Section 7, we evaluate the performance of LRA with a set of 600 noun-modifier pairs from Nastase and Szpakowicz (2003). The problem is to classify a noun-modifier pair, such as "laser printer ", according to the semantic relation between the head noun (printer) and the modifier (laser). The 600 pairs have been manually labeled with 30 classes of semantic relations. For example, "laser printer " is classified as instrument; the printer uses the laser as an instrument for printing. We approach the task of classifying semantic relations in noun-modifier pairs as a supervised learning problem. The 600 pairs are divided into training and testing sets
The College Board eliminated analogies from the SAT in 2005, apparently because it was believed that analogy questions discriminate against minorities, although it has been argued by liberals (Goldenberg, 2005) that dropping analogy questions has increased discrimination against minorities and by conservatives (Kurtz, 2002) that it has decreased academic standards. Analogy questions remain an important component in many other tests, such as the GRE.
 

2


Similarity of Semantic Relations

Turney

Table 1 An example of a typical SAT question, from the collection of 374 questions.

Stem: Choices:

Solution:

(a ) ( b) ( c) (d) ( e) ( b)

mason:stone teacher:chalk carpenter:wood soldier:gun photograph:camera book:word carpenter:wood

and a testing pair is classified according to the label of its single nearest neighbour in the training set. LRA is used to measure distance (i.e., similarity, nearness). LRA achieves an accuracy of 39.8% on the 30-class problem and 58.0% on the 5-class problem. On the same 600 noun-modifier pairs, the VSM had accuracies of 27.8% (30-class) and 45.7% (5-class) (Turney and Littman, 2005). We discuss the experimental results, limitations of LRA, and future work in Section 8 and we conclude in Section 9. 2. Attributional and Relational Similarity In this section, we explore connections between attributional and relational similarity. 2.1 Types of Similarity Medin, Goldstone, and Gentner (1990) distinguish attributes and relations as follows:
Attributes are predicates taking one argument (e.g., is red, is large), whereas relations are predicates taking two or more arguments (e.g., collides with , is larger than ). Attributes are used to state properties of objects; relations express relations between objects or propositions.

Gentner (1983) notes that what counts as an attribute or a relation can depend on the context. For example, large can be viewed as an attribute of , L A R G E( ), or a relation between and some standard , L A R G E R T H A N( , ). The amount of attributional similarity between two words, and , depends on the degree of correspondence between the properties of and . A measure of attributional similarity is a function that maps two words, and , to a real number, . The more correspondence there is between the properties of and , the greater their attributional similarity. For example, dog and wolf have a relatively high degree of attributional similarity. The amount of relational similarity between two pairs of words, A:B and C:D, depends on the degree of correspondence between the relations between and and the relations between and . A measure of relational similarity is a function that maps two pairs, A:B and C:D, to a real number, . The more correspondence there is between the relations of A:B and C:D, the greater their relational similarity. For example, dog:bark and cat:meow have a relatively high degree of relational similarity. Cognitive scientists distinguish words that are semantically associated (beeнhoney) from words that are semantically similar (deerнpony), although they recognize that some

е   е  д й и з ж

е

 б д

в


е

      е  д  и з ж

д

д

в

   
е

е

гв

д

 
д

г


б

в


3


Computational Linguistics

Volume 1, Number 1

words are both associated and similar (doctorнnurse) (Chiarello et al., 1990). Both of these are types of attributional similarity, since they are based on correspondence between attributes (e.g., bees and honey are both found in hives; deer and ponies are both mammals). Budanitsky and Hirst (2001) describe semantic relatedness as follows:
Recent research on the topic in computational linguistics has emphasized the perspective of semantic relatedness of two lexemes in a lexical resource, or its inverse, semantic distance. It's important to note that semantic relatedness is a more general concept than similarity; similar entities are usually assumed to be related by virtue of their likeness (bank-trust company), but dissimilar entities may also be semantically related by lexical relationships such as meronymy (car-wheel) and antonymy (hot-cold), or just by any kind of functional relationship or frequent association (pencil-paper, penguin-Antarctica).

As these examples show, semantic relatedness is the same as attributional similarity (e.g., hot and cold are both kinds of temperature, pencil and paper are both used for writing). Here we prefer to use the term attributional similarity, because it emphasizes the contrast with relational similarity. The term semantic relatedness may lead to confusion when the term relational similarity is also under discussion. Resnik (1995) describes semantic similarity as follows:
Semantic similarity represents a special case of semantic relatedness: for example, cars and gasoline would seem to be more closely related than, say, cars and bicycles, but the latter pair are certainly more similar. Rada et al. (1989) suggest that the assessment of similarity in semantic networks can in fact be thought of as involving just taxonimic (I S - A ) links, to the exclusion of other link types; that view will also be taken here, although admittedly it excludes some potentially useful information.

Thus semantic similarity is a specific type of attributional similarity. The term semantic similarity is misleading, because it refers to a type of attributional similarity, yet relational similarity is not any less semantic than attributional similarity. To avoid confusion, we will use the terms attributional similarity and relational similarity, following Medin, Goldstone, and Gentner (1990). Instead of semantic similarity (Resnik, 1995) or semantically similar (Chiarello et al., 1990), we prefer the term taxonomical similarity, which we take to be a specific type of attributional similarity. We interpret synonymy as a high degree of attributional similarity. Analogy is a high degree of relational similarity. 2.2 Measuring Attributional Similarity Algorithms for measuring attributional similarity can be lexicon-based (Lesk, 1986; Budanitsky and Hirst, 2001; Banerjee and Pedersen, 2003), corpus-based (Lesk, 1969; Landauer and Dumais, 1997; Lin, 1998a; Turney, 2001), or a hybrid of the two (Resnik, 1995; Jiang and Conrath, 1997; Turney et al., 2003). Intuitively, we might expect that lexicon-based algorithms would be better at capturing synonymy than corpus-based algorithms, since lexicons, such as WordNet, explicitly provide synonymy information that is only implicit in a corpus. However, experiments do not support this intuition. Several algorithms have been evaluated using 80 multiple-choice synonym questions taken from the Test of English as a Foreign Language (TOEFL). An example of one of the 80 TOEFL questions appears in Table 2. Table 3 shows the best performance on the TOEFL questions for each type of attributional similarity algorithm. The results

4


Similarity of Semantic Relations

Turney

Table 2 An example of a typical TOEFL question, from the collection of 80 questions.

Stem: Choices:

Solution:

(a ) ( b) ( c) (d) (a )

levied imposed believed requested correlated imposed

Table 3 Performance of attributional similarity measures on the 80 TOEFL questions. (The average non-English US college applicant's performance is included in the bottom row, for comparison.)

Reference Jarmasz and Szpakowicz (2003) Terra and Clarke (2003) Turney et al. (2003) Landauer and Dumais (1997)

Description best lexicon-based algorithm best corpus-based algorithm best hybrid algorithm average human score

Percent Correct 7 8 .7 5 % 8 1 .2 5 % 9 7 .5 0 % 6 4 .5 0 %

support the claim that lexicon-based algorithms have no advantage over corpus-based algorithms for recognizing synonymy. 2.3 Using Attributional Similarity to Solve Analogies We may distinguish near analogies (mason:stone::carpenter:wood) from far analogies (traffic:street::water:riverbed) (Gentner, 1983; Medin, Goldstone, and Gentner, 1990). In an analogy A:B::C:D, where there is a high degree of relational similarity between A:B and C:D, if there is also a high degree of attributional similarity between and , and between and , then A:B::C:D is a near analogy; otherwise, it is a far analogy. It seems possible that SAT analogy questions might consist largely of near analogies, in which case they can be solved using attributional similarity measures. We could score each candidate analogy by the average of the attributional similarity, , between and and between and : (1 )

This kind of approach was used in two of the thirteen modules in Turney et al. (2003). To evaluate this approach, we applied several measures of attributional similarity to our collection of 374 SAT questions. The performance of the algorithms was measured by precision, recall, and , defined as follows:

(2 ) (3 ) (4 )

д


д

й изж

и

    е й и з ж з    д й и з ж ж


   г в в б   в г  и   г з ж ж б и  из  и  ж гжж г     гвв б  б вг  и 

г

и ж гж ж г    б в г  и      б  ж гж ж г     гв в б    б в г  и  

и

    г в

    г в з  бзжз  гв  зжз  гв 

!

б

е

д

!ж

      е  д г в б   ж

дй

д б

д     г в

зжз  гв 


й

е


е


5


Computational Linguistics

Volume 1, Number 1

Table 4 Performance of attributional similarity measures on the 374 SAT questions. (The bottom two rows are not attributional similarity measures. They are included for comparison.)
й

Algorithm Hirst and St-Onge (1998) Jiang and Conrath (1997) Leacock and Chodorow (1998) Lin (1998b) Resnik (1995) Turney (2001) Turney and Littman (2005) random

Type lexicon-based hybrid lexicon-based hybrid hybrid corpus-based relational (VSM) random

Precision 3 4 .9 % 2 9 .8 % 3 2 .8 % 3 1 .2 % 3 5 .7 % 3 5 .0 % 4 7 .7 % 2 0 .0 %

Recall 3 2 .1 % 2 7 .3 % 3 1 .3 % 2 7 .3 % 3 3 .2 % 3 5 .0 % 4 7 .1 % 2 0 .0 %

3 3 .4 % 2 8 .5 % 3 2 .0 % 2 9 .1 % 3 4 .4 % 3 5 .0 % 4 7 .4 % 2 0 .0 %

Note that recall is the same as percent correct (for multiple-choice questions, with only zero or one guesses allowed per question, but not in general). Table 4 shows the experimental results for our set of 374 analogy questions. For example, using the algorithm of Hirst and St-Onge (1998), 120 questions were answered correctly, 224 incorrectly, and 30 questions were skipped. When the algorithm assigned the same similarity to all of the choices for a given question, that question was skipped. The precision was and the recall was . The first five algorithms in Table 4 are implemented in Pedersen's WordNet-Similarity package.2 The sixth algorithm (Turney, 2001) used the Waterloo MultiText System, as described in Terra and Clarke (2003). The difference between the lowest performance (Jiang and Conrath, 1997) and random guessing is statistically significant with 95% confidence, according to the Fisher Exact Test (Agresti, 1990). However, the difference between the highest performance (Turney, 2001) and the VSM approach (Turney and Littman, 2005) is also statistically significant with 95% confidence. We conclude that there are enough near analogies in the 374 SAT questions for attributional similarity to perform better than random guessing, but not enough near analogies for attributional similarity to perform as well as relational similarity. 3. Related Work This section is a brief survey of the many problems that involve semantic relations and could potentially make use of an algorithm for measuring relational similarity. 3.1 Recognizing Word Analogies The problem of recognizing word analogies is, given a stem word pair and a finite list of choice word pairs, select the choice that is most analogous to the stem. This problem was first attempted by a system called Argus (Reitman, 1965), using a small hand-built semantic network. Argus could only solve the limited set of analogy questions that its programmer had anticipated. Argus was based on a spreading activation model and did not explicitly attempt to measure relational similarity. Turney et al. (2003) combined 13 independent modules to answer SAT questions. The final output of the system was based on a weighted combination of the outputs of each individual module. The best of the 13 modules was the VSM, which is described
е д
See http://www.d.umn.edu/ tpederse/similarity.html.

6


гз

вж ж

з

  же б   же


вж ж

з

  же б   же


Similarity of Semantic Relations

Turney

in detail in Turney and Littman (2005). The VSM was evaluated on a set of 374 SAT questions, achieving a score of 47%. In contrast with the corpus-based approach of Turney and Littman (2005), Veale (2004) applied a lexicon-based approach to the same 374 SAT questions, attaining a score of 43%. Veale evaluated the quality of a candidate analogy A:B::C:D by looking for paths in WordNet, joining to and to . The quality measure was based on the similarity between the A:B paths and the C:D paths. Turney (2005) introduced Latent Relational Analysis (LRA), an enhanced version of the VSM approach, which reached 56% on the 374 SAT questions. Here we go beyond Turney (2005) by describing LRA in more detail, performing more extensive experiments, and analyzing the algorithm and related work in more depth. 3.2 Structure Mapping Theory French (2002) cites Structure Mapping Theory (SMT) (Gentner, 1983) and its implementation in the Structure Mapping Engine (SME) (Falkenhainer, Forbus, and Gentner, 1989) as the most influential work on modeling of analogy-making. The goal of computational modeling of analogy-making is to understand how people form complex, structured analogies. SME takes representations of a source domain and a target domain, and produces an analogical mapping between the source and target. The domains are given structured propositional representations, using predicate logic. These descriptions include attributes, relations, and higher-order relations (expressing relations between relations). The analogical mapping connects source domain relations to target domain relations. For example, there is an analogy between the solar system and Rutherford's model of the atom (Falkenhainer, Forbus, and Gentner, 1989). The solar system is the source domain and Rutherford's model of the atom is the target domain. The basic objects in the source model are the planets and the sun. The basic objects in the target model are the electrons and the nucleus. The planets and the sun have various attributes, such as mass(sun) and mass(planet), and various relations, such as revolve(planet, sun) and attracts(sun, planet). Likewise, the nucleus and the electrons have attributes, such as charge(electron) and charge(nucleus), and relations, such as revolve(electron, nucleus) and attracts(nucleus, electron). SME maps revolve(planet, sun) to revolve(electron, nucleus) and attracts(sun, planet) to attracts(nucleus, electron). Each individual connection (e.g., from revolve(planet, sun) to revolve(electron, nucleus)) in an analogical mapping implies that the connected relations are similar; thus, SMT requires a measure of relational similarity, in order to form maps. Early versions of SME only mapped identical relations, but later versions of SME allowed similar, nonidentical relations to match (Falkenhainer, 1990). However, the focus of research in analogy-making has been on the mapping process as a whole, rather than measuring the similarity between any two particular relations, hence the similarity measures used in SME at the level of individual connections are somewhat rudimentary. We believe that a more sophisticated measure of relational similarity, such as LRA, may enhance the performance of SME. Likewise, the focus of our work here is on the similarity between particular relations, and we ignore systematic mapping between sets of relations, so LRA may also be enhanced by integration with SME. 3.3 Metaphor Metaphorical language is very common in our daily life; so common that we are usually unaware of it (Lakoff and Johnson, 1980). Gentner et al. (2001) argue that novel metaphors are understood using analogy, but conventional metaphors are simply recalled from memory. A conventional metaphor is a metaphor that has become en-


е

д

7


Computational Linguistics

Volume 1, Number 1

Table 5 Metaphorical sentences from Lakoff and Johnson (1980), rendered as SAT-style verbal analogies.

Metaphorical sentence He shot down all of my arguments. I demolished his argument. You need to budget your time. I've invested a lot of time in her. My mind just isn't operating today. Life has cheated me. Inflation is eating up our profits.

SAT-style verbal analogy aircraft:shoot down::argument:refute building:demolish::argument:refute money:budget::time:schedule money:invest::time:allocate machine:operate::mind:think charlatan:cheat::life:disappoint animal:eat::inflation:reduce

trenched in our language (Lakoff and Johnson, 1980). Dolan (1995) describes an algorithm that can recognize conventional metaphors, but is not suited to novel metaphors. This suggests that it may be fruitful to combine Dolan's (1995) algorithm for handling conventional metaphorical language with LRA and SME for handling novel metaphors. Lakoff and Johnson (1980) give many examples of sentences in support of their claim that metaphorical language is ubiquitous. The metaphors in their sample sentences can be expressed using SAT-style verbal analogies of the form A:B::C:D. The first column in Table 5 is a list of sentences from Lakoff and Johnson (1980) and the second column shows how the metaphor that is implicit in each sentence may be made explicit as a verbal analogy. 3.4 Classifying Semantic Relations The task of classifying semantic relations is to identify the relation between a pair of words. Often the pairs are restricted to noun-modifier pairs, but there are many interesting relations, such as antonymy, that do not occur in noun-modifier pairs. However, noun-modifier pairs are interesting due to their high frequency in English. For instance, WordNet 2.0 contains more than 26,000 noun-modifier pairs, although many common noun-modifiers are not in WordNet, especially technical terms. Rosario and Hearst (2001) and Rosario, Hearst, and Fillmore (2002) classify nounmodifier relations in the medical domain, using MeSH (Medical Subject Headings) and UMLS (Unified Medical Language System) as lexical resources for representing each noun-modifier pair with a feature vector. They trained a neural network to distinguish 13 classes of semantic relations. Nastase and Szpakowicz (2003) explore a similar approach to classifying general noun-modifier pairs (i.e., not restricted to a particular domain, such as medicine), using WordNet and Roget's Thesaurus as lexical resources. Vanderwende (1994) used hand-built rules, together with a lexical knowledge base, to classify noun-modifier pairs. None of these approaches explicitly involved measuring relational similarity, but any classification of semantic relations necessarily employs some implicit notion of relational similarity, since members of the same class must be relationally similar to some extent. Barker and Szpakowicz (1998) tried a corpus-based approach that explicitly used a measure of relational similarity, but their measure was based on literal matching, which limited its ability to generalize. Moldovan et al. (2004) also used a measure of relational similarity, based on mapping each noun and modifier into semantic classes in WordNet. The noun-modifier pairs were taken from a corpus and the surrounding context in the corpus was used in a word sense disambiguation algorithm, to improve the mapping of the noun and modifier into WordNet. Turney and Littman (2005) used

8


Similarity of Semantic Relations

Turney

the VSM (as a component in a single nearest neighbour learning algorithm) to measure relational similarity. We take the same approach here, substituting LRA for the VSM, in Section 7. Lauer (1995) used a corpus-based approach (using the BNC) to paraphrase nounmodifier pairs, by inserting the prepositions of, for, in, at, on, from, with, and about. For example, reptile haven was paraphrased as haven for reptiles. Lapata and Keller (2004) achieved improved results on this task, by using the database of AltaVista's search engine as a corpus. 3.5 Word Sense Disambiguation We believe that the intended sense of a polysemous word is determined by its semantic relations with the other words in the surrounding text. If we can identify the semantic relations between the given word and its context, then we can disambiguate the given word. Yarowsky's (1993) observation that collocations are almost always monosemous is evidence for this view. Federici, Montemagni, and Pirrelli (1997) present an analogybased approach to word sense disambiguation. For example, consider the word plant. Out of context, plant could refer to an industrial plant or a living organism. Suppose plant appears in some text near food. A typical approach to disambiguating plant would compare the attributional similarity of food and industrial plant to the attributional similarity of food and living organism (Lesk, 1986; Banerjee and Pedersen, 2003). In this case, the decision may not be clear, since industrial plants often produce food and living organisms often serve as food. It would be very helpful to know the relation between food and plant in this example. In the phrase "food for the plant", the relation between food and plant strongly suggests that the plant is a living organism, since industrial plants do not need food. In the text "food at the plant", the relation strongly suggests that the plant is an industrial plant, since living organisms are not usually considered as locations. Thus an algorithm for classifying semantic relations (as in Section 7) should be helpful for word sense disambiguation. 3.6 Information Extraction The problem of relation extraction is, given an input document and a specific relation , extract all pairs of entities (if any) that have the relation in the document. The problem was introduced as part of the Message Understanding Conferences (MUC) in 1998. Zelenko, Aone, and Richardella (2003) present a kernel method for extracting the relations person-affiliation and organization-location. For example, in the sentence "John Smith is the chief scientist of the Hardcom Corporation," there is a person-affiliation relation between "John Smith" and "Hardcom Corporation" (Zelenko, Aone, and Richardella, 2003). This is similar to the problem of classifying semantic relations (Section 3.4), except that information extraction focuses on the relation between a specific pair of entities in a specific document, rather than a general pair of words in general text. Therefore an algorithm for classifying semantic relations should be useful for information extraction. In the VSM approach to classifying semantic relations (Turney and Littman, 2005), we would have a training set of labeled examples of the relation person-affiliation, for instance. Each example would be represented by a vector of pattern frequencies. Given a specific document discussing "John Smith" and "Hardcom Corporation", we could construct a vector representing the relation between these two entities, and then measure the relational similarity between this unlabeled vector and each of our labeled training vectors. It would seem that there is a problem here, because the training vectors would be relatively dense, since they would presumably be derived from a large corpus, but the new unlabeled vector for "John Smith" and "Hardcom Corporation" would be very sparse, since these entities might be mentioned only once in the given document. How-

 
9


Computational Linguistics

Volume 1, Number 1

ever, this is not a new problem for the Vector Space Model; it is the standard situation when the VSM is used for information retrieval. A query to a search engine is represented by a very sparse vector whereas a document is represented by a relatively dense vector. There are well-known techniques in information retrieval for coping with this disparity, such as weighting schemes for query vectors that are different from the weighting schemes for document vectors (Salton and Buckley, 1988). 3.7 Question Answering In their paper on classifying semantic relations, Moldovan et al. (2004) suggest that an important application of their work is Question Answering. As defined in the Text REtrieval Conference (TREC) Question Answering (QA) track, the task is to answer simple questions, such as "Where have nuclear incidents occurred?", by retrieving a relevant document from a large corpus and then extracting a short string from the document, such as "The Three Mile Island nuclear incident caused a DOE policy crisis." Moldovan et al. (2004) propose to map a given question to a semantic relation and then search for that relation in a corpus of semantically tagged text. They argue that the desired semantic relation can easily be inferred from the surface form of the question. A question of the form "Where ...?" is likely to be seeking for entities with a location relation and a question of the form "What did ... make?" is likely to be looking for entities with a product relation. In Section 7, we show how LRA can recognize relations such as location and product (see Table 19). 3.8 Automatic Thesaurus Generation Hearst (1992) presents an algorithm for learning hyponym (type of) relations from a corpus and Berland and Charniak (1999) describe how to learn meronym (part of) relations from a corpus. These algorithms could be used to automatically generate a thesaurus or dictionary, but we would like to handle more relations than hyponymy and meronymy. WordNet distinguishes more than a dozen semantic relations between words (Fellbaum, 1998) and Nastase and Szpakowicz (2003) list 30 semantic relations for noun-modifier pairs. Hearst (1992) and Berland and Charniak (1999) use manually generated rules to mine text for semantic relations. Turney and Littman (2005) also use a manually generated set of 64 patterns. LRA does not use a predefined set of patterns; it learns patterns from a large corpus. Instead of manually generating new rules or patterns for each new semantic relation, it is possible to automatically learn a measure of relational similarity that can handle arbitrary semantic relations. A nearest neighbour algorithm can then use this relational similarity measure to learn to classify according to any set of classes of relations, given the appropriate labeled training data. Girju, Badulescu, and Moldovan (2003) present an algorithm for learning meronym relations from a corpus. Like Hearst (1992) and Berland and Charniak (1999), they use manually generated rules to mine text for their desired relation. However, they supplement their manual rules with automatically learned constraints, to increase the precision of the rules. 3.9 Information Retrieval Veale (2003) has developed an algorithm for recognizing certain types of word analogies, based on information in WordNet. He proposes to use the algorithm for analogical information retrieval. For example, the query "Muslim church" should return "mosque" and the query "Hindu bible" should return "the Vedas". The algorithm was designed with a focus on analogies of the form adjective:noun::adjective:noun, such as Christian:church::Muslim:mosque.

10


Similarity of Semantic Relations

Turney

A measure of relational similarity is applicable to this task. Given a pair of words, and , the task is to return another pair of words, and , such that there is high relational similarity between the pair : and the pair : . For example, given = "Muslim" and = "church", return = "mosque" and = "Christian". (The pair Muslim:mosque has a high relational similarity to the pair Christian:church.) Marx et al. (2002) developed an unsupervised algorithm for discovering analogies by clustering words from two different corpora. Each cluster of words in one corpus is coupled one-to-one with a cluster in the other corpus. For example, one experiment used a corpus of Buddhist documents and a corpus of Christian documents. A cluster of words such as Hindu, Mahayana, Zen, ... from the Buddhist corpus was coupled with a cluster of words such as Catholic, Protestant, ... from the Christian corpus. Thus the algorithm appears to have discovered an analogical mapping between Buddhist schools and traditions and Christian schools and traditions. This is interesting work, but it is not directly applicable to SAT analogies, because it discovers analogies between clusters of words, rather than individual words. 3.10 Identifying Semantic Roles A semantic frame for an event such as judgement contains semantic roles such as judge, evaluee, and reason, whereas an event such as statement contains roles such as speaker, addressee, and message (Gildea and Jurafsky, 2002). The task of identifying semantic roles is to label the parts of a sentence according to their semantic roles. We believe that it may be helpful to view semantic frames and their semantic roles as sets of semantic relations; thus a measure of relational similarity should help us to identify semantic roles. Moldovan et al. (2004) argue that semantic roles are merely a special case of semantic relations (Section 3.4), since semantic roles always involve verbs or predicates, but semantic relations can involve words of any part of speech. 4. The Vector Space Model This section examines past work on measuring attributional and relational similarity using the Vector Space Model (VSM). 4.1 Measuring Attributional Similarity with the Vector Space Model The VSM was first developed for information retrieval (Salton and McGill, 1983; Salton and Buckley, 1988; Salton, 1989) and it is at the core of most modern search engines (Baeza-Yates and Ribeiro-Neto, 1999). In the VSM approach to information retrieval, queries and documents are represented by vectors. Elements in these vectors are based on the frequencies of words in the corresponding queries and documents. The frequencies are usually transformed by various formulas and weights, tailored to improve the effectiveness of the search engine (Salton, 1989). The attributional similarity between a query and a document is measured by the cosine of the angle between their corresponding vectors. For a given query, the search engine sorts the matching documents in order of decreasing cosine. The VSM approach has also been used to measure the attributional similarity of words (Lesk, 1969; Ruge, 1992; Pantel and Lin, 2002). Pantel and Lin (2002) clustered words according to their attributional similarity, as measured by a VSM. Their algorithm is able to discover the different senses of polysemous words, using unsupervised learning. Latent Semantic Analysis enhances the VSM approach to information retrieval by using the Singular Value Decomposition (SVD) to smooth the vectors, which helps to handle noise and sparseness in the data (Deerwester et al., 1990; Dumais, 1993; Lan-

д

г

г ег

в

б

б

в вд

 
е

 
е

д

11


Computational Linguistics

Volume 1, Number 1

dauer and Dumais, 1997). SVD improves both document-query attributional similarity measures (Deerwester et al., 1990; Dumais, 1993) and word-word attributional similarity measures (Landauer and Dumais, 1997). LRA also uses SVD to smooth vectors, as we discuss in Section 5. 4.2 Measuring Relational Similarity with the Vector Space Model Let be the semantic relation (or set of relations) between a pair of words, and , and let be the semantic relation (or set of relations) between another pair, and . We wish to measure the relational similarity between and . The relations and are not given to us; our task is to infer these hidden (latent) relations and then compare them. In the VSM approach to relational similarity (Turney and Littman, 2005), we create vectors, and , that represent features of and , and then measure the similarity of and by the cosine of the angle between and : (5 )
 

(6 )

We create a vector, , to characterize the relationship between two words, and , by counting the frequencies of various short phrases containing and . Turney and Littman (2005) use a list of 64 joining terms, such as "of ", "for ", and "to", to form 128 phrases that contain and , such as " of ", " of ", " for ", " for ", " to ", and " to ". These phrases are then used as queries for a search engine and the number of hits (matching documents) is recorded for each query. This process yields a vector of 128 numbers. If the number of hits for a query is , then the corresponding element in the vector is . Several authors report that the logarithmic transformation of frequencies improves cosine-based similarity measures (Salton and Buckley, 1988; Ruge, 1992; Lin, 1998b). Turney and Littman (2005) evaluated the VSM approach by its performance on 374 SAT analogy questions, achieving a score of 47%. Since there are five choices for each question, the expected score for random guessing is 20%. To answer a multiple-choice analogy question, vectors are created for the stem pair and each choice pair, and then cosines are calculated for the angles between the stem pair and each choice pair. The best guess is the choice pair with the highest cosine. We use the same set of analogy questions to evaluate LRA in Section 6. The VSM was also evaluated by its performance as a distance (nearness) measure in a supervised nearest neighbour classifier for noun-modifier semantic relations (Turney and Littman, 2005). The evaluation used 600 hand-labeled noun-modifier pairs from Nastase and Szpakowicz (2003). A testing pair is classified by searching for its single nearest neighbour in the labeled training data. The best guess is the label for the training pair with the highest cosine. LRA is evaluated with the same set of noun-modifier pairs in Section 7. Turney and Littman (2005) used the AltaVista search engine to obtain the frequency information required to build vectors for the VSM. Thus their corpus was the set of all web pages indexed by AltaVista. At the time, the English subset of this corpus consisted of about words. Around April 2004, AltaVista made substantial changes to their search engine, removing their advanced search operators. Their search engine no

12

е

 
г


в


д

и

г

г

бв бв   в  бв   в   в  д   д бв   в бв   в

в

в

б

 
в


з же б в и и и   е б вд д б в

 
бв

г


б

е

г

 
 в

з   б


в

 
е бв бй

з же   в и и и   е   вд д   в

в


г


йе б в  йе   в

  й

и ж

г


 б йе   в


в

  й

и ж

в


  й

в

и ж

г

д

     е !

  г з ж б  

бв

г

г

б

 
б


 в

 
в

б

 
в

г


Similarity of Semantic Relations

Turney

longer supports the asterisk operator, which was used by Turney and Littman (2005) for stemming and wild-card searching. AltaVista also changed their policy towards automated searching, which is now forbidden.3 Turney and Littman (2005) used AltaVista's hit count, which is the number of documents (web pages) matching a given query, but LRA uses the number of passages (strings) matching a query. In our experiments with LRA (Sections 6 and 7), we use a local copy of the Waterloo MultiText System (Clarke, Cormack, and Palmer, 1998; Terra and Clarke, 2003), running on a 16 CPU Beowulf Cluster, with a corpus of about English words. The Waterloo MultiText System (WMTS) is a distributed (multiprocessor) search engine, designed primarily for passage retrieval (although document retrieval is possible, as a special case of passage retrieval). The text and index require approximately one terabyte of disk space. Although AltaVista only gives a rough estimate of the number of matching documents, the Waterloo MultiText System gives exact counts of the number of matching passages. Turney et al. (2003) combine 13 independent modules to answer SAT questions. The performance of LRA significantly surpasses this combined system, but there is no real contest between these approaches, because we can simply add LRA to the combination, as a fourteenth module. Since the VSM module had the best performance of the thirteen modules (Turney et al., 2003), the following experiments focus on comparing VSM and LRA. 5. Latent Relational Analysis LRA takes as input a set of word pairs and produces as output a measure of the relational similarity between any two of the input pairs. LRA relies on three resources, a search engine with a very large corpus of text, a broad-coverage thesaurus of synonyms, and an efficient implementation of SVD. 5.1 Input and Output In our experiments, the input set contains from 600 to 2,244 word pairs. The output similarity measure is based on cosines, so the degree of similarity can range from (dissimilar; ) to (similar; ). Before applying SVD, the vectors are completely nonnegative, which implies that the cosine can only range from to , but SVD introduces negative values, so it is possible for the cosine to be negative, although we have never observed this in our experiments. 5.2 Search Engine and Corpus In the following experiments, we use a local copy of the Waterloo MultiText System (Clarke, Cormack, and Palmer, 1998; Terra and Clarke, 2003).4 The corpus consists of English words, gathered by a web crawler, mainly from US academic about web sites. The web pages cover a very wide range of topics, styles, genres, quality, and writing skill. The WMTS is well suited to LRA, because the WMTS scales well to large corpora (one terabyte, in our case), it gives exact frequency counts (unlike most web search engines), it is designed for passage retrieval (rather than document retrieval), and it has a powerful query syntax.
    е !  д     е ! 

See http://www.altavista.com/robots.txt for AltaVista's current policy towards "robots" (software for automatically gathering web pages or issuing batches of queries). The protocol of the "robots.txt" file is explained in http://www.robotstxt.org/wc/robots.html. See http://multitext.uwaterloo.ca/.
е

13

е

б

е

з

 
г 

д

г

е

з

г   ве

д

г


Computational Linguistics

Volume 1, Number 1

5.3 Thesaurus As a source of synonyms, we use Lin's (1998a) automatically generated thesaurus. This thesaurus is available through an online interactive demonstration or it can be downloaded.5 We used the online demonstration, since the downloadable version seems to contain fewer words. For each word in the input set of word pairs, we automatically query the online demonstration and fetch the resulting list of synonyms. As a courtesy to other users of Lin's online system, we insert a 20 second delay between each query. Lin's thesaurus was generated by parsing a corpus of about English words, consisting of text from the Wall Street Journal, San Jose Mercury, and AP Newswire (Lin, 1998a). The parser was used to extract pairs of words and their grammatical relations. Words were then clustered into synonym sets, based on the similarity of their grammatical relations. Two words were judged to be highly similar when they tended to have the same kinds of grammatical relations with the same sets of words. Given a word and its part of speech, Lin's thesaurus provides a list of words, sorted in order of decreasing attributional similarity. This sorting is convenient for LRA, since it makes it possible to focus on words with higher attributional similarity and ignore the rest. WordNet, in contrast, given a word and its part of speech, provides a list of words grouped by the possible senses of the given word, with groups sorted by the frequencies of the senses. WordNet's sorting does not directly correspond to sorting by degree of attributional similarity, although various algorithms have been proposed for deriving attributional similarity from WordNet (Resnik, 1995; Jiang and Conrath, 1997; Budanitsky and Hirst, 2001; Banerjee and Pedersen, 2003). 5.4 Singular Value Decomposition We use Rohde's SVDLIBC implementation of the Singular Value Decomposition, which is based on SVDPACKC (Berry, 1992).6 In LRA, SVD is used to reduce noise and compensate for sparseness. 5.5 The Algorithm We will go through each step of LRA, using an example to illustrate the steps. Assume that the input to LRA is the 374 multiple-choice SAT word analogy questions of Turney and Littman (2005). Since there are six word pairs per question (the stem and five choices), the input consists of 2,244 word pairs. Let's suppose that we wish to calculate the relational similarity between the pair quart:volume and the pair mile:distance, taken from the SAT question in Table 6. The LRA algorithm consists of the following twelve steps: 1. Find alternates: For each word pair : in the input set, look in Lin's (1998a) thesaurus for the top words (in the following experiments, is 10) that are most similar to . For each that is similar to , make a new word pair : . Likewise, look for the top words that are most similar to , and for each , make a new word pair : . : is called the original pair and each : or : is an alternate pair. The intent is that alternates should have almost the same semantic relations as the original. For each input pair, there will now be alternate pairs. When looking for similar words in Lin's (1998a) thesaurus, avoid words that seem unusual in any way (e.g., hyphenated words, words with three characters or less, words with nonThe online demonstration is at http://www.cs.ualberta.ca/ lindek/demos/depsim.htm and the downloadable version is at http://armena.cs.ualberta.ca/lindek/downloads/sims.lsp.gz. SVDLIBC is available at http://tedlab.mit.edu/ dr/SVDLIBC/ and SVDPACKC is available at http://www.netlib.org/svdpack/.
г д гвб е    е ! 

14

д

е д же д

г д гвб е ж

е

д

ед
е

г д гвб е !ж

д

г д гвб е ж

ед

ж

е жд е

е жд

е

з

и


Similarity of Semantic Relations

Turney

Table 6 This SAT question, from Claman (2000), is used to illustrate the steps in the LRA algorithm.

Stem: Choices:

Solution:

(a ) ( b) ( c) (d) ( e) ( b)

quart:volume day:night mile:distance decade:century friction:heat part:whole mile:distance

alphabetical characters, multi-word phrases, and capitalized words). The first column in Table 7 shows the alternate pairs that are generated for the original pair quart:volume. alternates 2. Filter alternates: For each original pair : , filter the as follows. For each alternate pair, send a query to the WMTS, to find the frequency of phrases that begin with one member of the pair and end with the other. The phrases cannot have more than words (we use ). Sort the alternate pairs by the frequency of their phrases. Select the top most frequent alternates and discard the remainder (we use , so 17 alternates are dropped). This step tends to eliminate alternates that have no clear semantic relation. The third column in Table 7 shows the frequency with which each pair co-occurs in a window of words. The last column in Table 7 shows the pairs that are selected. 3. Find phrases: For each pair (originals and alternates), make a list of phrases in the corpus that contain the pair. Query the WMTS for all phrases that begin with one member of the pair and end with the other (in either order). We ignore suffixes when searching for phrases that match a given pair. The phrases words and there must be at least one word cannot have more than between the two members of the word pair. These phrases give us information about the semantic relations between the words in each pair. A phrase with no words between the two members of the word pair would give us very little information about the semantic relations (other than that the words occur together with a certain frequency in a certain order). Table 8 gives some examples of phrases in the corpus that match the pair quart:volume. 4. Find patterns: For each phrase found in the previous step, build patterns from the intervening words. A pattern is constructed by replacing any or all or none of the intervening words with wild cards (one wild card can only replace one word). If a phrase is words long, there are intervening words between the members of the given word pair (e.g., between quart and volume). Thus a phrase with words generates patterns. (We use , so a phrase generates at most eight patterns.) For each pattern, count the number of pairs (originals and alternates) with phrases that match the pattern (a wild card must match exactly one word). Keep the top most frequent patterns and discard the rest (we use ). Typically there will be millions of patterns, so it is not feasible to keep them all.
 гд   в в б
  г г д гвб е !ж д б вг

5. Map pairs to rows: In preparation for building the matrix

, create a mapping

д

гд   вв б

гд   вв б
    в  б жж
  г


д д б вг

гвб

  г

ж

бб

жж

 б

ед

гвб

й б и жз ж

гд

вг

жее

б

д

вг

жее гвб д

д

б

гвб

гд   вв б гд   вв б

  г   г

15


Computational Linguistics

Volume 1, Number 1

Table 7 Alternate forms of the original pair quart:volume. The first column shows the original pair and the alternate pairs. The second column shows Lin's similarity score for the alternate word compared to the original word. For example, the similarity between quart and pint is 0.209843. The third column shows the frequency of the pair in the WMTS corpus. The fourth column shows the pairs that pass the filtering step (i.e., step 2).

Word pair quart:volume pint:volume gallon:volume liter:volume squirt:volume pail:volume vial:volume pumping:volume ounce:volume spoonful:volume tablespoon:volume quart:turnover quart:output quart:export quart:value quart:import quart:revenue quart:sale quart:investment quart:earnings quart:profit

Similarity NA 0 .2 0 9 8 4 3 0 .1 5 8 7 3 9 0 .1 2 2 2 9 7 0 .0 8 4 2 6 0 3 0 .0 8 3 7 0 8 0 .0 8 3 7 0 8 0 .0 7 3 4 7 9 2 0 .0 7 0 9 7 5 9 0 .0 7 0 4 2 4 5 0 .0 6 8 5 9 8 8 0 .2 2 8 7 9 5 0 .2 2 4 9 3 4 0 .2 0 6 0 1 3 0 .2 0 3 3 8 9 0 .1 8 5 5 4 9 0 .1 8 4 5 6 2 0 .1 6 8 5 4 0 .1 6 0 7 3 4 0 .1 5 6 2 1 2 0 .1 5 5 5 0 7

Frequency 632 372 1500 3323 54 28 373 1386 430 42 96 0 34 7 266 16 0 119 11 0 24

Filtering step accept (original pair) accept (top alternate) accept (top alternate)

accept (top alternate)

Table 8 Some examples of phrases that contain quart:volume. Suffixes are ignored when searching for matching phrases in the WMTS corpus. At least one word must occur between quart and volume. At most words can appear in a phrase.

"quarts liquid volume" "quarts of volume" "quarts in volume" "quart total volume" "quart of spray volume"

16

зж б е д г в б  

"volume in quarts" "volume capacity quarts" "volume being about two quarts" "volume of milk in quarts" "volume include measures like quart"


Similarity of Semantic Relations

Turney

Table 9 Frequencies of various patterns for quart:volume. The asterisk "*" represents the wildcard. Suffixes are ignored, so "quart" matches "quarts". For example, "quarts in volume" is one of the four phrases that match "quart volume" when is "in".

freq("quart volume") freq("volume quart")
б

= "in" 4 10

= "* of " 1 0

= "of *" 5 2

of word pairs to row numbers. For each pair : , create a row for : and another row for : . This will make the matrix more symmetrical, reflecting our knowledge that the relational similarity between : and : should be the same as the relational similarity between : and : . This duplication of rows is examined in Section 6.6. patterns 6. Map patterns to columns: Create a mapping of the top to column numbers. For each pattern , create a column for " " and another column for " ". Thus there will be columns in . This duplication of columns is examined in Section 6.6. 7. Generate a sparse matrix: Generate a matrix in sparse matrix format, suitable for input to SVDLIBC. The value for the cell in row and column is the frequency of the -th pattern (see step 6) in phrases that contain the -th word pair (see step 5). Table 9 gives some examples of pattern frequencies for quart:volume. 8. Calculate entropy: Apply log and entropy transformations to the sparse matrix (Landauer and Dumais, 1997). These transformations have been found to be very helpful for information retrieval (Harman, 1986; Dumais, 1990). Let be the cell in row and column of the matrix from step 7. Let be the number of rows in and let be the number of columns. We wish to weight the cell by the entropy of the -th column. To calculate the entropy of the column, we need to convert the column into a vector of probabilities. Let be the probability of , calculated by normalizing the column vector so that the sum of the elements is one, . The entropy of the -th column is then . Entropy is at its maximum when is a uniform distribution, , in which case . Entropy is at its minimum when is 1 for some value of and 0 for all other values of , in which case . We want to give more weight to columns (patterns) with frequencies that vary substantially from one row (word pair) to the next, and less weight to columns that are uniform. Therefore we weight the cell by , which varies from 0 when is uniform to 1 when entropy is minimal. We also apply the log transformation to frequencies, . (Entropy is calculated with the original frequency values, before the log transformation is applied.) For all and all , replace the original value in by the new value . This is an instance of the TFIDF (Term Frequency-Inverse Document Frequency) family of transformations, which is familiar in information retrieval (Salton and Buckley, 1988; Baeza-Yates and Ribeiro-Neto, 1999): is the TF term and is the IDF term. 9. Apply SVD: After the log and entropy transformations have been applied to
же й б же й  е
г д б вг бд вг в

е

е

жж

ед

 б

б

д б вг

б
гвб

 г  б 

  д вг в


жж

!ж

же й б

 б

д

е

гвб

жй

жв

 ед

е

б

д жй   же й б е г д же й б  же и бб е б  же и б    и б д ж й  зи и же и    з и б же й  д же й б же й 

ед


е

де


е

з же й   б 

  д вг в


б

б  


е

жв

е

з же й   б 

е

б

 г  б 

бд вг в

б


бж

й

е

б  

= "* *" 19 16


б

де

е

е


е

д

з


жв


же й   б 

же й 

же й 

же й 

б

е

же й б

17


Computational Linguistics

Volume 1, Number 1

the matrix , run SVDLIBC. SVD decomposes a matrix into a product of three matrices , where and are in column orthonormal form (i.e., the columns are orthogonal and have unit length: ) and is a diagonal matrix of singular values (hence SVD) (Golub and Loan, 1996). is of rank , then is also of rank . Let , where , be the diIf agonal matrix formed from the top singular values, and let and be the matrices produced by selecting the corresponding columns from and . The matrix is the matrix of rank that best approximates the original matrix , in the sense that it minimizes the approximation errors. That denotes the Frobenius norm (Golub and Loan, 1996). We may think of this matrix as a "smoothed" or "compressed" version of the original matrix. In the subsequent steps, we will be calculating cosines for row vectors. For this purpose, we can simplify calculations by dropping . The cosine of two vectors is their dot product, after they have been normalized to unit length. The matrix contains the dot products of all of the row vectors. We can find the dot product of the -th and -th row vectors by looking at the cell in row , column of the matrix . Since , we have , which means that we can calculate cosines with the smaller matrix , instead of using (Deerwester et al., 1990). 10. Projection: Calculate (we use ). This matrix has the same number of rows as , but only columns (instead of columns; in our experiments, that is 300 columns instead of 8,000). We can compare two word pairs by calculating the cosine of the corresponding row vectors in . The row vector for each word pair has been projected from the original 8,000 dimensional space into a new 300 dimensional space. The value is recommended by Landauer and Dumais (1997) for measuring the attributional similarity between words. We investigate other values in Section 6.4. 11. Evaluate alternates: Let : and : be any two word pairs in the input set. From step 2, we have versions of : , the original and alternates. Likewise, we have versions of : . Therefore we have ways to compare a version of : with a version of : . Look for the row vectors in that correspond to the versions of : and the versions of : and calculate the cosines (in our experiments, there are 16 cosines). For example, suppose : is quart:volume and : is mile:distance. Table 10 gives the cosines for the sixteen combinations. 12. Calculate relational similarity: The relational similarity between : and : is the average of the cosines, among the cosines from step 11, that are greater than or equal to the cosine of the original pairs, : and : . The requirement that the cosine must be greater than or equal to the original cosine is a way of filtering out poor analogies, which may be introduced in step 1 and may have slipped through the filtering in step 2. Averaging the cosines, as opposed to taking their maximum, is intended to provide some resistance to noise. For quart:volume and mile:distance, the third column in Table 10 shows which alternates are used to calculate the average. For these two pairs, the average of the selected cosines is 0.677258. In Table 7, we see that pumping:volume
бе

18


ед з

иб и  


д

гд

ед

в

вг

д

е

жее

ед

вгв

е

ед

д

д бвг

гвб

б  г  б   б   д г   г б в г в б   д г  г в б   г в б   е г е


жж

з е ед


 б

бе


гвб

вг

иб и  

з вг

жее

е

!ж

д

жее

г в б 

д

гвб

е

  
е

з вг

гд


ж  е е

бе

е


д

з вг

гвб

 ед

г

жее

иб и  

д


е

гвб

г виб и   и


г виб и   и

ед 


д

г вб   д

вг

г

жее

д

is,

minimizes

over all matrices

of rank , where

в

ив

д

д

 
в гв
в

и  же

д

з


  г 

иб

е

в

йи и и б зи и и

в

е

 
б

г вб  
в

г виб и   и


д

й и и и 

з

гвб


б


Similarity of Semantic Relations

Turney

Table 10 The sixteen combinations and their cosines. : :: : expresses the analogy " is to as is to ". The third column indicates those combinations for which the cosine is greater than or equal to the cosine of the original analogy, quart:volume::mile:distance.

yes yes yes yes yes yes yes yes (highest cosine) yes

has slipped through the filtering in step 2, although it is not a good alternate for quart:volume. However, Table 10 shows that all four analogies that involve pumping:volume are dropped here, in step 12. Steps 11 and 12 can be repeated for each two input pairs that are to be compared. This completes the description of LRA. Table 11 gives the cosines for the sample SAT question. The choice pair with the highest average cosine (column three in the table) is the solution for this question; LRA answers the question correctly. For comparison, column four gives the cosines for the original pairs and column five gives the highest cosine (the maximum over the sixteen alternates). For this particular SAT question, there is one choice that has the highest cosine for all three columns (choice (b)), although this is not true in general. However, note that the gap between the first choice (b) and the second choice (d) is largest for the average cosines. This suggests that the average of the cosines is better at discriminating the correct choice than either the original cosine or the highest cosine. 6. Experiments with Word Analogy Questions This section presents various experiments with 374 multiple-choice SAT word analogy questions. 6.1 Baseline LRA System Table 12 shows the performance of the baseline LRA system on the 374 SAT questions, using the parameter settings and configuration described in Section 5. LRA correctly answered 210 of the 374 questions. 160 questions were answered incorrectly and 4 questions were skipped, because the stem pair and its alternates were represented by zero vectors. The performance of LRA is significantly better than the lexicon-based approach

д

Word pairs quart:volume::mile:distance quart:volume::feet:distance quart:volume::mile:length quart:volume::length:distance liter:volume::mile:distance liter:volume::feet:distance liter:volume::mile:length liter:volume::length:distance gallon:volume::mile:distance gallon:volume::feet:distance gallon:volume::mile:length gallon:volume::length:distance pumping:volume::mile:distance pumping:volume::feet:distance pumping:volume::mile:length pumping:volume::length:distance

Cosine 0 .5 2 4 5 6 8 0 .4 6 3 5 5 2 0 .6 3 4 4 9 3 0 .4 9 8 8 5 8 0 .7 3 5 6 3 4 0 .6 8 6 9 8 3 0 .7 4 4 9 9 9 0 .5 7 6 4 7 7 0 .7 6 3 3 8 5 0 .7 0 9 9 6 5 0 .7 8 1 3 9 4 0 .6 1 4 6 8 5 0 .4 1 1 6 4 4 0 .4 3 9 2 5 0 0 .4 4 6 2 0 2 0 .4 9 0 5 1 1

Cosine original pairs yes (original pairs)

в

б

 
гв б 

г

19


Computational Linguistics

Volume 1, Number 1

Table 11 Cosines for the sample SAT question.

Stem: Choices:

Solution: Gap:

(a ) ( b) ( c) (d) ( e) ( b) ( b) - ( d )

quart:volume day:night mile:distance decade:century friction:heat part:whole mile:distance

Average cosines 0 .3 7 3 7 2 5 0 .6 7 7 2 5 8 0 .3 8 8 5 0 4 0 .4 2 7 8 6 0 0 .3 7 0 1 7 2 0 .6 7 7 2 5 8 0 .2 4 9 3 9 8

Original cosines 0 .3 2 6 6 1 0 0 .5 2 4 5 6 8 0 .3 2 7 2 0 1 0 .3 3 6 1 3 8 0 .3 2 9 9 9 7 0 .5 2 4 5 6 8 0 .1 8 8 4 3 0

Highest cosines 0 .4 4 3 0 7 9 0 .7 8 1 3 9 4 0 .4 6 9 6 1 0 0 .5 5 1 6 7 6 0 .4 0 8 3 5 7 0 .7 8 1 3 9 4 0 .2 2 9 7 1 8

Table 12 Performance of LRA on the 374 SAT questions. (The bottom five rows are included for comparison.)
й

Algorithm LRA Veale (2004) best attributional similarity random guessing lowest co-occurrence frequency highest co-occurrence frequency

Precision 5 6 .8 % 4 2 .8 % 3 5 .0 % 2 0 .0 % 1 6 .8 % 1 1 .8 %

Recall 5 6 .1 % 4 2 .8 % 3 5 .0 % 2 0 .0 % 1 6 .8 % 1 1 .8 %

5 6 .5 % 4 2 .8 % 3 5 .0 % 2 0 .0 % 1 6 .8 % 1 1 .8 %

of Veale (2004) (see Section 3.1) and the best performance using attributional similarity (see Section 2.3), with 95% confidence, according to the Fisher Exact Test (Agresti, 1990). As another point of reference, consider the simple strategy of always guessing the choice with the highest co-occurrence frequency. The idea here is that the words in the solution pair may occur together frequently, because there is presumably a clear and meaningful relation between the solution words, whereas the distractors may only occur together rarely, because they have no meaningful relation. This strategy is signifcantly worse than random guessing. The opposite strategy, always guessing the choice pair with the lowest co-occurrence frequency, is also worse than random guessing (but not significantly). It appears that the designers of the SAT questions deliberately chose distractors that would thwart these two strategies. With 374 questions and 6 word pairs per question (one stem and five choices), there are 2,244 pairs in the input set. In step 2, introducing alternate pairs multiplies the number of pairs by four, resulting in 8,976 pairs. In step 5, for each pair : , we add : , yielding 17,952 pairs. However, some pairs are dropped because they correspond to zero vectors (they do not appear together in a window of five words in the WMTS corpus). Also, a few words do not appear in Lin's thesaurus, and some word pairs appear twice in the SAT questions (e.g., lion:cat). The sparse matrix (step 7) has 17,232 rows (word pairs) and 8,000 columns (patterns), with a density of 5.8% (percentage of nonzero values). Table 13 gives the time required for each step of LRA, a total of almost nine days. All of the steps used a single CPU on a desktop computer, except step 3, finding the phrases for each word pair, which used a 16 CPU Beowulf cluster. Most of the other

20

ед

де


Similarity of Semantic Relations

Turney

Table 13 LRA elapsed run time.

Step 1 2 3 4 5 6 7 8 9 10 11 12 Total

Description Find alternates Filter alternates Find phrases Find patterns Map pairs to rows Map patterns to columns Generate a sparse matrix Calculate entropy Apply SVD Projection Evaluate alternates Calculate relational similarity

Time H:M:S 2 4 :5 6 :0 0 0 :0 0 :0 2 1 0 9 :5 2 :0 0 3 3 :4 1 :0 0 0 :0 0 :0 2 0 :0 0 :0 2 3 8 :0 7 :0 0 0 :1 1 :0 0 0 :4 3 :2 8 0 :0 8 :0 0 2 :1 1 :0 0 0 :0 0 :0 2 2 0 9 :4 9 :3 6

Hardware 1 CPU 1 CPU 16 CPUs 1 CPU 1 CPU 1 CPU 1 CPU 1 CPU 1 CPU 1 CPU 1 CPU 1 CPU

Table 14 LRA versus VSM with 374 SAT analogy questions.
й

Algorithm Correct VSM-AV 176 VSM-WMTS 144 LRA 210

Incorrect 193 196 160

Skipped 5 34 4

Precision 4 7 .7 % 4 2 .4 % 5 6 .8 %

Recall 4 7 .1 % 3 8 .5 % 5 6 .1 %

4 7 .4 % 4 0 .3 % 5 6 .5 %

steps are parallelizable; with a bit of programming effort, they could also be executed on the Beowulf cluster. All CPUs (both desktop and cluster) were 2.4 GHz Intel Xeons. The desktop computer had 2 GB of RAM and the cluster had a total of 16 GB of RAM. 6.2 LRA versus VSM Table 14 compares LRA to the Vector Space Model with the 374 analogy questions. VSMAV refers to the VSM using AltaVista's database as a corpus. The VSM-AV results are taken from Turney and Littman (2005). As mentioned in Section 4.2, we estimate this corpus contained about English words at the time the VSM-AV experiments took place. VSM-WMTS refers to the VSM using the WMTS, which contains about English words. We generated the VSM-WMTS results by adapting the VSM to the WMTS. The algorithm is slightly different from Turney and Littman (2005), because we used passage frequencies instead of document frequencies. All three pairwise differences in recall in Table 14 are statistically significant with 95% confidence, using the Fisher Exact Test (Agresti, 1990). The pairwise differences in precision between LRA and the two VSM variations are also significant, but the difference in precision between the two VSM variations (42.4% versus 47.7%) is not significant. Although VSM-AV has a corpus ten times larger than LRA's, LRA still performs better than VSM-AV. Comparing VSM-AV to VSM-WMTS, the smaller corpus has reduced the score of the VSM, but much of the drop is due to the larger number of questions that were skipped (34 for VSM-WMTS versus 5 for VSM-AV). With the smaller corpus, many more
!
      е !      е

21


Computational Linguistics

Volume 1, Number 1

Table 15 Comparison with human SAT performance. The last column in the table indicates whether (YES) or not (NO) the average human performance (57%) falls within the 95% confidence interval of the corresponding algorithm's performance. The confidence intervals are calculated using the Binomial Exact Test (Agresti, 1990).

System VSM-AV VSM-WMTS LRA

Recall (% correct) 4 7 .1 % 3 8 .5 % 5 6 .1 %

95% confidence interval for recall 4 2 .2 н 5 2 .5 % 3 3 .5 н 4 3 .6 % 5 1 .0 н 6 1 .2 %

Human-level ( 5 7 %) NO NO YES

of the input word pairs simply do not appear together in short phrases in the corpus. LRA is able to answer as many questions as VSM-AV, although it uses the same corpus as VSM-WMTS, because Lin's thesaurus allows LRA to substitute synonyms for words that are not in the corpus. VSM-AV required 17 days to process the 374 analogy questions (Turney and Littman, 2005), compared to 9 days for LRA. As a courtesy to AltaVista, Turney and Littman (2005) inserted a five second delay between each query. Since the WMTS is running locally, there is no need for delays. VSM-WMTS processed the questions in only one day. 6.3 Human Performance The average performance of college-bound senior high school students on verbal SAT questions corresponds to a recall (percent correct) of about 57% (Turney and Littman, 2005). The SAT I test consists of 78 verbal questions and 60 math questions (there is also an SAT II test, covering specific subjects, such as chemistry). Analogy questions are only a subset of the 78 verbal SAT questions. If we assume that the difficulty of our 374 analogy questions is comparable to the difficulty of the 78 verbal SAT I questions, then we can estimate that the average college-bound senior would correctly answer about 57% of the 374 analogy questions. Of our 374 SAT questions, 190 are from a collection of ten official SAT tests (Claman, 2000). On this subset of the questions, LRA has a recall of 61.1%, compared to a recall of 51.1% on the other 184 questions. The 184 questions that are not from Claman (2000) seem to be more difficult. This indicates that we may be underestimating how well LRA performs, relative to college-bound senior high school students. Claman (2000) suggests that the analogy questions may be somewhat harder than other verbal SAT questions, so we may be slightly overestimating the mean human score on the analogy questions. Table 15 gives the 95% confidence intervals for LRA, VSM-AV, and VSM-WMTS, calculated by the Binomial Exact Test (Agresti, 1990). There is no significant difference between LRA and human performance, but VSM-AV and VSM-WMTS are significantly below human-level performance. 6.4 Varying the Parameters in LRA There are several parameters in the LRA algorithm (see Section 5.5). The parameter values were determined by trying a small number of possible values on a small set of questions that were set aside. Since LRA is intended to be an unsupervised learning algorithm, we did not attempt to tune the parameter values to maximize the precision and recall on the 374 SAT questions. We hypothesized that LRA is relatively insensitive

22


Similarity of Semantic Relations

Turney

Table 16 Variation in performance with different parameter values. The Baseline column marks the baseline parameter values. The Step column gives the step number in Section 5.5 where each parameter is discussed.
й

Parameter
г д гвб е г д гвб е г д гвб е д б вг   б г в б жж д б вг   б г в б жж д б вг   б г в б жж д б вг   б г в б жж д б вг   б г в б жж д б вг   б г в б жж д б вг   б г в б жж вг д гвб жее вг д гвб жее вг д гвб жее вг д гвб жее вг д гвб жее гд   вв б   г гд   вв б   г гд   вв б   г

Baseline

Value 5 10 15 4 5 6 1 2 3 4 5 1000 2000 3000 4000 5000 6000 7000 100 300 500 700 900

Step 1 1 1 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 10 10 10 10 10

Precision 5 4 .2 % 5 6 .8 % 5 4 .1 % 5 5 .8 % 5 6 .8 % 5 6 .2 % 5 4 .3 % 5 5 .7 % 5 6 .8 % 5 5 .7 % 5 4 .3 % 5 5 .9 % 5 7 .6 % 5 8 .4 % 5 6 .8 % 5 7 .0 % 5 7 .0 % 5 8 .1 % 5 5 .7 % 5 6 .8 % 5 7 .6 % 5 6 .5 % 5 6 .2 %

Recall 5 3 .5 % 5 6 .1 % 5 3 .5 % 5 5 .1 % 5 6 .1 % 5 5 .6 % 5 3 .7 % 5 5 .1 % 5 6 .1 % 5 5 .1 % 5 3 .7 % 5 5 .3 % 5 7 .0 % 5 7 .8 % 5 6 .1 % 5 6 .4 % 5 6 .4 % 5 7 .5 % 5 5 .1 % 5 6 .1 % 5 7 .0 % 5 5 .9 % 5 5 .6 %

5 3 .8 % 5 6 .5 % 5 3 .8 % 5 5 .5 % 5 6 .5 % 5 5 .9 % 5 4 .0 % 5 5 .4 % 5 6 .5 % 5 5 .4 % 5 4 .0 % 5 5 .6 % 5 7 .3 % 5 8 .1 % 5 6 .5 % 5 6 .7 % 5 6 .7 % 5 7 .8 % 5 5 .4 % 5 6 .5 % 5 7 .3 % 5 6 .2 % 5 5 .9 %

to the values of the parameters. Table 16 shows the variation in the performance of LRA as the parameter values are adjusted. We take the baseline parameter settings (given in Section 5.5) and vary each parameter, one at a time, while holding the remaining parameters fixed at their baseline values. None of the precision and recall values are significantly different from the baseline, according to the Fisher Exact Test (Agresti, 1990), at the 95% confidence level. This supports the hypothesis that the algorithm is not sensitive to the parameter values. Although a full run of LRA on the 374 SAT questions takes nine days, for some of the parameters it is possible to reuse cached data from previous runs. We limited the experiments with and because caching was not as helpful for these parameters, so experimenting with them required several weeks. 6.5 Ablation Experiments As mentioned in the introduction, LRA extends the VSM approach of Turney and Littman (2005) by (1) exploring variations on the analogies by replacing words with synonyms (step 1), (2) automatically generating connecting patterns (step 4), and (3) smoothing the data with SVD (step 9). In this subsection, we ablate each of these three components to assess their contribution to the performance of LRA. Table 17 shows the results.
гд   вв б
  г г д гвб е

         
е е е е е

23


Computational Linguistics

Volume 1, Number 1

Table 17 Results of ablation experiments.

LRA baseline system #1 Correct 210 Incorrect 160 Skipped 4 Precision 56.8% Recall 5 6 .1 % 5 6 .5 %
й

LRA no SVD #2 198 172 4 5 3 .5 % 5 2 .9 % 5 3 .2 %

LRA no synonyms #3 185 167 22 5 2 .6 % 4 9 .5 % 5 1 .0 %

LRA no SVD no synonyms #4 178 173 23 5 0 .7 % 4 7 .6 % 4 9 .1 %

VSM-WMTS #5 144 196 34 4 2 .4 % 3 8 .5 % 4 0 .3 %

Without SVD (compare column #1 to #2 in Table 17), performance drops, but the drop is not statistically significant with 95% confidence, according to the Fisher Exact Test (Agresti, 1990). However, we hypothesize that the drop in performance would be significant with a larger set of word pairs. More word pairs would increase the sample size, which would decrease the 95% confidence interval, which would likely show that SVD is making a significant contribution. Furthermore, more word pairs would increase the matrix size, which would give SVD more leverage. For example, Landauer and Dumais (1997) apply SVD to a matrix of of 30,473 columns by 60,768 rows, but our matrix here is 8,000 columns by 17,232 rows. We are currently gathering more SAT questions, to test this hypothesis. Without synonyms (compare column #1 to #3 in Table 17), recall drops significantly (from 56.1% to 49.5%), but the drop in precision is not significant. When the synonym component is dropped, the number of skipped questions rises from 4 to 22, which demonstrates the value of the synonym component of LRA for compensating for sparse data. When both SVD and synonyms are dropped (compare column #1 to #4 in Table 17), the decrease in recall is significant, but the decrease in precision is not significant. Again, we believe that a larger sample size would show the drop in precision is significant. If we eliminate both synonyms and SVD from LRA, all that distinguishes LRA from VSM-WMTS is the patterns (step 4). The VSM approach uses a fixed list of 64 patterns to generate 128 dimensional vectors (Turney and Littman, 2005), whereas LRA uses a dynamically generated set of 4,000 patterns, resulting in 8,000 dimensional vectors. We can see the value of the automatically generated patterns by comparing LRA without synonyms and SVD (column #4) to VSM-WMTS (column #5). The difference in both precision and recall is statistically significant with 95% confidence, according to the Fisher Exact Test (Agresti, 1990). The ablation experiments support the value of the patterns (step 4) and synonyms (step 1) in LRA, but the contribution of SVD (step 9) has not been proven, although we believe more data will support its effectiveness. Nonetheless, the three components together result in a 16% increase in (compare #1 to #5). 6.6 Matrix Symmetry We know a priori that, if A:B::C:D, then B:A::D:C. For example, "mason is to stone as carpenter is to wood" implies "stone is to mason as wood is to carpenter ". Therefore a good measure of relational similarity, , should obey the following equation:
й

24

 изж


Similarity of Semantic Relations

Turney

(7 )


In steps 5 and 6 of the LRA algorithm (Section 5.5), we ensure that the matrix is symmetrical, so that equation (7) is necessarily true for LRA. The matrix is designed so that the row vector for A:B is different from the row vector for B:A only by a permutation of the elements. The same permutation distinguishes the row vectors for C:D and D:C. Therefore the cosine of the angle between A:B and C:D must be identical to the cosine of the angle between B:A and D:C (see equation (6)). To discover the consequences of this design decision, we altered steps 5 and 6 so that symmetry is no longer preserved. In step 5, for each word pair A:B that appears in the input set, we only have one row. There is no row for B:A unless B:A also appears in the input set. Thus the number of rows in the matrix dropped from 17,232 to 8,616. In step 6, we no longer have two columns for each pattern , one for " " and another for " ". However, to be fair, we kept the total number of columns at 8,000. In step 4, we selected the top 8,000 patterns (instead of the top 4,000), distinguishing the pattern " " from the pattern " " (instead of considering them equivalent). Thus a pattern with a high frequency is likely to appear in two columns, in both possible orders, but a lower frequency pattern might appear in only one column, in only one possible order. These changes resulted in a slight decrease in performance. Recall dropped from 56.1% to 55.3% and precision dropped from 56.8% to 55.9%. The decrease is not statistically significant. However, the modified algorithm no longer obeys equation (7). Although dropping symmetry appears to cause no significant harm to the performance of the algorithm on the SAT questions, we prefer to retain symmetry, to ensure that equation (7) is satisfied. Note that, if A:B::C:D, it does not follow that B:A::C:D. For example, it is false that "stone is to mason as carpenter is to wood". In general (except when the semantic relations between and are symmetrical), we have the following inequality:
бд вг в   д вг в

Therefore we do not want A:B and B:A to be represented by identical row vectors, although it would ensure that equation (7) is satisfied. 6.7 All Alternates versus Better Alternates In step 12 of LRA, the relational similarity between : and : is the average of the cosines, among the cosines from step 11, that are greater than or equal to the cosine of the original pairs, : and : . That is, the average includes only those alternates that are "better " than the originals. Taking all alternates instead of the better alternates, recall drops from 56.1% to 40.4% and precision drops from 56.8% to 40.8%. Both decreases are statistically significant with 95% confidence, according to the Fisher Exact Test (Agresti, 1990). 6.8 Interpreting Vectors Suppose a word pair : corresponds to a vector in the matrix . It would be convenient if inspection of gave us a simple explanation or description of the relation between and . For example, suppose the word pair ostrich:bird maps to the row vector . It would be pleasing to look in and find that the largest element corresponds to the pattern "is the largest" (i.e., "ostrich is the largest bird"). Unfortunately, inspection of reveals no such convenient patterns.

в в

б

  д вг в

б

бд вг в


б

и и

     д  е  и з ж      д  е  и з ж ед
б


бд вг в

д д
 

     е  д  и з ж      е  д  и з ж ед
б
бе


  д вг в

  д вг в

з вг

жее

д

е

б

ед

в

гвб

бд вг в


д

(8 )

е

д

в

в

25


Computational Linguistics

Volume 1, Number 1

1 3 10 30 100 300 1000 3000

3 4 .2 % 4 0 .2 % 4 5 .0 % 4 6 .8 % 4 7 .8 % 5 1 .6 % 5 3 .2 % 5 5 .6 %

We hypothesize that the semantic content of a vector is distributed over the whole vector; it is not concentrated in a few elements. To test this h