Volkan Cirik,
Louis-Philippe Morency and
Deniz Yuret.
1900.
Context Vectors using Substitute Word Distributions. In
?.
(in preparation). [
ai.ku]
annote google scholar
Title: Context Embeddings using Substitute Word Distributions
Authors: Volkan Cirik, Louis-Philippe Morency and Deniz Yuret
Instructions
The author response period has begun. The reviews for your submission are displayed on this page. If you want to respond to the points raised in the reviews, you may do so in the box provided below.
The response should be entered by 17 July, 2016 (11:59pm Pacific Daylight Savings Time, UTC -7h).
Response can be edited multiple times during all the author response period.
Please note: you are not obligated to respond to the reviews.
Review #1
Appropriateness: 5
Clarity: 5
Originality: 3
Soundness / Correctness: 4
Meaningful Comparison: 3
Substance: 3
Impact of Ideas / Results: 3
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 3
Reviewer Confidence: 5
Comments
This work proposed to calculate context embeddings based on embeddings of substitute words. For a given context, a list of substitute words are firstly collected. Then, the weights of each substitute word is calculated based on a language model. Finally, the context embedding is computed by the weighed sum of all the substitute word embeddings. The author concatenate the context embedding with word embedding to represent a word under that context. Experiments on POS tagging task show the effectiveness of this method.
The method is simple, and makes sense. However, I have some questions about the experiment part.
(1) In Table 2, the number of features inputed to LIBSVM of the “word embedding” column is 50, whereas the number is 100 for the last column. Is it the reason that more features bring the higher performance?
(2) Your word embeddings are pre-trained based on a much larger corpus. Is this the reason you got better performance in Table 4?
(3) I’m wondering how will the result be if you only use the context embedding instead of concatenating it with the original word embedding.
Review #2
Appropriateness: 5
Clarity: 4
Originality: 3
Soundness / Correctness: 5
Meaningful Comparison: 4
Substance: 4
Impact of Ideas / Results: 4
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 4
Reviewer Confidence: 4
Comments
The paper presents an extension of (Yatbaz et al., 2012), which introduced substitute vectors for word context representations, and elegantly define so called context embeddings.
The context embeddings possess a few interesting properties - they are adaptable to different word embeddings methods being thus universal in the input data too, they are naturally extending any word embedding method itself and, most importantly, they are able to improve the results of word embeddings by allowing to differentiate word sense occurrences. The improvement is documented in the paper by increased POS tagging accuracy of three word embeddings models and by state-of-the-art results for 5 unsupervised POS induction tasks on several languages (the other 5 results were comparable to the state-of-the-art).
Comments and questions:
The context of size 2n+1 of the substitute vectors uses only n-grams for the computation - why? Word2vec uses full word contexts for word embeddings, shouldn't they be used here too instead of Markov estimates?
The text uses both p() and P() for probability - is there a difference? If not, they should be unified.
Table 4 shows higher values of "Our method" in two more cases (Bulgarian CoNLL-X and Turkish CoNLL-X), these are, however, not bold. why?
The related works could be expanded with other recent results of "context embeddings" computations with similar "independence" qualities, e.g.
- Instance-context embeddings - Kågebäck, Mikael, et al. "Neural context embeddings for automatic discovery of word senses." Proceedings of NAACL-HLT. 2015.
- Vu, Thuy, and D. Stott Parker. "K-Embeddings: Learning Conceptual Embeddings for Words using Context." Proceedings of NAACL-HLT. 2016.
However, the context embeddings method from the current paper can be regarded as more transparent.
Review #3
Appropriateness: 5
Clarity: 4
Originality: 3
Soundness / Correctness: 4
Meaningful Comparison: 2
Substance: 3
Impact of Ideas / Results: 3
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 3
Reviewer Confidence: 4
Comments
The paper describes an approach to learn embeddings of words in context and uses these embeddings to tackle POS tagging problems. It combines the concept of substitute word distributions (i.e. probability with which other words can replace a word in context) with traditional word embeddings (e.g. Collobert & Weston 2011). The contextual word embedding is the concatenation of the original embedding and the weighted sum of the top-K substitute word embeddings where the weights are determined by a statistical 4-gram language model. On supervised POS tagging the system achieves an accuracy of 96.7 which is close to state-of-the-art. On unsupervised POS tagging the system beats the state-of-the-art on 5 languages.
Overall, the combination of n-gram language model with word embeddings to determine contextual embeddings seems novel.
Although the supervised tagging results are not very positive, the unsupervised tagging results seem promising. However, it is unclear how impactful unsupervised tagging results are when typical downstream applications require a much higher accuracy that can be achieved through supervised or semi-supervised techniques.
The other issue I had with the paper was that the way the contextual embeddings were inferred was unsatisfying. Instead of learning such embeddings from first principles, the embeddings were just the weighted sum of other words that appear in a similar context. In fact, the authors seem unaware of recent work in learning a different embedding for each sense of a given word (e.g. Iacobacci et. al. work in EMNLP'15 and others). It would be very helpful to compare against such approaches.
Finally, the authors' claim that the state-of-the-art supervised POS tagging results requires hand-engineered features is misleading. There are several papers that use word embeddings and neural networks without hand-engineering to achieve >97.2% accuracy on the WSJ corpus (e.g. see Collobert & Weston 2011 or dos Santos et. al. ICML'14).
Onur Kuru and
Deniz Yuret.
1900.
Recognizing Lexical Entailment using Substitutability.
?.
(in preparation). [
ai.ku]
annote google scholar
============================================================================
COLING 2016 Reviews for Submission #378
============================================================================
Title: Recognizing Lexical Entailment using Substitutability
Authors: Onur Kuru and Deniz Yuret
============================================================================
REVIEWER #1
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Relevance: 5
Originality: 3
Technical correctness / soundness: 4
Readability and clarity: 4
Meaningful comparison: 5
Substance: 3
Impact of ideas: 3
Impact of resources: 1
Overall recommendation: 4
Reviewer Confidence: 5
---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------
This paper proposes a new solution for lexical entailment that brings together
substitutability with entailment. While substitutability has always been tied
in with the definition of lexical entailment, the authors claim that this is
the first attempt to directly model that aspect. They compare their
(unsupervised) approach to other existing approach, showing a respectable
performance across different datasets and settings.
Overall, I like this paper. It is clear to read, and uses simple ideas, but at
the same time it tries to approach the problem from a slightly different point
of view than existing work, and shows that this novel approach does fairly
well. While lexical entailment cannot always be modeled by substitution (for
example, when the word pair has different POS tags), there is definitely some
advantage to tying the two together - and this paper does that much more
explicitly than prior work.
I have some (mostly minor, except point 1) concerns with this work though:
1) The most major concern is a factual error in Table 3 - the numbers for
balAPinc evaluated on KDSZ in the different setting are incorrect (the Turney
and Mohammad paper reports 0.60 AP1 and 0.60 AP0). This possibly invalidates
some of the claims made in this paper regarding the efficacy of balAPinc v/s
Subs.
2) Some of the comparisons in this paper are not apples-to-apples since the
authors in this work do not deal with multi-word expressions and hence they
have to work with only a subset of some of the datasets (specifically the
comparisons on the KDSZ and the Zeichner datasets).
3) Since this is an unsupervised setup, I would have liked a bit more detail on
the experimental setup. Is it k-fold cross validation, or do you just use a
small subset of the dataset to tune thresholds?
3) The error analysis is not very insightful. Instead of reporting performance
across corpora/# of tokens/n in n-gram, I would have liked to see an error
analysis that is specific to the approach, along the lines of the "Substitute
Distributions" sections. For example, what is it that makes this model better?
Where does it do better than balAPinc? Are the things that balAPinc can get
that the Subs model cannot get right?
4) There are some grammatical issues and typos. For instance :
- Page 1, Second paragraph, the sentence starting "Since lexical
entailment.." is not grammatical
- Page 6, the line above table 3, "they only dependent" is incorrect
============================================================================
REVIEWER #2
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Relevance: 5
Originality: 3
Technical correctness / soundness: 3
Readability and clarity: 4
Meaningful comparison: 4
Substance: 3
Impact of ideas: 2
Impact of resources: 1
Overall recommendation: 2
Reviewer Confidence: 4
---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------
This paper presents a rather simplistic scheme for inferring whether a lexical
entailment between a pair of words hold. The approach is based on a
probabilistic formulation of word substitutability is contexts where languages
models are used to estimate probabilities of candidate words occurring in given
contexts. For each context with a blank place holder, substitute
distributions are computed for words occurring in that context by computing the
probability of each word in that context and then normalizing.
The proposed method has been tested on four data sets and Average Precision
scores for both entail and does-not-entail scores have been computed. With the
exception of one data set, the proposed approach performs better that other
competing approaches based on WordNet.
Additional Comments:
-- What is the point of the equations 1-4 in section 2 given that they are not
used /referred to later? (2) seems to have a typesetting problem: should n in
the denominator be actually the summation bound?
-- In your derivation of the P(b|a) approximation, I can probably buy the first
assumption of a and b being independent. I understand why you need the second
assumption mathematically but is that a reasonable assumption? It could very
well be but you really need to provide a justification argument. Also in the
final equation are C and C' varying of the same set of contexts?
-- In references, please make sure your conference names are consistently named
and capitalized (e.g. the first two), also Srilm -> SRILM; journal names should
all be capital initial (e.g. Turney et al. and you should put all the authors
without et al)
============================================================================
REVIEWER #3
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Relevance: 5
Originality: 3
Technical correctness / soundness: 3
Readability and clarity: 3
Meaningful comparison: 1
Substance: 4
Impact of ideas: 3
Impact of resources: 1
Overall recommendation: 3
Reviewer Confidence: 5
---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------
This paper proposes to address the lexical entailment problem by
directly modeling lexical substitutability: the word "dog" is assumed
to entail the word "animal" if "animal" can replace "dog" in the
contexts where "dog" occurs. This idea is closely related to the
context inclusion hypothesis (dog entails animal if dog occurs in a
subset of contexts where animal occurs), which has been used to design
asymmetric distributional similarity functions] to detect lexical
entailment. This paper proposes instead to use n-gram language models
to directly score how often "dog" is a good replacement for "animal"
in contexts drawn from corpora.
This is an interesting idea which is presented clearly and with some
positive empirical results. However, several modeling and experimental
choices should be explained and motivated more thoroughly.
In Section (3), what are the consequences of the simplifying
assumption that context probabilities P(C) are uniform? This should be
discussed since some contexts are clearly more likely than others.
Experiments could be strengthened, and include controlled comparison
with other asymmetric unsupervised methods (such as the approaches
introduced in related work) beyond the random and (symmetric?)
similarity baselines (Table 2). Results published elsewhere are
reported for comparison (Table 3) in an out-of-domain evaluation
setting, but it would be useful to see a comparison with controlled
training conditions.
Selecting good contexts to test how often "dog" can be substituted by
"animal" seems to be a crucial step in the approach introduced
here. This raises the question of how sensitive is the approach to the
nature, domain, amount, diversity of contexts? Relatedly, what was the
motivation for extracting contexts from the Reuters RCV1 dataset and
using distinct corpora for language modeling?
Other comments:
In Equation (2), $n$ should be on top of the sum in the denominator.
In Section (4), what does the FASTSUBS algorithm do?