Volkan Cirik,
Louis-Philippe Morency and
Deniz Yuret.
1900.
Context Vectors using Substitute Word Distributions. In
?.
(in preparation). [
ai.ku]
annote google scholar
Title: Context Embeddings using Substitute Word Distributions
Authors: Volkan Cirik, Louis-Philippe Morency and Deniz Yuret
Instructions
The author response period has begun. The reviews for your submission are displayed on this page. If you want to respond to the points raised in the reviews, you may do so in the box provided below.
The response should be entered by 17 July, 2016 (11:59pm Pacific Daylight Savings Time, UTC -7h).
Response can be edited multiple times during all the author response period.
Please note: you are not obligated to respond to the reviews.
Review #1
Appropriateness: 5
Clarity: 5
Originality: 3
Soundness / Correctness: 4
Meaningful Comparison: 3
Substance: 3
Impact of Ideas / Results: 3
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 3
Reviewer Confidence: 5
Comments
This work proposed to calculate context embeddings based on embeddings of substitute words. For a given context, a list of substitute words are firstly collected. Then, the weights of each substitute word is calculated based on a language model. Finally, the context embedding is computed by the weighed sum of all the substitute word embeddings. The author concatenate the context embedding with word embedding to represent a word under that context. Experiments on POS tagging task show the effectiveness of this method.
The method is simple, and makes sense. However, I have some questions about the experiment part.
(1) In Table 2, the number of features inputed to LIBSVM of the “word embedding” column is 50, whereas the number is 100 for the last column. Is it the reason that more features bring the higher performance?
(2) Your word embeddings are pre-trained based on a much larger corpus. Is this the reason you got better performance in Table 4?
(3) I’m wondering how will the result be if you only use the context embedding instead of concatenating it with the original word embedding.
Review #2
Appropriateness: 5
Clarity: 4
Originality: 3
Soundness / Correctness: 5
Meaningful Comparison: 4
Substance: 4
Impact of Ideas / Results: 4
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 4
Reviewer Confidence: 4
Comments
The paper presents an extension of (Yatbaz et al., 2012), which introduced substitute vectors for word context representations, and elegantly define so called context embeddings.
The context embeddings possess a few interesting properties - they are adaptable to different word embeddings methods being thus universal in the input data too, they are naturally extending any word embedding method itself and, most importantly, they are able to improve the results of word embeddings by allowing to differentiate word sense occurrences. The improvement is documented in the paper by increased POS tagging accuracy of three word embeddings models and by state-of-the-art results for 5 unsupervised POS induction tasks on several languages (the other 5 results were comparable to the state-of-the-art).
Comments and questions:
The context of size 2n+1 of the substitute vectors uses only n-grams for the computation - why? Word2vec uses full word contexts for word embeddings, shouldn't they be used here too instead of Markov estimates?
The text uses both p() and P() for probability - is there a difference? If not, they should be unified.
Table 4 shows higher values of "Our method" in two more cases (Bulgarian CoNLL-X and Turkish CoNLL-X), these are, however, not bold. why?
The related works could be expanded with other recent results of "context embeddings" computations with similar "independence" qualities, e.g.
- Instance-context embeddings - Kågebäck, Mikael, et al. "Neural context embeddings for automatic discovery of word senses." Proceedings of NAACL-HLT. 2015.
- Vu, Thuy, and D. Stott Parker. "K-Embeddings: Learning Conceptual Embeddings for Words using Context." Proceedings of NAACL-HLT. 2016.
However, the context embeddings method from the current paper can be regarded as more transparent.
Review #3
Appropriateness: 5
Clarity: 4
Originality: 3
Soundness / Correctness: 4
Meaningful Comparison: 2
Substance: 3
Impact of Ideas / Results: 3
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 3
Reviewer Confidence: 4
Comments
The paper describes an approach to learn embeddings of words in context and uses these embeddings to tackle POS tagging problems. It combines the concept of substitute word distributions (i.e. probability with which other words can replace a word in context) with traditional word embeddings (e.g. Collobert & Weston 2011). The contextual word embedding is the concatenation of the original embedding and the weighted sum of the top-K substitute word embeddings where the weights are determined by a statistical 4-gram language model. On supervised POS tagging the system achieves an accuracy of 96.7 which is close to state-of-the-art. On unsupervised POS tagging the system beats the state-of-the-art on 5 languages.
Overall, the combination of n-gram language model with word embeddings to determine contextual embeddings seems novel.
Although the supervised tagging results are not very positive, the unsupervised tagging results seem promising. However, it is unclear how impactful unsupervised tagging results are when typical downstream applications require a much higher accuracy that can be achieved through supervised or semi-supervised techniques.
The other issue I had with the paper was that the way the contextual embeddings were inferred was unsatisfying. Instead of learning such embeddings from first principles, the embeddings were just the weighted sum of other words that appear in a similar context. In fact, the authors seem unaware of recent work in learning a different embedding for each sense of a given word (e.g. Iacobacci et. al. work in EMNLP'15 and others). It would be very helpful to compare against such approaches.
Finally, the authors' claim that the state-of-the-art supervised POS tagging results requires hand-engineered features is misleading. There are several papers that use word embeddings and neural networks without hand-engineering to achieve >97.2% accuracy on the WSJ corpus (e.g. see Collobert & Weston 2011 or dos Santos et. al. ICML'14).