Mehmet Ali Yatbaz,
Volkan Cirik,
Aylin Küntay and
Deniz Yuret.
2016.
Learning grammatical categories using paradigmatic representation: Substitute words for language acquisition. In
COLING,
December. [
ai.ku, scode]
pdf pdf pdf annote google scholar
============================================================================
COLING 2016 Reviews for Submission #383
============================================================================
Title: Learning grammatical categories using paradigmatic representations: Substitute words for language acquisition
Authors: Mehmet Ali Yatbaz, Volkan Cirik, Aylin Küntay and Deniz Yuret
============================================================================
REVIEWER #1
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Relevance: 2
Originality: 2
Technical correctness / soundness: 2
Readability and clarity: 2
Meaningful comparison: 2
Substance: 2
Impact of ideas: 2
Impact of resources: 1
Overall recommendation: 2
Reviewer Confidence: 3
---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------
This paper suggests that performance of the prediction of a syntactic
category of a word in CHILDS corpora performs better with the model of
the pattern with a slot (i.e. described as "a * b" in the article)
than using a model separating the pattern heads ('aX') and the tails
('Xb') (i.e. described as "aX+Xb").
The performance of "a*b" being better than for "aX + Xb" in prediction
seemed obvious to me in the presented experiment, since as in Figure
1, the information used for prediction in the former is larger than
for the latter. In this sense, the comparison performed in this paper
is not a "fair comparison". Maybe I misread, but if so, then authors
must make sure to convince the readers that they are conducting a fair
comparison.
Therefore, I did not understand what claim is made in this article and
my judgement with respect to COLING presentation unfortunately has to
be reject.
Apart from this main point, there are controversial issues related to
this article.
1. The authors claim that "a*b" corresponds to "paradigmatic"
representation, whereas "aX+Xb" to "syntagmatic", but uses of these
two terms had better be reconsidered. In linguistics, paradigmatic
and syntagmatic has the origin in F.de Saussure's definition. Calling
"a*b"/"aX+Xb" as paradigmatic/syntagmatic seems too wide.
2. The authors refer to EMNLP-coNLL paper (the last reference), which
has similar keywords in the title. If this refered paper is the
authors' paper, then since the review process must be double blind,
the authors should disclose the author names. If not, it must be shown
what exactly is new of this paper compared with the reference.
3. The authors must explain more about their exact intent of this
work. Do the authors want to engineer the CHILDES text? Then, the
authors must explain the application needs and situate their work
among previous work in the domain of engineering. Do the authors want
to say from this paper, that children must be processing grammar
through "a*b"? Then, the authors must explain the relation of their
experiment how the experiment could relate to a cognitive claim.
4. Above all, I wonder whether this paper is appropriate for a
presentation in the conferences of ACL community. Majority of the
references are in the cognitive scientific domain. If behind this
paper lies a cognitive question, then, it had better that they seek
for a discussion in a conference dedicated in cognitive science
domain.
============================================================================
REVIEWER #2
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Relevance: 5
Originality: 3
Technical correctness / soundness: 4
Readability and clarity: 5
Meaningful comparison: 4
Substance: 3
Impact of ideas: 3
Impact of resources: 1
Overall recommendation: 4
Reviewer Confidence: 4
---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------
## General comments
This paper presents an interesting study about the use of paradigmatic
presentations for word category learning in the setting of natural language
acquisition. The authors compare two models of word representations --
syntagmatic (where each word is represented based on its neighbors), and
paradigmatic (where the word is represented using substitute words). The
experiments presented here clearly show the benefit of paradigmatic
representations, and this is a very interesting contribution to the field.
However, I'm having a hard time understanding the dichotomy presented between
these two representations: both seem to rely on context-based substitutability
criteria, since even the paradigmatic presentation constructs the substitute
words via use of n-grams which in turn are created by looking at the
neighboring words of each target word. A related point is that there is a
slight misrepresentation of Harris's distributional hypothesis in the example
presented in the Introduction: in Harris's view, substitutability arrives
through the use of all (total) environments of the words/morphemes; it is
therefore "unfair" to compare the substitutes which have been created from a
very large collection of environments (n-grams) with the local context of these
two sentences.
## Specific issues/comments
The first sentence of your introduction would find an entire section of the
field in disagreement, namely researchers who subscribe to the constructivist
approach who believe that children learn rules about individual words and only
later form abstract syntactic categories.
After the example presented an introduction, you mentioned that "the high
probability substitute reflect both semantic and grammatical properties". As
many papers in the distributional semantic network area have shown, this is not
necessarily the case especially if you select a very small window from which
you create your context representations (something which applies to the n-grams
you use in this work), in which case they mostly reflect syntactic properties.
How do you justify the claim that the Redington et al. method "lacks
completeness"?
There are a few missing references from the part of speech induction literature
that would be interesting to include: Alex Clark's 2003 "Combining
distributional and morphological information for part of speech induction",
Christodoulopoulos et. al 2011 "A Bayesian Mixture Model for PoS Induction
Using Multiple Features" and Blunsom and Cohn's "A Hierarchical Pitman−Yor
Process HMM for Unsupervised Part of Speech Induction".
In the input corpora that you used, did you remove the punctuation marks? They
tend to provide a very strong distributional cue and are sometimes explicitly
removed to better simulate the linguistic input of children.
It is unclear what the contribution of Figure 1 is. Is it theoretically better
or worse to have more units as input to the model? Is there any evidence to
support these theoretical claims?
Given the size of the training corpus for the n-gram model, I would suggest
that the comparison to the syntagmatic models which are trained on
significantly fewer words seems problematic.
I would've liked to see a per-tag graph of the number of substitutes for each
target word, since I would think that for some cases of word classes
(interjections) this number would return a lot of noise.
It would be better if you describe the connection between the results that you
get in this work and those of St. Clair et al. 2010.
============================================================================
REVIEWER #3
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Relevance: 4
Originality: 4
Technical correctness / soundness: 4
Readability and clarity: 4
Meaningful comparison: 3
Substance: 3
Impact of ideas: 3
Impact of resources: 1
Overall recommendation: 4
Reviewer Confidence: 3
---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------
This paper describes a comparison of two approaches to POS tagging, one
(syntagmatic) in which tags are assigned based on ngrams, and one
(paradigmatic) in which tags are assigned based on possible lexical
substitutions. The paper reports the latter (paradigmatic) gives higher
accuracy.
Overall it's an interesting result, but I have some reservations about how
generally the results may be interpreted. First, if this model is to be taken
as an acquisition model, as Section 1 seems to describe, I'm surprised the
model uses supervised learning. Second, given this model is a supervised POS
tagger, the accuracy results seem rather low compared to state-of-the-art
sequence model taggers. This is potentially a problem because the gains due to
paradigmatic modeling may not turn out to be significant in the presence of a
more robust syntagmatic baseline.
Minor comments:
The last term in equation 3 should be "P(w_{n-1}|w^{n-2}_0)" (the zero is
missing).
Figures 2 and 3 should use bar graphs, since the slopes of lines between
columns isn't interpretable.
"patter(n)s" p. 5.
You may want to cite work on syntagmatic and paradigmatic parser acquisition by
Simon Dennis at U. Adelade in Australia.
Volkan Cirik,
Louis-Philippe Morency and
Deniz Yuret.
1900.
Context Vectors using Substitute Word Distributions. In
?.
(in preparation). [
ai.ku]
annote google scholar
Title: Context Embeddings using Substitute Word Distributions
Authors: Volkan Cirik, Louis-Philippe Morency and Deniz Yuret
Instructions
The author response period has begun. The reviews for your submission are displayed on this page. If you want to respond to the points raised in the reviews, you may do so in the box provided below.
The response should be entered by 17 July, 2016 (11:59pm Pacific Daylight Savings Time, UTC -7h).
Response can be edited multiple times during all the author response period.
Please note: you are not obligated to respond to the reviews.
Review #1
Appropriateness: 5
Clarity: 5
Originality: 3
Soundness / Correctness: 4
Meaningful Comparison: 3
Substance: 3
Impact of Ideas / Results: 3
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 3
Reviewer Confidence: 5
Comments
This work proposed to calculate context embeddings based on embeddings of substitute words. For a given context, a list of substitute words are firstly collected. Then, the weights of each substitute word is calculated based on a language model. Finally, the context embedding is computed by the weighed sum of all the substitute word embeddings. The author concatenate the context embedding with word embedding to represent a word under that context. Experiments on POS tagging task show the effectiveness of this method.
The method is simple, and makes sense. However, I have some questions about the experiment part.
(1) In Table 2, the number of features inputed to LIBSVM of the “word embedding” column is 50, whereas the number is 100 for the last column. Is it the reason that more features bring the higher performance?
(2) Your word embeddings are pre-trained based on a much larger corpus. Is this the reason you got better performance in Table 4?
(3) I’m wondering how will the result be if you only use the context embedding instead of concatenating it with the original word embedding.
Review #2
Appropriateness: 5
Clarity: 4
Originality: 3
Soundness / Correctness: 5
Meaningful Comparison: 4
Substance: 4
Impact of Ideas / Results: 4
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 4
Reviewer Confidence: 4
Comments
The paper presents an extension of (Yatbaz et al., 2012), which introduced substitute vectors for word context representations, and elegantly define so called context embeddings.
The context embeddings possess a few interesting properties - they are adaptable to different word embeddings methods being thus universal in the input data too, they are naturally extending any word embedding method itself and, most importantly, they are able to improve the results of word embeddings by allowing to differentiate word sense occurrences. The improvement is documented in the paper by increased POS tagging accuracy of three word embeddings models and by state-of-the-art results for 5 unsupervised POS induction tasks on several languages (the other 5 results were comparable to the state-of-the-art).
Comments and questions:
The context of size 2n+1 of the substitute vectors uses only n-grams for the computation - why? Word2vec uses full word contexts for word embeddings, shouldn't they be used here too instead of Markov estimates?
The text uses both p() and P() for probability - is there a difference? If not, they should be unified.
Table 4 shows higher values of "Our method" in two more cases (Bulgarian CoNLL-X and Turkish CoNLL-X), these are, however, not bold. why?
The related works could be expanded with other recent results of "context embeddings" computations with similar "independence" qualities, e.g.
- Instance-context embeddings - Kågebäck, Mikael, et al. "Neural context embeddings for automatic discovery of word senses." Proceedings of NAACL-HLT. 2015.
- Vu, Thuy, and D. Stott Parker. "K-Embeddings: Learning Conceptual Embeddings for Words using Context." Proceedings of NAACL-HLT. 2016.
However, the context embeddings method from the current paper can be regarded as more transparent.
Review #3
Appropriateness: 5
Clarity: 4
Originality: 3
Soundness / Correctness: 4
Meaningful Comparison: 2
Substance: 3
Impact of Ideas / Results: 3
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 3
Reviewer Confidence: 4
Comments
The paper describes an approach to learn embeddings of words in context and uses these embeddings to tackle POS tagging problems. It combines the concept of substitute word distributions (i.e. probability with which other words can replace a word in context) with traditional word embeddings (e.g. Collobert & Weston 2011). The contextual word embedding is the concatenation of the original embedding and the weighted sum of the top-K substitute word embeddings where the weights are determined by a statistical 4-gram language model. On supervised POS tagging the system achieves an accuracy of 96.7 which is close to state-of-the-art. On unsupervised POS tagging the system beats the state-of-the-art on 5 languages.
Overall, the combination of n-gram language model with word embeddings to determine contextual embeddings seems novel.
Although the supervised tagging results are not very positive, the unsupervised tagging results seem promising. However, it is unclear how impactful unsupervised tagging results are when typical downstream applications require a much higher accuracy that can be achieved through supervised or semi-supervised techniques.
The other issue I had with the paper was that the way the contextual embeddings were inferred was unsatisfying. Instead of learning such embeddings from first principles, the embeddings were just the weighted sum of other words that appear in a similar context. In fact, the authors seem unaware of recent work in learning a different embedding for each sense of a given word (e.g. Iacobacci et. al. work in EMNLP'15 and others). It would be very helpful to compare against such approaches.
Finally, the authors' claim that the state-of-the-art supervised POS tagging results requires hand-engineered features is misleading. There are several papers that use word embeddings and neural networks without hand-engineering to achieve >97.2% accuracy on the WSJ corpus (e.g. see Collobert & Weston 2011 or dos Santos et. al. ICML'14).