Onur Kuru,
Ozan Arkan Can and
Deniz Yuret.
2016.
CharNER: Character-Level Named Entity Recognition. In
COLING,
December. [
ai.ku]
pdf pdf annote google scholar
COLING 2016 review:
Title: CharNER: Character-Level Named Entity Recognition
Authors: Onur Kuru, Ozan Arkan Can and Deniz Yuret
============================================================================
REVIEWER #1
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Relevance: 5
Originality: 3
Technical correctness / soundness: 3
Readability and clarity: 4
Meaningful comparison: 4
Substance: 3
Impact of ideas: 3
Impact of resources: 3
Overall recommendation: 4
Reviewer Confidence: 4
---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------
The paper proposed character based NER for languages with word segmentation.
The character-based tagging is proposed previously. Their contribution is to
include LSTM models for the character-based tagging settings.
However, the result is fair for the targeted languages.
In my opinion, the method may be promising for the languages without word
segmentation such as Chinese and Japanese, since the word segmentation error
affects the score of NER.
============================================================================
REVIEWER #2
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Relevance: 5
Originality: 3
Technical correctness / soundness: 4
Readability and clarity: 4
Meaningful comparison: 4
Substance: 3
Impact of ideas: 3
Impact of resources: 1
Overall recommendation: 4
Reviewer Confidence: 5
---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------
This paper presents a character-based model for named-entity recognition based
on bidirectional LSTMs. There is very recent research that tries to do
something similar, in NAACL-16 (Lample et al.) and ACL (Ma and Hovy, for
example) with the exact same motivation: remove external resources, such as
gazetteers and use a character-based approach to achieve high results. This
should not invalidate the paper, though.
The main difference with previous research (mentioned above) is that this model
examines a sentence as a sequence of characters and outputs a tag distribution
for each character. They later use transition matrices that only allow tags
consistent with the word.
The results are nice, however they are not the best at all. They are very good
compared to systems that do not use external resources including word
embeddings, however it should be a requirement to report results provided by
the other systems without external resources (see Lample et al. for example)
In Table 6, you present results for Ma and Hovy and Lample et al. and you
include them in the "External" row, as far as I know they only use word
embeddings (if they do)... I think that you should incorporate "word
embeddings" to the caption, otherwise readers might think that they use
gazetteers.
Some missing references: Two EMNLP-15 papers that presented interesting results
for tagging, parsing and language modeling by using character-based embeddings.
- Wang Ling; Chris Dyer; Alan W Black; Isabel Trancoso; Ramon Fermandez; Silvio
Amir; Luis Marujo; Tiago Luis
Finding Function in Form: Compositional Character Models for Open Vocabulary
Word Representation
-Miguel Ballesteros; Chris Dyer; Noah A. Smith
Improved Transition-based Parsing by Modeling Characters instead of Words with
LSTMs
This paper is also worth mentioning, since it also produces an entire character
sequence for sentences: Bhuwan Dhingra; Zhong Zhou; Dylan Fitzpatrick; Michael
Muehl; William Cohen Tweet2Vec: Character-Based Distributed Representations for
Social Media
Minor comment:
Lample et al. do some more than LSTM-CRF, they also presented a shift reduce
algorithm that exploits character-based embeddings.
============================================================================
REVIEWER #3
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Relevance: 5
Originality: 3
Technical correctness / soundness: 4
Readability and clarity: 5
Meaningful comparison: 4
Substance: 4
Impact of ideas: 4
Impact of resources: 1
Overall recommendation: 4
Reviewer Confidence: 3
---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------
Very interesting work.
It clearly shows that a deep bidirectional LSTM architecture combined with a
Viterbi decoder effectively finds language specific features for NER.
Applying the algorithm to Languages written without space characters, e.g.,
Chinese and/or Japanese, may be interesting.
=================================
EMNLP 2016 review:
Title: CharNER: Character-Level Named Entity Recognition
Authors: Onur Kuru, Ozan Arkan Can and Deniz Yuret
Instructions
The author response period has begun. The reviews for your submission are displayed on this page. If you want to respond to the points raised in the reviews, you may do so in the box provided below.
The response should be entered by 17 July, 2016 (11:59pm Pacific Daylight Savings Time, UTC -7h).
Response can be edited multiple times during all the author response period.
Please note: you are not obligated to respond to the reviews.
Review #1
Appropriateness: 5
Clarity: 4
Originality: 3
Soundness / Correctness: 4
Meaningful Comparison: 5
Substance: 4
Impact of Ideas / Results: 4
Impact of Accompanying Software: 3
Impact of Accompanying Dataset / Resource: 1
Recommendation: 3
Reviewer Confidence: 5
Comments
This paper presents a named entity recognizer in which the entire sentence is encoded as a sequence of characters, and a bidirectional LSTM is used to make predictions. Unlike previous (and recent) approaches, such as Lample et al. 2016 that presented character-based representation of words and then an LSTM/bidirectionalLSTM/stack-LSTM on top of that. This model is similar to the tweet2vec model recently accepted at ACL2016, even though it tries to solve a different task. They examine a sentence as a sequence of characters and outputs a tag distribution for each character.This model,as Lample et al., has the potentialities of being language independent and they apply it cross-lingually. The motivation and goals are also similar to Lample et al. (remove external features such as gazzetteers etc, and still achieve high results)
Figure 3 does a great job summarizing the entire paper.
In order to avoid things like J o h n w o r k s P O O O G G G G O they use a decoder as Wang et al. 2015, that applies a transition matrix, at the end they output the entire sequence.
They make a good comparison with related work, but since some of the models are freely available, I'd expect that the authors run them in the languages without public results (such as Arabic or Turkish)
In table 5 you should definitely differentiate between systems that use gazzetteers, and neural models that use pretrained word embeddings. They are not the same thing and how it is presented it might confuse the reader.
This is an interesting paper, but it lacks a bit of novelty given all the previous work that already demonstrated the usefulness of characters and sequential models for NER.
Minor comments: Missing ref (?) in related work.
Mehmet Ali Yatbaz,
Volkan Cirik,
Aylin Küntay and
Deniz Yuret.
2016.
Learning grammatical categories using paradigmatic representation: Substitute words for language acquisition. In
COLING,
December. [
ai.ku, scode]
pdf pdf pdf annote google scholar
============================================================================
COLING 2016 Reviews for Submission #383
============================================================================
Title: Learning grammatical categories using paradigmatic representations: Substitute words for language acquisition
Authors: Mehmet Ali Yatbaz, Volkan Cirik, Aylin Küntay and Deniz Yuret
============================================================================
REVIEWER #1
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Relevance: 2
Originality: 2
Technical correctness / soundness: 2
Readability and clarity: 2
Meaningful comparison: 2
Substance: 2
Impact of ideas: 2
Impact of resources: 1
Overall recommendation: 2
Reviewer Confidence: 3
---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------
This paper suggests that performance of the prediction of a syntactic
category of a word in CHILDS corpora performs better with the model of
the pattern with a slot (i.e. described as "a * b" in the article)
than using a model separating the pattern heads ('aX') and the tails
('Xb') (i.e. described as "aX+Xb").
The performance of "a*b" being better than for "aX + Xb" in prediction
seemed obvious to me in the presented experiment, since as in Figure
1, the information used for prediction in the former is larger than
for the latter. In this sense, the comparison performed in this paper
is not a "fair comparison". Maybe I misread, but if so, then authors
must make sure to convince the readers that they are conducting a fair
comparison.
Therefore, I did not understand what claim is made in this article and
my judgement with respect to COLING presentation unfortunately has to
be reject.
Apart from this main point, there are controversial issues related to
this article.
1. The authors claim that "a*b" corresponds to "paradigmatic"
representation, whereas "aX+Xb" to "syntagmatic", but uses of these
two terms had better be reconsidered. In linguistics, paradigmatic
and syntagmatic has the origin in F.de Saussure's definition. Calling
"a*b"/"aX+Xb" as paradigmatic/syntagmatic seems too wide.
2. The authors refer to EMNLP-coNLL paper (the last reference), which
has similar keywords in the title. If this refered paper is the
authors' paper, then since the review process must be double blind,
the authors should disclose the author names. If not, it must be shown
what exactly is new of this paper compared with the reference.
3. The authors must explain more about their exact intent of this
work. Do the authors want to engineer the CHILDES text? Then, the
authors must explain the application needs and situate their work
among previous work in the domain of engineering. Do the authors want
to say from this paper, that children must be processing grammar
through "a*b"? Then, the authors must explain the relation of their
experiment how the experiment could relate to a cognitive claim.
4. Above all, I wonder whether this paper is appropriate for a
presentation in the conferences of ACL community. Majority of the
references are in the cognitive scientific domain. If behind this
paper lies a cognitive question, then, it had better that they seek
for a discussion in a conference dedicated in cognitive science
domain.
============================================================================
REVIEWER #2
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Relevance: 5
Originality: 3
Technical correctness / soundness: 4
Readability and clarity: 5
Meaningful comparison: 4
Substance: 3
Impact of ideas: 3
Impact of resources: 1
Overall recommendation: 4
Reviewer Confidence: 4
---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------
## General comments
This paper presents an interesting study about the use of paradigmatic
presentations for word category learning in the setting of natural language
acquisition. The authors compare two models of word representations --
syntagmatic (where each word is represented based on its neighbors), and
paradigmatic (where the word is represented using substitute words). The
experiments presented here clearly show the benefit of paradigmatic
representations, and this is a very interesting contribution to the field.
However, I'm having a hard time understanding the dichotomy presented between
these two representations: both seem to rely on context-based substitutability
criteria, since even the paradigmatic presentation constructs the substitute
words via use of n-grams which in turn are created by looking at the
neighboring words of each target word. A related point is that there is a
slight misrepresentation of Harris's distributional hypothesis in the example
presented in the Introduction: in Harris's view, substitutability arrives
through the use of all (total) environments of the words/morphemes; it is
therefore "unfair" to compare the substitutes which have been created from a
very large collection of environments (n-grams) with the local context of these
two sentences.
## Specific issues/comments
The first sentence of your introduction would find an entire section of the
field in disagreement, namely researchers who subscribe to the constructivist
approach who believe that children learn rules about individual words and only
later form abstract syntactic categories.
After the example presented an introduction, you mentioned that "the high
probability substitute reflect both semantic and grammatical properties". As
many papers in the distributional semantic network area have shown, this is not
necessarily the case especially if you select a very small window from which
you create your context representations (something which applies to the n-grams
you use in this work), in which case they mostly reflect syntactic properties.
How do you justify the claim that the Redington et al. method "lacks
completeness"?
There are a few missing references from the part of speech induction literature
that would be interesting to include: Alex Clark's 2003 "Combining
distributional and morphological information for part of speech induction",
Christodoulopoulos et. al 2011 "A Bayesian Mixture Model for PoS Induction
Using Multiple Features" and Blunsom and Cohn's "A Hierarchical Pitman−Yor
Process HMM for Unsupervised Part of Speech Induction".
In the input corpora that you used, did you remove the punctuation marks? They
tend to provide a very strong distributional cue and are sometimes explicitly
removed to better simulate the linguistic input of children.
It is unclear what the contribution of Figure 1 is. Is it theoretically better
or worse to have more units as input to the model? Is there any evidence to
support these theoretical claims?
Given the size of the training corpus for the n-gram model, I would suggest
that the comparison to the syntagmatic models which are trained on
significantly fewer words seems problematic.
I would've liked to see a per-tag graph of the number of substitutes for each
target word, since I would think that for some cases of word classes
(interjections) this number would return a lot of noise.
It would be better if you describe the connection between the results that you
get in this work and those of St. Clair et al. 2010.
============================================================================
REVIEWER #3
============================================================================
---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
Relevance: 4
Originality: 4
Technical correctness / soundness: 4
Readability and clarity: 4
Meaningful comparison: 3
Substance: 3
Impact of ideas: 3
Impact of resources: 1
Overall recommendation: 4
Reviewer Confidence: 3
---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------
This paper describes a comparison of two approaches to POS tagging, one
(syntagmatic) in which tags are assigned based on ngrams, and one
(paradigmatic) in which tags are assigned based on possible lexical
substitutions. The paper reports the latter (paradigmatic) gives higher
accuracy.
Overall it's an interesting result, but I have some reservations about how
generally the results may be interpreted. First, if this model is to be taken
as an acquisition model, as Section 1 seems to describe, I'm surprised the
model uses supervised learning. Second, given this model is a supervised POS
tagger, the accuracy results seem rather low compared to state-of-the-art
sequence model taggers. This is potentially a problem because the gains due to
paradigmatic modeling may not turn out to be significant in the presence of a
more robust syntagmatic baseline.
Minor comments:
The last term in equation 3 should be "P(w_{n-1}|w^{n-2}_0)" (the zero is
missing).
Figures 2 and 3 should use bar graphs, since the slopes of lines between
columns isn't interpretable.
"patter(n)s" p. 5.
You may want to cite work on syntagmatic and paradigmatic parser acquisition by
Simon Dennis at U. Adelade in Australia.