BibTeX

booktitle = COLING (3 entries) Select: All None Action: Show BibTeX

Onur Kuru, Ozan Arkan Can and Deniz Yuret. 2016. CharNER: Character-Level Named Entity Recognition. In COLING, December. [ai.ku] pdf pdf annote google scholar

COLING 2016 review: Title: CharNER: Character-Level Named Entity Recognition Authors: Onur Kuru, Ozan Arkan Can and Deniz Yuret ============================================================================ REVIEWER #1 ============================================================================ --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Relevance: 5 Originality: 3 Technical correctness / soundness: 3 Readability and clarity: 4 Meaningful comparison: 4 Substance: 3 Impact of ideas: 3 Impact of resources: 3 Overall recommendation: 4 Reviewer Confidence: 4 --------------------------------------------------------------------------- Comments --------------------------------------------------------------------------- The paper proposed character based NER for languages with word segmentation. The character-based tagging is proposed previously. Their contribution is to include LSTM models for the character-based tagging settings. However, the result is fair for the targeted languages. In my opinion, the method may be promising for the languages without word segmentation such as Chinese and Japanese, since the word segmentation error affects the score of NER. ============================================================================ REVIEWER #2 ============================================================================ --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Relevance: 5 Originality: 3 Technical correctness / soundness: 4 Readability and clarity: 4 Meaningful comparison: 4 Substance: 3 Impact of ideas: 3 Impact of resources: 1 Overall recommendation: 4 Reviewer Confidence: 5 --------------------------------------------------------------------------- Comments --------------------------------------------------------------------------- This paper presents a character-based model for named-entity recognition based on bidirectional LSTMs. There is very recent research that tries to do something similar, in NAACL-16 (Lample et al.) and ACL (Ma and Hovy, for example) with the exact same motivation: remove external resources, such as gazetteers and use a character-based approach to achieve high results. This should not invalidate the paper, though. The main difference with previous research (mentioned above) is that this model examines a sentence as a sequence of characters and outputs a tag distribution for each character. They later use transition matrices that only allow tags consistent with the word. The results are nice, however they are not the best at all. They are very good compared to systems that do not use external resources including word embeddings, however it should be a requirement to report results provided by the other systems without external resources (see Lample et al. for example) In Table 6, you present results for Ma and Hovy and Lample et al. and you include them in the "External" row, as far as I know they only use word embeddings (if they do)... I think that you should incorporate "word embeddings" to the caption, otherwise readers might think that they use gazetteers. Some missing references: Two EMNLP-15 papers that presented interesting results for tagging, parsing and language modeling by using character-based embeddings. - Wang Ling; Chris Dyer; Alan W Black; Isabel Trancoso; Ramon Fermandez; Silvio Amir; Luis Marujo; Tiago Luis Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation -Miguel Ballesteros; Chris Dyer; Noah A. Smith Improved Transition-based Parsing by Modeling Characters instead of Words with LSTMs This paper is also worth mentioning, since it also produces an entire character sequence for sentences: Bhuwan Dhingra; Zhong Zhou; Dylan Fitzpatrick; Michael Muehl; William Cohen Tweet2Vec: Character-Based Distributed Representations for Social Media Minor comment: Lample et al. do some more than LSTM-CRF, they also presented a shift reduce algorithm that exploits character-based embeddings. ============================================================================ REVIEWER #3 ============================================================================ --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Relevance: 5 Originality: 3 Technical correctness / soundness: 4 Readability and clarity: 5 Meaningful comparison: 4 Substance: 4 Impact of ideas: 4 Impact of resources: 1 Overall recommendation: 4 Reviewer Confidence: 3 --------------------------------------------------------------------------- Comments --------------------------------------------------------------------------- Very interesting work. It clearly shows that a deep bidirectional LSTM architecture combined with a Viterbi decoder effectively finds language specific features for NER. Applying the algorithm to Languages written without space characters, e.g., Chinese and/or Japanese, may be interesting. ================================= EMNLP 2016 review: Title: CharNER: Character-Level Named Entity Recognition Authors: Onur Kuru, Ozan Arkan Can and Deniz Yuret Instructions The author response period has begun. The reviews for your submission are displayed on this page. If you want to respond to the points raised in the reviews, you may do so in the box provided below. The response should be entered by 17 July, 2016 (11:59pm Pacific Daylight Savings Time, UTC -7h). Response can be edited multiple times during all the author response period. Please note: you are not obligated to respond to the reviews. Review #1 Appropriateness: 5 Clarity: 4 Originality: 3 Soundness / Correctness: 4 Meaningful Comparison: 5 Substance: 4 Impact of Ideas / Results: 4 Impact of Accompanying Software: 3 Impact of Accompanying Dataset / Resource: 1 Recommendation: 3 Reviewer Confidence: 5 Comments This paper presents a named entity recognizer in which the entire sentence is encoded as a sequence of characters, and a bidirectional LSTM is used to make predictions. Unlike previous (and recent) approaches, such as Lample et al. 2016 that presented character-based representation of words and then an LSTM/bidirectionalLSTM/stack-LSTM on top of that. This model is similar to the tweet2vec model recently accepted at ACL2016, even though it tries to solve a different task. They examine a sentence as a sequence of characters and outputs a tag distribution for each character.This model,as Lample et al., has the potentialities of being language independent and they apply it cross-lingually. The motivation and goals are also similar to Lample et al. (remove external features such as gazzetteers etc, and still achieve high results) Figure 3 does a great job summarizing the entire paper. In order to avoid things like J o h n w o r k s P O O O G G G G O they use a decoder as Wang et al. 2015, that applies a transition matrix, at the end they output the entire sequence. They make a good comparison with related work, but since some of the models are freely available, I'd expect that the authors run them in the languages without public results (such as Arabic or Turkish) In table 5 you should definitely differentiate between systems that use gazzetteers, and neural models that use pretrained word embeddings. They are not the same thing and how it is presented it might confuse the reader. This is an interesting paper, but it lacks a bit of novelty given all the previous work that already demonstrated the usefulness of characters and sequential models for NER. Minor comments: Missing ref (?) in related work.

Mehmet Ali Yatbaz, Volkan Cirik, Aylin Küntay and Deniz Yuret. 2016. Learning grammatical categories using paradigmatic representation: Substitute words for language acquisition. In COLING, December. [ai.ku, scode] pdf pdf pdf annote google scholar

============================================================================ COLING 2016 Reviews for Submission #383 ============================================================================ Title: Learning grammatical categories using paradigmatic representations: Substitute words for language acquisition Authors: Mehmet Ali Yatbaz, Volkan Cirik, Aylin Küntay and Deniz Yuret ============================================================================ REVIEWER #1 ============================================================================ --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Relevance: 2 Originality: 2 Technical correctness / soundness: 2 Readability and clarity: 2 Meaningful comparison: 2 Substance: 2 Impact of ideas: 2 Impact of resources: 1 Overall recommendation: 2 Reviewer Confidence: 3 --------------------------------------------------------------------------- Comments --------------------------------------------------------------------------- This paper suggests that performance of the prediction of a syntactic category of a word in CHILDS corpora performs better with the model of the pattern with a slot (i.e. described as "a * b" in the article) than using a model separating the pattern heads ('aX') and the tails ('Xb') (i.e. described as "aX+Xb"). The performance of "a*b" being better than for "aX + Xb" in prediction seemed obvious to me in the presented experiment, since as in Figure 1, the information used for prediction in the former is larger than for the latter. In this sense, the comparison performed in this paper is not a "fair comparison". Maybe I misread, but if so, then authors must make sure to convince the readers that they are conducting a fair comparison. Therefore, I did not understand what claim is made in this article and my judgement with respect to COLING presentation unfortunately has to be reject. Apart from this main point, there are controversial issues related to this article. 1. The authors claim that "a*b" corresponds to "paradigmatic" representation, whereas "aX+Xb" to "syntagmatic", but uses of these two terms had better be reconsidered. In linguistics, paradigmatic and syntagmatic has the origin in F.de Saussure's definition. Calling "a*b"/"aX+Xb" as paradigmatic/syntagmatic seems too wide. 2. The authors refer to EMNLP-coNLL paper (the last reference), which has similar keywords in the title. If this refered paper is the authors' paper, then since the review process must be double blind, the authors should disclose the author names. If not, it must be shown what exactly is new of this paper compared with the reference. 3. The authors must explain more about their exact intent of this work. Do the authors want to engineer the CHILDES text? Then, the authors must explain the application needs and situate their work among previous work in the domain of engineering. Do the authors want to say from this paper, that children must be processing grammar through "a*b"? Then, the authors must explain the relation of their experiment how the experiment could relate to a cognitive claim. 4. Above all, I wonder whether this paper is appropriate for a presentation in the conferences of ACL community. Majority of the references are in the cognitive scientific domain. If behind this paper lies a cognitive question, then, it had better that they seek for a discussion in a conference dedicated in cognitive science domain. ============================================================================ REVIEWER #2 ============================================================================ --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Relevance: 5 Originality: 3 Technical correctness / soundness: 4 Readability and clarity: 5 Meaningful comparison: 4 Substance: 3 Impact of ideas: 3 Impact of resources: 1 Overall recommendation: 4 Reviewer Confidence: 4 --------------------------------------------------------------------------- Comments --------------------------------------------------------------------------- ## General comments This paper presents an interesting study about the use of paradigmatic presentations for word category learning in the setting of natural language acquisition. The authors compare two models of word representations -- syntagmatic (where each word is represented based on its neighbors), and paradigmatic (where the word is represented using substitute words). The experiments presented here clearly show the benefit of paradigmatic representations, and this is a very interesting contribution to the field. However, I'm having a hard time understanding the dichotomy presented between these two representations: both seem to rely on context-based substitutability criteria, since even the paradigmatic presentation constructs the substitute words via use of n-grams which in turn are created by looking at the neighboring words of each target word. A related point is that there is a slight misrepresentation of Harris's distributional hypothesis in the example presented in the Introduction: in Harris's view, substitutability arrives through the use of all (total) environments of the words/morphemes; it is therefore "unfair" to compare the substitutes which have been created from a very large collection of environments (n-grams) with the local context of these two sentences. ## Specific issues/comments The first sentence of your introduction would find an entire section of the field in disagreement, namely researchers who subscribe to the constructivist approach who believe that children learn rules about individual words and only later form abstract syntactic categories. After the example presented an introduction, you mentioned that "the high probability substitute reflect both semantic and grammatical properties". As many papers in the distributional semantic network area have shown, this is not necessarily the case especially if you select a very small window from which you create your context representations (something which applies to the n-grams you use in this work), in which case they mostly reflect syntactic properties. How do you justify the claim that the Redington et al. method "lacks completeness"? There are a few missing references from the part of speech induction literature that would be interesting to include: Alex Clark's 2003 "Combining distributional and morphological information for part of speech induction", Christodoulopoulos et. al 2011 "A Bayesian Mixture Model for PoS Induction Using Multiple Features" and Blunsom and Cohn's "A Hierarchical Pitman−Yor Process HMM for Unsupervised Part of Speech Induction". In the input corpora that you used, did you remove the punctuation marks? They tend to provide a very strong distributional cue and are sometimes explicitly removed to better simulate the linguistic input of children. It is unclear what the contribution of Figure 1 is. Is it theoretically better or worse to have more units as input to the model? Is there any evidence to support these theoretical claims? Given the size of the training corpus for the n-gram model, I would suggest that the comparison to the syntagmatic models which are trained on significantly fewer words seems problematic. I would've liked to see a per-tag graph of the number of substitutes for each target word, since I would think that for some cases of word classes (interjections) this number would return a lot of noise. It would be better if you describe the connection between the results that you get in this work and those of St. Clair et al. 2010. ============================================================================ REVIEWER #3 ============================================================================ --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- Relevance: 4 Originality: 4 Technical correctness / soundness: 4 Readability and clarity: 4 Meaningful comparison: 3 Substance: 3 Impact of ideas: 3 Impact of resources: 1 Overall recommendation: 4 Reviewer Confidence: 3 --------------------------------------------------------------------------- Comments --------------------------------------------------------------------------- This paper describes a comparison of two approaches to POS tagging, one (syntagmatic) in which tags are assigned based on ngrams, and one (paradigmatic) in which tags are assigned based on possible lexical substitutions. The paper reports the latter (paradigmatic) gives higher accuracy. Overall it's an interesting result, but I have some reservations about how generally the results may be interpreted. First, if this model is to be taken as an acquisition model, as Section 1 seems to describe, I'm surprised the model uses supervised learning. Second, given this model is a supervised POS tagger, the accuracy results seem rather low compared to state-of-the-art sequence model taggers. This is potentially a problem because the gains due to paradigmatic modeling may not turn out to be significant in the presence of a more robust syntagmatic baseline. Minor comments: The last term in equation 3 should be "P(w_{n-1}|w^{n-2}_0)" (the zero is missing). Figures 2 and 3 should use bar graphs, since the slopes of lines between columns isn't interpretable. "patter(n)s" p. 5. You may want to cite work on syntagmatic and paradigmatic parser acquisition by Simon Dennis at U. Adelade in Australia.

Boris Katz and Beth Levin. 1988. Exploiting lexical regularities in designing natural language systems. In COLING. [Paraphrase, Start] pdf google scholar