Barret Zoph,
Deniz Yuret,
Jon May and
Kevin Knight.
2016.
Transfer Learning for Low-Resource Neural Machine Translation. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp
1568--1575,
Austin, Texas,
November.
Association for Computational Linguistics. [
ai.ku]
url url url annote google scholar
Title: Transfer Learning for Low-Resource Neural Machine Translation
Authors: Barret Zoph, Deniz Yuret, Jonathan May and Kevin Knight
Instructions
The author response period has begun. The reviews for your submission are displayed on this page. If you want to respond to the points raised in the reviews, you may do so in the box provided below.
The response should be entered by 17 July, 2016 (11:59pm Pacific Daylight Savings Time, UTC -7h).
Response can be edited multiple times during all the author response period.
Please note: you are not obligated to respond to the reviews.
Review #1
Appropriateness: 5
Clarity: 4
Originality: 4
Soundness / Correctness: 4
Meaningful Comparison: 4
Substance: 4
Impact of Ideas / Results: 4
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 4
Reviewer Confidence: 4
Comments
Neural machine translation (NMT) performs much worse than phrase-based or syntax-based statistical machine translation (SMT) for low-resource language pairs. This paper adopts transfer learning to improve the low-resource NMT. An NMT model is first trained on a rich-resource language pair, then transfer some learned parameters to the low-resource NMT model. In this way, the low-resource NMT obtained comparable performance as SMT.
Although the idea is simple, experiments on four low-resource language pairs proved its efficiency. Moreover, a detailed and nice analysis is included in the paper, discussed the effect of the different rich-resource languages, fixing different part of the model, learning curve, etc.
The section about key idea (section 3) lacks a bit of clarity, I understand the whole idea after read section 4 and 5, maybe you need to make it clearer what you exactly do transfer learning for NMT in section 3. A small question is related with this, do you use the same learning rate to train the parent model and child model? If not, how do you set the learning rate?
Review #2
Appropriateness: 5
Clarity: 5
Originality: 3
Soundness / Correctness: 4
Meaningful Comparison: 4
Substance: 4
Impact of Ideas / Results: 3
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 3
Reviewer Confidence: 4
Comments
The paper presents a transfer learning approach for low-resource NMT. The idea is extremely simple: training a full-scale parent model is used to initialise a child model with limited resources. The approach seems to be surprisingly effective and the authors convincingly present empirical proof for the use of this method. The final model is still below the performance of the baseline syntax-based SMT model in three out of four language pairs but outperforms the baseline when used for restoring n-bet lists.
It seems surprising that the randomised mapping of word embeddings to the low-resource language word types actually works well enough. It would be interesting to know how much they are changed during training and how much they differ from training without parent model (if that can be quantified in some reasonable way). In the analyses part, the authors present a est in which they improve the mapping with dictionary-based assignment which sounds much more intuitive. But the use of this initialisation is minimal and seems to fade out completely in later training epochs. I'm still puzzled why there shouldn't be any effect of more appropriate mappings. Is there a problem with the dictionary-based assignments?
Another question is whether this approach would also work the other way around. Could you fix the source language embeddings when translating from English into low-resource languages and would you expect the same kind of improvements? Did you already try this and would you have anything to report in the reverse direction? This would be more interesting then some of the ablation tests (like 5.2 which seems to argue for the same as 5.1) In any case, the discussion about the effect of parent language pair is limited as this does not systematically evaluate the effect using many language pairs and language families with various kinds of relationships.
I have also a question about the syntax-based model used as baseline. What is the target syntax you are using in your string-to-tree model and how does it compare to state-of-the art settings? 26 BLEU for the English-Frech system sounds low to me but what is the test set and how does it compare to state-of-the-art?
Another question would be about evaluations beyond BLEU scores. What does happen to the actual translations and what do humans think about them? The limits of automatic evaluation metrics are well know and the authors should discuss translation quality in other terms as well.
Review #3
Appropriateness: 5
Clarity: 5
Originality: 3
Soundness / Correctness: 4
Meaningful Comparison: 5
Substance: 4
Impact of Ideas / Results: 4
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 4
Reviewer Confidence: 3
Comments
The paper presents a method for transferring parameters from an NMT model learned on a large data set (fr-en) to a low resource translation task with the same output language (ha-en, tr-en, uz-en and ur-en). Part of the transfer process is selecting which parameters to fine-tune to the low resource translation task.
Although I'm not convinced that the presented method is the best way to leverage the larger data set in a low resource translation setting, I strongly believe that this method needs to be presented to the community.
In the prose you say that French' would be a good parent language for French, and I agree; in the experiment it seems like you flipped them so that French is the parent language, and French' is the child language. I don't know whether this makes any difference in your model at all, but please be consistent.
In Table 7 you report ablation tests, measuring model perplexity and BLEU on the dev set. I'm left with the question of how these two quantities correlate with BLEU on an unseen test set. This question comes up again in the dictionary initialization experiments (Figure 4), where perplexity is again reported, but there is no way to tell how well it correlates with translation performance.
Minor comments
s3p2: "employing w a separate" -> "employing a separate"
Fig1: the colors are really hard to see on a regular office laser printer print-out.
Fig2: I'm guessing that eg "Source Word" and "Source Word" are different source words, and that they may be close to each other, but please use proper indexing to clarify exactly how.
Xing Shi,
Kevin Knight and
Deniz Yuret.
2016.
Why Neural Translations are the Right Length. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp
2278--2282,
Austin, Texas,
November.
Association for Computational Linguistics. [
ai.ku]
url url annote google scholar
Title: Why Neural Translations are the Right Length
Authors: Xing Shi, Kevin Knight and Deniz Yuret
Instructions
The author response period has begun. The reviews for your submission are displayed on this page. If you want to respond to the points raised in the reviews, you may do so in the box provided below.
The response should be entered by 17 July, 2016 (11:59pm Pacific Daylight Savings Time, UTC -7h).
Response can be edited multiple times during all the author response period.
Please note: you are not obligated to respond to the reviews.
Review #1
Appropriateness: 5
Clarity: 5
Originality: 4
Soundness / Correctness: 5
Meaningful Comparison: 5
Substance: 4
Impact of Ideas / Results: 3
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 5
Reviewer Confidence: 4
Comments
This paper convincingly explains why NMT translations are the right length; they show the mechanism (that there are components in the vectors that keep track of length during encoding and decoding) explicitly with a very clear toy example, as well as the tendency in a real-world task. The paper is admirably clearly written and very accessible.
Please keep in mind that some people still print papers out, and stick to a gray scale legible coloring scheme (blue and red are indistinguishable after the gray scale dimensionality reduction has been applied).
s1p1: "covert that vector in a target sentence." -> "convert that vector into a target sentence."
Review #2
Appropriateness: 5
Clarity: 4
Originality: 4
Soundness / Correctness: 4
Meaningful Comparison: 4
Substance: 4
Impact of Ideas / Results: 3
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 4
Reviewer Confidence: 3
Comments
This paper investigates the question of how neural MT models manage to produce output in the right length. Starting with a smal toy problem and proceeding to an actual neural MT system, the paper shows that (groups of) specific cells are "dedicated" to encoding sentence length.
The paper is refreshingly different from many other NMT papers I've seen lately, in that it attempts to understand what's going on within the neural model, thus addressing a point of criticism that is occasionally brought forward against neural approaches to MT: that they are black boxes and no-one seems to care about what's going on inside.
The paper is well written and generally easy to understand. What I'm missing most is a good motivation for addressing this question. What do we gain from knowing that neural models explicitly encode length? Also, is this behavior consistent across neural models? What happens if we start model training with different random initializations?
A few minor quibbles:
- The images in Figure 2 are two small. You've got plenty of room left (6 pages max.!) to provide bigger images.
- Figures 4 and 5 should be tables, not figures.
Review #3
Appropriateness: 5
Clarity: 5
Originality: 3
Soundness / Correctness: 5
Meaningful Comparison: 4
Substance: 4
Impact of Ideas / Results: 4
Impact of Accompanying Software: 1
Impact of Accompanying Dataset / Resource: 1
Recommendation: 4
Reviewer Confidence: 4
Comments
One of the issues with NMT is the apparent opacity of the model. It is hard to know what is going in inside the block box. The authors start to peel back the curtain here by investigating the question of how NMT models output target sentences of the right length. By looking at both a toy auto-encoder with 4 units and a regular NMT model, they provide interesting insight and show that the model has one of a small handful of units devoted specifically to the token counting task.
The paper is well written & easy to follow and the insights will be of interest to the EMNLP audience.