Deniz Yuret.
2016.
Knet: beginning deep learning with 100 lines of Julia. In
Machine Learning Systems Workshop at NIPS 2016,
December. [
ai.ku]
pdf url annote google scholar
Masked Reviewer ID: Assigned_Reviewer_2
Review:
Question
Overall rating Accept
Reviewer expertise Some familiarity
Paper strengths (2-3 brief points) Knet provides useful building blocks for efficiently implementing standard machine learning algorithms.
The concise implementations for various algorithms provide performance competitive with other existing frameworks.
The gradient descent method used in Knet allows for applications to provide code in an already existing high-level language.
Paper weaknesses (2-3 brief points please) The paper does not clearly describe the limitations of the approach taken by Knet. Are there loss functions used in practice which cannot be handled by autograd? Are there algorithms used in practice which do not use gradient descent?
The paper does not clearly state what functions are implemented or supplied by Knet. Is the grad function used in Section 2 just the same grad function supplied by autograd in Section 3.1?
Detailed Comments How are the KnetArrays used by an application? Are the code segments in Section 2 using KnetArrays, or would the caller of "train" just supply arguments of type KnetArray?
It would be useful to see larger scale benchmarks in Section 2.6. In addition, these benchmarks should describe whether a CPU or GPU is being used, which models of those, with how much memory, and which versions of the various pieces of software were tested.
The code examples in Section 2 are clear and concise, which is very good, but some of the details of how the functions behave may be lost to readers unfamiliar with Julia (e.g., the array slicing syntax). In particular, the loss functions in 2.5 could be explained in a little more detail.
The paper could be more clear if ordered slightly differently. Section 3 describes the functions provided by the knet library, which are used by the example implementations in Section 2. It may make more sense to swap this order.
In 3.2, you mention that KnetArrays use a custom memory manager to reuse garbage collected pointers. This may be an interesting detail to elaborate on and evaluate the effects of (though in a short paper you may be space limited).
Also in 3.2, you state that the paper examples run up to 17x faster on a GPU than a CPU. This number needs to be contextualized with the model of the CPU and the GPU you are using.
Masked Reviewer ID: Assigned_Reviewer_3
Review:
Question
Overall rating Weak reject
Reviewer expertise Some familiarity
Paper strengths (2-3 brief points) The automatic differentiation approach seems like it could be an interesting mechanism, assuming the set of programs it works on are well characterized.
It's nice to see higher level constructs applied to traditionally low-level environments like GPUs.
Additionally, the code segments look like something a reasonable person could write.
Paper weaknesses (2-3 brief points please) The contribution and motivation are unclear. It reads like the major contribution is the porting of an automatic differentiation module to Julia, along with some syntactic sugar over a CUDA library, though I assume there is more to it than that.
Though the automatic differentiation technique looks promising, blackbox program transformations are slightly fraught in that they can fail on otherwise reasonable programs if they encounter a construct the author did not expect.
Detailed Comments Overall, I'm slightly confused about who the paper's target audience is. Based on the examples in the paper, Knet targets those who wish to design their own models, i.e. people with a fairly sophisticated ML background. Yet for those much of the code in section 2 is redundant, a compelling example would simply take one non-trivial model and compare it against the next best framework. (Caffe is probably not the best best example to compare against, the decision to use a declarative rather than algorithmic approach is usually more fundamental than the choice of which library gets used).
Furthermore, it's ambiguous what is actually being benchmarked; footnote 1 says the testbed uses AWS gpu instances but lines 262-263 imply that only CPU code was actually tested. Considering the importance of GPU targets this needs to be addressed preferably with a GPU comparison. Additionally, since TensorFlow is designed to function in a distributed setting comparing against it invites questions as to Knet's scalability. Since such a comparison is likely necessary, it may be a good idea to head off the issue by determining at what size dataset such scalability is actually beneficial. Also, since the benchmarks use different models, it would be good to have an indication that Knet based ones maintain rough parity for test data as well; as is there's no indication that they provide correct results.
Additionally while automatic differentiation looks nifty, a cursory glance at the python Autograd's documentation implies the technique only works for a limited subset of the language's features. For a blackbox program transformation to be useable it really requires clear distinction between which language features work and which don't, ideally one that is automatically checked. Lacking this forces your users to retain a deep knowledge of which constructs are valid and which are not, and intuit when incorrect outputs are not due to their own bugs, but because of a mismatch between what they wrote and what the Knet backed can handle. It may be that you're already providing this, but if so that should be mentioned in the paper.
Similarly, If you are achieving auto-vectorization, either through clever use of operator overloading, or from Julia's toolchain, this is a significant achievement and needs to be made explicit. As written, I cannot how much effort, if any, is required to translate between the example models and performant versions.