BibTeX

Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz and Yoram Singer. 2006. Online passive-aggressive algorithms. The Journal of Machine Learning Research, vol 7, pp 551--585. JMLR. org. (PA). [perceptron] pdf annote google scholar

File: crammer06a - annotated.pdf Annotation summary: --- Page 2 --- Note (yellow), Jun 16, 2014, 1:52 PM, Deniz Yuret: how do we measure capacity in this case? num SV? margin? |w|? Note (yellow), Jun 16, 2014, 1:52 PM, Deniz Yuret: functional margin? Highlight (yellow), Jun 16, 2014, 1:52 PM, Deniz Yuret: we would like the new classifier to remain as close as possible to the current one while achieving at least a unit margin on the most recent example. Note (yellow), Jun 16, 2014, 1:52 PM, Deniz Yuret: why remain close? Highlight (yellow), Jun 16, 2014, 1:52 PM, Deniz Yuret: Kivinen and Warmuth, 1997 Highlight (yellow), Jun 16, 2014, 1:52 PM, Deniz Yuret: the core of our construction can be viewed as finding a support vector machine based on a single example while replacing the norm constraint of SVM with a proximity constraint to the current classifier --- Page 3 --- Note (yellow), Jun 19, 2014, 12:18 PM, Deniz Yuret: signed margin is positive if correctly classified, signed or unsigned? Note (yellow), Jun 19, 2014, 12:18 PM, Deniz Yuret: still margin 0 vs nonzero makes a difference. why? --- Page 7 --- Note (yellow), Jun 17, 2014, 12:07 PM, Deniz Yuret: pa seems to collect a lot more svs. this is probably due to the use of hinge loss. the aggressive version of cotter13 is claimed to have fewer svs. what gives? we should write an online library with all perceptron variants, as pa seems to be very close in character. at least write a summary and post on blog. compare with romma lasvm mira etc. does pa optimize the svm objective? Note (yellow), Jun 17, 2014, 7:21 PM, Deniz Yuret: tau_t is the learning rate for iteration t: w+=tau_t yt xt if mistake, i.e. margin violation. tau_t = lt / |x|^2 or some variation thereof Drawing (red), Jun 17, 2014, 7:21 PM, Deniz Yuret Drawing (red), Jun 17, 2014, 7:21 PM, Deniz Yuret --- Page 8 --- Drawing (red), Jun 17, 2014, 7:21 PM, Deniz Yuret Drawing (red), Jun 17, 2014, 7:21 PM, Deniz Yuret Drawing (red), Jun 17, 2014, 7:21 PM, Deniz Yuret Drawing (red), Jun 17, 2014, 7:21 PM, Deniz Yuret Drawing (red), Jun 17, 2014, 7:21 PM, Deniz Yuret Drawing (red), Jun 17, 2014, 7:21 PM, Deniz Yuret Drawing (red), Jun 17, 2014, 7:21 PM, Deniz Yuret Drawing (red), Jun 17, 2014, 7:43 PM, Deniz Yuret Drawing (red), Jun 17, 2014, 7:43 PM, Deniz Yuret Drawing (red), Jun 17, 2014, 7:43 PM, Deniz Yuret Drawing (red), Jun 17, 2014, 7:43 PM, Deniz Yuret Note (yellow), Jun 19, 2014, 12:18 PM, Deniz Yuret: this pa mistake bound would apply to regular perceptron? Note (yellow), Jun 19, 2014, 12:18 PM, Deniz Yuret: work on applying the above bound to perceptron? --- Page 11 --- Note (yellow), Jun 18, 2014, 7:57 PM, Deniz Yuret: go through each of these bound proofs. some bound mistakes. some bound cumulative hinge loss. usually a fn of l* which is the loss of some vector u and size of |u|... --- Page 14 --- Note (yellow), Jun 19, 2014, 12:18 PM, Deniz Yuret: uniclass prediction: but that means yt is iid? --- Page 17 --- Note (yellow), Jun 19, 2014, 10:43 AM, Deniz Yuret: ok, in a real sequence prediction we would have auto regression, i.e. yt as a fn of yt-i. can we try a language model with word vectors and kernels? Highlight (yellow), Jun 19, 2014, 10:43 AM, Deniz Yuret: the algorithm outputs a score for each of the k labels in Y . --- Page 18 --- Note (yellow), Jun 19, 2014, 12:18 PM, Deniz Yuret: multiclass same as lsp Note (yellow), Jun 19, 2014, 12:18 PM, Deniz Yuret: this margin is like lsp and unlike multiclass perceptron in that we are only interested in the top wrong answer, not all wrong answers above the right answer. --- Page 19 --- Highlight (yellow), Jun 19, 2014, 11:13 AM, Deniz Yuret: we focus only on the single constraint which is violated the most by wt. Highlight (yellow), Jun 19, 2014, 11:13 AM, Deniz Yuret: satisfying all of these constraints simultaneously leads to the online algorithm presented in (Crammer and Singer, 2003a). Highlight (yellow), Jun 19, 2014, 12:18 PM, Deniz Yuret: Their online update is more involved and computationally expensive, and moreover, their analysis only covers the realizable case. --- Page 20 --- Highlight (yellow), Jun 19, 2014, 11:13 AM, Deniz Yuret: PA-I and PA-II Note (yellow), Jun 19, 2014, 11:13 AM, Deniz Yuret: figure out why there are two versions and which one is better. also why pa and not svm or mira etc? what objective is pa the sgd for? need to make a chart of all update variants and their objectives and mistake bounds. --- Page 22 --- Note (yellow), Jun 19, 2014, 11:13 AM, Deniz Yuret: pa makes sure we stay close to previous w vector. why is this any better than the alternatives? Note (yellow), Jun 19, 2014, 11:13 AM, Deniz Yuret: focusing on a single constraint is the only feasible way in structured prediction. --- Page 23 --- Note (yellow), Jun 19, 2014, 12:18 PM, Deniz Yuret: the max loss update requires what lsp calls cost sensitive decoding. --- Page 26 --- Highlight (yellow), Jun 19, 2014, 12:18 PM, Deniz Yuret: In summary, both algorithms suffer from some theoretical disadvantage and neither of them is theoretically superior to the other. --- Page 27 --- Note (yellow), Jun 19, 2014, 12:18 PM, Deniz Yuret: why does pa accumulate more sv than alternatives? Note (yellow), Jun 19, 2014, 12:18 PM, Deniz Yuret: in pa1 slack variables are added to cost linearly like svm. in pa2 their squares are added. in both rather than minimizing |w|^2 we minimize the difference between successive weight vectors. --- Page 31 --- Highlight (yellow), Jun 19, 2014, 12:18 PM, Deniz Yuret: the performance of PA-I is comparable to that of MIRA with a slight advantage to the latter. However, while each online update of MIRA requires solving a complex optimization problem, each update of PA has a simple closed-form expression and is thus much faster and easier to implement. --- Page 32 --- Highlight (yellow), Jun 19, 2014, 12:18 PM, Deniz Yuret: The update taken by our algorithms is aggressive in the sense that even a small loss forces an update of the hypothesis. When using kernels, this property often results in the use of many examples for representing the learned predic-tor. Thus, the memory requirements imposed when using kernels can be quite demanding. (report generated by GoodReader) Sent from my iPad