Dear Geoff,

I haven't written for a while but I am still working
on decision lists.

I wanted to clarify some questions I had regarding the
prepend algorithm for decision lists.  I have a new
implementation based on OPUS and a couple of students
trying to replicate your results from the prepend
papers.  I am interested in playing with the
preference function, if you remember, so I would like
to figure out exactly how your old preference function
works.

Let us consider a rule that matches a total of r
instances in the training set.  r1 of these instances
belong to the class that match the conclusion of the
rule and r0 of the instances belong to other classes.
Thus r = r1 + r0.  Let us also say that there is an
existing decision list we have been building using
prepend.  This decision list classifies d1 of the r
instances correctly and d0 of them incorrectly.  Thus
r = d1 + d0 as well.  Finally let us consider a
decision list we have been building using append.  In
this case the r instances can be split into three
groups: d0 of them are handled incorrectly by the
current list, d1 of them are handled correctly, and d2
of them are not yet covered by the existing rules.
The difference with prepend is due to the fact that
append does not have a default rule until the very end
but prepend does.  Thus in the case of append 
r = d0 + d1 + d2.

Below is an excerpt from your 93-94 paper, that I will
try to interpret according to the above defined
variables:

> A positive case with respect to a rule is a case
> belonging to the class identified by the conclusion
> of the rule.

I assume here that we only consider the instances
which match the rule's antecedent.  In that case
according to this definition, number of positive cases
is r1.

> A negative case with respect to a rule is any case
> that does not belong to the class identified by the
> conclusion of the rule.

Similarly, number of negative cases is r0.

> To guarantee termination of the algorithm it is
> necessary to terminate the repeat loop when the best
> rule covers less positive than negative
> cases. Otherwise the addition of a rule can increase
> the number of misclassified cases, necessitating the
> development of further rules to allow for those
> cases. Such a process can continue in an infinite
> loop.

I assumed so far in this definition that positive and
negative cases were only defined with respect to the
rule itself and not considering the existing decision
list.  Therefore this statement claims that if r1 < r0
we should stop.  If that is the case I do not agree
with the above statement.  A rule that covers less
positive than negative cases can decrease the number
of misclassified cases (e.g. in the case of all r
instances being handled incorrectly by the current
decision list), and a rule that covers more positive
than negative cases can increase the number of
misclassified cases (e.g. in the case of all r
instances being handled correctly by the current
decision list).  It seems to be if we want to use the
decreasing number of misclassified cases as an
assurance for termination, we have to include d1 and
d0 in this calculation.  In particular we should
simply ask for r1 - d1 > 0.

> The same preference functions may be applied to
> selecting successive rules as for the append
> algorithm, with the difference that the positive
> cover is calculated only from cases not correctly
> covered by the existing decision list and the
> negative cover is calculated only from cases
> correctly covered by the existing decision list.

In the case of append, the rule we add will only apply
to the d2 instances that have not yet been covered by
the existing decision list.  Therefore I do not quite
understand your definition that mentions "correctly
covered" and "not correctly covered".  Those seem to
imply that you want to use d0 and d1, which the new
rule has no way of altering.  Shouldn't the correct
expression for both positive and negative cover be
"calculated from cases not covered by the existing
decision list"?  Let r12 and r02 be the number of
instances not yet covered by the existing decision
list that match and do not match the current rule's
classification respectively.  Thus d2 = r12 + r02.  It
is r12 and r02 that we want to use in the preference
function.

> When investigating the value of a rule, with respect
> to negative cover, one should not consider cases
> that are already misclassified. Further
> misclassification of cases that are already
> misclassified will not affect the performance of the
> decision list. Similarly, one should not consider
> the correct classification of cases that are already
> correctly classified as this also will not affect
> the performance of the system.
 
In this paragraph, I assume we are talking about
prepend again.  My understanding here is that negative
cover be the size of the intersection of r0 and d1,
and positive cover should be the size of the
intersection of r1 and d0.  In particular instances
that belong to (r1 and d1) or (r0 and d0) do not
effect the performance.

If I reinterpret the first paragraph according to this
definition of positive and negative cover, we should
stop if r1d0 < r0d1.  Since r0+r1 = d0+d1, this is
equivalent to the rule I suggested i.e. (r1 < d1).

Finally if I understand correctly the Laplace evaluation
function you are using is:

            r1d0 + 1
value = -----------------
         r1d0 + r0d1 + 2

Where r1d0 represents the number of instances
correctly classified by the rule but incorrectly
classified by the existing decision list etc.

Is this correct?

thanks,
deniz


P.S. Has anybody been working on a probabilistic model
for the error rate of decision lists, trees etc.?  Or
is it a well known fact that it is an impossibly
difficult problem?