Andrew McCallum,
Kamal Nigam, et al.
1998.
A comparison of event models for naive bayes text classification. In
AAAI-98 workshop on learning for text categorization, vol
752, pp
41--48.
Citeseer. [
comp542]
url pdf abstract google scholar
Recent approaches to text classication have used two
dierent rst-order probabilistic models for classication, both of which make the naive Bayes assumption.
Some use a multi-variate Bernoulli model, that is, a
Bayesian Network with no dependencies between words
and binary word features (e.g. Larkey and Croft 1996;
Koller and Sahami 1997). Others use a multinomial
model, that is, a uni-gram language model with integer
word counts (e.g. Lewis and Gale 1994; Mitchell 1997).
This paper aims to clarify the confusion by describing
the dierences and details of these two models, and by
empirically comparing their classication performance
on ve text corpora. We nd that the multi-variate
Bernoulli performs well with small vocabulary sizes,
but that the multinomial performs usually performs
even better at larger vocabulary sizes|providing on
average a 27% reduction in error over the multi-variate
Bernoulli model at any vocabulary size.