Determines probability that text contains Spam.
Based upon Paul Grahams' A Plan for Spam.
Extended to Paul Grahams' Better Bayesian Filtering.
Sample method usage:
Use:
void addHam(Reader)
and
void addSpam(Reader)
methods to build up the Maps of ham & spam tokens/occurrences.
Both addHam and addSpam assume they're reading one message at a time,
if you feed more than one message per call, be sure to adjust the
appropriate message counter: hamMessageCount or spamMessageCount.
Then...
Use:
void buildCorpus()
to build the final token/probabilities Map.
Use your own methods for persistent storage of either the individual
ham/spam corpus & message counts, and/or the final corpus.
Then you can...
Use:
double computeSpamProbability(Reader)
to determine the probability that a particular text contains spam.
A returned result of 0.9 or above is an indicator that the text was
spam.
If you use persistent storage, use:
void setCorpus(Map)
before calling computeSpamProbability.
version: CVS $Revision: $ $Date: $ since: 2.3.0 |