Spam detection mailet using bayesian analysis techniques.
Sets an email message header indicating the
probability that an email message is SPAM.
Based upon the principals described in:
A Plan For Spam
by Paul Graham.
Extended to Paul Grahams' Better Bayesian Filtering.
The analysis capabilities are based on token frequencies (the Corpus)
learned through a training process (see
BayesianAnalysisFeeder )
and stored in a JDBC database.
After a training session, the Corpus must be rebuilt from the database in order to
acquire the new frequencies.
Every 10 minutes a special thread in this mailet will check if any
change was made to the database by the feeder, and rebuild the corpus if necessary.
A org.apache.james.spam.probability mail attribute will be created
containing the computed spam probability as a
java.lang.Double .
The headerName message header string will be created containing such
probability in floating point representation.
Sample configuration:
<mailet match="All" class="BayesianAnalysis">
<repositoryPath>db://maildb</repositoryPath>
<!--
Set this to the header name to add with the spam probability
(default is "X-MessageIsSpamProbability").
-->
<headerName>X-MessageIsSpamProbability</headerName>
<!--
Set this to true if you want to ignore messages coming from local senders
(default is false).
By local sender we mean a return-path with a local server part (server listed
in <servernames> in config.xml).
-->
<ignoreLocalSender>true</ignoreLocalSender>
<!--
Set this to the maximum message size (in bytes) that a message may have
to be considered spam (default is 100000).
-->
<maxSize>100000</maxSize>
</mailet>
The probability of being spam is pre-pended to the subject if
it is > 0.1 (10%).
The required tables are automatically created if not already there (see sqlResources.xml).
The token field in both the ham and spam tables is case sensitive.
See Also: BayesianAnalysisFeeder See Also: org.apache.james.util.BayesianAnalyzer See Also: org.apache.james.util.JDBCBayesianAnalyzer version: CVS $Revision: $ $Date: $ since: 2.3.0 |