Efficiently learn spam and ham with Dovecot and virtual users

I guess nearly every admin of a webserver knows the article ISP-style Email Server with Debian-Etch and Postfix 2.3 from workaround.org. I use a similar setup (with different pathes) for my private mailserver. I have a number of users and accounts and on some days I get a pretty high number of spam.

Step 11 of the tutorial shows an example how to automatically learn spams and ham. But I think, one can do this better.

There’s nothing wrong with the script. It just has some drawbacks:

Already learned messages are processed again and again.
Only the spam folder and the trash is processed. I think, it’s right to learn spam from the spam folder. But I sometimes move my spam into my trash, too. This means, my spam gets ham. I would prefer to learn folders like my inbox (and other folders as I use sieve to sort incoming mails). I can setup this, but this means I have to change the script if my mailbox structure changes. I prefer a blacklist: my spam folders aren’t renamed that often.
sa-learn gets called for each mail, which is slower than calling it for a number of messages.

As my setup is very small, performance is not that important. Also, I can trust all my users, so they won’t try to learn ham as spam and vice versa. So my solution looks like this (I used the original path names here):

#!/bin/sh
 
SADIR=/var/lib/amavis/.spamassassin
DBPATH=$SADIR/bayes
 
IGNOREPATTERN=".*\.trash"
SPAMPATTERN="junk\|spam"
 
GREP_SPAMPATTERN="/\.\($SPAMPATTERN\)/\(cur\|new\)/[^/][^/]*\$"
GREP_IGNOREPATTERN="\(/\.\($IGNOREPATTERN\)/\(cur\|new\)/[^/][^/]*\|2,[^,]*T[^/]*\)\$"
 
find /var/vmail/*/*/Maildir -type f -ctime -1 \
                \( -path */Maildir/cur/* -o -path */Maildir/.[A-Za-z0-9]*/cur/* -o \
                -path */Maildir/new/* -o -path */Maildir/.[A-Za-z0-9]*/new/* \) \
                -print0 \
        | grep -zZi $GREP_SPAMPATTERN \
        | grep -zZvi $GREP_IGNOREPATTERN \
        | xargs --null nice sa-learn --spam --no-sync --dbpath $DBPATH >/dev/null
 
find /var/vmail/*/*/Maildir -type f -ctime -1 \
                \( -path */Maildir/cur/* -o -path */Maildir/.[A-Za-z0-9]*/cur/* -o \
                -path */Maildir/new/* -o -path */Maildir/.[A-Za-z0-9]*/new/* \) \
                -print0 \
        | grep -zZvi $GREP_SPAMPATTERN \
        | grep -zZvi $GREP_IGNOREPATTERN \
        | xargs --null nice sa-learn --ham --no-sync --dbpath $DBPATH >/dev/null
 
nice sa-learn --ham --sync --dbpath $DBPATH >/dev/null
chown -R amavis:amavis $SADIR

Okay, what happens here? Line 7 and 8 define patterns. The IGNOREPATTERN means: don’t learn Mails which are in „trash“-folders. That means „trash“ and „my/dir/trash“ are ignored and spam and ham can safely be deleted the way thunderbird & co does. Line 8 defines folders which contains spam. This is my blacklist. Some users prefer „spam“, others „junk“. Note that „my/dir/junk“ would not be found.

Line 9 and 10 wrap the pattern into grep-compatible regexes. The lines 12-18 and 20-28 are nearly identical. First, /var/vmail is scanned for mails, which were created in the last 24 hours. This assumes that the script runs once per day (e.g. as a cron job). Lines 13 to 14 filter out directories which do not contain mails. Folder names may contain whitespaces, so a null-char is used to separate them.

Line 16 only returns spam-Mails. Note the „-v“ in line 24: here, only ham is returned. This is one of two differences. Next, ignored folders are removed, because we can’t say if they contain spam or ham.

After this, xargs is used to call sa-learn. This is a performance-boost since sa-learn is not called separately for each mail. Here is the second difference. In line 19, the parameter is „–spam“, in line 26, it’s „–ham“. If a mail was learned as ham and classified as spam later (and the other way around), spamassassin automatically removes the mail before it re-learns it, so there is no need to do this manually. Last but not least, the correct owner is set.

The script runs as root, because it has to read all mails. For me, it’s perfect. Perhaps there is someone out there who also likes it .

Update:

There are some reasons where it’s perhaps better to not use this approach. If you have a big number of users, scanning mails for spam on the server may slow things down. In this case, it’s perhaps better to let the users scan their mails individually. A second reason to avoid automatic filtering of mails is when different users write mails about totally different concerns. One user may report mails as spam which have content that another user would like to read. Thx to Christoph for this idea.

Update 2:

Learning spam and ham is much faster when the –no-sync and –sync parameters are used. Thanks to Alexey Vazhnov for this very valuable approach. Sometimes you should just read the manpage more than once .

One Reply to “Efficiently learn spam and ham with Dovecot and virtual users”

Alexey Vazhnov says:

28. November 2011 at 10:05

Change
sa-learn –spam
to
sa-learn –no-sync –spam

sa-learn –ham
to
sa-learn –no-sync –ham

and add line to end of script:
sa-learn –sync –dbpath $DBPATH

From „man sa-learn“:
–no-sync
Skip the slow synchronization step which normally takes place after changing database entries. If you plan to learn from many folders in a batch, or to learn many individual messages one-by-one, it is faster to use this switch and run „sa-learn –sync“ once all the folders have been scanned.

Antworten

One Reply to “Efficiently learn spam and ham with Dovecot and virtual users”

Schreibe einen Kommentar Antworten abbrechen