HMM Bayes and training

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

HMM Bayes and training

Colin
Hi folks,

Posting over here rather than the dev list for once with what is probably one of the more basic things about ASSP.
When HMM and Bayes have been discussed in the past, people seem to find it very reliable with few false positives. Unfortunately for me, I have never been able to replicate this.

As a result we have HMM set to score 49 and the threshhold at 50 so it needs HMM plus something else to push it over. We regularly get a few messages slipping through that HMM would have blocked, but I can't turn on blocking because there would be too many false positives.

I ask everyone using the service to report messages using the email interface and release things that have been picked up accidentally but not everyone does. I can't force this by turning on blocking because everyone will just ditch us for something else.

To give you an idea I have 157 messages that have been saved to the discarded folder today with "[spam found] and possibly passing because messagescore(49) low". About 50% of these seem to be legitimate messages. The remainder are legitimate mailing lists though some are questionable as to whether the user would have added themselves.

I started going through the discard folder moving these into errors/notspam and putting any spam into the spam corpus in the hope that it would train better, but there are many other messages in the discard folder. I got it down to 13,200 yesterday but now it is at 14,910.

I don't think there is a way to have these possibly passing messages save somewhere else so I can review just those - unless someone can correct me?

Settings are:
DoBayesian: Score
DoHMM: Score
BayesAfterHMM: 0.5-0.5

None of the other related settings seem to be changed from defaults except AddSpamProbHeader and AddConfidenceHeader are enabled.

So how have other people got their databases to be accurate?
All the best,
Colin.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HMM Bayes and training

GrayHat
:: On Tue, 25 Jul 2017 14:22:01 +0100
:: <[hidden email]>
:: cw <[hidden email]> wrote:

 
> So how have other people got their databases to be accurate?
> All the best,

A decent approach is using the default regexp and some good and
reliable DNSBLs/URIBLs to catch "surefire spam", that will help
training the bayes/hmm which, after a while may be set to reject

As for training, you may also add to the arsenal a properly setup clamD
scanner, just add some of the signatures found here to it

http://sanesecurity.com/usage/signatures/

and configure a scheduled script to keep them up-to-date; these, along
with the DNS lists will greatly help training the heuristic engines
(and then you may also feed some spam mail to the corpus); I know, it
isn't a "setup and forget", but then ASSP needs to be configured *and*
trained; the great advantage is that, once it starts humming along you
won't need to do too much to keep it running :)

Sure, you'll also need to properly configure automatic whitelisting and
train users about the email interface (it's easy, believe me), but
that's more or less all you'll need


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HMM Bayes and training

Colin

Thanks for the reply.

I've never taken anything out of ASSP unless it was causing a problem. I've had to add plenty to bombSubjectRe though over the years. I've the following DNSBL:

  • zen.spamhaus.org=>127.0.0.2=>1
  • zen.spamhaus.org=>127.0.0.3=>1
  • zen.spamhaus.org=>127.0.0.4=>1
  • zen.spamhaus.org=>127.0.0.5=>1
  • zen.spamhaus.org=>127.0.0.6=>1
  • zen.spamhaus.org=>127.0.0.7=>1
  • zen.spamhaus.org=>127.0.0.8=>1
  • bl.spamcop.net=>1
  • #safe.dnsbl.sorbs.net=>1
  • ix.dnsbl.manitu.net=>2
  • bb.barracudacentral.org=>2
  • bogons.cymru.com=>1
  • db.wpbl.info=>2
  • dnsbl-1.uceprotect.net=>2
  • psbl.surriel.com=>2
  • #dnsbl-2.uceprotect.net=>4
  • bl.spameatingmonkey.net=>127.0.0.2=>1
  • dnsrbl.swinog.ch=>3
  • dsn.rfc-ignorant.org=>1
  • bl.mailspike.net=>1

Re clam, I've got unofficial-clamav-sigs running which does low and medium risk defs for the following:

Sansecurity
Malware Expert
Foxhole
Winnow
MiscreantPunch
BOFHland
RookSecurity
Porcupine
SecuriteInfo
Linux Malware Detect
Yara Rules

I've reclassified a few hundred emails manually today, ones that HMM would have blocked but were allowed through and I got bored after getting back as far as the 21st. The problem is that I can't turn on blocking for HMM and force people to release everything that gets blocked as it'd cause way too much upset. I can see hundreds of legitimate emails that would be blocked per day.

I can't see an easy way to improve this, the closest I can get is to have emails that fail HMM/Bayes but do not get blocked collected in a different folder and then I can whip through them to reclassify them. When that retrains the database to the point that there are very few false positives I can be confident in turning the blocking on.

All the best,
Colin.

On 25/07/2017 15:46, Grayhat wrote:
:: On Tue, 25 Jul 2017 14:22:01 +0100
:: [hidden email]
:: cw [hidden email] wrote:

 
So how have other people got their databases to be accurate?
All the best,
A decent approach is using the default regexp and some good and
reliable DNSBLs/URIBLs to catch "surefire spam", that will help
training the bayes/hmm which, after a while may be set to reject

As for training, you may also add to the arsenal a properly setup clamD
scanner, just add some of the signatures found here to it

http://sanesecurity.com/usage/signatures/

and configure a scheduled script to keep them up-to-date; these, along
with the DNS lists will greatly help training the heuristic engines
(and then you may also feed some spam mail to the corpus); I know, it
isn't a "setup and forget", but then ASSP needs to be configured *and*
trained; the great advantage is that, once it starts humming along you
won't need to do too much to keep it running :)

Sure, you'll also need to properly configure automatic whitelisting and
train users about the email interface (it's easy, believe me), but
that's more or less all you'll need


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HMM Bayes and training

assp-user mailing list
In reply to this post by Colin
On 7/25/2017 6:22 AM, cw wrote:
>
> I don't think there is a way to have these possibly passing messages
> save somewhere else so I can review just those - unless someone can
> correct me?
>

You may want to explore sendAllSpam and ccSpamAlways.  I have
sendAllSpam set to a dedicated spam account which I check if a user
tells me they didn't get the message - then I can fix it via the email
interface.  I haven't had to do that in quite some time - so every few
months I just empty it.

Everyone's use case is different - I've had excellent success with my
own settings which are rather aggressive.  The two keys I've found:

1.  My local users are whitelist approved - so I've trained them to send
messages to anyone they want to receive a message from.
2.  Any regular correspondents get added to the no-processing domains.

Daniel

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HMM Bayes and training

Colin
Hi Daniel,

Thanks for the suggestion,

I'm not 100% sure this will work. The messages I am interested in are
those that are currently below the spam threshold but close to it.

I can see that I would be able to use ccMaxScore to prevent it from
copying anything above a certain score and I can see that there is
sendHamInbound but it doesn't provide the ability to control the send
ham by limiting the score. In this case I would want send all ham above
a score of 45 say.

All the best,
Colin.


On 27/07/2017 06:27, Daniel Miller via Assp-user wrote:

> On 7/25/2017 6:22 AM, cw wrote:
>>
>> I don't think there is a way to have these possibly passing messages
>> save somewhere else so I can review just those - unless someone can
>> correct me?
>>
>
> You may want to explore sendAllSpam and ccSpamAlways.  I have
> sendAllSpam set to a dedicated spam account which I check if a user
> tells me they didn't get the message - then I can fix it via the email
> interface.  I haven't had to do that in quite some time - so every few
> months I just empty it.
>
> Everyone's use case is different - I've had excellent success with my
> own settings which are rather aggressive.  The two keys I've found:
>
> 1.  My local users are whitelist approved - so I've trained them to
> send messages to anyone they want to receive a message from.
> 2.  Any regular correspondents get added to the no-processing domains.
>
> Daniel
>
> ------------------------------------------------------------------------------
>
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Assp-user mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-user


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Loading...