HTML Emails & False Positives

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

HTML Emails & False Positives

Doldrums
For the past several months we have been using ASSP v1.1.1 Final.

From day 1, it has been giving us a lot of False Positives (emails marked as
Bayesian spam that are legit).  We have been sending these to the notspam@
address, and now have the following number of messages in our folders: spam
= 615; notspam = 1552; errors/spam = 4; errors/notspam = 322.  However, the
system seems to be learning VERY slowly, as emails we even mark as "notspam"
are still getting marked sometimes.

A lot of the legit emails that are being marked are from companies like:
Intuit, Monster, Fidelity.  Needless to say, I have not taken the system out
of test mode.

One thing that has me VERY concerned, is that when I run the Analysis tool;
most of the words marked at "Bad Words" / "Good Words" are really HTML tags
or styles.  Am I missing a configuration setting?

I'm including a few analyses below, along with my rebuildspamdb log.  Any
help you could provide would be GREATLY appreciated.

Thanks,

Hugh O'Donnell

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

----------------------------
[Analysis of Fidelity email]
----------------------------

has a greylist value of 0.857609 (adds 0.857609 0.857609)

Good Words Good Prob
  investments href 0.0001
  size 3 0.0001
  - hlo 0.0002
  href investments 0.0003
  size 2 0.0019
  randword  0.0159
  has changed 0.0227
  helo rcpt 0.0348
Bad Words Bad Prob
  0 table 1.0000
  randword 7bit 1.0000
  7bit img 1.0000
  0 img 0.9999
  0 linkedimage 0.9998
  hlo fmr 0.9998
  hlo maillnx 0.9998
  href customersendemail 0.9997
  href emailoption 0.9997
  href confirmshelp 0.9997
  href 7746028 0.9987
  7746028 href 0.9987
  hlo us101 0.9984
  border 0 0.9977

Analysis totals: 1.0000 1.0000 1.0000 0.9999 0.9999 0.0001 0.0001 0.0002
0.9998 0.9998 0.9998 0.9998 0.9997 0.9997 0.9997 0.0003 0.0003 0.9987 0.9987
0.9987 0.9987 0.9984 0.0019 0.9977 0.9977 0.0159 0.0159 0.0227 0.0348 0.0364
0.0364

spam-prob = 1.00000


------------------------------
[Analysis of RealCities email]
------------------------------

66.165.105 has a greylist value of 0.242424 (adds 0.242424 0.242424)

Good Words Good Prob
  <none>
Bad Words Bad Prob
  cccccc text-decoration 1.0000
  ..l randword 1.0000
  none a.headline 1.0000
  none ..l 1.0000
  none a.headline2 1.0000
  none ..greytext 1.0000
  none a.npnav 1.0000
  none a.headline3 1.0000
  8bit if 1.0000
  underline a.headline2 1.0000
  underline ..l 1.0000
  a.headline3 visited 1.0000
  underline a.npnav 1.0000
  underline ..greytext 1.0000
  underline a.headline3 1.0000
  bold a.headline 1.0000
  a.npnav hover 1.0000
  a.headline visited 1.0000
  a.headline2 link 1.0000
  a.npnav visited 1.0000
  hlo esp 1.0000
  a.headline hover 1.0000

Analysis totals: 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
1.0000

spam-prob = 1.00000


-------------------------------
[Analysis of Monster.com email]
-------------------------------
Good Words Good Prob
  advice newsletter 0.0000

Bad Words Bad Prob
  atxt have 1.0000
  have atxt 1.0000
  join atxt 1.0000
  atxt join 1.0000
  0 href 1.0000
  atxt people 1.0000
  people atxt 1.0000
  than atxt 1.0000
  connecting atxt 1.0000
  atxt connecting 1.0000
  general newsletter 1.0000
  coffee atxt 1.0000
  atxt such 1.0000
  such atxt 1.0000
  atxt match 1.0000
  could atxt 0.9999
  atxt could 0.9999
  atxt 6 0.9999
  comes atxt 0.9999
  atxt comes 0.9999
  href dcel 0.9999

Analysis totals: 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 1.0000 1.0000 1.0000
0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
0.9999

spam-prob = 1.00000


------------------
[rebuilspamdb log]
------------------
mt=179535
Analyzing c:\assp/errors/spam  1..4
Analyzing c:\assp/errors/notspam 65..322
Analyzing c:\assp/spam 321..615
Analyzing c:\assp/notspam  897..1552
Found 360458 spam words, 1632076 non-spam words.
Generating weighted keys...
norm=0.2209



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|

Re: HTML Emails & False Positives

fletcher sandbeck-2
On 1/10/06 at 4:50 PM by [hidden email] (Doldrums):

>For the past several months we have been using ASSP v1.1.1 Final.
>
>From day 1, it has been giving us a lot of False Positives (emails marked as
>Bayesian spam that are legit).  We have been sending these to the notspam@
>address, and now have the following number of messages in our folders: spam
>= 615; notspam = 1552; errors/spam = 4; errors/notspam = 322.  However, the
>system seems to be learning VERY slowly, as emails we even mark as "notspam"
>are still getting marked sometimes.
>
>A lot of the legit emails that are being marked are from companies like:
>Intuit, Monster, Fidelity.  Needless to say, I have not taken the system out
>of test mode.
>
>One thing that has me VERY concerned, is that when I run the Analysis tool;
>most of the words marked at "Bad Words" / "Good Words" are really HTML tags
>or styles.  Am I missing a configuration setting?
>
>I'm including a few analyses below, along with my rebuildspamdb log.  Any
>help you could provide would be GREATLY appreciated.

A lot of spam contains HTML so that markup can naturally become part of the Bayesian filter.  I always try to whitelist mail from companies which send me marketing emails, newsletters, or receipts.  This is easier than getting the Bayesian filter tuned so that it can recognize the difference between spam marketing and legitimate marketing.

Also, I have adopted the strategy of leaving ASSP in test mode indefinitely.  My email client automatically recognizes the ASSP headers and sends any marked messages into its spam folder.  I usually glance through this folder before I delete the messages to make sure no important mail has been caught as a false positive.

[fletcher]


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
<a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|

Re: HTML Emails & False Positives

Joseph Armstrong
----- Original Message -----
From: "fletcher sandbeck" <[hidden email]>
To: <[hidden email]>
Sent: Wednesday, January 11, 2006 8:58 AM
Subject: Re: [Assp-user] HTML Emails & False Positives


On 1/10/06 at 4:50 PM by [hidden email] (Doldrums):


>>A lot of spam contains HTML so that markup can naturally become part of
>>the Bayesian filter.  I always try to whitelist mail from companies which
>>send me marketing emails, newsletters, or receipts.  This is easier than
>>getting the Bayesian filter tuned so that it can recognize the difference
>>between spam marketing and legitimate marketing.


You might be better redlisting or ignore listing these.



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|

Re: HTML Emails & False Positives

Bill Christensen
In reply to this post by Doldrums
At 4:50 PM -0600 1/10/06, Doldrums wrote:
>For the past several months we have been using ASSP v1.1.1 Final.
>
>From day 1, it has been giving us a lot of False Positives (emails marked as
>Bayesian spam that are legit).  We have been sending these to the notspam@
>address, and now have the following number of messages in our folders: spam
>= 615; notspam = 1552; errors/spam = 4; errors/notspam = 322.  However, the
>system seems to be learning VERY slowly, as emails we even mark as "notspam"
>are still getting marked sometimes.


Those are extremely low numbers for your spam/notspam folders.  By
default ASSP sets the number to 14,000 before it starts randomly
deleting.

So either you're deleting mail files from your folders (don't!) or
you need to send a LOT more mail to your spam/notspam addresses.

I started mine off by sending a year's worth of my own correspondence
to the notspam list.  And since I have several 10+ year old email
addresses which were widely published on the web, I had plenty of
spam coming in and sent some 200-300 spam messages a day to the list
for 6 months to tune it before we switched out of test mode - but I
was being overly cautious.

Also make sure that all your users are sending their outgoing mail
through ASSP.  This both adds it to the notspam files AND whitelists
any email addresses

--
Bill Christensen
<http://greenbuilder.com/contact/>

Green Building Professionals Directory: <http://directory.greenbuilder.com>
Sustainable Building Calendar: <http://www.greenbuilder.com/calendar/>
Green Real Estate: <http://www.greenbuilder.com/realestate/>
Straw Bale Registry: <http://sbregistry.greenbuilder.com/>
Books/videos/software: <http://bookstore.greenbuilder.com/>


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|

Re: HTML Emails & False Positives

Bill Christensen
In reply to this post by Doldrums
At 4:50 PM -0600 1/10/06, Doldrums wrote:

Oh, by the way.

You'll want to REDLIST any accounts that are set up to forward or
auto-reply.  Otherwise, if they receive spam it'll be marked as NOT
spam when it's sent back out.



--
Bill Christensen
<http://greenbuilder.com/contact/>

Green Building Professionals Directory: <http://directory.greenbuilder.com>
Sustainable Building Calendar: <http://www.greenbuilder.com/calendar/>
Green Real Estate: <http://www.greenbuilder.com/realestate/>
Straw Bale Registry: <http://sbregistry.greenbuilder.com/>
Books/videos/software: <http://bookstore.greenbuilder.com/>


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|

RE: HTML Emails & False Positives

Theo Aukerman
In reply to this post by Doldrums
Personally, I think you need to wait for your email collections
(spam/notspam) to grow.

615 spam and 1552 notspam collected over a period of 2 months makes for
processing 36 emails a day.  This sounds rather low.  We do about 400 a
day here and the average (from the assp sourceforge statistics page)
seems to be 1600 a day.  If you truly are only processing about 30
emails a day, then it is expected that your filter will learn slowly.

I remember some users in the past thinking that they need to purge the
spam/notspam folders.  DON'T DO THIS.  The spam/notspam folders should
grow until they reach some size you specify (mine is set to 10,000 as an
example).

As you populate the spam/notspam folders (and rebuild the database
periodically), and you collect HTML formatted email in both the spam and
notspam folders, the HTML tags should begin looking more grey, and will
be taken out of the calculations in favor of words that provide stronger
indications one way or the other.

If you never send out HTML formatted email, then your notspam collection
might be low on HTML formatted email.

I'm still running V1.1.1 Beta 11, and I know that with my version, when
you send mail to assp-notspam, it leaves the mail in the spam folder,
presumably to be overwritten later via the random number method of file
generation.  The problem is with only 30 mails a day, the false positive
email might sit in the spam folder for a while.

Therefore, you might be able to speed up the learning curve by finding
the false positives in the spam folder that now exist in the
errors/notspam folder and removing them from the spam folder (or moving
them to the not spam folder).

Theo


> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Doldrums
> Sent: Tuesday, January 10, 2006 2:50 PM
> To: [hidden email]
> Subject: [Assp-user] HTML Emails & False Positives
>
>
> For the past several months we have been using ASSP v1.1.1 Final.
>
> From day 1, it has been giving us a lot of False Positives
> (emails marked as Bayesian spam that are legit).  We have
> been sending these to the notspam@ address, and now have the
> following number of messages in our folders: spam = 615;
> notspam = 1552; errors/spam = 4; errors/notspam = 322.  
> However, the system seems to be learning VERY slowly, as
> emails we even mark as "notspam" are still getting marked sometimes.
>
> A lot of the legit emails that are being marked are from
> companies like: Intuit, Monster, Fidelity.  Needless to say,
> I have not taken the system out of test mode.
>
> One thing that has me VERY concerned, is that when I run the
> Analysis tool; most of the words marked at "Bad Words" /
> "Good Words" are really HTML tags or styles.  Am I missing a
> configuration setting?
>
> I'm including a few analyses below, along with my
> rebuildspamdb log.  Any help you could provide would be
> GREATLY appreciated.
>
> Thanks,
>
> Hugh O'Donnell
>
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> -=-=-=-=-
>
> ----------------------------
> [Analysis of Fidelity email]
> ----------------------------
>
> has a greylist value of 0.857609 (adds 0.857609 0.857609)
>
> Good Words Good Prob
>   investments href 0.0001
>   size 3 0.0001
>   - hlo 0.0002
>   href investments 0.0003
>   size 2 0.0019
>   randword  0.0159
>   has changed 0.0227
>   helo rcpt 0.0348
> Bad Words Bad Prob
>   0 table 1.0000
>   randword 7bit 1.0000
>   7bit img 1.0000
>   0 img 0.9999
>   0 linkedimage 0.9998
>   hlo fmr 0.9998
>   hlo maillnx 0.9998
>   href customersendemail 0.9997
>   href emailoption 0.9997
>   href confirmshelp 0.9997
>   href 7746028 0.9987
>   7746028 href 0.9987
>   hlo us101 0.9984
>   border 0 0.9977
>
> Analysis totals: 1.0000 1.0000 1.0000 0.9999 0.9999 0.0001
> 0.0001 0.0002 0.9998 0.9998 0.9998 0.9998 0.9997 0.9997
> 0.9997 0.0003 0.0003 0.9987 0.9987 0.9987 0.9987 0.9984
> 0.0019 0.9977 0.9977 0.0159 0.0159 0.0227 0.0348 0.0364 0.0364
>
> spam-prob = 1.00000
>
>
> ------------------------------
> [Analysis of RealCities email]
> ------------------------------
>
> 66.165.105 has a greylist value of 0.242424 (adds 0.242424 0.242424)
>
> Good Words Good Prob
>   <none>
> Bad Words Bad Prob
>   cccccc text-decoration 1.0000
>   ..l randword 1.0000
>   none a.headline 1.0000
>   none ..l 1.0000
>   none a.headline2 1.0000
>   none ..greytext 1.0000
>   none a.npnav 1.0000
>   none a.headline3 1.0000
>   8bit if 1.0000
>   underline a.headline2 1.0000
>   underline ..l 1.0000
>   a.headline3 visited 1.0000
>   underline a.npnav 1.0000
>   underline ..greytext 1.0000
>   underline a.headline3 1.0000
>   bold a.headline 1.0000
>   a.npnav hover 1.0000
>   a.headline visited 1.0000
>   a.headline2 link 1.0000
>   a.npnav visited 1.0000
>   hlo esp 1.0000
>   a.headline hover 1.0000
>
> Analysis totals: 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
> 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
> 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
> 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
>
> spam-prob = 1.00000
>
>
> -------------------------------
> [Analysis of Monster.com email]
> -------------------------------
> Good Words Good Prob
>   advice newsletter 0.0000
>
> Bad Words Bad Prob
>   atxt have 1.0000
>   have atxt 1.0000
>   join atxt 1.0000
>   atxt join 1.0000
>   0 href 1.0000
>   atxt people 1.0000
>   people atxt 1.0000
>   than atxt 1.0000
>   connecting atxt 1.0000
>   atxt connecting 1.0000
>   general newsletter 1.0000
>   coffee atxt 1.0000
>   atxt such 1.0000
>   such atxt 1.0000
>   atxt match 1.0000
>   could atxt 0.9999
>   atxt could 0.9999
>   atxt 6 0.9999
>   comes atxt 0.9999
>   atxt comes 0.9999
>   href dcel 0.9999
>
> Analysis totals: 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
> 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
> 1.0000 0.0000 1.0000 1.0000 1.0000 0.9999 0.9999 0.9999
> 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
>
> spam-prob = 1.00000
>
>
> ------------------
> [rebuilspamdb log]
> ------------------
> mt=179535
> Analyzing c:\assp/errors/spam  1..4
> Analyzing c:\assp/errors/notspam 65..322
> Analyzing c:\assp/spam 321..615
> Analyzing c:\assp/notspam  897..1552
> Found 360458 spam words, 1632076 non-spam words.
> Generating weighted keys...
> norm=0.2209
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep
> through log files for problems?  Stop!  Download the new AJAX
> search engine that makes searching your log files as easy as
> surfing the  web.  DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> _______________________________________________
> Assp-user mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-user
>


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
<a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user