size of spam & notspam folders in /Library/ASSP

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

size of spam & notspam folders in /Library/ASSP

Marc Lucke
Hi,

It's taking a lot of hard work and forever for ASSP to process some circa 35,000 spam and not spam mesages in the spam & notspam folders (it's a few hundred megabytes).  I don't care about the storage because it's MB as opposed to GB; it just seems that after running the regular rebuild.pl that none of the email is removed from the spam/notspam folders.

Is it meant to?  Is it meant to process every message every time?  If mine is somehow malfunctioning, where should I begin looking to fix it?


--
Marc Lucke
Manager, Online Services

Main Street Publishing
Ph: +61 2 9929 1910 (direct)
Fx: +61 2 9929 1999

Reply | Threaded
Open this post in threaded view
|

RE: size of spam & notspam folders in /Library/ASSP

Lars Troen
I'm not sure what you mean by forever, but atleast for me it takes ~100 minutes to process ~40,000 messages. I don't think it removes any messages. It only overwrites messages at random.
 
Lars


From: [hidden email] [mailto:[hidden email]] On Behalf Of Marc Lucke
Sent: 11. januar 2006 08:59
To: [hidden email]
Subject: [Assp-user] size of spam & notspam folders in /Library/ASSP

Hi,

It's taking a lot of hard work and forever for ASSP to process some circa 35,000 spam and not spam mesages in the spam & notspam folders (it's a few hundred megabytes).  I don't care about the storage because it's MB as opposed to GB; it just seems that after running the regular rebuild.pl that none of the email is removed from the spam/notspam folders.

Is it meant to?  Is it meant to process every message every time?  If mine is somehow malfunctioning, where should I begin looking to fix it?


--
Marc Lucke
Manager, Online Services

Main Street Publishing
Ph: +61 2 9929 1910 (direct)
Fx: +61 2 9929 1999

Reply | Threaded
Open this post in threaded view
|

Re: size of spam & notspam folders in /Library/ASSP

Micheal Espinola Jr
100 minutes!  I thought mine was bad taking 25 minutes for 40,000 msgs...

On 1/11/06, Lars Troen <[hidden email]> wrote:

> I'm not sure what you mean by forever, but atleast for me it takes ~100
> minutes to process ~40,000 messages. I don't think it removes any messages.
> It only overwrites messages at random.
>
> Lars
>
> ________________________________
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of
> Marc Lucke
> Sent: 11. januar 2006 08:59
> To: [hidden email]
> Subject: [Assp-user] size of spam & notspam folders in /Library/ASSP
>
>
> Hi,
>
> It's taking a lot of hard work and forever for ASSP to process some circa
> 35,000 spam and not spam mesages in the spam & notspam folders (it's a few
> hundred megabytes).  I don't care about the storage because it's MB as
> opposed to GB; it just seems that after running the regular rebuild.pl that
> none of the email is removed from the spam/notspam folders.
>
> Is it meant to?  Is it meant to process every message every time?  If mine
> is somehow malfunctioning, where should I begin looking to fix it?
>
>
> --
> Marc Lucke
> Manager, Online Services
>
> Main Street Publishing
> Ph: +61 2 9929 1910 (direct)
> Fx: +61 2 9929 1999


--
ME2


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
<a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|

RE: size of spam & notspam folders in /Library/ASSP

Lars Troen
In reply to this post by Marc Lucke
Hmm... This made me a bit curious, so I've now increased the amount of
memory of the box from 256Mb to 768Mb and now a rebuild takes 22
minutes. Memory is certainly a key factor here.

Lars

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of
> Micheal Espinola Jr
> Sent: 11. januar 2006 09:33
> To: [hidden email]
> Subject: Re: [Assp-user] size of spam & notspam folders in
> /Library/ASSP
>
> 100 minutes!  I thought mine was bad taking 25 minutes for
> 40,000 msgs...
>
> On 1/11/06, Lars Troen <[hidden email]> wrote:
> > I'm not sure what you mean by forever, but atleast for me it takes
> > ~100 minutes to process ~40,000 messages. I don't think it
> removes any messages.
> > It only overwrites messages at random.
> >
> > Lars
> >
> > ________________________________
> > From: [hidden email]
> > [mailto:[hidden email]] On Behalf Of
> Marc Lucke
> > Sent: 11. januar 2006 08:59
> > To: [hidden email]
> > Subject: [Assp-user] size of spam & notspam folders in /Library/ASSP
> >
> >
> > Hi,
> >
> > It's taking a lot of hard work and forever for ASSP to
> process some circa
> > 35,000 spam and not spam mesages in the spam & notspam
> folders (it's a few
> > hundred megabytes).  I don't care about the storage because
> it's MB as
> > opposed to GB; it just seems that after running the regular
> rebuild.pl that
> > none of the email is removed from the spam/notspam folders.
> >
> > Is it meant to?  Is it meant to process every message every
> time?  If mine
> > is somehow malfunctioning, where should I begin looking to fix it?
> >
> >
> > --
> > Marc Lucke
> > Manager, Online Services
> >
> > Main Street Publishing
> > Ph: +61 2 9929 1910 (direct)
> > Fx: +61 2 9929 1999
>
>
> --
> ME2
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep
> through log files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  
> DOWNLOAD SPLUNK!
> <a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=ick">http://ads.osdn.com/?ad_idv37&alloc_id865&op=ick
> _______________________________________________
> Assp-user mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-user
>


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
<a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|

Re: size of spam & notspam folders in /Library/ASSP

Cameron WIlhelm
FWIW,

My rebuild takes about 22 minutes every night.  This is for about  
18.5k messages (combined spam and notspam).  I'm assuming that the  
number that the rebuildspamdb.pl outputs is a token count or  
something...?  May be a better indication of how much processing it's  
actually doing.  Mine comes out to: 272,705.

This is on an old PowerMac G4 400 Mhz with 1.38 GB of RAM.

I think mine is processor bound (as opposed to memory bound).

-Cameron Wilhelm
[hidden email]



On Jan 11, 2006, at 2:14 AM, Lars Troen wrote:

> Hmm... This made me a bit curious, so I've now increased the amount of
> memory of the box from 256Mb to 768Mb and now a rebuild takes 22
> minutes. Memory is certainly a key factor here.
>
> Lars
>
>> -----Original Message-----
>> From: [hidden email]
>> [mailto:[hidden email]] On Behalf Of
>> Micheal Espinola Jr
>> Sent: 11. januar 2006 09:33
>> To: [hidden email]
>> Subject: Re: [Assp-user] size of spam & notspam folders in
>> /Library/ASSP
>>
>> 100 minutes!  I thought mine was bad taking 25 minutes for
>> 40,000 msgs...
>>
>> On 1/11/06, Lars Troen <[hidden email]> wrote:
>>> I'm not sure what you mean by forever, but atleast for me it takes
>>> ~100 minutes to process ~40,000 messages. I don't think it
>> removes any messages.
>>> It only overwrites messages at random.
>>>
>>> Lars
>>>
>>> ________________________________
>>> From: [hidden email]
>>> [mailto:[hidden email]] On Behalf Of
>> Marc Lucke
>>> Sent: 11. januar 2006 08:59
>>> To: [hidden email]
>>> Subject: [Assp-user] size of spam & notspam folders in /Library/ASSP
>>>
>>>
>>> Hi,
>>>
>>> It's taking a lot of hard work and forever for ASSP to
>> process some circa
>>> 35,000 spam and not spam mesages in the spam & notspam
>> folders (it's a few
>>> hundred megabytes).  I don't care about the storage because
>> it's MB as
>>> opposed to GB; it just seems that after running the regular
>> rebuild.pl that
>>> none of the email is removed from the spam/notspam folders.
>>>
>>> Is it meant to?  Is it meant to process every message every
>> time?  If mine
>>> is somehow malfunctioning, where should I begin looking to fix it?
>>>
>>>
>>> --
>>> Marc Lucke
>>> Manager, Online Services
>>>
>>> Main Street Publishing
>>> Ph: +61 2 9929 1910 (direct)
>>> Fx: +61 2 9929 1999
>>
>>
>> --
>> ME2
>>
>>
>> -------------------------------------------------------
>> This SF.net email is sponsored by: Splunk Inc. Do you grep
>> through log files
>> for problems?  Stop!  Download the new AJAX search engine that makes
>> searching your log files as easy as surfing the  web.
>> DOWNLOAD SPLUNK!
>> <a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=ick">http://ads.osdn.com/?ad_idv37&alloc_id865&op=ick
>> _______________________________________________
>> Assp-user mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/assp-user
>>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through  
> log files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD  
> SPLUNK!
> <a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
> _______________________________________________
> Assp-user mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-user



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
<a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|

RE: size of spam & notspam folders in /Library/ASSP

Lars Troen
In reply to this post by Marc Lucke
I've now upgraded to 3.6Gb ram, but it still takes 22 minutes. I guess
maybe ReiserFS might perform better. It's known to be faster than ext3
(that I'm using now) when handling directories with many files.

Token count? Which one is that?

mt=374227
Analyzing ./errors/spam
32

Analyzing ./errors/notspam
34

Analyzing ./spam
21574 ****

Analyzing ./notspam
44860 ********

Found 10294607 spam words, 9752793 non-spam words.
Generating weighted keys...
norm=1.0556
Saving rebuilt SPAM
database*************************************************

total time processing=1355 second(s)

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of
> Cameron Wilhelm
> Sent: 11. januar 2006 10:33
> To: [hidden email]
> Subject: Re: [Assp-user] size of spam & notspam folders in
> /Library/ASSP
>
> FWIW,
>
> My rebuild takes about 22 minutes every night.  This is for
> about 18.5k messages (combined spam and notspam).  I'm
> assuming that the number that the rebuildspamdb.pl outputs is
> a token count or something...?  May be a better indication of
> how much processing it's actually doing.  Mine comes out to: 272,705.
>
> This is on an old PowerMac G4 400 Mhz with 1.38 GB of RAM.
>
> I think mine is processor bound (as opposed to memory bound).
>
> -Cameron Wilhelm
> [hidden email]
>


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
<a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|

RE: size of spam & notspam folders in /Library/ASSP

Lars Troen
In reply to this post by Marc Lucke
Just to follow this track a bit further (while I'm having a lot of
memory available) I put the spam & nospam directories on a ram disk. The
processing time now was 12 minutes. This is done in a Debian VM on ESX
Server.

Lars
>
> I've now upgraded to 3.6Gb ram, but it still takes 22
> minutes. I guess maybe ReiserFS might perform better. It's
> known to be faster than ext3 (that I'm using now) when
> handling directories with many files.
>


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
<a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|

Re: size of spam & notspam folders in /Library/ASSP

Marc Lucke
I run MailScanner on Linux which collectively controls spam, virus and dns blacklists and it's stunningly effective & easy.  Unfortunately it seems to be quite the pain to setup on MacOSX, if it all possible, the platform on which I'm stuck at work.  Essentially I feed the spam to spamassassin (sa-learn) and then remove the email message and that is that!  It is done on the fly.  This seems to me to be a much better than reprocessing the same (or almost the same) spam and non-spam every day?

I don't quite understand the overwriting of an email at random.

I'd like to know more about how ASSP works and whether a more spamassassin-like way could be achieved.  It sounds like many people would like this immediate & cumulative processing approach rather than the batch, reprosses every day approach.  ASSP seems really resource intensive.


Marc


Lars Troen wrote:
Just to follow this track a bit further (while I'm having a lot of
memory available) I put the spam & nospam directories on a ram disk. The
processing time now was 12 minutes. This is done in a Debian VM on ESX
Server.

Lars
  
I've now upgraded to 3.6Gb ram, but it still takes 22 
minutes. I guess maybe ReiserFS might perform better. It's 
known to be faster than ext3 (that I'm using now) when 
handling directories with many files. 

    


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
<a class="moz-txt-link-freetext" href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
  
Reply | Threaded
Open this post in threaded view
|

RE: size of spam & notspam folders in /Library/ASSP

Ged West
In reply to this post by Marc Lucke
Go here and read.  It should answer all your questions.
 


From: [hidden email] [mailto:[hidden email]] On Behalf Of Marc Lucke
Sent: Wednesday, January 11, 2006 3:05 AM
To: [hidden email]
Subject: Re: [Assp-user] size of spam & notspam folders in /Library/ASSP

I run MailScanner on Linux which collectively controls spam, virus and dns blacklists and it's stunningly effective & easy.  Unfortunately it seems to be quite the pain to setup on MacOSX, if it all possible, the platform on which I'm stuck at work.  Essentially I feed the spam to spamassassin (sa-learn) and then remove the email message and that is that!  It is done on the fly.  This seems to me to be a much better than reprocessing the same (or almost the same) spam and non-spam every day?

I don't quite understand the overwriting of an email at random.

I'd like to know more about how ASSP works and whether a more spamassassin-like way could be achieved.  It sounds like many people would like this immediate & cumulative processing approach rather than the batch, reprosses every day approach.  ASSP seems really resource intensive.


Marc


Lars Troen wrote:
Just to follow this track a bit further (while I'm having a lot of
memory available) I put the spam & nospam directories on a ram disk. The
processing time now was 12 minutes. This is done in a Debian VM on ESX
Server.

Lars
  
I've now upgraded to 3.6Gb ram, but it still takes 22 
minutes. I guess maybe ReiserFS might perform better. It's 
known to be faster than ext3 (that I'm using now) when 
handling directories with many files. 

    


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
  
Reply | Threaded
Open this post in threaded view
|

Re: size of spam & notspam folders in /Library/ASSP

Cameron WIlhelm
In reply to this post by Lars Troen
I don't know what that 44860 is at the end. All I know is that it  
gets larger as you have more e-mail in your directories.


-Cameron Wilhelm
[hidden email]



On Jan 11, 2006, at 2:54 AM, Lars Troen wrote:

> I've now upgraded to 3.6Gb ram, but it still takes 22 minutes. I guess
> maybe ReiserFS might perform better. It's known to be faster than ext3
> (that I'm using now) when handling directories with many files.
>
> Token count? Which one is that?
>
> mt=374227
> Analyzing ./errors/spam
> 32
>
> Analyzing ./errors/notspam
> 34
>
> Analyzing ./spam
> 21574 ****
>
> Analyzing ./notspam
> 44860 ********
>
> Found 10294607 spam words, 9752793 non-spam words.
> Generating weighted keys...
> norm=1.0556
> Saving rebuilt SPAM
> database*************************************************
>
> total time processing=1355 second(s)
>
>> -----Original Message-----
>> From: [hidden email]
>> [mailto:[hidden email]] On Behalf Of
>> Cameron Wilhelm
>> Sent: 11. januar 2006 10:33
>> To: [hidden email]
>> Subject: Re: [Assp-user] size of spam & notspam folders in
>> /Library/ASSP
>>
>> FWIW,
>>
>> My rebuild takes about 22 minutes every night.  This is for
>> about 18.5k messages (combined spam and notspam).  I'm
>> assuming that the number that the rebuildspamdb.pl outputs is
>> a token count or something...?  May be a better indication of
>> how much processing it's actually doing.  Mine comes out to: 272,705.
>>
>> This is on an old PowerMac G4 400 Mhz with 1.38 GB of RAM.
>>
>> I think mine is processor bound (as opposed to memory bound).
>>
>> -Cameron Wilhelm
>> [hidden email]
>>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through  
> log files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD  
> SPLUNK!
> <a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
> _______________________________________________
> Assp-user mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-user



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
<a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|

Re: size of spam & notspam folders in /Library/ASSP

Cameron WIlhelm
In reply to this post by Marc Lucke
Disclaimer:  I am not an expert at this, so please correct any errors.

This is a 2-in-1 answer for the sake of simplicity.

ASSP randomly overwrites e-mails in it's corpus as it goes along.  
This is so that your database of spam and non-spam doesn't get  
stale.  It does this by randomly assigning every received e-mail a  
number, and then writing the e-mail to XXXX.eml in either your spam  
or non-spam directory.  If an e-mail with that number was already in  
the directory, it gets overwritten.  The point is to have a decent  
sampling size that's never too old and never too new.

Now, being that ASSP uses bayesian filtering and that it's designed  
for the corpus to not get too stale, there must be some sort of  
processing to drop out the old stuff.  If it were simply that new e-
mails were being added to a database, then that could be done (at  
some resource expense) on the fly.  However, since the rebuild  
process needs to drop out the old e-mail in addition to incorporating  
the new e-mail, the entire database must be rebuilt.  There's no  
other (efficient) way to determine which tokens belong to which e-
mails and drop them out on the fly using a random aging process like  
the one above.  If you don't use a random aging process, you run the  
risk of your database becoming too new.

ASSP is actually not very resource intensive at all.  One of the  
design philosophies is that it process incoming e-mail as quickly as  
possible so that it not tie up any extra resources on your machine  
that are waiting for it to pass mail through.  If it were updating  
the database with every mail that came in, it would not scale to  
large volumes as easily as it does now, and ASSP would place a larger  
constant load on the server (with admittedly no peak load.)

ASSP could be redesigned to do the processing on the fly, I'm just  
kinda giving the basics about why it's the way it is.  One of the  
things to keep in mind is that ASSP is written (almost?) entirely in  
perl.  That makes it extremely easy to port.  Many of the on-the-fly  
processing solutions that I can think of become are much more  
difficult to implement with a perl-only solution.

Again, I'm not an expert - please feel free to correct any mistakes  
I've made in here.

-Cameron Wilhelm
[hidden email]



On Jan 11, 2006, at 4:05 AM, Marc Lucke wrote:

> I run MailScanner on Linux which collectively controls spam, virus  
> and dns blacklists and it's stunningly effective & easy.  
> Unfortunately it seems to be quite the pain to setup on MacOSX, if  
> it all possible, the platform on which I'm stuck at work.  
> Essentially I feed the spam to spamassassin (sa-learn) and then  
> remove the email message and that is that!  It is done on the fly.  
> This seems to me to be a much better than reprocessing the same (or  
> almost the same) spam and non-spam every day?
>
> I don't quite understand the overwriting of an email at random.
>
> I'd like to know more about how ASSP works and whether a more  
> spamassassin-like way could be achieved.  It sounds like many  
> people would like this immediate & cumulative processing approach  
> rather than the batch, reprosses every day approach.  ASSP seems  
> really resource intensive.
>
>
> Marc
>
>
> Lars Troen wrote:
>> Just to follow this track a bit further (while I'm having a lot of  
>> memory available) I put the spam & nospam directories on a ram  
>> disk. The processing time now was 12 minutes. This is done in a  
>> Debian VM on ESX Server.  Lars
>>> I've now upgraded to 3.6Gb ram, but it still takes 22 minutes. I  
>>> guess maybe ReiserFS might perform better. It's known to be  
>>> faster than ext3 (that I'm using now) when handling directories  
>>> with many files.
>> ------------------------------------------------------- This  
>> SF.net email is sponsored by: Splunk Inc. Do you grep through log  
>> files for problems? Stop! Download the new AJAX search engine that  
>> makes searching your log files as easy as surfing the web.  
>> DOWNLOAD SPLUNK! http://ads.osdn.com/?
>> ad_idv37&alloc_id865&op=click  
>> _______________________________________________ Assp-user mailing  
>> list [hidden email] https://lists.sourceforge.net/ 
>> lists/listinfo/assp-user



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
<a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|

Thunderbird

Chris Norman
Does anyone know of a thunderbird plug in so that if I were to click
junk on an email it will forward to a preset email address?

This so I could train ASSP with a button click?

C


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|

Re: size of spam & notspam folders in /Library/ASSP

Marc Lucke
In reply to this post by Cameron WIlhelm
Thanks, Cameron.  If there's any corrections to be made to this then it
certainly isn't obvious.  You've now got me wondering about the baysian
database that my spamassassin keeps - could it be stale?  Thanks a lot
for your work.  I appreciate it a lot.


Marc

Cameron Wilhelm wrote:

> Disclaimer:  I am not an expert at this, so please correct any errors.
>
> This is a 2-in-1 answer for the sake of simplicity.
>
> ASSP randomly overwrites e-mails in it's corpus as it goes along.  
> This is so that your database of spam and non-spam doesn't get  
> stale.  It does this by randomly assigning every received e-mail a  
> number, and then writing the e-mail to XXXX.eml in either your spam  
> or non-spam directory.  If an e-mail with that number was already in  
> the directory, it gets overwritten.  The point is to have a decent  
> sampling size that's never too old and never too new.
>
> Now, being that ASSP uses bayesian filtering and that it's designed  
> for the corpus to not get too stale, there must be some sort of  
> processing to drop out the old stuff.  If it were simply that new e-
> mails were being added to a database, then that could be done (at  
> some resource expense) on the fly.  However, since the rebuild  
> process needs to drop out the old e-mail in addition to incorporating  
> the new e-mail, the entire database must be rebuilt.  There's no  
> other (efficient) way to determine which tokens belong to which e-
> mails and drop them out on the fly using a random aging process like  
> the one above.  If you don't use a random aging process, you run the  
> risk of your database becoming too new.
>
> ASSP is actually not very resource intensive at all.  One of the  
> design philosophies is that it process incoming e-mail as quickly as  
> possible so that it not tie up any extra resources on your machine  
> that are waiting for it to pass mail through.  If it were updating  
> the database with every mail that came in, it would not scale to  
> large volumes as easily as it does now, and ASSP would place a larger  
> constant load on the server (with admittedly no peak load.)
>
> ASSP could be redesigned to do the processing on the fly, I'm just  
> kinda giving the basics about why it's the way it is.  One of the  
> things to keep in mind is that ASSP is written (almost?) entirely in  
> perl.  That makes it extremely easy to port.  Many of the on-the-fly  
> processing solutions that I can think of become are much more  
> difficult to implement with a perl-only solution.
>
> Again, I'm not an expert - please feel free to correct any mistakes  
> I've made in here.
>
> -Cameron Wilhelm
> [hidden email]
>
>
>
> On Jan 11, 2006, at 4:05 AM, Marc Lucke wrote:
>
>> I run MailScanner on Linux which collectively controls spam, virus  
>> and dns blacklists and it's stunningly effective & easy.  
>> Unfortunately it seems to be quite the pain to setup on MacOSX, if  
>> it all possible, the platform on which I'm stuck at work.  
>> Essentially I feed the spam to spamassassin (sa-learn) and then  
>> remove the email message and that is that!  It is done on the fly.  
>> This seems to me to be a much better than reprocessing the same (or  
>> almost the same) spam and non-spam every day?
>>
>> I don't quite understand the overwriting of an email at random.
>>
>> I'd like to know more about how ASSP works and whether a more  
>> spamassassin-like way could be achieved.  It sounds like many  people
>> would like this immediate & cumulative processing approach  rather
>> than the batch, reprosses every day approach.  ASSP seems  really
>> resource intensive.
>>
>>
>> Marc
>>
>>
>> Lars Troen wrote:
>>
>>> Just to follow this track a bit further (while I'm having a lot of  
>>> memory available) I put the spam & nospam directories on a ram  
>>> disk. The processing time now was 12 minutes. This is done in a  
>>> Debian VM on ESX Server.  Lars
>>>
>>>> I've now upgraded to 3.6Gb ram, but it still takes 22 minutes. I  
>>>> guess maybe ReiserFS might perform better. It's known to be  faster
>>>> than ext3 (that I'm using now) when handling directories  with many
>>>> files.
>>>
>>> ------------------------------------------------------- This  SF.net
>>> email is sponsored by: Splunk Inc. Do you grep through log  files
>>> for problems? Stop! Download the new AJAX search engine that  makes
>>> searching your log files as easy as surfing the web.  DOWNLOAD
>>> SPLUNK! http://ads.osdn.com/? ad_idv37&alloc_id865&op=click  
>>> _______________________________________________ Assp-user mailing  
>>> list [hidden email] https://lists.sourceforge.net/ 
>>> lists/listinfo/assp-user
>>
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> <a href="http://ads.osdn.com/?ad_idv37&alloc_id865&op=click">http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
> _______________________________________________
> Assp-user mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-user



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user
Reply | Threaded
Open this post in threaded view
|

Re: size of spam & notspam folders in /Library/ASSP

Marc Lucke
In reply to this post by Ged West
der.  Thanks.  Sorry.  RTFM.  I liked Cameron Wilhelm's description too.

It's all good.  Unfortunately my server still has to chug through thousands of emails periodically though; I don't like that very much.  But I guess with a low priority it's not that harmful.

Thanks to all that answered!


Marc

Ged West wrote:
Go here and read.  It should answer all your questions.
 


From: [hidden email] [[hidden email]] On Behalf Of Marc Lucke
Sent: Wednesday, January 11, 2006 3:05 AM
To: [hidden email]
Subject: Re: [Assp-user] size of spam & notspam folders in /Library/ASSP

I run MailScanner on Linux which collectively controls spam, virus and dns blacklists and it's stunningly effective & easy.  Unfortunately it seems to be quite the pain to setup on MacOSX, if it all possible, the platform on which I'm stuck at work.  Essentially I feed the spam to spamassassin (sa-learn) and then remove the email message and that is that!  It is done on the fly.  This seems to me to be a much better than reprocessing the same (or almost the same) spam and non-spam every day?

I don't quite understand the overwriting of an email at random.

I'd like to know more about how ASSP works and whether a more spamassassin-like way could be achieved.  It sounds like many people would like this immediate & cumulative processing approach rather than the batch, reprosses every day approach.  ASSP seems really resource intensive.


Marc


Lars Troen wrote:
Just to follow this track a bit further (while I'm having a lot of
memory available) I put the spam & nospam directories on a ram disk. The
processing time now was 12 minutes. This is done in a Debian VM on ESX
Server.

Lars
  
I've now upgraded to 3.6Gb ram, but it still takes 22 
minutes. I guess maybe ReiserFS might perform better. It's 
known to be faster than ext3 (that I'm using now) when 
handling directories with many files. 

    


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Assp-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-user