Max Number Duplicate File Names

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Max Number Duplicate File Names

K Post
I've got UseSubjectAsMaillogNames checked (the messages are stored in the
folders user the subject name followed by a 6 digit number as expected)

I've got MaxAllowedDups set to 3

MaxBayesFileAge is 0
MaxFiles is 15000

I'm noticing that MaxAllowedDups doesn't seem to be working.

For example, a couple users often send emails with the subject
"Your Donation Receipt"
There are about 600 of those files in NotSpam.
Your_Donation_Receipt--123456.txt
where 123456 is a random differing number.

Shouldn't only 3 of these files exist in the folder (with the exception of
those that were sent since the rebuild / maintenance window)?

Thanks

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

Thomas Eckardt/eck
>There are about 600 of those files in NotSpam.

'MaxAllowedDups','Max Number of Duplicate File Names'
  'The maximum number of logged files with the same filename (subject)
that are stored in the spam folder (spamlog),........

I'll write in Hebrew - possibly the english is better, if you translate it
back to english.

Thomas



Von:    K Post <[hidden email]>
An:     ASSP development mailing list <[hidden email]>
Datum:  10.03.2016 00:29
Betreff:        [Assp-test] Max Number Duplicate File Names



I've got UseSubjectAsMaillogNames checked (the messages are stored in the
folders user the subject name followed by a 6 digit number as expected)

I've got MaxAllowedDups set to 3

MaxBayesFileAge is 0
MaxFiles is 15000

I'm noticing that MaxAllowedDups doesn't seem to be working.

For example, a couple users often send emails with the subject
"Your Donation Receipt"
There are about 600 of those files in NotSpam.
Your_Donation_Receipt--123456.txt
where 123456 is a random differing number.

Shouldn't only 3 of these files exist in the folder (with the exception of
those that were sent since the rebuild / maintenance window)?

Thanks
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test




DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally
privileged and protected in law and are intended solely for the use of the

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no
known virus in this email!
*******************************************************


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

K Post
I know you're all RTFM, but there's plenty of places in the GUI where the
description isn't exactly clear or right.  For example

MaxFiles
If you're not using subjects as file names ( UseSubjectsAsMaillogNames ),
this is the maximum number of files to keep in each collection (spam &
nonspam)
It's actually less than this -- files get a random number between 1 and
MaxFiles.

I AM using file names and MaxFiles DOES control the maximum number of files
in each collection, despite what the description says when
MaintBayesCollection is on and no max age is set. The language is not clear
and that makes us assume things, sometimes incorrectly, about what the GUI
really mean.  We've been working this way since ASSP came out.  Because of
this, I had no way of knowing that MaxAllowedDups >really< only applied to
the spam collection.  I assumed the GUI meant the whole log of spam and
NOTspam.  I don't think that's an unreasonable assumption, or call it an
oversight, or a mistake on my part - but none of that justifies and angry
sounding response from you.

 I'm not looking for a fight, but I feel like I have to keep justifying
myself after you appear to be so angry with me, and the rest of us, who
turn to you for enlightenment.  You're carrying the entire weight of this
project on your shoulders.  It's a lot, I know,  Can we move on and have a
reasonable discussion here?

Is there a reason that MaxAllowedDups shouldn't also apply to the notspam
collection?   Shouldn't we want that to be the case for the same reason
that we have it for spam?   Maybe also to the errors collections?

If we don't, wouldn't the case where a staff member sends the same basic
message to 5000 people (against my wishes, but I can't control everything)
that'll take 1/3 of the other notspam messages out of the rebuild
processes?  How about if 20k messages are sent?

Maybe I'm just not understanding, and that's why I'm asking, but I hope it
doesn't result in any more scolding.

Thank you


On Thu, Mar 10, 2016 at 4:15 AM, Thomas Eckardt <[hidden email]>
wrote:

> >There are about 600 of those files in NotSpam.
>
> 'MaxAllowedDups','Max Number of Duplicate File Names'
>   'The maximum number of logged files with the same filename (subject)
> that are stored in the spam folder (spamlog),........
>
> I'll write in Hebrew - possibly the english is better, if you translate it
> back to english.
>
> Thomas
>
>
>
> Von:    K Post <[hidden email]>
> An:     ASSP development mailing list <[hidden email]>
> Datum:  10.03.2016 00:29
> Betreff:        [Assp-test] Max Number Duplicate File Names
>
>
>
> I've got UseSubjectAsMaillogNames checked (the messages are stored in the
> folders user the subject name followed by a 6 digit number as expected)
>
> I've got MaxAllowedDups set to 3
>
> MaxBayesFileAge is 0
> MaxFiles is 15000
>
> I'm noticing that MaxAllowedDups doesn't seem to be working.
>
> For example, a couple users often send emails with the subject
> "Your Donation Receipt"
> There are about 600 of those files in NotSpam.
> Your_Donation_Receipt--123456.txt
> where 123456 is a random differing number.
>
> Shouldn't only 3 of these files exist in the folder (with the exception of
> those that were sent since the rebuild / maintenance window)?
>
> Thanks
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
>
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
>
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

Thomas Eckardt/eck
Just think about the logic behind Bayesian and HMM - this will answer your
question.

Having the same mail in the spam folder multiple times, this will score
the content to extreme spam havy, even your users are using the same
content - but less often.

Thomas





Von:    K Post <[hidden email]>
An:     ASSP development mailing list <[hidden email]>
Datum:  10.03.2016 16:58
Betreff:        Re: [Assp-test] Max Number Duplicate File Names



I know you're all RTFM, but there's plenty of places in the GUI where the
description isn't exactly clear or right.  For example

MaxFiles
If you're not using subjects as file names ( UseSubjectsAsMaillogNames ),
this is the maximum number of files to keep in each collection (spam &
nonspam)
It's actually less than this -- files get a random number between 1 and
MaxFiles.

I AM using file names and MaxFiles DOES control the maximum number of
files
in each collection, despite what the description says when
MaintBayesCollection is on and no max age is set. The language is not
clear
and that makes us assume things, sometimes incorrectly, about what the GUI
really mean.  We've been working this way since ASSP came out.  Because of
this, I had no way of knowing that MaxAllowedDups >really< only applied to
the spam collection.  I assumed the GUI meant the whole log of spam and
NOTspam.  I don't think that's an unreasonable assumption, or call it an
oversight, or a mistake on my part - but none of that justifies and angry
sounding response from you.

 I'm not looking for a fight, but I feel like I have to keep justifying
myself after you appear to be so angry with me, and the rest of us, who
turn to you for enlightenment.  You're carrying the entire weight of this
project on your shoulders.  It's a lot, I know,  Can we move on and have a
reasonable discussion here?

Is there a reason that MaxAllowedDups shouldn't also apply to the notspam
collection?   Shouldn't we want that to be the case for the same reason
that we have it for spam?   Maybe also to the errors collections?

If we don't, wouldn't the case where a staff member sends the same basic
message to 5000 people (against my wishes, but I can't control everything)
that'll take 1/3 of the other notspam messages out of the rebuild
processes?  How about if 20k messages are sent?

Maybe I'm just not understanding, and that's why I'm asking, but I hope it
doesn't result in any more scolding.

Thank you


On Thu, Mar 10, 2016 at 4:15 AM, Thomas Eckardt
<[hidden email]>
wrote:

> >There are about 600 of those files in NotSpam.
>
> 'MaxAllowedDups','Max Number of Duplicate File Names'
>   'The maximum number of logged files with the same filename (subject)
> that are stored in the spam folder (spamlog),........
>
> I'll write in Hebrew - possibly the english is better, if you translate
it

> back to english.
>
> Thomas
>
>
>
> Von:    K Post <[hidden email]>
> An:     ASSP development mailing list <[hidden email]>
> Datum:  10.03.2016 00:29
> Betreff:        [Assp-test] Max Number Duplicate File Names
>
>
>
> I've got UseSubjectAsMaillogNames checked (the messages are stored in
the

> folders user the subject name followed by a 6 digit number as expected)
>
> I've got MaxAllowedDups set to 3
>
> MaxBayesFileAge is 0
> MaxFiles is 15000
>
> I'm noticing that MaxAllowedDups doesn't seem to be working.
>
> For example, a couple users often send emails with the subject
> "Your Donation Receipt"
> There are about 600 of those files in NotSpam.
> Your_Donation_Receipt--123456.txt
> where 123456 is a random differing number.
>
> Shouldn't only 3 of these files exist in the folder (with the exception
of
> those that were sent since the rebuild / maintenance window)?
>
> Thanks
>
>
------------------------------------------------------------------------------

> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential,
legally
> privileged and protected in law and are intended solely for the use of
the
>
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
>
>
>
------------------------------------------------------------------------------

> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test




DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally
privileged and protected in law and are intended solely for the use of the

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no
known virus in this email!
*******************************************************


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

Scott MacLean-4
In reply to this post by K Post
I agree with this. I have a user that sends out a newsletter monthly to
~50,000 users, and it completely wipes out my notspam collection every
time, with 15,000 almost identical copies of the same email. I wait
until the newsletter send has completed, and then I restore the notspam
files from a backup previous to the newsletter sending.

On 3/10/2016 10:55 AM, K Post wrote:

>
> Is there a reason that MaxAllowedDups shouldn't also apply to the notspam
> collection?   Shouldn't we want that to be the case for the same reason
> that we have it for spam?   Maybe also to the errors collections?
>
> If we don't, wouldn't the case where a staff member sends the same basic
> message to 5000 people (against my wishes, but I can't control everything)
> that'll take 1/3 of the other notspam messages out of the rebuild
> processes?  How about if 20k messages are sent?
>


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

Thomas Eckardt/eck
MaxAllowedDups - forces spam mails that exceeds this limit to be moved to
the discarded folder.
What should be done with HAM mails, that have the same subject (like RE:
or FW: - without any additionaly text) - but possibly a different content
?

>sends out a newsletter monthly

ASSP has several options to solve this problem:

I would use any of - or both:

noCollecting
noCollectRe

Tell the user, that he has to tag the newsletter ever time with the same
unique tag - special greeting, header tag ... ????
Put the tag into 'noCollectRe'. Notice: the tag has to be in the header or
the first MaxBytes of the body.

Or - tell the user that he has to use a different sender address for the
newsletter - and put the address in to 'noCollecting'.

DoNotCollectRedRe and RedRe may do the same.


Thomas





Von:    Scott MacLean <[hidden email]>
An:     ASSP development mailing list <[hidden email]>
Datum:  10.03.2016 18:40
Betreff:        Re: [Assp-test] Max Number Duplicate File Names



I agree with this. I have a user that sends out a newsletter monthly to
~50,000 users, and it completely wipes out my notspam collection every
time, with 15,000 almost identical copies of the same email. I wait
until the newsletter send has completed, and then I restore the notspam
files from a backup previous to the newsletter sending.

On 3/10/2016 10:55 AM, K Post wrote:
>
> Is there a reason that MaxAllowedDups shouldn't also apply to the
notspam
> collection?   Shouldn't we want that to be the case for the same reason
> that we have it for spam?   Maybe also to the errors collections?
>
> If we don't, wouldn't the case where a staff member sends the same basic
> message to 5000 people (against my wishes, but I can't control
everything)
> that'll take 1/3 of the other notspam messages out of the rebuild
> processes?  How about if 20k messages are sent?
>


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test






DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally
privileged and protected in law and are intended solely for the use of the

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no
known virus in this email!
*******************************************************


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

Colin
In reply to this post by Scott MacLean-4
I have:

UseSubjectsAsMaillogNames - ticked
MaxAllowedDups - 5
MaintBayesCollection - ticked
MaxBayesFileAge - 21

I was actually surprised by this, I could have sworn that in the past ASSP
used to check when saving messages. I would see lines in the logs during
message receipt saying something like "max number of duplicates found" and
then a message saying it was deleting the oldest with the filename listed.

I've just run a test and ended up with 7 messages with the same subject so
that is presumably not happening for some reason, unless I am remembering
wrong.

Scott: Yours could be easy to solve. Can you not identify those messages by
sender/source and redlist them to stop them from poisoning the corpus? I
had to do that for one client that receives thousands of messages per day
from their line of business app and ecommerce platform.

On Thu, Mar 10, 2016 at 5:38 PM, Scott MacLean <[hidden email]> wrote:

> I agree with this. I have a user that sends out a newsletter monthly to
> ~50,000 users, and it completely wipes out my notspam collection every
> time, with 15,000 almost identical copies of the same email. I wait
> until the newsletter send has completed, and then I restore the notspam
> files from a backup previous to the newsletter sending.
>
> On 3/10/2016 10:55 AM, K Post wrote:
> >
> > Is there a reason that MaxAllowedDups shouldn't also apply to the notspam
> > collection?   Shouldn't we want that to be the case for the same reason
> > that we have it for spam?   Maybe also to the errors collections?
> >
> > If we don't, wouldn't the case where a staff member sends the same basic
> > message to 5000 people (against my wishes, but I can't control
> everything)
> > that'll take 1/3 of the other notspam messages out of the rebuild
> > processes?  How about if 20k messages are sent?
> >
>
>
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

K Post
In reply to this post by Thomas Eckardt/eck
Isn't that exact same logic an argument for having the maximum number of
duplicate subjects apply to the HAM / notspam folder too?  5000 or 15000 of
the same message sent individually by (untrainable / apathetic) users would
fill the notspam folder and mess up HMM / Bayesian right?

And for those RE / FWD / No subject emails, maybe we could have ASSP ignore
subjects shorter than say 5 or 6 characters when deleting duplicate file
names?  Then those files could get wiped out oldest first during the
maintenance.

\

On Thu, Mar 10, 2016 at 11:18 AM, Thomas Eckardt <[hidden email]
> wrote:

> Just think about the logic behind Bayesian and HMM - this will answer your
> question.
>
> Having the same mail in the spam folder multiple times, this will score
> the content to extreme spam havy, even your users are using the same
> content - but less often.
>
> Thomas
>
>
>
>
>
> Von:    K Post <[hidden email]>
> An:     ASSP development mailing list <[hidden email]>
> Datum:  10.03.2016 16:58
> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
>
>
>
> I know you're all RTFM, but there's plenty of places in the GUI where the
> description isn't exactly clear or right.  For example
>
> MaxFiles
> If you're not using subjects as file names ( UseSubjectsAsMaillogNames ),
> this is the maximum number of files to keep in each collection (spam &
> nonspam)
> It's actually less than this -- files get a random number between 1 and
> MaxFiles.
>
> I AM using file names and MaxFiles DOES control the maximum number of
> files
> in each collection, despite what the description says when
> MaintBayesCollection is on and no max age is set. The language is not
> clear
> and that makes us assume things, sometimes incorrectly, about what the GUI
> really mean.  We've been working this way since ASSP came out.  Because of
> this, I had no way of knowing that MaxAllowedDups >really< only applied to
> the spam collection.  I assumed the GUI meant the whole log of spam and
> NOTspam.  I don't think that's an unreasonable assumption, or call it an
> oversight, or a mistake on my part - but none of that justifies and angry
> sounding response from you.
>
>  I'm not looking for a fight, but I feel like I have to keep justifying
> myself after you appear to be so angry with me, and the rest of us, who
> turn to you for enlightenment.  You're carrying the entire weight of this
> project on your shoulders.  It's a lot, I know,  Can we move on and have a
> reasonable discussion here?
>
> Is there a reason that MaxAllowedDups shouldn't also apply to the notspam
> collection?   Shouldn't we want that to be the case for the same reason
> that we have it for spam?   Maybe also to the errors collections?
>
> If we don't, wouldn't the case where a staff member sends the same basic
> message to 5000 people (against my wishes, but I can't control everything)
> that'll take 1/3 of the other notspam messages out of the rebuild
> processes?  How about if 20k messages are sent?
>
> Maybe I'm just not understanding, and that's why I'm asking, but I hope it
> doesn't result in any more scolding.
>
> Thank you
>
>
> On Thu, Mar 10, 2016 at 4:15 AM, Thomas Eckardt
> <[hidden email]>
> wrote:
>
> > >There are about 600 of those files in NotSpam.
> >
> > 'MaxAllowedDups','Max Number of Duplicate File Names'
> >   'The maximum number of logged files with the same filename (subject)
> > that are stored in the spam folder (spamlog),........
> >
> > I'll write in Hebrew - possibly the english is better, if you translate
> it
> > back to english.
> >
> > Thomas
> >
> >
> >
> > Von:    K Post <[hidden email]>
> > An:     ASSP development mailing list <[hidden email]>
> > Datum:  10.03.2016 00:29
> > Betreff:        [Assp-test] Max Number Duplicate File Names
> >
> >
> >
> > I've got UseSubjectAsMaillogNames checked (the messages are stored in
> the
> > folders user the subject name followed by a 6 digit number as expected)
> >
> > I've got MaxAllowedDups set to 3
> >
> > MaxBayesFileAge is 0
> > MaxFiles is 15000
> >
> > I'm noticing that MaxAllowedDups doesn't seem to be working.
> >
> > For example, a couple users often send emails with the subject
> > "Your Donation Receipt"
> > There are about 600 of those files in NotSpam.
> > Your_Donation_Receipt--123456.txt
> > where 123456 is a random differing number.
> >
> > Shouldn't only 3 of these files exist in the folder (with the exception
> of
> > those that were sent since the rebuild / maintenance window)?
> >
> > Thanks
> >
> >
>
> ------------------------------------------------------------------------------
> > Transform Data into Opportunity.
> > Accelerate data analysis in your applications with
> > Intel Data Analytics Acceleration Library.
> > Click to learn more.
> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> > _______________________________________________
> > Assp-test mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >
> >
> >
> >
> > DISCLAIMER:
> > *******************************************************
> > This email and any files transmitted with it may be confidential,
> legally
> > privileged and protected in law and are intended solely for the use of
> the
> >
> > individual to whom it is addressed.
> > This email was multiple times scanned for viruses. There should be no
> > known virus in this email!
> > *******************************************************
> >
> >
> >
> >
>
> ------------------------------------------------------------------------------
> > Transform Data into Opportunity.
> > Accelerate data analysis in your applications with
> > Intel Data Analytics Acceleration Library.
> > Click to learn more.
> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> > _______________________________________________
> > Assp-test mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >
> >
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
>
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
>
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

K Post
On of our staff inadvertently sent about 3400 of the same test messages out
through our server.  Okay, okay, it was me - had a loop coded wrong and
before I noticed what was going on and could stop it about 3400 of the same
messages went out, fortunately, they were just to me.  Sure enough, all
3400 were in notspam.

So, could we, and does it make sense, to keep discussing this?

On Thu, Mar 10, 2016 at 1:47 PM, K Post <[hidden email]> wrote:

> Isn't that exact same logic an argument for having the maximum number of
> duplicate subjects apply to the HAM / notspam folder too?  5000 or 15000 of
> the same message sent individually by (untrainable / apathetic) users would
> fill the notspam folder and mess up HMM / Bayesian right?
>
> And for those RE / FWD / No subject emails, maybe we could have ASSP
> ignore subjects shorter than say 5 or 6 characters when deleting duplicate
> file names?  Then those files could get wiped out oldest first during the
> maintenance.
>
> \
>
> On Thu, Mar 10, 2016 at 11:18 AM, Thomas Eckardt <
> [hidden email]> wrote:
>
>> Just think about the logic behind Bayesian and HMM - this will answer your
>> question.
>>
>> Having the same mail in the spam folder multiple times, this will score
>> the content to extreme spam havy, even your users are using the same
>> content - but less often.
>>
>> Thomas
>>
>>
>>
>>
>>
>> Von:    K Post <[hidden email]>
>> An:     ASSP development mailing list <[hidden email]>
>> Datum:  10.03.2016 16:58
>> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
>>
>>
>>
>> I know you're all RTFM, but there's plenty of places in the GUI where the
>> description isn't exactly clear or right.  For example
>>
>> MaxFiles
>> If you're not using subjects as file names ( UseSubjectsAsMaillogNames ),
>> this is the maximum number of files to keep in each collection (spam &
>> nonspam)
>> It's actually less than this -- files get a random number between 1 and
>> MaxFiles.
>>
>> I AM using file names and MaxFiles DOES control the maximum number of
>> files
>> in each collection, despite what the description says when
>> MaintBayesCollection is on and no max age is set. The language is not
>> clear
>> and that makes us assume things, sometimes incorrectly, about what the GUI
>> really mean.  We've been working this way since ASSP came out.  Because of
>> this, I had no way of knowing that MaxAllowedDups >really< only applied to
>> the spam collection.  I assumed the GUI meant the whole log of spam and
>> NOTspam.  I don't think that's an unreasonable assumption, or call it an
>> oversight, or a mistake on my part - but none of that justifies and angry
>> sounding response from you.
>>
>>  I'm not looking for a fight, but I feel like I have to keep justifying
>> myself after you appear to be so angry with me, and the rest of us, who
>> turn to you for enlightenment.  You're carrying the entire weight of this
>> project on your shoulders.  It's a lot, I know,  Can we move on and have a
>> reasonable discussion here?
>>
>> Is there a reason that MaxAllowedDups shouldn't also apply to the notspam
>> collection?   Shouldn't we want that to be the case for the same reason
>> that we have it for spam?   Maybe also to the errors collections?
>>
>> If we don't, wouldn't the case where a staff member sends the same basic
>> message to 5000 people (against my wishes, but I can't control everything)
>> that'll take 1/3 of the other notspam messages out of the rebuild
>> processes?  How about if 20k messages are sent?
>>
>> Maybe I'm just not understanding, and that's why I'm asking, but I hope it
>> doesn't result in any more scolding.
>>
>> Thank you
>>
>>
>> On Thu, Mar 10, 2016 at 4:15 AM, Thomas Eckardt
>> <[hidden email]>
>> wrote:
>>
>> > >There are about 600 of those files in NotSpam.
>> >
>> > 'MaxAllowedDups','Max Number of Duplicate File Names'
>> >   'The maximum number of logged files with the same filename (subject)
>> > that are stored in the spam folder (spamlog),........
>> >
>> > I'll write in Hebrew - possibly the english is better, if you translate
>> it
>> > back to english.
>> >
>> > Thomas
>> >
>> >
>> >
>> > Von:    K Post <[hidden email]>
>> > An:     ASSP development mailing list <[hidden email]>
>> > Datum:  10.03.2016 00:29
>> > Betreff:        [Assp-test] Max Number Duplicate File Names
>> >
>> >
>> >
>> > I've got UseSubjectAsMaillogNames checked (the messages are stored in
>> the
>> > folders user the subject name followed by a 6 digit number as expected)
>> >
>> > I've got MaxAllowedDups set to 3
>> >
>> > MaxBayesFileAge is 0
>> > MaxFiles is 15000
>> >
>> > I'm noticing that MaxAllowedDups doesn't seem to be working.
>> >
>> > For example, a couple users often send emails with the subject
>> > "Your Donation Receipt"
>> > There are about 600 of those files in NotSpam.
>> > Your_Donation_Receipt--123456.txt
>> > where 123456 is a random differing number.
>> >
>> > Shouldn't only 3 of these files exist in the folder (with the exception
>> of
>> > those that were sent since the rebuild / maintenance window)?
>> >
>> > Thanks
>> >
>> >
>>
>> ------------------------------------------------------------------------------
>> > Transform Data into Opportunity.
>> > Accelerate data analysis in your applications with
>> > Intel Data Analytics Acceleration Library.
>> > Click to learn more.
>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
>> > _______________________________________________
>> > Assp-test mailing list
>> > [hidden email]
>> > https://lists.sourceforge.net/lists/listinfo/assp-test
>> >
>> >
>> >
>> >
>> > DISCLAIMER:
>> > *******************************************************
>> > This email and any files transmitted with it may be confidential,
>> legally
>> > privileged and protected in law and are intended solely for the use of
>> the
>> >
>> > individual to whom it is addressed.
>> > This email was multiple times scanned for viruses. There should be no
>> > known virus in this email!
>> > *******************************************************
>> >
>> >
>> >
>> >
>>
>> ------------------------------------------------------------------------------
>> > Transform Data into Opportunity.
>> > Accelerate data analysis in your applications with
>> > Intel Data Analytics Acceleration Library.
>> > Click to learn more.
>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
>> > _______________________________________________
>> > Assp-test mailing list
>> > [hidden email]
>> > https://lists.sourceforge.net/lists/listinfo/assp-test
>> >
>> >
>>
>> ------------------------------------------------------------------------------
>> Transform Data into Opportunity.
>> Accelerate data analysis in your applications with
>> Intel Data Analytics Acceleration Library.
>> Click to learn more.
>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
>> _______________________________________________
>> Assp-test mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/assp-test
>>
>>
>>
>>
>> DISCLAIMER:
>> *******************************************************
>> This email and any files transmitted with it may be confidential, legally
>> privileged and protected in law and are intended solely for the use of the
>>
>> individual to whom it is addressed.
>> This email was multiple times scanned for viruses. There should be no
>> known virus in this email!
>> *******************************************************
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Transform Data into Opportunity.
>> Accelerate data analysis in your applications with
>> Intel Data Analytics Acceleration Library.
>> Click to learn more.
>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
>> _______________________________________________
>> Assp-test mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/assp-test
>>
>>
>

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

K Post
-From Thomas, posted elsewhere
>Remains the (my) question - what should be done with mails that
>reaches the 'MaxAllowedHamDups' without breaking any concept and without
>creating a new folder (which breaks several concepts)?

The scenario where a bonehead user sends 5000 of the same message in an
Outlook mailmerge isn't just a conceptual possibility, it happens.  And
it's happening more and more frequently despite training, memos, reminders,
and a very good email blast system in place that eliminated the need for
mailmerges.

What about when doing the nightly cleanup if you were to delete files with
the same name in excess of max dups, then delete as you already do files in
excess of the maximum total number of files?  I thought that was what was
already happening with the spam corpus, but apparently not.

I only see upside to limiting the number of dups it notspam, but you've
stated elsewhere that the arguments herein don't make sense to you.  If
you're saying what we suggest doesn't make any sense, I know that we must
be missing something significant.  I know that bayesian filtering works
really well, but I only understand the inner workings from 35,000 feet. I
just can't understand how making every effort to insure that our notspam
corpus remains diverse doesn't make sense.

Thanks again.  Hope we can continue this discussion.

On Mon, Mar 14, 2016 at 5:28 PM, K Post <[hidden email]> wrote:

> On of our staff inadvertently sent about 3400 of the same test messages
> out through our server.  Okay, okay, it was me - had a loop coded wrong and
> before I noticed what was going on and could stop it about 3400 of the same
> messages went out, fortunately, they were just to me.  Sure enough, all
> 3400 were in notspam.
>
> So, could we, and does it make sense, to keep discussing this?
>
> On Thu, Mar 10, 2016 at 1:47 PM, K Post <[hidden email]> wrote:
>
>> Isn't that exact same logic an argument for having the maximum number of
>> duplicate subjects apply to the HAM / notspam folder too?  5000 or 15000 of
>> the same message sent individually by (untrainable / apathetic) users would
>> fill the notspam folder and mess up HMM / Bayesian right?
>>
>> And for those RE / FWD / No subject emails, maybe we could have ASSP
>> ignore subjects shorter than say 5 or 6 characters when deleting duplicate
>> file names?  Then those files could get wiped out oldest first during the
>> maintenance.
>>
>> \
>>
>> On Thu, Mar 10, 2016 at 11:18 AM, Thomas Eckardt <
>> [hidden email]> wrote:
>>
>>> Just think about the logic behind Bayesian and HMM - this will answer
>>> your
>>> question.
>>>
>>> Having the same mail in the spam folder multiple times, this will score
>>> the content to extreme spam havy, even your users are using the same
>>> content - but less often.
>>>
>>> Thomas
>>>
>>>
>>>
>>>
>>>
>>> Von:    K Post <[hidden email]>
>>> An:     ASSP development mailing list <[hidden email]>
>>> Datum:  10.03.2016 16:58
>>> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
>>>
>>>
>>>
>>> I know you're all RTFM, but there's plenty of places in the GUI where the
>>> description isn't exactly clear or right.  For example
>>>
>>> MaxFiles
>>> If you're not using subjects as file names ( UseSubjectsAsMaillogNames ),
>>> this is the maximum number of files to keep in each collection (spam &
>>> nonspam)
>>> It's actually less than this -- files get a random number between 1 and
>>> MaxFiles.
>>>
>>> I AM using file names and MaxFiles DOES control the maximum number of
>>> files
>>> in each collection, despite what the description says when
>>> MaintBayesCollection is on and no max age is set. The language is not
>>> clear
>>> and that makes us assume things, sometimes incorrectly, about what the
>>> GUI
>>> really mean.  We've been working this way since ASSP came out.  Because
>>> of
>>> this, I had no way of knowing that MaxAllowedDups >really< only applied
>>> to
>>> the spam collection.  I assumed the GUI meant the whole log of spam and
>>> NOTspam.  I don't think that's an unreasonable assumption, or call it an
>>> oversight, or a mistake on my part - but none of that justifies and angry
>>> sounding response from you.
>>>
>>>  I'm not looking for a fight, but I feel like I have to keep justifying
>>> myself after you appear to be so angry with me, and the rest of us, who
>>> turn to you for enlightenment.  You're carrying the entire weight of this
>>> project on your shoulders.  It's a lot, I know,  Can we move on and have
>>> a
>>> reasonable discussion here?
>>>
>>> Is there a reason that MaxAllowedDups shouldn't also apply to the notspam
>>> collection?   Shouldn't we want that to be the case for the same reason
>>> that we have it for spam?   Maybe also to the errors collections?
>>>
>>> If we don't, wouldn't the case where a staff member sends the same basic
>>> message to 5000 people (against my wishes, but I can't control
>>> everything)
>>> that'll take 1/3 of the other notspam messages out of the rebuild
>>> processes?  How about if 20k messages are sent?
>>>
>>> Maybe I'm just not understanding, and that's why I'm asking, but I hope
>>> it
>>> doesn't result in any more scolding.
>>>
>>> Thank you
>>>
>>>
>>> On Thu, Mar 10, 2016 at 4:15 AM, Thomas Eckardt
>>> <[hidden email]>
>>> wrote:
>>>
>>> > >There are about 600 of those files in NotSpam.
>>> >
>>> > 'MaxAllowedDups','Max Number of Duplicate File Names'
>>> >   'The maximum number of logged files with the same filename (subject)
>>> > that are stored in the spam folder (spamlog),........
>>> >
>>> > I'll write in Hebrew - possibly the english is better, if you translate
>>> it
>>> > back to english.
>>> >
>>> > Thomas
>>> >
>>> >
>>> >
>>> > Von:    K Post <[hidden email]>
>>> > An:     ASSP development mailing list <[hidden email]
>>> >
>>> > Datum:  10.03.2016 00:29
>>> > Betreff:        [Assp-test] Max Number Duplicate File Names
>>> >
>>> >
>>> >
>>> > I've got UseSubjectAsMaillogNames checked (the messages are stored in
>>> the
>>> > folders user the subject name followed by a 6 digit number as expected)
>>> >
>>> > I've got MaxAllowedDups set to 3
>>> >
>>> > MaxBayesFileAge is 0
>>> > MaxFiles is 15000
>>> >
>>> > I'm noticing that MaxAllowedDups doesn't seem to be working.
>>> >
>>> > For example, a couple users often send emails with the subject
>>> > "Your Donation Receipt"
>>> > There are about 600 of those files in NotSpam.
>>> > Your_Donation_Receipt--123456.txt
>>> > where 123456 is a random differing number.
>>> >
>>> > Shouldn't only 3 of these files exist in the folder (with the exception
>>> of
>>> > those that were sent since the rebuild / maintenance window)?
>>> >
>>> > Thanks
>>> >
>>> >
>>>
>>> ------------------------------------------------------------------------------
>>> > Transform Data into Opportunity.
>>> > Accelerate data analysis in your applications with
>>> > Intel Data Analytics Acceleration Library.
>>> > Click to learn more.
>>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
>>> > _______________________________________________
>>> > Assp-test mailing list
>>> > [hidden email]
>>> > https://lists.sourceforge.net/lists/listinfo/assp-test
>>> >
>>> >
>>> >
>>> >
>>> > DISCLAIMER:
>>> > *******************************************************
>>> > This email and any files transmitted with it may be confidential,
>>> legally
>>> > privileged and protected in law and are intended solely for the use of
>>> the
>>> >
>>> > individual to whom it is addressed.
>>> > This email was multiple times scanned for viruses. There should be no
>>> > known virus in this email!
>>> > *******************************************************
>>> >
>>> >
>>> >
>>> >
>>>
>>> ------------------------------------------------------------------------------
>>> > Transform Data into Opportunity.
>>> > Accelerate data analysis in your applications with
>>> > Intel Data Analytics Acceleration Library.
>>> > Click to learn more.
>>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
>>> > _______________________________________________
>>> > Assp-test mailing list
>>> > [hidden email]
>>> > https://lists.sourceforge.net/lists/listinfo/assp-test
>>> >
>>> >
>>>
>>> ------------------------------------------------------------------------------
>>> Transform Data into Opportunity.
>>> Accelerate data analysis in your applications with
>>> Intel Data Analytics Acceleration Library.
>>> Click to learn more.
>>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
>>> _______________________________________________
>>> Assp-test mailing list
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/assp-test
>>>
>>>
>>>
>>>
>>> DISCLAIMER:
>>> *******************************************************
>>> This email and any files transmitted with it may be confidential, legally
>>> privileged and protected in law and are intended solely for the use of
>>> the
>>>
>>> individual to whom it is addressed.
>>> This email was multiple times scanned for viruses. There should be no
>>> known virus in this email!
>>> *******************************************************
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Transform Data into Opportunity.
>>> Accelerate data analysis in your applications with
>>> Intel Data Analytics Acceleration Library.
>>> Click to learn more.
>>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
>>> _______________________________________________
>>> Assp-test mailing list
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/assp-test
>>>
>>>
>>
>

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

Thomas Eckardt/eck
bonehead user sends 5000 -> LocalFrequencyInt and next configs

regular user sends 5000 -> noCollecting , noCollectRe ...........

This is not a coding task - this is an organizing and configuration task.
As I always say - RTMF!

>then delete as you already do files in
>excess of the maximum total number of files?

Oldest fist - no content check.

>that our notspam corpus remains diverse

having 5000 times the 100% same mail-body in one folder is the same, like
having the mail one time in this folder for HMM and bayes
having the same mail in the opposit folder one time - elimiates all the
5000 for HMM and bayes
BTW : this is independend from the filename or subject

This is not new (since more than 10 years) - because it is one of the
basic concepts of HMM and bayes.

>I know that we must be missing something significant.

Yes - the concept!

You waste my time Ken.

Thomas



Von:    K Post <[hidden email]>
An:     ASSP development mailing list <[hidden email]>
Datum:  21.03.2016 16:41
Betreff:        Re: [Assp-test] Max Number Duplicate File Names



-From Thomas, posted elsewhere
>Remains the (my) question - what should be done with mails that
>reaches the 'MaxAllowedHamDups' without breaking any concept and without
>creating a new folder (which breaks several concepts)?

The scenario where a bonehead user sends 5000 of the same message in an
Outlook mailmerge isn't just a conceptual possibility, it happens.  And
it's happening more and more frequently despite training, memos,
reminders,
and a very good email blast system in place that eliminated the need for
mailmerges.

What about when doing the nightly cleanup if you were to delete files with
the same name in excess of max dups, then delete as you already do files
in
excess of the maximum total number of files?  I thought that was what was
already happening with the spam corpus, but apparently not.

I only see upside to limiting the number of dups it notspam, but you've
stated elsewhere that the arguments herein don't make sense to you.  If
you're saying what we suggest doesn't make any sense, I know that we must
be missing something significant.  I know that bayesian filtering works
really well, but I only understand the inner workings from 35,000 feet. I
just can't understand how making every effort to insure that our notspam
corpus remains diverse doesn't make sense.

Thanks again.  Hope we can continue this discussion.

On Mon, Mar 14, 2016 at 5:28 PM, K Post <[hidden email]> wrote:

> On of our staff inadvertently sent about 3400 of the same test messages
> out through our server.  Okay, okay, it was me - had a loop coded wrong
and
> before I noticed what was going on and could stop it about 3400 of the
same
> messages went out, fortunately, they were just to me.  Sure enough, all
> 3400 were in notspam.
>
> So, could we, and does it make sense, to keep discussing this?
>
> On Thu, Mar 10, 2016 at 1:47 PM, K Post <[hidden email]> wrote:
>
>> Isn't that exact same logic an argument for having the maximum number
of
>> duplicate subjects apply to the HAM / notspam folder too?  5000 or
15000 of
>> the same message sent individually by (untrainable / apathetic) users
would
>> fill the notspam folder and mess up HMM / Bayesian right?
>>
>> And for those RE / FWD / No subject emails, maybe we could have ASSP
>> ignore subjects shorter than say 5 or 6 characters when deleting
duplicate
>> file names?  Then those files could get wiped out oldest first during
the

>> maintenance.
>>
>> \
>>
>> On Thu, Mar 10, 2016 at 11:18 AM, Thomas Eckardt <
>> [hidden email]> wrote:
>>
>>> Just think about the logic behind Bayesian and HMM - this will answer
>>> your
>>> question.
>>>
>>> Having the same mail in the spam folder multiple times, this will
score

>>> the content to extreme spam havy, even your users are using the same
>>> content - but less often.
>>>
>>> Thomas
>>>
>>>
>>>
>>>
>>>
>>> Von:    K Post <[hidden email]>
>>> An:     ASSP development mailing list
<[hidden email]>
>>> Datum:  10.03.2016 16:58
>>> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
>>>
>>>
>>>
>>> I know you're all RTFM, but there's plenty of places in the GUI where
the
>>> description isn't exactly clear or right.  For example
>>>
>>> MaxFiles
>>> If you're not using subjects as file names ( UseSubjectsAsMaillogNames
),
>>> this is the maximum number of files to keep in each collection (spam &
>>> nonspam)
>>> It's actually less than this -- files get a random number between 1
and

>>> MaxFiles.
>>>
>>> I AM using file names and MaxFiles DOES control the maximum number of
>>> files
>>> in each collection, despite what the description says when
>>> MaintBayesCollection is on and no max age is set. The language is not
>>> clear
>>> and that makes us assume things, sometimes incorrectly, about what the
>>> GUI
>>> really mean.  We've been working this way since ASSP came out. Because
>>> of
>>> this, I had no way of knowing that MaxAllowedDups >really< only
applied
>>> to
>>> the spam collection.  I assumed the GUI meant the whole log of spam
and
>>> NOTspam.  I don't think that's an unreasonable assumption, or call it
an
>>> oversight, or a mistake on my part - but none of that justifies and
angry
>>> sounding response from you.
>>>
>>>  I'm not looking for a fight, but I feel like I have to keep
justifying
>>> myself after you appear to be so angry with me, and the rest of us,
who
>>> turn to you for enlightenment.  You're carrying the entire weight of
this
>>> project on your shoulders.  It's a lot, I know,  Can we move on and
have
>>> a
>>> reasonable discussion here?
>>>
>>> Is there a reason that MaxAllowedDups shouldn't also apply to the
notspam
>>> collection?   Shouldn't we want that to be the case for the same
reason
>>> that we have it for spam?   Maybe also to the errors collections?
>>>
>>> If we don't, wouldn't the case where a staff member sends the same
basic
>>> message to 5000 people (against my wishes, but I can't control
>>> everything)
>>> that'll take 1/3 of the other notspam messages out of the rebuild
>>> processes?  How about if 20k messages are sent?
>>>
>>> Maybe I'm just not understanding, and that's why I'm asking, but I
hope

>>> it
>>> doesn't result in any more scolding.
>>>
>>> Thank you
>>>
>>>
>>> On Thu, Mar 10, 2016 at 4:15 AM, Thomas Eckardt
>>> <[hidden email]>
>>> wrote:
>>>
>>> > >There are about 600 of those files in NotSpam.
>>> >
>>> > 'MaxAllowedDups','Max Number of Duplicate File Names'
>>> >   'The maximum number of logged files with the same filename
(subject)
>>> > that are stored in the spam folder (spamlog),........
>>> >
>>> > I'll write in Hebrew - possibly the english is better, if you
translate
>>> it
>>> > back to english.
>>> >
>>> > Thomas
>>> >
>>> >
>>> >
>>> > Von:    K Post <[hidden email]>
>>> > An:     ASSP development mailing list
<[hidden email]
>>> >
>>> > Datum:  10.03.2016 00:29
>>> > Betreff:        [Assp-test] Max Number Duplicate File Names
>>> >
>>> >
>>> >
>>> > I've got UseSubjectAsMaillogNames checked (the messages are stored
in
>>> the
>>> > folders user the subject name followed by a 6 digit number as
expected)

>>> >
>>> > I've got MaxAllowedDups set to 3
>>> >
>>> > MaxBayesFileAge is 0
>>> > MaxFiles is 15000
>>> >
>>> > I'm noticing that MaxAllowedDups doesn't seem to be working.
>>> >
>>> > For example, a couple users often send emails with the subject
>>> > "Your Donation Receipt"
>>> > There are about 600 of those files in NotSpam.
>>> > Your_Donation_Receipt--123456.txt
>>> > where 123456 is a random differing number.
>>> >
>>> > Shouldn't only 3 of these files exist in the folder (with the
exception
>>> of
>>> > those that were sent since the rebuild / maintenance window)?
>>> >
>>> > Thanks
>>> >
>>> >
>>>
>>>
------------------------------------------------------------------------------

>>> > Transform Data into Opportunity.
>>> > Accelerate data analysis in your applications with
>>> > Intel Data Analytics Acceleration Library.
>>> > Click to learn more.
>>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
>>> > _______________________________________________
>>> > Assp-test mailing list
>>> > [hidden email]
>>> > https://lists.sourceforge.net/lists/listinfo/assp-test
>>> >
>>> >
>>> >
>>> >
>>> > DISCLAIMER:
>>> > *******************************************************
>>> > This email and any files transmitted with it may be confidential,
>>> legally
>>> > privileged and protected in law and are intended solely for the use
of
>>> the
>>> >
>>> > individual to whom it is addressed.
>>> > This email was multiple times scanned for viruses. There should be
no
>>> > known virus in this email!
>>> > *******************************************************
>>> >
>>> >
>>> >
>>> >
>>>
>>>
------------------------------------------------------------------------------

>>> > Transform Data into Opportunity.
>>> > Accelerate data analysis in your applications with
>>> > Intel Data Analytics Acceleration Library.
>>> > Click to learn more.
>>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
>>> > _______________________________________________
>>> > Assp-test mailing list
>>> > [hidden email]
>>> > https://lists.sourceforge.net/lists/listinfo/assp-test
>>> >
>>> >
>>>
>>>
------------------------------------------------------------------------------

>>> Transform Data into Opportunity.
>>> Accelerate data analysis in your applications with
>>> Intel Data Analytics Acceleration Library.
>>> Click to learn more.
>>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
>>> _______________________________________________
>>> Assp-test mailing list
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/assp-test
>>>
>>>
>>>
>>>
>>> DISCLAIMER:
>>> *******************************************************
>>> This email and any files transmitted with it may be confidential,
legally

>>> privileged and protected in law and are intended solely for the use of
>>> the
>>>
>>> individual to whom it is addressed.
>>> This email was multiple times scanned for viruses. There should be no
>>> known virus in this email!
>>> *******************************************************
>>>
>>>
>>>
>>>
------------------------------------------------------------------------------

>>> Transform Data into Opportunity.
>>> Accelerate data analysis in your applications with
>>> Intel Data Analytics Acceleration Library.
>>> Click to learn more.
>>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
>>> _______________________________________________
>>> Assp-test mailing list
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/assp-test
>>>
>>>
>>
>
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test




DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally
privileged and protected in law and are intended solely for the use of the

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no
known virus in this email!
*******************************************************


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

K Post
I'm worry if you're finding this discussion to be such a time waster..  My
goal is to improve ASSP for all, not to waste your time, you must know
that.

In the interest of conserving your time - summary question:
*Wouldn't it be better for ASSP to remove duplicate file names in excess of
X from notspam than for it to not remove them and instead remove other
files with more varied notspam content?*

Expanded:

It's helpful for me to now understand that the hmm/bayes analysis doesn't
weigh repetition more heavily than just one file in the opposite folder.
Thank you for that explanation.  When the users do mail merges, a lot of
the time, the body is subtly different (different dear line or other per
person customization for example), but based on what you're saying, I'd
think that they'd be substantially similar enough to act the same way as
you describe.  So good.

But, what is the downside of having ASSP remove filenames with more than X
of the same in notpsam?  I understand that having more wouldn't increase
scoring for the content of >those< messages, but wouldn't it also remove
say 5000 of the OTHER files that we want during a rebuild based on their
age, and therefore give us a file store that's not as diverse as it could
be?  Isn't that a bad thing or at least not as good as it would be if the
duplicate file name emails were removed?

Localfrequency isn't going to help, at least in my case.  If the director
wants to ignore my instruction and policy, she needs to be able to. Yes,
this is a policy problem, but the people high up in the charity will always
argue that if they need to send a message, they're going to.  They don't
pay me much, but it's a job that I need - I can't risk that by turning on
localfrequency.  I don't see how nocollecting / re is going to help.  I
have no way of knowing who is going to send next or what they're going to
send.

I guess I just don't see the downside (other than your time in coding and
testing along with a slightly longer cleanup process) to have ASSP remove
those duplicate file names at cleanup time, before removing oldest first.
Wouldn't that be better than having a notspam folder that once cleanup runs
could only have only a handful of files that are significantly different
content (say if a couple users sent a boatload of mailmerges in 1 day)?






On Mon, Mar 21, 2016 at 12:49 PM, Thomas Eckardt <[hidden email]
> wrote:

> bonehead user sends 5000 -> LocalFrequencyInt and next configs
>
> regular user sends 5000 -> noCollecting , noCollectRe ...........
>
> This is not a coding task - this is an organizing and configuration task.
> As I always say - RTMF!
>
> >then delete as you already do files in
> >excess of the maximum total number of files?
>
> Oldest fist - no content check.
>
> >that our notspam corpus remains diverse
>
> having 5000 times the 100% same mail-body in one folder is the same, like
> having the mail one time in this folder for HMM and bayes
> having the same mail in the opposit folder one time - elimiates all the
> 5000 for HMM and bayes
> BTW : this is independend from the filename or subject
>
> This is not new (since more than 10 years) - because it is one of the
> basic concepts of HMM and bayes.
>
> >I know that we must be missing something significant.
>
> Yes - the concept!
>
> You waste my time Ken.
>
> Thomas
>
>
>
> Von:    K Post <[hidden email]>
> An:     ASSP development mailing list <[hidden email]>
> Datum:  21.03.2016 16:41
> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
>
>
>
> -From Thomas, posted elsewhere
> >Remains the (my) question - what should be done with mails that
> >reaches the 'MaxAllowedHamDups' without breaking any concept and without
> >creating a new folder (which breaks several concepts)?
>
> The scenario where a bonehead user sends 5000 of the same message in an
> Outlook mailmerge isn't just a conceptual possibility, it happens.  And
> it's happening more and more frequently despite training, memos,
> reminders,
> and a very good email blast system in place that eliminated the need for
> mailmerges.
>
> What about when doing the nightly cleanup if you were to delete files with
> the same name in excess of max dups, then delete as you already do files
> in
> excess of the maximum total number of files?  I thought that was what was
> already happening with the spam corpus, but apparently not.
>
> I only see upside to limiting the number of dups it notspam, but you've
> stated elsewhere that the arguments herein don't make sense to you.  If
> you're saying what we suggest doesn't make any sense, I know that we must
> be missing something significant.  I know that bayesian filtering works
> really well, but I only understand the inner workings from 35,000 feet. I
> just can't understand how making every effort to insure that our notspam
> corpus remains diverse doesn't make sense.
>
> Thanks again.  Hope we can continue this discussion.
>
> On Mon, Mar 14, 2016 at 5:28 PM, K Post <[hidden email]> wrote:
>
> > On of our staff inadvertently sent about 3400 of the same test messages
> > out through our server.  Okay, okay, it was me - had a loop coded wrong
> and
> > before I noticed what was going on and could stop it about 3400 of the
> same
> > messages went out, fortunately, they were just to me.  Sure enough, all
> > 3400 were in notspam.
> >
> > So, could we, and does it make sense, to keep discussing this?
> >
> > On Thu, Mar 10, 2016 at 1:47 PM, K Post <[hidden email]> wrote:
> >
> >> Isn't that exact same logic an argument for having the maximum number
> of
> >> duplicate subjects apply to the HAM / notspam folder too?  5000 or
> 15000 of
> >> the same message sent individually by (untrainable / apathetic) users
> would
> >> fill the notspam folder and mess up HMM / Bayesian right?
> >>
> >> And for those RE / FWD / No subject emails, maybe we could have ASSP
> >> ignore subjects shorter than say 5 or 6 characters when deleting
> duplicate
> >> file names?  Then those files could get wiped out oldest first during
> the
> >> maintenance.
> >>
> >> \
> >>
> >> On Thu, Mar 10, 2016 at 11:18 AM, Thomas Eckardt <
> >> [hidden email]> wrote:
> >>
> >>> Just think about the logic behind Bayesian and HMM - this will answer
> >>> your
> >>> question.
> >>>
> >>> Having the same mail in the spam folder multiple times, this will
> score
> >>> the content to extreme spam havy, even your users are using the same
> >>> content - but less often.
> >>>
> >>> Thomas
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Von:    K Post <[hidden email]>
> >>> An:     ASSP development mailing list
> <[hidden email]>
> >>> Datum:  10.03.2016 16:58
> >>> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
> >>>
> >>>
> >>>
> >>> I know you're all RTFM, but there's plenty of places in the GUI where
> the
> >>> description isn't exactly clear or right.  For example
> >>>
> >>> MaxFiles
> >>> If you're not using subjects as file names ( UseSubjectsAsMaillogNames
> ),
> >>> this is the maximum number of files to keep in each collection (spam &
> >>> nonspam)
> >>> It's actually less than this -- files get a random number between 1
> and
> >>> MaxFiles.
> >>>
> >>> I AM using file names and MaxFiles DOES control the maximum number of
> >>> files
> >>> in each collection, despite what the description says when
> >>> MaintBayesCollection is on and no max age is set. The language is not
> >>> clear
> >>> and that makes us assume things, sometimes incorrectly, about what the
> >>> GUI
> >>> really mean.  We've been working this way since ASSP came out. Because
> >>> of
> >>> this, I had no way of knowing that MaxAllowedDups >really< only
> applied
> >>> to
> >>> the spam collection.  I assumed the GUI meant the whole log of spam
> and
> >>> NOTspam.  I don't think that's an unreasonable assumption, or call it
> an
> >>> oversight, or a mistake on my part - but none of that justifies and
> angry
> >>> sounding response from you.
> >>>
> >>>  I'm not looking for a fight, but I feel like I have to keep
> justifying
> >>> myself after you appear to be so angry with me, and the rest of us,
> who
> >>> turn to you for enlightenment.  You're carrying the entire weight of
> this
> >>> project on your shoulders.  It's a lot, I know,  Can we move on and
> have
> >>> a
> >>> reasonable discussion here?
> >>>
> >>> Is there a reason that MaxAllowedDups shouldn't also apply to the
> notspam
> >>> collection?   Shouldn't we want that to be the case for the same
> reason
> >>> that we have it for spam?   Maybe also to the errors collections?
> >>>
> >>> If we don't, wouldn't the case where a staff member sends the same
> basic
> >>> message to 5000 people (against my wishes, but I can't control
> >>> everything)
> >>> that'll take 1/3 of the other notspam messages out of the rebuild
> >>> processes?  How about if 20k messages are sent?
> >>>
> >>> Maybe I'm just not understanding, and that's why I'm asking, but I
> hope
> >>> it
> >>> doesn't result in any more scolding.
> >>>
> >>> Thank you
> >>>
> >>>
> >>> On Thu, Mar 10, 2016 at 4:15 AM, Thomas Eckardt
> >>> <[hidden email]>
> >>> wrote:
> >>>
> >>> > >There are about 600 of those files in NotSpam.
> >>> >
> >>> > 'MaxAllowedDups','Max Number of Duplicate File Names'
> >>> >   'The maximum number of logged files with the same filename
> (subject)
> >>> > that are stored in the spam folder (spamlog),........
> >>> >
> >>> > I'll write in Hebrew - possibly the english is better, if you
> translate
> >>> it
> >>> > back to english.
> >>> >
> >>> > Thomas
> >>> >
> >>> >
> >>> >
> >>> > Von:    K Post <[hidden email]>
> >>> > An:     ASSP development mailing list
> <[hidden email]
> >>> >
> >>> > Datum:  10.03.2016 00:29
> >>> > Betreff:        [Assp-test] Max Number Duplicate File Names
> >>> >
> >>> >
> >>> >
> >>> > I've got UseSubjectAsMaillogNames checked (the messages are stored
> in
> >>> the
> >>> > folders user the subject name followed by a 6 digit number as
> expected)
> >>> >
> >>> > I've got MaxAllowedDups set to 3
> >>> >
> >>> > MaxBayesFileAge is 0
> >>> > MaxFiles is 15000
> >>> >
> >>> > I'm noticing that MaxAllowedDups doesn't seem to be working.
> >>> >
> >>> > For example, a couple users often send emails with the subject
> >>> > "Your Donation Receipt"
> >>> > There are about 600 of those files in NotSpam.
> >>> > Your_Donation_Receipt--123456.txt
> >>> > where 123456 is a random differing number.
> >>> >
> >>> > Shouldn't only 3 of these files exist in the folder (with the
> exception
> >>> of
> >>> > those that were sent since the rebuild / maintenance window)?
> >>> >
> >>> > Thanks
> >>> >
> >>> >
> >>>
> >>>
>
> ------------------------------------------------------------------------------
> >>> > Transform Data into Opportunity.
> >>> > Accelerate data analysis in your applications with
> >>> > Intel Data Analytics Acceleration Library.
> >>> > Click to learn more.
> >>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> >>> > _______________________________________________
> >>> > Assp-test mailing list
> >>> > [hidden email]
> >>> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > DISCLAIMER:
> >>> > *******************************************************
> >>> > This email and any files transmitted with it may be confidential,
> >>> legally
> >>> > privileged and protected in law and are intended solely for the use
> of
> >>> the
> >>> >
> >>> > individual to whom it is addressed.
> >>> > This email was multiple times scanned for viruses. There should be
> no
> >>> > known virus in this email!
> >>> > *******************************************************
> >>> >
> >>> >
> >>> >
> >>> >
> >>>
> >>>
>
> ------------------------------------------------------------------------------
> >>> > Transform Data into Opportunity.
> >>> > Accelerate data analysis in your applications with
> >>> > Intel Data Analytics Acceleration Library.
> >>> > Click to learn more.
> >>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> >>> > _______________________________________________
> >>> > Assp-test mailing list
> >>> > [hidden email]
> >>> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >>> >
> >>> >
> >>>
> >>>
>
> ------------------------------------------------------------------------------
> >>> Transform Data into Opportunity.
> >>> Accelerate data analysis in your applications with
> >>> Intel Data Analytics Acceleration Library.
> >>> Click to learn more.
> >>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> >>> _______________________________________________
> >>> Assp-test mailing list
> >>> [hidden email]
> >>> https://lists.sourceforge.net/lists/listinfo/assp-test
> >>>
> >>>
> >>>
> >>>
> >>> DISCLAIMER:
> >>> *******************************************************
> >>> This email and any files transmitted with it may be confidential,
> legally
> >>> privileged and protected in law and are intended solely for the use of
> >>> the
> >>>
> >>> individual to whom it is addressed.
> >>> This email was multiple times scanned for viruses. There should be no
> >>> known virus in this email!
> >>> *******************************************************
> >>>
> >>>
> >>>
> >>>
>
> ------------------------------------------------------------------------------
> >>> Transform Data into Opportunity.
> >>> Accelerate data analysis in your applications with
> >>> Intel Data Analytics Acceleration Library.
> >>> Click to learn more.
> >>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> >>> _______________________________________________
> >>> Assp-test mailing list
> >>> [hidden email]
> >>> https://lists.sourceforge.net/lists/listinfo/assp-test
> >>>
> >>>
> >>
> >
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
>
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
>
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

Thomas Eckardt/eck
The concept of bayes but at least for HMM is too complex for most admins.
ASSP is nice and stores all file in clear text and with nice filenames in
filesystem if configured this way.

No someone like you, looks in to this folders and tries to find a reason
to do something - that this folder looks more nice to humans.
Now assume, I would store all files RSA encrypted in a database. Nobody
would care about the content, the size, the subject and the count of
records as long as the corpus norm is fine and the detection rate is OK.
Because nobody would be able to read anything in this database.
Looking back to the old days of assp, move2num was used and the filenames
were build using random numbers and it was possible, that any file could
be deleted at any time. Not really nice but it worked - numbers only
numbers nothing more - and the highest randomness you can think about..

Now - reading this you must come to a conclusion - 'MaxAllowedDups' is
NONSENSE - and YES you are right - SWITCH IT OFF!

Thomas







Von:    K Post <[hidden email]>
An:     ASSP development mailing list <[hidden email]>
Datum:  21.03.2016 18:56
Betreff:        Re: [Assp-test] Max Number Duplicate File Names



I'm worry if you're finding this discussion to be such a time waster..  My
goal is to improve ASSP for all, not to waste your time, you must know
that.

In the interest of conserving your time - summary question:
*Wouldn't it be better for ASSP to remove duplicate file names in excess
of
X from notspam than for it to not remove them and instead remove other
files with more varied notspam content?*

Expanded:

It's helpful for me to now understand that the hmm/bayes analysis doesn't
weigh repetition more heavily than just one file in the opposite folder.
Thank you for that explanation.  When the users do mail merges, a lot of
the time, the body is subtly different (different dear line or other per
person customization for example), but based on what you're saying, I'd
think that they'd be substantially similar enough to act the same way as
you describe.  So good.

But, what is the downside of having ASSP remove filenames with more than X
of the same in notpsam?  I understand that having more wouldn't increase
scoring for the content of >those< messages, but wouldn't it also remove
say 5000 of the OTHER files that we want during a rebuild based on their
age, and therefore give us a file store that's not as diverse as it could
be?  Isn't that a bad thing or at least not as good as it would be if the
duplicate file name emails were removed?

Localfrequency isn't going to help, at least in my case.  If the director
wants to ignore my instruction and policy, she needs to be able to. Yes,
this is a policy problem, but the people high up in the charity will
always
argue that if they need to send a message, they're going to.  They don't
pay me much, but it's a job that I need - I can't risk that by turning on
localfrequency.  I don't see how nocollecting / re is going to help.  I
have no way of knowing who is going to send next or what they're going to
send.

I guess I just don't see the downside (other than your time in coding and
testing along with a slightly longer cleanup process) to have ASSP remove
those duplicate file names at cleanup time, before removing oldest first.
Wouldn't that be better than having a notspam folder that once cleanup
runs
could only have only a handful of files that are significantly different
content (say if a couple users sent a boatload of mailmerges in 1 day)?






On Mon, Mar 21, 2016 at 12:49 PM, Thomas Eckardt
<[hidden email]
> wrote:

> bonehead user sends 5000 -> LocalFrequencyInt and next configs
>
> regular user sends 5000 -> noCollecting , noCollectRe ...........
>
> This is not a coding task - this is an organizing and configuration
task.

> As I always say - RTMF!
>
> >then delete as you already do files in
> >excess of the maximum total number of files?
>
> Oldest fist - no content check.
>
> >that our notspam corpus remains diverse
>
> having 5000 times the 100% same mail-body in one folder is the same,
like

> having the mail one time in this folder for HMM and bayes
> having the same mail in the opposit folder one time - elimiates all the
> 5000 for HMM and bayes
> BTW : this is independend from the filename or subject
>
> This is not new (since more than 10 years) - because it is one of the
> basic concepts of HMM and bayes.
>
> >I know that we must be missing something significant.
>
> Yes - the concept!
>
> You waste my time Ken.
>
> Thomas
>
>
>
> Von:    K Post <[hidden email]>
> An:     ASSP development mailing list <[hidden email]>
> Datum:  21.03.2016 16:41
> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
>
>
>
> -From Thomas, posted elsewhere
> >Remains the (my) question - what should be done with mails that
> >reaches the 'MaxAllowedHamDups' without breaking any concept and
without

> >creating a new folder (which breaks several concepts)?
>
> The scenario where a bonehead user sends 5000 of the same message in an
> Outlook mailmerge isn't just a conceptual possibility, it happens.  And
> it's happening more and more frequently despite training, memos,
> reminders,
> and a very good email blast system in place that eliminated the need for
> mailmerges.
>
> What about when doing the nightly cleanup if you were to delete files
with
> the same name in excess of max dups, then delete as you already do files
> in
> excess of the maximum total number of files?  I thought that was what
was
> already happening with the spam corpus, but apparently not.
>
> I only see upside to limiting the number of dups it notspam, but you've
> stated elsewhere that the arguments herein don't make sense to you.  If
> you're saying what we suggest doesn't make any sense, I know that we
must
> be missing something significant.  I know that bayesian filtering works
> really well, but I only understand the inner workings from 35,000 feet.
I
> just can't understand how making every effort to insure that our notspam
> corpus remains diverse doesn't make sense.
>
> Thanks again.  Hope we can continue this discussion.
>
> On Mon, Mar 14, 2016 at 5:28 PM, K Post <[hidden email]> wrote:
>
> > On of our staff inadvertently sent about 3400 of the same test
messages
> > out through our server.  Okay, okay, it was me - had a loop coded
wrong
> and
> > before I noticed what was going on and could stop it about 3400 of the
> same
> > messages went out, fortunately, they were just to me.  Sure enough,
all

> > 3400 were in notspam.
> >
> > So, could we, and does it make sense, to keep discussing this?
> >
> > On Thu, Mar 10, 2016 at 1:47 PM, K Post <[hidden email]> wrote:
> >
> >> Isn't that exact same logic an argument for having the maximum number
> of
> >> duplicate subjects apply to the HAM / notspam folder too?  5000 or
> 15000 of
> >> the same message sent individually by (untrainable / apathetic) users
> would
> >> fill the notspam folder and mess up HMM / Bayesian right?
> >>
> >> And for those RE / FWD / No subject emails, maybe we could have ASSP
> >> ignore subjects shorter than say 5 or 6 characters when deleting
> duplicate
> >> file names?  Then those files could get wiped out oldest first during
> the
> >> maintenance.
> >>
> >> \
> >>
> >> On Thu, Mar 10, 2016 at 11:18 AM, Thomas Eckardt <
> >> [hidden email]> wrote:
> >>
> >>> Just think about the logic behind Bayesian and HMM - this will
answer

> >>> your
> >>> question.
> >>>
> >>> Having the same mail in the spam folder multiple times, this will
> score
> >>> the content to extreme spam havy, even your users are using the same
> >>> content - but less often.
> >>>
> >>> Thomas
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Von:    K Post <[hidden email]>
> >>> An:     ASSP development mailing list
> <[hidden email]>
> >>> Datum:  10.03.2016 16:58
> >>> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
> >>>
> >>>
> >>>
> >>> I know you're all RTFM, but there's plenty of places in the GUI
where
> the
> >>> description isn't exactly clear or right.  For example
> >>>
> >>> MaxFiles
> >>> If you're not using subjects as file names (
UseSubjectsAsMaillogNames
> ),
> >>> this is the maximum number of files to keep in each collection (spam
&
> >>> nonspam)
> >>> It's actually less than this -- files get a random number between 1
> and
> >>> MaxFiles.
> >>>
> >>> I AM using file names and MaxFiles DOES control the maximum number
of
> >>> files
> >>> in each collection, despite what the description says when
> >>> MaintBayesCollection is on and no max age is set. The language is
not
> >>> clear
> >>> and that makes us assume things, sometimes incorrectly, about what
the
> >>> GUI
> >>> really mean.  We've been working this way since ASSP came out.
Because
> >>> of
> >>> this, I had no way of knowing that MaxAllowedDups >really< only
> applied
> >>> to
> >>> the spam collection.  I assumed the GUI meant the whole log of spam
> and
> >>> NOTspam.  I don't think that's an unreasonable assumption, or call
it

> an
> >>> oversight, or a mistake on my part - but none of that justifies and
> angry
> >>> sounding response from you.
> >>>
> >>>  I'm not looking for a fight, but I feel like I have to keep
> justifying
> >>> myself after you appear to be so angry with me, and the rest of us,
> who
> >>> turn to you for enlightenment.  You're carrying the entire weight of
> this
> >>> project on your shoulders.  It's a lot, I know,  Can we move on and
> have
> >>> a
> >>> reasonable discussion here?
> >>>
> >>> Is there a reason that MaxAllowedDups shouldn't also apply to the
> notspam
> >>> collection?   Shouldn't we want that to be the case for the same
> reason
> >>> that we have it for spam?   Maybe also to the errors collections?
> >>>
> >>> If we don't, wouldn't the case where a staff member sends the same
> basic
> >>> message to 5000 people (against my wishes, but I can't control
> >>> everything)
> >>> that'll take 1/3 of the other notspam messages out of the rebuild
> >>> processes?  How about if 20k messages are sent?
> >>>
> >>> Maybe I'm just not understanding, and that's why I'm asking, but I
> hope
> >>> it
> >>> doesn't result in any more scolding.
> >>>
> >>> Thank you
> >>>
> >>>
> >>> On Thu, Mar 10, 2016 at 4:15 AM, Thomas Eckardt
> >>> <[hidden email]>
> >>> wrote:
> >>>
> >>> > >There are about 600 of those files in NotSpam.
> >>> >
> >>> > 'MaxAllowedDups','Max Number of Duplicate File Names'
> >>> >   'The maximum number of logged files with the same filename
> (subject)
> >>> > that are stored in the spam folder (spamlog),........
> >>> >
> >>> > I'll write in Hebrew - possibly the english is better, if you
> translate
> >>> it
> >>> > back to english.
> >>> >
> >>> > Thomas
> >>> >
> >>> >
> >>> >
> >>> > Von:    K Post <[hidden email]>
> >>> > An:     ASSP development mailing list
> <[hidden email]
> >>> >
> >>> > Datum:  10.03.2016 00:29
> >>> > Betreff:        [Assp-test] Max Number Duplicate File Names
> >>> >
> >>> >
> >>> >
> >>> > I've got UseSubjectAsMaillogNames checked (the messages are stored
> in
> >>> the
> >>> > folders user the subject name followed by a 6 digit number as
> expected)
> >>> >
> >>> > I've got MaxAllowedDups set to 3
> >>> >
> >>> > MaxBayesFileAge is 0
> >>> > MaxFiles is 15000
> >>> >
> >>> > I'm noticing that MaxAllowedDups doesn't seem to be working.
> >>> >
> >>> > For example, a couple users often send emails with the subject
> >>> > "Your Donation Receipt"
> >>> > There are about 600 of those files in NotSpam.
> >>> > Your_Donation_Receipt--123456.txt
> >>> > where 123456 is a random differing number.
> >>> >
> >>> > Shouldn't only 3 of these files exist in the folder (with the
> exception
> >>> of
> >>> > those that were sent since the rebuild / maintenance window)?
> >>> >
> >>> > Thanks
> >>> >
> >>> >
> >>>
> >>>
>
>
------------------------------------------------------------------------------

> >>> > Transform Data into Opportunity.
> >>> > Accelerate data analysis in your applications with
> >>> > Intel Data Analytics Acceleration Library.
> >>> > Click to learn more.
> >>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> >>> > _______________________________________________
> >>> > Assp-test mailing list
> >>> > [hidden email]
> >>> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > DISCLAIMER:
> >>> > *******************************************************
> >>> > This email and any files transmitted with it may be confidential,
> >>> legally
> >>> > privileged and protected in law and are intended solely for the
use

> of
> >>> the
> >>> >
> >>> > individual to whom it is addressed.
> >>> > This email was multiple times scanned for viruses. There should be
> no
> >>> > known virus in this email!
> >>> > *******************************************************
> >>> >
> >>> >
> >>> >
> >>> >
> >>>
> >>>
>
>
------------------------------------------------------------------------------

> >>> > Transform Data into Opportunity.
> >>> > Accelerate data analysis in your applications with
> >>> > Intel Data Analytics Acceleration Library.
> >>> > Click to learn more.
> >>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> >>> > _______________________________________________
> >>> > Assp-test mailing list
> >>> > [hidden email]
> >>> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >>> >
> >>> >
> >>>
> >>>
>
>
------------------------------------------------------------------------------

> >>> Transform Data into Opportunity.
> >>> Accelerate data analysis in your applications with
> >>> Intel Data Analytics Acceleration Library.
> >>> Click to learn more.
> >>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> >>> _______________________________________________
> >>> Assp-test mailing list
> >>> [hidden email]
> >>> https://lists.sourceforge.net/lists/listinfo/assp-test
> >>>
> >>>
> >>>
> >>>
> >>> DISCLAIMER:
> >>> *******************************************************
> >>> This email and any files transmitted with it may be confidential,
> legally
> >>> privileged and protected in law and are intended solely for the use
of
> >>> the
> >>>
> >>> individual to whom it is addressed.
> >>> This email was multiple times scanned for viruses. There should be
no
> >>> known virus in this email!
> >>> *******************************************************
> >>>
> >>>
> >>>
> >>>
>
>
------------------------------------------------------------------------------

> >>> Transform Data into Opportunity.
> >>> Accelerate data analysis in your applications with
> >>> Intel Data Analytics Acceleration Library.
> >>> Click to learn more.
> >>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> >>> _______________________________________________
> >>> Assp-test mailing list
> >>> [hidden email]
> >>> https://lists.sourceforge.net/lists/listinfo/assp-test
> >>>
> >>>
> >>
> >
>
>
------------------------------------------------------------------------------

> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential,
legally
> privileged and protected in law and are intended solely for the use of
the
>
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
>
>
>
------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library

> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test




DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally
privileged and protected in law and are intended solely for the use of the

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no
known virus in this email!
*******************************************************


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

K Post
Summary:
*Isn't it important to have as random and diverse of a sampling in the file
store as possible?  *I feel like removing excessive duplicate file names
before cleaning up the file store gives us the best chance of using the
most varied set of messages possible for both building bayes/hmm and also
manual review, resends, copying to corrected, etc.

Detailed discussion:

You're right, I came to the conclusion that you thought I would at the end
of your last post  - if you argue that it doesn't make sense for notspam,
then it doesn't make sense for spam either...  but I still disagree.
Please allow me to explain better.

It's funny that you talk about move2num, I was just looking at that from a
really old installation.  I was also trying to find my circa 2003 script
that removed duplicate file names before the rebuild. (couldn't find it,
but I stopped using that once MaxDupFileNames was introduced).

And yes, you nailed it that I periodically manually poke around in the file
store, but that's not why I'm >still< thinking that it's still important to
have the removal of dups.  A random selection of varying data is important
no?

If we have a maximum number of total files, a system that removes excess
based on file age, and the potential that there could be a single user that
could send the same email more than the maximum number of files that are
allowed, that could give us NO data diversity in notspam.  All of the older
files would be deleted, leaving us with almost all fo the remaining files
being identical, without any other files to build that random sampling
from.  Right?

And even if we went back to number only filenaming, or something like md5
filenames, the same problem would exist - we only have so many files to
work with, if you replace enough of them with essentially the same file,
that just removes data that we wanted more than what we replaced it with.

Simply put, if we replaced 15,000 not spam files with the same not spam
file 15,000 times and did a rebuild, wouldn't we get a bayesian/hmm
database that's not as good as what we would have had if we first removed
those duplicate file names?

Wouldn't turning off maxalloweddups open us up for the the potential that
the spam corpus could be filled with the same message over and over (same
subject at least) that would then result in the same problem - the spam
folder having fewer different messages than we could have had otherwise.

Besides my own neuroses, isn't it important for the blockreport and manual
inspections to also keep as many of the message files in place for resends,
corrections, etc?  Yes, deleting all of the excess duplicate filenames
means that those duplicates they wouldn't be available, but playing our
odds, it's more likely that files with different filenames are different
than those with the same file names.




On Mon, Mar 21, 2016 at 2:19 PM, Thomas Eckardt <[hidden email]>
wrote:

> The concept of bayes but at least for HMM is too complex for most admins.
> ASSP is nice and stores all file in clear text and with nice filenames in
> filesystem if configured this way.
>
> No someone like you, looks in to this folders and tries to find a reason
> to do something - that this folder looks more nice to humans.
> Now assume, I would store all files RSA encrypted in a database. Nobody
> would care about the content, the size, the subject and the count of
> records as long as the corpus norm is fine and the detection rate is OK.
> Because nobody would be able to read anything in this database.
> Looking back to the old days of assp, move2num was used and the filenames
> were build using random numbers and it was possible, that any file could
> be deleted at any time. Not really nice but it worked - numbers only
> numbers nothing more - and the highest randomness you can think about..
>
> Now - reading this you must come to a conclusion - 'MaxAllowedDups' is
> NONSENSE - and YES you are right - SWITCH IT OFF!
>
> Thomas
>
>
>
>
>
>
>
> Von:    K Post <[hidden email]>
> An:     ASSP development mailing list <[hidden email]>
> Datum:  21.03.2016 18:56
> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
>
>
>
> I'm worry if you're finding this discussion to be such a time waster..  My
> goal is to improve ASSP for all, not to waste your time, you must know
> that.
>
> In the interest of conserving your time - summary question:
> *Wouldn't it be better for ASSP to remove duplicate file names in excess
> of
> X from notspam than for it to not remove them and instead remove other
> files with more varied notspam content?*
>
> Expanded:
>
> It's helpful for me to now understand that the hmm/bayes analysis doesn't
> weigh repetition more heavily than just one file in the opposite folder.
> Thank you for that explanation.  When the users do mail merges, a lot of
> the time, the body is subtly different (different dear line or other per
> person customization for example), but based on what you're saying, I'd
> think that they'd be substantially similar enough to act the same way as
> you describe.  So good.
>
> But, what is the downside of having ASSP remove filenames with more than X
> of the same in notpsam?  I understand that having more wouldn't increase
> scoring for the content of >those< messages, but wouldn't it also remove
> say 5000 of the OTHER files that we want during a rebuild based on their
> age, and therefore give us a file store that's not as diverse as it could
> be?  Isn't that a bad thing or at least not as good as it would be if the
> duplicate file name emails were removed?
>
> Localfrequency isn't going to help, at least in my case.  If the director
> wants to ignore my instruction and policy, she needs to be able to. Yes,
> this is a policy problem, but the people high up in the charity will
> always
> argue that if they need to send a message, they're going to.  They don't
> pay me much, but it's a job that I need - I can't risk that by turning on
> localfrequency.  I don't see how nocollecting / re is going to help.  I
> have no way of knowing who is going to send next or what they're going to
> send.
>
> I guess I just don't see the downside (other than your time in coding and
> testing along with a slightly longer cleanup process) to have ASSP remove
> those duplicate file names at cleanup time, before removing oldest first.
> Wouldn't that be better than having a notspam folder that once cleanup
> runs
> could only have only a handful of files that are significantly different
> content (say if a couple users sent a boatload of mailmerges in 1 day)?
>
>
>
>
>
>
> On Mon, Mar 21, 2016 at 12:49 PM, Thomas Eckardt
> <[hidden email]
> > wrote:
>
> > bonehead user sends 5000 -> LocalFrequencyInt and next configs
> >
> > regular user sends 5000 -> noCollecting , noCollectRe ...........
> >
> > This is not a coding task - this is an organizing and configuration
> task.
> > As I always say - RTMF!
> >
> > >then delete as you already do files in
> > >excess of the maximum total number of files?
> >
> > Oldest fist - no content check.
> >
> > >that our notspam corpus remains diverse
> >
> > having 5000 times the 100% same mail-body in one folder is the same,
> like
> > having the mail one time in this folder for HMM and bayes
> > having the same mail in the opposit folder one time - elimiates all the
> > 5000 for HMM and bayes
> > BTW : this is independend from the filename or subject
> >
> > This is not new (since more than 10 years) - because it is one of the
> > basic concepts of HMM and bayes.
> >
> > >I know that we must be missing something significant.
> >
> > Yes - the concept!
> >
> > You waste my time Ken.
> >
> > Thomas
> >
> >
> >
> > Von:    K Post <[hidden email]>
> > An:     ASSP development mailing list <[hidden email]>
> > Datum:  21.03.2016 16:41
> > Betreff:        Re: [Assp-test] Max Number Duplicate File Names
> >
> >
> >
> > -From Thomas, posted elsewhere
> > >Remains the (my) question - what should be done with mails that
> > >reaches the 'MaxAllowedHamDups' without breaking any concept and
> without
> > >creating a new folder (which breaks several concepts)?
> >
> > The scenario where a bonehead user sends 5000 of the same message in an
> > Outlook mailmerge isn't just a conceptual possibility, it happens.  And
> > it's happening more and more frequently despite training, memos,
> > reminders,
> > and a very good email blast system in place that eliminated the need for
> > mailmerges.
> >
> > What about when doing the nightly cleanup if you were to delete files
> with
> > the same name in excess of max dups, then delete as you already do files
> > in
> > excess of the maximum total number of files?  I thought that was what
> was
> > already happening with the spam corpus, but apparently not.
> >
> > I only see upside to limiting the number of dups it notspam, but you've
> > stated elsewhere that the arguments herein don't make sense to you.  If
> > you're saying what we suggest doesn't make any sense, I know that we
> must
> > be missing something significant.  I know that bayesian filtering works
> > really well, but I only understand the inner workings from 35,000 feet.
> I
> > just can't understand how making every effort to insure that our notspam
> > corpus remains diverse doesn't make sense.
> >
> > Thanks again.  Hope we can continue this discussion.
> >
> > On Mon, Mar 14, 2016 at 5:28 PM, K Post <[hidden email]> wrote:
> >
> > > On of our staff inadvertently sent about 3400 of the same test
> messages
> > > out through our server.  Okay, okay, it was me - had a loop coded
> wrong
> > and
> > > before I noticed what was going on and could stop it about 3400 of the
> > same
> > > messages went out, fortunately, they were just to me.  Sure enough,
> all
> > > 3400 were in notspam.
> > >
> > > So, could we, and does it make sense, to keep discussing this?
> > >
> > > On Thu, Mar 10, 2016 at 1:47 PM, K Post <[hidden email]> wrote:
> > >
> > >> Isn't that exact same logic an argument for having the maximum number
> > of
> > >> duplicate subjects apply to the HAM / notspam folder too?  5000 or
> > 15000 of
> > >> the same message sent individually by (untrainable / apathetic) users
> > would
> > >> fill the notspam folder and mess up HMM / Bayesian right?
> > >>
> > >> And for those RE / FWD / No subject emails, maybe we could have ASSP
> > >> ignore subjects shorter than say 5 or 6 characters when deleting
> > duplicate
> > >> file names?  Then those files could get wiped out oldest first during
> > the
> > >> maintenance.
> > >>
> > >> \
> > >>
> > >> On Thu, Mar 10, 2016 at 11:18 AM, Thomas Eckardt <
> > >> [hidden email]> wrote:
> > >>
> > >>> Just think about the logic behind Bayesian and HMM - this will
> answer
> > >>> your
> > >>> question.
> > >>>
> > >>> Having the same mail in the spam folder multiple times, this will
> > score
> > >>> the content to extreme spam havy, even your users are using the same
> > >>> content - but less often.
> > >>>
> > >>> Thomas
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Von:    K Post <[hidden email]>
> > >>> An:     ASSP development mailing list
> > <[hidden email]>
> > >>> Datum:  10.03.2016 16:58
> > >>> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
> > >>>
> > >>>
> > >>>
> > >>> I know you're all RTFM, but there's plenty of places in the GUI
> where
> > the
> > >>> description isn't exactly clear or right.  For example
> > >>>
> > >>> MaxFiles
> > >>> If you're not using subjects as file names (
> UseSubjectsAsMaillogNames
> > ),
> > >>> this is the maximum number of files to keep in each collection (spam
> &
> > >>> nonspam)
> > >>> It's actually less than this -- files get a random number between 1
> > and
> > >>> MaxFiles.
> > >>>
> > >>> I AM using file names and MaxFiles DOES control the maximum number
> of
> > >>> files
> > >>> in each collection, despite what the description says when
> > >>> MaintBayesCollection is on and no max age is set. The language is
> not
> > >>> clear
> > >>> and that makes us assume things, sometimes incorrectly, about what
> the
> > >>> GUI
> > >>> really mean.  We've been working this way since ASSP came out.
> Because
> > >>> of
> > >>> this, I had no way of knowing that MaxAllowedDups >really< only
> > applied
> > >>> to
> > >>> the spam collection.  I assumed the GUI meant the whole log of spam
> > and
> > >>> NOTspam.  I don't think that's an unreasonable assumption, or call
> it
> > an
> > >>> oversight, or a mistake on my part - but none of that justifies and
> > angry
> > >>> sounding response from you.
> > >>>
> > >>>  I'm not looking for a fight, but I feel like I have to keep
> > justifying
> > >>> myself after you appear to be so angry with me, and the rest of us,
> > who
> > >>> turn to you for enlightenment.  You're carrying the entire weight of
> > this
> > >>> project on your shoulders.  It's a lot, I know,  Can we move on and
> > have
> > >>> a
> > >>> reasonable discussion here?
> > >>>
> > >>> Is there a reason that MaxAllowedDups shouldn't also apply to the
> > notspam
> > >>> collection?   Shouldn't we want that to be the case for the same
> > reason
> > >>> that we have it for spam?   Maybe also to the errors collections?
> > >>>
> > >>> If we don't, wouldn't the case where a staff member sends the same
> > basic
> > >>> message to 5000 people (against my wishes, but I can't control
> > >>> everything)
> > >>> that'll take 1/3 of the other notspam messages out of the rebuild
> > >>> processes?  How about if 20k messages are sent?
> > >>>
> > >>> Maybe I'm just not understanding, and that's why I'm asking, but I
> > hope
> > >>> it
> > >>> doesn't result in any more scolding.
> > >>>
> > >>> Thank you
> > >>>
> > >>>
> > >>> On Thu, Mar 10, 2016 at 4:15 AM, Thomas Eckardt
> > >>> <[hidden email]>
> > >>> wrote:
> > >>>
> > >>> > >There are about 600 of those files in NotSpam.
> > >>> >
> > >>> > 'MaxAllowedDups','Max Number of Duplicate File Names'
> > >>> >   'The maximum number of logged files with the same filename
> > (subject)
> > >>> > that are stored in the spam folder (spamlog),........
> > >>> >
> > >>> > I'll write in Hebrew - possibly the english is better, if you
> > translate
> > >>> it
> > >>> > back to english.
> > >>> >
> > >>> > Thomas
> > >>> >
> > >>> >
> > >>> >
> > >>> > Von:    K Post <[hidden email]>
> > >>> > An:     ASSP development mailing list
> > <[hidden email]
> > >>> >
> > >>> > Datum:  10.03.2016 00:29
> > >>> > Betreff:        [Assp-test] Max Number Duplicate File Names
> > >>> >
> > >>> >
> > >>> >
> > >>> > I've got UseSubjectAsMaillogNames checked (the messages are stored
> > in
> > >>> the
> > >>> > folders user the subject name followed by a 6 digit number as
> > expected)
> > >>> >
> > >>> > I've got MaxAllowedDups set to 3
> > >>> >
> > >>> > MaxBayesFileAge is 0
> > >>> > MaxFiles is 15000
> > >>> >
> > >>> > I'm noticing that MaxAllowedDups doesn't seem to be working.
> > >>> >
> > >>> > For example, a couple users often send emails with the subject
> > >>> > "Your Donation Receipt"
> > >>> > There are about 600 of those files in NotSpam.
> > >>> > Your_Donation_Receipt--123456.txt
> > >>> > where 123456 is a random differing number.
> > >>> >
> > >>> > Shouldn't only 3 of these files exist in the folder (with the
> > exception
> > >>> of
> > >>> > those that were sent since the rebuild / maintenance window)?
> > >>> >
> > >>> > Thanks
> > >>> >
> > >>> >
> > >>>
> > >>>
> >
> >
>
> ------------------------------------------------------------------------------
> > >>> > Transform Data into Opportunity.
> > >>> > Accelerate data analysis in your applications with
> > >>> > Intel Data Analytics Acceleration Library.
> > >>> > Click to learn more.
> > >>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> > >>> > _______________________________________________
> > >>> > Assp-test mailing list
> > >>> > [hidden email]
> > >>> > https://lists.sourceforge.net/lists/listinfo/assp-test
> > >>> >
> > >>> >
> > >>> >
> > >>> >
> > >>> > DISCLAIMER:
> > >>> > *******************************************************
> > >>> > This email and any files transmitted with it may be confidential,
> > >>> legally
> > >>> > privileged and protected in law and are intended solely for the
> use
> > of
> > >>> the
> > >>> >
> > >>> > individual to whom it is addressed.
> > >>> > This email was multiple times scanned for viruses. There should be
> > no
> > >>> > known virus in this email!
> > >>> > *******************************************************
> > >>> >
> > >>> >
> > >>> >
> > >>> >
> > >>>
> > >>>
> >
> >
>
> ------------------------------------------------------------------------------
> > >>> > Transform Data into Opportunity.
> > >>> > Accelerate data analysis in your applications with
> > >>> > Intel Data Analytics Acceleration Library.
> > >>> > Click to learn more.
> > >>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> > >>> > _______________________________________________
> > >>> > Assp-test mailing list
> > >>> > [hidden email]
> > >>> > https://lists.sourceforge.net/lists/listinfo/assp-test
> > >>> >
> > >>> >
> > >>>
> > >>>
> >
> >
>
> ------------------------------------------------------------------------------
> > >>> Transform Data into Opportunity.
> > >>> Accelerate data analysis in your applications with
> > >>> Intel Data Analytics Acceleration Library.
> > >>> Click to learn more.
> > >>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> > >>> _______________________________________________
> > >>> Assp-test mailing list
> > >>> [hidden email]
> > >>> https://lists.sourceforge.net/lists/listinfo/assp-test
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> DISCLAIMER:
> > >>> *******************************************************
> > >>> This email and any files transmitted with it may be confidential,
> > legally
> > >>> privileged and protected in law and are intended solely for the use
> of
> > >>> the
> > >>>
> > >>> individual to whom it is addressed.
> > >>> This email was multiple times scanned for viruses. There should be
> no
> > >>> known virus in this email!
> > >>> *******************************************************
> > >>>
> > >>>
> > >>>
> > >>>
> >
> >
>
> ------------------------------------------------------------------------------
> > >>> Transform Data into Opportunity.
> > >>> Accelerate data analysis in your applications with
> > >>> Intel Data Analytics Acceleration Library.
> > >>> Click to learn more.
> > >>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> > >>> _______________________________________________
> > >>> Assp-test mailing list
> > >>> [hidden email]
> > >>> https://lists.sourceforge.net/lists/listinfo/assp-test
> > >>>
> > >>>
> > >>
> > >
> >
> >
>
> ------------------------------------------------------------------------------
> > Transform Data into Opportunity.
> > Accelerate data analysis in your applications with
> > Intel Data Analytics Acceleration Library.
> > Click to learn more.
> > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> > _______________________________________________
> > Assp-test mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >
> >
> >
> >
> > DISCLAIMER:
> > *******************************************************
> > This email and any files transmitted with it may be confidential,
> legally
> > privileged and protected in law and are intended solely for the use of
> the
> >
> > individual to whom it is addressed.
> > This email was multiple times scanned for viruses. There should be no
> > known virus in this email!
> > *******************************************************
> >
> >
> >
> >
>
> ------------------------------------------------------------------------------
> > Transform Data into Opportunity.
> > Accelerate data analysis in your applications with
> > Intel Data Analytics Acceleration Library
>
> > Click to learn more.
> > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> > _______________________________________________
> > Assp-test mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >
> >
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
>
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
>
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

Thomas Eckardt/eck
>If we have a maximum number of total files, a system that removes excess
>based on file age, and the potential that there could be a single user
that
>could send the same email more than the maximum number of files that are
>allowed, that could give us NO data diversity in notspam.  All of the
older
>files would be deleted, leaving us with almost all fo the remaining files
>being identical, without any other files to build that random sampling
>from.  Right?

It is a configuration mistake, if you set 'MaxFiles' too low and you don't
use UseSubjectsAsMaillogNames

It is very simple. If you can find and explain a mathematical proof for
the stochasitc analysis used by assp, if and how any count of equal mails
in any collection folder will compromize the corpus norm and/or the corpus
confidence, than and ONLY than I'll think about this suggestion.

precondition:

- a finished rebuildspamdb
- MaxBytes is set to 8.000
- MaxFiles is ignored because UseSubjectsAsMaillogNames is configured
- 5000 files in the spam folder
- 2000 files in the notspam folder
- 500.000 records in spamdb
- 1.000.000 records in HMMdb
- Spam Weight  =  3,500,000
- Not-Spam Weight:   3,500,000
- corpus norm == 1
- corpus confidence == 1

event after the rebuild spamdb:

- an arbitary count of equal files is stored in one or multiple collection
folders.
- the length of the mail body is 10.000 Bytes

proof:

after the next rebuild spamdb :

- what is the minimum corpus norm and minimum corpus confidence if all
files are stored in the notspam folder
- what is the maximum corpus norm and minimum corpus confidence if all
files are stored in the spam folder
- what is the maximum and minimum corpus norm and minimum corpus
confidence if the files are stored randomly in the spam and notspam folder
- for the distribution
spam / notspam
25% / 75%
50% / 50%
75% / 25%

Until now, this is more or less simple mathematic (+-*/ exp) - but here
the last, but most important question:

Explain for only one of the above cases, why and how the changed corpus
norm and corpus confidence will affect the average detection rate for spam
and notspam of the Bayesian and HMM engine - for 100.000 incoming mails.
The detection rate before the event occured was 99,0% spam-detection with
no false positives for the Bayesian and HMM engine (only!)

assumed distribution of real spam / notspam

95.000 / 5.000

good luck

Thomas



Von:    K Post <[hidden email]>
An:     ASSP development mailing list <[hidden email]>
Datum:  21.03.2016 20:31
Betreff:        Re: [Assp-test] Max Number Duplicate File Names



Summary:
*Isn't it important to have as random and diverse of a sampling in the
file
store as possible?  *I feel like removing excessive duplicate file names
before cleaning up the file store gives us the best chance of using the
most varied set of messages possible for both building bayes/hmm and also
manual review, resends, copying to corrected, etc.

Detailed discussion:

You're right, I came to the conclusion that you thought I would at the end
of your last post  - if you argue that it doesn't make sense for notspam,
then it doesn't make sense for spam either...  but I still disagree.
Please allow me to explain better.

It's funny that you talk about move2num, I was just looking at that from a
really old installation.  I was also trying to find my circa 2003 script
that removed duplicate file names before the rebuild. (couldn't find it,
but I stopped using that once MaxDupFileNames was introduced).

And yes, you nailed it that I periodically manually poke around in the
file
store, but that's not why I'm >still< thinking that it's still important
to
have the removal of dups.  A random selection of varying data is important
no?

If we have a maximum number of total files, a system that removes excess
based on file age, and the potential that there could be a single user
that
could send the same email more than the maximum number of files that are
allowed, that could give us NO data diversity in notspam.  All of the
older
files would be deleted, leaving us with almost all fo the remaining files
being identical, without any other files to build that random sampling
from.  Right?

And even if we went back to number only filenaming, or something like md5
filenames, the same problem would exist - we only have so many files to
work with, if you replace enough of them with essentially the same file,
that just removes data that we wanted more than what we replaced it with.

Simply put, if we replaced 15,000 not spam files with the same not spam
file 15,000 times and did a rebuild, wouldn't we get a bayesian/hmm
database that's not as good as what we would have had if we first removed
those duplicate file names?

Wouldn't turning off maxalloweddups open us up for the the potential that
the spam corpus could be filled with the same message over and over (same
subject at least) that would then result in the same problem - the spam
folder having fewer different messages than we could have had otherwise.

Besides my own neuroses, isn't it important for the blockreport and manual
inspections to also keep as many of the message files in place for
resends,
corrections, etc?  Yes, deleting all of the excess duplicate filenames
means that those duplicates they wouldn't be available, but playing our
odds, it's more likely that files with different filenames are different
than those with the same file names.




On Mon, Mar 21, 2016 at 2:19 PM, Thomas Eckardt
<[hidden email]>
wrote:

> The concept of bayes but at least for HMM is too complex for most
admins.
> ASSP is nice and stores all file in clear text and with nice filenames
in
> filesystem if configured this way.
>
> No someone like you, looks in to this folders and tries to find a reason
> to do something - that this folder looks more nice to humans.
> Now assume, I would store all files RSA encrypted in a database. Nobody
> would care about the content, the size, the subject and the count of
> records as long as the corpus norm is fine and the detection rate is OK.
> Because nobody would be able to read anything in this database.
> Looking back to the old days of assp, move2num was used and the
filenames

> were build using random numbers and it was possible, that any file could
> be deleted at any time. Not really nice but it worked - numbers only
> numbers nothing more - and the highest randomness you can think about..
>
> Now - reading this you must come to a conclusion - 'MaxAllowedDups' is
> NONSENSE - and YES you are right - SWITCH IT OFF!
>
> Thomas
>
>
>
>
>
>
>
> Von:    K Post <[hidden email]>
> An:     ASSP development mailing list <[hidden email]>
> Datum:  21.03.2016 18:56
> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
>
>
>
> I'm worry if you're finding this discussion to be such a time waster..
My

> goal is to improve ASSP for all, not to waste your time, you must know
> that.
>
> In the interest of conserving your time - summary question:
> *Wouldn't it be better for ASSP to remove duplicate file names in excess
> of
> X from notspam than for it to not remove them and instead remove other
> files with more varied notspam content?*
>
> Expanded:
>
> It's helpful for me to now understand that the hmm/bayes analysis
doesn't
> weigh repetition more heavily than just one file in the opposite folder.
> Thank you for that explanation.  When the users do mail merges, a lot of
> the time, the body is subtly different (different dear line or other per
> person customization for example), but based on what you're saying, I'd
> think that they'd be substantially similar enough to act the same way as
> you describe.  So good.
>
> But, what is the downside of having ASSP remove filenames with more than
X
> of the same in notpsam?  I understand that having more wouldn't increase
> scoring for the content of >those< messages, but wouldn't it also remove
> say 5000 of the OTHER files that we want during a rebuild based on their
> age, and therefore give us a file store that's not as diverse as it
could
> be?  Isn't that a bad thing or at least not as good as it would be if
the
> duplicate file name emails were removed?
>
> Localfrequency isn't going to help, at least in my case.  If the
director
> wants to ignore my instruction and policy, she needs to be able to. Yes,
> this is a policy problem, but the people high up in the charity will
> always
> argue that if they need to send a message, they're going to.  They don't
> pay me much, but it's a job that I need - I can't risk that by turning
on
> localfrequency.  I don't see how nocollecting / re is going to help.  I
> have no way of knowing who is going to send next or what they're going
to
> send.
>
> I guess I just don't see the downside (other than your time in coding
and
> testing along with a slightly longer cleanup process) to have ASSP
remove
> those duplicate file names at cleanup time, before removing oldest
first.

> Wouldn't that be better than having a notspam folder that once cleanup
> runs
> could only have only a handful of files that are significantly different
> content (say if a couple users sent a boatload of mailmerges in 1 day)?
>
>
>
>
>
>
> On Mon, Mar 21, 2016 at 12:49 PM, Thomas Eckardt
> <[hidden email]
> > wrote:
>
> > bonehead user sends 5000 -> LocalFrequencyInt and next configs
> >
> > regular user sends 5000 -> noCollecting , noCollectRe ...........
> >
> > This is not a coding task - this is an organizing and configuration
> task.
> > As I always say - RTMF!
> >
> > >then delete as you already do files in
> > >excess of the maximum total number of files?
> >
> > Oldest fist - no content check.
> >
> > >that our notspam corpus remains diverse
> >
> > having 5000 times the 100% same mail-body in one folder is the same,
> like
> > having the mail one time in this folder for HMM and bayes
> > having the same mail in the opposit folder one time - elimiates all
the

> > 5000 for HMM and bayes
> > BTW : this is independend from the filename or subject
> >
> > This is not new (since more than 10 years) - because it is one of the
> > basic concepts of HMM and bayes.
> >
> > >I know that we must be missing something significant.
> >
> > Yes - the concept!
> >
> > You waste my time Ken.
> >
> > Thomas
> >
> >
> >
> > Von:    K Post <[hidden email]>
> > An:     ASSP development mailing list
<[hidden email]>

> > Datum:  21.03.2016 16:41
> > Betreff:        Re: [Assp-test] Max Number Duplicate File Names
> >
> >
> >
> > -From Thomas, posted elsewhere
> > >Remains the (my) question - what should be done with mails that
> > >reaches the 'MaxAllowedHamDups' without breaking any concept and
> without
> > >creating a new folder (which breaks several concepts)?
> >
> > The scenario where a bonehead user sends 5000 of the same message in
an
> > Outlook mailmerge isn't just a conceptual possibility, it happens. And
> > it's happening more and more frequently despite training, memos,
> > reminders,
> > and a very good email blast system in place that eliminated the need
for
> > mailmerges.
> >
> > What about when doing the nightly cleanup if you were to delete files
> with
> > the same name in excess of max dups, then delete as you already do
files
> > in
> > excess of the maximum total number of files?  I thought that was what
> was
> > already happening with the spam corpus, but apparently not.
> >
> > I only see upside to limiting the number of dups it notspam, but
you've
> > stated elsewhere that the arguments herein don't make sense to you. If
> > you're saying what we suggest doesn't make any sense, I know that we
> must
> > be missing something significant.  I know that bayesian filtering
works
> > really well, but I only understand the inner workings from 35,000
feet.
> I
> > just can't understand how making every effort to insure that our
notspam

> > corpus remains diverse doesn't make sense.
> >
> > Thanks again.  Hope we can continue this discussion.
> >
> > On Mon, Mar 14, 2016 at 5:28 PM, K Post <[hidden email]> wrote:
> >
> > > On of our staff inadvertently sent about 3400 of the same test
> messages
> > > out through our server.  Okay, okay, it was me - had a loop coded
> wrong
> > and
> > > before I noticed what was going on and could stop it about 3400 of
the

> > same
> > > messages went out, fortunately, they were just to me.  Sure enough,
> all
> > > 3400 were in notspam.
> > >
> > > So, could we, and does it make sense, to keep discussing this?
> > >
> > > On Thu, Mar 10, 2016 at 1:47 PM, K Post <[hidden email]> wrote:
> > >
> > >> Isn't that exact same logic an argument for having the maximum
number
> > of
> > >> duplicate subjects apply to the HAM / notspam folder too?  5000 or
> > 15000 of
> > >> the same message sent individually by (untrainable / apathetic)
users
> > would
> > >> fill the notspam folder and mess up HMM / Bayesian right?
> > >>
> > >> And for those RE / FWD / No subject emails, maybe we could have
ASSP
> > >> ignore subjects shorter than say 5 or 6 characters when deleting
> > duplicate
> > >> file names?  Then those files could get wiped out oldest first
during

> > the
> > >> maintenance.
> > >>
> > >> \
> > >>
> > >> On Thu, Mar 10, 2016 at 11:18 AM, Thomas Eckardt <
> > >> [hidden email]> wrote:
> > >>
> > >>> Just think about the logic behind Bayesian and HMM - this will
> answer
> > >>> your
> > >>> question.
> > >>>
> > >>> Having the same mail in the spam folder multiple times, this will
> > score
> > >>> the content to extreme spam havy, even your users are using the
same

> > >>> content - but less often.
> > >>>
> > >>> Thomas
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Von:    K Post <[hidden email]>
> > >>> An:     ASSP development mailing list
> > <[hidden email]>
> > >>> Datum:  10.03.2016 16:58
> > >>> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
> > >>>
> > >>>
> > >>>
> > >>> I know you're all RTFM, but there's plenty of places in the GUI
> where
> > the
> > >>> description isn't exactly clear or right.  For example
> > >>>
> > >>> MaxFiles
> > >>> If you're not using subjects as file names (
> UseSubjectsAsMaillogNames
> > ),
> > >>> this is the maximum number of files to keep in each collection
(spam
> &
> > >>> nonspam)
> > >>> It's actually less than this -- files get a random number between
1

> > and
> > >>> MaxFiles.
> > >>>
> > >>> I AM using file names and MaxFiles DOES control the maximum number
> of
> > >>> files
> > >>> in each collection, despite what the description says when
> > >>> MaintBayesCollection is on and no max age is set. The language is
> not
> > >>> clear
> > >>> and that makes us assume things, sometimes incorrectly, about what
> the
> > >>> GUI
> > >>> really mean.  We've been working this way since ASSP came out.
> Because
> > >>> of
> > >>> this, I had no way of knowing that MaxAllowedDups >really< only
> > applied
> > >>> to
> > >>> the spam collection.  I assumed the GUI meant the whole log of
spam
> > and
> > >>> NOTspam.  I don't think that's an unreasonable assumption, or call
> it
> > an
> > >>> oversight, or a mistake on my part - but none of that justifies
and
> > angry
> > >>> sounding response from you.
> > >>>
> > >>>  I'm not looking for a fight, but I feel like I have to keep
> > justifying
> > >>> myself after you appear to be so angry with me, and the rest of
us,
> > who
> > >>> turn to you for enlightenment.  You're carrying the entire weight
of
> > this
> > >>> project on your shoulders.  It's a lot, I know,  Can we move on
and

> > have
> > >>> a
> > >>> reasonable discussion here?
> > >>>
> > >>> Is there a reason that MaxAllowedDups shouldn't also apply to the
> > notspam
> > >>> collection?   Shouldn't we want that to be the case for the same
> > reason
> > >>> that we have it for spam?   Maybe also to the errors collections?
> > >>>
> > >>> If we don't, wouldn't the case where a staff member sends the same
> > basic
> > >>> message to 5000 people (against my wishes, but I can't control
> > >>> everything)
> > >>> that'll take 1/3 of the other notspam messages out of the rebuild
> > >>> processes?  How about if 20k messages are sent?
> > >>>
> > >>> Maybe I'm just not understanding, and that's why I'm asking, but I
> > hope
> > >>> it
> > >>> doesn't result in any more scolding.
> > >>>
> > >>> Thank you
> > >>>
> > >>>
> > >>> On Thu, Mar 10, 2016 at 4:15 AM, Thomas Eckardt
> > >>> <[hidden email]>
> > >>> wrote:
> > >>>
> > >>> > >There are about 600 of those files in NotSpam.
> > >>> >
> > >>> > 'MaxAllowedDups','Max Number of Duplicate File Names'
> > >>> >   'The maximum number of logged files with the same filename
> > (subject)
> > >>> > that are stored in the spam folder (spamlog),........
> > >>> >
> > >>> > I'll write in Hebrew - possibly the english is better, if you
> > translate
> > >>> it
> > >>> > back to english.
> > >>> >
> > >>> > Thomas
> > >>> >
> > >>> >
> > >>> >
> > >>> > Von:    K Post <[hidden email]>
> > >>> > An:     ASSP development mailing list
> > <[hidden email]
> > >>> >
> > >>> > Datum:  10.03.2016 00:29
> > >>> > Betreff:        [Assp-test] Max Number Duplicate File Names
> > >>> >
> > >>> >
> > >>> >
> > >>> > I've got UseSubjectAsMaillogNames checked (the messages are
stored

> > in
> > >>> the
> > >>> > folders user the subject name followed by a 6 digit number as
> > expected)
> > >>> >
> > >>> > I've got MaxAllowedDups set to 3
> > >>> >
> > >>> > MaxBayesFileAge is 0
> > >>> > MaxFiles is 15000
> > >>> >
> > >>> > I'm noticing that MaxAllowedDups doesn't seem to be working.
> > >>> >
> > >>> > For example, a couple users often send emails with the subject
> > >>> > "Your Donation Receipt"
> > >>> > There are about 600 of those files in NotSpam.
> > >>> > Your_Donation_Receipt--123456.txt
> > >>> > where 123456 is a random differing number.
> > >>> >
> > >>> > Shouldn't only 3 of these files exist in the folder (with the
> > exception
> > >>> of
> > >>> > those that were sent since the rebuild / maintenance window)?
> > >>> >
> > >>> > Thanks
> > >>> >
> > >>> >
> > >>>
> > >>>
> >
> >
>
>
------------------------------------------------------------------------------

> > >>> > Transform Data into Opportunity.
> > >>> > Accelerate data analysis in your applications with
> > >>> > Intel Data Analytics Acceleration Library.
> > >>> > Click to learn more.
> > >>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> > >>> > _______________________________________________
> > >>> > Assp-test mailing list
> > >>> > [hidden email]
> > >>> > https://lists.sourceforge.net/lists/listinfo/assp-test
> > >>> >
> > >>> >
> > >>> >
> > >>> >
> > >>> > DISCLAIMER:
> > >>> > *******************************************************
> > >>> > This email and any files transmitted with it may be
confidential,
> > >>> legally
> > >>> > privileged and protected in law and are intended solely for the
> use
> > of
> > >>> the
> > >>> >
> > >>> > individual to whom it is addressed.
> > >>> > This email was multiple times scanned for viruses. There should
be

> > no
> > >>> > known virus in this email!
> > >>> > *******************************************************
> > >>> >
> > >>> >
> > >>> >
> > >>> >
> > >>>
> > >>>
> >
> >
>
>
------------------------------------------------------------------------------

> > >>> > Transform Data into Opportunity.
> > >>> > Accelerate data analysis in your applications with
> > >>> > Intel Data Analytics Acceleration Library.
> > >>> > Click to learn more.
> > >>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> > >>> > _______________________________________________
> > >>> > Assp-test mailing list
> > >>> > [hidden email]
> > >>> > https://lists.sourceforge.net/lists/listinfo/assp-test
> > >>> >
> > >>> >
> > >>>
> > >>>
> >
> >
>
>
------------------------------------------------------------------------------

> > >>> Transform Data into Opportunity.
> > >>> Accelerate data analysis in your applications with
> > >>> Intel Data Analytics Acceleration Library.
> > >>> Click to learn more.
> > >>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> > >>> _______________________________________________
> > >>> Assp-test mailing list
> > >>> [hidden email]
> > >>> https://lists.sourceforge.net/lists/listinfo/assp-test
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> DISCLAIMER:
> > >>> *******************************************************
> > >>> This email and any files transmitted with it may be confidential,
> > legally
> > >>> privileged and protected in law and are intended solely for the
use

> of
> > >>> the
> > >>>
> > >>> individual to whom it is addressed.
> > >>> This email was multiple times scanned for viruses. There should be
> no
> > >>> known virus in this email!
> > >>> *******************************************************
> > >>>
> > >>>
> > >>>
> > >>>
> >
> >
>
>
------------------------------------------------------------------------------

> > >>> Transform Data into Opportunity.
> > >>> Accelerate data analysis in your applications with
> > >>> Intel Data Analytics Acceleration Library.
> > >>> Click to learn more.
> > >>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> > >>> _______________________________________________
> > >>> Assp-test mailing list
> > >>> [hidden email]
> > >>> https://lists.sourceforge.net/lists/listinfo/assp-test
> > >>>
> > >>>
> > >>
> > >
> >
> >
>
>
------------------------------------------------------------------------------

> > Transform Data into Opportunity.
> > Accelerate data analysis in your applications with
> > Intel Data Analytics Acceleration Library.
> > Click to learn more.
> > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> > _______________________________________________
> > Assp-test mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >
> >
> >
> >
> > DISCLAIMER:
> > *******************************************************
> > This email and any files transmitted with it may be confidential,
> legally
> > privileged and protected in law and are intended solely for the use of
> the
> >
> > individual to whom it is addressed.
> > This email was multiple times scanned for viruses. There should be no
> > known virus in this email!
> > *******************************************************
> >
> >
> >
> >
>
>
------------------------------------------------------------------------------

> > Transform Data into Opportunity.
> > Accelerate data analysis in your applications with
> > Intel Data Analytics Acceleration Library
>
> > Click to learn more.
> > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> > _______________________________________________
> > Assp-test mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >
> >
>
>
------------------------------------------------------------------------------

> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential,
legally
> privileged and protected in law and are intended solely for the use of
the
>
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
>
>
>
------------------------------------------------------------------------------

> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test




DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally
privileged and protected in law and are intended solely for the use of the

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no
known virus in this email!
*******************************************************


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test
Reply | Threaded
Open this post in threaded view
|

Re: Max Number Duplicate File Names

K Post
I had to lookup stochasitc.  I just barely understand the surface of what
you're even challenging me to answer.  I rely on your wizardry, and that of
those who came before you, for all of this and have faith that it works
(and absolutely have proof that it does in the real world).  I trust what
you're arguing - it doesn't matter for the rebuild process, and I believe
you without question, though I don't understand why (and don't need to).

Let me be more clear, I am no longer thinking that a poor distribution of
randomness in notspam will impact assp accuracy.  I don't know why, but if
you say it's true, I believe it.  BUT, why are you resistant to removing
these duplicate files in notspam at cleanup time before trashing others?
What's the downside to removing the duplicate notspam files by name.   *We're
going to have to delete some files during the cleanup, so why not give us
the best chance of keeping the ones that we care about most - if not for
the rebuild, then for manual analysis, copying to corrected folders,
etc.  *Would
it cause harm that I'm not understanding?



On Wed, Mar 23, 2016 at 8:17 AM, Thomas Eckardt <[hidden email]>
wrote:

> >If we have a maximum number of total files, a system that removes excess
> >based on file age, and the potential that there could be a single user
> that
> >could send the same email more than the maximum number of files that are
> >allowed, that could give us NO data diversity in notspam.  All of the
> older
> >files would be deleted, leaving us with almost all fo the remaining files
> >being identical, without any other files to build that random sampling
> >from.  Right?
>
> It is a configuration mistake, if you set 'MaxFiles' too low and you don't
> use UseSubjectsAsMaillogNames
>
> It is very simple. If you can find and explain a mathematical proof for
> the stochasitc analysis used by assp, if and how any count of equal mails
> in any collection folder will compromize the corpus norm and/or the corpus
> confidence, than and ONLY than I'll think about this suggestion.
>
> precondition:
>
> - a finished rebuildspamdb
> - MaxBytes is set to 8.000
> - MaxFiles is ignored because UseSubjectsAsMaillogNames is configured
> - 5000 files in the spam folder
> - 2000 files in the notspam folder
> - 500.000 records in spamdb
> - 1.000.000 records in HMMdb
> - Spam Weight  =  3,500,000
> - Not-Spam Weight:   3,500,000
> - corpus norm == 1
> - corpus confidence == 1
>
> event after the rebuild spamdb:
>
> - an arbitary count of equal files is stored in one or multiple collection
> folders.
> - the length of the mail body is 10.000 Bytes
>
> proof:
>
> after the next rebuild spamdb :
>
> - what is the minimum corpus norm and minimum corpus confidence if all
> files are stored in the notspam folder
> - what is the maximum corpus norm and minimum corpus confidence if all
> files are stored in the spam folder
> - what is the maximum and minimum corpus norm and minimum corpus
> confidence if the files are stored randomly in the spam and notspam folder
> - for the distribution
> spam / notspam
> 25% / 75%
> 50% / 50%
> 75% / 25%
>
> Until now, this is more or less simple mathematic (+-*/ exp) - but here
> the last, but most important question:
>
> Explain for only one of the above cases, why and how the changed corpus
> norm and corpus confidence will affect the average detection rate for spam
> and notspam of the Bayesian and HMM engine - for 100.000 incoming mails.
> The detection rate before the event occured was 99,0% spam-detection with
> no false positives for the Bayesian and HMM engine (only!)
>
> assumed distribution of real spam / notspam
>
> 95.000 / 5.000
>
> good luck
>
> Thomas
>
>
>
> Von:    K Post <[hidden email]>
> An:     ASSP development mailing list <[hidden email]>
> Datum:  21.03.2016 20:31
> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
>
>
>
> Summary:
> *Isn't it important to have as random and diverse of a sampling in the
> file
> store as possible?  *I feel like removing excessive duplicate file names
> before cleaning up the file store gives us the best chance of using the
> most varied set of messages possible for both building bayes/hmm and also
> manual review, resends, copying to corrected, etc.
>
> Detailed discussion:
>
> You're right, I came to the conclusion that you thought I would at the end
> of your last post  - if you argue that it doesn't make sense for notspam,
> then it doesn't make sense for spam either...  but I still disagree.
> Please allow me to explain better.
>
> It's funny that you talk about move2num, I was just looking at that from a
> really old installation.  I was also trying to find my circa 2003 script
> that removed duplicate file names before the rebuild. (couldn't find it,
> but I stopped using that once MaxDupFileNames was introduced).
>
> And yes, you nailed it that I periodically manually poke around in the
> file
> store, but that's not why I'm >still< thinking that it's still important
> to
> have the removal of dups.  A random selection of varying data is important
> no?
>
> If we have a maximum number of total files, a system that removes excess
> based on file age, and the potential that there could be a single user
> that
> could send the same email more than the maximum number of files that are
> allowed, that could give us NO data diversity in notspam.  All of the
> older
> files would be deleted, leaving us with almost all fo the remaining files
> being identical, without any other files to build that random sampling
> from.  Right?
>
> And even if we went back to number only filenaming, or something like md5
> filenames, the same problem would exist - we only have so many files to
> work with, if you replace enough of them with essentially the same file,
> that just removes data that we wanted more than what we replaced it with.
>
> Simply put, if we replaced 15,000 not spam files with the same not spam
> file 15,000 times and did a rebuild, wouldn't we get a bayesian/hmm
> database that's not as good as what we would have had if we first removed
> those duplicate file names?
>
> Wouldn't turning off maxalloweddups open us up for the the potential that
> the spam corpus could be filled with the same message over and over (same
> subject at least) that would then result in the same problem - the spam
> folder having fewer different messages than we could have had otherwise.
>
> Besides my own neuroses, isn't it important for the blockreport and manual
> inspections to also keep as many of the message files in place for
> resends,
> corrections, etc?  Yes, deleting all of the excess duplicate filenames
> means that those duplicates they wouldn't be available, but playing our
> odds, it's more likely that files with different filenames are different
> than those with the same file names.
>
>
>
>
> On Mon, Mar 21, 2016 at 2:19 PM, Thomas Eckardt
> <[hidden email]>
> wrote:
>
> > The concept of bayes but at least for HMM is too complex for most
> admins.
> > ASSP is nice and stores all file in clear text and with nice filenames
> in
> > filesystem if configured this way.
> >
> > No someone like you, looks in to this folders and tries to find a reason
> > to do something - that this folder looks more nice to humans.
> > Now assume, I would store all files RSA encrypted in a database. Nobody
> > would care about the content, the size, the subject and the count of
> > records as long as the corpus norm is fine and the detection rate is OK.
> > Because nobody would be able to read anything in this database.
> > Looking back to the old days of assp, move2num was used and the
> filenames
> > were build using random numbers and it was possible, that any file could
> > be deleted at any time. Not really nice but it worked - numbers only
> > numbers nothing more - and the highest randomness you can think about..
> >
> > Now - reading this you must come to a conclusion - 'MaxAllowedDups' is
> > NONSENSE - and YES you are right - SWITCH IT OFF!
> >
> > Thomas
> >
> >
> >
> >
> >
> >
> >
> > Von:    K Post <[hidden email]>
> > An:     ASSP development mailing list <[hidden email]>
> > Datum:  21.03.2016 18:56
> > Betreff:        Re: [Assp-test] Max Number Duplicate File Names
> >
> >
> >
> > I'm worry if you're finding this discussion to be such a time waster..
> My
> > goal is to improve ASSP for all, not to waste your time, you must know
> > that.
> >
> > In the interest of conserving your time - summary question:
> > *Wouldn't it be better for ASSP to remove duplicate file names in excess
> > of
> > X from notspam than for it to not remove them and instead remove other
> > files with more varied notspam content?*
> >
> > Expanded:
> >
> > It's helpful for me to now understand that the hmm/bayes analysis
> doesn't
> > weigh repetition more heavily than just one file in the opposite folder.
> > Thank you for that explanation.  When the users do mail merges, a lot of
> > the time, the body is subtly different (different dear line or other per
> > person customization for example), but based on what you're saying, I'd
> > think that they'd be substantially similar enough to act the same way as
> > you describe.  So good.
> >
> > But, what is the downside of having ASSP remove filenames with more than
> X
> > of the same in notpsam?  I understand that having more wouldn't increase
> > scoring for the content of >those< messages, but wouldn't it also remove
> > say 5000 of the OTHER files that we want during a rebuild based on their
> > age, and therefore give us a file store that's not as diverse as it
> could
> > be?  Isn't that a bad thing or at least not as good as it would be if
> the
> > duplicate file name emails were removed?
> >
> > Localfrequency isn't going to help, at least in my case.  If the
> director
> > wants to ignore my instruction and policy, she needs to be able to. Yes,
> > this is a policy problem, but the people high up in the charity will
> > always
> > argue that if they need to send a message, they're going to.  They don't
> > pay me much, but it's a job that I need - I can't risk that by turning
> on
> > localfrequency.  I don't see how nocollecting / re is going to help.  I
> > have no way of knowing who is going to send next or what they're going
> to
> > send.
> >
> > I guess I just don't see the downside (other than your time in coding
> and
> > testing along with a slightly longer cleanup process) to have ASSP
> remove
> > those duplicate file names at cleanup time, before removing oldest
> first.
> > Wouldn't that be better than having a notspam folder that once cleanup
> > runs
> > could only have only a handful of files that are significantly different
> > content (say if a couple users sent a boatload of mailmerges in 1 day)?
> >
> >
> >
> >
> >
> >
> > On Mon, Mar 21, 2016 at 12:49 PM, Thomas Eckardt
> > <[hidden email]
> > > wrote:
> >
> > > bonehead user sends 5000 -> LocalFrequencyInt and next configs
> > >
> > > regular user sends 5000 -> noCollecting , noCollectRe ...........
> > >
> > > This is not a coding task - this is an organizing and configuration
> > task.
> > > As I always say - RTMF!
> > >
> > > >then delete as you already do files in
> > > >excess of the maximum total number of files?
> > >
> > > Oldest fist - no content check.
> > >
> > > >that our notspam corpus remains diverse
> > >
> > > having 5000 times the 100% same mail-body in one folder is the same,
> > like
> > > having the mail one time in this folder for HMM and bayes
> > > having the same mail in the opposit folder one time - elimiates all
> the
> > > 5000 for HMM and bayes
> > > BTW : this is independend from the filename or subject
> > >
> > > This is not new (since more than 10 years) - because it is one of the
> > > basic concepts of HMM and bayes.
> > >
> > > >I know that we must be missing something significant.
> > >
> > > Yes - the concept!
> > >
> > > You waste my time Ken.
> > >
> > > Thomas
> > >
> > >
> > >
> > > Von:    K Post <[hidden email]>
> > > An:     ASSP development mailing list
> <[hidden email]>
> > > Datum:  21.03.2016 16:41
> > > Betreff:        Re: [Assp-test] Max Number Duplicate File Names
> > >
> > >
> > >
> > > -From Thomas, posted elsewhere
> > > >Remains the (my) question - what should be done with mails that
> > > >reaches the 'MaxAllowedHamDups' without breaking any concept and
> > without
> > > >creating a new folder (which breaks several concepts)?
> > >
> > > The scenario where a bonehead user sends 5000 of the same message in
> an
> > > Outlook mailmerge isn't just a conceptual possibility, it happens. And
> > > it's happening more and more frequently despite training, memos,
> > > reminders,
> > > and a very good email blast system in place that eliminated the need
> for
> > > mailmerges.
> > >
> > > What about when doing the nightly cleanup if you were to delete files
> > with
> > > the same name in excess of max dups, then delete as you already do
> files
> > > in
> > > excess of the maximum total number of files?  I thought that was what
> > was
> > > already happening with the spam corpus, but apparently not.
> > >
> > > I only see upside to limiting the number of dups it notspam, but
> you've
> > > stated elsewhere that the arguments herein don't make sense to you. If
> > > you're saying what we suggest doesn't make any sense, I know that we
> > must
> > > be missing something significant.  I know that bayesian filtering
> works
> > > really well, but I only understand the inner workings from 35,000
> feet.
> > I
> > > just can't understand how making every effort to insure that our
> notspam
> > > corpus remains diverse doesn't make sense.
> > >
> > > Thanks again.  Hope we can continue this discussion.
> > >
> > > On Mon, Mar 14, 2016 at 5:28 PM, K Post <[hidden email]> wrote:
> > >
> > > > On of our staff inadvertently sent about 3400 of the same test
> > messages
> > > > out through our server.  Okay, okay, it was me - had a loop coded
> > wrong
> > > and
> > > > before I noticed what was going on and could stop it about 3400 of
> the
> > > same
> > > > messages went out, fortunately, they were just to me.  Sure enough,
> > all
> > > > 3400 were in notspam.
> > > >
> > > > So, could we, and does it make sense, to keep discussing this?
> > > >
> > > > On Thu, Mar 10, 2016 at 1:47 PM, K Post <[hidden email]> wrote:
> > > >
> > > >> Isn't that exact same logic an argument for having the maximum
> number
> > > of
> > > >> duplicate subjects apply to the HAM / notspam folder too?  5000 or
> > > 15000 of
> > > >> the same message sent individually by (untrainable / apathetic)
> users
> > > would
> > > >> fill the notspam folder and mess up HMM / Bayesian right?
> > > >>
> > > >> And for those RE / FWD / No subject emails, maybe we could have
> ASSP
> > > >> ignore subjects shorter than say 5 or 6 characters when deleting
> > > duplicate
> > > >> file names?  Then those files could get wiped out oldest first
> during
> > > the
> > > >> maintenance.
> > > >>
> > > >> \
> > > >>
> > > >> On Thu, Mar 10, 2016 at 11:18 AM, Thomas Eckardt <
> > > >> [hidden email]> wrote:
> > > >>
> > > >>> Just think about the logic behind Bayesian and HMM - this will
> > answer
> > > >>> your
> > > >>> question.
> > > >>>
> > > >>> Having the same mail in the spam folder multiple times, this will
> > > score
> > > >>> the content to extreme spam havy, even your users are using the
> same
> > > >>> content - but less often.
> > > >>>
> > > >>> Thomas
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> Von:    K Post <[hidden email]>
> > > >>> An:     ASSP development mailing list
> > > <[hidden email]>
> > > >>> Datum:  10.03.2016 16:58
> > > >>> Betreff:        Re: [Assp-test] Max Number Duplicate File Names
> > > >>>
> > > >>>
> > > >>>
> > > >>> I know you're all RTFM, but there's plenty of places in the GUI
> > where
> > > the
> > > >>> description isn't exactly clear or right.  For example
> > > >>>
> > > >>> MaxFiles
> > > >>> If you're not using subjects as file names (
> > UseSubjectsAsMaillogNames
> > > ),
> > > >>> this is the maximum number of files to keep in each collection
> (spam
> > &
> > > >>> nonspam)
> > > >>> It's actually less than this -- files get a random number between
> 1
> > > and
> > > >>> MaxFiles.
> > > >>>
> > > >>> I AM using file names and MaxFiles DOES control the maximum number
> > of
> > > >>> files
> > > >>> in each collection, despite what the description says when
> > > >>> MaintBayesCollection is on and no max age is set. The language is
> > not
> > > >>> clear
> > > >>> and that makes us assume things, sometimes incorrectly, about what
> > the
> > > >>> GUI
> > > >>> really mean.  We've been working this way since ASSP came out.
> > Because
> > > >>> of
> > > >>> this, I had no way of knowing that MaxAllowedDups >really< only
> > > applied
> > > >>> to
> > > >>> the spam collection.  I assumed the GUI meant the whole log of
> spam
> > > and
> > > >>> NOTspam.  I don't think that's an unreasonable assumption, or call
> > it
> > > an
> > > >>> oversight, or a mistake on my part - but none of that justifies
> and
> > > angry
> > > >>> sounding response from you.
> > > >>>
> > > >>>  I'm not looking for a fight, but I feel like I have to keep
> > > justifying
> > > >>> myself after you appear to be so angry with me, and the rest of
> us,
> > > who
> > > >>> turn to you for enlightenment.  You're carrying the entire weight
> of
> > > this
> > > >>> project on your shoulders.  It's a lot, I know,  Can we move on
> and
> > > have
> > > >>> a
> > > >>> reasonable discussion here?
> > > >>>
> > > >>> Is there a reason that MaxAllowedDups shouldn't also apply to the
> > > notspam
> > > >>> collection?   Shouldn't we want that to be the case for the same
> > > reason
> > > >>> that we have it for spam?   Maybe also to the errors collections?
> > > >>>
> > > >>> If we don't, wouldn't the case where a staff member sends the same
> > > basic
> > > >>> message to 5000 people (against my wishes, but I can't control
> > > >>> everything)
> > > >>> that'll take 1/3 of the other notspam messages out of the rebuild
> > > >>> processes?  How about if 20k messages are sent?
> > > >>>
> > > >>> Maybe I'm just not understanding, and that's why I'm asking, but I
> > > hope
> > > >>> it
> > > >>> doesn't result in any more scolding.
> > > >>>
> > > >>> Thank you
> > > >>>
> > > >>>
> > > >>> On Thu, Mar 10, 2016 at 4:15 AM, Thomas Eckardt
> > > >>> <[hidden email]>
> > > >>> wrote:
> > > >>>
> > > >>> > >There are about 600 of those files in NotSpam.
> > > >>> >
> > > >>> > 'MaxAllowedDups','Max Number of Duplicate File Names'
> > > >>> >   'The maximum number of logged files with the same filename
> > > (subject)
> > > >>> > that are stored in the spam folder (spamlog),........
> > > >>> >
> > > >>> > I'll write in Hebrew - possibly the english is better, if you
> > > translate
> > > >>> it
> > > >>> > back to english.
> > > >>> >
> > > >>> > Thomas
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > Von:    K Post <[hidden email]>
> > > >>> > An:     ASSP development mailing list
> > > <[hidden email]
> > > >>> >
> > > >>> > Datum:  10.03.2016 00:29
> > > >>> > Betreff:        [Assp-test] Max Number Duplicate File Names
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > I've got UseSubjectAsMaillogNames checked (the messages are
> stored
> > > in
> > > >>> the
> > > >>> > folders user the subject name followed by a 6 digit number as
> > > expected)
> > > >>> >
> > > >>> > I've got MaxAllowedDups set to 3
> > > >>> >
> > > >>> > MaxBayesFileAge is 0
> > > >>> > MaxFiles is 15000
> > > >>> >
> > > >>> > I'm noticing that MaxAllowedDups doesn't seem to be working.
> > > >>> >
> > > >>> > For example, a couple users often send emails with the subject
> > > >>> > "Your Donation Receipt"
> > > >>> > There are about 600 of those files in NotSpam.
> > > >>> > Your_Donation_Receipt--123456.txt
> > > >>> > where 123456 is a random differing number.
> > > >>> >
> > > >>> > Shouldn't only 3 of these files exist in the folder (with the
> > > exception
> > > >>> of
> > > >>> > those that were sent since the rebuild / maintenance window)?
> > > >>> >
> > > >>> > Thanks
> > > >>> >
> > > >>> >
> > > >>>
> > > >>>
> > >
> > >
> >
> >
>
> ------------------------------------------------------------------------------
> > > >>> > Transform Data into Opportunity.
> > > >>> > Accelerate data analysis in your applications with
> > > >>> > Intel Data Analytics Acceleration Library.
> > > >>> > Click to learn more.
> > > >>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> > > >>> > _______________________________________________
> > > >>> > Assp-test mailing list
> > > >>> > [hidden email]
> > > >>> > https://lists.sourceforge.net/lists/listinfo/assp-test
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > DISCLAIMER:
> > > >>> > *******************************************************
> > > >>> > This email and any files transmitted with it may be
> confidential,
> > > >>> legally
> > > >>> > privileged and protected in law and are intended solely for the
> > use
> > > of
> > > >>> the
> > > >>> >
> > > >>> > individual to whom it is addressed.
> > > >>> > This email was multiple times scanned for viruses. There should
> be
> > > no
> > > >>> > known virus in this email!
> > > >>> > *******************************************************
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>>
> > > >>>
> > >
> > >
> >
> >
>
> ------------------------------------------------------------------------------
> > > >>> > Transform Data into Opportunity.
> > > >>> > Accelerate data analysis in your applications with
> > > >>> > Intel Data Analytics Acceleration Library.
> > > >>> > Click to learn more.
> > > >>> > http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> > > >>> > _______________________________________________
> > > >>> > Assp-test mailing list
> > > >>> > [hidden email]
> > > >>> > https://lists.sourceforge.net/lists/listinfo/assp-test
> > > >>> >
> > > >>> >
> > > >>>
> > > >>>
> > >
> > >
> >
> >
>
> ------------------------------------------------------------------------------
> > > >>> Transform Data into Opportunity.
> > > >>> Accelerate data analysis in your applications with
> > > >>> Intel Data Analytics Acceleration Library.
> > > >>> Click to learn more.
> > > >>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> > > >>> _______________________________________________
> > > >>> Assp-test mailing list
> > > >>> [hidden email]
> > > >>> https://lists.sourceforge.net/lists/listinfo/assp-test
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> DISCLAIMER:
> > > >>> *******************************************************
> > > >>> This email and any files transmitted with it may be confidential,
> > > legally
> > > >>> privileged and protected in law and are intended solely for the
> use
> > of
> > > >>> the
> > > >>>
> > > >>> individual to whom it is addressed.
> > > >>> This email was multiple times scanned for viruses. There should be
> > no
> > > >>> known virus in this email!
> > > >>> *******************************************************
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > >
> > >
> >
> >
>
> ------------------------------------------------------------------------------
> > > >>> Transform Data into Opportunity.
> > > >>> Accelerate data analysis in your applications with
> > > >>> Intel Data Analytics Acceleration Library.
> > > >>> Click to learn more.
> > > >>> http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
> > > >>> _______________________________________________
> > > >>> Assp-test mailing list
> > > >>> [hidden email]
> > > >>> https://lists.sourceforge.net/lists/listinfo/assp-test
> > > >>>
> > > >>>
> > > >>
> > > >
> > >
> > >
> >
> >
>
> ------------------------------------------------------------------------------
> > > Transform Data into Opportunity.
> > > Accelerate data analysis in your applications with
> > > Intel Data Analytics Acceleration Library.
> > > Click to learn more.
> > > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> > > _______________________________________________
> > > Assp-test mailing list
> > > [hidden email]
> > > https://lists.sourceforge.net/lists/listinfo/assp-test
> > >
> > >
> > >
> > >
> > > DISCLAIMER:
> > > *******************************************************
> > > This email and any files transmitted with it may be confidential,
> > legally
> > > privileged and protected in law and are intended solely for the use of
> > the
> > >
> > > individual to whom it is addressed.
> > > This email was multiple times scanned for viruses. There should be no
> > > known virus in this email!
> > > *******************************************************
> > >
> > >
> > >
> > >
> >
> >
>
> ------------------------------------------------------------------------------
> > > Transform Data into Opportunity.
> > > Accelerate data analysis in your applications with
> > > Intel Data Analytics Acceleration Library
> >
> > > Click to learn more.
> > > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> > > _______________________________________________
> > > Assp-test mailing list
> > > [hidden email]
> > > https://lists.sourceforge.net/lists/listinfo/assp-test
> > >
> > >
> >
> >
>
> ------------------------------------------------------------------------------
> > Transform Data into Opportunity.
> > Accelerate data analysis in your applications with
> > Intel Data Analytics Acceleration Library.
> > Click to learn more.
> > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> > _______________________________________________
> > Assp-test mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >
> >
> >
> >
> > DISCLAIMER:
> > *******************************************************
> > This email and any files transmitted with it may be confidential,
> legally
> > privileged and protected in law and are intended solely for the use of
> the
> >
> > individual to whom it is addressed.
> > This email was multiple times scanned for viruses. There should be no
> > known virus in this email!
> > *******************************************************
> >
> >
> >
> >
>
> ------------------------------------------------------------------------------
> > Transform Data into Opportunity.
> > Accelerate data analysis in your applications with
> > Intel Data Analytics Acceleration Library.
> > Click to learn more.
> > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> > _______________________________________________
> > Assp-test mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/assp-test
> >
> >
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>
>
>
> DISCLAIMER:
> *******************************************************
> This email and any files transmitted with it may be confidential, legally
> privileged and protected in law and are intended solely for the use of the
>
> individual to whom it is addressed.
> This email was multiple times scanned for viruses. There should be no
> known virus in this email!
> *******************************************************
>
>
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Assp-test mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
>

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Assp-test mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/assp-test