mersenneforum.org > Data And now for some TF results...
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

2019-02-24, 03:47   #34
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7·491 Posts

Quote:
 Originally Posted by R. Gerbicz Obviously the hard part is that for no factor (what is the common) there is nothing to check in the usual codes. In the past I've thought that it is just impossible, don't have an exact idea what would be the slowdown on gpu (I'd say that less than 1%), but the check on the server would be terrible fast and close to impossible to fool it: Let x mod N the unique mod N residue in [0,N). and H=hash(p,q,t)=(2^p mod q) mod (2^t) we are expecting to see H=1 for cnt/p/2^(t+1) different q primes (q=1,7 mod 8) if there are cnt primes in [2^b,2^(b+1)). Save these q numbers if q is prime ( or pseudoprime ) and send them to the server at the end of the computation. The number of hits follows a binomial distribution with: Binom(N,r)=Binom(cnt/2/p,1/2^t) Code: where: N=cnt/(2*p) r=1/2^t ev=N*r (the expected value) sigma=sqrt(N*r*(1-r)) here mark the submission as suspect if you'd see less than ev-5*sigma different(!) q values from the [2^b,2^(b+1)) interval, where q mod 8={1,7} and q is prime (or pseudoprime, that would change almost nothing). Ofcourse accept the submission for p, if it has found at least one q>1 divisor of mp (because it is possible that in this case the search used an early abort strategy at success, what is common). choose such t for that say 10001000, and if the range is not very small). ps. For large bounds we have exact prime counts for [2^b,2^(b+1)), but you can also use estimations. Don't count composite q values, because that mess up these things, depends on your sieving bound etc. Why would be hard to fool it? Because for actual q prime divisors of mp we have also H=1, if you could find these q much faster, than the whole tf would be also much faster.
I'm not going to pretend to have fully understood that. However.
If it depends on the factoring program mfaktc or mfakto using only candidate prime factors in the algorithm, or a predictable number of candidate factors, of predictable values, there may be a problem. In gpu sieving there are 3 parameters which we users can adjust to tune the application for maximum throughput. Some have only a few values allowable, others have a wide integer range allowable. One of these is typically of order 105, and is how far to sieve candidate factors. (Another can range up to 2047 in modified code.) Some composite candidate factors get let through and used in the algorithm, because it's faster than perfectly sieving candidate factors for primality. If not using gpu sieving for candidate factors, there's another set of tune parameters to do approximately the same. See https://www.mersenneforum.org/showpo...23&postcount=6 mainly items 8, 9 and 26.
The crux of it I think is the application does not know how many of the candidate factors employed were prime, in usual operation, or which to count as prime and which not to.

Let's see how much I got wrong. Here's how I interpreted it.
P is the exponent we seek a factor for and want to validate the TF attempt for, over the TF interval 2b to 2(b+1)
cnt(*) is count of (whatever); number of primes (q's?) in TF interval 2b to 2(b+1);
(all primes in the range, or only primes of form 1 or 7 mod 8 and 2kp+1?)
q is a candidate factor that may or may not itself be prime, having been generated from an efficient generator and sieved imperfectly in such applications as mfaktc, mfakto, or prime95/mprime

t is chosen so that the expected number of H1's is between 1000 and 2000.
If for example p=~90M and b=76,
cnt =~.05 * 276 ~ 272 (and most of those will be sieved out or not even generated)
N =~272 /2 /90M =~ 2623536935.
aim for about the middle, ev=1536=1.5*210 =N*r; ev/N=r=~5.85E-7. t=log2(1/r) =~21.
round t to integer, t=21.
check; ev=N r = N / 2t = 2623536935 / 221 ~ 1251.
sigma = sqrt(2623536935 * 2-21 * (1- 2-21)) =~ sqrt(1251) = 35.37.
ev-5 sigma =~ 1251 -5 * 35.37 = 1251 -176.85=~ 1074.

while computing the TF, after each (2p mod q) trial,
compute H= residue mod 2t;
if H=1, add q to a list.
When the TF is complete, if no factor is found, the report gets the list of H1 q values appended as a form of validation. That's of order 1250 entries, of roughly 23-digit numbers plus a separator character each, an additional 30 KBytes for every NF report. Some size (17%?) could be saved by using hex representation rather than decimal.

The computation of the H values itself seems quite quick compared to the base TF trial. Although it is still the evaluation of a multiword residue since in the example q is 77 bits, in 32-bit or shorter expression per word, since the modulo is 2t, for t not greater than usable word length, the high words can be ignored.

Some of the top TF producers send in millions of attempt results annually each. 30KB each x 7M+ for one prolific contributor is another >210GB of traffic to and storage on the server annually; probably of order triple that for the entire project. Approx 160Kbps average additional sustained rate to the primenet server. And it's probably quite bursty.

I think in lists of such size, there should be a CRC or maybe ECC added.

So, what is the trivially quick server side verification? Pick a q at random from the list included and verify H=1? It would need to be able to reproduce the t value for the exponent and bit level, or receive it in the message.

Last fiddled with by kriesel on 2019-02-24 at 03:59

 2019-02-24, 05:06 #35 Madpoo Serpentine Vermin Jar     Jul 2014 3×5×7×31 Posts Our mystery spammer was back under a different random girls name. Same or similar porn site link for the web url... And that web link may lead to the actual motive of this person: forum spamming. If they're reading this thread... don't bother. You may notice that any weblinks to 3rd party sites that appear on mersenne.org have rel="nofollow" so it's not doing you a bit of good to try and billboard the recent results or anything. Take a look, it's true. And if you think somehow a search engine is going to see those and follow them anyway, you're wrong, but keep this up and we'll just remove those links from those pages entirely. By the way, the VPN you're using has a terrible reputation on all of their endpoints. But that's probably why they allow scum like you (or because they allow scum like you). No reputation to protect. At my last job, we ran into web spammers like this on a somewhat regular basis and I know the game.
2019-02-24, 05:28   #36
Serpentine Vermin Jar

Jul 2014

3×5×7×31 Posts

Quote:
 Originally Posted by Madpoo ... And if you think somehow a search engine is going to see those and follow them anyway, you're wrong, but keep this up and we'll just remove those links from those pages entirely. ...
I realized I could just limit those links so they only show up on accounts over a certain age. Keep the bozos from submitting bogus results just so their useless porn links show up. Or not bother showing them at all for "no factor" results. That way we're not throwing the baby out with the bath water.

 2019-02-24, 06:35 #37 Madpoo Serpentine Vermin Jar     Jul 2014 62678 Posts Idea #whatever is to use something like this, either on the account creation page (which I just updated, FYI) or maybe even on each hit to the manual result page: https://www.stopforumspam.com/ Basically the server would call the API and get a reliability score for the IP address. I noticed that in each of these recent cases, the offending IP was listed as a TOR exit node with terrible reputation scores on that particular service. Like, 99.9% + confidence that anything coming from them is bad. I can do a little "offline" run of that and see how many, if any, of the manual results we got in the past couple days would have been affected. With any service like that, the danger is false positives.
2019-02-24, 07:00   #38
retina
Undefined

"The unspeakable one"
Jun 2006
My evil lair

23·3·13·17 Posts

Quote:
 Originally Posted by Madpoo ... the danger is false positives.
If you decide to try and automate detection of bad actors then you will always catch innocent people with it. It just goes with the territory.

Is it really such a problem that catching people manually is not an option? Humans are always better at determining good from bad. Can't we keep the humans in the process?

So what if we occasionally get extreme results in the DB? We can go back and erase them, no big deal.

2019-02-24, 07:31   #39
Serpentine Vermin Jar

Jul 2014

1100101101112 Posts

Quote:
 Originally Posted by retina If you decide to try and automate detection of bad actors then you will always catch innocent people with it. It just goes with the territory. Is it really such a problem that catching people manually is not an option? Humans are always better at determining good from bad. Can't we keep the humans in the process? So what if we occasionally get extreme results in the DB? We can go back and erase them, no big deal.
It's a hassle to do it manually... I don't check the forums all the time (although I have today because of these issues). Automating some filtering would be great.

I just ran a list of IP addresses that have submitted manual results lately through that "forum spam" API and only got one hit. And that one hit turned out to be this doofus back for more. No false positives, although it was a limited sample size, and the one it found was correctly flagged.

I also just implemented a change to the "recent results" page so any account under 30 days old won't have their website appear as a link on their user name. The hope being when this person (and I use the term lightly) realized their porn site isn't showing up anywhere, they'll get bored and move on.

2019-02-24, 11:44   #40
R. Gerbicz

"Robert Gerbicz"
Oct 2005
Hungary

24578 Posts

Quote:
 Originally Posted by kriesel However. If it depends on the factoring program mfaktc or mfakto using only candidate prime factors in the algorithm, or a predictable number of candidate factors, of predictable values, there may be a problem.
The distribution of primes in reduced residue classes is close to equal (it is a known theorem), and we have eulerphi(8*p)=4*(p-1) such classes mod (8*p) and only two good classes, thus we have roughly N=cnt*2/(4*(p-1))=cnt/(2*(p-1)), q=2*k*p+1 primes in the [2^b,2^(b+1)) interval for that q=1,7 mod 8. So yes, N=cnt/(2*p) was slightly incorrect, but for "large" p there is no large difference. As pointed out we send q only if it is prime (or pseudoprime), and you have to test these q primes whatever is your sieve bound.

Quote:
 Originally Posted by kriesel When the TF is complete, if no factor is found, the report gets the list of H1 q values appended as a form of validation. That's of order 1250 entries, of roughly 23-digit numbers plus a separator character each, an additional 30 KBytes for every NF report. Some size (17%?) could be saved by using hex representation rather than decimal. .. Some of the top TF producers send in millions of attempt results annually each. 30KB each x 7M+ for one prolific contributor is another >210GB of traffic to and storage on the server annually; probably of order triple that for the entire project. Approx 160Kbps average additional sustained rate to the primenet server. And it's probably quite bursty.
To save bandwidth you could send only k, and reconstruct: q=2*k*p+1, and after the verification you can delete these and mark the entry (p,b) as verified, though you could also download it daily to an external drive.

Quote:
 Originally Posted by kriesel So, what is the trivially quick server side verification? Pick a q at random from the list included and verify H=1? It would need to be able to reproduce the t value for the exponent and bit level, or receive it in the message.
Testing 1-2 thousand q primes per p is not an issue. If the value of t is deterministic then you don't need to send it, but maybe sending one byte per p,b is still not a big effort.

Last fiddled with by R. Gerbicz on 2019-02-24 at 11:46

2019-02-24, 14:41   #41
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7×491 Posts

Quote:
 Originally Posted by R. Gerbicz ... To save bandwidth you could send only k, and reconstruct: q=2*k*p+1, and after the verification you can delete these and mark the entry (p,b) as verified, though you could also download it daily to an external drive. Testing 1-2 thousand q primes per p is not an issue. If the value of t is deterministic then you don't need to send it, but maybe sending one byte per p,b is still not a big effort.
If I follow you, part of what makes it work is the count does not need to be determinate and matched between client and server, because of the use of a list transmitted for verification. So tuning or moderate miscount effects are accommodated.

Sending k rather than q knocks off about 27 bits per q or 8 digits per q or a third for my example.

Checking all the listed q's would constitute a very considerable increase in the server effort of verification of TF. Instead of per report, running only the occasional one factor found again, at cost ~log2(p) and frequency 1/b, we would also be running on the server, (b-1)/b frequency * number ~1500 q's at cost ~log2(p). (1/b + (b-1)/b 1500 )/ (1/b) = 112501 ~105-fold ratio. Sampling a dozen or more pseudorandomly selected q's might be a good compromise between increased server loading and detections of discrepancies. I'm unpersuaded of the marginal utility of checking the last thousand or more if all the first dozen to hundred per report check out ok. (How many sigma do we really need? Five sigma, better than 1ppm is impressive. I'd be satisfied with around half that, on a log scale; 0.1% as a great improvement over the status quo. But it's probably George's call.)

If just one or two of 1500 fail to check out, what then? Does the server record it as a verified NF report with n verification errors of 1500 values or of however many actually checked?

I like the idea of sending also t. The impact on message size is small, and it is one more piece for the rare few people faking results to get wrong. I'd probably choose to express it as " t dd" or some such, around 5 characters, as easily human readable.

I'll leave it to folks like Madpoo, Prime95, Scott Kurowski, and James Heinrich to decide what's no problem and what's maybe too much to add to their servers' loadings.

Thanks once again for continuing to look for ways to refine the methods of the GIMPS project. This seems an exciting development.

Last fiddled with by kriesel on 2019-02-24 at 14:46

2019-02-24, 15:17   #42
chalsall
If I May

"Chris Halsall"
Sep 2002

227D16 Posts

Quote:
 Originally Posted by Mark Rose We could move to giving TF credit only when a factor is found.
I would argue against this idea...

It is true that /most/ people who do TF'ing don't really care all that much about the credits. But, *some* do.

Further, the "No Factor between 2^X and 2^Y" information is almost as valuable as a found factor. Not only to keep track of where the next bit (no joke intended) of TF'ing work should be done, but also P-1'ing takes this information into consideration.

Lastly, since there would be no "reward" to the participant to submit the "No factor found" information, some _might_ simply not submit such results, thus making others redo the work with the resultant lower "success rate".

My 4 cents (BDS)....

2019-02-24, 15:42   #43
SELROC

2×3×199 Posts

Quote:
 Originally Posted by chalsall I would argue against this idea... It is true that /most/ people who do TF'ing don't really care all that much about the credits. But, *some* do. Further, the "No Factor between 2^X and 2^Y" information is almost as valuable as a found factor. Not only to keep track of where the next bit (no joke intended) of TF'ing work should be done, but also P-1'ing takes this information into consideration. Lastly, since there would be no "reward" to the participant to submit the "No factor found" information, some _might_ simply not submit such results, thus making others redo the work with the resultant lower "success rate". My 4 cents (BDS)....

+1

Also there would be no reward for the energy spent to find "No Factor between 2^X and 2^Y".

2019-02-24, 16:17   #44
R. Gerbicz

"Robert Gerbicz"
Oct 2005
Hungary

24578 Posts

Quote:
 Originally Posted by kriesel How many sigma do we really need? Five sigma, better than 1ppm is impressive. I'd be satisfied with around half that, on a log scale; 0.1% as a great improvement over the status quo. But it's probably George's call.
See the last table: https://en.wikipedia.org/wiki/Standard_deviation, ofcourse we have a different (binomial distribution) but for large N we are very close to this. Or you can get very quickly Probab(x<ev-5*sigma) for binomial distribution. So we'd fail this test only once per 3500000 p prime, because we're interested only in one side, if we have more than ev+5*sigma q primes then it isn't that interesting.

Make the decision per p,b and not per input file.

 Similar Threads Thread Thread Starter Forum Replies Last Post ET_ Operazione Doppi Mersennes 604 2020-03-26 15:17 lycorn PrimeNet 22 2017-10-02 02:40 danaj Prime Gap Searches 0 2017-08-14 18:35 Unregistered Information & Answers 3 2010-07-26 00:49 Mike PrimeNet 11 2004-05-23 12:55

All times are UTC. The time now is 00:01.

Thu Apr 2 00:01:53 UTC 2020 up 7 days, 21:34, 3 users, load averages: 1.40, 1.41, 1.34