mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   PrimeNet (https://www.mersenneforum.org/forumdisplay.php?f=11)
-   -   Trial Factoring Elimination up to 2^64 Project (https://www.mersenneforum.org/showthread.php?t=19258)

tapion64 2014-04-09 18:45

Trial Factoring Elimination up to 2^64 Project
 
(For those curious about the math behind this, the multiplicative order of n mod m is defined as the smallest k such that n^k = 1 mod m. As Mersenne numbers are of the form 2^p-1, if a given prime q divides that Mersenne number, then its multiplicative order of 2 is equal to p, since 2^p mod q = 1, and 1-1 = 0. After that point, q will only divide numbers of the form 2^(k*p)-1, which are eliminated from Mersenne numbers as the exponent is not prime. This is the reason that 2^n-1 must have n be prime for the number to be prime, as the set of all multiplicative orders of 2 mod p contains the set of all primes. If this were not the case, then there would exist some number 2^p-1 which had no factors greater than 1, which we know is impossible.)

The purpose of the project is to eventually get to 2^64 (or beyond) in contiguous primes, calculating their multiplicative order, and then eliminating them from trial factoring range. Calculating this for an individual prime is trivial even in the 2^80 range and above using the Bach Shallit algorithm, so processing-wise this isn't very intensive, it's just a huge number of test cases (around 416 quadrillion primes).

The original intent was to calculate and store this information for the primes, but this would easily take up several exabytes of information, so instead my server is going to track information about the range covered so far. This introduces new architectural problems I'm attempting to solve at the moment.

Inevitably this will run across factors for exponents that are far outside the range that GIMPS currently stores, which I believe is for exponents less than or equal to 2^32. For the purpose of this project, it's possible to ignore those as 'outside of scope'. Thus if the calculated multiplicative order is greater than 2^32, we can still add it to the range of eliminated trial factors. If the multiplicative order is less than 2^32 and prime, we can immediately report it to GIMPS as a factor, again eliminating it from the trial factoring range. The final and most common outcome would be that it is less than 2^32 and not prime, so we can eliminate it (permanently) from trial factoring range. However, since we aren't storing any of this information to a database, it'll simply have to be recalculated next time the exponent range on GIMPS increases.

The main challenge of distributing this work is actually having the primes on hand to test. But if we assume that all non-primes are eliminated from the first step (testing for multiplicative order), and the average CPU can run the whole algorithm on 1 core for 2 million of these primes in around 1 second (I'll have to do benchmarking on this, it could be slower), total calculation time ends up being around 6592 years. If the average contributor has atleast a quad core, 1649 years or so. If the entire community of GIMPS that was working on trial factoring at this moment switched over to doing these calculations, around 2184, that would be 9 months of testing, and we'd never have to trial factor below 2^64 again (until GIMPS increases their exponent range), as we'll have eliminated every single Mersenne prime candidate with exponent less than 2^32 that has a factor less than 2^64.

Obviously, this type of approach is also well suited for CUDA divide and conquer computing, but I have no idea how widespread CUDA GPUs are, and their performance can vary quite a bit. But ideally this would be a project that would support both types of contributors. I'd like any feedback/criticism/help the community can offer. I'll be doing some benchmarking to come back with hard estimates for the time this would actually take.

EDIT: (I'm also aware that it's unreasonable to expect that many people to be running the program 24/7 for that long, and also that many of them may only have dual or even single cores working on it, so I will factor that in.)

kracker 2014-04-09 19:32

Not sure where you are going with this, :smile: but programs exist for TF'ing on the GPU and people run them. See the project "GPU to 72" also. Also, no one runs TF on the CPU because the GPU can do it much, [I]much [/I]faster.
[url]http://mersenneforum.org/forumdisplay.php?f=92[/url]

OpenCL is faster at this moment compared to CUDA in this generation of GPU's, but that always changes :razz: Maxwell I believe has really good potential.

tapion64 2014-04-09 19:42

[QUOTE=kracker;370679]Not sure where you are going with this, :smile: but programs exist for TF'ing on the GPU and people run them. See the project "GPU to 72" also. Also, no one runs TF on the CPU because the GPU can do it much, [I]much [/I]faster.
[URL]http://mersenneforum.org/forumdisplay.php?f=92[/URL]

OpenCL is faster at this moment compared to CUDA in this generation of GPU's, but that always changes :razz: Maxwell I believe has really good potential.[/QUOTE]

The approach with trial factoring is to pick a specific exponent and test if any number in a range like 2^65 to 2^66 divides 2^exponent - 1. The difference with this project is that it picks a specific prime and discovers which Mersenne number, if one exists, it will divide. It's a faster computation, but there are magnitudes more test cases to go through to cover the 2^64 range.

TheMawn 2014-04-09 19:59

I only vaguely caught the math behind this idea. If I understand correctly, it testing if 2KP+1 divides 2[SUP]P[/SUP]-1, but tests every P for one K instead of every K for one P, The latter being our Trial Factoring method?

I can only assume I'm wrong, because this has actually been done. There is at least one user who is taking this approach currently, and has been finding factors < 2[SUP]~45[/SUP] (last I checked) all over the map.


Also, in the interest of finding all factors less than 64 bits, this sounds pretty neat if it does work. However, you'll have a hard time getting a lot of traction because trial factoring to 64 bits has been completed up to 1000M except for very low exponents whose Prime / Composite status has long been proven.

Your efforts will not clear any exponents in the list. It will only find factors for exponents we already know to be composite. I personally would rather direct efforts at exponents with unknown status, but I can't speak for everyone.

tapion64 2014-04-09 20:04

[QUOTE=TheMawn;370684]I only vaguely caught the math behind this idea. If I understand correctly, it testing if 2KP+1 divides 2[SUP]P[/SUP]-1, but tests every P for one K instead of every K for one P, The latter being our Trial Factoring method?

I can only assume I'm wrong, because this has actually been done. There is at least one user who is taking this approach currently, and has been finding factors < 2[SUP]~45[/SUP] (last I checked) all over the map.


Also, in the interest of finding all factors less than 64 bits, this sounds pretty neat if it does work. However, you'll have a hard time getting a lot of traction because trial factoring to 64 bits has been completed up to 1000M except for very low exponents whose Prime / Composite status has long been proven.

Your efforts will not clear any exponents in the list. It will only find factors for exponents we already know to be composite. I personally would rather direct efforts at exponents with unknown status, but I can't speak for everyone.[/QUOTE]
I did not know they'd all been factored to 1000M that far up, I figured we were going sequentially. If that's the case, then we can consider the 2^64 part of this project already complete. Man, I feel accomplished. Of course, the idea behind the project will still work for 2^65 through 2^80. There's just more and more primes to test. Computationally it's an insignificant difference for numbers this 'small'.

EDIT: In that case, I think this should definitely be a GPU only project. CPU will be too slow on the sieving of primes to make a difference. I need to learn how to do the sieving using only the GPU and convert my Go code for calculating the order into CUDA. That part could take some time, as I'm not familiar with CUDA in general. This could take far less time than I thought it might if the conversion from Go to CUDA is seemless.

TheMawn 2014-04-09 21:16

Is this similar to finding factors by k instead of by P? Or is there some other improvement?

tapion64 2014-04-09 21:38

[QUOTE=TheMawn;370694]Is this similar to finding factors by k instead of by P? Or is there some other improvement?[/QUOTE]
We're taking P and saying it's a factor of 2^k-1, which is basically calculating the multiplicative order of 2 mod P = k.

In order to make this feasible, we'll limit k to be less than or equal to the exponents that GIMPS currently tracks, which are those less than 2^32 (or in other words, Mersenne numbers less than 2^(2^32)-1). So after we've calculated k, if it's greater than 2^32, throw the number away and move on to the next one. If not, check the primality of k (this should be simple, it's a 32 bit number). If it's composite, we don't care because 2^k-1 isn't a Mersenne number and can't possibly be prime, it's already eliminated. If it's composite, go check again GIMPS to see if P is in the system for 2^k-1 already. If it isn't, great success. Either way, move on to the next prime and keep checking. If GPU sieving is this fast, and we can write efficient CUDA code to calculate multiplicative order, one person alone with a 560 Ti + could wipe out those ranges in a week's time, I think. Possibly faster.

I wish I could code CUDA as fast as I pseudo-coded this algorithm, we'd already be done. Luckily the GPU sieving library in mfaktc should come in handy, I just have to figure out how to use it and use CUDA in general. I keep expecting to find some terrible flaw in my logic, but if I don't, it means we're about to find a lot of factors.

LaurV 2014-04-10 02:18

What you want to do is called "searching by k" (as opposite to "searching by p", what p95 and mfaktX do - the "k" and "p" refers to a factor q of 2^p-1 being of the form q=2kp+1) and it was discussed at least 20 times on the forum. I personally put a thousand pari code snippets, and versions of that code was used by others (James) to extend the factor data base (see[URL="http://www.mersenne.ca/graphs/factor_bits/factor_bits_20140409.png"] his graphs[/URL]) for small factors (below 2^45 or so). This is very fast for small factors, but it is getting exponentially slower as q gets higher, because of the "garbage" it produces, mostly.

This method, with all possible and impossible optimizations together (like testing only primes which are 1 or 7 mod 8, being the most obvious optimization, but other less obvious can be done too) [U]is not feasible[/U] beyond 2^48 (or max. 2^52, if you have a supercomputer), due to the large number of primes you need to test, which won't produce a prime exponent (the most of the primes produce composite exponents). I have a data base of factors to about 2^56, which can be "easily" reproduced letting a single core of pari to crunch for half year or 8 months. James has a larger database. When you succeed going over this limit, tell us :razz:

For records, PrimeNet has such a database to 2^64 at least (all exponents over some limit were TF-ed to 2^65 or so, by other - faster - methods).

There are about 2.2*10^17 primes between 2^56 and 2^64 (superficial calculus, forgive me if it is not accurate, but the magnitude is right) from which a half are 1 or 7 mod 8. Even if you have ten thousand computers and succeed to test a hundred thousands of primes in each computer per second, you will still need about 3 to 5 years to reach 2^64. So, do you have 10 thousand computers?

Edit: I had a look for old code but can't find it on the forum, anyhow, here is a sample of what you want to do, imagine the "getp" function implements the algorithm you talk about.

[code]
gp > \r tf
gp > forprime(q=7,1000, print(q" : "getp(q,10^9)))
[COLOR=Red]7 : 3[/COLOR]
11 : 10
13 : 12
17 : 8
19 : 18
[COLOR=Red]23 : 11[/COLOR]
29 : 28
[COLOR=Red]31 : 5[/COLOR]
37 : 36
41 : 20
43 : 14
[COLOR=Red]47 : 23[/COLOR]
53 : 52
59 : 58
61 : 60
67 : 66
71 : 35
73 : 9
79 : 39
83 : 82
[COLOR=Red]89 : 11[/COLOR]
97 : 48
101 : 100
103 : 51
107 : 106
109 : 36
113 : 28
[COLOR=Red]127 : 7[/COLOR]
131 : 130
137 : 68
139 : 138
149 : 148
151 : 15
157 : 52
163 : 162
[COLOR=Red]167 : 83[/COLOR]
173 : 172
179 : 178
181 : 180
191 : 95
193 : 96
197 : 196
199 : 99
211 : 210
[COLOR=Red]223 : 37[/COLOR]
227 : 226
229 : 76
[COLOR=Red]233 : 29[/COLOR]
239 : 119
241 : 24
251 : 50
257 : 16
[COLOR=Red]263 : 131[/COLOR]
269 : 268
271 : 135
277 : 92
281 : 70
283 : 94
293 : 292
307 : 102
311 : 155
313 : 156
317 : 316
331 : 30
337 : 21
347 : 346
349 : 348
353 : 88
[COLOR=Red]359 : 179[/COLOR]
367 : 183
373 : 372
379 : 378
[COLOR=Red]383 : 191[/COLOR]
389 : 388
397 : 44
401 : 200
409 : 204
419 : 418
421 : 420
[COLOR=Red]431 : 43[/COLOR]
433 : 72
439 : 73
443 : 442
449 : 224
457 : 76
461 : 460
463 : 231
467 : 466
[COLOR=Red]479 : 239[/COLOR]
487 : 243
491 : 490
499 : 166
[COLOR=Red]503 : 251[/COLOR]
509 : 508
521 : 260
523 : 522
541 : 540
547 : 546
557 : 556
563 : 562
569 : 284
571 : 114
577 : 144
587 : 586
593 : 148
599 : 299
601 : 25
607 : 303
613 : 612
617 : 154
619 : 618
631 : 45
641 : 64
643 : 214
647 : 323
653 : 652
659 : 658
661 : 660
673 : 48
677 : 676
683 : 22
691 : 230
701 : 700
709 : 708
[COLOR=Red]719 : 359[/COLOR]
727 : 121
733 : 244
739 : 246
743 : 371
751 : 375
757 : 756
761 : 380
769 : 384
773 : 772
787 : 786
797 : 796
809 : 404
811 : 270
821 : 820
823 : 411
827 : 826
829 : 828
[COLOR=Red]839 : 419[/COLOR]
853 : 852
857 : 428
859 : 858
[COLOR=Red]863 : 431[/COLOR]
877 : 876
881 : 55
883 : 882
[COLOR=Red]887 : 443[/COLOR]
907 : 906
911 : 91
919 : 153
929 : 464
937 : 117
941 : 940
947 : 946
953 : 68
967 : 483
971 : 194
977 : 488
[COLOR=Red]983 : 491[/COLOR]
991 : 495
997 : 332
gp >
[/code]where each line is "q : p" and it can be read as "q divides 2^p-1". I marked red the one which interested us. As you go higher, they get exponentially rarefied. Even with optimizations, as I said, the simpler is to take only primes q which are 1 or 7 (mod 8), you will need not "one week", but years to finish.

More clear:
[code]
gp > forprime(q=7,10000, if(q%8==1||q%8==7,p=getp(q,10^9); if(isprime(p), print(q" divides 2^"p"-1"))))
7 divides 2^3-1
23 divides 2^11-1
31 divides 2^5-1
47 divides 2^23-1
89 divides 2^11-1
127 divides 2^7-1
167 divides 2^83-1
223 divides 2^37-1
233 divides 2^29-1
263 divides 2^131-1
359 divides 2^179-1
383 divides 2^191-1
431 divides 2^43-1
439 divides 2^73-1
479 divides 2^239-1
503 divides 2^251-1
719 divides 2^359-1
839 divides 2^419-1
863 divides 2^431-1
887 divides 2^443-1
983 divides 2^491-1
1103 divides 2^29-1
1319 divides 2^659-1
1367 divides 2^683-1
1399 divides 2^233-1
1433 divides 2^179-1
1439 divides 2^719-1
1487 divides 2^743-1
1823 divides 2^911-1
1913 divides 2^239-1
2039 divides 2^1019-1
2063 divides 2^1031-1
2089 divides 2^29-1
2207 divides 2^1103-1
2351 divides 2^47-1
2383 divides 2^397-1
2447 divides 2^1223-1
2687 divides 2^79-1
2767 divides 2^461-1
2879 divides 2^1439-1
2903 divides 2^1451-1
2999 divides 2^1499-1
3023 divides 2^1511-1
3119 divides 2^1559-1
3167 divides 2^1583-1
3343 divides 2^557-1
3391 divides 2^113-1
3449 divides 2^431-1
3463 divides 2^577-1
3607 divides 2^601-1
3623 divides 2^1811-1
3863 divides 2^1931-1
4007 divides 2^2003-1
4079 divides 2^2039-1
4127 divides 2^2063-1
4513 divides 2^47-1
4567 divides 2^761-1
4679 divides 2^2339-1
4703 divides 2^2351-1
4799 divides 2^2399-1
4871 divides 2^487-1
4919 divides 2^2459-1
5087 divides 2^2543-1
5399 divides 2^2699-1
5471 divides 2^547-1
5639 divides 2^2819-1
5711 divides 2^571-1
5737 divides 2^239-1
5807 divides 2^2903-1
5879 divides 2^2939-1
5927 divides 2^2963-1
6047 divides 2^3023-1
6079 divides 2^1013-1
6089 divides 2^761-1
6353 divides 2^397-1
6361 divides 2^53-1
6599 divides 2^3299-1
6719 divides 2^3359-1
6857 divides 2^857-1
6983 divides 2^3491-1
7079 divides 2^3539-1
7247 divides 2^3623-1
7487 divides 2^197-1
7529 divides 2^941-1
7559 divides 2^3779-1
7607 divides 2^3803-1
7703 divides 2^3851-1
7727 divides 2^3863-1
7823 divides 2^3911-1
7927 divides 2^1321-1
8039 divides 2^4019-1
8167 divides 2^1361-1
8191 divides 2^13-1
8287 divides 2^1381-1
8423 divides 2^4211-1
8543 divides 2^4271-1
8719 divides 2^1453-1
8783 divides 2^4391-1
8831 divides 2^883-1
9511 divides 2^317-1
9623 divides 2^283-1
9719 divides 2^43-1
9743 divides 2^4871-1
9839 divides 2^4919-1
9887 divides 2^4943-1
gp >
[/code]This is "quite straight" and produces "lots of factors" as long as q stays below 2^33, because - guess what - p is most of the time of the magnitude of q/2 (i.e. under 2^32, in our range of interest) but after that, the "productivity" is gone. Puff..

tapion64 2014-04-10 03:32

Well, you're certainly right that it will take more time than I thought. I was mislead by how fast I was verifying the factors I found with ECM/trial factor, because p-1 is naturally easier to factor. Still, I think with an Atkin sieve modified to only find +/-1 mod 8 primes that runs on the GPU, and using msieve to factor p-1, then Bach Shallit to calculate the order, then some primality test that's fast for 32 bit numbers on the order, we reach feasible for atleast 2^64 through 2^65 if it's well distributed. The key is that we don't have a CPU bottleneck, everything happens on the GPU and all of the difficult parts are highly parallelizable. Throw enough high compute power cards at it and we should get it done in a reasonable amount of time. But I'll have to throw it together and test on my 560 Ti.

retina 2014-04-10 03:50

[QUOTE=tapion64;370731]... then some primality test that's fast for 32 bit numbers ...[/QUOTE]Maybe [url]http://miller-rabin.appspot.com/[/url]

axn 2014-04-10 06:10

[QUOTE=retina;370732]Maybe [url]http://miller-rabin.appspot.com/[/url][/QUOTE]

A 256MB bitmap is sufficient to represent all odd numbers under 2^32. No that it matters, since this step is essentially noise. Almost no time will be spent on this step.

tapion64 2014-04-10 08:00

[QUOTE=retina;370732]Maybe [URL]http://miller-rabin.appspot.com/[/URL][/QUOTE]
I'd rather use a deterministic method since it should be just as fast on such a small number. Technically we could use msieve for the test to screen, and then check with GIMPS to see if any data for the exponent exists, since GIMPS knows all the primes less than 2^32.

jnml 2014-04-10 08:20

[QUOTE=tapion64;370764]I'd rather use a deterministic method since it should be just as fast on such a small number. Technically we could use msieve for the test to screen, and then check with GIMPS to see if any data for the exponent exists, since GIMPS knows all the primes less than 2^32.[/QUOTE]

M-R test [I]is[/I] deterministic in that range if you use proper bases: [url]http://miller-rabin.appspot.com/[/url]

LaurV 2014-04-10 08:27

So, to have an idea about the involved effort. Divide this by thousands, considering that pari is quite a slow animal comparing with the gpu, but the proportions remain. As I said in the post from the parallel thread, you need an efficient algorithm to factor small numbers very fast, which you don't have. You will lose the most of the time in factoring q-1, for every q. After that, you don't need to effectively find the order, but it is much faster to check, for all prime factors [B]p [/B]of [B]q-1[/B], which are in the range of p's you are interested (that is from 0 to 1 billion, GIMPS range, or to say 2^32), if q divides 2^p-1 by exponentiation. This is faster, but yet too slow comparing with the algorithms used by GIMPS, when q gets higher.

To try, see this code:
[CODE]
/* "random" trial factoring, specify a starting possible factor and the outfput file */
ratf(a,fis)=
{
if(fis==0, print("Specify an output file!"); return(0));
while(1,
until(a%8==1 || a%8==7,a=nextprime(a+1));
c=factorint(a-1)[,1]~;
for(i=1,#c,
if(c[i]<1000000000 && Mod(2,a)^c[i]==1, print(a","c[i]); write(fis,a","c[i]))
)
);
}

/*same as above, but split the output to different files depending on n and k
-if a=-1, it will read the last_k from the file, assuming the file has ONE LINE only
-use the flags.txt file to gracefully stop it
*/
ratfpro(a,lastvalue,printstep)=
{
my(tnextpr,tmod,tfactint,telse,wa,sg);
if(a==-1, a=read("fact_last_k.txt"));
if(lastvalue<=0, error("There is a must you provide a maximum value of k."));
print("Starting/continuing from: "a);
tnextpr=tfactint=tmod=telse=0;
if(printstep==0,printstep=2^20);
prst=a;
gettime();
while(a<=lastvalue,
telse+=gettime();
until(a%8==1 || a%8==7,a=nextprime(a+1));
tnextpr+=gettime();
if(a>prst,
prst=a+printstep;
printf("... %d, (%d digits, 2^%d++) --- Timers: %6.4f, %6.4f, %6.4f, %6.4f (seconds)\n",
a,ceil(log(a)/ten),floor(log(a)/two),
tnextpr/1000, tfactint/1000, tmod/1000, telse/1000);
extern("del fact_last_k.bak");
extern("ren fact_last_k.txt fact_last_k.bak");
write("fact_last_k.txt",a-1);
shouldstop=read("flags.txt");
if(shouldstop==1,break)
);
if(a>lastvalue,print("Candidate list is finished now."); break);
telse+=gettime();
c=factorint((a-1)>>1)[,1]~;
tfactint+=gettime();
for(i=1,#c,
if(c[i]<10000000000,
telse+=gettime();
if(Mod(2,a)^c[i]==1,
tmod+=gettime();
print(a","c[i]" ");
if(c[i]<1000000000,fis="fact_0B_",
if(c[i]<2000000000,fis="fact_1B_",
if(c[i]<3000000000,fis="fact_2B_",
if(c[i]<4000000000,fis="fact_3B_",
if(c[i]<5000000000,fis="fact_4B_",
if(c[i]<6000000000,fis="fact_5B_",
if(c[i]<7000000000,fis="fact_6B_",
if(c[i]<8000000000,fis="fact_7B_",
if(c[i]<9000000000,fis="fact_8B_",
fis="fact_9B_"
)
)
)
)
)
)
)
)
);
expo=37;
k=1<<expo;
for(j=1,40,
if(a<k,fis=concat(fis,expo); break,
k<<=1; expo++));
if(a%8==1,
wa=(a-1)/(8*c[i])-1;
sg="+",
/*else*/
sg="-";
wa=(a-1-2*(4-c[i]%4)*c[i])/(8*c[i])
);
write(fis,wa,sg,c[i]),
/*else*/
tmod+=gettime()
),
/*else*/
break
)
)
);
}

/*same as above, but use modular order instead of factoring k-1*/
/*for bigger stuff it may become faster then factoring k-1*/
ratfzno(a,lastvalue,printstep)=
{
my(tnextpr,tprchk,tznord,telse);
if(a==-1, a=read("fact_last_k"));
tnextpr=tznord=tprchk=telse=0;
gettime();
if(printstep==0,printstep=2^20);
prst=a;
while(a<=lastvalue,
until(a%8==1 || a%8==7,a=nextprime(a+1));
tnextpr+=gettime();
if(a>lastvalue,break);
if(a>prst,
prst=a+printstep;
printf("... %d, (%d digits, 2^%d++) --- Timers: %10.4f, %10.4f, %10.4f, %10.4f seconds\n",
a,ceil(log(a)/ten),floor(log(a)/two),
tnextpr/1000, tznord/1000, tprchk/1000, telse/1000);
write("fact_last_k",a));
telse+=gettime();
c=znorder(Mod(2,a));
tznord+=gettime();
if(isprime(c),
tprchk+=gettime();
if(c<1000000000,fis="fact_0B_",
if(c<2000000000,fis="fact_1B_",
if(c<3000000000,fis="fact_2B_",
if(c<4000000000,fis="fact_3B_",
if(c<5000000000,fis="fact_4B_",
if(c<6000000000,fis="fact_5B_",
if(c<7000000000,fis="fact_6B_",
if(c<8000000000,fis="fact_7B_",
if(c<9000000000,fis="fact_8B_",
if(c<10000000000,fis="fact_9B_",
telse+=gettime();
next))))))))));
expo=40;
k=1<<expo;
for(j=1,40,
if(a<k,fis=concat(fis,expo); break,
k<<=1; expo++));
print(a","c" ");
write(fis,a","c);
telse+=gettime()
);
tprchk+=gettime()
);
}
[/CODE]

save the code, load it in pari, try: "ratfpro(2^10,2^64,10^6)"
(the parameters are start, stop, how often to print)

then stop it with CTRL/C, check which files were created.

then try "ratfpro(2^45,2^64,10^6)"

The timers show how much time was spent for finding the next prime q, how much to factor it, how much for exponentiation, etc.

The lines which are printed without "timers" represent factors of mersenne numbers, on the form "q,p", i.e. "q divides 2^p-1"

Then try starting from 2^58, for example. See how many factors you find.... Not so efficient...

This is nice for didactic purpose, only. The function creates a series of files where the factors are stored in a special form, I had some reason at that time to use that form. It also checks (reads/writes) a couple of other files which can be used to resume the work after a stop or crash, of to "gracefully" shut it down (i.e. let it finish the current calculus first, I could implement some pari hook for ctrl-C, but it was beyond the scope).

tapion64 2014-04-10 13:25

The Go code I referenced in the other thread used Bach Shallit's algorithm, which factors q-1 so that it can do exactly what you're saying in order to calculate the order. Otherwise there's no point in factoring the number. And we have a fast way to factor the numbers, msieve has CUDA implementations that factor 30+ digit numbers in no time (the timer literally says 00:00:00). Factoring isn't the bottle neck on this, it's the insanely large number of primes we need to test.

I need to learn more about the logic behind Atkin's sieve before I attempt to modify it, but I did a traditional Eratosthenes sieve coupled with +/- 1 mod 8, and we can really only get up to 23. The factor we reduce by when adding a new prime to the sieve is (p-1)/p, which has diminishing returns. 23 gets you to around .085, which is just a bit less than 4 times the n/ln(n) approximation for the number of primes in the 2^64 to 2^65. One option would be to try a 2 step sieve, although I think modifying Atkin's sieve will probably yield better results. Even so, if we assume the factoring is fast enough for numbers in that range (msieve says it is), that just means it takes 4 times longer, which really isn't /that/ bad.

Also, these are based on what a 560 Ti can do. A faster GPU with more cores can increase the factoring speed and the resulting calculation of the order. Even if the latest cards in the market can't do this quite yet, give it a few years and they will.

tapion64 2014-04-10 18:46

So I went ahead and read the algorithm and reasoning for why Atkin's works, and then I ignored all of that and just multiplied everything by 2 and sieved in respect to 120 instead of 60, taking out all numbers that weren't +/-1 mod 8.

The result, surprisingly, appears to not only work, but eliminates the need for crossing off multiples of the squares in the final step. In other words, it appears to be a completely parallelizable algorithm that can start from any point in the set of integers instead of just starting with 1. Perhaps more interesting, this algorithm appears to be able to take a pre-sieved list with no problem, not just a contiguous section.

Granted, I only ran the algorithm on 1 through 240 by hand, but since the cycle length is 120, you'd think that means the results are good throughout. I'll do some more extensive testing and see if I find discrepancies. It'd be really neat if this turned out to be a fast way to search for primes = +/- 1 mod 8. I think the need to solve for the solutions to the curves might impact larger numbers too much to be too feasible, though.

If anyone's interested, the revised algorithm:

1. List all integers k to be sieved, marked as composite (well, composite or prime and not = +/-1 mod 8)
2. For all k:[INDENT]a. If k is congruent to {1, 17, 41, 49, 73, 89, 97, 113} mod 120, find the number of solutions d to 4x^2 + y^2 = k, where x,y > 0. If d is odd, mark k as prime = +/- 1 mod 8.

b. If k is congruent to {7, 31, 79, 103} mod 120, find the number of solutions d to 3x^2 + y^2 = k, where x,y > 0. If d is odd, mark k as prime = +/- 1 mod 8.

c. If k is congruent to {23 47 71 119} mod 120, find the number of solutions d to 3x^2 - y^2 = k, where x > y > 0. If d is odd, mark k as prime = +/- 1 mod 8.

d. If k is congruent to any other residue mod 120, leave k as marked as it is (in other words, composite or prime and not = +/- 1 mod 8)
[/INDENT]3. You appear to be done! But the original algorithm says to square each element marked prime and then mark multiples of the square as composite. I didn't have to do this, but that doesn't mean it's not necessary for other test sets.

EDIT: Now that I think about it, this algorithm should certainly be extendable to cover 7, 11, 13, and so on, by multiplying by 120 and increasing the cycle range then sieving out the 'bad' elements in the congruency tests. It may not be necessary, but it might prevent needless curve solving, and it makes the problem set larger to be split up by cards with more cores.

tapion64 2014-04-10 20:27

Apparently after some amount of time you can't edit posts, so sorry for multiple posts. Some basic estimations of running time for the original Atkin's is O(N/log(log(N))). If we assume a clock rate of 1 GHz on the GPU and 384 cores and we're doing primes up to 2^65, that takes around 185 days (6 months).

With the modified algorithm, we only look from 2^64 through 2^65, which halves the work, and we assume that atleast 1/2 of primes being filtered out as not +/-1 mod 8. Thus we're looking at about 1 and half months (possibly less, need to see how removing the square sieving affects running time) just to iterate through the primes.

Since we previously assumed 1/2 the primes are filtered out, we're dealing with 208 quadrillion inputs. With the same clock rate and cores before, and exponential time using quadratic sieve for the 65 bit numbers, 1 core could factor roughly 2048 of them per second, gives us roughly 8306 years of computing time. Ok, yeah, that's a bottleneck. My earlier calculations were off by a factor of 100, which is enough. I should see if there's a better fit for calculating the multiplicative order that doesn't require factoring, possibly make a modified naive algorithm that takes into account the qualities of p which will work for numbers of this size.

tapion64 2014-04-10 22:42

The solution was there all along. Since we only care if the multiplicative order is less than 2^32, naive multiply and subtract can be performed up to 2^31 and if we still haven't hit -1, we know that the order is larger than 2^32. Although probably use some squared exponentiation techniques and operate in [-(p+1)/2,(p+1)/2) to speed it up. No factoring, no ridiculous bottleneck, but this algorithm isn't as parallelizable... I'll keep looking for better options

retina 2014-04-10 23:53

tapion64: Maybe you should write all this in a blog somewhere and when you've finally figured it all out come back and post here to get people to join you in your search.

tapion64 2014-04-11 01:30

Will do. I can't believe the site doesn't have perma-editable posts though.

TheMawn 2014-04-11 02:45

[QUOTE=tapion64;370843]Will do. I can't believe the site doesn't have perma-editable posts though.[/QUOTE]

It's different from what other places do, but I think it adds an element of permanence I think is more valuable. I've seen posts proven wrong changed weeks after being first written because the original poster wanted to save face.

EDIT: On the other hand, it appears there are no Crusades against double-posting here, unlike the same forums I am referring to.

LaurV 2014-04-11 03:53

Now you got to more realistic times :D
Related to your question, there are many ways to get that order, without factoring. Unfortunately none of them is faster.

One very simple (but still exponential) method is extremely easy to implement in binary, as it only needs additions and shifts:

Starting with 1:
-- add q
-- count the (binary) zeroes at the end, and eliminate them
-- repeat the two steps above until you get 1.

Example:
[U]say q=5[/U], which in binary is 101, so 101+1=110, we have 11 and a zero.
-- 11+101=1000, we have 1 (so we stop) and additional 3 zeroes, so we have a total of 4 zeroes. Therefore 5 divides 2[SUP]4[/SUP]-1, more exactly 5 divides all numbers of the form 2[SUP]4k[/SUP]-1, and none others (i.e. there is no other power x such as 2[SUP]x[/SUP]-1 is divisible by 5)

You can do this for all [U]odd[/U] numbers (not necessarily prime) and there is a simple proof that you will always end in 1, after the right number of steps.

[U]say q=9[/U], which in binary is 1001, so 1001+1=1010, we have 101 and a zero.
-- 101+1001=1110, we have 111 and additional one zero, total 2 zeroes.
-- 111+1001=10000, we have 1 (so we stop) and additional 4 zeroes, total 6 zeroes.
Therefore, 9 divides 2^6-1, and all numbers of the form 2[SUP]6k[/SUP]-1 and none others.

[U]say q=11[/U] (decimal), which in binary is 1011, so 1011+1=1100, we have 11 and 2 zeroes.
-- 11+1011=1110, we have 111 and additional one zero, total 3 zeroes.
-- 111+1011=10010, we have 1001 and additional 1 zero, total 4 zeroes.
-- 1001+1011=10100, we have 101 and additional 2 zeroes, total 6 zeroes.
-- 101+1011=10000, we have 1 (so we stop) and additional 4 zeroes, total 10 zeroes.
Therefore, 11 divides 2^10-1, and all numbers of the form 2[SUP]10k[/SUP]-1 and none others.

[U] One which will interest us, where the result is prime:[/U]

[U] say q=23[/U], which in binary is 10111, so 10111+1=11000, we have 11 and 3 zeroes.
-- 11+10111=11010, we have 1101 and additional one zero, total 4 zeroes.
-- 1101+10111=100100, we have 1001 and additional 2 zeroes, total 6 zeroes.
-- 1001+10111=10000, we have 1 (so we stop) and additional 5 zeroes, total 11 (decimal) zeroes.
Therefore, 23 divides 2^11-1, and all numbers of the form 2[SUP]11k[/SUP]-1 and none others.

The same (last) calculus in decimal, involving "divide by 2 if it is even", would look like:
q=23+1=24, even, we have q=q/2=12 and p=1 (one zero);
q=12 even, we have q=6 and p=2;
q=6 even we have q=3 and p=3;
q=3 odd, etc...
q=3+23=26, q=13, p=4;
q=13+23=36; q=18, p=5; q=9, p=6;
q=9+23=32; q=16, p=7; q=8, p=8; q=4, p=9; q=2, p=10; q=1, p=11; end.
So, 23 divides 2^11-1.

Next step, you have to do an ASIC to compute this (it won't be difficult, anyhow it is much simpler than the asics for sha256 for bitcoin mining, hehe, just few counters and shifters) but it will still be (much) slower than normal exponentiation, by squaring - but well... you do no factoring, neither multiplication nor division :razz:).

kladner 2014-04-11 03:57

The one hour limit on editing posts gives time to polish things up, add things, and remove typos. The issue of further editing has been the subject of heated debate for at least one user here.

I think freezing things at some point makes exchanges more like vocal conversation. You can always reconsider and correct earlier statements in later comments, but the process by which you arrive at later positions is retained. I find this to be a valuable feature.

axn 2014-04-11 04:32

[QUOTE=tapion64;370780]msieve has CUDA implementations that factor 30+ digit numbers in no time (the timer literally says 00:00:00). [/quote]
Really? News to me. Where did you get the idea that msieve has CUDA implementation for factoring 30 digit numbers?
[QUOTE=tapion64;370780]Factoring isn't the bottle neck on this, it's the insanely large number of primes we need to test.[/quote]
The bottleneck is the factoring / group order calculation [B]for the[/B] insanely large number of primes, not the generation of the primes themselves.

tapion64 2014-04-11 08:06

[QUOTE=axn;370864]Really? News to me. Where did you get the idea that msieve has CUDA implementation for factoring 30 digit numbers?

The bottleneck is the factoring / group order calculation [B]for the[/B] insanely large number of primes, not the generation of the primes themselves.[/QUOTE]
Maybe it's recent? If you go on SourceForge for msieve, they have a version for CUDA. I tried it out on a few 30 digit numbers and it handled them really quickly.

Yeah, I'm working on getting that down. Unfortunately my field theory isn't quite up to par, I'm reading through a Jager/Lenstra paper and I kind of understand why p = -1 mod 8 yields odd orders of 2 mod p. I'm trying to extend the logic to determine how the Dirichlet character values for the remaining case p = 1 mod 8 affect even/odd in the order, but I'm just not grounded enough in GF theory to do so at the moment. I'll be working on it tomorrow. If I can filter out the p = 1 mod 8 cases that will yield even orders from the modified Atkin sieve, Lenstra says this will remove 17/24 primes asymptotically. That's almost 1/4 instead of my assumed 1/2, which is a good start. Also, if we can actually skip the last step of the Atkin sieve without a problem, that should bring the complexity of the sieve itself down by the same factor and another log(log(N)) term, so we're looking at a potential 1/96 increase in performance which would get us closer to the 88 years of compute time I had originally predicted, which is certainly feasible when distributed out across 100 GPUs or so. Even more so if we consider faster cards than the 560 Ti.

tapion64 2014-04-11 08:35

@ LaurV

Interesting. That appears to be the reverse of the way I normally calculate it, starting from (p+1)/2 and going backwards to 1. Same complexity as starting at 2 and shifting left until greater than p, then subtracting p, and repeat until you get to -1, number of shifts * 2 is the order. I usually do it on the [-(p-1)/2,(p-1)/2] interval, since if your shift in r gets you close to p, r-p is a smaller number in magnitude and the shifts can be calculated in the negative direction until you get back below (p-1)/2.

I.e., for order 2 mod 13, 2 -> 4 -> 8 = -5 -> -10 = 3 -> 6 -> 12 = -1, 6 shifts so order is 12. The actual complexity of this method is less than O(order/2), since you can do a quick calculation to see how many shifts you need to get to the next boundary, which speeds it up for larger numbers by reducing the number of indivual operations. Since we're ignoring cases where order > 2^32, we actually will never exceed O(2^31) even if it's the naive method, so it's really O(sqrt(N)) complexity with a hard cap of 2^31. Still not as good as O(log(n)) but it can be greatly facilitated by combining those trivial shifts into one instruction. But I'd rather not do it this way because it's a sequential algorithm that has no advantage running on the GPU. I'd rather come up with a parallelizable algorithm that will be faster for the cases of primes we're dealing with.

axn 2014-04-11 08:54

[QUOTE=tapion64;370879]Maybe it's recent? If you go on SourceForge for msieve, they have a version for CUDA. I tried it out on a few 30 digit numbers and it handled them really quickly.[/quote]
There is a version [B]with[/B] CUDA. The CUDA support is for GNFS ploy select acceleration, not general factorization. What you're seeing is the CPU code in action, which is quite capable of handling 30-digit factorizations that fast.

[QUOTE=tapion64;370879]Yeah, I'm working on getting that down. [/QUOTE]

A little bit here and there won't cut it. Here is what you face: there are about 4.1e17 primes b/w 2^64 & 2^65, of which we need to consider half (1 or 7 mod 8). Assuming you can compute the order for a prime in 1 microsecond (million p/s), you will need ~200 Gsecs on 1 core (~6000 Core years).

Straightaway, I am giving you a generous 1 us/prime. Are you anywhere close to that? Even with all potential optimizations, if you can't reach the microsecond territory, then it is hopeless.

retina 2014-04-11 09:04

[QUOTE=axn;370881]... then it is hopeless.[/QUOTE]No. Just make it distributed. Convince everyone here and there that it is important and useful and that they should give cycles to this. Should be easy peasy. :rolleyes:

Or buy a super computer. It doesn't even need to be one in the top 10.

tapion64 2014-04-11 09:15

[QUOTE=axn;370881]There is a version [B]with[/B] CUDA. The CUDA support is for GNFS ploy select acceleration, not general factorization. What you're seeing is the CPU code in action, which is quite capable of handling 30-digit factorizations that fast.



A little bit here and there won't cut it. Here is what you face: there are about 4.1e17 primes b/w 2^64 & 2^65, of which we need to consider half (1 or 7 mod 8). Assuming you can compute the order for a prime in 1 microsecond (million p/s), you will need ~200 Gsecs on 1 core (~6000 Core years).

Straightaway, I am giving you a generous 1 us/prime. Are you anywhere close to that? Even with all potential optimizations, if you can't reach the microsecond territory, then it is hopeless.[/QUOTE]
As Retina said, if it doesn't drop into 1 year compute time territory, I'd just make it distributed, assign people blocks of 120 numbers to sieve and test. Also, yeah, I see now that the CUDA part doesn't affect quadratic sieve. I'm trying to get away from factoring anyways, so I guess it's a good thing.

LaurV 2014-04-11 09:27

[edit: crosspost with the last 3-4 posts]

[QUOTE=tapion64;370880]I'd rather come up with a parallelizable algorithm that will be faster for the cases of primes we're dealing with.[/QUOTE]
You can "parallelize" it how much you want, by handling one billion primes in the same time. It seems that you didn't get the right issue. This method (same as the other method given by me) will do a hundred million steps for one "hundred million exponent". And this for most of the primes which are over 2^33 (almost all of them have orders higher than 2^32 in this case). So, you will end up with (how much it was? 2^17 primes? I already forgot) about 2^32 steps for 2^17 exponents...

These elementary things are interesting to talk about, because they make a newcomer understand what is going on, but their practical value is zero. You will never be able to finish this type of calculus in your life, even with all the computing power in the world available to your fingers. In this case, even the factoring of q-1 (it is where we started from, remember?) would be much faster.

Try taking 10 random primes in the 2^58 range, and do this "add and shift" method for all of them. When you finish, come back here... (now we got rid of you for a week or two, and there were only 10 primes!).

Don't get me wrong, I keep this conversation with you alive only because I walked exactly this path years ago when I started with hunting for prime numbers. I also invested/wasted a lot of time in "genuine" paths like this, that went nowhere... This is how you learn.

Please see [URL="http://www.mersenneforum.org/showthread.php?p=319330"]this topic[/URL] fore a more interesting (yet didactic only) discussion about factoring.

tapion64 2014-04-11 13:43

[QUOTE=LaurV;370885][edit: crosspost with the last 3-4 posts]
You can "parallelize" it how much you want, by handling one billion primes in the same time. It seems that you didn't get the right issue. This method (same as the other method given by me) will do a hundred million steps for one "hundred million exponent". And this for most of the primes which are over 2^33 (almost all of them have orders higher than 2^32 in this case). So, you will end up with (how much it was? 2^17 primes? I already forgot) about 2^32 steps for 2^17 exponents...

Try taking 10 random primes in the 2^58 range, and do this "add and shift" method for all of them. When you finish, come back here... (now we got rid of you for a week or two, and there were only 10 primes!).[/QUOTE]
There are around 416 quadrillion total primes less than 2^65 by x/ln(x) estimation. We can eliminate half of those by screening for +/-1 mod 8, and 17/24 of them by screening for primes = 1 mod 8 that will yield an even multiplicative order (which I almost have sorted out).

That leaves us with 2.946~ * 10^17 primes to test. Also, since we can stop once we've hit 2^k = -1 mod p since order will be 2*k, the complexity is 2^31, which is roughly 3.728 * 10^26 instructions (actually far less than this, because we're compressing continuous right/left shifts into one instruction, but let's go with it). With only optimizing in this way, we'd need to drop the worst case complexity from 2^31 to 2^5 to do this in 87 compute years, which is indeed too much.

We could address this better if we could solve the problem in parallel, but ultimately the goal is to sieve out unnecessary primes to test for in the first place. While finding 10 random primes in that range and using this method on them won't take as long as you think it would (roughly 3 primes could be handled per second on one of my 3.4 GHz cores without trying for shift optimization, yielding 12 in 1 second), it will take too long over the range we're testing. That's why I'm trying to find an algorithm with complexity near log(n) that can be parallelized, it brings it back into reasonable range. Factoring also gives a better expected running time but is prone to catching on numbers with large prime factors which I believe would in practice slow the effort down more than naive methods.

There are two ways to attack this problem, by either further increasing the efficiency of the sieve (which will have the highest effect), or decreasing the complexity of order calculation to O(log(n)). I think there's only so much that can be done on the order calculation side, but if I can get a generalized process to sieve out numbers with even and then odd composite orders before calculation, by exploiting the facts we know (and finding facts we don't know), we can break the problem down to something more feasible, even if we have the cpu factoring bottleneck or the high O(2^31) complexity for naive method.

I'll be working to better understand GF theory and Dirichlet characters to improve the efficiency of the sieve today. I'm also looking into why for numbers p = 1 mod 8 which have even orders, the order will never be = p-1, it always goes to (p-1)/2 or less. I believe it's related to the Kroenecker/Jacobi symbol properties of p = +/-1 mod 8 numbers, but more testing is needed.

tapion64 2014-04-12 02:42

Today was a productive day. I learned a bit more about GF theory and Dirichlet characters. As I suspected, the key to sieving out p = 1 mod 8 which have even orders is the Jacobi symbol (well, the generalized power residue symbol). Every prime = +/-1 mod 8 necessarily has J(2/p) = 1, but unfortunately since (p-1)/2 is always even, this only tells us that the order of p is at most (p-1)/2 (which was a problem I wanted to solve anyways).

In order to correctly determine if the order is odd or even, we have to do a chain of power residue symbol checks for 2^(2^n) = 2 mod p. So J(2/p) we know is always 1. Then Pr(2/p)[SUB]4[/SUB], which functions like Jacobi except the sub replaces the 2, so x^4 = 2 mod p. If that's -1, we know immediately that the order is even and we can stop. If it's 1, then we test Pr(2/p)[SUB]8[/SUB]. If this is -1, again we know that the order is even and we can stop. After this point, we may already be done. (p-1)/8 is a guaranteed integer, but (p-1)/8 might still be even, and (p-1)/16, and so forth. Essentially we need Pr(2/p)[SUB]2^n[/SUB] to be equal to 1 for all n = 2 ... k, where k is the number of times 2 divides p-1, and k will always be atleast 3. That's a somewhat complicated calculation just to sieve it out, but we could sieve out to atleast k = 3 and probably have sufficient results, it gives diminishing returns as k increases.

EDIT: It might be worth noting that this might be the precise requirement for calculating totient(p)/order = (p-1)/order. Find max value of n for each p[SUB]i[/SUB] prime factor of p such that Pr(p[SUB]i[/SUB]/p)[SUB]2^n[/SUB] = 1, then divide p-1 by p[SUB]i[/SUB]^max n for each p[SUB]i[/SUB] = order of 2 mod p. And maybe (maybe maybe maybe), if you replace the 2 in 2^n in the power symbol with another number z, then you find the order of z mod p. I'll look into if anyone's conjectured this (or proven it) before.

So forget that stuff, it was just for my own edification. I found a better way which just involves calculating 2^e = a mod p, where e = (p-1)/2^k (i.e. divide p-1 by 2 until it's odd). If a = 1, the order is odd and divides e (and heuristically speaking, is usually either e or e is a small multiple of the order, working out exact rules on this a.t.m). If e is a large multiple of the order, we found a large factor of a smaller Mersenne number or a Mersenne prime, but since we know that none of the numbers we're testing are Mersenne primes, it means they're factors of smaller Mersenne numbers we haven't found yet. If a is anything else, by definition we'd have to square it between 1 and k times to arrive at the actual order, at which point the order must be even, so we can ignore it. Much more efficient than doing all those power residue symbol calculations.

But much more importantly, this made me realize a useful feature of Mersenne factors that I touched on earlier but didn't focus on until today. If p = -1 mod 8, then the order is naturally going to be atmost (p-1)/2. By using a modified version of the modified version of the Atkin sieve I came up with earlier, it's possible to sieve out every Sophie Germain prime = 3,-1 mod 8 less than 4, then 4*3, then 4*3*5, then 4*3*5*7, and so forth. It's a complicated 2 step process to sieve out p = -1 mod 8, then repeat the sieve on (p-1)/2. It should be possible to combine the two sieve steps into one step, it was just logically easier to put it into two steps.

I want to say that because the sieve always finds all Sophie Germain primes less than 2*Primorial(n) = 3,-1 mod 8, and it would appear that every iteration of the sieve must yield new primes, this implies that there are infinitely many Sophie Germain primes... But I know that's not a proof. I think there is a way to use this to prove the conjecture if you can combine this with a sieve that finds the other set of Sophie Germain primes (those = 1,5 mod 8) and then prove that if either sieve stopped finding primes, there would be some contradiction based on the Mersenne factor properties. But I'm too burnt out to work on that and I'd rather focus on bringing eliminating Mersenne factors to reasonable compute time.

EDIT: Ok there might be too much math in this post I need to go to sleep <.<


All times are UTC. The time now is 23:29.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.