![]() |
No, too big an fft will cause errors too. I think it has to do with how far the carries get propagated.
|
Stage 1 save files are now implemented. It's not very polite in that it doesn't clean these up when its done. Some of you will want to keep these for extending b1 later. I'm starting work on stage 2 save files and will figure out the cleanup when that's ready.
|
[QUOTE=owftheevil;340949]Stage 1 save files are now implemented. It's not very polite in that it doesn't clean these up when its done. Some of you will want to keep these for extending b1 later. I'm starting work on stage 2 save files and will figure out the cleanup when that's ready.[/QUOTE]
Do you have a Win-32-bit compiled version of this available? |
Not yet. frmky has been doing the windows builds. I don't know when he will have time to get to it.
|
Just wanted to mention that without frmky's help none of this would be available until later this summer or maybe even fall.
|
Windows binaries with latest changes, untested as usual.
Win32 [URL="https://www.dropbox.com/s/ecwuwbezul6t65m/cudapm1_win32_20130520.zip"]https://www.dropbox.com/s/ecwuwbezul6t65m/cudapm1_win32_20130520.zip[/URL] x64 [URL="https://www.dropbox.com/s/ik1g9eza96t767q/cudapm1_x64_20130520.zip"]https://www.dropbox.com/s/ik1g9eza96t767q/cudapm1_x64_20130520.zip[/URL] |
Thank you very much! :smile:
|
Latest and greatest 64 bit binary works here:smile:
Stopped and resumed a couple of times during stage 1, here are the end results on new whql forceware: [CODE]Accumulated product stage 1: M63137587, 0x1f2595c1236f31dc, n = 3456K, CUDAPm1 v0.10 Accumulated product stage 2: M63137587, 0x412ca727e7d21026, n = 3456K, CUDAPm1 v0.10[/CODE] |
Having trouble with CUDAPm1. When I use "-b1 3100000" in the command line it works, but it stays a lot in that CPU routine that compute the product. A pari line line "n=3*10^6; lgn=log(n); z=prod(x=1,n,if(isprime(x),x^floor(lgn/log(x)),1)); ceil(log(z)/log(2))" returns in the same time, against all the logic and reason (pari should be much slower!).
But not this is the main problem. All values between 3200000 and 20M are parsed wrong, it says "B1 need to be at least 1" and does a test with B1=1 and B2=393xxx or so, which [B]does[/B] find a factor, if one exists for these values. I am not sure if smaller values starting with 1 are parsed wrong too or not (like -b1 150000) When I use a value of -b1 over 20M, it is parsed right (but never returns from the CPU multiplication routine, not ever after half hour). So, what are the restrictions for B1? Or, are there any restrictions and I am doing something completely silly? (I would like to run "CUDAPm1 160403 -b1 12000000 -b2 12000000" for example... Max value I can use is around 3M1, which is not enough, the former one is 10M. And totally ignoring the fact that he wants B2 to be 13 times higher then B1, which is totally nonsense for these numbers.) Also, how can we "extend" a former B1? I tried the test cases: CUDAPm1 58610467 -b1 70843 -b2 694201 and CUDAPm1 58610467 -b1 694201 -b2 694201 they both find the factor [edit, first one in stage 2, second one in stage 1, as it is normal] if started from scratch (delete the checkpoint file in between). But now assuming I have a run with the first, I want, when I run the second, that it should continue from where B1 left. This is not possible, as the former B1 is recorded in the file, and if I let the file there, it is totally ignoring my command line, it says "found limits in the file" and only runs stage 2. If I delete the file, obviously it starts from the scratch, duplicating the most of the work. This is not what was intended when we talked about "extending B1". OTOH, resuming stage1 works very nice, and I believe it is only about ignoring that former B1 stored in the file (I did not look into the sources however, and for the record, I use win7 64 bits binaries). Question: why are you doing that whole product in the beginning? You can do exponentiation for every prime, this would make it easy to "extend" the B1 limit, and you would not need to stress the CPU "only" (the GPU is idle in this time, for [U]minutes[/U], depends how big B1 is). [CODE] >CUDAPm1 630893 -b1 3100000 mkdir: cannot create directory `savefiles': File exists CUDA reports 1306M of 1535M GPU memory free. Using e=6, d=2310, nrp=480 Using approximately 155M GPU memory. B2 should be at least 390390, increasing it. B2 should be at least 40300000, increasing it. [COLOR=Red]<<<< here it stays about 2 minutes, GPU is iddle, CPU hard computing the product, then everything continues normally.[/COLOR] Starting stage 1 P-1, M630893, B1 = 3100000, B2 = 40300000, e = 6, fft length = 40K Doing 4471985 iterations Iteration 10000 M630893, 0x280b630169a8b5f7, n = 40K, CUDAPm1 v0.10 err = 0.00049 (0:17 real, 1.6675 ms/iter, ETA 2:04:00) Iteration 20000 M630893, 0xfb3b1f4975308539, n = 40K, CUDAPm1 v0.10 err = 0.00046 (0:01 real, 0.1044 ms/iter, ETA 7:44) Iteration 30000 M630893, 0xc90545f20507538b, n = 40K, CUDAPm1 v0.10 err = 0.00046 (0:01 real, 0.1039 ms/iter, ETA 7:41) Iteration 40000 M630893, 0x3ff1f732d6ebab86, n = 40K, CUDAPm1 v0.10 err = 0.00046 (0:01 real, 0.1041 ms/iter, ETA 7:41) [/CODE] |
LaurV, thanks for your input. I'll have time for a more complete response in about an hour, but for now I'll just say that most of what you are talking about hasn't been implemented yet, or hasn't been cleaned up yet. I was unaware of any problems parsing b1, I'll take a look as soon as I have time.
|
[QUOTE]Having trouble with CUDAPm1. When I use "-b1 3100000" in the command line it works, but it stays a lot in that CPU routine that compute the product. A pari line line "n=3*10^6; lgn=log(n); z=prod(x=1,n,if(isprime(x),x^floor(lgn/log(x)),1)); ceil(log(z)/log(2))" returns in the same time, against all the logic and reason (pari should be much slower!). [/QUOTE]My lack of imagination strikes again. As in who the heck would want to spend that much time doing p-1? Well I know the answer to that question now. Currently, the computation of the products of powers of primes is rather inefficient. And now that I realize some people will want to use huge b1's, I should probably split large b1's into two parts, a reasonable length large exponent and then piecewise smaller exponents to fill in the gap.
[QUOTE]But not this is the main problem. All values between 3200000 and 20M are parsed wrong, it says "B1 need to be at least 1" and does a test with B1=1 and B2=393xxx or so, which [B]does[/B] find a factor, if one exists for these values. I am not sure if smaller values starting with 1 are parsed wrong too or not (like -b1 150000) When I use a value of -b1 over 20M, it is parsed right (but never returns from the CPU multiplication routine, not ever after half hour). [/QUOTE]Like I said earlier, I was not aware of this problem. I'll look into it. [QUOTE]So, what are the restrictions for B1? Or, are there any restrictions and I am doing something completely silly? (I would like to run "CUDAPm1 160403 -b1 12000000 -b2 12000000" for example... Max value I can use is around 3M1, which is not enough, the former one is 10M. And totally ignoring the fact that he wants B2 to be 13 times higher then B1, which is totally nonsense for these numbers.)[/QUOTE]Currently there are a few silly restrictions caused by my lack of boundary case considerations in the initialization of stage 2. These are first on the list to be removed after stage 2 save files are working. Exactly what the restrictions are depend on many factors, so it hard to say exactly how big b1 must be. If e is the B-S exponent, d is the primorial being used, and p is the smallest prime which does not divide d, then b2 / p <= b1 and b2 / p / d >= 2 * e + 1 are the primary restrictions. [QUOTE]Also, how can we "extend" a former B1?[/QUOTE]You can't yet. Its on the list of things to do. The code for splitting large b1's up will automatically provided most of this. [QUOTE]Question: why are you doing that whole product in the beginning? You can do exponentiation for every prime, this would make it easy to "extend" the B1 limit, and you would not need to stress the CPU "only" (the GPU is idle in this time, for [U]minutes[/U], depends how big B1 is).[/QUOTE]Speed. 0's in the binary representation of the exponent require a squaring, 1's require an additional multiplication by the base. If the base is 3, this can be done with a modified normalization kernel with negligible increase in time, but with a huge integer base, it requires an additional fft multiplication. |
| All times are UTC. The time now is 23:19. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.