![]() |
1 Attachment(s)
Ver 1.048 with linix64 exec file.
[code] $ grep 1.048 GeneferCUDA.cu //1.048 cufftSetCompatibilityMode(plan,CUFFT_COMPATIBILITY_NATIVE); //1.047 check( 4000,2097152, "ff3daf8908789696"); //1.048 from AG5BPilot [/code] [code] $ ./GeneferCUDA.cuda4.0.Linux64 -b GeneferCUDA 2.2.1 (CUDA) based on Genefer v2.2.1 Copyright (C) 2001-2003, Yves Gallot (v1.3) Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1) Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA) A program for finding large probable generalized Fermat primes. Generalized Fermat Number Bench 2009574^8192+1 Time: 407 us/mul. Err: 3.82e-01 51636 digits 1632282^16384+1 Time: 414 us/mul. Err: 2.53e-01 101791 digits 1325824^32768+1 Time: 467 us/mul. Err: 2.03e-01 200622 digits 1076904^65536+1 Time: 611 us/mul. Err: 1.88e-01 395325 digits 874718^131072+1 Time: 843 us/mul. Err: 3.47e-01 778813 digits 710492^262144+1 Time: 1.37 ms/mul. Err: 4.21e-01 1533952 digits 577098^524288+1 Time: 2.65 ms/mul. Err: 2.01e-01 3020555 digits 468750^1048576+1 Time: 5.3 ms/mul. Err: 1.72e-01 5946413 digits 380742^2097152+1 Time: 10.7 ms/mul. Err: 3.63e-01 11703432 digits 309258^4194304+1 Time: 23.4 ms/mul. Err: 1.56e-01 23028076 digits 251196^8388608+1 Time: 48 ms/mul. Err: 1.56e-01 45298590 digits [/code] |
I'm not sure how long this has been in the code, but the source of the occasional checkpoint read failures is due to the code within the checkpoint read function using !strcmp when it should be using strcmp AND also not adding a terminating null to the end of one of the strings it's comparing.
This code: [code] fread(build, 1, bytes, fPtr); if (!strcmp(build, CPU_TARGET)) {[/code] should be this: [code] fread(build, 1, bytes, fPtr); build[bytes] = 0; if (strcmp(build, CPU_TARGET)) {[/code] |
[QUOTE=AG5BPilot;284535]I'm not sure how long this has been in the code, but the source of the occasional checkpoint read failures is due to the code within the checkpoint read function using !strcmp when it should be using strcmp AND also not adding a terminating null to the end of one of the strings it's comparing.
[/QUOTE] You are right. [code] $ od -cx genefer.ckpt |head 0000000 335 \0 \0 \0 004 \0 \0 \0 C U D A 232 372 036 \0 00dd 0000 0004 0000 5543 4144 fa9a 001e #define SaveFileVersion 221 oldSaveVer="00dd 0000" byte="0004 0000" build="5543 4144" (x86 is little-endian format.) http://research.microsoft.com/en-us/um/redmond/projects/invisible/src/crt/strcmp.c.htm [/code] |
1 Attachment(s)
Ver 1.049 with linux64 exec file.
Fixed #99 issue. |
1 Attachment(s)
Ver 1.05 with linux64 exec file.
Fixed "Corruption of error message" issue. |
1 Attachment(s)
Minor cosmetics (i*2) to ((i)*2) in v1.051.
Do you have an idea, why your source produces a 100% load on one cpu-core? Is this too by design that stopping the bench results in a hanging app after writing the message "writing checkpoint" to the screen? Do you know what causes the underlined differences in err-rates between the 32bit- and 64bit-app? Here some timings for our apps and in comparison the timings for genefer, genefer80 and geneferX64: [QUOTE]boinc@vmware2k-3: ./[B]GeneferCUDA.cuda4.0.Linux32[/B] -b 2009574^8192+1 Time: 663 us/mul. Err: 3.82e-01 51636 digits 1632282^16384+1 Time: 680 us/mul. Err: 2.53e-01 101791 digits 1325824^32768+1 Time: 746 us/mul. Err: 2.03e-01 200622 digits 1076904^65536+1 Time: 980 us/mul. Err: 1.88e-01 395325 digits 874718^131072+1 Time: 1.34 ms/mul. Err: 3.47e-01 778813 digits 710492^262144+1 Time: 2.07 ms/mul. Err: 4.21e-01 1533952 digits 577098^524288+1 Time: 4.06 ms/mul. Err: 2.01e-01 3020555 digits 468750^1048576+1 Time: 8.21 ms/mul. [U]Err: 1.64e-01[/U] 5946413 digits 380742^2097152+1 Time: 16.6 ms/mul. Err: 3.63e-01 11703432 digits 309258^4194304+1 Time: 35.8 ms/mul. [U]Err: 4.07e-01[/U] 23028076 digits 251196^8388608+1 Time: 73.2 ms/mul. [U]Err: 4.33e-01[/U] 45298590 digits [/QUOTE] [QUOTE]boinc@vmware2k-3: ./[B]GeneferCUDA.cuda4.0.Linux64[/B] -b 2009574^8192+1 Time: 696 us/mul. Err: 3.82e-01 51636 digits 1632282^16384+1 Time: 713 us/mul. Err: 2.53e-01 101791 digits 1325824^32768+1 Time: 779 us/mul. Err: 2.03e-01 200622 digits 1076904^65536+1 Time: 1.01 ms/mul. Err: 1.88e-01 395325 digits 874718^131072+1 Time: 1.37 ms/mul. Err: 3.47e-01 778813 digits 710492^262144+1 Time: 2.11 ms/mul. Err: 4.21e-01 1533952 digits 577098^524288+1 Time: 4.09 ms/mul. Err: 2.01e-01 3020555 digits 468750^1048576+1 Time: 8.21 ms/mul. [U]Err: 1.72e-01[/U] 5946413 digits 380742^2097152+1 Time: 16.7 ms/mul. Err: 3.63e-01 11703432 digits 309258^4194304+1 Time: 36.9 ms/mul. [U]Err: 1.56e-01[/U] 23028076 digits 251196^8388608+1 Time: 74.9 ms/mul. [U]Err: 1.56e-01[/U] 45298590 digits [/QUOTE] CPU: [QUOTE]boinc@vmware2k-3: ./[B]genefer[/B] -b 6631258^256+1 Time: 6.1 us/mul. Err: 0.4991 1747 digits 5386256^512+1 Time: 13.4 us/mul. Err: 0.4361 3447 digits 4375000^1024+1 Time: 28.1 us/mul. Err: 0.4200 6801 digits 3553604^2048+1 Time: 59.8 us/mul. Err: 0.4686 13416 digits 2886422^4096+1 Time: 129 us/mul. Err: 0.4362 26462 digits 2344504^8192+1 Time: 327 us/mul. Err: 0.4133 52184 digits 1904328^16384+1 Time: 693 us/mul. Err: 0.4014 102888 digits 1546796^32768+1 Time: 1.46 ms/mul. Err: 0.4412 202816 digits 1256388^65536+1 Time: 3.05 ms/mul. Err: 0.4634 399713 digits 1020504^131072+1 Time: 6.33 ms/mul. Err: 0.3661 787588 digits 828906^262144+1 Time: 13.8 ms/mul. Err: 0.4133 1551501 digits 673282^524288+1 Time: 29.4 ms/mul. Err: 0.4228 3055654 digits 546874^1048576+1 Time: 71.2 ms/mul. Err: 0.3784 6016611 digits 444200^2097152+1 Time: 155 ms/mul. Err: 0.3295 11843831 digits [/QUOTE] [QUOTE]boinc@vmware2k-3: ./[B]genefer80[/B] -b 5683936^256+1 Time: 9.77 us/mul. Err: 0.0002 1730 digits 4616790^512+1 Time: 20.8 us/mul. Err: 0.0002 3413 digits 3750000^1024+1 Time: 47.6 us/mul. Err: 0.0002 6732 digits 3045946^2048+1 Time: 101 us/mul. Err: 0.0002 13279 digits 2474076^4096+1 Time: 227 us/mul. Err: 0.0002 26188 digits 2009574^8192+1 Time: 522 us/mul. Err: 0.0002 51636 digits 1632282^16384+1 Time: 1.08 ms/mul. Err: 0.0002 101791 digits 1325824^32768+1 Time: 2.25 ms/mul. Err: 0.0002 200622 digits 1076904^65536+1 Time: 5 ms/mul. Err: 0.0002 395325 digits 874718^131072+1 Time: 11.2 ms/mul. Err: 0.0002 778813 digits 710492^262144+1 Time: 22.7 ms/mul. Err: 0.0002 1533952 digits 577098^524288+1 Time: 55.3 ms/mul. Err: 0.0002 3020555 digits 468750^1048576+1 Time: 128 ms/mul. Err: 0.0002 5946413 digits 380742^2097152+1 Time: 270 ms/mul. Err: 0.0002 11703432 digits [/QUOTE] [QUOTE]boinc@vmware2k-3: ./[B]geneferX64[/B] -b 5683936^256+1 Time: 3.81 us/mul. Err: 0.2500 1730 digits 4616790^512+1 Time: 8.24 us/mul. Err: 0.2500 3413 digits 3750000^1024+1 Time: 17.7 us/mul. Err: 0.2500 6732 digits 3045946^2048+1 Time: 37.8 us/mul. Err: 0.2500 13279 digits 2474076^4096+1 Time: 80.6 us/mul. Err: 0.2500 26188 digits 2009574^8192+1 Time: 200 us/mul. Err: 0.2500 51636 digits 1632282^16384+1 Time: 420 us/mul. Err: 0.2500 101791 digits 1325824^32768+1 Time: 879 us/mul. Err: 0.2188 200622 digits 1076904^65536+1 Time: 1.88 ms/mul. Err: 0.2031 395325 digits 874718^131072+1 Time: 3.83 ms/mul. Err: 0.2188 778813 digits 710492^262144+1 Time: 7.97 ms/mul. Err: 0.1875 1533952 digits 577098^524288+1 Time: 17.5 ms/mul. Err: 0.1719 3020555 digits 468750^1048576+1 Time: 44.4 ms/mul. Err: 0.1875 5946413 digits 380742^2097152+1 Time: 98.8 ms/mul. Err: 0.1875 11703432 digits [/QUOTE] |
[QUOTE=rroonnaalldd;284769]
Do you have an idea, why your source produces a 100% load on one cpu-core? Is this too by design that stopping the bench results in a hanging app after writing the message "writing checkpoint" to the screen? Do you know what causes the underlined differences in err-rates between the 32bit- and 64bit-app? CPU:[/QUOTE] Ronald, I can probably answer some of your questions as my boinc version is, essentially, a superset of Shoichiro's software. The problem with not being able to terminate the benchmarks has to do with the way the CTRL-C trapping works. The handler for the signal capture is set up at the beginning of the program, BEFORE it's determined that you're going to run a benchmark. When you hit CTRL-C, the handler captures it and sets the quitting flag, which tells "check" to stop running, do a checkpoint, and exit. Unfortunately, "check" isn't running. So the quitting flag is set, but nothing ever reads that. I fixed this in my boinc version of the code by only setting the CTRL-C handler if we're going to do a real PRP test. Otherwise, I let the default run-time handler catch the signal and terminate the program normally. As for the CPU core loading, mine doesn't do that, and the math code in my version is identical to Shoichiro's. But my build is 32 bit windows. By the way, when I did build a 64 bit version, it ran very slightly slower than the 32 bit build. |
[QUOTE=rroonnaalldd;284769]Minor cosmetics (i*2) to ((i)*2) in v1.051.[/QUOTE]
I must be missing something. These are the only changes I can find between 1.05 and 1.051: [code]Compare: (<)C:\GeneferCUDA test\genefercuda.1.051\GeneferCUDA.cu (37817 bytes) with: (>)C:\GeneferCUDA test\genefercuda.1.05\GeneferCUDA.cu (37811 bytes) 137d137 < 476,477c475 < } < while (f != 0); --- > } while (f != 0); 898,899c896,897 < } < while (maxError > ErrThreshold); --- > > } while (maxError > ErrThreshold); 1121c1119,1120 < while (fgets(str, 132, fp) != NULL) --- > > while (fgets(str, 132, fp) != NULL) 1267d1266 < 1271,1275d1269 < < < < < [/code] I don't see anything about (i*2). Also, the only place I can find (i*2), i is a variable, not a macro parameter, so I don't understand the benefit of changing it to ((i)*2). |
Shoichiro,
I just completed testing 4000^2097152+1 with older code (0.99) that uses SHIFT=8. Good news; the residual is the same as the residual that came from the 1.04 version with the new SHIFT logic. Mike |
I'm hoping there's a simple answer to this question.
How do I get convert an array of integers (which represent a very very large integer) to and from the complex arrays that are the input and output from FFTsquareFFT? Here's the background: Once upon a time (say, when I was in high school some 35 five years ago), I understood and really, really enjoyed things like Fourier transforms. Having not ever, once, had to actually use a Fourier transform since then, it's not difficult to imagine that I've forgotten nearly everything about them! Until, lo and behold, I find myself needing to do multiplications on astoundingly big numbers. So, here's the problem. Genefer, at the beginning of its test, starts by computing the value of b^N by repeatedly taking the square of b. The code in Genefer worked fine, for years, until some fool (that fool would be me, of course) had the audacity to go and discover that 25898^524288+1 was a prime number. So, of course, that sent us all thinking about finding even bigger primes with Genefer. So far, so good. Except that I'm a programmer by training, so I started tinkering with Genefer. First thing I did was convert GeneferCUDA to run under BOINC, so we can gets lots and lots of people running it (more people are willing to run Boinc than PRPNet. That's a shame; I like PRPNet better. Mark rocks.) Along the way, I found that at N=4194304 Genefer takes a very long time to start up. Looking at the code, I discovered that the really simple code at the beginning that computes the actual value of b^N, by repeatedly squaring b, takes a very long time when N=44194304. About 2 hours on my Core2Quad. Now, I could live with it taking 2 hours. After all, the actual primality test takes about 200 hours on a GTX 460. But, I know it can be done faster. It would take me, for example, a few minutes to write the code to use the gwnum library to do it, and that could do the calculations very quickly. Of course, it's silly to link the gwnum library into a program that's already doing math like that on the GPU. The programmer in me won't let me accept that 2 hours of processing when I know it can be eliminated. Which brings me to the part about my having forgotten what I learned about Fourier transforms 35 years ago. GeneferCUDA already has the code in it to do all the real work to do the square. All I have to do is call FFTsquareFFT. It does a forward FFT, calls a CUDA kernel to do the complex multiply on the FFT result, and does an inverse FFT. The result is the square of the input. The problem is, I'm not sure how to convert the input number (an array of integers) into the complex array that's the input to the FFT, nor how to get the resulting number back out. I might figure it out eventually, but I thought I'd ask because there's probably an incredibly simple answer. Thanks, Mike |
Hi ,rroonnaalldd
[QUOTE=rroonnaalldd;284769]Do you have an idea, why your source produces a 100% load on one cpu-core? [/QUOTE] Please see. [url]http://mersenneforum.org/showpost.php?p=284251&postcount=86[/url] [url]http://mersenneforum.org/showpost.php?p=284284&postcount=87[/url] [url]http://mersenneforum.org/showpost.php?p=284292&postcount=88[/url] [url]http://mersenneforum.org/showpost.php?p=284300&postcount=89[/url] |
Hi ,
[QUOTE=AG5BPilot;284813]How do I get convert an array of integers (which represent a very very large integer) to and from the complex arrays that are the input and output from FFTsquareFFT? [/QUOTE] I think right way was Implement library. [url]http://mersenneforum.org/showpost.php?p=248716&postcount=63[/url] Or something like this(GMP ?). llrCUDA have same problem. |
[QUOTE]Please see.[/QUOTE] Thanks for the links.
|
Hi ,
[QUOTE=rroonnaalldd;284769]Do you know what causes the underlined differences in err-rates between the 32bit- and 64bit-app? [/QUOTE] cudatoolkit_4.0.17_linux_32_ubuntu10.10.run gpucomputingsdk_4.0.17_linux.run devdriver_4.1_linux_32_285.05.23.run with GTX-550Ti [code] genefercuda.1.051$ ./GeneferCUDA.cuda4.0.Linux32 -b GeneferCUDA 2.2.1 (CUDA) based on Genefer v2.2.1 Copyright (C) 2001-2003, Yves Gallot (v1.3) Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1) Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA) A program for finding large probable generalized Fermat primes. Generalized Fermat Number Bench 2009574^8192+1 Time: 378 us/mul. Err: 3.82e-01 51636 digits 1632282^16384+1 Time: 384 us/mul. Err: 2.53e-01 101791 digits 1325824^32768+1 Time: 437 us/mul. Err: 2.03e-01 200622 digits 1076904^65536+1 Time: 587 us/mul. Err: 1.88e-01 395325 digits 874718^131072+1 Time: 817 us/mul. Err: 3.47e-01 778813 digits 710492^262144+1 Time: 1.33 ms/mul. Err: 4.21e-01 1533952 digits 577098^524288+1 Time: 2.58 ms/mul. Err: 2.01e-01 3020555 digits 468750^1048576+1 Time: 5.25 ms/mul. Err: 1.64e-01 5946413 digits 380742^2097152+1 Time: 10.6 ms/mul. Err: 3.63e-01 11703432 digits 309258^4194304+1 Time: 22.7 ms/mul. Err: 1.56e-01 23028076 digits 251196^8388608+1 Time: 46.5 ms/mul. Err: 1.56e-01 45298590 digits [/code] |
Hi ,
[QUOTE=msft;284826]cudatoolkit_4.0.17_linux_32_ubuntu10.10.run gpucomputingsdk_4.0.17_linux.run devdriver_4.1_linux_32_285.05.23.run with GTX-550Ti [/QUOTE] I have installed the same Cuda41rc2-driver and the older Cuda40-sdk. Only difference is my smaller GTS450... [add] All your error-rates seems to be the same except: [QUOTE]309258^4194304+1 Time: 22.7 ms/mul. Err: 1.56e-01 23028076 digits 251196^8388608+1 Time: 46.5 ms/mul. Err: 1.56e-01 45298590 digits[/QUOTE] I have: [QUOTE]309258^4194304+1 Time: 35.8 ms/mul. Err: 4.07e-01 23028076 digits 251196^8388608+1 Time: 73.2 ms/mul. Err: 4.33e-01 45298590 digits[/QUOTE] |
[QUOTE=AG5BPilot;284813]
Along the way, I found that at N=4194304 Genefer takes a very long time to start up. Looking at the code, I discovered that the really simple code at the beginning that computes the actual value of b^N, by repeatedly squaring b, takes a very long time when N=44194304. About 2 hours on my Core2Quad.[/QUOTE] [QUOTE=msft;284818] Or something like this(GMP ?).[/QUOTE] Use the GMP, Luke :smile: [CODE]? N=222222^4194304; ? ## *** last result computed in 2,749 ms.[/CODE] That's 2.7 seconds on a 32-bit build of PARI (w/ GMP) running on Core 2 duo 2Ghz. A 64-build version will be much faster. It can probably made even faster, by removing the 2's from the (even) base, and just doing the relevant number of squarings in the PRP loop. |
[QUOTE=axn;284842]Use the GMP, Luke :smile:
[/QUOTE]I like this joke.:smile: Do not have licence issue? |
1 Attachment(s)
Ver 1.06
Implement GMP. |
gmp.h missing...
Solved by using apt-cache search gmp... |
You have to install the external GMP software yourself :smile:
|
Got it working and did a comparison between v1.051 and 1.060:
[QUOTE]boinc@vmware2k-3:~/Cuda/test$ time ./GeneferCUDA_cudart_cuda40_64bit -b 2009574^8192+1 Time: 696 us/mul. Err: 3.82e-01 51636 digits 1632282^16384+1 Time: 713 us/mul. Err: 2.53e-01 101791 digits 1325824^32768+1 Time: 777 us/mul. Err: 2.03e-01 200622 digits 1076904^65536+1 Time: 1.01 ms/mul. Err: 1.88e-01 395325 digits 874718^131072+1 Time: 1.37 ms/mul. Err: 3.47e-01 778813 digits 710492^262144+1 Time: 2.11 ms/mul. Err: 4.21e-01 1533952 digits 577098^524288+1 Time: 4.09 ms/mul. Err: 2.01e-01 3020555 digits 468750^1048576+1 Time: 8.21 ms/mul. Err: 1.72e-01 5946413 digits 380742^2097152+1 Time: 16.7 ms/mul. Err: 3.63e-01 11703432 digits 309258^4194304+1 Time: 36.8 ms/mul. Err: 1.56e-01 23028076 digits 251196^8388608+1 Time: 74.7 ms/mul. Err: 1.56e-01 45298590 digits real 34m31.539s user 14m8.597s sys 20m12.152s[/quote] [quote]boinc@vmware2k-3:~/Cuda/test$ time ./GeneferCUDA_gmp_cuda40_64bit -b 2009574^8192+1 Time: 696 us/mul. Err: 3.82e-01 51636 digits 1632282^16384+1 Time: 714 us/mul. Err: 2.53e-01 101791 digits 1325824^32768+1 Time: 777 us/mul. Err: 2.03e-01 200622 digits 1076904^65536+1 Time: 1.01 ms/mul. Err: 1.88e-01 395325 digits 874718^131072+1 Time: 1.37 ms/mul. Err: 3.47e-01 778813 digits 710492^262144+1 Time: 2.11 ms/mul. Err: 4.21e-01 1533952 digits 577098^524288+1 Time: 4.09 ms/mul. Err: 2.01e-01 3020555 digits 468750^1048576+1 Time: 8.21 ms/mul. Err: 1.72e-01 5946413 digits 380742^2097152+1 Time: 16.7 ms/mul. Err: 3.63e-01 11703432 digits 309258^4194304+1 Time: 36.8 ms/mul. Err: 1.56e-01 23028076 digits 251196^8388608+1 Time: 74.7 ms/mul. Err: 1.56e-01 45298590 digits real 34m31.597s user 14m10.249s sys 20m10.672s [/QUOTE] |
[QUOTE=rroonnaalldd;285090]Got it working and did a comparison between v1.051 and 1.060:[/QUOTE]
"-b" option not use GMP. Please try normal test. By the way. [url]http://www.primegrid.com/forum_thread.php?id=3902&nowrap=true#46251[/url] AG5BPilot make non GMP version. We can wait his work.:smile: |
[QUOTE=msft;285194]"-b" option not use GMP.
Please try normal test. By the way. [url]http://www.primegrid.com/forum_thread.php?id=3902&nowrap=true#46251[/url] AG5BPilot make non GMP version. We can wait his work.:smile:[/QUOTE] Hopefully it will work. :) BTW, you can add this line into the residual check: [code] check( 1248,4194304, "8f985a974820a6d3");[/code] |
1 Attachment(s)
Ver 1.061
Add: [code] check( 1248,4194304, "8f985a974820a6d3"); [/code] |
In preparation for assaulting the higher N numbers, I made some changes to the calculation of b^N at the beginning of the check() function. Here's the new beginning of that function, along with a few new helper functions it uses:
This is the OLD code: [code]void check(const Int32 b, const UInt32 m, char *expectedResidue) { ~~snip~~ if (read_checkpoint(b, m, z, &startTime, &iStart, &Nlen, &iOrigStart)) { fprintf(STDOUT, "\rResuming %d^%u+1 from a checkpoint (%d iterations left)\n", b, m, iStart); //1.00 fflush(STDOUT); } else { Nlen = 1; Na = (UInt32 *)myAlloc(1 * sizeof(UInt32)); Na[0] = (UInt32)b; for (j = m; j != 1; j /= 2) { a = (UInt32 *)myAlloc(2 * Nlen * sizeof(UInt32)); for (i = 0; i != Nlen; ++i) a[i] = 0; for (i = 0; i != Nlen; ++i) { BOINC_GPU_CHECK_FOR_ABORT(true); // should be good spot for abort check. Runs sqrt of total times a[i + Nlen] = mul_1_add_n(&a[i], Na[i], Na, Nlen); } myFree(Na); Na = a; Nlen *= 2; if (Na[Nlen - 1] == 0) --Nlen; } iStart = iOrigStart = lg(Na, Nlen) - 1; fprintf(STDOUT, "%sTesting %d^%u+1... %d steps to go ", ISBOINC ? "\n" : "\r", b, m, iStart); fflush(STDOUT); }[/code] This is the new code and helper functions: [code]// transfer input int array into complex array for fft __global__ void fftIn_kernel(cufftDoubleComplex * const g_fftIn, const UInt32 * const g_intArray, const UInt32 len) { int threadID = blockIdx.x * blockDim.x + threadIdx.x; g_fftIn[threadID*2].y = g_fftIn[threadID*2+1].y = 0.0; if (threadID < len) { g_fftIn[threadID*2].x = (double) (g_intArray[threadID] & 0xFFFF); g_fftIn[threadID*2+1].x = (double) (g_intArray[threadID] >> 16); } else g_fftIn[threadID*2].x = g_fftIn[threadID*2+1].x = (double) 0.0; } // Compute the size of the FFT array for compute_square // round up to the nearest power of 2, minimum 128 UInt32 fftSize(const UInt32 len) { UInt32 size = lpt(len); return size >= 128 ? size : 128; } // In order to compute the value of b^N, we need to repeatedly square b. // Use cuFFT libraries to do the square // The resulting value, the address of which is returned by this function // will be twice as long as the input (i.e., 2*Nlen) // This function allocates memory for the result -- the calling routine has // responsibility to eventually free that memory. // // Try reducing bits per element to 16 -- maybe we have a loss of precision // This means doubling the size of the FFT // Yes, this was the problem -- so do fft's using 16 bit unisgned ints in doubles rather than 32 bits. // UInt32 * computeSquare(const UInt32 * Na, const UInt32 Nlen) { cufftDoubleComplex * g_fftIn; // FFT input array cufftDoubleComplex * g_fftOut; // FFT Output array cufftDoubleComplex * fftOut; // final output (easier to do this on CPU because of carry propogation) UInt32 * g_intArray; // input/output integer array on the GPU UInt32 * a; // imaginatively named pointer to the result UInt32 fftLen = 2*fftSize(2*Nlen); // Size of the FFT -- NLen*2 rounded up to the next power of 2 cufftHandle plan; // the FFT plan UInt32 i, carry; // index counter and carry UInt64 l; // temporary 64 bit unsigned int for preserving carry // allocated GPU memory for the FFT in and out (Complex double, 2*Nlen, rounded up to a good FFT size cutilSafeCall(cudaMalloc((void**)&g_fftIn, sizeof(*g_fftIn)*fftLen)); cutilSafeCall(cudaMalloc((void**)&g_fftOut, sizeof(*g_fftOut)*fftLen)); // allocate GPU memory to load the integer array (UInt32, same count as the fft array) cutilSafeCall(cudaMalloc((void**)&g_intArray, sizeof(*g_intArray)*fftLen/2)); // Copy Na into the device memory cutilSafeCall(cudaMemcpy(g_intArray, Na, sizeof(*g_intArray)*Nlen, cudaMemcpyHostToDevice)); // invoke kernal to build FFT input fftIn_kernel<<< (fftLen/2)/128, 128 >>> (g_fftIn, g_intArray, Nlen); // Create the FFT plan cufftSafeCall(cufftPlan1d(&plan, fftLen, CUFFT_Z2Z, 1)); // Do the forward FFT cufftSafeCall(cufftExecZ2Z(plan, g_fftIn, g_fftOut, CUFFT_FORWARD)); // invoke kernel to square the resulting complex matrix mul_kernel<<< fftLen/128,128 >>> (g_fftOut); // Do the reverse FFT (from 'out' back to 'in') cufftSafeCall(cufftExecZ2Z(plan, g_fftOut, g_fftIn, CUFFT_INVERSE)); // allocate memory for the result (UInt32, 2*Nlen) a = (UInt32 *)myAlloc(sizeof(*a)*2*Nlen); // allocate memory for output and copy the result from GPU fftOut = (cufftDoubleComplex *)myAlloc(sizeof(*fftOut)*fftLen); cutilSafeCall(cudaMemcpy(fftOut, g_fftIn, sizeof(*fftOut)*fftLen, cudaMemcpyDeviceToHost)); // Now extract the result from the array of complex variables // Slower than doing this on the gpu, but much easier on host because of carry propogation. // It's only run log n times, so it should be insignificant for(i = 0, carry = 0; i < 2*Nlen; i++) { l = (UInt64) carry + (UInt64) round(fftOut[i*2].x / (double) fftLen) + (((UInt64) round(fftOut[i*2+1].x / (double) fftLen)) << 16); a[i] = (UInt32) l; // lower 32 bits carry = (UInt32) (l >> 32); } // free the 3 device memory arrays cutilSafeCall(cudaFree((char *)g_fftIn)); cutilSafeCall(cudaFree((char *)g_fftOut)); cutilSafeCall(cudaFree((char *)g_intArray)); myFree((void*)fftOut); // Destroy the FFT plan cufftSafeCall(cufftDestroy(plan)); return a; } void check(const Int32 b, const UInt32 m, char *expectedResidue) { ~~snip~~ if (read_checkpoint(b, m, z, &startTime, &iStart, &Nlen, &iOrigStart)) { fprintf(STDOUT, "\rResuming %d^%u+1 from a checkpoint (%d iterations left)\n", b, m, iStart); //1.00 fflush(STDOUT); } else { Nlen = 1; Na = (UInt32 *)myAlloc(1 * sizeof(UInt32)); Na[0] = (UInt32)b; for (j = m; j != 1; j /= 2) { a = computeSquare(Na, Nlen); // compute_square will allocate "a" myFree(Na); Na = a; Nlen *= 2; if (Na[Nlen - 1] == 0) --Nlen; } iStart = iOrigStart = lg(Na, Nlen) - 1; fprintf(STDOUT, "%sTesting %d^%u+1... %d steps to go ", ISBOINC ? "\n" : "\r", b, m, iStart); fflush(STDOUT); }[/code] Also, the mul_1_add_n() function is no longer used and can be safely deleted. For 1248^4194304+1, this change reduces the initialization time, where it's calculating the value of b^N, from 2 hours down to a small fraction of one second. Full source and Windows executables will be available shortly. |
Source and executable can be downloaded here: [url=http://www.asgoodasitgoetz.com/distribution/geneferCUDA-boinc.1.03beta.7z]GeneferCUDA-boinc.1.03beta.7z[/url]
I expected this to be released into production on PrimeGrid's Boinc server, but I want to do more testing first. This source is 100% compatible with Shoichiro's code up through Shoichiro's v1.049. Modifications from 1.050 onward have NOT been incorporated into my source. It can operate under Boinc, under PRPNet, or standalone. It's currently built for windows, but there's only a handful of places where the code is incompatible with Linux. (Setting Windows thread priority and variable arguments in a logging function are the two things I remember.) |
[QUOTE=AG5BPilot;285590]
Also, the mul_1_add_n() function is no longer used and can be safely deleted. For 1248^4194304+1, this change reduces the initialization time, where it's calculating the value of b^N, from 2 hours down to a small fraction of one second.[/QUOTE] Now that Na computation is fast, there is probably no need for it to be written to/read from checkpoint. Doesn't the computesquare call need BOINC_START_GPU/END_GPU? |
[QUOTE=axn;285687]Now that Na computation is fast, there is probably no need for it to be written to/read from checkpoint.[/quote]
A bit premature, as it turns out -- further testing has shown there's a problem at N=4194304, which is a bit of a nuisance since that's our real target. At 4M it's exceeding maxErr doing the last FFT. The b limit on that calculation is effectively less than 100. :( The extra information in the checkpoint file doesn't make that much of a difference. I'm doing that calculation using each element as a 16 bit unsigned integer, rather than32 bits because if you square two 32 bit ints, your result is 64 bits long -- which will NOT fit in a 64 bit float since the float uses about 10 bits for the two signs and the exponent. However, this made the FFT have twice as many elements, which is itself causing problems. [quote]Doesn't the computesquare call need BOINC_START_GPU/END_GPU?[/QUOTE] It might, if those calls actually had any code in them anymore. Those calls, which are supposed to be boinc critical sections, would actually prevent the task from ever shutting down. Boinc's critical section code appears to be horribly broken. It's only useful when you're protecting a tiny percentage of your code within the critical section. In our case, where only a tiny percentage ISN"T protected (since we spend almost all our time in the GPU), the BOINC client -- which effectively talks to the client via polling, not interrupts (at least as far as the critical section code is concerned) -- never gets the signal through unless it's unbelievably lucky. BTW, I'm going to pull that build off the website until I find a fix for the problem. At N=4194304, it would produce incorrect results without any indication that the result is wrong. If anyone wants that source, let me know. |
[QUOTE=AG5BPilot;285706]A bit premature, as it turns out -- further testing has shown there's a problem at N=4194304, which is a bit of a nuisance since that's our real target. At 4M it's exceeding maxErr doing the last FFT. The b limit on that calculation is effectively less than 100.[/QUOTE]
You can use a Karatsuba-style multiplication using 3 half-sized FFT multiplies for the final iteration. [url]http://en.wikipedia.org/wiki/Karatsuba_algorithm#The_basic_step[/url] [Note that, while the wiki example is X & Y, when you need to square, all the underlying mults also become squares.] Also, you can gain a little bit more by dividing out all the twos from the 'b'. Eg:- 1248^4194304 = 39^4194304 * (2^5)^4194304 (1248 = 39*2^5). So you only need to do the FFT to calculate 39^4194304, and multiplication by (2^5)^4194304 is just shifting. Definitely getting more complex, isn't it? :smile: Well, good luck. |
Hi ,
I retry "Reduce CPU Time". [code] if(n1 > 16384) ithreads = 128; else ithreads = 16; for (i = iStart; i != 0; --i) { FFTsquareFFT(z, n1); bt = Na[i / (8 * sizeof(UInt32))] >> (i % (8 * sizeof(UInt32))) & 1; FFTnextStepGFN(b,invBase,(bt == 0) ? t1 : t2,n1,t3,ithreads,SHIFT); if(i % 0xF == 0) cutilSafeCall(cudaMemcpy(l_maxErr,g_maxErr,sizeof(float),cudaMemcpyDeviceToHost)); if (!(i & 0xffff)) { printf("\rTesting %d^%u+1... %d steps to go ", b, m, i); fflush(stdout); cutilSafeCall(cudaMemcpy(l_maxErr,g_maxErr,sizeof(float),cudaMemcpyDeviceToHost)); //1.00 [/code] Ver 1.061: [code] 24518^262144+1 is a probable prime. (1150678 digits) (err = 0.0002) (time = 1:27:48) 08:02:55 2347.29user 2912.59system 1:27:47elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 336inputs+238616outputs (0major+13265minor)pagefaults 0swaps [/code] This code version: [code] 24518^262144+1 is a probable prime. (1150678 digits) (err = 0.0002) (time = 1:29:02) 06:35:07 215.20user 25.99system 1:29:02elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+238528outputs (0major+11867minor)pagefaults 0swaps [/code] 1% slow, but 95% reduce cpu time. |
a 1% increase in time in exchange of an almost free cpu? nice
|
[QUOTE=axn;285707]
Definitely getting more complex, isn't it? :smile: Well, good luck.[/QUOTE] It's actually a heck of a lot simpler than I imagined. First time I looked at this my first thought was "It just can not be that simple. I have to be missing something." I was very surprised when the right numbers started coming out. It's very much like like making a smoothie in a blender, reversing the power plug, and having whole fruit emerge. :) Thanks for the advice, and the luck! |
About 4 different ideas later, I think I've fixed 1.03. I'm going to test it more thoroughly and THEN post the code.
|
[url=http://www.asgoodasitgoetz.com/distribution/geneferCUDA-boinc.1.03a.7z]geneferCUDA-boinc.1.03a.7z[/url]
This has working code for the fast initialization. In the end, I sacrificed speed for precision. It's a whopping one and a half seconds slower than 1.03, but it has the slight advantage of actually working. I reduced the FFT element size to 8 bits, down from 16, which gave me tons of room to avoid rounding errors, but did push the FFT size above 8 million. I guess it's a good thing I GPUs are pretty fast! |
Linux need #127 patch with CPU TIME 100% issue.
Win not need #127 patch. |
[QUOTE=msft;286136]Linux need #127 patch with CPU TIME 100% issue.
Win not need #127 patch.[/QUOTE] Shoichiro, I don't understand why your fix would work. Is the only change you made to add an extra DeviceToHost copy of maxErr? But, I did see that the init_device() routine makes 4 cuda device calls, and unlike the rest of the software, it does not use cutilSafeCall() to check for errors. In particular, I think this call may be important: [code]cudaSetDeviceFlags(cudaDeviceBlockingSync);[/code] If that were to fail for any reason, CUDA would act as if you were in cudaDeviceScheduleSpin mode, and that would cause the 100% CPU utilization that you're seeing. However, I do not understand why that call would be failing. I would suggest adding cutilSafeCall() to the API calls inside init_device() function. Seems like a good thing to do regardless of whether this is related to the 100% CPU core problem. I'm adding that to my version of the code. Mike |
Hi ,AG5BPilot
[QUOTE=AG5BPilot;286150] I don't understand why your fix would work. Is the only change you made to add an extra DeviceToHost copy of maxErr? [/QUOTE] Yes. I think this code will trigger. [QUOTE=AG5BPilot;286150] But, I did see that the init_device() routine makes 4 cuda device calls, and unlike the rest of the software, it does not use cutilSafeCall() to check for errors. In particular, I think this call may be important: [code]cudaSetDeviceFlags(cudaDeviceBlockingSync);[/code] If that were to fail for any reason, CUDA would act as if you were in cudaDeviceScheduleSpin mode, and that would cause the 100% CPU utilization that you're seeing. However, I do not understand why that call would be failing. [/QUOTE] I think,if that were to fail,"extra DeviceToHost copy" should not work. [QUOTE=AG5BPilot;286150] I would suggest adding cutilSafeCall() to the API calls inside init_device() function. Seems like a good thing to do regardless of whether this is related to the 100% CPU core problem. I'm adding that to my version of the code. [/QUOTE] The results did not change. Anyway strange phenomenon. |
Hi ,
Ver 1.06,1.061,1.062 have fatal error with 32bit OS. I made bug around GMP code. Do not use with 32bit OS. Thank you, |
[QUOTE=msft;287104]Hi ,
Ver 1.06,1.061,1.062 have fatal error with 32bit OS. I made bug around GMP code. Do not use with 32bit OS. Thank you,[/QUOTE] Which bug? I am getting no error here or in llrCUDA. I have Lubuntu11.10 with kernel 3.0.0-15.26 as compiling-platform. |
Hi ,
llrCUDA was OK. [code] m_b = b; mpz_init_set_ui(m_Na,m_b); for(j = m; j != 1; j/=2) mpz_mul(m_Na,m_Na,m_Na); m_Na_size = mpz_size(m_Na); m_Na_size_byte = m_Na_size*sizeof(long int); m_a = (long int*) malloc(m_Na_size*sizeof(long int)); mpz_export(m_a,NULL,0,m_Na_size_byte,0,0,m_Na); [/code] Sometime m_a[m_Na_size-1]==0 with linux32. [code] if(m_a_32[m_a_32_len-1]== 0) m_a_32_len--; [/code] This code not enough. |
Hi ,
mini test program. [code] #include <stdio.h> #include <gmp.h> int main() { mpz_t m; int i,j; unsigned long long e_m[10]; mpz_init_set_ui(m,2); for(j = 0; j != 8; j++) { mpz_mul(m,m,m); for(i=0;i<10;i++)e_m[i]=0; mpz_export(e_m,NULL,0,80,0,0,m); printf(" mpz_size(m)=%d ",(int) mpz_size(m)); for(i=9;i>=0;i--) printf(" %llx",e_m[i]); printf("\n"); } mpz_clear(m); } [/code] linux32: [code] $ ./a.out mpz_size(m)=1 0 0 0 0 0 0 0 0 0 4 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 100 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10000 mpz_size(m)=2 0 0 0 0 0 0 0 0 0 100000000 mpz_size(m)=3 0 0 0 0 0 0 0 0 1 0 mpz_size(m)=5 0 0 0 0 0 0 0 1 0 0 mpz_size(m)=9 0 0 0 0 0 1 0 0 0 0 [/code] linux64: [code] $ ./a.out mpz_size(m)=1 0 0 0 0 0 0 0 0 0 4 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 100 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10000 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 100000000 mpz_size(m)=2 0 0 0 0 0 0 0 0 1 0 mpz_size(m)=3 0 0 0 0 0 0 0 1 0 0 mpz_size(m)=5 0 0 0 0 0 1 0 0 0 0 [/code] |
Hmm, got it not compiled:
[QUOTE]boinc@Lubuntu32:~/Cuda$ gcc a a: In function `main': a.c:(.text.startup+0x25): undefined reference to `__gmpz_init_set_ui' a.c:(.text.startup+0x3c): undefined reference to `__gmpz_mul' a.c:(.text.startup+0x110): undefined reference to `__gmpz_export' a.c:(.text.startup+0x2ba): undefined reference to `__gmpz_clear' collect2: ld returned 1 exit status [/QUOTE] |
[QUOTE=rroonnaalldd;287485]Hmm, got it not compiled:[/QUOTE]
Add -lgmp |
[code]
mpz_mul(m_Na,m_Na,m_Na); m_Na_size = (mpz_sizeinbase(m_Na,2)+sizeof(long long int)*8-1)/(sizeof(long long int)*8); m_Na_size_byte = m_Na_size*sizeof(long long int); [/code] It's only answer. |
[QUOTE=rogue;287486]Add -lgmp[/QUOTE]
:redface: Lubuntu11.10-32 [QUOTE]boinc@Lubuntu32:~/Cuda$ ./a32.out mpz_size(m)=1 0 0 0 0 0 0 0 0 0 4 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 100 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10000 mpz_size(m)=2 0 0 0 0 0 0 0 0 0 100000000 mpz_size(m)=3 0 0 0 0 0 0 0 0 1 0 mpz_size(m)=5 0 0 0 0 0 0 0 1 0 0 mpz_size(m)=9 0 0 0 0 0 1 0 0 0 0 [/QUOTE] DotschUX-64 based on Ubuntu8.10 [QUOTE]boinc@vmware2k-3:~/Cuda$ ./a32.out mpz_size(m)=1 0 0 0 0 0 0 0 0 0 4 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 100 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10000 mpz_size(m)=2 0 0 0 0 0 0 0 0 0 100000000 mpz_size(m)=3 0 0 0 0 0 0 0 0 1 0 mpz_size(m)=5 0 0 0 0 0 0 0 1 0 0 mpz_size(m)=9 0 0 0 0 0 1 0 0 0 0 [/QUOTE] [quote]boinc@vmware2k-3:~/Cuda$ ./a64.out mpz_size(m)=1 0 0 0 0 0 0 0 0 0 4 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 100 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10000 mpz_size(m)=1 0 0 0 0 0 0 0 0 0 100000000 mpz_size(m)=2 0 0 0 0 0 0 0 0 1 0 mpz_size(m)=3 0 0 0 0 0 0 0 1 0 0 mpz_size(m)=5 0 0 0 0 0 1 0 0 0 0 [/quote] |
does anyone have any idea of the relative time spent of "fft square" routine vs "next step" kernels in genefer?
|
[QUOTE=axn;305570]does anyone have any idea of the relative time spent of "fft square" routine vs "next step" kernels in genefer?[/QUOTE]
I used Nvidia's visual profiler to check the usage. The single biggest consumer of GPU time was the transpose kernel. However, I don't remember right now what the exact conditions of the test were -- for example, I may have been looking more at the initialization code than at the genefer algorithm -- so I wouldn't put too much weight on those results. But you should be able to get the information you want using their tools. |
[QUOTE=AG5BPilot;305585]But you should be able to get the information you want using their tools.[/QUOTE]
I would... if I had any DP capable cards :sad: Would you be able to run a quick test and figure out the execution profile of a WR test? |
[QUOTE=axn;305606]I would... if I had any DP capable cards :sad: Would you be able to run a quick test and figure out the execution profile of a WR test?[/QUOTE]
Not sure when I'll have time, so that's a definite maybe. |
[QUOTE=axn;305606]I would... if I had any DP capable cards :sad: Would you be able to run a quick test and figure out the execution profile of a WR test?[/QUOTE]
I figured it was easier, so I put timing code into Genefer's benchmarks: [quote]C:\GeneferCUDA test\GeneferTest>genefercudatest -b genefercuda 2.3.1-0 (Windows x86 CUDA 3.2) Copyright 2001-2003, Yves Gallot Copyright 2009, Mark Rodenkirch, David Underbakke Copyright 2010-2012, Shoichiro Yamada, Ken Brazier Copyright 2011-2012, Iain Bethune, Michael Goetz, Ronald Schneider Command line: genefercudatest -b Generalized Fermat Number Bench FFTsquareGFN=53.46% FFTnextStepGFN=46.54% (raw: 0.0850 0.0740 seconds) 2009574^8192+1 Time: 159 us/mul. Err: 0.1719 51636 digits FFTsquareGFN=55.00% FFTnextStepGFN=45.00% (raw: 0.0880 0.0720 seconds) 1632282^16384+1 Time: 161 us/mul. Err: 0.1563 101791 digits FFTsquareGFN=48.95% FFTnextStepGFN=51.05% (raw: 0.0930 0.0970 seconds) 1325824^32768+1 Time: 212 us/mul. Err: 0.1563 200622 digits FFTsquareGFN=32.93% FFTnextStepGFN=67.07% (raw: 0.0810 0.1650 seconds) 1076904^65536+1 Time: 339 us/mul. Err: 0.1602 395325 digits FFTsquareGFN=49.48% FFTnextStepGFN=50.52% (raw: 0.0950 0.0970 seconds) 874718^131072+1 Time: 601 us/mul. Err: 0.1563 778813 digits FFTsquareGFN=43.55% FFTnextStepGFN=56.45% (raw: 0.2530 0.3280 seconds) 710492^262144+1 Time: 1.05 ms/mul. Err: 0.1641 1533952 digits FFTsquareGFN=29.27% FFTnextStepGFN=70.73% (raw: 0.3990 0.9640 seconds) 577098^524288+1 Time: 1.98 ms/mul. Err: 0.1875 3020555 digits FFTsquareGFN=21.41% FFTnextStepGFN=78.59% (raw: 0.6850 2.5150 seconds) 468750^1048576+1 Time: 3.97 ms/mul. Err: 0.1563 5946413 digits FFTsquareGFN=32.74% FFTnextStepGFN=67.26% (raw: 1.8470 3.7940 seconds) 380742^2097152+1 Time: 8.23 ms/mul. Err: 0.1484 11703432 digits FFTsquareGFN=25.51% FFTnextStepGFN=74.49% (raw: 3.4360 10.0320 seconds) 309258^4194304+1 Time: 16.5 ms/mul. Err: 0.1719 23028076 digits FFTsquareGFN=21.27% FFTnextStepGFN=78.73% (raw: 6.4150 23.7390 seconds) 100^8388608+1 Time: 33.7 ms/mul. Err: 0.0000 16777217 digits[/quote] The timings from N=262144 and above seem fairly consistent between runs. |
[QUOTE=AG5BPilot;305807]I figured it was easier, so I put timing code into Genefer's benchmarks:[/QUOTE]
Yeesh! That thing is scaling the wrong way. I hope you made a mistake and flipped the numbers. FFT is O(nlogn) and carry propagation (nextstep) is O(n), so you'd expect that as n grows higher, FFT will take a larger fraction of the time -- not smaller! Something is rotten. :confused: |
[QUOTE=axn;305881]Yeesh! That thing is scaling the wrong way. I hope you made a mistake and flipped the numbers. FFT is O(nlogn) and carry propagation (nextstep) is O(n), so you'd expect that as n grows higher, FFT will take a larger fraction of the time -- not smaller! Something is rotten. :confused:[/QUOTE]
The numbers are definitely not reversed. It's possible it's an instrumentation anomaly, for example if either of the two routines' run time is close to the clock step frequency of 1 ms you might get odd results with ONE of the benchmarks, but that doesn't appear to be what's happening. Here's the relevant code (everything except the declarations): [code]SETCLOCK(clock1); FFTsquareGFN(z, n1, n2); squareTime += elapsedTime(clock1); bt = Na[i / (8 * sizeof(uint32_t))] >> (i % (8 * sizeof(uint32_t))) & 1; SETCLOCK(clock1); FFTnextStepGFN(z, b, (bt == 0) ? t1 : t2, n1, n2, t3); nextTime += elapsedTime(clock1); ... if (squareTime+nextTime==0.0) squareTime=nextTime=1.0; printf("\n FFTsquareGFN=%.2f%% FFTnextStepGFN=%.2f%% (raw: %.4f %.4f seconds)\n", 100*squareTime/(squareTime+nextTime), 100*nextTime/(squareTime+nextTime), squareTime, nextTime);[/code] |
[QUOTE=AG5BPilot;305882]The numbers are definitely not reversed.[/QUOTE]
I know. It was one of those idle hopes :smile: The outcome is serious enough that I have just gone ahead and ordered my very own GT 520 (cheapest DP capable part I could find :smile:) to further explore this. Hopefully, before it arrives, I will have finished up the linux port of the sieve, and have some free time. |
1 Attachment(s)
[QUOTE=axn;305884]The outcome is serious enough that I have just gone ahead and ordered my very own GT 520 (cheapest DP capable part I could find :smile:) to further explore this. Hopefully, before it arrives, I will have finished up the linux port of the sieve, and have some free time.[/QUOTE]
A GT520s performance will probably be in the vicinity of what you can get on your CPU running Genefx64. The ONLY code changes to measure the internal timings are in the check() routine, so you can just plug in my modified genefer.cpp file and build the other versions of genefer. It would be interesting to see if you see the same behavior in the CPU versions. |
[QUOTE=AG5BPilot;305925]A GT520s performance will probably be in the vicinity of what you can get on your CPU running Genefx64.
The ONLY code changes to measure the internal timings are in the check() routine, so you can just plug in my modified genefer.cpp file and build the other versions of genefer. It would be interesting to see if you see the same behavior in the CPU versions.[/QUOTE] I built GenefX64 under linux and the numbers were going in the right direction. Unfortunately, I could only run it up to 1M, since the initialization was taking longer and longer. I'll hack it a bit to overcome that and post the full details. Summary is: it starts out at roughly 50:50, but by 1M, it is 80:20 (sq:nextstep). With each increase in N, square tends to be more and more dominant factor. The GPU numbers are definitely an anomaly. |
Figured it out. It was the parallel processing of the square and nextstep overlapping each other that was messing up the metrics. Serializing the steps slowed the program down a bit, but does yield more accurate measurements:
[quote]C:\GeneferCUDA test\GeneferTest>genefercudaTest.exe -b genefercuda 2.3.1-0 (Windows x86 CUDA 3.2) Copyright 2001-2003, Yves Gallot Copyright 2009, Mark Rodenkirch, David Underbakke Copyright 2010-2012, Shoichiro Yamada, Ken Brazier Copyright 2011-2012, Iain Bethune, Michael Goetz, Ronald Schneider Command line: genefercudaTest.exe -b Generalized Fermat Number Bench FFTsquareGFN=46.32% FFTnextStepGFN=53.68% (raw: 0.2520 0.2920 seconds) 2009574^8192+1 Time: 546 us/mul. Err: 0.1719 51636 digits FFTsquareGFN=47.40% FFTnextStepGFN=52.60% (raw: 0.2830 0.3140 seconds) 1632282^16384+1 Time: 597 us/mul. Err: 0.1563 101791 digits FFTsquareGFN=47.75% FFTnextStepGFN=52.25% (raw: 0.3190 0.3490 seconds) 1325824^32768+1 Time: 668 us/mul. Err: 0.1563 200622 digits FFTsquareGFN=48.98% FFTnextStepGFN=51.02% (raw: 0.4310 0.4490 seconds) 1076904^65536+1 Time: 880 us/mul. Err: 0.1602 395325 digits FFTsquareGFN=61.86% FFTnextStepGFN=38.14% (raw: 0.6650 0.4100 seconds) 874718^131072+1 Time: 1.08 ms/mul. Err: 0.1563 778813 digits FFTsquareGFN=59.12% FFTnextStepGFN=40.88% (raw: 0.9820 0.6790 seconds) 710492^262144+1 Time: 1.66 ms/mul. Err: 0.1641 1533952 digits FFTsquareGFN=63.43% FFTnextStepGFN=36.57% (raw: 1.7070 0.9840 seconds) 577098^524288+1 Time: 2.69 ms/mul. Err: 0.1875 3020555 digits FFTsquareGFN=65.57% FFTnextStepGFN=34.43% (raw: 3.2610 1.7120 seconds) 468750^1048576+1 Time: 4.97 ms/mul. Err: 0.1563 5946413 digits FFTsquareGFN=74.32% FFTnextStepGFN=25.68% (raw: 6.7420 2.3290 seconds) 380742^2097152+1 Time: 9.07 ms/mul. Err: 0.1484 11703432 digits FFTsquareGFN=75.52% FFTnextStepGFN=24.48% (raw: 13.3390 4.3230 seconds) 309258^4194304+1 Time: 17.7 ms/mul. Err: 0.1719 23028076 digits FFTsquareGFN=78.22% FFTnextStepGFN=21.78% (raw: 27.3330 7.6120 seconds) 100^8388608+1 Time: 34.9 ms/mul. Err: 0.0000 16777217 digits[/quote] |
[QUOTE=AG5BPilot;306037]Figured it out. It was the parallel processing of the square and nextstep overlapping each other that was messing up the metrics. Serializing the steps slowed the program down a bit, but does yield more accurate measurements:[/QUOTE]
Phew. That's more like it! It is in line with the CPU figures. Anyway, once I have my GT520, I'll do some visual profiling and see if any improvement can be wrung out of the code. But I'm not holding my breath for any big improvements. |
Is genefercuda correct
I tried genefercuda for a gtx680 nvidia GPU on a AMD FX8120 cpu and 32 gb ram for
Fermat number F19 which is known composite with b=2 and N=524288=2^19 and the program prints out that it is probably prime. Can any one else try this out? A... |
[QUOTE=prime7989;314331]I tried genefercuda for a gtx680 nvidia GPU on a AMD FX8120 cpu and 32 gb ram for
Fermat number F19 which is known composite with b=2 and N=524288=2^19 and the program prints out that it is probably prime. Can any one else try this out? A...[/QUOTE] genefercuda is running a PRP test in base 2 (just checked the source), which is fine for checking any GF(b), b>2 (and not a power of 2, which would be pointless). Of course, all Fermat numbers are 2-PRPs. |
[QUOTE=Batalov;314332]genefercuda is running a PRP test in base 2 (just checked the source), which is fine for checking any GF(b), b>2 (and not a power of 2, which would be pointless).
Of course, all Fermat numbers are 2-PRPs.[/QUOTE] It should be relatively trivial to change the PRP-base (to say, 3). I will try it out later. Also, 2^524288 = 65536^32768. So a much smaller FFT may be used. |
[QUOTE=axn;314339]It should be relatively trivial to change the PRP-base (to say, 3). I will try it out later.
Also, 2^524288 = 65536^32768. So a much smaller FFT may be used.[/QUOTE] So, I tried this out... and you wouldn't believe what I found out: [CODE]65536^32768+1 is composite. (RES=0df151a4161b6c99) (157827 digits) (err = 0.0005) (time = 0:07:24)[/CODE] :shock: |
That's creepy! Don't say all the tests need to be redone? :ouch:
|
[QUOTE=LaurV;314489]That's creepy! Don't say all the tests need to be redone? :ouch:[/QUOTE]
I Bet your side.:smile: |
Oh, I forgot about this, my post was kinda joke, based on the fact that the incriminated number is F19, which is known as composite. If axn did a 3-PRP test it was normal to be composite (it is like a Pepin Test.... squared, I might be wrong here but this will never return a false result, i.e. Fx is 3-prp if and only if it is prime).
|
base=3
[code] $ diff genefer.c genefer.c.ORG 1757c1757 < CONST double t1 = 2.0 / m, t2 = 3 * t1; --- > CONST double t1 = 2.0 / m, t2 = 2 * t1; 1781c1781 < z = FFTinitGFN(m, 3.0, &n1, &n2); --- > z = FFTinitGFN(m, 2.0, &n1, &n2); $ cc genefer.c -lm $ ./a.out -t GeneFer 1.3 (standard) Copyright (C) 2001-2002, Yves GALLOT A program for finding large probable generalized Fermat primes. Usage: GeneFer -b run bench GeneFer -t run test GeneFer <filename> test <filename> GeneFer use interactive mode 100050^16+1 is a probable prime. (err = 1.53e-05) 100014^32+1 is a probable prime. (err = 3.05e-05) 100234^64+1 is a probable prime. (err = 4.58e-05) 100032^128+1 is a probable prime. (err = 9.16e-05) 5684328^256+1 is a probable composite (err = 5.00e-01). 4619000^512+1 is a probable composite (err = 5.00e-01). 3752220^1024+1 is a probable composite (err = 5.00e-01). [/code] |
Aaah! msft is [URL="http://ru.wikipedia.org/wiki/%D0%9C%D0%BE%D1%80%D1%80%D0%B0"]Morra[/URL]!
(...I suddenly expect Morra defence in the Vote chess thread. No. I did expect it before. That's when I looked up Morra. Too late. ;-) ) "The Groke" on the other hand reminds me of Piet Hein's Grooks. [QUOTE] THE ROAD TO WISDOM? Well, it's plain and simple to express. Err and err and err again, but less and less and less.[/QUOTE] |
[QUOTE=axn;314396]So, I tried this out... and you wouldn't believe what I found out:
[CODE]65536^32768+1 is composite. (RES=0df151a4161b6c99) (157827 digits) (err = 0.0005) (time = 0:07:24)[/CODE]:shock:[/QUOTE] Dear axn, I dislike reading code. So please which variable did you change to make it 3-PRP in GeneferCUDA.cu, please? If you could give me the line number of the code of function and var, I would be very greatful! Thank you, prime7989 |
[QUOTE=prime7989;333674]Dear axn,
I dislike reading code. So please which variable did you change to make it 3-PRP in GeneferCUDA.cu, please? If you could give me the line number of the code of function and var, I would be very greatful! Thank you, prime7989[/QUOTE] There are many versions of G/CUDA out there. If you can post the one you want to look at (as a forum attachment), I can tell you. FWIW, the change is easy and localised to "check" function. |
Extend GeneferCUDA pseudo-primality checks
1 Attachment(s)
[QUOTE=axn;333686]There are many versions of G/CUDA out there. If you can post the one you want to look at (as a forum attachment), I can tell you. FWIW, the change is easy and localised to "check" function.[/QUOTE]
Thank you axn, I would also like to make the change to GeneferCUDA,cu such that the input instead of: n=b^N+1 is as follows: n=b^N+c where c is an non-zero integer such that |c|<Log(n). Thank you, Prime7989 |
[QUOTE=prime7989;333689]Thank you axn,
I would also like to make the change to GeneferCUDA,cu such that the input instead of: n=b^N+1 is as follows: n=b^N+c where c is an non-zero integer such that |c|<Log(n). Thank you, Prime7989[/QUOTE] For changing the base used for test, you need to modify the two lines in check() function from: [CODE]const double t1 = 1.0,t2 = 2.0,t3 = 2.0 / m; <snip> z = FFTinitGFN(m, 2.0, &n1,&SHIFT); [/CODE] to: [CODE]const double t1 = 1.0,t2 = prp_base,t3 = 2.0 / m; <snip> z = FFTinitGFN(m, prp_base, &n1,&SHIFT);[/CODE] As far as testing b^N+c, that is "above my paygrade", so to speak. :sad: |
c=-1?
Dear axn,
How about c=-1 that is change the code in geneferCuda from so that c=-1 i.e Input from: b^m+1 to b^m-1 prime7989 |
This is easy to do.
The whole code will become: [CODE]//...some headers // skip to page 10 j = m; while (j > 0) { if (j > 1 && j & 1) { fprintf(stderr, "Cannot test. m is not a power of 2.\n"); return; } j >>= 1; } printf("Testing %d^%u-1...\r", b, m); printf("\r%d^%u-1 is composite.", b, m); //... some footer [/CODE] |
2^p-1
Dear Batalov,axn,
Ofcourse looking at the code i know my just previous post was incomplete. What i meant was the following mods to geneferCUDA: 1) Specify the base p-PRP as a #define macro or as a run time input by the user(easy to do) Give the user a chance to run a version option: 2a) Option A):Make it such that it is base 2,3,5-PRP for b^m+1 where m is a power of 2. 2b)Option B:)Make it such that it is base 3-PRP for b^q-1 where b=2 and q = a prime so as to accomodate Mersenne numbers. As well as GFNs. Does anyone have recent code of geneferCUDA.cu > ver 1.061 or is willing to modify the code with these improvements? And wiilling to post it? prime7989 |
GeNeFer (and geneferCUDA which is its CUDA port) are written specifically to test Generalized Fermat Numbers. That is, m must be a power of two. Your question about changing the PRP test's base was already addressed, even though it was somewhat moot, because test prp_base=2 is fine for any b>2, while for b=2 the (true Fermat) candidates are already searched to death by many people before you.
For other applications, there are other programs, e.g. CUDALucas (prime m and c=-1). Why would someone bother to modify a program so that it would become another, already written, specialized program? If you want a universal program that does everything that other programs can do, then that's easy - make a script that would call the appropriate binary. That would be much more robust than keeping up with all disparate sources, putting them together and then recompiling when any component had changed. |
[QUOTE=Batalov;335856]GeNeFer (and geneferCUDA which is its CUDA port) are written specifically to test Generalized Fermat Numbers. That is, m must be a power of two. Your question about changing the PRP test's base was already addressed, even though it was somewhat moot, because test prp_base=2 is fine for any b>2, while for b=2 the (true Fermat) candidates are already searched to death by many people before you.
For other applications, there are other programs, e.g. CUDALucas (prime m and c=-1). Why would someone bother to modify a program so that it would become another, already written, specialized program? If you want a universal program that does everything that other programs can do, then that's easy - make a script that would call the appropriate binary. That would be much more robust than keeping up with all disparate sources, putting them together and then recompiling when any component had changed.[/QUOTE] GeneferCUDA is a base-2 Pepin's test as far as I can ascertain(please correct if I am wrong, please). To allow for Fermat number testing with the GTX-TITAN we can make it base 3 or 5 deterministic Pepin's Test. To allow test's for Mersenne numbers with a Pepin's test GenfeferCuda should be run in Base 3 with the modification for Fermat to Mersenne numbers in the c=+/-1 for Fermat and Mersenne numbers. Furthermore CudaLucas is a Lucas-Lehmer test for Mersenne numbers only and not a Pepin's test. Al |
1 Attachment(s)
Hi,
I ported genefer to RADEON. [CODE] + time ./GeneferCUDA -t GeneferCUDA 2.2.1 (CUDA) based on Genefer v2.2.1 Copyright (C) 2001-2003, Yves Gallot (v1.3) Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1) Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA) A program for finding large probable generalized Fermat primes. Platform :Advanced Micro Devices, Inc. Device 0 : Capeverde Build Options are : -D KHR_DP_EXTENSION 2030234^8192+1 is a probable prime. (51672 digits) (err = 0.1719) (time = 0:03:01) 00:52:08 [/CODE] |
1 Attachment(s)
Hi, New Version.
HW:Celeron+RADEON7750(Low end) 1) install ubuntu Desktop 64bit 12.04 LTS 2) install amd-driver-installer-catalyst-13-4-linux-x86.x86_64.zip 3) install AMD-APP-SDK-v2.8-lnx64.tgz 4) install clAmdFft-1.10.321.tar.gz 5) install g++ 6) install libglu1-mesa-dev [CODE] $ pwd /opt/AMDAPP/samples/opencl/genefer/0.05 $ tar -xvf 0.04.tar.bz2 0.04/ 0.04/clFFTPlans.cpp 0.04/Makefile 0.04/MatrixMulDouble_Kernels.cl 0.04/run.sh 0.04/clFFTPlans.h 0.04/GeneferCUDA.c 0.04/MatrixMulDouble.hpp $ cd 0.04 0.04$ sh -x ./run.sh g++ -Wpointer-arith -Wfloat-equal -g -O3 -I /opt/AMDAPP/include/ -I /opt/clAmdFft-1.10.321/include/ -g3 -ffor-scope -I ../../../../../samples/opencl/SDKUtil/include -I ../../../../../samples/bolt/BoltUtil/include -I ../../../../../samples/C++Amp/AmpUtil/include -I "/opt/AMDAPP/include" -I ../../../../../include -o build/debug/x86_64//GeneferCUDA.o -c GeneferCUDA.c g++ -Wpointer-arith -Wfloat-equal -g -O3 -I /opt/AMDAPP/include/ -I /opt/clAmdFft-1.10.321/include/ -g3 -ffor-scope -I ../../../../../samples/opencl/SDKUtil/include -I ../../../../../samples/bolt/BoltUtil/include -I ../../../../../samples/C++Amp/AmpUtil/include -I "/opt/AMDAPP/include" -I ../../../../../include -o build/debug/x86_64//clFFTPlans.o -c clFFTPlans.cpp g++ -o build/debug/x86_64/GeneferCUDA build/debug/x86_64//GeneferCUDA.o build/debug/x86_64//clFFTPlans.o -lpthread -ldl -L/usr/X11R6/lib -g /opt/clAmdFft-1.10.321/lib64/libclAmdFft.Runtime.so -lOpenCL -lgmp -lSDKUtil -lOpenCL -L../../../../../lib/x86_64 -L../../../../../TempSDKUtil/lib/x86_64 -L"/opt/AMDAPP/lib/x86_64" + export LD_LIBRARY_PATH=:/opt/clAmdFft-1.10.321/lib64/ + time ./GeneferCUDA -t GeneferCUDA 2.2.1 (GPU) based on Genefer v2.2.1 Copyright (C) 2001-2003, Yves Gallot (v1.3) Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1) Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA) A program for finding large probable generalized Fermat primes. Platform :Advanced Micro Devices, Inc. Device 0 : Capeverde Build Options are : -D KHR_DP_EXTENSION 2030234^8192+1 is a probable prime. (51672 digits) (err = 0.1738) (time = 0:01:47) 02:41:09 Platform :Advanced Micro Devices, Inc. Device 0 : Capeverde Build Options are : -D KHR_DP_EXTENSION 572186^131072+1 is a probable prime. (754652 digits) (err = 0.0801) (time = 1:02:35) 03:43:44 [/CODE] |
[bitchy]Who wants to find Fermat primes which are "large probable" or "probable generalized"? (or both)
Wouldn't "large, generalized Fermat (probable) primes" (native speakers can help here!) sound better? [/bitchy] All in all, well done msft! kutgw! (you spoiled us already) |
1 Attachment(s)
Hi, New Version.
HD7750: [CODE] 0.10$ ./GeneferCUDA -b 874718^131072+1 Time: 1.66 ms/mul. Err: 3.47e-01 778813 digits 710492^262144+1 Time: 2.99 ms/mul. Err: 5.00e-01 1533952 digits 577098^524288+1 Time: 4.88 ms/mul. Err: 5.00e-01 3020555 digits 468750^1048576+1 Time: 9.83 ms/mul. Err: 5.00e-01 5946413 digits 380742^2097152+1 Time: 18.9 ms/mul. Err: 5.00e-01 11703432 digits 309258^4194304+1 Time: 40 ms/mul. Err: 5.00e-01 23028076 digits 251196^8388608+1 Time: 80.4 ms/mul. Err: 5.00e-01 45298590 digits 0.10$ ./GeneferCUDA -t 572186^131072+1 is a probable prime. (754652 digits) (err = 0.0762) (time = 1:10:03) 18:36:25 [/CODE] PS. clAmdFft have this issue. [url]http://devgurus.amd.com/message/1298209#1298209[/url] |
[QUOTE=LaurV;347410][bitchy]Who wants to find Fermat primes which are "large probable" or "probable generalized"? (or both)
Wouldn't "large, generalized Fermat (probable) primes" (native speakers can help here!) sound better? [/bitchy] All in all, well done msft! kutgw! (you spoiled us already)[/QUOTE] +1 Indeed.. :razz: |
1 Attachment(s)
New Version. Convert finish.
[CODE] 572186^131072+1 is a probable prime. (754652 digits) (err = 0.0742) (time = 0:54:27) 12:09:13 0.13$ ./GeneferCUDA -b 1076904^65536+1 Time: 809 us/mul. Err: 5.00e-01 395325 digits 874718^131072+1 Time: 1.27 ms/mul. Err: 3.47e-01 778813 digits 710492^262144+1 Time: 2.56 ms/mul. Err: 5.00e-01 1533952 digits 577098^524288+1 Time: 4.73 ms/mul. Err: 5.00e-01 3020555 digits 468750^1048576+1 Time: 9.28 ms/mul. Err: 5.00e-01 5946413 digits 380742^2097152+1 Time: 18.1 ms/mul. Err: 5.00e-01 11703432 digits 309258^4194304+1 Time: 38.2 ms/mul. Err: 5.00e-01 23028076 digits 251196^8388608+1 Time: 76.4 ms/mul. Err: 5.00e-01 45298590 digits [/CODE] |
| All times are UTC. The time now is 05:55. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.