![]() |
1 Attachment(s)
Hi ,
[QUOTE=AG5BPilot;284292]By the way, even though the changes in 1.045 greatly slowed down GeneferCUDA, the benchmarks are now working:[/QUOTE] I gave up "reduce CPU time". Thank you, |
1.046:
[quote]Generalized Fermat Number Bench 2009574^8192+1 Time: 397 us/mul. Err: 3.82e-001 51636 digits 1632282^16384+1 Time: 427 us/mul. Err: 2.53e-001 101791 digits 1325824^32768+1 Time: 458 us/mul. Err: 1.88e-001 200622 digits 1076904^65536+1 Time: 610 us/mul. Err: 1.72e-001 395325 digits 874718^131072+1 Time: 610 us/mul. Err: 3.47e-001 778813 digits 710492^262144+1 Time: 977 us/mul. Err: 4.21e-001 1533952 digits 577098^524288+1 Time: 1.46 ms/mul. Err: 2.01e-001 3020555 digits 468750^1048576+1 Time: 2.93 ms/mul. Err: 1.56e-001 5946413 digits 380742^2097152+1 Time: 0 us/mul. Err: 3.63e-001 11703432 digits 309258^4194304+1 Time: 0 us/mul. Err: 1.48e-001 23028076 digits 251196^8388608+1 Time: 0 us/mul. Err: 1.41e-001 45298590 digits[/quote] Awesome speed improvement! ;-) |
[QUOTE=AG5BPilot;284307]1.046:
Awesome speed improvement! ;-)[/QUOTE] Quantization Error. what a wonderful world!:lol: |
1 Attachment(s)
Ver 1.047 with linux64 exec file.
[QUOTE=msft;284313]Quantization Error.[/QUOTE] Fixed this issue. |
[QUOTE=msft;284333]Ver 1.047 with linux64 exec file.
Fixed this issue.[/QUOTE] 1.047: [quote]Generalized Fermat Number Bench 2009574^8192+1 Time: 399 us/mul. Err: 3.82e-001 51636 digits 1632282^16384+1 Time: 423 us/mul. Err: 2.53e-001 101791 digits 1325824^32768+1 Time: 461 us/mul. Err: 1.88e-001 200622 digits 1076904^65536+1 Time: 614 us/mul. Err: 1.78e-001 395325 digits 874718^131072+1 Time: 785 us/mul. Err: 3.47e-001 778813 digits 710492^262144+1 Time: 1.1 ms/mul. Err: 4.21e-001 1533952 digits 577098^524288+1 Time: 2.03 ms/mul. Err: 2.01e-001 3020555 digits 468750^1048576+1 Time: 3.89 ms/mul. Err: 1.64e-001 5946413 digits 380742^2097152+1 Time: 8.22 ms/mul. Err: 3.63e-001 11703432 digits 309258^4194304+1 Time: 16 ms/mul. Err: 1.56e-001 23028076 digits 251196^8388608+1 Time: 32.6 ms/mul. Err: 1.56e-001 45298590 digits[/quote] The problem is indeed fixed. Thanks! However, the software got a LOT slower at higher Ns. This is 1.04 with SHIFT=8: [quote]Generalized Fermat Number Bench 2009574^8192+1 Time: 400 us/mul. Err: 3.82e-001 51636 digits 1632282^16384+1 Time: 424 us/mul. Err: 2.53e-001 101791 digits 1325824^32768+1 Time: 457 us/mul. Err: 1.88e-001 200622 digits 1076904^65536+1 Time: 585 us/mul. Err: 1.72e-001 395325 digits 874718^131072+1 Time: 711 us/mul. Err: 3.47e-001 778813 digits 710492^262144+1 Time: 941 us/mul. Err: 4.21e-001 1533952 digits 577098^524288+1 Time: 1.51 ms/mul. Err: 2.01e-001 3020555 digits 468750^1048576+1 Time: 2.31 ms/mul. Err: 1.56e-001 5946413 digits 380742^2097152+1 Time: 3.56 ms/mul. Err: 3.63e-001 11703432 digits 309258^4194304+1 Time: 6.05 ms/mul. Err: 1.56e-001 23028076 digits 251196^8388608+1 Time: 11.9 ms/mul. Err: 1.41e-001 45298590 digits[/quote] I wanted to see if the compatibility mode setting was the cause of the slowdown, so I commented out this line: [code] cufftSetCompatibilityMode(plan,CUFFT_COMPATIBILITY_NATIVE); //1.047[/code] This is 1.047 with that line commented out: [quote]Generalized Fermat Number Bench 2009574^8192+1 Time: 401 us/mul. Err: 3.82e-001 51636 digits 1632282^16384+1 Time: 424 us/mul. Err: 2.53e-001 101791 digits 1325824^32768+1 Time: 463 us/mul. Err: 1.88e-001 200622 digits 1076904^65536+1 Time: 617 us/mul. Err: 1.78e-001 395325 digits 874718^131072+1 Time: 791 us/mul. Err: 3.47e-001 778813 digits 710492^262144+1 Time: 1.11 ms/mul. Err: 4.21e-001 1533952 digits 577098^524288+1 Time: 2.04 ms/mul. Err: 2.01e-001 3020555 digits 468750^1048576+1 Time: 3.9 ms/mul. Err: 1.64e-001 5946413 digits 380742^2097152+1 Time: 8.19 ms/mul. Err: 3.63e-001 11703432 digits 309258^4194304+1 Time: 15.9 ms/mul. Err: 1.56e-001 23028076 digits 251196^8388608+1 Time: 32.4 ms/mul. Err: 1.56e-001 45298590 digits[/quote] As you can see, it doesn't affect speed much. I'm wondering if the "slowdown" is actually just an instrumentation error and the benchmarks were giving us bad timings at higher Ns all along. I'm going to do some timing measurements on real processing later today. I'll let you know what I find. Mike |
[QUOTE=AG5BPilot;284355]I'm wondering if the "slowdown" is actually just an instrumentation error and the benchmarks were giving us bad timings at higher Ns all along. I'm going to do some timing measurements on real processing later today. I'll let you know what I find.
[/QUOTE] I just tested 0.99 and 1.047 with -q "14^2097152+1" and measured the actual time for the counter to decrease. [B]Both programs run at exactly the same speed.[/B] So, the information I posted before about 1.047 with its higher SHIFT values being slower IS WRONG. It looks like the faster speeds being reported by the older software is probably due to the timing not being measured accurately. This also brings into question ALL of the previous timing measurements I've made, as well as the conclusions. Previous measurements have shown that under Windows 7 x64, CUDA 3.2 is the fastest, driver 285.86 is the fastest, and 32 bit is the fastest. I'm going to need to retest all of those. On the other hand, at least in a 32-bit Windows app, this also shows that increasing SHIFT at N=2097152 doesn't affect the speed at all. (I doubt it would affect a 64 bit app either, since the CPU is hardly being used.) Shoichiro, Thanks for all the effort you put into figuring this out over the last few days. Mike |
Sorry ,Old version's(<1.047) -b results was inaccuracy.
Ver 1.047 benchmark with GTX-550Ti SHIFT=8: [code] Generalized Fermat Number Bench 710492^262144+1 Time: 1.29 ms/mul. Err: 4.21e-01 1533952 digits 577098^524288+1 Time: 2.51 ms/mul. Err: 2.01e-01 3020555 digits 468750^1048576+1 Time: 5.03 ms/mul. Err: 1.68e-01 5946413 digits 380742^2097152+1 Time: 10.6 ms/mul. Err: 3.63e-01 11703432 digits 309258^4194304+1 Time: 21.7 ms/mul. Err: 1.56e-01 23028076 digits 251196^8388608+1 Time: 44.7 ms/mul. Err: 1.64e-01 45298590 digits [/code] SHIFT=9: [code] Generalized Fermat Number Bench 710492^262144+1 Time: 1.9 ms/mul. Err: 4.21e-01 1533952 digits 577098^524288+1 Time: 2.88 ms/mul. Err: 2.01e-01 3020555 digits 468750^1048576+1 Time: 5.08 ms/mul. Err: 1.64e-01 5946413 digits 380742^2097152+1 Time: 10.5 ms/mul. Err: 3.63e-01 11703432 digits 309258^4194304+1 Time: 21.5 ms/mul. Err: 1.56e-01 23028076 digits 251196^8388608+1 Time: 44.2 ms/mul. Err: 1.64e-01 45298590 digits [/code] SHIFT=10: [code] Generalized Fermat Number Bench 710492^262144+1 Time: 3.89 ms/mul. Err: 4.21e-01 1533952 digits 577098^524288+1 Time: 4.87 ms/mul. Err: 2.01e-01 3020555 digits 468750^1048576+1 Time: 6.89 ms/mul. Err: 1.76e-01 5946413 digits 380742^2097152+1 Time: 11.8 ms/mul. Err: 3.63e-01 11703432 digits 309258^4194304+1 Time: 21.5 ms/mul. Err: 1.56e-01 23028076 digits 251196^8388608+1 Time: 44.1 ms/mul. Err: 1.64e-01 45298590 digits [/code] Change SHIFT value effect was small. |
I just completed my timing tests with 1.047.
Previously, I had found that CUDA 3.2 is fastest. That is still true. However, the difference between 3.2 and 4.0/4.1 is about half of what was measured before (the readings with 4.7 are more accurate.) This difference was also confirmed by timing real PRP testing in addition to the benchmarks. I also tested video driver 27.33 vs. 285.86 as I had done previously. This time, the results were surprising. I could detect no significant difference between 275.33 and 285.86. So, at least in 32 bit Windows, CUDA 3.2 is the way to go, but the driver doesn't matter. Shoichiro, You can add this line into the test residue routine: [code]check( 4000,2097152, "ff3daf8908789696");[/code] [quote]4000^2097152+1 is composite. (RES=ff3daf8908789696) (7554068 digits) (err = 0.0000) (time = 60:34:16) 19:13:44[/quote] I'm not 100% certain the residue is correct yet, but I'm running it against another GeneferCUDA version to verify it. That was run with 1.04, and took 60 hours (CUDA 4.2). Here's the benchmark results: (GTX 460, v1.047) [quote]driver 285.86: CUDA 3.2: Generalized Fermat Number Bench 2009574^8192+1 Time: 400 us/mul. Err: 3.82e-001 51636 digits 1632282^16384+1 Time: 426 us/mul. Err: 2.53e-001 101791 digits 1325824^32768+1 Time: 464 us/mul. Err: 1.88e-001 200622 digits 1076904^65536+1 Time: 606 us/mul. Err: 1.78e-001 395325 digits 874718^131072+1 Time: 776 us/mul. Err: 3.47e-001 778813 digits 710492^262144+1 Time: 1.09 ms/mul. Err: 4.21e-001 1533952 digits 577098^524288+1 Time: 2.04 ms/mul. Err: 2.01e-001 3020555 digits 468750^1048576+1 Time: 3.88 ms/mul. Err: 1.64e-001 5946413 digits 380742^2097152+1 Time: 8.18 ms/mul. Err: 3.63e-001 11703432 digits 309258^4194304+1 Time: 15.9 ms/mul. Err: 1.56e-001 23028076 digits 251196^8388608+1 Time: 32.3 ms/mul. Err: 1.56e-001 45298590 digits CUDA 4.0: Generalized Fermat Number Bench 2009574^8192+1 Time: 424 us/mul. Err: 3.82e-001 51636 digits 1632282^16384+1 Time: 429 us/mul. Err: 2.53e-001 101791 digits 1325824^32768+1 Time: 465 us/mul. Err: 1.88e-001 200622 digits 1076904^65536+1 Time: 610 us/mul. Err: 1.88e-001 395325 digits 874718^131072+1 Time: 780 us/mul. Err: 3.47e-001 778813 digits 710492^262144+1 Time: 1.19 ms/mul. Err: 4.21e-001 1533952 digits 577098^524288+1 Time: 2.18 ms/mul. Err: 2.01e-001 3020555 digits 468750^1048576+1 Time: 4.09 ms/mul. Err: 1.72e-001 5946413 digits 380742^2097152+1 Time: 8.32 ms/mul. Err: 3.63e-001 11703432 digits 309258^4194304+1 Time: 17.2 ms/mul. Err: 1.64e-001 23028076 digits 251196^8388608+1 Time: 35.1 ms/mul. Err: 1.72e-001 45298590 digits CUDA 4.1: Generalized Fermat Number Bench 2009574^8192+1 Time: 430 us/mul. Err: 3.82e-001 51636 digits 1632282^16384+1 Time: 438 us/mul. Err: 2.53e-001 101791 digits 1325824^32768+1 Time: 476 us/mul. Err: 1.88e-001 200622 digits 1076904^65536+1 Time: 611 us/mul. Err: 1.91e-001 395325 digits 874718^131072+1 Time: 810 us/mul. Err: 3.47e-001 778813 digits 710492^262144+1 Time: 1.2 ms/mul. Err: 4.21e-001 1533952 digits 577098^524288+1 Time: 2.2 ms/mul. Err: 2.01e-001 3020555 digits 468750^1048576+1 Time: 4.14 ms/mul. Err: 1.88e-001 5946413 digits 380742^2097152+1 Time: 8.39 ms/mul. Err: 3.63e-001 11703432 digits 309258^4194304+1 Time: 17.1 ms/mul. Err: 1.64e-001 23028076 digits 251196^8388608+1 Time: 35.7 ms/mul. Err: 1.56e-001 45298590 digits driver 275.33: CUDA 3.2: Generalized Fermat Number Bench 2009574^8192+1 Time: 400 us/mul. Err: 3.82e-001 51636 digits 1632282^16384+1 Time: 424 us/mul. Err: 2.53e-001 101791 digits 1325824^32768+1 Time: 460 us/mul. Err: 1.88e-001 200622 digits 1076904^65536+1 Time: 603 us/mul. Err: 1.78e-001 395325 digits 874718^131072+1 Time: 773 us/mul. Err: 3.47e-001 778813 digits 710492^262144+1 Time: 1.09 ms/mul. Err: 4.21e-001 1533952 digits 577098^524288+1 Time: 2.03 ms/mul. Err: 2.01e-001 3020555 digits 468750^1048576+1 Time: 3.89 ms/mul. Err: 1.64e-001 5946413 digits 380742^2097152+1 Time: 8.2 ms/mul. Err: 3.63e-001 11703432 digits 309258^4194304+1 Time: 15.9 ms/mul. Err: 1.56e-001 23028076 digits 251196^8388608+1 Time: 32.4 ms/mul. Err: 1.56e-001 45298590 digits CUDA 4.0: Generalized Fermat Number Bench 2009574^8192+1 Time: 424 us/mul. Err: 3.82e-001 51636 digits 1632282^16384+1 Time: 430 us/mul. Err: 2.53e-001 101791 digits 1325824^32768+1 Time: 464 us/mul. Err: 1.88e-001 200622 digits 1076904^65536+1 Time: 610 us/mul. Err: 1.88e-001 395325 digits 874718^131072+1 Time: 779 us/mul. Err: 3.47e-001 778813 digits 710492^262144+1 Time: 1.19 ms/mul. Err: 4.21e-001 1533952 digits 577098^524288+1 Time: 2.17 ms/mul. Err: 2.01e-001 3020555 digits 468750^1048576+1 Time: 4.11 ms/mul. Err: 1.72e-001 5946413 digits 380742^2097152+1 Time: 8.35 ms/mul. Err: 3.63e-001 11703432 digits 309258^4194304+1 Time: 17.2 ms/mul. Err: 1.64e-001 23028076 digits 251196^8388608+1 Time: 35.2 ms/mul. Err: 1.72e-001 45298590 digits CUDA 4.1 FAILS[/quote] |
CUFFT benchmark program:
[code] int main() { cufftHandle plan; cudaEvent_t start, stop; double *x;cufftDoubleComplex *g_x; float outerTime; int i,j,imax; imax = 1024*1024*4+1; cutilSafeCall(cudaMalloc((void**)&g_x, sizeof(cufftDoubleComplex)*imax)); x = ((double *)malloc(sizeof(cufftDoubleComplex)*imax)); for(i=0;i<imax*2;i++)x[i]=0; cutilSafeCall(cudaMemcpy(g_x, x, sizeof(cufftDoubleComplex)*imax, cudaMemcpyHostToDevice)); cutilSafeCall(cudaEventCreate(&start)); cutilSafeCall(cudaEventCreate(&stop)); for(j=131072;j<imax+1;j*=2) { cufftSafeCall(cufftPlan1d(&plan, j, CUFFT_Z2Z, 1)); cufftSafeCall(cufftExecZ2Z(plan,g_x,g_x,CUFFT_FORWARD)); cutilSafeCall( cudaEventRecord(start, 0) ); for(i=0;i<100;i++) cufftSafeCall(cufftExecZ2Z(plan,g_x,g_x,CUFFT_FORWARD)); cutilSafeCall( cudaEventRecord(stop, 0) ); cutilSafeCall( cudaEventSynchronize(stop) ); cutilSafeCall( cudaEventElapsedTime(&outerTime, start, stop) ); printf("CUFFT_Z2Z fft length =%d time=%f msec\n",j*2,outerTime/100); cufftSafeCall(cufftDestroy(plan)); } cutilSafeCall(cudaFree((char *)g_x)); cutilSafeCall( cudaEventDestroy(start) ); cutilSafeCall( cudaEventDestroy(stop) ); } [/code] Result: [code] CUDA3.2 CUFFT_Z2Z fft length =262144 time=0.416510 msec CUFFT_Z2Z fft length =524288 time=0.876336 msec CUFFT_Z2Z fft length =1048576 time=1.833723 msec CUFFT_Z2Z fft length =2097152 time=3.977337 msec CUFFT_Z2Z fft length =4194304 time=8.249021 msec CUFFT_Z2Z fft length =8388608 time=17.154232 msec CUDA4.0 CUFFT_Z2Z fft length =262144 time=0.458137 msec CUFFT_Z2Z fft length =524288 time=0.951801 msec CUFFT_Z2Z fft length =1048576 time=1.974829 msec CUFFT_Z2Z fft length =2097152 time=4.103585 msec CUFFT_Z2Z fft length =4194304 time=9.244113 msec CUFFT_Z2Z fft length =8388608 time=19.159618 msec [/code] Ver 1.47 -b result: [code] CUDA3.2 Generalized Fermat Number Bench 710492^262144+1 Time: 1.29 ms/mul. Err: 4.21e-01 1533952 digits 577098^524288+1 Time: 2.51 ms/mul. Err: 2.01e-01 3020555 digits 468750^1048576+1 Time: 5.03 ms/mul. Err: 1.68e-01 5946413 digits 380742^2097152+1 Time: 10.5 ms/mul. Err: 3.63e-01 11703432 digits 309258^4194304+1 Time: 21.5 ms/mul. Err: 1.56e-01 23028076 digits 251196^8388608+1 Time: 44.1 ms/mul. Err: 1.64e-01 45298590 digits CUDA4.0 Generalized Fermat Number Bench 710492^262144+1 Time: 1.37 ms/mul. Err: 4.21e-01 1533952 digits 577098^524288+1 Time: 2.65 ms/mul. Err: 2.01e-01 3020555 digits 468750^1048576+1 Time: 5.3 ms/mul. Err: 1.72e-01 5946413 digits 380742^2097152+1 Time: 10.7 ms/mul. Err: 3.63e-01 11703432 digits 309258^4194304+1 Time: 23.4 ms/mul. Err: 1.56e-01 23028076 digits 251196^8388608+1 Time: 48 ms/mul. Err: 1.56e-01 45298590 digits [/code] N=8388608 GeneferCUDA 48 ms(CUDA4.0) - 44.1 ms(CUDA3.2) = 3.9ms CUFFT 19.159618 msec(CUDA4.0)*2 - 17.154232 msec(CUDA3.2)*2 = 4.0msec Cause is CUFFT. |
1 Attachment(s)
Ver 1.048 with linix64 exec file.
[code] $ grep 1.048 GeneferCUDA.cu //1.048 cufftSetCompatibilityMode(plan,CUFFT_COMPATIBILITY_NATIVE); //1.047 check( 4000,2097152, "ff3daf8908789696"); //1.048 from AG5BPilot [/code] [code] $ ./GeneferCUDA.cuda4.0.Linux64 -b GeneferCUDA 2.2.1 (CUDA) based on Genefer v2.2.1 Copyright (C) 2001-2003, Yves Gallot (v1.3) Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1) Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA) A program for finding large probable generalized Fermat primes. Generalized Fermat Number Bench 2009574^8192+1 Time: 407 us/mul. Err: 3.82e-01 51636 digits 1632282^16384+1 Time: 414 us/mul. Err: 2.53e-01 101791 digits 1325824^32768+1 Time: 467 us/mul. Err: 2.03e-01 200622 digits 1076904^65536+1 Time: 611 us/mul. Err: 1.88e-01 395325 digits 874718^131072+1 Time: 843 us/mul. Err: 3.47e-01 778813 digits 710492^262144+1 Time: 1.37 ms/mul. Err: 4.21e-01 1533952 digits 577098^524288+1 Time: 2.65 ms/mul. Err: 2.01e-01 3020555 digits 468750^1048576+1 Time: 5.3 ms/mul. Err: 1.72e-01 5946413 digits 380742^2097152+1 Time: 10.7 ms/mul. Err: 3.63e-01 11703432 digits 309258^4194304+1 Time: 23.4 ms/mul. Err: 1.56e-01 23028076 digits 251196^8388608+1 Time: 48 ms/mul. Err: 1.56e-01 45298590 digits [/code] |
I'm not sure how long this has been in the code, but the source of the occasional checkpoint read failures is due to the code within the checkpoint read function using !strcmp when it should be using strcmp AND also not adding a terminating null to the end of one of the strings it's comparing.
This code: [code] fread(build, 1, bytes, fPtr); if (!strcmp(build, CPU_TARGET)) {[/code] should be this: [code] fread(build, 1, bytes, fPtr); build[bytes] = 0; if (strcmp(build, CPU_TARGET)) {[/code] |
| All times are UTC. The time now is 05:55. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.