mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   genefer/CUDA (https://www.mersenneforum.org/showthread.php?t=14297)

ET_ 2011-11-27 10:14

[QUOTE=msft;280077]Ken Braziermay`s code changes that may produce a percent or two speed increase in GeneferCUDA.[/QUOTE]

Thank you :-) :smile:

Luigi

msft 2011-12-09 06:57

1 Attachment(s)
New Ver.:smile:
[QUOTE]
OLD:
$ ./GeneferCUDA -b
GeneferCUDA 2.2.1 (CUDA) based on Genefer v2.2.1
Copyright (C) 2001-2003, Yves Gallot (v1.3)
Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1)
Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA)
A program for finding large probable generalized Fermat primes.

Generalized Fermat Number Bench
2009574^8192+1 Time: 396 us/mul. Err: 3.82e-01 51636 digits
1632282^16384+1 Time: 402 us/mul. Err: 2.53e-01 101791 digits
1325824^32768+1 Time: 453 us/mul. Err: 1.88e-01 200622 digits
1076904^65536+1 Time: 608 us/mul. Err: 1.72e-01 395325 digits
874718^131072+1 Time: 835 us/mul. Err: 3.47e-01 778813 digits
710492^262144+1 Time: 1.35 ms/mul. Err: 4.21e-01 1533952 digits
577098^524288+1 Time: 2.58 ms/mul. Err: 2.01e-01 3020555 digits
468750^1048576+1 Time: 5.12 ms/mul. Err: 1.56e-01 5946413 digits
380742^2097152+1 Time: 10.5 ms/mul. Err: 3.63e-01 11703432 digits
309258^4194304+1 Time: 23.7 ms/mul. Err: 1.47e-01 23028076 digits
251196^8388608+1 Time: 48.7 ms/mul. Err: 1.56e-01 45298590 digits

NEW:
$ ./GeneferCUDA -b
GeneferCUDA 2.2.1 (CUDA) based on Genefer v2.2.1
Copyright (C) 2001-2003, Yves Gallot (v1.3)
Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1)
Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA)
A program for finding large probable generalized Fermat primes.

Generalized Fermat Number Bench
2009574^8192+1 Time: 407 us/mul. Err: 3.82e-01 51636 digits
1632282^16384+1 Time: 412 us/mul. Err: 2.53e-01 101791 digits
1325824^32768+1 Time: 463 us/mul. Err: 1.88e-01 200622 digits
1076904^65536+1 Time: 612 us/mul. Err: 1.72e-01 395325 digits
874718^131072+1 Time: 837 us/mul. Err: 3.47e-01 778813 digits
710492^262144+1 Time: 1.35 ms/mul. Err: 4.21e-01 1533952 digits
577098^524288+1 Time: 2.58 ms/mul. Err: 2.01e-01 3020555 digits
468750^1048576+1 Time: 5.12 ms/mul. Err: 1.56e-01 5946413 digits
380742^2097152+1 Time: 9.61 ms/mul. Err: 3.63e-01 11703432 digits
309258^4194304+1 Time: 20.7 ms/mul. Err: 1.47e-01 23028076 digits
251196^8388608+1 Time: 31.7 ms/mul. Err: 1.41e-01 45298590 digits
[/QUOTE]

AG5BPilot 2011-12-26 05:17

Shoichiro,

Last month I started experimenting with using GeneferCUDA at N=41943404. I discovered that the initialization of the program takes two hours before the GPU actually starts working. At N=254288 this only takes 5 minutes. This initialization takes place every time GeneferCUDA is restarted, so each time the program needs to be stopped and started takes two hours.

I played with the code a bit an found this section of code near the beginning of the "check" routine:

[code] Nlen = 1;
Na = (UInt32 *)myAlloc(1 * sizeof(UInt32));
Na[0] = (UInt32)b;
for (j = m; j != 1; j /= 2)
{
a = (UInt32 *)myAlloc(2 * Nlen * sizeof(UInt32));
for (i = 0; i != Nlen; ++i)
a[i] = 0;

for (i = 0; i != Nlen; ++i)
a[i + Nlen] = mul_1_add_n(&a[i], Na[i], Na, Nlen);

myFree(Na);
Na = a;
Nlen *= 2;
if (Na[Nlen - 1] == 0) --Nlen;
}
[/code]

This loop plus the mul_1_add_n routine computes the value of b^N by repeatedly squaring b. This works efficiently at lower values, but the final couple of multiplications get to be exceedingly time consuming with larger numbers.

I was thinking of ways to eliminate the 2 hour initialization, and while it's certainly possible to make the computation a lot more efficient (it could probably be computed on the GPU very quickly), it might be simpler to merely save the computed value of b in the checkpoint file. Then it would get computed only once.

What do you think?

Mike

msft 2011-12-27 15:08

Hi ,AG5BPilot
[QUOTE=AG5BPilot;283524]I was thinking of ways to eliminate the 2 hour initialization, and while it's certainly possible to make the computation a lot more efficient (it could probably be computed on the GPU very quickly), it might be simpler to merely save the computed value of b in the checkpoint file. Then it would get computed only once.
[/QUOTE]
I made checkpoint file version.:smile:

AG5BPilot 2011-12-28 02:28

[QUOTE=msft;283664]Hi ,AG5BPilot

I made checkpoint file version.:smile:[/QUOTE]

Thank you!

I did some timing tests running GeneferCUDA with 2 different video drivers, and three different CUDA toolkits. You might be interested in the results, which can be found [url=http://www.primegrid.com/forum_thread.php?id=1528&nowrap=true#45768]here[/url] on the PrimeGrid forums.

There's about a 20% speed difference between the best and the worst combination.

Mike

msft 2011-12-28 04:02

1 Attachment(s)
Fix AG5BPilot`s issue.

AG5BPilot 2011-12-28 06:41

[QUOTE=msft;283748]Fix AG5BPilot`s issue.[/QUOTE]

[quote]C:\GeneferCUDA test\genefercuda.1.04>GeneferCUDA-X41.exe -b
GeneferCUDA 2.2.1 (CUDA) based on Genefer v2.2.1
Copyright (C) 2001-2003, Yves Gallot (v1.3)
Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1)
Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA)
A program for finding large probable generalized Fermat primes.

Generalized Fermat Number Bench
2009574^8192+1 Time: 446 us/mul. Err: 3.82e-001 51636 digits
1632282^16384+1 Time: 444 us/mul. Err: 2.53e-001 101791 digits
1325824^32768+1 Time: 477 us/mul. Err: 1.82e-001 200622 digits
1076904^65536+1 Time: 610 us/mul. Err: 1.80e-001 395325 digits
874718^131072+1 Time: 771 us/mul. Err: 3.47e-001 778813 digits
710492^262144+1 Time: 1.06 ms/mul. Err: 4.21e-001 1533952 digits
577098^524288+1 Time: 1.69 ms/mul. Err: 2.01e-001 3020555 digits
468750^1048576+1 Time: 2.61 ms/mul. Err: 1.56e-001 5946413 digits
[color=red]380742^2097152+1 Time: 150 us/mul. Err: 3.63e-001 11703432 digits
309258^4194304+1 Time: 102 us/mul. Err: 1.47e-001 23028076 digits
251196^8388608+1 Time: 93.8 us/mul. Err: 1.41e-001 45298590 digits[/COLOR][/quote]

The benchmark times are wrong, but the real runtimes are correct, so the problem appears to be only in the benchGFN function. Perhaps it has to do with the SHIFT parameter to the FFT---GFN rountines?

Mike

msft 2011-12-28 07:35

Hi ,AG5BPilot
[QUOTE=AG5BPilot;283763]The benchmark times are wrong, but the real runtimes are correct, so the problem appears to be only in the benchGFN function. Perhaps it has to do with the SHIFT parameter to the FFT---GFN rountines?[/QUOTE]
On my GTX-550Ti(cudatoolkit_3.2.16_linux_64_ubuntu10.04.run,gpucomputingsdk_3.2.16_linux.run,devdriver_4.1_linux_64_285.05.23.run)
[code]
GeneferCUDA 2.2.1 (CUDA) based on Genefer v2.2.1
Copyright (C) 2001-2003, Yves Gallot (v1.3)
Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1)
Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA)
A program for finding large probable generalized Fermat primes.

Generalized Fermat Number Bench
2009574^8192+1 Time: 391 us/mul. Err: 3.82e-01 51636 digits
1632282^16384+1 Time: 414 us/mul. Err: 2.53e-01 101791 digits
1325824^32768+1 Time: 466 us/mul. Err: 2.19e-01 200622 digits
1076904^65536+1 Time: 608 us/mul. Err: 1.80e-01 395325 digits
874718^131072+1 Time: 829 us/mul. Err: 3.47e-01 778813 digits
710492^262144+1 Time: 1.27 ms/mul. Err: 4.21e-01 1533952 digits
577098^524288+1 Time: 2.44 ms/mul. Err: 2.01e-01 3020555 digits
468750^1048576+1 Time: 4.84 ms/mul. Err: 1.56e-01 5946413 digits
380742^2097152+1 Time: 9.39 ms/mul. Err: 3.63e-01 11703432 digits
309258^4194304+1 Time: 19 ms/mul. Err: 1.56e-01 23028076 digits
251196^8388608+1 Time: 29.1 ms/mul. Err: 1.56e-01 45298590 digits
[/code]
Is GeneferCUDA Ver 1.03 correct ?

LaurV 2011-12-28 08:03

"A program for finding large probable generalized Fermat primes"

:P That looks like my English :P

AG5BPilot 2011-12-28 13:27

[QUOTE=msft;283766]Hi ,AG5BPilot

On my GTX-550Ti(cudatoolkit_3.2.16_linux_64_ubuntu10.04.run,gpucomputingsdk_3.2.16_linux.run,devdriver_4.1_linux_64_285.05.23.run)

[i]~Normal-looking benchmark results removed~[/i]

Is GeneferCUDA Ver 1.03 correct ?[/QUOTE]

EDIT: No. That's using 1.04, the latest one you posted. I wanted to test the new checkpointing.

It's compiled with Toolkit 4.1,, SDK 3.2, and driver 285.86, all for Windows 7 64 bit. However, it's a 32-bit build, using these nvcc compiler options: -O3 -m32 -arch=sm_21

I'm going to revert to the 3.2 toolkit and rebuild your 1.03 and see what happens. I'll post the results in a few minutes.

Mike

AG5BPilot 2011-12-28 14:01

Your v1.04 source code, CUDA toolkit 3.2 (64 bit), SDK 3.2 (64 bit), driver 285.86, Windows 7 64 bit, nvcc options -O3 -arch=sm_21 -m32 (32 bit build):

[quote]C:\GeneferCUDA test\genefercuda.1.04>GeneferCUDA-X32.exe -b
GeneferCUDA 2.2.1 (CUDA) based on Genefer v2.2.1
Copyright (C) 2001-2003, Yves Gallot (v1.3)
Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1)
Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA)
A program for finding large probable generalized Fermat primes.

Generalized Fermat Number Bench
2009574^8192+1 Time: 397 us/mul. Err: 3.82e-001 51636 digits
1632282^16384+1 Time: 420 us/mul. Err: 2.53e-001 101791 digits
1325824^32768+1 Time: 451 us/mul. Err: 1.88e-001 200622 digits
1076904^65536+1 Time: 589 us/mul. Err: 1.72e-001 395325 digits
874718^131072+1 Time: 715 us/mul. Err: 3.47e-001 778813 digits
710492^262144+1 Time: 944 us/mul. Err: 4.21e-001 1533952 digits
577098^524288+1 Time: 1.5 ms/mul. Err: 2.01e-001 3020555 digits
468750^1048576+1 Time: 2.31 ms/mul. Err: 1.56e-001 5946413 digits
[color=red]380742^2097152+1 Time: 225 us/mul. Err: 3.63e-001 11703432 digits
309258^4194304+1 Time: 316 us/mul. Err: 1.48e-001 23028076 digits
251196^8388608+1 Time: 273 us/mul. Err: 1.41e-001 45298590 digits[/color][/quote]

Same problem at N=2097152 and above.

I also tried building v1.03, using the same options as above:

[quote]C:\GeneferCUDA test\genefercuda.1.03>GeneferCUDA-X32.exe -b
GeneferCUDA 2.2.1 (CUDA) based on Genefer v2.2.1
Copyright (C) 2001-2003, Yves Gallot (v1.3)
Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1)
Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA)
A program for finding large probable generalized Fermat primes.

Generalized Fermat Number Bench
2009574^8192+1 Time: 396 us/mul. Err: 3.82e-001 51636 digits
1632282^16384+1 Time: 420 us/mul. Err: 2.53e-001 101791 digits
1325824^32768+1 Time: 451 us/mul. Err: 1.88e-001 200622 digits
1076904^65536+1 Time: 590 us/mul. Err: 1.72e-001 395325 digits
874718^131072+1 Time: 717 us/mul. Err: 3.47e-001 778813 digits
710492^262144+1 Time: 942 us/mul. Err: 4.21e-001 1533952 digits
577098^524288+1 Time: 1.5 ms/mul. Err: 2.01e-001 3020555 digits
468750^1048576+1 Time: 2.31 ms/mul. Err: 1.56e-001 5946413 digits
[color=red]380742^2097152+1 Time: 232 us/mul. Err: 3.63e-001 11703432 digits
309258^4194304+1 Time: 316 us/mul. Err: 1.48e-001 23028076 digits
251196^8388608+1 Time: 281 us/mul. Err: 1.41e-001 45298590 digits[/color][/quote]

There were a bunch of compiler warnings building 1.03:

[quote]GeneferCUDA.cu(204): warning: variable "j" was declared but never referenced

GeneferCUDA.cu(551): warning: variable "STRIDE" is used before its value is set

GeneferCUDA.cu(690): warning: variable "STRIDE" is used before its value is set

GeneferCUDA.cu(686): warning: variable "j" was declared but never referenced

GeneferCUDA.cu(793): warning: variable "STRIDE" is used before its value is set

GeneferCUDA.cu(786): warning: variable "j" was declared but never referenced

GeneferCUDA.cu(1245): warning: variable "args" is used before its value is set[/quote]

The 'args' warning also happens with 1.04, but I'm not too worried about that right now.

EDIT: I'm wondering if it has anything to do with my using a 32 bit build (nvcc -m32).


All times are UTC. The time now is 05:55.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.