mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   genefer/CUDA (https://www.mersenneforum.org/showthread.php?t=14297)

msft 2012-01-02 14:48

1 Attachment(s)
Ver 1.048 with linix64 exec file.
[code]
$ grep 1.048 GeneferCUDA.cu
//1.048 cufftSetCompatibilityMode(plan,CUFFT_COMPATIBILITY_NATIVE); //1.047
check( 4000,2097152, "ff3daf8908789696"); //1.048 from AG5BPilot
[/code]
[code]

$ ./GeneferCUDA.cuda4.0.Linux64 -b
GeneferCUDA 2.2.1 (CUDA) based on Genefer v2.2.1
Copyright (C) 2001-2003, Yves Gallot (v1.3)
Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1)
Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA)
A program for finding large probable generalized Fermat primes.

Generalized Fermat Number Bench
2009574^8192+1 Time: 407 us/mul. Err: 3.82e-01 51636 digits
1632282^16384+1 Time: 414 us/mul. Err: 2.53e-01 101791 digits
1325824^32768+1 Time: 467 us/mul. Err: 2.03e-01 200622 digits
1076904^65536+1 Time: 611 us/mul. Err: 1.88e-01 395325 digits
874718^131072+1 Time: 843 us/mul. Err: 3.47e-01 778813 digits
710492^262144+1 Time: 1.37 ms/mul. Err: 4.21e-01 1533952 digits
577098^524288+1 Time: 2.65 ms/mul. Err: 2.01e-01 3020555 digits
468750^1048576+1 Time: 5.3 ms/mul. Err: 1.72e-01 5946413 digits
380742^2097152+1 Time: 10.7 ms/mul. Err: 3.63e-01 11703432 digits
309258^4194304+1 Time: 23.4 ms/mul. Err: 1.56e-01 23028076 digits
251196^8388608+1 Time: 48 ms/mul. Err: 1.56e-01 45298590 digits
[/code]

AG5BPilot 2012-01-03 01:51

I'm not sure how long this has been in the code, but the source of the occasional checkpoint read failures is due to the code within the checkpoint read function using !strcmp when it should be using strcmp AND also not adding a terminating null to the end of one of the strings it's comparing.

This code:
[code] fread(build, 1, bytes, fPtr);
if (!strcmp(build, CPU_TARGET))
{[/code]

should be this:
[code] fread(build, 1, bytes, fPtr);
build[bytes] = 0;
if (strcmp(build, CPU_TARGET))
{[/code]

msft 2012-01-04 01:10

[QUOTE=AG5BPilot;284535]I'm not sure how long this has been in the code, but the source of the occasional checkpoint read failures is due to the code within the checkpoint read function using !strcmp when it should be using strcmp AND also not adding a terminating null to the end of one of the strings it's comparing.
[/QUOTE]
You are right.
[code]
$ od -cx genefer.ckpt |head
0000000 335 \0 \0 \0 004 \0 \0 \0 C U D A 232 372 036 \0
00dd 0000 0004 0000 5543 4144 fa9a 001e

#define SaveFileVersion 221

oldSaveVer="00dd 0000"
byte="0004 0000"
build="5543 4144"
(x86 is little-endian format.)

http://research.microsoft.com/en-us/um/redmond/projects/invisible/src/crt/strcmp.c.htm
[/code]

msft 2012-01-04 04:05

1 Attachment(s)
Ver 1.049 with linux64 exec file.
Fixed #99 issue.

msft 2012-01-04 12:33

1 Attachment(s)
Ver 1.05 with linux64 exec file.
Fixed "Corruption of error message" issue.

rroonnaalldd 2012-01-04 15:15

1 Attachment(s)
Minor cosmetics (i*2) to ((i)*2) in v1.051.

Do you have an idea, why your source produces a 100% load on one cpu-core?
Is this too by design that stopping the bench results in a hanging app after writing the message "writing checkpoint" to the screen?
Do you know what causes the underlined differences in err-rates between the 32bit- and 64bit-app?


Here some timings for our apps and in comparison the timings for genefer, genefer80 and geneferX64:
[QUOTE]boinc@vmware2k-3: ./[B]GeneferCUDA.cuda4.0.Linux32[/B] -b

2009574^8192+1 Time: 663 us/mul. Err: 3.82e-01 51636 digits
1632282^16384+1 Time: 680 us/mul. Err: 2.53e-01 101791 digits
1325824^32768+1 Time: 746 us/mul. Err: 2.03e-01 200622 digits
1076904^65536+1 Time: 980 us/mul. Err: 1.88e-01 395325 digits
874718^131072+1 Time: 1.34 ms/mul. Err: 3.47e-01 778813 digits
710492^262144+1 Time: 2.07 ms/mul. Err: 4.21e-01 1533952 digits
577098^524288+1 Time: 4.06 ms/mul. Err: 2.01e-01 3020555 digits
468750^1048576+1 Time: 8.21 ms/mul. [U]Err: 1.64e-01[/U] 5946413 digits
380742^2097152+1 Time: 16.6 ms/mul. Err: 3.63e-01 11703432 digits
309258^4194304+1 Time: 35.8 ms/mul. [U]Err: 4.07e-01[/U] 23028076 digits
251196^8388608+1 Time: 73.2 ms/mul. [U]Err: 4.33e-01[/U] 45298590 digits
[/QUOTE]
[QUOTE]boinc@vmware2k-3: ./[B]GeneferCUDA.cuda4.0.Linux64[/B] -b

2009574^8192+1 Time: 696 us/mul. Err: 3.82e-01 51636 digits
1632282^16384+1 Time: 713 us/mul. Err: 2.53e-01 101791 digits
1325824^32768+1 Time: 779 us/mul. Err: 2.03e-01 200622 digits
1076904^65536+1 Time: 1.01 ms/mul. Err: 1.88e-01 395325 digits
874718^131072+1 Time: 1.37 ms/mul. Err: 3.47e-01 778813 digits
710492^262144+1 Time: 2.11 ms/mul. Err: 4.21e-01 1533952 digits
577098^524288+1 Time: 4.09 ms/mul. Err: 2.01e-01 3020555 digits
468750^1048576+1 Time: 8.21 ms/mul. [U]Err: 1.72e-01[/U] 5946413 digits
380742^2097152+1 Time: 16.7 ms/mul. Err: 3.63e-01 11703432 digits
309258^4194304+1 Time: 36.9 ms/mul. [U]Err: 1.56e-01[/U] 23028076 digits
251196^8388608+1 Time: 74.9 ms/mul. [U]Err: 1.56e-01[/U] 45298590 digits
[/QUOTE]

CPU:
[QUOTE]boinc@vmware2k-3: ./[B]genefer[/B] -b

6631258^256+1 Time: 6.1 us/mul. Err: 0.4991 1747 digits
5386256^512+1 Time: 13.4 us/mul. Err: 0.4361 3447 digits
4375000^1024+1 Time: 28.1 us/mul. Err: 0.4200 6801 digits
3553604^2048+1 Time: 59.8 us/mul. Err: 0.4686 13416 digits
2886422^4096+1 Time: 129 us/mul. Err: 0.4362 26462 digits
2344504^8192+1 Time: 327 us/mul. Err: 0.4133 52184 digits
1904328^16384+1 Time: 693 us/mul. Err: 0.4014 102888 digits
1546796^32768+1 Time: 1.46 ms/mul. Err: 0.4412 202816 digits
1256388^65536+1 Time: 3.05 ms/mul. Err: 0.4634 399713 digits
1020504^131072+1 Time: 6.33 ms/mul. Err: 0.3661 787588 digits
828906^262144+1 Time: 13.8 ms/mul. Err: 0.4133 1551501 digits
673282^524288+1 Time: 29.4 ms/mul. Err: 0.4228 3055654 digits
546874^1048576+1 Time: 71.2 ms/mul. Err: 0.3784 6016611 digits
444200^2097152+1 Time: 155 ms/mul. Err: 0.3295 11843831 digits
[/QUOTE]
[QUOTE]boinc@vmware2k-3: ./[B]genefer80[/B] -b

5683936^256+1 Time: 9.77 us/mul. Err: 0.0002 1730 digits
4616790^512+1 Time: 20.8 us/mul. Err: 0.0002 3413 digits
3750000^1024+1 Time: 47.6 us/mul. Err: 0.0002 6732 digits
3045946^2048+1 Time: 101 us/mul. Err: 0.0002 13279 digits
2474076^4096+1 Time: 227 us/mul. Err: 0.0002 26188 digits
2009574^8192+1 Time: 522 us/mul. Err: 0.0002 51636 digits
1632282^16384+1 Time: 1.08 ms/mul. Err: 0.0002 101791 digits
1325824^32768+1 Time: 2.25 ms/mul. Err: 0.0002 200622 digits
1076904^65536+1 Time: 5 ms/mul. Err: 0.0002 395325 digits
874718^131072+1 Time: 11.2 ms/mul. Err: 0.0002 778813 digits
710492^262144+1 Time: 22.7 ms/mul. Err: 0.0002 1533952 digits
577098^524288+1 Time: 55.3 ms/mul. Err: 0.0002 3020555 digits
468750^1048576+1 Time: 128 ms/mul. Err: 0.0002 5946413 digits
380742^2097152+1 Time: 270 ms/mul. Err: 0.0002 11703432 digits
[/QUOTE]
[QUOTE]boinc@vmware2k-3: ./[B]geneferX64[/B] -b

5683936^256+1 Time: 3.81 us/mul. Err: 0.2500 1730 digits
4616790^512+1 Time: 8.24 us/mul. Err: 0.2500 3413 digits
3750000^1024+1 Time: 17.7 us/mul. Err: 0.2500 6732 digits
3045946^2048+1 Time: 37.8 us/mul. Err: 0.2500 13279 digits
2474076^4096+1 Time: 80.6 us/mul. Err: 0.2500 26188 digits
2009574^8192+1 Time: 200 us/mul. Err: 0.2500 51636 digits
1632282^16384+1 Time: 420 us/mul. Err: 0.2500 101791 digits
1325824^32768+1 Time: 879 us/mul. Err: 0.2188 200622 digits
1076904^65536+1 Time: 1.88 ms/mul. Err: 0.2031 395325 digits
874718^131072+1 Time: 3.83 ms/mul. Err: 0.2188 778813 digits
710492^262144+1 Time: 7.97 ms/mul. Err: 0.1875 1533952 digits
577098^524288+1 Time: 17.5 ms/mul. Err: 0.1719 3020555 digits
468750^1048576+1 Time: 44.4 ms/mul. Err: 0.1875 5946413 digits
380742^2097152+1 Time: 98.8 ms/mul. Err: 0.1875 11703432 digits
[/QUOTE]

AG5BPilot 2012-01-04 16:26

[QUOTE=rroonnaalldd;284769]
Do you have an idea, why your source produces a 100% load on one cpu-core?
Is this too by design that stopping the bench results in a hanging app after writing the message "writing checkpoint" to the screen?
Do you know what causes the underlined differences in err-rates between the 32bit- and 64bit-app?

CPU:[/QUOTE]

Ronald,

I can probably answer some of your questions as my boinc version is, essentially, a superset of Shoichiro's software.

The problem with not being able to terminate the benchmarks has to do with the way the CTRL-C trapping works. The handler for the signal capture is set up at the beginning of the program, BEFORE it's determined that you're going to run a benchmark. When you hit CTRL-C, the handler captures it and sets the quitting flag, which tells "check" to stop running, do a checkpoint, and exit.

Unfortunately, "check" isn't running. So the quitting flag is set, but nothing ever reads that.

I fixed this in my boinc version of the code by only setting the CTRL-C handler if we're going to do a real PRP test. Otherwise, I let the default run-time handler catch the signal and terminate the program normally.

As for the CPU core loading, mine doesn't do that, and the math code in my version is identical to Shoichiro's. But my build is 32 bit windows.

By the way, when I did build a 64 bit version, it ran very slightly slower than the 32 bit build.

AG5BPilot 2012-01-04 16:52

[QUOTE=rroonnaalldd;284769]Minor cosmetics (i*2) to ((i)*2) in v1.051.[/QUOTE]

I must be missing something. These are the only changes I can find between 1.05 and 1.051:

[code]Compare: (<)C:\GeneferCUDA test\genefercuda.1.051\GeneferCUDA.cu (37817 bytes)
with: (>)C:\GeneferCUDA test\genefercuda.1.05\GeneferCUDA.cu (37811 bytes)

137d137
<
476,477c475
< }
< while (f != 0);
---
> } while (f != 0);
898,899c896,897
< }
< while (maxError > ErrThreshold);
---
>
> } while (maxError > ErrThreshold);
1121c1119,1120
< while (fgets(str, 132, fp) != NULL)
---
>
> while (fgets(str, 132, fp) != NULL)
1267d1266
<
1271,1275d1269
<
<
<
<
< [/code]

I don't see anything about (i*2). Also, the only place I can find (i*2), i is a variable, not a macro parameter, so I don't understand the benefit of changing it to ((i)*2).

AG5BPilot 2012-01-04 19:37

Shoichiro,

I just completed testing 4000^2097152+1 with older code (0.99) that uses SHIFT=8. Good news; the residual is the same as the residual that came from the 1.04 version with the new SHIFT logic.

Mike

AG5BPilot 2012-01-04 23:48

I'm hoping there's a simple answer to this question.

How do I get convert an array of integers (which represent a very very large integer) to and from the complex arrays that are the input and output from FFTsquareFFT?

Here's the background:

Once upon a time (say, when I was in high school some 35 five years ago), I understood and really, really enjoyed things like Fourier transforms. Having not ever, once, had to actually use a Fourier transform since then, it's not difficult to imagine that I've forgotten nearly everything about them!

Until, lo and behold, I find myself needing to do multiplications on astoundingly big numbers.

So, here's the problem. Genefer, at the beginning of its test, starts by computing the value of b^N by repeatedly taking the square of b. The code in Genefer worked fine, for years, until some fool (that fool would be me, of course) had the audacity to go and discover that 25898^524288+1 was a prime number. So, of course, that sent us all thinking about finding even bigger primes with Genefer.

So far, so good. Except that I'm a programmer by training, so I started tinkering with Genefer. First thing I did was convert GeneferCUDA to run under BOINC, so we can gets lots and lots of people running it (more people are willing to run Boinc than PRPNet. That's a shame; I like PRPNet better. Mark rocks.)

Along the way, I found that at N=4194304 Genefer takes a very long time to start up. Looking at the code, I discovered that the really simple code at the beginning that computes the actual value of b^N, by repeatedly squaring b, takes a very long time when N=44194304. About 2 hours on my Core2Quad.

Now, I could live with it taking 2 hours. After all, the actual primality test takes about 200 hours on a GTX 460. But, I know it can be done faster. It would take me, for example, a few minutes to write the code to use the gwnum library to do it, and that could do the calculations very quickly.

Of course, it's silly to link the gwnum library into a program that's already doing math like that on the GPU.

The programmer in me won't let me accept that 2 hours of processing when I know it can be eliminated.

Which brings me to the part about my having forgotten what I learned about Fourier transforms 35 years ago. GeneferCUDA already has the code in it to do all the real work to do the square. All I have to do is call FFTsquareFFT. It does a forward FFT, calls a CUDA kernel to do the complex multiply on the FFT result, and does an inverse FFT. The result is the square of the input.

The problem is, I'm not sure how to convert the input number (an array of integers) into the complex array that's the input to the FFT, nor how to get the resulting number back out. I might figure it out eventually, but I thought I'd ask because there's probably an incredibly simple answer.

Thanks,
Mike

msft 2012-01-05 00:18

Hi ,rroonnaalldd
[QUOTE=rroonnaalldd;284769]Do you have an idea, why your source produces a 100% load on one cpu-core?
[/QUOTE]
Please see.
[url]http://mersenneforum.org/showpost.php?p=284251&postcount=86[/url]
[url]http://mersenneforum.org/showpost.php?p=284284&postcount=87[/url]
[url]http://mersenneforum.org/showpost.php?p=284292&postcount=88[/url]
[url]http://mersenneforum.org/showpost.php?p=284300&postcount=89[/url]

msft 2012-01-05 01:10

Hi ,
[QUOTE=AG5BPilot;284813]How do I get convert an array of integers (which represent a very very large integer) to and from the complex arrays that are the input and output from FFTsquareFFT?
[/QUOTE]
I think right way was Implement library.
[url]http://mersenneforum.org/showpost.php?p=248716&postcount=63[/url]
Or something like this(GMP ?).
llrCUDA have same problem.

rroonnaalldd 2012-01-05 02:21

[QUOTE]Please see.[/QUOTE] Thanks for the links.

msft 2012-01-05 02:23

Hi ,
[QUOTE=rroonnaalldd;284769]Do you know what causes the underlined differences in err-rates between the 32bit- and 64bit-app?
[/QUOTE]
cudatoolkit_4.0.17_linux_32_ubuntu10.10.run
gpucomputingsdk_4.0.17_linux.run
devdriver_4.1_linux_32_285.05.23.run
with GTX-550Ti
[code]
genefercuda.1.051$ ./GeneferCUDA.cuda4.0.Linux32 -b
GeneferCUDA 2.2.1 (CUDA) based on Genefer v2.2.1
Copyright (C) 2001-2003, Yves Gallot (v1.3)
Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1)
Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA)
A program for finding large probable generalized Fermat primes.

Generalized Fermat Number Bench
2009574^8192+1 Time: 378 us/mul. Err: 3.82e-01 51636 digits
1632282^16384+1 Time: 384 us/mul. Err: 2.53e-01 101791 digits
1325824^32768+1 Time: 437 us/mul. Err: 2.03e-01 200622 digits
1076904^65536+1 Time: 587 us/mul. Err: 1.88e-01 395325 digits
874718^131072+1 Time: 817 us/mul. Err: 3.47e-01 778813 digits
710492^262144+1 Time: 1.33 ms/mul. Err: 4.21e-01 1533952 digits
577098^524288+1 Time: 2.58 ms/mul. Err: 2.01e-01 3020555 digits
468750^1048576+1 Time: 5.25 ms/mul. Err: 1.64e-01 5946413 digits
380742^2097152+1 Time: 10.6 ms/mul. Err: 3.63e-01 11703432 digits
309258^4194304+1 Time: 22.7 ms/mul. Err: 1.56e-01 23028076 digits
251196^8388608+1 Time: 46.5 ms/mul. Err: 1.56e-01 45298590 digits
[/code]

rroonnaalldd 2012-01-05 03:37

Hi ,
[QUOTE=msft;284826]cudatoolkit_4.0.17_linux_32_ubuntu10.10.run
gpucomputingsdk_4.0.17_linux.run
devdriver_4.1_linux_32_285.05.23.run
with GTX-550Ti
[/QUOTE]

I have installed the same Cuda41rc2-driver and the older Cuda40-sdk. Only difference is my smaller GTS450...


[add]
All your error-rates seems to be the same except: [QUOTE]309258^4194304+1 Time: 22.7 ms/mul. Err: 1.56e-01 23028076 digits
251196^8388608+1 Time: 46.5 ms/mul. Err: 1.56e-01 45298590 digits[/QUOTE]

I have: [QUOTE]309258^4194304+1 Time: 35.8 ms/mul. Err: 4.07e-01 23028076 digits
251196^8388608+1 Time: 73.2 ms/mul. Err: 4.33e-01 45298590 digits[/QUOTE]

axn 2012-01-05 04:29

[QUOTE=AG5BPilot;284813]
Along the way, I found that at N=4194304 Genefer takes a very long time to start up. Looking at the code, I discovered that the really simple code at the beginning that computes the actual value of b^N, by repeatedly squaring b, takes a very long time when N=44194304. About 2 hours on my Core2Quad.[/QUOTE]

[QUOTE=msft;284818]
Or something like this(GMP ?).[/QUOTE]

Use the GMP, Luke :smile:

[CODE]? N=222222^4194304;
? ##
*** last result computed in 2,749 ms.[/CODE]
That's 2.7 seconds on a 32-bit build of PARI (w/ GMP) running on Core 2 duo 2Ghz. A 64-build version will be much faster.

It can probably made even faster, by removing the 2's from the (even) base, and just doing the relevant number of squarings in the PRP loop.

msft 2012-01-05 07:04

[QUOTE=axn;284842]Use the GMP, Luke :smile:
[/QUOTE]I like this joke.:smile:
Do not have licence issue?

msft 2012-01-06 11:34

1 Attachment(s)
Ver 1.06
Implement GMP.

rroonnaalldd 2012-01-06 11:57

gmp.h missing...

Solved by using apt-cache search gmp...

debrouxl 2012-01-06 12:05

You have to install the external GMP software yourself :smile:

rroonnaalldd 2012-01-06 16:31

Got it working and did a comparison between v1.051 and 1.060:
[QUOTE]boinc@vmware2k-3:~/Cuda/test$ time ./GeneferCUDA_cudart_cuda40_64bit -b

2009574^8192+1 Time: 696 us/mul. Err: 3.82e-01 51636 digits
1632282^16384+1 Time: 713 us/mul. Err: 2.53e-01 101791 digits
1325824^32768+1 Time: 777 us/mul. Err: 2.03e-01 200622 digits
1076904^65536+1 Time: 1.01 ms/mul. Err: 1.88e-01 395325 digits
874718^131072+1 Time: 1.37 ms/mul. Err: 3.47e-01 778813 digits
710492^262144+1 Time: 2.11 ms/mul. Err: 4.21e-01 1533952 digits
577098^524288+1 Time: 4.09 ms/mul. Err: 2.01e-01 3020555 digits
468750^1048576+1 Time: 8.21 ms/mul. Err: 1.72e-01 5946413 digits
380742^2097152+1 Time: 16.7 ms/mul. Err: 3.63e-01 11703432 digits
309258^4194304+1 Time: 36.8 ms/mul. Err: 1.56e-01 23028076 digits
251196^8388608+1 Time: 74.7 ms/mul. Err: 1.56e-01 45298590 digits

real 34m31.539s
user 14m8.597s
sys 20m12.152s[/quote]

[quote]boinc@vmware2k-3:~/Cuda/test$ time ./GeneferCUDA_gmp_cuda40_64bit -b

2009574^8192+1 Time: 696 us/mul. Err: 3.82e-01 51636 digits
1632282^16384+1 Time: 714 us/mul. Err: 2.53e-01 101791 digits
1325824^32768+1 Time: 777 us/mul. Err: 2.03e-01 200622 digits
1076904^65536+1 Time: 1.01 ms/mul. Err: 1.88e-01 395325 digits
874718^131072+1 Time: 1.37 ms/mul. Err: 3.47e-01 778813 digits
710492^262144+1 Time: 2.11 ms/mul. Err: 4.21e-01 1533952 digits
577098^524288+1 Time: 4.09 ms/mul. Err: 2.01e-01 3020555 digits
468750^1048576+1 Time: 8.21 ms/mul. Err: 1.72e-01 5946413 digits
380742^2097152+1 Time: 16.7 ms/mul. Err: 3.63e-01 11703432 digits
309258^4194304+1 Time: 36.8 ms/mul. Err: 1.56e-01 23028076 digits
251196^8388608+1 Time: 74.7 ms/mul. Err: 1.56e-01 45298590 digits

real 34m31.597s
user 14m10.249s
sys 20m10.672s
[/QUOTE]

msft 2012-01-07 04:54

[QUOTE=rroonnaalldd;285090]Got it working and did a comparison between v1.051 and 1.060:[/QUOTE]
"-b" option not use GMP.
Please try normal test.
By the way.
[url]http://www.primegrid.com/forum_thread.php?id=3902&nowrap=true#46251[/url]
AG5BPilot make non GMP version.
We can wait his work.:smile:

AG5BPilot 2012-01-07 05:06

[QUOTE=msft;285194]"-b" option not use GMP.
Please try normal test.
By the way.
[url]http://www.primegrid.com/forum_thread.php?id=3902&nowrap=true#46251[/url]
AG5BPilot make non GMP version.
We can wait his work.:smile:[/QUOTE]

Hopefully it will work. :)

BTW, you can add this line into the residual check:

[code] check( 1248,4194304, "8f985a974820a6d3");[/code]

msft 2012-01-08 02:55

1 Attachment(s)
Ver 1.061
Add:
[code]
check( 1248,4194304, "8f985a974820a6d3");
[/code]

AG5BPilot 2012-01-09 19:24

In preparation for assaulting the higher N numbers, I made some changes to the calculation of b^N at the beginning of the check() function. Here's the new beginning of that function, along with a few new helper functions it uses:

This is the OLD code:

[code]void check(const Int32 b, const UInt32 m, char *expectedResidue)
{
~~snip~~
if (read_checkpoint(b, m, z, &startTime, &iStart, &Nlen, &iOrigStart))
{
fprintf(STDOUT, "\rResuming %d^%u+1 from a checkpoint (%d iterations left)\n", b, m, iStart); //1.00
fflush(STDOUT);
}
else
{
Nlen = 1;
Na = (UInt32 *)myAlloc(1 * sizeof(UInt32));
Na[0] = (UInt32)b;
for (j = m; j != 1; j /= 2)
{
a = (UInt32 *)myAlloc(2 * Nlen * sizeof(UInt32));
for (i = 0; i != Nlen; ++i)
a[i] = 0;

for (i = 0; i != Nlen; ++i)
{
BOINC_GPU_CHECK_FOR_ABORT(true); // should be good spot for abort check. Runs sqrt of total times
a[i + Nlen] = mul_1_add_n(&a[i], Na[i], Na, Nlen);
}

myFree(Na);
Na = a;
Nlen *= 2;
if (Na[Nlen - 1] == 0) --Nlen;
}
iStart = iOrigStart = lg(Na, Nlen) - 1;
fprintf(STDOUT, "%sTesting %d^%u+1... %d steps to go ", ISBOINC ? "\n" : "\r", b, m, iStart);
fflush(STDOUT);
}[/code]

This is the new code and helper functions:

[code]// transfer input int array into complex array for fft
__global__ void fftIn_kernel(cufftDoubleComplex * const g_fftIn, const UInt32 * const g_intArray, const UInt32 len)
{
int threadID = blockIdx.x * blockDim.x + threadIdx.x;

g_fftIn[threadID*2].y = g_fftIn[threadID*2+1].y = 0.0;
if (threadID < len)
{
g_fftIn[threadID*2].x = (double) (g_intArray[threadID] & 0xFFFF);
g_fftIn[threadID*2+1].x = (double) (g_intArray[threadID] >> 16);
}
else
g_fftIn[threadID*2].x = g_fftIn[threadID*2+1].x = (double) 0.0;

}

// Compute the size of the FFT array for compute_square
// round up to the nearest power of 2, minimum 128
UInt32 fftSize(const UInt32 len)
{
UInt32 size = lpt(len);

return size >= 128 ? size : 128;
}

// In order to compute the value of b^N, we need to repeatedly square b.
// Use cuFFT libraries to do the square
// The resulting value, the address of which is returned by this function
// will be twice as long as the input (i.e., 2*Nlen)
// This function allocates memory for the result -- the calling routine has
// responsibility to eventually free that memory.
//
// Try reducing bits per element to 16 -- maybe we have a loss of precision
// This means doubling the size of the FFT
// Yes, this was the problem -- so do fft's using 16 bit unisgned ints in doubles rather than 32 bits.
//
UInt32 * computeSquare(const UInt32 * Na, const UInt32 Nlen)
{
cufftDoubleComplex * g_fftIn; // FFT input array
cufftDoubleComplex * g_fftOut; // FFT Output array
cufftDoubleComplex * fftOut; // final output (easier to do this on CPU because of carry propogation)
UInt32 * g_intArray; // input/output integer array on the GPU
UInt32 * a; // imaginatively named pointer to the result
UInt32 fftLen = 2*fftSize(2*Nlen); // Size of the FFT -- NLen*2 rounded up to the next power of 2
cufftHandle plan; // the FFT plan
UInt32 i, carry; // index counter and carry
UInt64 l; // temporary 64 bit unsigned int for preserving carry

// allocated GPU memory for the FFT in and out (Complex double, 2*Nlen, rounded up to a good FFT size
cutilSafeCall(cudaMalloc((void**)&g_fftIn, sizeof(*g_fftIn)*fftLen));
cutilSafeCall(cudaMalloc((void**)&g_fftOut, sizeof(*g_fftOut)*fftLen));

// allocate GPU memory to load the integer array (UInt32, same count as the fft array)
cutilSafeCall(cudaMalloc((void**)&g_intArray, sizeof(*g_intArray)*fftLen/2));

// Copy Na into the device memory
cutilSafeCall(cudaMemcpy(g_intArray, Na, sizeof(*g_intArray)*Nlen, cudaMemcpyHostToDevice));


// invoke kernal to build FFT input
fftIn_kernel<<< (fftLen/2)/128, 128 >>> (g_fftIn, g_intArray, Nlen);

// Create the FFT plan
cufftSafeCall(cufftPlan1d(&plan, fftLen, CUFFT_Z2Z, 1));

// Do the forward FFT
cufftSafeCall(cufftExecZ2Z(plan, g_fftIn, g_fftOut, CUFFT_FORWARD));

// invoke kernel to square the resulting complex matrix
mul_kernel<<< fftLen/128,128 >>> (g_fftOut);

// Do the reverse FFT (from 'out' back to 'in')
cufftSafeCall(cufftExecZ2Z(plan, g_fftOut, g_fftIn, CUFFT_INVERSE));

// allocate memory for the result (UInt32, 2*Nlen)
a = (UInt32 *)myAlloc(sizeof(*a)*2*Nlen);

// allocate memory for output and copy the result from GPU
fftOut = (cufftDoubleComplex *)myAlloc(sizeof(*fftOut)*fftLen);
cutilSafeCall(cudaMemcpy(fftOut, g_fftIn, sizeof(*fftOut)*fftLen, cudaMemcpyDeviceToHost));

// Now extract the result from the array of complex variables
// Slower than doing this on the gpu, but much easier on host because of carry propogation.
// It's only run log n times, so it should be insignificant
for(i = 0, carry = 0; i < 2*Nlen; i++)
{
l = (UInt64) carry + (UInt64) round(fftOut[i*2].x / (double) fftLen) + (((UInt64) round(fftOut[i*2+1].x / (double) fftLen)) << 16);
a[i] = (UInt32) l; // lower 32 bits
carry = (UInt32) (l >> 32);
}

// free the 3 device memory arrays
cutilSafeCall(cudaFree((char *)g_fftIn));
cutilSafeCall(cudaFree((char *)g_fftOut));
cutilSafeCall(cudaFree((char *)g_intArray));
myFree((void*)fftOut);

// Destroy the FFT plan
cufftSafeCall(cufftDestroy(plan));

return a;
}

void check(const Int32 b, const UInt32 m, char *expectedResidue)
{
~~snip~~
if (read_checkpoint(b, m, z, &startTime, &iStart, &Nlen, &iOrigStart))
{
fprintf(STDOUT, "\rResuming %d^%u+1 from a checkpoint (%d iterations left)\n", b, m, iStart); //1.00
fflush(STDOUT);
}
else
{
Nlen = 1;
Na = (UInt32 *)myAlloc(1 * sizeof(UInt32));
Na[0] = (UInt32)b;
for (j = m; j != 1; j /= 2)
{
a = computeSquare(Na, Nlen); // compute_square will allocate "a"
myFree(Na);
Na = a;
Nlen *= 2;
if (Na[Nlen - 1] == 0) --Nlen;
}
iStart = iOrigStart = lg(Na, Nlen) - 1;
fprintf(STDOUT, "%sTesting %d^%u+1... %d steps to go ", ISBOINC ? "\n" : "\r", b, m, iStart);
fflush(STDOUT);
}[/code]

Also, the mul_1_add_n() function is no longer used and can be safely deleted.

For 1248^4194304+1, this change reduces the initialization time, where it's calculating the value of b^N, from 2 hours down to a small fraction of one second.

Full source and Windows executables will be available shortly.

AG5BPilot 2012-01-09 19:47

Source and executable can be downloaded here: [url=http://www.asgoodasitgoetz.com/distribution/geneferCUDA-boinc.1.03beta.7z]GeneferCUDA-boinc.1.03beta.7z[/url]

I expected this to be released into production on PrimeGrid's Boinc server, but I want to do more testing first.

This source is 100% compatible with Shoichiro's code up through Shoichiro's v1.049. Modifications from 1.050 onward have NOT been incorporated into my source. It can operate under Boinc, under PRPNet, or standalone.

It's currently built for windows, but there's only a handful of places where the code is incompatible with Linux. (Setting Windows thread priority and variable arguments in a logging function are the two things I remember.)

axn 2012-01-10 09:15

[QUOTE=AG5BPilot;285590]
Also, the mul_1_add_n() function is no longer used and can be safely deleted.

For 1248^4194304+1, this change reduces the initialization time, where it's calculating the value of b^N, from 2 hours down to a small fraction of one second.[/QUOTE]

Now that Na computation is fast, there is probably no need for it to be written to/read from checkpoint.

Doesn't the computesquare call need BOINC_START_GPU/END_GPU?

AG5BPilot 2012-01-10 12:29

[QUOTE=axn;285687]Now that Na computation is fast, there is probably no need for it to be written to/read from checkpoint.[/quote]

A bit premature, as it turns out -- further testing has shown there's a problem at N=4194304, which is a bit of a nuisance since that's our real target. At 4M it's exceeding maxErr doing the last FFT. The b limit on that calculation is effectively less than 100. :(

The extra information in the checkpoint file doesn't make that much of a difference.

I'm doing that calculation using each element as a 16 bit unsigned integer, rather than32 bits because if you square two 32 bit ints, your result is 64 bits long -- which will NOT fit in a 64 bit float since the float uses about 10 bits for the two signs and the exponent.

However, this made the FFT have twice as many elements, which is itself causing problems.

[quote]Doesn't the computesquare call need BOINC_START_GPU/END_GPU?[/QUOTE]

It might, if those calls actually had any code in them anymore. Those calls, which are supposed to be boinc critical sections, would actually prevent the task from ever shutting down. Boinc's critical section code appears to be horribly broken. It's only useful when you're protecting a tiny percentage of your code within the critical section. In our case, where only a tiny percentage ISN"T protected (since we spend almost all our time in the GPU), the BOINC client -- which effectively talks to the client via polling, not interrupts (at least as far as the critical section code is concerned) -- never gets the signal through unless it's unbelievably lucky.

BTW, I'm going to pull that build off the website until I find a fix for the problem. At N=4194304, it would produce incorrect results without any indication that the result is wrong. If anyone wants that source, let me know.

axn 2012-01-10 13:07

[QUOTE=AG5BPilot;285706]A bit premature, as it turns out -- further testing has shown there's a problem at N=4194304, which is a bit of a nuisance since that's our real target. At 4M it's exceeding maxErr doing the last FFT. The b limit on that calculation is effectively less than 100.[/QUOTE]

You can use a Karatsuba-style multiplication using 3 half-sized FFT multiplies for the final iteration. [url]http://en.wikipedia.org/wiki/Karatsuba_algorithm#The_basic_step[/url] [Note that, while the wiki example is X & Y, when you need to square, all the underlying mults also become squares.]

Also, you can gain a little bit more by dividing out all the twos from the 'b'. Eg:- 1248^4194304 = 39^4194304 * (2^5)^4194304 (1248 = 39*2^5). So you only need to do the FFT to calculate 39^4194304, and multiplication by (2^5)^4194304 is just shifting.

Definitely getting more complex, isn't it? :smile: Well, good luck.

msft 2012-01-10 13:14

Hi ,
I retry "Reduce CPU Time".
[code]
if(n1 > 16384) ithreads = 128;
else ithreads = 16;
for (i = iStart; i != 0; --i)
{
FFTsquareFFT(z, n1);
bt = Na[i / (8 * sizeof(UInt32))] >> (i % (8 * sizeof(UInt32))) & 1;
FFTnextStepGFN(b,invBase,(bt == 0) ? t1 : t2,n1,t3,ithreads,SHIFT);
if(i % 0xF == 0) cutilSafeCall(cudaMemcpy(l_maxErr,g_maxErr,sizeof(float),cudaMemcpyDeviceToHost));
if (!(i & 0xffff))
{
printf("\rTesting %d^%u+1... %d steps to go ", b, m, i);
fflush(stdout);
cutilSafeCall(cudaMemcpy(l_maxErr,g_maxErr,sizeof(float),cudaMemcpyDeviceToHost)); //1.00

[/code]
Ver 1.061:
[code]
24518^262144+1 is a probable prime.
(1150678 digits) (err = 0.0002) (time = 1:27:48) 08:02:55
2347.29user 2912.59system 1:27:47elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
336inputs+238616outputs (0major+13265minor)pagefaults 0swaps
[/code]
This code version:
[code]
24518^262144+1 is a probable prime.
(1150678 digits) (err = 0.0002) (time = 1:29:02) 06:35:07
215.20user 25.99system 1:29:02elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+238528outputs (0major+11867minor)pagefaults 0swaps
[/code]
1% slow, but 95% reduce cpu time.

firejuggler 2012-01-10 13:35

a 1% increase in time in exchange of an almost free cpu? nice

AG5BPilot 2012-01-10 13:36

[QUOTE=axn;285707]
Definitely getting more complex, isn't it? :smile: Well, good luck.[/QUOTE]

It's actually a heck of a lot simpler than I imagined. First time I looked at this my first thought was "It just can not be that simple. I have to be missing something." I was very surprised when the right numbers started coming out. It's very much like like making a smoothie in a blender, reversing the power plug, and having whole fruit emerge. :)

Thanks for the advice, and the luck!

AG5BPilot 2012-01-11 00:24

About 4 different ideas later, I think I've fixed 1.03. I'm going to test it more thoroughly and THEN post the code.

AG5BPilot 2012-01-11 04:23

[url=http://www.asgoodasitgoetz.com/distribution/geneferCUDA-boinc.1.03a.7z]geneferCUDA-boinc.1.03a.7z[/url]

This has working code for the fast initialization. In the end, I sacrificed speed for precision. It's a whopping one and a half seconds slower than 1.03, but it has the slight advantage of actually working.

I reduced the FFT element size to 8 bits, down from 16, which gave me tons of room to avoid rounding errors, but did push the FFT size above 8 million. I guess it's a good thing I GPUs are pretty fast!

msft 2012-01-13 06:44

Linux need #127 patch with CPU TIME 100% issue.
Win not need #127 patch.

AG5BPilot 2012-01-13 11:45

[QUOTE=msft;286136]Linux need #127 patch with CPU TIME 100% issue.
Win not need #127 patch.[/QUOTE]

Shoichiro,

I don't understand why your fix would work. Is the only change you made to add an extra DeviceToHost copy of maxErr?

But, I did see that the init_device() routine makes 4 cuda device calls, and unlike the rest of the software, it does not use cutilSafeCall() to check for errors.

In particular, I think this call may be important:

[code]cudaSetDeviceFlags(cudaDeviceBlockingSync);[/code]

If that were to fail for any reason, CUDA would act as if you were in cudaDeviceScheduleSpin mode, and that would cause the 100% CPU utilization that you're seeing.

However, I do not understand why that call would be failing.

I would suggest adding cutilSafeCall() to the API calls inside init_device() function. Seems like a good thing to do regardless of whether this is related to the 100% CPU core problem. I'm adding that to my version of the code.

Mike

msft 2012-01-13 12:35

Hi ,AG5BPilot
[QUOTE=AG5BPilot;286150]
I don't understand why your fix would work. Is the only change you made to add an extra DeviceToHost copy of maxErr?
[/QUOTE]
Yes.
I think this code will trigger.
[QUOTE=AG5BPilot;286150]
But, I did see that the init_device() routine makes 4 cuda device calls, and unlike the rest of the software, it does not use cutilSafeCall() to check for errors.
In particular, I think this call may be important:
[code]cudaSetDeviceFlags(cudaDeviceBlockingSync);[/code]
If that were to fail for any reason, CUDA would act as if you were in cudaDeviceScheduleSpin mode, and that would cause the 100% CPU utilization that you're seeing.
However, I do not understand why that call would be failing.
[/QUOTE]
I think,if that were to fail,"extra DeviceToHost copy" should not work.
[QUOTE=AG5BPilot;286150]
I would suggest adding cutilSafeCall() to the API calls inside init_device() function. Seems like a good thing to do regardless of whether this is related to the 100% CPU core problem. I'm adding that to my version of the code.
[/QUOTE]
The results did not change.

Anyway strange phenomenon.

msft 2012-01-24 11:28

Hi ,
Ver 1.06,1.061,1.062 have fatal error with 32bit OS.
I made bug around GMP code.
Do not use with 32bit OS.
Thank you,

rroonnaalldd 2012-01-25 11:28

[QUOTE=msft;287104]Hi ,
Ver 1.06,1.061,1.062 have fatal error with 32bit OS.
I made bug around GMP code.
Do not use with 32bit OS.
Thank you,[/QUOTE]

Which bug?
I am getting no error here or in llrCUDA. I have Lubuntu11.10 with kernel 3.0.0-15.26 as compiling-platform.

msft 2012-01-25 12:38

Hi ,
llrCUDA was OK.
[code]
m_b = b;
mpz_init_set_ui(m_Na,m_b);
for(j = m; j != 1; j/=2)
mpz_mul(m_Na,m_Na,m_Na);
m_Na_size = mpz_size(m_Na);
m_Na_size_byte = m_Na_size*sizeof(long int);
m_a = (long int*) malloc(m_Na_size*sizeof(long int));
mpz_export(m_a,NULL,0,m_Na_size_byte,0,0,m_Na);
[/code]
Sometime m_a[m_Na_size-1]==0 with linux32.
[code]
if(m_a_32[m_a_32_len-1]== 0) m_a_32_len--;
[/code]
This code not enough.

msft 2012-01-26 03:44

Hi ,
mini test program.
[code]
#include <stdio.h>
#include <gmp.h>
int main()
{
mpz_t m;
int i,j;
unsigned long long e_m[10];
mpz_init_set_ui(m,2);
for(j = 0; j != 8; j++)
{
mpz_mul(m,m,m);
for(i=0;i<10;i++)e_m[i]=0;
mpz_export(e_m,NULL,0,80,0,0,m);
printf(" mpz_size(m)=%d ",(int) mpz_size(m));
for(i=9;i>=0;i--) printf(" %llx",e_m[i]);
printf("\n");
}
mpz_clear(m);
}
[/code]
linux32:
[code]
$ ./a.out
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 4
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 100
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10000
mpz_size(m)=2 0 0 0 0 0 0 0 0 0 100000000
mpz_size(m)=3 0 0 0 0 0 0 0 0 1 0
mpz_size(m)=5 0 0 0 0 0 0 0 1 0 0
mpz_size(m)=9 0 0 0 0 0 1 0 0 0 0
[/code]
linux64:
[code]
$ ./a.out
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 4
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 100
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10000
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 100000000
mpz_size(m)=2 0 0 0 0 0 0 0 0 1 0
mpz_size(m)=3 0 0 0 0 0 0 0 1 0 0
mpz_size(m)=5 0 0 0 0 0 1 0 0 0 0
[/code]

rroonnaalldd 2012-01-28 02:12

Hmm, got it not compiled:
[QUOTE]boinc@Lubuntu32:~/Cuda$ gcc a
a: In function `main':
a.c:(.text.startup+0x25): undefined reference to `__gmpz_init_set_ui'
a.c:(.text.startup+0x3c): undefined reference to `__gmpz_mul'
a.c:(.text.startup+0x110): undefined reference to `__gmpz_export'
a.c:(.text.startup+0x2ba): undefined reference to `__gmpz_clear'
collect2: ld returned 1 exit status
[/QUOTE]

rogue 2012-01-28 02:23

[QUOTE=rroonnaalldd;287485]Hmm, got it not compiled:[/QUOTE]

Add -lgmp

msft 2012-01-28 03:18

[code]
mpz_mul(m_Na,m_Na,m_Na);
m_Na_size = (mpz_sizeinbase(m_Na,2)+sizeof(long long int)*8-1)/(sizeof(long long int)*8);
m_Na_size_byte = m_Na_size*sizeof(long long int);
[/code]
It's only answer.

rroonnaalldd 2012-01-28 10:04

[QUOTE=rogue;287486]Add -lgmp[/QUOTE]
:redface:



Lubuntu11.10-32 [QUOTE]boinc@Lubuntu32:~/Cuda$ ./a32.out
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 4
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 100
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10000
mpz_size(m)=2 0 0 0 0 0 0 0 0 0 100000000
mpz_size(m)=3 0 0 0 0 0 0 0 0 1 0
mpz_size(m)=5 0 0 0 0 0 0 0 1 0 0
mpz_size(m)=9 0 0 0 0 0 1 0 0 0 0
[/QUOTE]

DotschUX-64 based on Ubuntu8.10 [QUOTE]boinc@vmware2k-3:~/Cuda$ ./a32.out
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 4
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 100
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10000
mpz_size(m)=2 0 0 0 0 0 0 0 0 0 100000000
mpz_size(m)=3 0 0 0 0 0 0 0 0 1 0
mpz_size(m)=5 0 0 0 0 0 0 0 1 0 0
mpz_size(m)=9 0 0 0 0 0 1 0 0 0 0
[/QUOTE]

[quote]boinc@vmware2k-3:~/Cuda$ ./a64.out
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 4
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 100
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 10000
mpz_size(m)=1 0 0 0 0 0 0 0 0 0 100000000
mpz_size(m)=2 0 0 0 0 0 0 0 0 1 0
mpz_size(m)=3 0 0 0 0 0 0 0 1 0 0
mpz_size(m)=5 0 0 0 0 0 1 0 0 0 0
[/quote]

axn 2012-07-23 08:21

does anyone have any idea of the relative time spent of "fft square" routine vs "next step" kernels in genefer?

AG5BPilot 2012-07-23 11:56

[QUOTE=axn;305570]does anyone have any idea of the relative time spent of "fft square" routine vs "next step" kernels in genefer?[/QUOTE]

I used Nvidia's visual profiler to check the usage. The single biggest consumer of GPU time was the transpose kernel.

However, I don't remember right now what the exact conditions of the test were -- for example, I may have been looking more at the initialization code than at the genefer algorithm -- so I wouldn't put too much weight on those results. But you should be able to get the information you want using their tools.

axn 2012-07-23 15:14

[QUOTE=AG5BPilot;305585]But you should be able to get the information you want using their tools.[/QUOTE]

I would... if I had any DP capable cards :sad: Would you be able to run a quick test and figure out the execution profile of a WR test?

AG5BPilot 2012-07-23 16:00

[QUOTE=axn;305606]I would... if I had any DP capable cards :sad: Would you be able to run a quick test and figure out the execution profile of a WR test?[/QUOTE]

Not sure when I'll have time, so that's a definite maybe.

AG5BPilot 2012-07-24 21:51

[QUOTE=axn;305606]I would... if I had any DP capable cards :sad: Would you be able to run a quick test and figure out the execution profile of a WR test?[/QUOTE]

I figured it was easier, so I put timing code into Genefer's benchmarks:

[quote]C:\GeneferCUDA test\GeneferTest>genefercudatest -b
genefercuda 2.3.1-0 (Windows x86 CUDA 3.2)
Copyright 2001-2003, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2012, Iain Bethune, Michael Goetz, Ronald Schneider

Command line: genefercudatest -b

Generalized Fermat Number Bench

FFTsquareGFN=53.46% FFTnextStepGFN=46.54% (raw: 0.0850 0.0740 seconds)
2009574^8192+1 Time: 159 us/mul. Err: 0.1719 51636 digits

FFTsquareGFN=55.00% FFTnextStepGFN=45.00% (raw: 0.0880 0.0720 seconds)
1632282^16384+1 Time: 161 us/mul. Err: 0.1563 101791 digits

FFTsquareGFN=48.95% FFTnextStepGFN=51.05% (raw: 0.0930 0.0970 seconds)
1325824^32768+1 Time: 212 us/mul. Err: 0.1563 200622 digits

FFTsquareGFN=32.93% FFTnextStepGFN=67.07% (raw: 0.0810 0.1650 seconds)
1076904^65536+1 Time: 339 us/mul. Err: 0.1602 395325 digits

FFTsquareGFN=49.48% FFTnextStepGFN=50.52% (raw: 0.0950 0.0970 seconds)
874718^131072+1 Time: 601 us/mul. Err: 0.1563 778813 digits

FFTsquareGFN=43.55% FFTnextStepGFN=56.45% (raw: 0.2530 0.3280 seconds)
710492^262144+1 Time: 1.05 ms/mul. Err: 0.1641 1533952 digits

FFTsquareGFN=29.27% FFTnextStepGFN=70.73% (raw: 0.3990 0.9640 seconds)
577098^524288+1 Time: 1.98 ms/mul. Err: 0.1875 3020555 digits

FFTsquareGFN=21.41% FFTnextStepGFN=78.59% (raw: 0.6850 2.5150 seconds)
468750^1048576+1 Time: 3.97 ms/mul. Err: 0.1563 5946413 digits

FFTsquareGFN=32.74% FFTnextStepGFN=67.26% (raw: 1.8470 3.7940 seconds)
380742^2097152+1 Time: 8.23 ms/mul. Err: 0.1484 11703432 digits

FFTsquareGFN=25.51% FFTnextStepGFN=74.49% (raw: 3.4360 10.0320 seconds)
309258^4194304+1 Time: 16.5 ms/mul. Err: 0.1719 23028076 digits

FFTsquareGFN=21.27% FFTnextStepGFN=78.73% (raw: 6.4150 23.7390 seconds)
100^8388608+1 Time: 33.7 ms/mul. Err: 0.0000 16777217 digits[/quote]

The timings from N=262144 and above seem fairly consistent between runs.

axn 2012-07-25 04:20

[QUOTE=AG5BPilot;305807]I figured it was easier, so I put timing code into Genefer's benchmarks:[/QUOTE]

Yeesh! That thing is scaling the wrong way. I hope you made a mistake and flipped the numbers. FFT is O(nlogn) and carry propagation (nextstep) is O(n), so you'd expect that as n grows higher, FFT will take a larger fraction of the time -- not smaller! Something is rotten. :confused:

AG5BPilot 2012-07-25 05:09

[QUOTE=axn;305881]Yeesh! That thing is scaling the wrong way. I hope you made a mistake and flipped the numbers. FFT is O(nlogn) and carry propagation (nextstep) is O(n), so you'd expect that as n grows higher, FFT will take a larger fraction of the time -- not smaller! Something is rotten. :confused:[/QUOTE]

The numbers are definitely not reversed. It's possible it's an instrumentation anomaly, for example if either of the two routines' run time is close to the clock step frequency of 1 ms you might get odd results with ONE of the benchmarks, but that doesn't appear to be what's happening.

Here's the relevant code (everything except the declarations):

[code]SETCLOCK(clock1);
FFTsquareGFN(z, n1, n2);
squareTime += elapsedTime(clock1);
bt = Na[i / (8 * sizeof(uint32_t))] >> (i % (8 * sizeof(uint32_t))) & 1;
SETCLOCK(clock1);
FFTnextStepGFN(z, b, (bt == 0) ? t1 : t2, n1, n2, t3);
nextTime += elapsedTime(clock1);

...


if (squareTime+nextTime==0.0) squareTime=nextTime=1.0;
printf("\n FFTsquareGFN=%.2f%% FFTnextStepGFN=%.2f%% (raw: %.4f %.4f seconds)\n", 100*squareTime/(squareTime+nextTime), 100*nextTime/(squareTime+nextTime), squareTime, nextTime);[/code]

axn 2012-07-25 05:34

[QUOTE=AG5BPilot;305882]The numbers are definitely not reversed.[/QUOTE]

I know. It was one of those idle hopes :smile:

The outcome is serious enough that I have just gone ahead and ordered my very own GT 520 (cheapest DP capable part I could find :smile:) to further explore this. Hopefully, before it arrives, I will have finished up the linux port of the sieve, and have some free time.

AG5BPilot 2012-07-25 11:51

1 Attachment(s)
[QUOTE=axn;305884]The outcome is serious enough that I have just gone ahead and ordered my very own GT 520 (cheapest DP capable part I could find :smile:) to further explore this. Hopefully, before it arrives, I will have finished up the linux port of the sieve, and have some free time.[/QUOTE]

A GT520s performance will probably be in the vicinity of what you can get on your CPU running Genefx64.

The ONLY code changes to measure the internal timings are in the check() routine, so you can just plug in my modified genefer.cpp file and build the other versions of genefer.

It would be interesting to see if you see the same behavior in the CPU versions.

axn 2012-07-26 03:56

[QUOTE=AG5BPilot;305925]A GT520s performance will probably be in the vicinity of what you can get on your CPU running Genefx64.

The ONLY code changes to measure the internal timings are in the check() routine, so you can just plug in my modified genefer.cpp file and build the other versions of genefer.

It would be interesting to see if you see the same behavior in the CPU versions.[/QUOTE]

I built GenefX64 under linux and the numbers were going in the right direction. Unfortunately, I could only run it up to 1M, since the initialization was taking longer and longer. I'll hack it a bit to overcome that and post the full details. Summary is: it starts out at roughly 50:50, but by 1M, it is 80:20 (sq:nextstep). With each increase in N, square tends to be more and more dominant factor.
The GPU numbers are definitely an anomaly.

AG5BPilot 2012-07-26 04:41

Figured it out. It was the parallel processing of the square and nextstep overlapping each other that was messing up the metrics. Serializing the steps slowed the program down a bit, but does yield more accurate measurements:

[quote]C:\GeneferCUDA test\GeneferTest>genefercudaTest.exe -b
genefercuda 2.3.1-0 (Windows x86 CUDA 3.2)
Copyright 2001-2003, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2012, Iain Bethune, Michael Goetz, Ronald Schneider

Command line: genefercudaTest.exe -b

Generalized Fermat Number Bench

FFTsquareGFN=46.32% FFTnextStepGFN=53.68% (raw: 0.2520 0.2920 seconds)
2009574^8192+1 Time: 546 us/mul. Err: 0.1719 51636 digits

FFTsquareGFN=47.40% FFTnextStepGFN=52.60% (raw: 0.2830 0.3140 seconds)
1632282^16384+1 Time: 597 us/mul. Err: 0.1563 101791 digits

FFTsquareGFN=47.75% FFTnextStepGFN=52.25% (raw: 0.3190 0.3490 seconds)
1325824^32768+1 Time: 668 us/mul. Err: 0.1563 200622 digits

FFTsquareGFN=48.98% FFTnextStepGFN=51.02% (raw: 0.4310 0.4490 seconds)
1076904^65536+1 Time: 880 us/mul. Err: 0.1602 395325 digits

FFTsquareGFN=61.86% FFTnextStepGFN=38.14% (raw: 0.6650 0.4100 seconds)
874718^131072+1 Time: 1.08 ms/mul. Err: 0.1563 778813 digits

FFTsquareGFN=59.12% FFTnextStepGFN=40.88% (raw: 0.9820 0.6790 seconds)
710492^262144+1 Time: 1.66 ms/mul. Err: 0.1641 1533952 digits

FFTsquareGFN=63.43% FFTnextStepGFN=36.57% (raw: 1.7070 0.9840 seconds)
577098^524288+1 Time: 2.69 ms/mul. Err: 0.1875 3020555 digits

FFTsquareGFN=65.57% FFTnextStepGFN=34.43% (raw: 3.2610 1.7120 seconds)
468750^1048576+1 Time: 4.97 ms/mul. Err: 0.1563 5946413 digits

FFTsquareGFN=74.32% FFTnextStepGFN=25.68% (raw: 6.7420 2.3290 seconds)
380742^2097152+1 Time: 9.07 ms/mul. Err: 0.1484 11703432 digits

FFTsquareGFN=75.52% FFTnextStepGFN=24.48% (raw: 13.3390 4.3230 seconds)
309258^4194304+1 Time: 17.7 ms/mul. Err: 0.1719 23028076 digits

FFTsquareGFN=78.22% FFTnextStepGFN=21.78% (raw: 27.3330 7.6120 seconds)
100^8388608+1 Time: 34.9 ms/mul. Err: 0.0000 16777217 digits[/quote]

axn 2012-07-26 05:46

[QUOTE=AG5BPilot;306037]Figured it out. It was the parallel processing of the square and nextstep overlapping each other that was messing up the metrics. Serializing the steps slowed the program down a bit, but does yield more accurate measurements:[/QUOTE]

Phew. That's more like it! It is in line with the CPU figures.

Anyway, once I have my GT520, I'll do some visual profiling and see if any improvement can be wrung out of the code. But I'm not holding my breath for any big improvements.

prime7989 2012-10-12 02:20

Is genefercuda correct
 
I tried genefercuda for a gtx680 nvidia GPU on a AMD FX8120 cpu and 32 gb ram for
Fermat number F19 which is known composite with b=2 and N=524288=2^19
and the program prints out that it is probably prime.
Can any one else try this out?
A...

Batalov 2012-10-12 02:47

[QUOTE=prime7989;314331]I tried genefercuda for a gtx680 nvidia GPU on a AMD FX8120 cpu and 32 gb ram for
Fermat number F19 which is known composite with b=2 and N=524288=2^19
and the program prints out that it is probably prime.
Can any one else try this out?
A...[/QUOTE]
genefercuda is running a PRP test in base 2 (just checked the source), which is fine for checking any GF(b), b>2 (and not a power of 2, which would be pointless).
Of course, all Fermat numbers are 2-PRPs.

axn 2012-10-12 03:35

[QUOTE=Batalov;314332]genefercuda is running a PRP test in base 2 (just checked the source), which is fine for checking any GF(b), b>2 (and not a power of 2, which would be pointless).
Of course, all Fermat numbers are 2-PRPs.[/QUOTE]

It should be relatively trivial to change the PRP-base (to say, 3). I will try it out later.

Also, 2^524288 = 65536^32768. So a much smaller FFT may be used.

axn 2012-10-12 13:52

[QUOTE=axn;314339]It should be relatively trivial to change the PRP-base (to say, 3). I will try it out later.

Also, 2^524288 = 65536^32768. So a much smaller FFT may be used.[/QUOTE]

So, I tried this out... and you wouldn't believe what I found out:
[CODE]65536^32768+1 is composite. (RES=0df151a4161b6c99) (157827 digits) (err = 0.0005) (time = 0:07:24)[/CODE]

:shock:

LaurV 2012-10-13 12:19

That's creepy! Don't say all the tests need to be redone? :ouch:

msft 2012-10-15 01:26

[QUOTE=LaurV;314489]That's creepy! Don't say all the tests need to be redone? :ouch:[/QUOTE]
I Bet your side.:smile:

LaurV 2012-10-15 09:12

Oh, I forgot about this, my post was kinda joke, based on the fact that the incriminated number is F19, which is known as composite. If axn did a 3-PRP test it was normal to be composite (it is like a Pepin Test.... squared, I might be wrong here but this will never return a false result, i.e. Fx is 3-prp if and only if it is prime).

msft 2012-10-16 06:35

base=3
[code]
$ diff genefer.c genefer.c.ORG
1757c1757
< CONST double t1 = 2.0 / m, t2 = 3 * t1;
---
> CONST double t1 = 2.0 / m, t2 = 2 * t1;
1781c1781
< z = FFTinitGFN(m, 3.0, &n1, &n2);
---
> z = FFTinitGFN(m, 2.0, &n1, &n2);
$ cc genefer.c -lm
$ ./a.out -t
GeneFer 1.3 (standard) Copyright (C) 2001-2002, Yves GALLOT
A program for finding large probable generalized Fermat primes.

Usage: GeneFer -b run bench
GeneFer -t run test
GeneFer <filename> test <filename>
GeneFer use interactive mode

100050^16+1 is a probable prime. (err = 1.53e-05)
100014^32+1 is a probable prime. (err = 3.05e-05)
100234^64+1 is a probable prime. (err = 4.58e-05)
100032^128+1 is a probable prime. (err = 9.16e-05)
5684328^256+1 is a probable composite (err = 5.00e-01).
4619000^512+1 is a probable composite (err = 5.00e-01).
3752220^1024+1 is a probable composite (err = 5.00e-01).
[/code]

Batalov 2012-10-16 06:45

Aaah! msft is [URL="http://ru.wikipedia.org/wiki/%D0%9C%D0%BE%D1%80%D1%80%D0%B0"]Morra[/URL]!
(...I suddenly expect Morra defence in the Vote chess thread. No. I did expect it before. That's when I looked up Morra.
Too late. ;-) )

"The Groke" on the other hand reminds me of Piet Hein's Grooks.
[QUOTE]
THE ROAD TO WISDOM?

Well, it's plain
and simple to express.
Err and err and err again,
but less and less and less.[/QUOTE]

prime7989 2013-03-17 04:54

[QUOTE=axn;314396]So, I tried this out... and you wouldn't believe what I found out:
[CODE]65536^32768+1 is composite. (RES=0df151a4161b6c99) (157827 digits) (err = 0.0005) (time = 0:07:24)[/CODE]:shock:[/QUOTE]
Dear axn,
I dislike reading code. So please which variable did you change to make it 3-PRP
in GeneferCUDA.cu, please? If you could give me the line number of the code of function and var, I would be very greatful!
Thank you,
prime7989

axn 2013-03-17 08:23

[QUOTE=prime7989;333674]Dear axn,
I dislike reading code. So please which variable did you change to make it 3-PRP
in GeneferCUDA.cu, please? If you could give me the line number of the code of function and var, I would be very greatful!
Thank you,
prime7989[/QUOTE]

There are many versions of G/CUDA out there. If you can post the one you want to look at (as a forum attachment), I can tell you. FWIW, the change is easy and localised to "check" function.

prime7989 2013-03-17 09:28

Extend GeneferCUDA pseudo-primality checks
 
1 Attachment(s)
[QUOTE=axn;333686]There are many versions of G/CUDA out there. If you can post the one you want to look at (as a forum attachment), I can tell you. FWIW, the change is easy and localised to "check" function.[/QUOTE]


Thank you axn,
I would also like to make the change to GeneferCUDA,cu such that the input instead of:
n=b^N+1
is as follows: n=b^N+c where c is an non-zero integer such that |c|<Log(n).
Thank you,
Prime7989

axn 2013-03-17 15:57

[QUOTE=prime7989;333689]Thank you axn,
I would also like to make the change to GeneferCUDA,cu such that the input instead of:
n=b^N+1
is as follows: n=b^N+c where c is an non-zero integer such that |c|<Log(n).
Thank you,
Prime7989[/QUOTE]

For changing the base used for test, you need to modify the two lines in check() function from:
[CODE]const double t1 = 1.0,t2 = 2.0,t3 = 2.0 / m;
<snip>
z = FFTinitGFN(m, 2.0, &n1,&SHIFT);
[/CODE]
to:
[CODE]const double t1 = 1.0,t2 = prp_base,t3 = 2.0 / m;
<snip>
z = FFTinitGFN(m, prp_base, &n1,&SHIFT);[/CODE]

As far as testing b^N+c, that is "above my paygrade", so to speak. :sad:

prime7989 2013-03-31 04:36

c=-1?
 
Dear axn,
How about c=-1 that is change the code in geneferCuda from so that c=-1 i.e
Input from: b^m+1 to b^m-1
prime7989

Batalov 2013-03-31 18:09

This is easy to do.
The whole code will become:
[CODE]//...some headers
// skip to page 10
j = m;
while (j > 0)
{
if (j > 1 && j & 1)
{
fprintf(stderr, "Cannot test. m is not a power of 2.\n");
return;
}
j >>= 1;
}
printf("Testing %d^%u-1...\r", b, m);

printf("\r%d^%u-1 is composite.", b, m);
//... some footer
[/CODE]

prime7989 2013-04-01 03:43

2^p-1
 
Dear Batalov,axn,
Ofcourse looking at the code i know my just previous post was incomplete.
What i meant was the following mods to geneferCUDA:
1) Specify the base p-PRP as a #define macro or as a run time input by the user(easy to do)
Give the user a chance to run a version option:
2a) Option A):Make it such that it is base 2,3,5-PRP for b^m+1 where m is a power of 2.
2b)Option B:)Make it such that it is base 3-PRP for b^q-1 where b=2 and q = a prime so as to accomodate Mersenne numbers. As well as GFNs.
Does anyone have recent code of geneferCUDA.cu > ver 1.061 or is willing to modify the code with these improvements? And wiilling to post it?
prime7989

Batalov 2013-04-02 17:33

GeNeFer (and geneferCUDA which is its CUDA port) are written specifically to test Generalized Fermat Numbers. That is, m must be a power of two. Your question about changing the PRP test's base was already addressed, even though it was somewhat moot, because test prp_base=2 is fine for any b>2, while for b=2 the (true Fermat) candidates are already searched to death by many people before you.

For other applications, there are other programs, e.g. CUDALucas (prime m and c=-1). Why would someone bother to modify a program so that it would become another, already written, specialized program?

If you want a universal program that does everything that other programs can do, then that's easy - make a script that would call the appropriate binary. That would be much more robust than keeping up with all disparate sources, putting them together and then recompiling when any component had changed.

prime7989 2013-06-07 07:00

[QUOTE=Batalov;335856]GeNeFer (and geneferCUDA which is its CUDA port) are written specifically to test Generalized Fermat Numbers. That is, m must be a power of two. Your question about changing the PRP test's base was already addressed, even though it was somewhat moot, because test prp_base=2 is fine for any b>2, while for b=2 the (true Fermat) candidates are already searched to death by many people before you.

For other applications, there are other programs, e.g. CUDALucas (prime m and c=-1). Why would someone bother to modify a program so that it would become another, already written, specialized program?

If you want a universal program that does everything that other programs can do, then that's easy - make a script that would call the appropriate binary. That would be much more robust than keeping up with all disparate sources, putting them together and then recompiling when any component had changed.[/QUOTE]



GeneferCUDA is a base-2 Pepin's test as far as I can ascertain(please correct if I am wrong, please). To allow for Fermat number testing with the GTX-TITAN we can make it base 3 or 5 deterministic Pepin's Test. To allow test's for Mersenne numbers with a Pepin's test GenfeferCuda should be run in Base 3 with the modification for Fermat to Mersenne numbers in the c=+/-1 for Fermat and Mersenne numbers.
Furthermore CudaLucas is a Lucas-Lehmer test for Mersenne numbers only and not a Pepin's test.
Al

msft 2013-07-24 17:32

1 Attachment(s)
Hi,
I ported genefer to RADEON.
[CODE]
+ time ./GeneferCUDA -t
GeneferCUDA 2.2.1 (CUDA) based on Genefer v2.2.1
Copyright (C) 2001-2003, Yves Gallot (v1.3)
Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1)
Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA)
A program for finding large probable generalized Fermat primes.

Platform :Advanced Micro Devices, Inc.
Device 0 : Capeverde

Build Options are : -D KHR_DP_EXTENSION
2030234^8192+1 is a probable prime.
(51672 digits) (err = 0.1719) (time = 0:03:01) 00:52:08
[/CODE]

msft 2013-07-25 18:51

1 Attachment(s)
Hi, New Version.
HW:Celeron+RADEON7750(Low end)
1) install ubuntu Desktop 64bit 12.04 LTS
2) install amd-driver-installer-catalyst-13-4-linux-x86.x86_64.zip
3) install AMD-APP-SDK-v2.8-lnx64.tgz
4) install clAmdFft-1.10.321.tar.gz
5) install g++
6) install libglu1-mesa-dev
[CODE]
$ pwd
/opt/AMDAPP/samples/opencl/genefer/0.05
$ tar -xvf 0.04.tar.bz2
0.04/
0.04/clFFTPlans.cpp
0.04/Makefile
0.04/MatrixMulDouble_Kernels.cl
0.04/run.sh
0.04/clFFTPlans.h
0.04/GeneferCUDA.c
0.04/MatrixMulDouble.hpp
$ cd 0.04
0.04$ sh -x ./run.sh
g++ -Wpointer-arith -Wfloat-equal -g -O3 -I /opt/AMDAPP/include/ -I /opt/clAmdFft-1.10.321/include/ -g3 -ffor-scope -I ../../../../../samples/opencl/SDKUtil/include -I ../../../../../samples/bolt/BoltUtil/include -I ../../../../../samples/C++Amp/AmpUtil/include -I "/opt/AMDAPP/include" -I ../../../../../include -o build/debug/x86_64//GeneferCUDA.o -c GeneferCUDA.c
g++ -Wpointer-arith -Wfloat-equal -g -O3 -I /opt/AMDAPP/include/ -I /opt/clAmdFft-1.10.321/include/ -g3 -ffor-scope -I ../../../../../samples/opencl/SDKUtil/include -I ../../../../../samples/bolt/BoltUtil/include -I ../../../../../samples/C++Amp/AmpUtil/include -I "/opt/AMDAPP/include" -I ../../../../../include -o build/debug/x86_64//clFFTPlans.o -c clFFTPlans.cpp
g++ -o build/debug/x86_64/GeneferCUDA build/debug/x86_64//GeneferCUDA.o build/debug/x86_64//clFFTPlans.o -lpthread -ldl -L/usr/X11R6/lib -g /opt/clAmdFft-1.10.321/lib64/libclAmdFft.Runtime.so -lOpenCL -lgmp -lSDKUtil -lOpenCL -L../../../../../lib/x86_64 -L../../../../../TempSDKUtil/lib/x86_64 -L"/opt/AMDAPP/lib/x86_64"
+ export LD_LIBRARY_PATH=:/opt/clAmdFft-1.10.321/lib64/
+ time ./GeneferCUDA -t
GeneferCUDA 2.2.1 (GPU) based on Genefer v2.2.1
Copyright (C) 2001-2003, Yves Gallot (v1.3)
Copyright (C) 2009, 2011 Mark Rodenkirch, David Underbakke (v2.2.1)
Copyright (C) 2010, 2011, Shoichiro Yamada (CUDA)
A program for finding large probable generalized Fermat primes.
Platform :Advanced Micro Devices, Inc.
Device 0 : Capeverde
Build Options are : -D KHR_DP_EXTENSION
2030234^8192+1 is a probable prime.
(51672 digits) (err = 0.1738) (time = 0:01:47) 02:41:09
Platform :Advanced Micro Devices, Inc.
Device 0 : Capeverde

Build Options are : -D KHR_DP_EXTENSION
572186^131072+1 is a probable prime.
(754652 digits) (err = 0.0801) (time = 1:02:35) 03:43:44
[/CODE]

LaurV 2013-07-26 04:51

[bitchy]Who wants to find Fermat primes which are "large probable" or "probable generalized"? (or both)
Wouldn't "large, generalized Fermat (probable) primes" (native speakers can help here!) sound better?
[/bitchy]

All in all, well done msft! kutgw! (you spoiled us already)

msft 2013-07-26 09:40

1 Attachment(s)
Hi, New Version.
HD7750:
[CODE]
0.10$ ./GeneferCUDA -b
874718^131072+1 Time: 1.66 ms/mul. Err: 3.47e-01 778813 digits
710492^262144+1 Time: 2.99 ms/mul. Err: 5.00e-01 1533952 digits
577098^524288+1 Time: 4.88 ms/mul. Err: 5.00e-01 3020555 digits
468750^1048576+1 Time: 9.83 ms/mul. Err: 5.00e-01 5946413 digits
380742^2097152+1 Time: 18.9 ms/mul. Err: 5.00e-01 11703432 digits
309258^4194304+1 Time: 40 ms/mul. Err: 5.00e-01 23028076 digits
251196^8388608+1 Time: 80.4 ms/mul. Err: 5.00e-01 45298590 digits
0.10$ ./GeneferCUDA -t
572186^131072+1 is a probable prime.
(754652 digits) (err = 0.0762) (time = 1:10:03) 18:36:25
[/CODE]
PS. clAmdFft have this issue.
[url]http://devgurus.amd.com/message/1298209#1298209[/url]

kracker 2013-07-26 13:16

[QUOTE=LaurV;347410][bitchy]Who wants to find Fermat primes which are "large probable" or "probable generalized"? (or both)
Wouldn't "large, generalized Fermat (probable) primes" (native speakers can help here!) sound better?
[/bitchy]

All in all, well done msft! kutgw! (you spoiled us already)[/QUOTE]
+1
Indeed.. :razz:

msft 2013-07-28 03:58

1 Attachment(s)
New Version. Convert finish.
[CODE]
572186^131072+1 is a probable prime.
(754652 digits) (err = 0.0742) (time = 0:54:27) 12:09:13
0.13$ ./GeneferCUDA -b
1076904^65536+1 Time: 809 us/mul. Err: 5.00e-01 395325 digits
874718^131072+1 Time: 1.27 ms/mul. Err: 3.47e-01 778813 digits
710492^262144+1 Time: 2.56 ms/mul. Err: 5.00e-01 1533952 digits
577098^524288+1 Time: 4.73 ms/mul. Err: 5.00e-01 3020555 digits
468750^1048576+1 Time: 9.28 ms/mul. Err: 5.00e-01 5946413 digits
380742^2097152+1 Time: 18.1 ms/mul. Err: 5.00e-01 11703432 digits
309258^4194304+1 Time: 38.2 ms/mul. Err: 5.00e-01 23028076 digits
251196^8388608+1 Time: 76.4 ms/mul. Err: 5.00e-01 45298590 digits
[/CODE]


All times are UTC. The time now is 05:55.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.