mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-07-23, 11:56   #144
AG5BPilot
 
AG5BPilot's Avatar
 
Dec 2011
New York, U.S.A.

11000012 Posts
Default

Quote:
Originally Posted by axn View Post
does anyone have any idea of the relative time spent of "fft square" routine vs "next step" kernels in genefer?
I used Nvidia's visual profiler to check the usage. The single biggest consumer of GPU time was the transpose kernel.

However, I don't remember right now what the exact conditions of the test were -- for example, I may have been looking more at the initialization code than at the genefer algorithm -- so I wouldn't put too much weight on those results. But you should be able to get the information you want using their tools.
AG5BPilot is offline   Reply With Quote
Old 2012-07-23, 15:14   #145
axn
 
axn's Avatar
 
Jun 2003

13DD16 Posts
Default

Quote:
Originally Posted by AG5BPilot View Post
But you should be able to get the information you want using their tools.
I would... if I had any DP capable cards Would you be able to run a quick test and figure out the execution profile of a WR test?
axn is offline   Reply With Quote
Old 2012-07-23, 16:00   #146
AG5BPilot
 
AG5BPilot's Avatar
 
Dec 2011
New York, U.S.A.

97 Posts
Default

Quote:
Originally Posted by axn View Post
I would... if I had any DP capable cards Would you be able to run a quick test and figure out the execution profile of a WR test?
Not sure when I'll have time, so that's a definite maybe.
AG5BPilot is offline   Reply With Quote
Old 2012-07-24, 21:51   #147
AG5BPilot
 
AG5BPilot's Avatar
 
Dec 2011
New York, U.S.A.

97 Posts
Default

Quote:
Originally Posted by axn View Post
I would... if I had any DP capable cards Would you be able to run a quick test and figure out the execution profile of a WR test?
I figured it was easier, so I put timing code into Genefer's benchmarks:

Quote:
C:\GeneferCUDA test\GeneferTest>genefercudatest -b
genefercuda 2.3.1-0 (Windows x86 CUDA 3.2)
Copyright 2001-2003, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2012, Iain Bethune, Michael Goetz, Ronald Schneider

Command line: genefercudatest -b

Generalized Fermat Number Bench

FFTsquareGFN=53.46% FFTnextStepGFN=46.54% (raw: 0.0850 0.0740 seconds)
2009574^8192+1 Time: 159 us/mul. Err: 0.1719 51636 digits

FFTsquareGFN=55.00% FFTnextStepGFN=45.00% (raw: 0.0880 0.0720 seconds)
1632282^16384+1 Time: 161 us/mul. Err: 0.1563 101791 digits

FFTsquareGFN=48.95% FFTnextStepGFN=51.05% (raw: 0.0930 0.0970 seconds)
1325824^32768+1 Time: 212 us/mul. Err: 0.1563 200622 digits

FFTsquareGFN=32.93% FFTnextStepGFN=67.07% (raw: 0.0810 0.1650 seconds)
1076904^65536+1 Time: 339 us/mul. Err: 0.1602 395325 digits

FFTsquareGFN=49.48% FFTnextStepGFN=50.52% (raw: 0.0950 0.0970 seconds)
874718^131072+1 Time: 601 us/mul. Err: 0.1563 778813 digits

FFTsquareGFN=43.55% FFTnextStepGFN=56.45% (raw: 0.2530 0.3280 seconds)
710492^262144+1 Time: 1.05 ms/mul. Err: 0.1641 1533952 digits

FFTsquareGFN=29.27% FFTnextStepGFN=70.73% (raw: 0.3990 0.9640 seconds)
577098^524288+1 Time: 1.98 ms/mul. Err: 0.1875 3020555 digits

FFTsquareGFN=21.41% FFTnextStepGFN=78.59% (raw: 0.6850 2.5150 seconds)
468750^1048576+1 Time: 3.97 ms/mul. Err: 0.1563 5946413 digits

FFTsquareGFN=32.74% FFTnextStepGFN=67.26% (raw: 1.8470 3.7940 seconds)
380742^2097152+1 Time: 8.23 ms/mul. Err: 0.1484 11703432 digits

FFTsquareGFN=25.51% FFTnextStepGFN=74.49% (raw: 3.4360 10.0320 seconds)
309258^4194304+1 Time: 16.5 ms/mul. Err: 0.1719 23028076 digits

FFTsquareGFN=21.27% FFTnextStepGFN=78.73% (raw: 6.4150 23.7390 seconds)
100^8388608+1 Time: 33.7 ms/mul. Err: 0.0000 16777217 digits
The timings from N=262144 and above seem fairly consistent between runs.
AG5BPilot is offline   Reply With Quote
Old 2012-07-25, 04:20   #148
axn
 
axn's Avatar
 
Jun 2003

117358 Posts
Default

Quote:
Originally Posted by AG5BPilot View Post
I figured it was easier, so I put timing code into Genefer's benchmarks:
Yeesh! That thing is scaling the wrong way. I hope you made a mistake and flipped the numbers. FFT is O(nlogn) and carry propagation (nextstep) is O(n), so you'd expect that as n grows higher, FFT will take a larger fraction of the time -- not smaller! Something is rotten.
axn is offline   Reply With Quote
Old 2012-07-25, 05:09   #149
AG5BPilot
 
AG5BPilot's Avatar
 
Dec 2011
New York, U.S.A.

97 Posts
Default

Quote:
Originally Posted by axn View Post
Yeesh! That thing is scaling the wrong way. I hope you made a mistake and flipped the numbers. FFT is O(nlogn) and carry propagation (nextstep) is O(n), so you'd expect that as n grows higher, FFT will take a larger fraction of the time -- not smaller! Something is rotten.
The numbers are definitely not reversed. It's possible it's an instrumentation anomaly, for example if either of the two routines' run time is close to the clock step frequency of 1 ms you might get odd results with ONE of the benchmarks, but that doesn't appear to be what's happening.

Here's the relevant code (everything except the declarations):

Code:
SETCLOCK(clock1);
        FFTsquareGFN(z, n1, n2);
squareTime += elapsedTime(clock1);
        bt = Na[i / (8 * sizeof(uint32_t))] >> (i % (8 * sizeof(uint32_t))) & 1;
SETCLOCK(clock1);
        FFTnextStepGFN(z, b, (bt == 0) ? t1 : t2, n1, n2, t3);
nextTime += elapsedTime(clock1);

...


if (squareTime+nextTime==0.0) squareTime=nextTime=1.0;
printf("\n FFTsquareGFN=%.2f%%  FFTnextStepGFN=%.2f%%  (raw: %.4f %.4f seconds)\n", 100*squareTime/(squareTime+nextTime), 100*nextTime/(squareTime+nextTime), squareTime, nextTime);
AG5BPilot is offline   Reply With Quote
Old 2012-07-25, 05:34   #150
axn
 
axn's Avatar
 
Jun 2003

32·5·113 Posts
Default

Quote:
Originally Posted by AG5BPilot View Post
The numbers are definitely not reversed.
I know. It was one of those idle hopes

The outcome is serious enough that I have just gone ahead and ordered my very own GT 520 (cheapest DP capable part I could find ) to further explore this. Hopefully, before it arrives, I will have finished up the linux port of the sieve, and have some free time.
axn is offline   Reply With Quote
Old 2012-07-25, 11:51   #151
AG5BPilot
 
AG5BPilot's Avatar
 
Dec 2011
New York, U.S.A.

97 Posts
Default

Quote:
Originally Posted by axn View Post
The outcome is serious enough that I have just gone ahead and ordered my very own GT 520 (cheapest DP capable part I could find ) to further explore this. Hopefully, before it arrives, I will have finished up the linux port of the sieve, and have some free time.
A GT520s performance will probably be in the vicinity of what you can get on your CPU running Genefx64.

The ONLY code changes to measure the internal timings are in the check() routine, so you can just plug in my modified genefer.cpp file and build the other versions of genefer.

It would be interesting to see if you see the same behavior in the CPU versions.
Attached Files
File Type: zip genefertest.zip (12.3 KB, 81 views)
AG5BPilot is offline   Reply With Quote
Old 2012-07-26, 03:56   #152
axn
 
axn's Avatar
 
Jun 2003

32×5×113 Posts
Default

Quote:
Originally Posted by AG5BPilot View Post
A GT520s performance will probably be in the vicinity of what you can get on your CPU running Genefx64.

The ONLY code changes to measure the internal timings are in the check() routine, so you can just plug in my modified genefer.cpp file and build the other versions of genefer.

It would be interesting to see if you see the same behavior in the CPU versions.
I built GenefX64 under linux and the numbers were going in the right direction. Unfortunately, I could only run it up to 1M, since the initialization was taking longer and longer. I'll hack it a bit to overcome that and post the full details. Summary is: it starts out at roughly 50:50, but by 1M, it is 80:20 (sq:nextstep). With each increase in N, square tends to be more and more dominant factor.
The GPU numbers are definitely an anomaly.
axn is offline   Reply With Quote
Old 2012-07-26, 04:41   #153
AG5BPilot
 
AG5BPilot's Avatar
 
Dec 2011
New York, U.S.A.

9710 Posts
Default

Figured it out. It was the parallel processing of the square and nextstep overlapping each other that was messing up the metrics. Serializing the steps slowed the program down a bit, but does yield more accurate measurements:

Quote:
C:\GeneferCUDA test\GeneferTest>genefercudaTest.exe -b
genefercuda 2.3.1-0 (Windows x86 CUDA 3.2)
Copyright 2001-2003, Yves Gallot
Copyright 2009, Mark Rodenkirch, David Underbakke
Copyright 2010-2012, Shoichiro Yamada, Ken Brazier
Copyright 2011-2012, Iain Bethune, Michael Goetz, Ronald Schneider

Command line: genefercudaTest.exe -b

Generalized Fermat Number Bench

FFTsquareGFN=46.32% FFTnextStepGFN=53.68% (raw: 0.2520 0.2920 seconds)
2009574^8192+1 Time: 546 us/mul. Err: 0.1719 51636 digits

FFTsquareGFN=47.40% FFTnextStepGFN=52.60% (raw: 0.2830 0.3140 seconds)
1632282^16384+1 Time: 597 us/mul. Err: 0.1563 101791 digits

FFTsquareGFN=47.75% FFTnextStepGFN=52.25% (raw: 0.3190 0.3490 seconds)
1325824^32768+1 Time: 668 us/mul. Err: 0.1563 200622 digits

FFTsquareGFN=48.98% FFTnextStepGFN=51.02% (raw: 0.4310 0.4490 seconds)
1076904^65536+1 Time: 880 us/mul. Err: 0.1602 395325 digits

FFTsquareGFN=61.86% FFTnextStepGFN=38.14% (raw: 0.6650 0.4100 seconds)
874718^131072+1 Time: 1.08 ms/mul. Err: 0.1563 778813 digits

FFTsquareGFN=59.12% FFTnextStepGFN=40.88% (raw: 0.9820 0.6790 seconds)
710492^262144+1 Time: 1.66 ms/mul. Err: 0.1641 1533952 digits

FFTsquareGFN=63.43% FFTnextStepGFN=36.57% (raw: 1.7070 0.9840 seconds)
577098^524288+1 Time: 2.69 ms/mul. Err: 0.1875 3020555 digits

FFTsquareGFN=65.57% FFTnextStepGFN=34.43% (raw: 3.2610 1.7120 seconds)
468750^1048576+1 Time: 4.97 ms/mul. Err: 0.1563 5946413 digits

FFTsquareGFN=74.32% FFTnextStepGFN=25.68% (raw: 6.7420 2.3290 seconds)
380742^2097152+1 Time: 9.07 ms/mul. Err: 0.1484 11703432 digits

FFTsquareGFN=75.52% FFTnextStepGFN=24.48% (raw: 13.3390 4.3230 seconds)
309258^4194304+1 Time: 17.7 ms/mul. Err: 0.1719 23028076 digits

FFTsquareGFN=78.22% FFTnextStepGFN=21.78% (raw: 27.3330 7.6120 seconds)
100^8388608+1 Time: 34.9 ms/mul. Err: 0.0000 16777217 digits
AG5BPilot is offline   Reply With Quote
Old 2012-07-26, 05:46   #154
axn
 
axn's Avatar
 
Jun 2003

10011110111012 Posts
Default

Quote:
Originally Posted by AG5BPilot View Post
Figured it out. It was the parallel processing of the square and nextstep overlapping each other that was messing up the metrics. Serializing the steps slowed the program down a bit, but does yield more accurate measurements:
Phew. That's more like it! It is in line with the CPU figures.

Anyway, once I have my GT520, I'll do some visual profiling and see if any improvement can be wrung out of the code. But I'm not holding my breath for any big improvements.
axn is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Genefer's FFT applied to Mersenne squaring preda Software 0 2017-09-06 02:54
CUDA 5.5 ET_ GPU Computing 2 2013-06-13 15:50
AVX CPU LL vs CUDA LL nucleon GPU Computing 11 2012-01-04 17:52
Best CUDA GPU for the $$ Christenson GPU Computing 24 2011-05-01 00:06
CUDA? Xentar Conjectures 'R Us 6 2010-03-31 07:43

All times are UTC. The time now is 05:55.


Fri Aug 6 05:55:33 UTC 2021 up 14 days, 24 mins, 1 user, load averages: 3.26, 3.45, 3.20

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.