![]() |
[QUOTE=axn;305570]does anyone have any idea of the relative time spent of "fft square" routine vs "next step" kernels in genefer?[/QUOTE]
I used Nvidia's visual profiler to check the usage. The single biggest consumer of GPU time was the transpose kernel. However, I don't remember right now what the exact conditions of the test were -- for example, I may have been looking more at the initialization code than at the genefer algorithm -- so I wouldn't put too much weight on those results. But you should be able to get the information you want using their tools. |
[QUOTE=AG5BPilot;305585]But you should be able to get the information you want using their tools.[/QUOTE]
I would... if I had any DP capable cards :sad: Would you be able to run a quick test and figure out the execution profile of a WR test? |
[QUOTE=axn;305606]I would... if I had any DP capable cards :sad: Would you be able to run a quick test and figure out the execution profile of a WR test?[/QUOTE]
Not sure when I'll have time, so that's a definite maybe. |
[QUOTE=axn;305606]I would... if I had any DP capable cards :sad: Would you be able to run a quick test and figure out the execution profile of a WR test?[/QUOTE]
I figured it was easier, so I put timing code into Genefer's benchmarks: [quote]C:\GeneferCUDA test\GeneferTest>genefercudatest -b genefercuda 2.3.1-0 (Windows x86 CUDA 3.2) Copyright 2001-2003, Yves Gallot Copyright 2009, Mark Rodenkirch, David Underbakke Copyright 2010-2012, Shoichiro Yamada, Ken Brazier Copyright 2011-2012, Iain Bethune, Michael Goetz, Ronald Schneider Command line: genefercudatest -b Generalized Fermat Number Bench FFTsquareGFN=53.46% FFTnextStepGFN=46.54% (raw: 0.0850 0.0740 seconds) 2009574^8192+1 Time: 159 us/mul. Err: 0.1719 51636 digits FFTsquareGFN=55.00% FFTnextStepGFN=45.00% (raw: 0.0880 0.0720 seconds) 1632282^16384+1 Time: 161 us/mul. Err: 0.1563 101791 digits FFTsquareGFN=48.95% FFTnextStepGFN=51.05% (raw: 0.0930 0.0970 seconds) 1325824^32768+1 Time: 212 us/mul. Err: 0.1563 200622 digits FFTsquareGFN=32.93% FFTnextStepGFN=67.07% (raw: 0.0810 0.1650 seconds) 1076904^65536+1 Time: 339 us/mul. Err: 0.1602 395325 digits FFTsquareGFN=49.48% FFTnextStepGFN=50.52% (raw: 0.0950 0.0970 seconds) 874718^131072+1 Time: 601 us/mul. Err: 0.1563 778813 digits FFTsquareGFN=43.55% FFTnextStepGFN=56.45% (raw: 0.2530 0.3280 seconds) 710492^262144+1 Time: 1.05 ms/mul. Err: 0.1641 1533952 digits FFTsquareGFN=29.27% FFTnextStepGFN=70.73% (raw: 0.3990 0.9640 seconds) 577098^524288+1 Time: 1.98 ms/mul. Err: 0.1875 3020555 digits FFTsquareGFN=21.41% FFTnextStepGFN=78.59% (raw: 0.6850 2.5150 seconds) 468750^1048576+1 Time: 3.97 ms/mul. Err: 0.1563 5946413 digits FFTsquareGFN=32.74% FFTnextStepGFN=67.26% (raw: 1.8470 3.7940 seconds) 380742^2097152+1 Time: 8.23 ms/mul. Err: 0.1484 11703432 digits FFTsquareGFN=25.51% FFTnextStepGFN=74.49% (raw: 3.4360 10.0320 seconds) 309258^4194304+1 Time: 16.5 ms/mul. Err: 0.1719 23028076 digits FFTsquareGFN=21.27% FFTnextStepGFN=78.73% (raw: 6.4150 23.7390 seconds) 100^8388608+1 Time: 33.7 ms/mul. Err: 0.0000 16777217 digits[/quote] The timings from N=262144 and above seem fairly consistent between runs. |
[QUOTE=AG5BPilot;305807]I figured it was easier, so I put timing code into Genefer's benchmarks:[/QUOTE]
Yeesh! That thing is scaling the wrong way. I hope you made a mistake and flipped the numbers. FFT is O(nlogn) and carry propagation (nextstep) is O(n), so you'd expect that as n grows higher, FFT will take a larger fraction of the time -- not smaller! Something is rotten. :confused: |
[QUOTE=axn;305881]Yeesh! That thing is scaling the wrong way. I hope you made a mistake and flipped the numbers. FFT is O(nlogn) and carry propagation (nextstep) is O(n), so you'd expect that as n grows higher, FFT will take a larger fraction of the time -- not smaller! Something is rotten. :confused:[/QUOTE]
The numbers are definitely not reversed. It's possible it's an instrumentation anomaly, for example if either of the two routines' run time is close to the clock step frequency of 1 ms you might get odd results with ONE of the benchmarks, but that doesn't appear to be what's happening. Here's the relevant code (everything except the declarations): [code]SETCLOCK(clock1); FFTsquareGFN(z, n1, n2); squareTime += elapsedTime(clock1); bt = Na[i / (8 * sizeof(uint32_t))] >> (i % (8 * sizeof(uint32_t))) & 1; SETCLOCK(clock1); FFTnextStepGFN(z, b, (bt == 0) ? t1 : t2, n1, n2, t3); nextTime += elapsedTime(clock1); ... if (squareTime+nextTime==0.0) squareTime=nextTime=1.0; printf("\n FFTsquareGFN=%.2f%% FFTnextStepGFN=%.2f%% (raw: %.4f %.4f seconds)\n", 100*squareTime/(squareTime+nextTime), 100*nextTime/(squareTime+nextTime), squareTime, nextTime);[/code] |
[QUOTE=AG5BPilot;305882]The numbers are definitely not reversed.[/QUOTE]
I know. It was one of those idle hopes :smile: The outcome is serious enough that I have just gone ahead and ordered my very own GT 520 (cheapest DP capable part I could find :smile:) to further explore this. Hopefully, before it arrives, I will have finished up the linux port of the sieve, and have some free time. |
1 Attachment(s)
[QUOTE=axn;305884]The outcome is serious enough that I have just gone ahead and ordered my very own GT 520 (cheapest DP capable part I could find :smile:) to further explore this. Hopefully, before it arrives, I will have finished up the linux port of the sieve, and have some free time.[/QUOTE]
A GT520s performance will probably be in the vicinity of what you can get on your CPU running Genefx64. The ONLY code changes to measure the internal timings are in the check() routine, so you can just plug in my modified genefer.cpp file and build the other versions of genefer. It would be interesting to see if you see the same behavior in the CPU versions. |
[QUOTE=AG5BPilot;305925]A GT520s performance will probably be in the vicinity of what you can get on your CPU running Genefx64.
The ONLY code changes to measure the internal timings are in the check() routine, so you can just plug in my modified genefer.cpp file and build the other versions of genefer. It would be interesting to see if you see the same behavior in the CPU versions.[/QUOTE] I built GenefX64 under linux and the numbers were going in the right direction. Unfortunately, I could only run it up to 1M, since the initialization was taking longer and longer. I'll hack it a bit to overcome that and post the full details. Summary is: it starts out at roughly 50:50, but by 1M, it is 80:20 (sq:nextstep). With each increase in N, square tends to be more and more dominant factor. The GPU numbers are definitely an anomaly. |
Figured it out. It was the parallel processing of the square and nextstep overlapping each other that was messing up the metrics. Serializing the steps slowed the program down a bit, but does yield more accurate measurements:
[quote]C:\GeneferCUDA test\GeneferTest>genefercudaTest.exe -b genefercuda 2.3.1-0 (Windows x86 CUDA 3.2) Copyright 2001-2003, Yves Gallot Copyright 2009, Mark Rodenkirch, David Underbakke Copyright 2010-2012, Shoichiro Yamada, Ken Brazier Copyright 2011-2012, Iain Bethune, Michael Goetz, Ronald Schneider Command line: genefercudaTest.exe -b Generalized Fermat Number Bench FFTsquareGFN=46.32% FFTnextStepGFN=53.68% (raw: 0.2520 0.2920 seconds) 2009574^8192+1 Time: 546 us/mul. Err: 0.1719 51636 digits FFTsquareGFN=47.40% FFTnextStepGFN=52.60% (raw: 0.2830 0.3140 seconds) 1632282^16384+1 Time: 597 us/mul. Err: 0.1563 101791 digits FFTsquareGFN=47.75% FFTnextStepGFN=52.25% (raw: 0.3190 0.3490 seconds) 1325824^32768+1 Time: 668 us/mul. Err: 0.1563 200622 digits FFTsquareGFN=48.98% FFTnextStepGFN=51.02% (raw: 0.4310 0.4490 seconds) 1076904^65536+1 Time: 880 us/mul. Err: 0.1602 395325 digits FFTsquareGFN=61.86% FFTnextStepGFN=38.14% (raw: 0.6650 0.4100 seconds) 874718^131072+1 Time: 1.08 ms/mul. Err: 0.1563 778813 digits FFTsquareGFN=59.12% FFTnextStepGFN=40.88% (raw: 0.9820 0.6790 seconds) 710492^262144+1 Time: 1.66 ms/mul. Err: 0.1641 1533952 digits FFTsquareGFN=63.43% FFTnextStepGFN=36.57% (raw: 1.7070 0.9840 seconds) 577098^524288+1 Time: 2.69 ms/mul. Err: 0.1875 3020555 digits FFTsquareGFN=65.57% FFTnextStepGFN=34.43% (raw: 3.2610 1.7120 seconds) 468750^1048576+1 Time: 4.97 ms/mul. Err: 0.1563 5946413 digits FFTsquareGFN=74.32% FFTnextStepGFN=25.68% (raw: 6.7420 2.3290 seconds) 380742^2097152+1 Time: 9.07 ms/mul. Err: 0.1484 11703432 digits FFTsquareGFN=75.52% FFTnextStepGFN=24.48% (raw: 13.3390 4.3230 seconds) 309258^4194304+1 Time: 17.7 ms/mul. Err: 0.1719 23028076 digits FFTsquareGFN=78.22% FFTnextStepGFN=21.78% (raw: 27.3330 7.6120 seconds) 100^8388608+1 Time: 34.9 ms/mul. Err: 0.0000 16777217 digits[/quote] |
[QUOTE=AG5BPilot;306037]Figured it out. It was the parallel processing of the square and nextstep overlapping each other that was messing up the metrics. Serializing the steps slowed the program down a bit, but does yield more accurate measurements:[/QUOTE]
Phew. That's more like it! It is in line with the CPU figures. Anyway, once I have my GT520, I'll do some visual profiling and see if any improvement can be wrung out of the code. But I'm not holding my breath for any big improvements. |
| All times are UTC. The time now is 05:55. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.