![]() |
[QUOTE=preda;490474]I added an initial CUDA backend to gpuOwl. I expect this to be rough, buggy and not-optimized yet, but it's a start.
<snip> Not so nice: the performance on GTX 1080 is disappointing. 5.9ms/it at the PRP wavefront, 4480K FFT. (thus I don't think it's such a good idea to do PRP or LL on Nvidia yet. Probably TF is a better fit for the 32bit-oriented hardware).[/QUOTE] 32 bit hardware? Does this mean you aren't using FP64 variables? ALL Nvidia boards support FP64, it's just that some have fewer ( in most cases, vastly fewer ) FP64 math units that FP32 ones. Despite that, FP64 is the way to get excellent performance. |
[QUOTE=preda;490474]I added an initial CUDA backend to gpuOwl. I expect this to be rough, buggy and not-optimized yet, but it's a start.
The approach I ended with was to use most of the same codebase, but split out two backends, OpenCL and CUDA. [I'm thinking, should I rename the previous gpuOwl to openOwl for symmetry with cudaOwl?] So, the savefile format, and much of the logic, is shared between the cudaOwl and gpuOwl. There are some notable differences though: - gpuOwl supports "offset extension", which means varying the offset (aka "shift") when a PRP error is encountered. Not a big deal unfortunately, this trick achieves about 0.5% exponent extension for a given FFT size. This was motivated by the severe lack of FFT size choice in openOwl. (cudaOwl doesn't have "offset"). - cudaOwl has a rich choice of FFT sizes (unlike openOwl). FFT selection is controlled with the "-fft" argument, allowing to specify hard sizes such as 4096K or 4M, or delta steps from the "default" size for the exponent, such as +1 or -1. A few nice things: - it's possible to switch the savefile between CUDA/OpenCL in midflight. - it's possible to change the FFT size in midflight. Not so nice: the performance on GTX 1080 is disappointing. 5.9ms/it at the PRP wavefront, 4480K FFT. (thus I don't think it's such a good idea to do PRP or LL on Nvidia yet. Probably TF is a better fit for the 32bit-oriented hardware).[/QUOTE] Common front end & different back ends makes a lot of sense. It positions you to adopt other interface standards in the future. (There's untapped OpenGL hardware. Or gpuFish? (Play on barraCUDA) _gtx1070_ fft timings in CUDALucas, for comparison (multiply by about 0.703 for expected gtx1080 timings) 4096 75846319 4.4804 4608 85111207 5.3196 ~3.74 for gtx1080 5184 95507747 5.9684 5292 97454309 6.6808 5600 103000823 6.8442 5832 107174381 7.0365 6144 112781477 7.3922 6272 115080019 7.5335 6480 118813021 7.9496 6912 126558077 8.0038 7168 131142761 8.4981 7200 131715607 8.7282 8192 149447533 9.1301 For CUDALucas 2.06 on gtx1070, running at 4480k is slightly slower than 4608k: fft size = 4096K, ave time = 4.4804 msec, max-ave = 0.00427 fft size = 4116K, ave time = 5.8768 msec, max-ave = 0.00866 fft size = 4200K, ave time = 5.7410 msec, max-ave = 0.00837 fft size = 4320K, ave time = 5.6031 msec, max-ave = 0.00523 fft size = 4374K, ave time = 5.7722 msec, max-ave = 0.01244 fft size = 4375K, ave time = 6.2813 msec, max-ave = 0.00654 fft size = 4410K, ave time = 5.8451 msec, max-ave = 0.00626 fft size = 4480K, ave time = 5.6864 msec, max-ave = 0.00568 fft size = 4500K, ave time = 5.7041 msec, max-ave = 0.00836 fft size = 4536K, ave time = 5.6158 msec, max-ave = 0.00727 fft size = 4608K, ave time = 5.3196 msec, max-ave = 0.00626 Does your program benchmark the different fft lengths or autoselect, or is that up to the user? Sounds like the latter. When checking speed of your new code on the GTX1080, did you monitor whether it was thermally limited? I routinely see a performance hit on the GTX1070 because of that, and the gtx10xx temperature limits are significantly lower than other cards. Sounds like good progress. Whether you find ways to tune for more speed or not, it's now the fastest PRP on CUDA for mersennes. What's next; explore an NVIDIA-specific OpenCl flavor? Tuning CUDA with profiling tool? |
[QUOTE=kriesel;490491]Common front end & different back ends makes a lot of sense. It positions you to adopt other interface standards in the future. (There's untapped OpenGL hardware.
...[/QUOTE] Are you sure OpenGL is a thing? AFAIK OpenGL was used for some limited form of compute before compute was a thing GPUs regularly did, but it's now a relic of the past. |
[QUOTE=preda;490474]I added an initial CUDA backend to gpuOwl. I expect this to be rough, buggy and not-optimized yet, but it's a start.
The approach I ended with was to use most of the same codebase, but split out two backends, OpenCL and CUDA. [I'm thinking, should I rename the previous gpuOwl to openOwl for symmetry with cudaOwl?] So, the savefile format, and much of the logic, is shared between the cudaOwl and gpuOwl. There are some notable differences though: - gpuOwl supports "offset extension", which means varying the offset (aka "shift") when a PRP error is encountered. Not a big deal unfortunately, this trick achieves about 0.5% exponent extension for a given FFT size. This was motivated by the severe lack of FFT size choice in openOwl. (cudaOwl doesn't have "offset"). - cudaOwl has a rich choice of FFT sizes (unlike openOwl). FFT selection is controlled with the "-fft" argument, allowing to specify hard sizes such as 4096K or 4M, or delta steps from the "default" size for the exponent, such as +1 or -1. A few nice things: - it's possible to switch the savefile between CUDA/OpenCL in midflight. - it's possible to change the FFT size in midflight. Not so nice: the performance on GTX 1080 is disappointing. 5.9ms/it at the PRP wavefront, 4480K FFT. (thus I don't think it's such a good idea to do PRP or LL on Nvidia yet. Probably TF is a better fit for the 32bit-oriented hardware).[/QUOTE] It is faster than a RX 580 which is at 6.7ms/it. PassMark reports it is faster [url]https://www.videocardbenchmark.net/directCompute.html[/url] |
4 Attachment(s)
[QUOTE=M344587487;490503]Are you sure OpenGL is a thing? AFAIK OpenGL was used for some limited form of compute before compute was a thing GPUs regularly did, but it's now a relic of the past.[/QUOTE]
You're right it's not a high priority. As long as someone like Preda can find ways to squeeze another 0.5% performance out of 1000GhzD/day gpus, it's more productive use of time and talent than getting 70% performance out of ~2GhzD/day igps. There are still a number of old systems around with IGPs for which OpenGL is supported, or DirectX is available, and CUDA and OpenCl are not. These would be probably in the single digit ghzday/day range for performance. Just for fun. Now that I think about it, they may lack floating point.That and low speed would favor trial factoring application if any. I have multiple Intel 82865g which are supposedly OpenGL1.3 but no OpenCl and no CUDA of course. And multiple ATI Mobility X300 (DirectX 9), OpenGl 2.0 but no OpenCl and no CUDA of course. ATI M26 Radeon Mobility X700 (openGL 2.0, DirectX 9) ATI RV350 Mobility Radeon 9600 M10 (openGL 2.0) Leave no silicon unused. The device features are very different; note transistor counts. As I recall, doing math with OpenGL involves arranging the math into a graphics form, to get the graphics-oriented OpenGL doing math you want. An example at [URL]https://stackoverflow.com/questions/39045419/using-opengl-shaders-for-math-calculation-c[/URL] [URL]https://github.com/skanti/Gaussian-Filter-GPU-OpenGL[/URL] But who knows, maybe a break from OpenCl and CUDA someday, in OpenGl, would lead to a creative new look at CUDA or OpenCl later. |
I am not sure why you are even bringing up decade+ old hardware. We would be hard pushed to find hardware to debug on nevermind it be worth the time.
|
[QUOTE=henryzz;490528]I am not sure why you are even bringing up decade+ old hardware. We would be hard pushed to find hardware to debug on nevermind it be worth the time.[/QUOTE]
I included 3 possible reasons in my previous post. And how about, for entertainment value? eBay and my basement are full of such hardware. They're generally not a great way to spend the electrical budget, but they're available and inexpensive. There are people running various software on Raspberry Pi.[url]http://www.mersenneforum.org/showthread.php?t=23250[/url] [URL]https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313.TR11.TRC2.A0.H0.Xlaptop+compaq.TRS1&_nkw=laptop+compaq&_sacat=0&LH_TitleDesc=0&_udhi=115&_osacat=0&_odkw=laptop[/URL] Preda knocked out a CUDA port of gpuOwL in about 10 days, including the front-end/back-end split. I wonder what challenge he'll tackle next. (After, presumably, some CUDA tuning.) |
[QUOTE=kriesel;490537]I included 3 possible reasons in my previous post. And how about, for entertainment value?
...[/QUOTE] I look forward to your implementation of an OpenGL PRP tester ;) |
[QUOTE=tServo;490487]32 bit hardware? Does this mean you aren't using FP64 variables?
ALL Nvidia boards support FP64, it's just that some have fewer ( in most cases, vastly fewer ) FP64 math units that FP32 ones. Despite that, FP64 is the way to get excellent performance.[/QUOTE] I'm using FP64. What I meant is that "cheap" (i.e. less than $4K) Nvidia GPUs have *very* weak FP64 power (and INT64 too), that's why I called them "32-bit oriented". |
[QUOTE=M344587487;490540]I look forward to your implementation of an OpenGL PRP tester ;)[/QUOTE]
I'll get right on that, after a few other todo lists, and your OpenGL or Vulkan P-1 factoring program is released. :smile: Somewhere among the answers at this link is the claim OpenGL is faster than OpenCl. [URL]https://stackoverflow.com/questions/7907510/opengl-vs-opencl-which-to-choose-and-why#[/URL] |
[QUOTE=preda;490558]I'm using FP64. What I meant is that "cheap" (i.e. less than $4K) Nvidia GPUs have *very* weak FP64 power (and INT64 too), that's why I called them "32-bit oriented".[/QUOTE]
The Kepler GPU based boards have excellent FP64 performance and are quite reasonable. This would include the old standby original Titan and Titan black ( appx 300-400 us dollars on Ebay, used ), the Titan Z, the Tesla K80 ( 900 dollars ), and some of the higher-end Quadros of that architecture such as the K6000 and K5000. The Quadros cost about the same as theTesla K80. Most of these boards have 1/2 the FP64 math units as the FP32. |
| All times are UTC. The time now is 22:58. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.