mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2019-01-14, 15:19   #100
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by dcheuk View Post
I have two machines running Turing graphics cards, they both seem to stuck in a zero residue loop while running CUDALucas 2.05.1. However on 2.06beta the problem seem to go away.

Meanwhile with the same NVIDIA driver and CUDA version (10.0) on the same computer, with a Pascal graphics card CUDALucas 2.05.1 seem to function normally. So this is isolated, at least locally on my machines, to the Turing cards.

1, So, is this a known issue with the Turing graphics card running CUDALucas?

2, Should I worry about the reliability of the residue at the end of the test if their error is somewhere in the neighborhood of 0.2000?

Thanks.
Zero or other problem repeating interim residue is a known issue of CUDALucas 2.05.1. Use v2.06 May 2017 beta, don't use 2.05.1 or earlier.

Test any new installation of CUDALucas with -memtest thoroughly, repeat of a small known prime, and at least one doublecheck assignment. Do such tests by redirection of console output to a log file for later examination for errors. Roundoff error of 0.2 is not a problem. See some of the earlier entries in the CUDALucas bug and wish list at https://www.mersenneforum.org/showpo...24&postcount=3
kriesel is online now   Reply With Quote
Old 2019-01-15, 06:22   #101
dcheuk
 
dcheuk's Avatar
 
Jan 2019
Florida

24310 Posts
Default

Quote:
Originally Posted by kriesel View Post
Zero or other problem repeating interim residue is a known issue of CUDALucas 2.05.1. Use v2.06 May 2017 beta, don't use 2.05.1 or earlier.

Test any new installation of CUDALucas with -memtest thoroughly, repeat of a small known prime, and at least one doublecheck assignment. Do such tests by redirection of console output to a log file for later examination for errors. Roundoff error of 0.2 is not a problem. See some of the earlier entries in the CUDALucas bug and wish list at https://www.mersenneforum.org/showpo...24&postcount=3
Okay, thanks for the clarification.
dcheuk is offline   Reply With Quote
Old 2019-01-16, 04:27   #102
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2×3×1,693 Posts
Default

RTX 2060 $350 to $420
https://promotions.newegg.com/neemai...x-landing.aspx
kladner is offline   Reply With Quote
Old 2019-01-16, 07:00   #103
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24×199 Posts
Default

Quote:
Originally Posted by nomead View Post
Phew. That really took some time. Anyway, I tested everything on the same exponent, and the same bit depth to keep things constant in that regard. First I started filling a table with values at GPUSieveProcessSize=8 and increased the sieve size. 128 was the best there. Then I increased the process size step by step, and to be sure that I don't miss something unexpected, at every step I also tested at the top three sieve sizes. No change there, every time 128 was the best size. Process size 16 ended up being slightly better than 24 but the difference is really marginal. (I had been running on 32 for some reason thus far) 2884 GHz-d/day at 176 watts power. An increase in clock speed (or running it against the default power limit of 215W) would of course produce even more throughput. But also more heat and somewhat less performance per watt.

What bothered me though, was that no combination of those settings could produce any higher GPU utilization than 94%. So I had to try out some potentially risky things, but in the end, everything went well. I edited GPU_SIEVE_SIZE_MAX in params.h to 256, then 512, then 1024 and recompiled the program. Yes it's below the large warning "DO NOT EDIT DEFINES BELOW THIS LINE UNLESS YOU REALLY KNOW WHAT YOU DO!" but since the comment on that parameter says "We've only tested up to 128M bits. The GPU sieve code may be able to go higher." I thought, well, let's give it a try. After each recompilation I ran the long self test and everything worked fine. Of course it uses more GPU memory now, but even at the largest size, there's still plenty left and memory bandwidth usage stays at 1%. Every increase in sieve size brought a corresponding increase in performance, but of course, the further I got, the smaller the difference between steps. Diminishing returns. Finally at 1024, the GPU utilization was at 99% and per-class timings as reported by mfaktc itself stayed stable (at 128 they vary a bit from row to row, for some reason). And 3085 GHz-d/day with just a few more watts consumed than at sieve size 128.

Is there some risk of missing factors or something else, if the sieve size is increased like that? I mean, the difference is 200 GHz-d/day just from that one setting. Or is it just a matter of further tests needed, but nobody has done it? (Could I do it?)
This has been making me think about doing similar benchmarking with my 1070's. A free 7% improvement is nice. My utilization is sitting at 98% though, not 94%.

Kind of sad how I'm spending just under 600 watts (4 cards) to match your card.
Mark Rose is offline   Reply With Quote
Old 2019-01-17, 01:36   #104
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2×3×1,693 Posts
Default

Quote:
My utilization is sitting at 98% though, not 94%.
I run 2 mfaktc instances on a GTX 1060, and 1 instance on a GTX 460. Both are overclocked. If Prime95 is not running, they hold steady at 100% utilization. With P95 running on all 4 cores of a 6700K, utilization drops to 99% on both. GHzd/d drops from about 308.1 to 306.2 on each 1060 instance, and from 208.4 to 207.4 on the 460. (This is just eyeball averaging of output.)
EDIT: Neither card drives a display.

Last fiddled with by kladner on 2019-01-17 at 01:37
kladner is offline   Reply With Quote
Old 2019-01-28, 10:41   #105
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

1001111012 Posts
Default

The display card I have in my home machine started showing signs of dying in the weekend. A couple system BSODs while watching Youtube, and it appears that the old GT430 I had now has a dead fan. Not worth fixing anymore in my opinion, but I guess I'll keep it on the shelf for a few years in case I need a backup card for some other system. Of course, this gave me a good excuse to order an RTX 2060. Unfortunately it's a Windows system, and I don't think the precompiled binary supports GPUSieveSize above 128, but I'll post comparison benchmarks vs. the 2080, on identical parameters in mfaktc.ini, as soon as I'm able to. It's an old case though, with many hard disks and plenty of cable clutter, so the airflow and thermal performance might be a bit underwhelming. Keeping the GPU cool reduces the leakage inside the chip, which reduces power draw, which in turn keeps the GPU even cooler... of course up to a limit. It feels like when going over 60C the power draw really goes off a cliff. Still, I'm expecting about 65% of the performance of the 2080, for 50% of the price. Purely based on the number of CUDA cores.

So I did some thermal and power measurements on the 2080 at different fan speeds to see the effect in quantitative terms - although feelings are nice, it's no replacement for actual benchmark data. To be specific, the card used is an MSI Ventus RTX 2080, standard edition, not "OC". GPUSieveSize=1024 for better performance. And by the way, at 128 it produces less heat, but does less work, and the net effect is that the GHz-days/d per Watt is better at GPUSieveSize=1024, at any GPU clock frequency.

The default fan speed seemed to stay under 40% even at maximum power. This didn't allow running over 1800 MHz without hitting the power limit of 240W. Specified TDP is 215W, but nvidia-smi lets you set the power limit slightly higher than that. Maybe some Windows-based overclocking utilities would allow even higher boost clock rates, maybe not. At constant 60% fan speed, the maximum was 1830 MHz. At that speed the fan noise is still bearable, as most of it is just white noise from the airflow and there is not much motor whine. At 70% fan speed the motor whine appears, and it's kind of fine at work, but I wouldn't like to have something like that at home. But the maximum frequency went up one notch to 1845 MHz. Note that in none of these cases is the GPU thermally throttling, it is only the hard power limit of 240W that it runs against.

The data is attached as a PDF in case anyone is interested.
Attached Files
File Type: pdf GHz-RTX2080-1024-fan60-70.pdf (25.2 KB, 152 views)
nomead is offline   Reply With Quote
Old 2019-01-28, 15:45   #106
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by nomead View Post
Phew. That really took some time. Anyway, I tested everything on the same exponent, and the same bit depth to keep things constant in that regard. First I started filling a table with values at GPUSieveProcessSize=8 and increased the sieve size. 128 was the best there. Then I increased the process size step by step, and to be sure that I don't miss something unexpected, at every step I also tested at the top three sieve sizes. No change there, every time 128 was the best size. Process size 16 ended up being slightly better than 24 but the difference is really marginal. (I had been running on 32 for some reason thus far) 2884 GHz-d/day at 176 watts power. An increase in clock speed (or running it against the default power limit of 215W) would of course produce even more throughput. But also more heat and somewhat less performance per watt.

What bothered me though, was that no combination of those settings could produce any higher GPU utilization than 94%. So I had to try out some potentially risky things, but in the end, everything went well. I edited GPU_SIEVE_SIZE_MAX in params.h to 256, then 512, then 1024 and recompiled the program. Yes it's below the large warning "DO NOT EDIT DEFINES BELOW THIS LINE UNLESS YOU REALLY KNOW WHAT YOU DO!" but since the comment on that parameter says "We've only tested up to 128M bits. The GPU sieve code may be able to go higher." I thought, well, let's give it a try. After each recompilation I ran the long self test and everything worked fine. Of course it uses more GPU memory now, but even at the largest size, there's still plenty left and memory bandwidth usage stays at 1%. Every increase in sieve size brought a corresponding increase in performance, but of course, the further I got, the smaller the difference between steps. Diminishing returns. Finally at 1024, the GPU utilization was at 99% and per-class timings as reported by mfaktc itself stayed stable (at 128 they vary a bit from row to row, for some reason). And 3085 GHz-d/day with just a few more watts consumed than at sieve size 128.

Is there some risk of missing factors or something else, if the sieve size is increased like that? I mean, the difference is 200 GHz-d/day just from that one setting. Or is it just a matter of further tests needed, but nobody has done it? (Could I do it?)

I also tested if NumStreams had any effect on performance, but no, not really. Any difference is practically indistinguishable from noise and measurement uncertainty. The final check was to see if changing GPUSievePrimes had any effect. Well, yes, mostly negative ones. Going higher increased power consumption and noticeably decreased performance. Going lower perhaps decreased power consumption a bit, but also the performance went down a bit. So the default value of 82486 is spot on.

I've attached a printout of the timings I gathered.
Very interesting; gains all the way to gpusievesize 1024, several percent. On what OS? Post the executable?
I suggest you try ~90000 for gpusiveprimes, which is near where I found an optimal for GTX1080Ti.
Also, have you considered trying gpusievesize 2048 or even 4096? The 512 to 1024 increment gave about 1% rise in GhzD/day. There may be a bit more gain left. If it's actual trial factoring throughput that's being indicated.
Attached Files
File Type: pdf big gpusievesize effect.pdf (14.4 KB, 158 views)

Last fiddled with by kriesel on 2019-01-28 at 16:31
kriesel is online now   Reply With Quote
Old 2019-01-28, 18:40   #107
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

31710 Posts
Default

Quote:
Originally Posted by kriesel View Post
Very interesting; gains all the way to gpusievesize 1024, several percent. On what OS? Post the executable?
I suggest you try ~90000 for gpusiveprimes, which is near where I found an optimal for GTX1080Ti.
Also, have you considered trying gpusievesize 2048 or even 4096? The 512 to 1024 increment gave about 1% rise in GhzD/day. There may be a bit more gain left. If it's actual trial factoring throughput that's being indicated.
Debian Linux, kernel 4.19, and the executable is compiled just for CUDA 10.0 and compute capability 7.5. But then, that's just another build-time option, if someone really wants it... I don't have a Windows build environment yet, but maybe now there's some motivation for me to set one up.

Ok so I'm running all the tests again with the same exponent and same settings (clock speed etc.) as before, just varying GPUSieveSize. The first one I tried was 2048, but no luck. This error came up when trying to run self tests. And it's not a matter of actually running out of memory; mfaktc.exe only uses 405 MiB at the 2047 setting.
Code:
gpusieve.cu(1276) : CUDA Runtime API error 2: out of memory.
1536 seems to pass self tests, but there is only a slight improvement (from 3085 GHz-d/d at 1024, to 3108, or a further 0,7%).
Amazingly, 2047 also passes self tests. The result there is 3115, or a 0,2% improvement over 1536; really minuscule by that point.

Then to GPUSievePrimes. Maybe my search for the optimum point was a bit coarse before. For these runs I went back to GPUSieveSize=1024 and started increasing the sieveprimes value by about 4000 per step. For me, the performance stayed about flat, with a barely perceptible decline at each step. But I didn't feel like going any further up than 110K (111158 adjusted, to be exact) because at that point performance was down 0,5%. Then stepping down, I saw a very slight improvement at 78K (79158 adjusted), but as it was just +0,2% it could as well be due to noise in the measurement. Under half a second for a 5-minute run. After that, with smaller values, performance started declining again.

So for this card at least, adjusting GPUSievePrimes from the default value doesn't bring any benefits. But these things are highly dependent on the GPU architecture. Who knows, maybe there's a way to make even better use of the faster INT32 on Volta and Turing. And the fact that now FP and INT operations can run at the same time.
nomead is offline   Reply With Quote
Old 2019-02-04, 19:11   #108
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

317 Posts
Default

Quote:
Originally Posted by nomead View Post
Still, I'm expecting about 65% of the performance of the 2080, for 50% of the price. Purely based on the number of CUDA cores.
Pretty close, the actual ratio seems to be 67%. Benchmarks at some frequencies again attached, options are the same for both cards (2080 on Linux and 2060 on Win 7) so it was necessary to use GPUSieveSize=128 for these tests. The 2060 can be clocked higher, but my card seems to hit some limit at 1920 MHz and won't go any higher without touching the overvolt settings, and I'm not really willing to do that in the long run. Besides it's already at the rated TDP at that point, so there's very little to gain anymore.
Attached Files
File Type: pdf GHz-RTX2080-comp.pdf (19.5 KB, 136 views)
nomead is offline   Reply With Quote
Old 2019-02-04, 19:42   #109
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24·199 Posts
Default

Quote:
Originally Posted by nomead View Post
Still, I'm expecting about 65% of the performance of the 2080, for 50% of the price.
Don't forget the effective price should also include providing the PCIe slot (thus power supply, case, etc). So the 2080 may still be a better deal.
Mark Rose is offline   Reply With Quote
Old 2019-02-04, 19:50   #110
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

4758 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
Don't forget the effective price should also include providing the PCIe slot (thus power supply, case, etc). So the 2080 may still be a better deal.
That is only valid if you're building a system just for that purpose, not upgrading some pre-existing one (like I did - GT430 out, RTX2060 in - on an old 6-core Phenom system from 2011...)
nomead is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Nvidia GTX 745 4GB ??? petrw1 GPU Computing 3 2016-08-02 15:23
Nvidia Pascal, a third of DP firejuggler GPU Computing 12 2016-02-23 06:55
AMD + Nvidia TheMawn GPU Computing 7 2013-07-01 14:08
Nvidia Kepler Brain GPU Computing 149 2013-02-17 08:05
What can I do with my nvidia GPU? Surge Software 4 2010-09-29 11:36

All times are UTC. The time now is 15:22.


Fri Jul 7 15:22:47 UTC 2023 up 323 days, 12:51, 0 users, load averages: 1.21, 1.13, 1.10

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔