mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2019-01-07, 16:32   #67
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×112×47 Posts
Default

Quote:
Originally Posted by kriesel View Post
Which gpu model #s? (and that's on linux I expect) And hasn't gpu sieving on/enabled/1 been the default in the distributed ini files for quite a while now?
One (1#) GTX 560 and two (2#) GTX 1050s. And, yes, of course Linux. CentOS 7.3 to be exact.

With regards to the GPU sieving, yes, I believe "on" has been the default since George and Oliver implemented it, since it is SO much faster!
chalsall is offline   Reply With Quote
Old 2019-01-07, 16:38   #68
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by chalsall View Post
One (1#) GTX 560 and two (2#) GTX 1050s. And, yes, of course Linux. CentOS 7.3 to be exact.
Thanks. What gpu load does nvidia-smi give for them?
kriesel is online now   Reply With Quote
Old 2019-01-07, 16:46   #69
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×112×47 Posts
Default

Quote:
Originally Posted by kriesel View Post
Thanks. What gpu load does nvidia-smi give for them?
The 560 doesn't give any details except for fan speed, temp and memory usage (74%, 84C and 64MiB / 1985MiB). nVidia seem to have intentionally broken nvidia-smi for older cards...

For the two 1050s (both in the same machine):
Code:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.59                 Driver Version: 390.59                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050    Off  | 00000000:01:00.0 Off |                  N/A |
| 48%   83C    P0    N/A /  65W |     67MiB /  2000MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1050    Off  | 00000000:03:00.0 Off |                  N/A |
| 63%   70C    P0    N/A /  75W |     67MiB /  2000MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     16053      C   ./mfaktc.exe                                  57MiB |
|    1     16095      C   ./mfaktc.exe                                  57MiB |
+-----------------------------------------------------------------------------+

Last fiddled with by chalsall on 2019-01-07 at 16:50 Reason: s/The 580/The 560/;
chalsall is offline   Reply With Quote
Old 2019-01-07, 17:14   #70
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Note in chalsall's nvidia-smi output, gpu load is 99% not 100%, which nvidia-smi is capable of displaying.
On a 3-disparate-gpu system, Win7 x64:

Code:
+-----------------------------------------------------------------------------+
 | NVIDIA-SMI 378.66                 Driver Version: 378.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070   WDDM  | 0000:03:00.0     Off |                  N/A |
| 89%   85C    P2   119W / 158W |    345MiB /  8192MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro 2000        WDDM  | 0000:1C:00.0     Off |                  N/A |
|100%   91C    P0    N/A /  N/A |     87MiB /  1024MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 105... WDDM  | 0000:28:00.0     Off |                  N/A |
| 40%   67C    P0    65W /  75W |    304MiB /  4096MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      5888    C   ...uments\gtx1070-mfaktc\2\mfaktc-win-64.exe N/A      |
|    0     11748    C   ...ocuments\gtx1070-mfaktc\mfaktc-win-64.exe N/A      |
|    1      9908    C   ...ments\mfaktc-quadro2000\mfaktc-win-64.exe N/A      |
|    2      9884    C   ...DALucas2.06beta-CUDA6.5-Windows-WIN32.exe N/A      |
+-----------------------------------------------------------------------------+
vs.
Code:
Mon Jan 07 10:57:38 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 378.66                 Driver Version: 378.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070   WDDM  | 0000:03:00.0     Off |                  N/A |
| 89%   85C    P2   113W / 158W |    226MiB /  8192MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro 2000        WDDM  | 0000:1C:00.0     Off |                  N/A |
|100%   92C    P0    N/A /  N/A |     87MiB /  1024MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 105... WDDM  | 0000:28:00.0     Off |                  N/A |
| 41%   68C    P0    65W /  75W |    304MiB /  4096MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     11748    C   ...ocuments\gtx1070-mfaktc\mfaktc-win-64.exe N/A      |
|    1      9908    C   ...ments\mfaktc-quadro2000\mfaktc-win-64.exe N/A      |
|    2      9884    C   ...DALucas2.06beta-CUDA6.5-Windows-WIN32.exe N/A      |
+-----------------------------------------------------------------------------+
Note the 6 watt (5.3%) difference on the 1070 with the second instance.
Another hypothesis about the performance is in doing a class, the last batch of thread blocks may not fully occupy the gpu, and so temporarily underutilize the gpu, for the duration of their run. I remember reading a post somewhere about that. Running multiple instances may reduce the extent and impact of that brief underutilization. More classes would make that occurrence more frequent, less classes less frequent.

Last fiddled with by kriesel on 2019-01-07 at 17:18
kriesel is online now   Reply With Quote
Old 2019-01-07, 17:17   #71
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24·199 Posts
Default

Quote:
Originally Posted by nomead View Post
RTX 2060 announced at "$349". Based on the released specs it could have a better "bang for the buck" than either the 2070 or 2080. It has 1920 CUDA cores, so that's 17% less than RTX2070, and 35% less than RTX2080. Clock speeds are almost the same as on the 2070 - 1365 MHz base and 1680 MHz boost. The RTX2080 is clocked higher, though, so there the performance differential will likely be more than 35%. TDP 160 watts. 6 GB of GDDR6 on a 192-bit bus, for 336 GB/s of bandwidth.

30% less price than the RTX2070 for let's say 20% less performance?

And 50% less price than the RTX2080 for maybe 40-45% less performance.

Again speculation based purely on published specifications, not running any actual LL or TF benchmarks.
Does better bang for the buck include:

1. The cost of electricity over two or three years?
2. The cost of providing a PCIe slot and power?

The 2080 Ti is probably the best value once those are taken into account.
Mark Rose is offline   Reply With Quote
Old 2019-01-07, 18:04   #72
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Less classes, one instance, 92M, peregrine laptop, Win10 x64

Code:
Mon Jan 07 11:20:20 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 398.36                 Driver Version: 398.36                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   84C    P0    N/A /  N/A |    137MiB /  4096MiB |     94%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     49772      C   ...0ti\mfaktc-win-64.LessClasses-CUDA8.exe N/A      |
+-----------------------------------------------------------------------------+
vs.
Code:
Mon Jan 07 11:46:48 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 398.36                 Driver Version: 398.36                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   86C    P0    N/A /  N/A |    197MiB /  4096MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     49772      C   ...0ti\mfaktc-win-64.LessClasses-CUDA8.exe N/A      |
|    0     54744      C   ...i\2\mfaktc-win-64.LessClasses-CUDA8.exe N/A      |
+-----------------------------------------------------------------------------+
300 GhzD/day vs 154*2 = 308.


More classes one instance 304 GhzD/day, 98% load.


One more-classes 162.2, one less classes147.6, 309.8 combined, 98% load.

Last fiddled with by kriesel on 2019-01-07 at 18:53
kriesel is online now   Reply With Quote
Old 2019-01-07, 18:42   #73
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×112×47 Posts
Default

Quote:
Originally Posted by kriesel View Post
One more-classes 162.2, on less classes147.6, 309.6 combined, 98% load.
I find the discrepancy between your and my installations interesting (both using 1050s, if I'm reading your nvidia-smi output correctly (it's clipped)).

Could it be an OS issue? You're running WinBlows (sorry, couldn't resist... ) and I'm running a "headless" server-class Linux.

Frankly, it is not worth my time to squeeze ~1% more out of my kit if it takes ongoing human cycles....
chalsall is offline   Reply With Quote
Old 2019-01-07, 19:11   #74
GP2
 
GP2's Avatar
 
Sep 2003

2·5·7·37 Posts
Default

Quote:
Originally Posted by kriesel View Post
Another hypothesis about the performance is in doing a class, the last batch of thread blocks may not fully occupy the gpu, and so temporarily underutilize the gpu, for the duration of their run. I remember reading a post somewhere about that. Running multiple instances may reduce the extent and impact of that brief underutilization. More classes would make that occurrence more frequent, less classes less frequent.
I know very little about CUDA, but one of my New Year's resolutions was to learn more.

So, from the programmer's guide:

Quote:
However, there will be context switch overheads associated with Compute Preemption, which is automatically enabled on those devices for which support exists. The individual attribute query function cudaDeviceGetAttribute() with the attribute cudaDevAttrComputePreemptionSupported can be used to determine if the device in use supports Compute Preemption. Users wishing to avoid context switch overheads associated with different processes can ensure that only one process is active on the GPU by selecting exclusive-process mode.
So it doesn't seem like running multiple processes would get you better throughput.
GP2 is offline   Reply With Quote
Old 2019-01-07, 19:35   #75
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by GP2 View Post
I know very little about CUDA, but one of my New Year's resolutions was to learn more.

So, from the programmer's guide:

So it doesn't seem like running multiple processes would get you better throughput.
Thanks for looking into it. Maybe we can learn together.

Context switch is one type of overhead. Apparently not the whole picture.
Consider one task running. It will spend time in non-compute phases, loading data and CUDA code and transferring results. Seems like something else could be using the compute cores then.
Consider TF with a number of thread blocks that is more than but not an exact multiple of the particular gpu's core count. How fully utilized is the gpu hardware during the last, "runt" subset for a TF class? Simple example, 10 parallel tasks, 8 processors, equal length tasks, 8 run, then 2; average utilization 10/16 (less if allowing for setup time in series). 100 tasks, 12 sets of 8, then 4; utilization <100/(13*8). And I think there's no guarantee the tasks take the same time. Some may wait for the slowest one to finish. There was an online conversation related to this by R Gerbicz and The Judger I think recently. I may be misremembermangling some of the vaguely recalled details.


Edit: some good background starts around post 2995 and goes to 3020 in the mfaktc thread. https://www.mersenneforum.org/showth...12827&page=273
And I see you were an active participant in that, contributing content that was useful to me, thanks!

Last fiddled with by kriesel on 2019-01-07 at 19:59
kriesel is online now   Reply With Quote
Old 2019-01-07, 19:39   #76
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by chalsall View Post
I find the discrepancy between your and my installations interesting (both using 1050s, if I'm reading your nvidia-smi output correctly (it's clipped)).

Could it be an OS issue? You're running Win and I'm running a "headless" server-class Linux.

Frankly, it is not worth my time to squeeze ~1% more out of my kit if it takes ongoing human cycles....
I don't have a GTX1050. It's the laptop version of a GTX 1050 Ti that's being described in this thread. It's comparable in performance to the pcie GTX 1050 Ti I have on another system.

Display duties of the gpus here are minimal since nearly all systems are being run by remote desktop access, which puts a small load on the cpu not the gpu. The peregrine laptop is further configured to do normal local display by the UHD630 igp not the GTX 1050 Ti it contains. So, some bases covered.

It's very understandable to not sweat 1% of a 1050 (~ -2.1 GhzD/day TF difference),
definitely a case of small potential gains, but less so when the underutilization is 5 to 10% of a GTX 1070, 1080Ti; or 5% of a RTX2080 as it was for another poster, more than half of the throughput of a GTX1050.

I assume you're running the more-classes executable. That vs. Less-classes (which I had been running for reduced console output volume) could account for a lot of difference. Lots of points of difference: OS, gpu model, driver level, possibly app classes count, degree of "headlessness", exponent, TF level, ?...

Last fiddled with by kriesel on 2019-01-07 at 19:45
kriesel is online now   Reply With Quote
Old 2019-01-07, 20:12   #77
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2C6E16 Posts
Default

Quote:
Originally Posted by kriesel View Post
It's very understandable to not sweat 1% of a 1050 (~ -2.1 GhzD/day TF difference), definitely a case of small potential gains, but less so when the underutilization is 5 to 10% of a GTX 1070, 1080Ti; or 5% of a RTX2080 as it was for another poster, more than half of the throughput of a GTX1050.
Agreed. But it would be interesting to find out how much of a difference the OS makes.

If you're not able / willing to "dual boot" between WinDoze () and Linux, perhaps others are, in order to get some heuristics from a "bare metal" perspective.

All of my GPUs (admittedly, a small sample set of slower GPUs; even those I sometimes rent from Amazon, Google and M$) don't even have a display connected; they're just for "compute". And they always run Linux. And they always report ~99% utilization by nvidia-smi.

Quote:
Originally Posted by kriesel View Post
I assume you're running the more-classes executable. That vs. Less-classes (which I had been running for reduced console output volume) could account for a lot of difference. Lots of points of difference: OS, gpu model, driver level, possibly app classes count, degree of "headlessness", exponent, TF level, ?...
Yes, I always run the default executable. I never work so low that the "less-classes" Makes Sense.

Last fiddled with by chalsall on 2019-01-07 at 20:14 Reason: s/by nividia-smi/by nvidia-smi/;
chalsall is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Nvidia GTX 745 4GB ??? petrw1 GPU Computing 3 2016-08-02 15:23
Nvidia Pascal, a third of DP firejuggler GPU Computing 12 2016-02-23 06:55
AMD + Nvidia TheMawn GPU Computing 7 2013-07-01 14:08
Nvidia Kepler Brain GPU Computing 149 2013-02-17 08:05
What can I do with my nvidia GPU? Surge Software 4 2010-09-29 11:36

All times are UTC. The time now is 15:19.


Fri Jul 7 15:19:59 UTC 2023 up 323 days, 12:48, 0 users, load averages: 1.00, 1.07, 1.09

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔