mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-06-16, 00:43   #2333
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

166378 Posts
Default

So our system with two cards is stable.

But we couldn't get clinfo to find our cards when we booted headless.

We bought some DisplayPort "dummy" dongles but they did not work.

We moved around the cards too many times.

We tried just one card. We tried three.

Eventually, we figured out that we need to log into at least one display's GUI to get the cards initialized. Just booting to the GDM login screen wasn't enough. This works without any dongles.

We wonder how people who have no display connections at all on their cards get them running? (The serious $$$ compute cards have no display connectors.)


Xyzzy is offline   Reply With Quote
Old 2020-06-16, 08:48   #2334
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22008 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
So our system with two cards is stable.

But we couldn't get clinfo to find our cards when we booted headless.

We bought some DisplayPort "dummy" dongles but they did not work.

We moved around the cards too many times.

We tried just one card. We tried three.

Eventually, we figured out that we need to log into at least one display's GUI to get the cards initialized. Just booting to the GDM login screen wasn't enough. This works without any dongles.

We wonder how people who have no display connections at all on their cards get them running? (The serious $$$ compute cards have no display connectors.)


Yes I confirm I see the same behavior, I'm mystified by what sort of bug may cause it..
preda is online now   Reply With Quote
Old 2020-06-16, 13:36   #2335
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·34·13 Posts
Default

Quote:
Originally Posted by preda View Post
Yes I confirm I see the same behavior, I'm mystified by what sort of bug may cause it..
Try checking any power saving options. It's been seen to cause problems in other OS&hardware&software situations.

Re high $ pro compute cards, some of those have different drivers entirely, in the NVIDIA line. Might apply to AMD too (MI50 etc).

Last fiddled with by kriesel on 2020-06-16 at 13:38
kriesel is offline   Reply With Quote
Old 2020-06-16, 15:58   #2336
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

5·11·157 Posts
Default

Quote:
Originally Posted by preda View Post
The savefiles are FFT-length agnostic. They only store the "compacted" residue which is not affected by FFT-len in any way. That's why one can restart the savefile with any FFT-length, changing the FFT size midway.
We love that. Don't change it!

When the shift?
LaurV is offline   Reply With Quote
Old 2020-06-16, 17:56   #2337
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

7,583 Posts
Default

Quote:
Originally Posted by kriesel View Post
Try checking any power saving options.
We have every power-saving option turned off.
Quote:
Originally Posted by kriesel View Post
Re high $ pro compute cards, some of those have different drivers entirely, in the NVIDIA line. Might apply to AMD too (MI50 etc).
We are using the "pro" drivers. The opensource drivers don't work with Navi cards.

Xyzzy is offline   Reply With Quote
Old 2020-06-16, 18:57   #2338
PhilF
 
PhilF's Avatar
 
Feb 2005
Colorado

479 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
We have every power-saving option turned off.We are using the "pro" drivers. The opensource drivers don't work with Navi cards.

I used those drivers too, with a Radeon VII, and remember using the "headless" option. Maybe that option is how others are getting around the problem of having to run the GUI in order to initialize cards.
PhilF is offline   Reply With Quote
Old 2020-06-16, 19:31   #2339
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1139910 Posts
Default

Quote:
Originally Posted by Prime95 View Post
BTW, the latest version introduced MIDDLE=13,14,15 so that we should have optimized code for first time tests for quite a while.

Yes, the 3 errors code path could be better. It could retry once, then switch to next higher accuracy level, then try STATS, etc. I'll leave that to Mihai who is up to his eyeballs in PRP proofs. Don't expect anything soon!
MIDDLE refers to what in my code is the leading (containing any odd factors of the FFT length) FFT radix, i.e. MIDDLE=13,14,15 covers FFT lengths 6.5,7,7.5M?

Re. the accuracy-tweaking user flag you say you hope to deploy soon - how will that work? The obvious thing that comes to my mind would be to allow bumping the various internal accuracy-setting flags up or down one, or even up or down a user-set number of notches, so long as the result lies within the available accuracy settings for the given FFT length.

Simplifying your 2-param accuracy scheme for purposes of illustration: let's say those 2 params map to an ascending set of accuracy settings ranging from 0-9, 0 is least accurate (and presumably fastest), 9 is most accurate (and presumably slowest). Say the user's current exponent has a default accuracy setting mapping to 4. User could fiddle that up or down one notch via -acc_flag=[+1 or -1], or some desired number of notches with -acc_flag=n, where for this case, any n < -4 or > 5 would trigger a warning message and subsequently be ignored. (n = +-1 makes the most sense as you set the defaults with some care, but other values may be useful for experimentation.)

What do you have in mind here?
ewmayer is offline   Reply With Quote
Old 2020-06-16, 20:15   #2340
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×3×7×167 Posts
Default

Quote:
Originally Posted by ewmayer View Post
MIDDLE refers to what in my code is the leading (containing any odd factors of the FFT length) FFT radix, i.e. MIDDLE=13,14,15 covers FFT lengths 6.5,7,7.5M?

Re. the accuracy-tweaking user flag you say you hope to deploy soon - how will that work?
The 5M FFT does a 1024 complex FFT, a MIDDLE=10 step, and a 256-complex FFT. So, yes MIDDLE=13,14,15 gives you 6.5,7,7.5M FFTs.

The 1024 complex and 256 complex were the fastest first and last steps when I got involved. There was a first and last step of 512-complex that was slower and may have been deprecated. If they have not been deprecated, then 7M FFT could have been done with a MIDDLE=7 step.

I envision a very simple accuracy tweaking flag. Presently, the maximum bits-per-word with all accuracy options on is hardwired. Various constants based on bits-per-word indicate when it is safe to turn off each accuracy option. My scheme would be to simply increase (more aggressive) or decrease (more conservative) the FFT's maximum bits per word. For example, -use aggressive=0.1 or -use aggressive=-.05.

In theory, one could set aggressive to 1.0 and manually set the accuracy options. That would be tedious/dangerous.

I prefer for Mihai to muck in the C++ code with heavy std library usage. I'm a dinosaur brought up on C code.
Prime95 is offline   Reply With Quote
Old 2020-06-17, 00:14   #2341
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

7,583 Posts
Default

We figured out how to control our fans and get temperature readings thanks to this article: https://wiki.archlinux.org/index.php...fs_fan_control

Using the auto settings (default) our cards were incredibly hot.

Here are the hard-coded limits for each card:
Code:
    fan1:       min =    0 RPM
                max = 5400 RPM

    edge: critical  = +113.0°C
          emergency =  +99.0°C

junction: critical  =  +99.0°C
          emergency =  +99.0°C

     mem: critical  =  +99.0°C
          emergency =  +99.0°C

  power1:       cap = 105.00 W
Here are the values at various fan speed percentages:
Code:
   speed:     %     auto auto      100  100       90   90       80   80
    card:     #        0    1        0    1        0    1        0    1
    fan1:     C     4246 3973     5636 5348     5399 5119     5077 4805
    edge:     C       87   83       76   74       77   75       79   76
junction:     C       99   94       88   83       89   84       91   86
     mem:     C       74   72       66   66       66   66       68   68
  power1:     W       91   93       88   87       89   89       89   91
The amazing thing is cooling the cards down doesn't speed up the calculations any!

We think dropping the temperature has to be good for the cards. They are certainly louder and maybe the fans will wear out sooner?

What values do you all suggest?

Xyzzy is offline   Reply With Quote
Old 2020-06-17, 00:34   #2342
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

11,399 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
We think dropping the temperature has to be good for the cards. They are certainly louder and maybe the fans will wear out sooner?

What values do you all suggest?

Absolute temps are more or less meaningless ... even among users of the same GPU (like my R7), the temps they observe relative to where their card begins to auto-throttle differ, depending on e.g. which of the many distributed internal temp monitor points is being used by the system. I've always used throttling to tell me when I need to up the fan speed, and among my 4 R7s, all from the same particular manufacturer (XFX), and all running under same-OS and same-ROCm version, see a roughly 5C range in that, from 80C for the lowest-threshold card to 85c for the highest. Thus I try to keep all temps <= 80C for simplicity.

Hard to say which is more dangerous, running hot (but not throttling-hot) or stressing the fans - one assumes/hopes the mfr has set the auto-throttling temps based on what is actually dangerous for the silicon, but ya never know until ya know. OTOH, depending on the design-for-repair-ability of the card, burning out a an may be just as reliable a way to brick the card as overheating the silicon.

Alas, ROCm under linux doesn't display fan RPM, only "fan level" as a % - I try to keep mine under 70%, because that corresponds to the level where the fan noise really gets noticeable. Putting a single cheap $15 plug-in Honeywell desk fan at one end of my test-frame build - pic below - made a huge difference, that moves a lot of air over the whole 3-GPU array for very low watts and noise, as it uses a much larger set of fan blades, i.e. needs fewer RPMs to achieve the sme airflow velocity. The leftmost GPU there - it's the one running at sclk = 4 (the other 2 are sclk = 3, ~50-60W each less) and yet I can set its fan speed to the minimum because the desk fan is blowing right into its fan intakes. (Well, after first hitting the 3 Android phones running Mlucas. :) That left most GPU is #1 in this ROCm listing, you can see its higher sclk and low fan-% setting, and yet its temp is also the loest of the 3, due its favorable geometry:
Code:
GPU  Temp   AvgPwr  SCLK     MCLK     Fan     Perf    PwrCap  VRAM%  GPU%  
0    80.0c  161.0W  1373Mhz  1201Mhz  63.92%  manual  250.0W    5%   100%  
1    76.0c  211.0W  1547Mhz  1151Mhz  21.96%  manual  250.0W    5%   100%  
2    78.0c  161.0W  1373Mhz  1151Mhz  61.96%  manual  250.0W    5%   100%
Attached Thumbnails
Click image for larger version

Name:	new_build4.jpg
Views:	23
Size:	230.5 KB
ID:	22600  

Last fiddled with by ewmayer on 2020-06-17 at 00:38
ewmayer is offline   Reply With Quote
Old 2020-06-17, 02:55   #2343
PhilF
 
PhilF's Avatar
 
Feb 2005
Colorado

47910 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Hard to say which is more dangerous, running hot (but not throttling-hot) or stressing the fans
I'll stress the fans. They aren't that hard to change.
PhilF is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1618 2020-06-24 00:11
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 12:01.

Sat Aug 8 12:01:32 UTC 2020 up 22 days, 7:48, 1 user, load averages: 2.21, 2.17, 1.90

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.