mersenneforum.org gpuOwL: an OpenCL program for Mersenne primality testing
 Register FAQ Search Today's Posts Mark Forums Read

 2020-06-16, 00:43 #2333 Xyzzy     "Mike" Aug 2002 166378 Posts So our system with two cards is stable. But we couldn't get clinfo to find our cards when we booted headless. We bought some DisplayPort "dummy" dongles but they did not work. We moved around the cards too many times. We tried just one card. We tried three. Eventually, we figured out that we need to log into at least one display's GUI to get the cards initialized. Just booting to the GDM login screen wasn't enough. This works without any dongles. We wonder how people who have no display connections at all on their cards get them running? (The serious $$compute cards have no display connectors.) 2020-06-16, 08:48 #2334 preda "Mihai Preda" Apr 2015 22008 Posts Quote:  Originally Posted by Xyzzy So our system with two cards is stable. But we couldn't get clinfo to find our cards when we booted headless. We bought some DisplayPort "dummy" dongles but they did not work. We moved around the cards too many times. We tried just one card. We tried three. Eventually, we figured out that we need to log into at least one display's GUI to get the cards initialized. Just booting to the GDM login screen wasn't enough. This works without any dongles. We wonder how people who have no display connections at all on their cards get them running? (The serious$$$compute cards have no display connectors.) Yes I confirm I see the same behavior, I'm mystified by what sort of bug may cause it.. 2020-06-16, 13:36 #2335 kriesel "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 22·34·13 Posts Quote:  Originally Posted by preda Yes I confirm I see the same behavior, I'm mystified by what sort of bug may cause it.. Try checking any power saving options. It's been seen to cause problems in other OS&hardware&software situations. Re high$ pro compute cards, some of those have different drivers entirely, in the NVIDIA line. Might apply to AMD too (MI50 etc).

Last fiddled with by kriesel on 2020-06-16 at 13:38

2020-06-16, 15:58   #2336
LaurV
Romulan Interpreter

Jun 2011
Thailand

5·11·157 Posts

Quote:
 Originally Posted by preda The savefiles are FFT-length agnostic. They only store the "compacted" residue which is not affected by FFT-len in any way. That's why one can restart the savefile with any FFT-length, changing the FFT size midway.
We love that. Don't change it!

When the shift?

2020-06-16, 17:56   #2337
Xyzzy

"Mike"
Aug 2002

7,583 Posts

Quote:
 Originally Posted by kriesel Try checking any power saving options.
We have every power-saving option turned off.
Quote:
 Originally Posted by kriesel Re high $pro compute cards, some of those have different drivers entirely, in the NVIDIA line. Might apply to AMD too (MI50 etc). We are using the "pro" drivers. The opensource drivers don't work with Navi cards. 2020-06-16, 18:57 #2338 PhilF Feb 2005 Colorado 479 Posts Quote:  Originally Posted by Xyzzy We have every power-saving option turned off.We are using the "pro" drivers. The opensource drivers don't work with Navi cards. I used those drivers too, with a Radeon VII, and remember using the "headless" option. Maybe that option is how others are getting around the problem of having to run the GUI in order to initialize cards. 2020-06-16, 19:31 #2339 ewmayer 2ω=0 Sep 2002 República de California 1139910 Posts Quote:  Originally Posted by Prime95 BTW, the latest version introduced MIDDLE=13,14,15 so that we should have optimized code for first time tests for quite a while. Yes, the 3 errors code path could be better. It could retry once, then switch to next higher accuracy level, then try STATS, etc. I'll leave that to Mihai who is up to his eyeballs in PRP proofs. Don't expect anything soon! MIDDLE refers to what in my code is the leading (containing any odd factors of the FFT length) FFT radix, i.e. MIDDLE=13,14,15 covers FFT lengths 6.5,7,7.5M? Re. the accuracy-tweaking user flag you say you hope to deploy soon - how will that work? The obvious thing that comes to my mind would be to allow bumping the various internal accuracy-setting flags up or down one, or even up or down a user-set number of notches, so long as the result lies within the available accuracy settings for the given FFT length. Simplifying your 2-param accuracy scheme for purposes of illustration: let's say those 2 params map to an ascending set of accuracy settings ranging from 0-9, 0 is least accurate (and presumably fastest), 9 is most accurate (and presumably slowest). Say the user's current exponent has a default accuracy setting mapping to 4. User could fiddle that up or down one notch via -acc_flag=[+1 or -1], or some desired number of notches with -acc_flag=n, where for this case, any n < -4 or > 5 would trigger a warning message and subsequently be ignored. (n = +-1 makes the most sense as you set the defaults with some care, but other values may be useful for experimentation.) What do you have in mind here? 2020-06-16, 20:15 #2340 Prime95 P90 years forever! Aug 2002 Yeehaw, FL 2×3×7×167 Posts Quote:  Originally Posted by ewmayer MIDDLE refers to what in my code is the leading (containing any odd factors of the FFT length) FFT radix, i.e. MIDDLE=13,14,15 covers FFT lengths 6.5,7,7.5M? Re. the accuracy-tweaking user flag you say you hope to deploy soon - how will that work? The 5M FFT does a 1024 complex FFT, a MIDDLE=10 step, and a 256-complex FFT. So, yes MIDDLE=13,14,15 gives you 6.5,7,7.5M FFTs. The 1024 complex and 256 complex were the fastest first and last steps when I got involved. There was a first and last step of 512-complex that was slower and may have been deprecated. If they have not been deprecated, then 7M FFT could have been done with a MIDDLE=7 step. I envision a very simple accuracy tweaking flag. Presently, the maximum bits-per-word with all accuracy options on is hardwired. Various constants based on bits-per-word indicate when it is safe to turn off each accuracy option. My scheme would be to simply increase (more aggressive) or decrease (more conservative) the FFT's maximum bits per word. For example, -use aggressive=0.1 or -use aggressive=-.05. In theory, one could set aggressive to 1.0 and manually set the accuracy options. That would be tedious/dangerous. I prefer for Mihai to muck in the C++ code with heavy std library usage. I'm a dinosaur brought up on C code.  2020-06-17, 00:14 #2341 Xyzzy "Mike" Aug 2002 7,583 Posts We figured out how to control our fans and get temperature readings thanks to this article: https://wiki.archlinux.org/index.php...fs_fan_control Using the auto settings (default) our cards were incredibly hot. Here are the hard-coded limits for each card: Code:  fan1: min = 0 RPM max = 5400 RPM edge: critical = +113.0°C emergency = +99.0°C junction: critical = +99.0°C emergency = +99.0°C mem: critical = +99.0°C emergency = +99.0°C power1: cap = 105.00 W Here are the values at various fan speed percentages: Code:  speed: % auto auto 100 100 90 90 80 80 card: # 0 1 0 1 0 1 0 1 fan1: C 4246 3973 5636 5348 5399 5119 5077 4805 edge: C 87 83 76 74 77 75 79 76 junction: C 99 94 88 83 89 84 91 86 mem: C 74 72 66 66 66 66 68 68 power1: W 91 93 88 87 89 89 89 91 The amazing thing is cooling the cards down doesn't speed up the calculations any! We think dropping the temperature has to be good for the cards. They are certainly louder and maybe the fans will wear out sooner? What values do you all suggest? 2020-06-17, 00:34 #2342 ewmayer 2ω=0 Sep 2002 República de California 11,399 Posts Quote:  Originally Posted by Xyzzy We think dropping the temperature has to be good for the cards. They are certainly louder and maybe the fans will wear out sooner? What values do you all suggest? Absolute temps are more or less meaningless ... even among users of the same GPU (like my R7), the temps they observe relative to where their card begins to auto-throttle differ, depending on e.g. which of the many distributed internal temp monitor points is being used by the system. I've always used throttling to tell me when I need to up the fan speed, and among my 4 R7s, all from the same particular manufacturer (XFX), and all running under same-OS and same-ROCm version, see a roughly 5C range in that, from 80C for the lowest-threshold card to 85c for the highest. Thus I try to keep all temps <= 80C for simplicity. Hard to say which is more dangerous, running hot (but not throttling-hot) or stressing the fans - one assumes/hopes the mfr has set the auto-throttling temps based on what is actually dangerous for the silicon, but ya never know until ya know. OTOH, depending on the design-for-repair-ability of the card, burning out a an may be just as reliable a way to brick the card as overheating the silicon. Alas, ROCm under linux doesn't display fan RPM, only "fan level" as a % - I try to keep mine under 70%, because that corresponds to the level where the fan noise really gets noticeable. Putting a single cheap$15 plug-in Honeywell desk fan at one end of my test-frame build - pic below - made a huge difference, that moves a lot of air over the whole 3-GPU array for very low watts and noise, as it uses a much larger set of fan blades, i.e. needs fewer RPMs to achieve the sme airflow velocity. The leftmost GPU there - it's the one running at sclk = 4 (the other 2 are sclk = 3, ~50-60W each less) and yet I can set its fan speed to the minimum because the desk fan is blowing right into its fan intakes. (Well, after first hitting the 3 Android phones running Mlucas. :) That left most GPU is #1 in this ROCm listing, you can see its higher sclk and low fan-% setting, and yet its temp is also the loest of the 3, due its favorable geometry:
Code:
GPU  Temp   AvgPwr  SCLK     MCLK     Fan     Perf    PwrCap  VRAM%  GPU%
0    80.0c  161.0W  1373Mhz  1201Mhz  63.92%  manual  250.0W    5%   100%
1    76.0c  211.0W  1547Mhz  1151Mhz  21.96%  manual  250.0W    5%   100%
2    78.0c  161.0W  1373Mhz  1151Mhz  61.96%  manual  250.0W    5%   100%
Attached Thumbnails

Last fiddled with by ewmayer on 2020-06-17 at 00:38

2020-06-17, 02:55   #2343
PhilF

Feb 2005

47910 Posts

Quote:
 Originally Posted by ewmayer Hard to say which is more dangerous, running hot (but not throttling-hot) or stressing the fans
I'll stress the fans. They aren't that hard to change.

 Similar Threads Thread Thread Starter Forum Replies Last Post Bdot GPU Computing 1618 2020-06-24 00:11 xx005fs GpuOwl 0 2019-07-26 21:37 1260 Software 17 2015-08-28 01:35 CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12 Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 12:01.

Sat Aug 8 12:01:32 UTC 2020 up 22 days, 7:48, 1 user, load averages: 2.21, 2.17, 1.90

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.