![]() |
So our system with two cards is stable.
But we couldn't get [c]clinfo[/c] to find our cards when we booted headless. We bought some DisplayPort [URL="https://www.amazon.com/FUERAN-DP-DisplayPort-emulator-2560x1600/dp/B071CGCTMY"]"dummy"[/URL] dongles but they did not work. We moved around the cards too many times. We tried just one card. We tried three. Eventually, we figured out [SPOILER]that we need to log into at least one display's GUI to get the cards initialized. Just booting to the GDM login screen wasn't enough. This works without any dongles. We wonder how people who have no display connections at all on their cards get them running? (The serious $$$ compute cards have no display connectors.)[/SPOILER] :mike: |
[QUOTE=Xyzzy;548122]So our system with two cards is stable.
But we couldn't get [c]clinfo[/c] to find our cards when we booted headless. We bought some DisplayPort [URL="https://www.amazon.com/FUERAN-DP-DisplayPort-emulator-2560x1600/dp/B071CGCTMY"]"dummy"[/URL] dongles but they did not work. We moved around the cards too many times. We tried just one card. We tried three. Eventually, we figured out [SPOILER]that we need to log into at least one display's GUI to get the cards initialized. Just booting to the GDM login screen wasn't enough. This works without any dongles. We wonder how people who have no display connections at all on their cards get them running? (The serious $$$ compute cards have no display connectors.)[/SPOILER] :mike:[/QUOTE] Yes I confirm I see the same behavior, I'm mystified by what sort of bug may cause it.. |
[QUOTE=preda;548147]Yes I confirm I see the same behavior, I'm mystified by what sort of bug may cause it..[/QUOTE]
Try checking any power saving options. It's been seen to cause problems in other OS&hardware&software situations. Re high $ pro compute cards, some of those have different drivers entirely, in the NVIDIA line. Might apply to AMD too (MI50 etc). |
[QUOTE=preda;548114]The savefiles are FFT-length agnostic. They only store the "compacted" residue which is not affected by FFT-len in any way. That's why one can restart the savefile with any FFT-length, changing the FFT size midway.[/QUOTE]
We love that. Don't change it! When the shift? :razz: |
[QUOTE=kriesel;548158]Try checking any power saving options.[/QUOTE]We have every power-saving option turned off.[QUOTE=kriesel;548158]Re high $ pro compute cards, some of those have different drivers entirely, in the NVIDIA line. Might apply to AMD too (MI50 etc).[/QUOTE]We are using the "pro" drivers. The opensource drivers don't work with Navi cards.
:mike: |
[QUOTE=Xyzzy;548185]We have every power-saving option turned off.We are using the "pro" drivers. The opensource drivers don't work with Navi cards.
:mike:[/QUOTE] I used those drivers too, with a Radeon VII, and remember using the "headless" option. Maybe that option is how others are getting around the problem of having to run the GUI in order to initialize cards. |
[QUOTE=Prime95;548120]BTW, the latest version introduced MIDDLE=13,14,15 so that we should have optimized code for first time tests for quite a while.
Yes, the 3 errors code path could be better. It could retry once, then switch to next higher accuracy level, then try STATS, etc. I'll leave that to Mihai who is up to his eyeballs in PRP proofs. Don't expect anything soon![/QUOTE] MIDDLE refers to what in my code is the leading (containing any odd factors of the FFT length) FFT radix, i.e. MIDDLE=13,14,15 covers FFT lengths 6.5,7,7.5M? Re. the accuracy-tweaking user flag you say you hope to deploy soon - how will that work? The obvious thing that comes to my mind would be to allow bumping the various internal accuracy-setting flags up or down one, or even up or down a user-set number of notches, so long as the result lies within the available accuracy settings for the given FFT length. Simplifying your 2-param accuracy scheme for purposes of illustration: let's say those 2 params map to an ascending set of accuracy settings ranging from 0-9, 0 is least accurate (and presumably fastest), 9 is most accurate (and presumably slowest). Say the user's current exponent has a default accuracy setting mapping to 4. User could fiddle that up or down one notch via -acc_flag=[+1 or -1], or some desired number of notches with -acc_flag=n, where for this case, any n < -4 or > 5 would trigger a warning message and subsequently be ignored. (n = +-1 makes the most sense as you set the defaults with some care, but other values may be useful for experimentation.) What do you have in mind here? |
[QUOTE=ewmayer;548192]MIDDLE refers to what in my code is the leading (containing any odd factors of the FFT length) FFT radix, i.e. MIDDLE=13,14,15 covers FFT lengths 6.5,7,7.5M?
Re. the accuracy-tweaking user flag you say you hope to deploy soon - how will that work? [/QUOTE] The 5M FFT does a 1024 complex FFT, a MIDDLE=10 step, and a 256-complex FFT. So, yes MIDDLE=13,14,15 gives you 6.5,7,7.5M FFTs. The 1024 complex and 256 complex were the fastest first and last steps when I got involved. There was a first and last step of 512-complex that was slower and may have been deprecated. If they have not been deprecated, then 7M FFT could have been done with a MIDDLE=7 step. I envision a very simple accuracy tweaking flag. Presently, the maximum bits-per-word with all accuracy options on is hardwired. Various constants based on bits-per-word indicate when it is safe to turn off each accuracy option. My scheme would be to simply increase (more aggressive) or decrease (more conservative) the FFT's maximum bits per word. For example, -use aggressive=0.1 or -use aggressive=-.05. In theory, one could set aggressive to 1.0 and manually set the accuracy options. That would be tedious/dangerous. I prefer for Mihai to muck in the C++ code with heavy std library usage. I'm a dinosaur brought up on C code. |
We figured out how to control our fans and get temperature readings thanks to this article: [URL]https://wiki.archlinux.org/index.php/Fan_speed_control#AMDGPU_sysfs_fan_control[/URL]
Using the auto settings (default) our cards were incredibly hot. Here are the hard-coded limits for each card:[CODE] fan1: min = 0 RPM max = 5400 RPM edge: critical = +113.0°C emergency = +99.0°C junction: critical = +99.0°C emergency = +99.0°C mem: critical = +99.0°C emergency = +99.0°C power1: cap = 105.00 W[/CODE]Here are the values at various fan speed percentages:[CODE] speed: % auto auto 100 100 90 90 80 80 card: # 0 1 0 1 0 1 0 1 fan1: R 4246 3973 5636 5348 5399 5119 5077 4805 edge: C 87 83 76 74 77 75 79 76 junction: C 99 94 88 83 89 84 91 86 mem: C 74 72 66 66 66 66 68 68 power1: W 91 93 88 87 89 89 89 91[/CODE]The amazing thing is cooling the cards down doesn't speed up the calculations any! We think dropping the temperature has to be good for the cards. They are certainly louder and maybe the fans will wear out sooner? What values do you all suggest? :mike: |
1 Attachment(s)
[QUOTE=Xyzzy;548220] We think dropping the temperature has to be good for the cards. They are certainly louder and maybe the fans will wear out sooner?
What values do you all suggest? :mike:[/QUOTE] Absolute temps are more or less meaningless ... even among users of the same GPU (like my R7), the temps they observe relative to where their card begins to auto-throttle differ, depending on e.g. which of the many distributed internal temp monitor points is being used by the system. I've always used throttling to tell me when I need to up the fan speed, and among my 4 R7s, all from the same particular manufacturer (XFX), and all running under same-OS and same-ROCm version, see a roughly 5C range in that, from 80C for the lowest-threshold card to 85c for the highest. Thus I try to keep all temps <= 80C for simplicity. Hard to say which is more dangerous, running hot (but not throttling-hot) or stressing the fans - one assumes/hopes the mfr has set the auto-throttling temps based on what is actually dangerous for the silicon, but ya never know until ya know. OTOH, depending on the design-for-repair-ability of the card, burning out a an may be just as reliable a way to brick the card as overheating the silicon. Alas, ROCm under linux doesn't display fan RPM, only "fan level" as a % - I try to keep mine under 70%, because that corresponds to the level where the fan noise really gets noticeable. Putting a single cheap $15 plug-in Honeywell desk fan at one end of my test-frame build - pic below - made a huge difference, that moves a lot of air over the whole 3-GPU array for very low watts and noise, as it uses a much larger set of fan blades, i.e. needs fewer RPMs to achieve the sme airflow velocity. The leftmost GPU there - it's the one running at sclk = 4 (the other 2 are sclk = 3, ~50-60W each less) and yet I can set its fan speed to the minimum because the desk fan is blowing right into its fan intakes. (Well, after first hitting the 3 Android phones running Mlucas. :) That left most GPU is #1 in this ROCm listing, you can see its higher sclk and low fan-% setting, and yet its temp is also the loest of the 3, due its favorable geometry: [code]GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 80.0c 161.0W 1373Mhz 1201Mhz 63.92% manual 250.0W 5% 100% 1 76.0c 211.0W 1547Mhz 1151Mhz 21.96% manual 250.0W 5% 100% 2 78.0c 161.0W 1373Mhz 1151Mhz 61.96% manual 250.0W 5% 100% [/code] |
[QUOTE=ewmayer;548222]Hard to say which is more dangerous, running hot (but not throttling-hot) or stressing the fans[/quote]
I'll stress the fans. They aren't that hard to change. :tire: |
| All times are UTC. The time now is 22:58. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.