![]() |
I tried the washer mod with minimal success, YMMV: [URL]https://www.mersenneforum.org/showpost.php?p=521689&postcount=13[/URL]
edit: [quote]Warning: the thermal pad that the GPU comes with is very good -- it's hard to replace it with anything with comparable performance. When taking the cooler apart, the termal pad may be demaged and need to be replaced which would be a net loss. Personally I would recommend against trying out the washer hack.[/quote]The thermal pad is good enough and will need replacing if you remove the cooler, but even the moderately decent thermal paste I used has better conductivity (the pad has better thermal properties on paper but paper is misleading, if the pad and paste were the same thickness the pad would win but the paste is applied much more thinly). That said IMO it's not worth doing a repaste or the washer mod. |
[QUOTE=M344587487;536079]I tried the washer mod with minimal success, YMMV: [url]https://www.mersenneforum.org/showpost.php?p=521689&postcount=13[/url][/QUOTE]
Thanks - but did you try first just-the-washers before doing the other stuff? If you tried a bunch of different things at once you risk, using the terminology of clinical trials in medicine, "confounding effects". |
After a fresh Ubuntu 19.10 install on my ~6-year-old Haswell system and several afternoons' work, including some awkward Dremel hackery of both the R7 mounting bracket and the back of my ATX case in order to resolve a geometric mismatch there, the R7 is in and recognized by the OS, lspci shows 2 R7 entries:
[i] 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon VII] (rev c1) 03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 HDMI Audio [Radeon VII] [/i] In terms of needed drivers, Matt (a.k.a. M344587487) noted this: "Ubuntu 19.10 uses kernel 5.3 which means the open source AMD driver that's built into the kernel can handle the Vega 20. If you were on an earlier kernel you'd need to install the amdgpu-pro driver from AMD's site but you should be good. Something you might need is the Vega 20 firmware, there was a strange period where the kernel had the right drivers but some distro's hadn't caught up to providing Vega 20 firmware. To check if you have the firmware open a terminal and run 'ls /lib/firmware/amdgpu/vega20*'." That latter list command shows 13 vega20_*.bin files, so that seems set to go. But - and I was clued in to the pronlem by my usual Mlucas 4-thread job on the Haswell CPU running 3x slower than usual - there is some kind of misconfiguration/driver problem remaining. 'top' shows multiple cycle-eating 'system-udevd' and 'modprobe' processes. Invoking 'dmesg' shows what appears to be the problem - endless repeats of this message: [i] NVRM: No NVIDIA graphics adapter found! nvidia-nvlink: Unregistered the Nvlink Core, major device number 238 nvidia-nvlink: Nvlink Core is being initialized, major device number 238 [/i] It's not clear to me which of the following 3 possible causes is the likely culprit: 1. Preparing to instal the R7, I first removed an old nvidia gtx430 card from the PCI 2.0 slot (seems unlikely, because I quickly found the issue with the R7 mounting bracket after that, at which point I rebooted sans any gfx card, and had been running happily for several days like that). 2. The R7 needs some nVidia drivers and is not finding them; 3. The system is detecting *a* new video card - brand not important - and doing something nVidia-ish as a result. |
Maybe it is just easier to backup, reinstall afresh and, like most of us do, use RocM drivers.
I will be interested how the R7 performs on a PCIE-2 rather than a PCIE-3... |
1 Attachment(s)
[QUOTE=paulunderwood;536345]Maybe it is just easier to backup, reinstall afresh and, like most of us do, use RocM drivers.
I will interested how the R7 performs on a PCIE-2 rather than a PCIE-3...[/QUOTE] My old gtx430 was on the PCIE-2 slot ... the R7 is on the PCIE-3, plus used both the 8-pin power connectors on the PSU in this system. It also needed me to use my Dremel with a small cutting wheel to chop out the metal bridge between the 2 back-of-case PCI cutout used by the R7. Here the gory post-surgery picture of the patient's innards: |
Re. the nVidia-related dmesg errors in post #47, one additional possibility occurs to me ... the only nVidia drivers I ever explicitly installed were under the old headless Debian setup, which I blew away.
I removed the nVidia card a week ago, in prep. for trying to install the R7. However, the nVidia card was still installed when I upgraded to Ubuntu 19.10 ... might the Ubuntu installer have auto-detected the nVidia card and installed/defaulted-to-use the appropriate drivers at that point, and now the kernel is throwing errors due to the mismatch between those initial-OS-install drivers and the new gfx card? |
[QUOTE=ewmayer;536081]Thanks - but did you try first just-the-washers before doing the other stuff? If you tried a bunch of different things at once you risk, using the terminology of clinical trials in medicine, "confounding effects".[/QUOTE]
I did both at once and ruined the clinical trial, my understanding that the paste made a bigger difference than the washer mod is from a tech youtuber so YMMV. [QUOTE=ewmayer;536356]... However, the nVidia card was still installed when I upgraded to Ubuntu 19.10 ... might the Ubuntu installer have auto-detected the nVidia card and installed/defaulted-to-use the appropriate drivers at that point, and now the kernel is throwing errors due to the mismatch between those initial-OS-install drivers and the new gfx card?[/QUOTE] Yes, Ubuntu installs non-free drivers by default when it needs to unless you tell it not, including nvidia's blobs if an nvidia card is present. I'm inclined to blame nvidia's proprietary crap for your problems, people have trouble mixing vendors in the same system and I believe it's because nvidia does things it's own way via binary blob which means they're not integrating properly with the Linux way of doing things. The easiest/safest fix is probably to wipe and restart (after burning the nvidia card and burying it in a deep pit preferably, YMMV), but it can't hurt to try purging nvidia from the system if you feel like it (it's not critical but highly recommended that you change your wallpaper to Linus flipping off nvidia at this point, for luck). This is from an old guide but it seems reasonable: This command should list all nvidia packages, there should be a few dozen of them: [code]dpkg -l | grep -i nvidia[/code]Purge all packages beginning with nvidia-, which should also remove their dependencies: [code]sudo apt-get remove --purge '^nvidia-.*'[/code]Reinstall ubuntu-desktop which was just erroneously removed: [code]sudo apt-get install ubuntu-desktop[/code]Then reboot and see where you stand. |
[QUOTE=M344587487;536371]Yes, Ubuntu installs non-free drivers by default when it needs to unless you tell it not, including nvidia's blobs if an nvidia card is present. I'm inclined to blame nvidia's proprietary crap for your problems, people have trouble mixing vendors in the same system and I believe it's because nvidia does things it's own way via binary blob which means they're not integrating properly with the Linux way of doing things.
The easiest/safest fix is probably to wipe and restart (after burning the nvidia card and burying it in a deep pit preferably, YMMV), but it can't hurt to try purging nvidia from the system if you feel like it (it's not critical but highly recommended that you change your wallpaper to Linus flipping off nvidia at this point, for luck). This is from an old guide but it seems reasonable: This command should list all nvidia packages, there should be a few dozen of them: [code]dpkg -l | grep -i nvidia[/code]Purge all packages beginning with nvidia-, which should also remove their dependencies: [code]sudo apt-get remove --purge '^nvidia-.*'[/code]Reinstall ubuntu-desktop which was just erroneously removed: [code]sudo apt-get install ubuntu-desktop[/code]Then reboot and see where you stand.[/QUOTE] Thanks, Matt - I PMed you the 'before' and 'after' results of 'dpkg -l | grep -i nvidia' ... on reboot, I still quickly get and a "system program problem detected" popup (but now only one, versus multiple ones before) which I dismiss, but 'dmesg' now shows no more of the repeating nVidia-crud. I PMed you the shortlist of bold-highlighted warnings/errors I did find in the dmesg output, one of which involves a vega20*bin firmware file, namely [i] [ 2.517924] amdgpu 0000:03:00.0: Direct firmware load for amdgpu/vega20_ta.bin failed with error -2 [/i] I see 13 files among the /lib/firmware/amdgpu/vega20*.bin set which Ubuntu 19.10 auto-installed, but no vega20_ta.bin among those, probably just need to grab that one separately. Most importantly, 'top' no longer shows any out-of-control system processes, and my Mlucas runs on the CPU are once again back at normal throughput. So, progress! |
Now that Super Bowl Sunday (quasi-holiday in the US revolving around the National Footbal League championship game) is behind us, an update - the card seems to be functioning properly. I've been following Matt's "quick and dirty setup guide" [url=https://www.mersenneforum.org/showpost.php?p=511655&postcount=76]here[/url], am currently at the "Take the above [bash] init script [to set up for 2-gpuwol-instance running] and tweak it to suit your card". First I'd like to play with some basic single-instance running, but something is borked. The readme says "Self-test: simply start gpuowl with any valid exponent..." but does not say how to specify that expo via cmd-line flags. I tried just sticking a prime expo in there, then without any arguments whatever, both gave the following kind of error:
[code] ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl 90110269 2020-02-01 18:43:36 gpuowl v6.11-142-gf54af2e 2020-02-01 18:43:36 Note: not found 'config.txt' 2020-02-01 18:43:36 config: 90110269 2020-02-01 18:43:36 device 0, unique id '' 2020-02-01 18:43:36 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs 2020-02-01 18:43:36 Bye ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl 2020-02-01 18:44:02 gpuowl v6.11-142-gf54af2e 2020-02-01 18:44:02 Note: not found 'config.txt' 2020-02-01 18:44:02 device 0, unique id '' 2020-02-01 18:44:02 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs 2020-02-01 18:44:02 Bye [/code] Matt had noted to me, "If the PRP test starts we are good to go. If it fails with something along the lines ofclGetDeviceId then gpuowl couldn't see the card." How to debug that latter problem? Looking ahead, the first 2 steps of the setup-for-2-instances script are these: [code]#Allow manual control echo "manual" >/sys/class/drm/card0/device/power_dpm_force_performance_level #Undervolt by setting max voltage # V Set this to 50mV less than the max stock voltage of your card (which varies from card to card), then optionally tune it down echo "vc 2 1801 1010" >/sys/class/drm/card0/device/pp_od_clk_voltage [/code] How do I find the max stock voltage? rocm-smi gives a bunch of things, but not that: [code]GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 1 31.0c 21.0W 809Mhz 351Mhz 21.96% auto 250.0W 0% 0% [/code] ...and fiddling with various values of "/opt/rocm/bin/rocm-smi --setfan [n]" to set a constant fan speed causes the Fan value in the above to rise and fall. Thanks for any help from current gpuowl users. |
[QUOTE=ewmayer;536581]Now that Super Bowl Sunday (quasi-holiday in the US revolving around the National Footbal League championship game) is behind us, an update - the card seems to be functioning properly. I've been following Matt's "quick and dirty setup guide" [url=https://www.mersenneforum.org/showpost.php?p=511655&postcount=76]here[/url], am currently at the "Take the above [bash] init script [to set up for 2-gpuwol-instance running] and tweak it to suit your card". First I'd like to play with some basic single-instance running, but something is borked. The readme says "Self-test: simply start gpuowl with any valid exponent..." but does not say how to specify that expo via cmd-line flags. I tried just sticking a prime expo in there, then without any arguments whatever, both gave the following kind of error:
[code] ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl 90110269 2020-02-01 18:43:36 gpuowl v6.11-142-gf54af2e 2020-02-01 18:43:36 Note: not found 'config.txt' 2020-02-01 18:43:36 config: 90110269 2020-02-01 18:43:36 device 0, unique id '' 2020-02-01 18:43:36 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs 2020-02-01 18:43:36 Bye ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl 2020-02-01 18:44:02 gpuowl v6.11-142-gf54af2e 2020-02-01 18:44:02 Note: not found 'config.txt' 2020-02-01 18:44:02 device 0, unique id '' 2020-02-01 18:44:02 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs 2020-02-01 18:44:02 Bye [/code] [/QUOTE] Run as root (or sudo) with the [c]-user ewmayer[/c] switch (or is it --user? I just run as root.) Start with fans at 170; monitor the temperatures and, depending on your overclock, undervolt and ambient temperature, you might be able to reduce the fan speed. |
[QUOTE=paulunderwood;536593]Run as root (or sudo) with the [c]-user ewmayer[/c] swtich (or is it --user? I just run as root.)
Start with fans at 170; monitor the temperatures and, depending on your overclock and undervolt, you might be able to reduce the fan speed.[/QUOTE] Thanks- Per the readme, single minus sign ... from within a subdir 'run0' where I have created a worktodo.txt file containing a pair of PRP assignments, I tried 'sudo ../gpuowl -user ewmayer' ... after entering my sudo password the run echoed same as the 2nd #fail above, just with an added 'config: -user ewmayer' line. Trying to instead login as root and run that way [this the Ubuntu 19.10 setup I created last week] and using the same pwd gives 'Authentication failure'. I don't recall entering any other pwd during the set-pwd phase of Ubuntu 19.10 setup. Not needed yet since I can't run at all, but how do determine the max stock voltage of my R7? |
| All times are UTC. The time now is 14:17. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.