mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   Radeon VII @ newegg for 500 dollars US on 11-27 (https://www.mersenneforum.org/showthread.php?t=24979)

ewmayer 2020-05-13 21:58

Update: further installed clinfo and libncurses5 - thanks, George - and we are up and running. Gratifyingly, with sclk=4 and fan=120 *and* gpuowl running the fan noise is actually much more tolerable than last night's experiment, where I simply tried upping to fan=120 without any significant compute load on the card. So there is still hope that this beast could find a home tucked away in some corner of the apartment without having to resort to extreme sound-damping mesaures.

In preparation for completing this build, couple of questions:

1. How do I get the system setup to switch back to using the onboard gfx for the display interface, rather than the hdmi-out on the card? I'd like to be able to use the new vga-to-hdmi adapter (bought once I realized the old one was bad and was hosing setup of my new build) for interfacing with my Odroid as-needed for ARM builds, have also bought a vga-to-dvd-i adapter to be able to use the latter output on the mobo, but need to get the system to use it rather than defaulting to the GPU;

2. Does anything need doing system-setup-wise before adding cards 2 and 3, or just shut down, physically install, boot up? Assuming the system recognizes the added cards, is it then just a matter of adding the proper -d device flag to rocm-smi commands and gpuowl invocations?

paulunderwood 2020-05-13 22:00

2. I am glad you got the rig working with one card. Measure the wall wattage carefully at --setsclk 3. When you install all three GPUs you want to stay in safe limits, otherwise things will fry! Have a fail-safe way of booting the machine to set the sclk to 3 (if that is tolerable.)

1. I suspect you motherboard defaults to the plugged-in card. Linux will be efficient when you ssh into it.

preda 2020-05-13 22:19

[QUOTE=ewmayer;545295]now gpuowl starts but immediately coredumps:
[code]2020-05-13 13:31:31 gpuowl v6.11-278-ga39cc1a
2020-05-13 13:31:31 Note: not found 'config.txt'
2020-05-13 13:31:31 device 0, unique id 'df7080c172fd5d6e'
2020-05-13 13:31:31 df7080c172fd5d6e 104954387 FFT: 5.50M 1K:11:256 (18.20 bpw)
2020-05-13 13:31:31 df7080c172fd5d6e Expected maximum carry32: 50D10000
Segmentation fault (core dumped)[/code][/QUOTE]

Can you please rebuild with debug symbols (add "-g" to CXXFLAGS), and afterwards run the executable under gdb, to see where it coredumps.

>gdb ./gpuowl
> r -prp 104954387
[segfault]
> bt (to see the stack)

Alternativelly, enable coredump files ("ulimit -c unlimited"), get a coredump file after the crash, load it with gdb and see where it segfaults. (still needs the build with -g)

Edit: sorry missed the follow-up messages, seems problem solved (probably the libncurses thing), good.

ewmayer 2020-05-13 22:20

[QUOTE=paulunderwood;545309]2. I am glad you got the rig working with one card. Measure the wall wattage carefully at --setsclk 3. When you install all three GPUs you want to stay in safe limits, otherwise things will fry! Have a fail-safe way of booting the machine to set the sclk to 3 (if that is tolerable.)[/QUOTE]
Good point, will do - currently simply plugged into wall, but have a 2nd wattmeter (first is on the Haswell+R7 system) ready to go. First gonna let things run for a few hours one the one card at current sclk=4, then shutdown, replug-in via wattmeter, boot up and try 2-jobs-one-card at sclk=3 and 4. CPU is unloaded except for system tasks, so hopefully wattage will point to 3-cards being runnable using my 850W PSU, even if I have to drop down a smidge to sclk=3. Also, I haven't yet tried doing any mem-clock fiddles on this new system - my first system (haswell+R7 under rocm 2.10) didn't support them, but per Mihai they can be very useful for maximizing FLOPs/Watt.

[QUOTE]1. I suspect you motherboard defaults to the plugged-in card. Linux will be efficient when you ssh into it.[/QUOTE]
Current pair of jobs were started and are being monitored via ssh from my laptop, simply used 'nohup' to invoke the program, now logged out but the fan noise indicates all is well. I'll just leave the wifi usb stick on this system enabled ... I suppose if I were worried about security I could leave the wifi enabled but only physically insert the stick for occasional sshing-in. The wattmeter makes an excellent means of is-everything-running-as-normal monitoring.

preda 2020-05-13 22:38

[QUOTE=ewmayer;545307]is it then just a matter of adding the proper -d device flag to rocm-smi commands and gpuowl invocations?[/QUOTE]

I find running gpuowl with -uid <id> much more useful than running with -d <position> . This way the identity of the card is preserved even when swapping it around the PCIe slots.

And the script tools/device.py can be used to convert the UID to -d "position" for rocm-smi

kriesel 2020-05-14 00:11

[QUOTE=ewmayer;545307]1. How do I get the system setup to switch back to using the onboard gfx for the display interface, rather than the hdmi-out on the card?[/QUOTE]On at least some motherboards, there's a BIOS setting to lock it to the igp. Asrock H81 is like that, as is this Dell laptop.

ewmayer 2020-05-14 19:05

After no problems running 2 gpuOwl instances overnight, just shut down and replugged system in via a wattmeter, in preparation for wattage tests with 1,2 and eventually 3 GPUs installed. On reboot, system is not finding the GPU - invoking gpuowl has it looking for the same-ID device as yesterday, but not finding it:
[code]2020-05-14 11:58:16 gpuowl v6.11-278-ga39cc1a
2020-05-14 11:58:16 Note: not found 'config.txt'
2020-05-14 11:58:16 device 0, unique id 'df7080c172fd5d6e'
2020-05-14 11:58:16 df7080c172fd5d6e 104954387 FFT: 5.50M 1K:11:256 (18.20 bpw)
2020-05-14 11:58:16 df7080c172fd5d6e Expected maximum carry32: 50D10000
2020-05-14 11:58:16 df7080c172fd5d6e Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs
2020-05-14 11:58:16 df7080c172fd5d6e Bye[/code]
And clinfo shows 0 devices:
[code]Number of platforms 1
Platform Name AMD Accelerated Parallel Processing
Platform Vendor Advanced Micro Devices, Inc.
Platform Version OpenCL 2.1 AMD-APP (3098.0)
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Host timer resolution 1ns
Platform Extensions function suffix AMD

Platform Name AMD Accelerated Parallel Processing
Number of devices 0

NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) AMD Accelerated Parallel Processing
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)
clCreateContext(NULL, ...) [default] No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No devices found in platform

ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.11
ICD loader Profile OpenCL 2.1[/code]
Looking through the long dmesg log now to see what I can see ... I *really* hope the 'shutdown -h' and reboot didn't bork the card.

paulunderwood 2020-05-14 19:26

What does [c]uname -a[/c] give? Are you using rocm v3.3? What command are you issuing to try and run gpuowl?

ewmayer 2020-05-14 19:34

[QUOTE=paulunderwood;545395]What does [c]uname -a[/c] give? Are you using rocm v3.3? What command are you issuing to try and run gpuowl?[/QUOTE]

In order:

Linux ewmayer-gimp 5.3.0-51-generic #44-Ubuntu SMP Wed Apr 22 21:09:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
yes
[From within one of my 2 run-subdirs above the main gpuowl dir] ../gpuowl

The executable is clearly firing up and trying to resume from where it left off, but not finding a valid device at the device ID in question (not sure if it caches that in a local-stuff file or queries the system to get that) to run on.

One more diagnostic data point - on post-shutdown reboot, had the video cable disconnected from the card, figuring on doing remote management via ssh. Did another shutdown just now, plugged that back in and on reboot the display showed the expected BIOS menu, followed by boot into Ubuntu. So the basic video-out part of the GPU must be functioning.

paulunderwood 2020-05-14 19:42

[QUOTE=ewmayer;545396]In order:

[From within one of my 2 run-subdirs above the main gpuowl dir] ../gpuowl
[/QUOTE]

Don't you need to prefix with a sudo command?

ewmayer 2020-05-14 19:49

[QUOTE=paulunderwood;545398]Don't you need a to prefix with a sudo command?[/QUOTE]

I don't need to do that on my Haswell+R7 system ... but on this new build, despite the gpuowl executable showing regular-user x permission, that seems to do the trick, back up and running. Thanks!

Still not understanding why clinfo would show 0 devices post-boot, find none when attempting to run the program as regular user, then suddenly find the card when run via sudo ... is gpuowl initing the card now, when run via sudo?

[b]Edit:[/b] New build wattages:

o On powerup, one R7 plugged in but unloaded: 40W
o 2 gpuowl instances @5.5M FFT, sclk = 4: 1350 us/iter, 285W
o 2 gpuowl instances @5.5M FFT, sclk = 3: 1425 us/iter, 230W

Oddly, sclk = 5 ups the wattage to 340W but yields no performance gain ... figured it might be due to throttling (though that should cut the wattage), and tried upping fan from 120 to 150, no change. Not that anything above sclk = 4 would be viable once I add the remaining GPUs, anyway.

Any advice on memclk-fiddles which might boost the FLOPs/Watt welcome - for reasons unknown (some unable-to-write-a-system-file-even-as-su issue), was unable to alter the mclk settings on my Haswell-system's R7, hopefully the new one will allow it.

Will install 2nd card in a couple hours, after catching up on the stuff I would normally have done over the last few hours. Will need each new card to draw < 200W when running 2 gpuowl instance, figuring that 600-650W total is the max I want to run on my 850W PSU on a 24/7 basis.


All times are UTC. The time now is 22:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.