mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2020-05-13, 21:58   #188
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D7F16 Posts
Default

Update: further installed clinfo and libncurses5 - thanks, George - and we are up and running. Gratifyingly, with sclk=4 and fan=120 *and* gpuowl running the fan noise is actually much more tolerable than last night's experiment, where I simply tried upping to fan=120 without any significant compute load on the card. So there is still hope that this beast could find a home tucked away in some corner of the apartment without having to resort to extreme sound-damping mesaures.

In preparation for completing this build, couple of questions:

1. How do I get the system setup to switch back to using the onboard gfx for the display interface, rather than the hdmi-out on the card? I'd like to be able to use the new vga-to-hdmi adapter (bought once I realized the old one was bad and was hosing setup of my new build) for interfacing with my Odroid as-needed for ARM builds, have also bought a vga-to-dvd-i adapter to be able to use the latter output on the mobo, but need to get the system to use it rather than defaulting to the GPU;

2. Does anything need doing system-setup-wise before adding cards 2 and 3, or just shut down, physically install, boot up? Assuming the system recognizes the added cards, is it then just a matter of adding the proper -d device flag to rocm-smi commands and gpuowl invocations?
ewmayer is offline   Reply With Quote
Old 2020-05-13, 22:00   #189
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

22×941 Posts
Default

2. I am glad you got the rig working with one card. Measure the wall wattage carefully at --setsclk 3. When you install all three GPUs you want to stay in safe limits, otherwise things will fry! Have a fail-safe way of booting the machine to set the sclk to 3 (if that is tolerable.)

1. I suspect you motherboard defaults to the plugged-in card. Linux will be efficient when you ssh into it.

Last fiddled with by paulunderwood on 2020-05-13 at 22:03
paulunderwood is online now   Reply With Quote
Old 2020-05-13, 22:19   #190
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default

Quote:
Originally Posted by ewmayer View Post
now gpuowl starts but immediately coredumps:
Code:
2020-05-13 13:31:31 gpuowl v6.11-278-ga39cc1a
2020-05-13 13:31:31 Note: not found 'config.txt'
2020-05-13 13:31:31 device 0, unique id 'df7080c172fd5d6e'
2020-05-13 13:31:31 df7080c172fd5d6e 104954387 FFT: 5.50M 1K:11:256 (18.20 bpw)
2020-05-13 13:31:31 df7080c172fd5d6e Expected maximum carry32: 50D10000
Segmentation fault (core dumped)
Can you please rebuild with debug symbols (add "-g" to CXXFLAGS), and afterwards run the executable under gdb, to see where it coredumps.

>gdb ./gpuowl
> r -prp 104954387
[segfault]
> bt (to see the stack)

Alternativelly, enable coredump files ("ulimit -c unlimited"), get a coredump file after the crash, load it with gdb and see where it segfaults. (still needs the build with -g)

Edit: sorry missed the follow-up messages, seems problem solved (probably the libncurses thing), good.

Last fiddled with by preda on 2020-05-13 at 22:22
preda is offline   Reply With Quote
Old 2020-05-13, 22:20   #191
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19·613 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
2. I am glad you got the rig working with one card. Measure the wall wattage carefully at --setsclk 3. When you install all three GPUs you want to stay in safe limits, otherwise things will fry! Have a fail-safe way of booting the machine to set the sclk to 3 (if that is tolerable.)
Good point, will do - currently simply plugged into wall, but have a 2nd wattmeter (first is on the Haswell+R7 system) ready to go. First gonna let things run for a few hours one the one card at current sclk=4, then shutdown, replug-in via wattmeter, boot up and try 2-jobs-one-card at sclk=3 and 4. CPU is unloaded except for system tasks, so hopefully wattage will point to 3-cards being runnable using my 850W PSU, even if I have to drop down a smidge to sclk=3. Also, I haven't yet tried doing any mem-clock fiddles on this new system - my first system (haswell+R7 under rocm 2.10) didn't support them, but per Mihai they can be very useful for maximizing FLOPs/Watt.

Quote:
1. I suspect you motherboard defaults to the plugged-in card. Linux will be efficient when you ssh into it.
Current pair of jobs were started and are being monitored via ssh from my laptop, simply used 'nohup' to invoke the program, now logged out but the fan noise indicates all is well. I'll just leave the wifi usb stick on this system enabled ... I suppose if I were worried about security I could leave the wifi enabled but only physically insert the stick for occasional sshing-in. The wattmeter makes an excellent means of is-everything-running-as-normal monitoring.

Last fiddled with by ewmayer on 2020-05-13 at 22:39
ewmayer is offline   Reply With Quote
Old 2020-05-13, 22:38   #192
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default

Quote:
Originally Posted by ewmayer View Post
is it then just a matter of adding the proper -d device flag to rocm-smi commands and gpuowl invocations?
I find running gpuowl with -uid <id> much more useful than running with -d <position> . This way the identity of the card is preserved even when swapping it around the PCIe slots.

And the script tools/device.py can be used to convert the UID to -d "position" for rocm-smi
preda is offline   Reply With Quote
Old 2020-05-14, 00:11   #193
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

124758 Posts
Default

Quote:
Originally Posted by ewmayer View Post
1. How do I get the system setup to switch back to using the onboard gfx for the display interface, rather than the hdmi-out on the card?
On at least some motherboards, there's a BIOS setting to lock it to the igp. Asrock H81 is like that, as is this Dell laptop.
kriesel is offline   Reply With Quote
Old 2020-05-14, 19:05   #194
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19×613 Posts
Default

After no problems running 2 gpuOwl instances overnight, just shut down and replugged system in via a wattmeter, in preparation for wattage tests with 1,2 and eventually 3 GPUs installed. On reboot, system is not finding the GPU - invoking gpuowl has it looking for the same-ID device as yesterday, but not finding it:
Code:
2020-05-14 11:58:16 gpuowl v6.11-278-ga39cc1a
2020-05-14 11:58:16 Note: not found 'config.txt'
2020-05-14 11:58:16 device 0, unique id 'df7080c172fd5d6e'
2020-05-14 11:58:16 df7080c172fd5d6e 104954387 FFT: 5.50M 1K:11:256 (18.20 bpw)
2020-05-14 11:58:16 df7080c172fd5d6e Expected maximum carry32: 50D10000
2020-05-14 11:58:16 df7080c172fd5d6e Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs
2020-05-14 11:58:16 df7080c172fd5d6e Bye
And clinfo shows 0 devices:
Code:
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.1 AMD-APP (3098.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 
  Platform Host timer resolution                  1ns
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 0

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  AMD Accelerated Parallel Processing
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   
  clCreateContext(NULL, ...) [default]            No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No devices found in platform

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1
Looking through the long dmesg log now to see what I can see ... I *really* hope the 'shutdown -h' and reboot didn't bork the card.
ewmayer is offline   Reply With Quote
Old 2020-05-14, 19:26   #195
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

EB416 Posts
Default

What does uname -a give? Are you using rocm v3.3? What command are you issuing to try and run gpuowl?
paulunderwood is online now   Reply With Quote
Old 2020-05-14, 19:34   #196
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19·613 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
What does uname -a give? Are you using rocm v3.3? What command are you issuing to try and run gpuowl?
In order:

Linux ewmayer-gimp 5.3.0-51-generic #44-Ubuntu SMP Wed Apr 22 21:09:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
yes
[From within one of my 2 run-subdirs above the main gpuowl dir] ../gpuowl

The executable is clearly firing up and trying to resume from where it left off, but not finding a valid device at the device ID in question (not sure if it caches that in a local-stuff file or queries the system to get that) to run on.

One more diagnostic data point - on post-shutdown reboot, had the video cable disconnected from the card, figuring on doing remote management via ssh. Did another shutdown just now, plugged that back in and on reboot the display showed the expected BIOS menu, followed by boot into Ubuntu. So the basic video-out part of the GPU must be functioning.
ewmayer is offline   Reply With Quote
Old 2020-05-14, 19:42   #197
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

72648 Posts
Default

Quote:
Originally Posted by ewmayer View Post
In order:

[From within one of my 2 run-subdirs above the main gpuowl dir] ../gpuowl
Don't you need to prefix with a sudo command?

Last fiddled with by paulunderwood on 2020-05-14 at 19:50
paulunderwood is online now   Reply With Quote
Old 2020-05-14, 19:49   #198
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D7F16 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
Don't you need a to prefix with a sudo command?
I don't need to do that on my Haswell+R7 system ... but on this new build, despite the gpuowl executable showing regular-user x permission, that seems to do the trick, back up and running. Thanks!

Still not understanding why clinfo would show 0 devices post-boot, find none when attempting to run the program as regular user, then suddenly find the card when run via sudo ... is gpuowl initing the card now, when run via sudo?

Edit: New build wattages:

o On powerup, one R7 plugged in but unloaded: 40W
o 2 gpuowl instances @5.5M FFT, sclk = 4: 1350 us/iter, 285W
o 2 gpuowl instances @5.5M FFT, sclk = 3: 1425 us/iter, 230W

Oddly, sclk = 5 ups the wattage to 340W but yields no performance gain ... figured it might be due to throttling (though that should cut the wattage), and tried upping fan from 120 to 150, no change. Not that anything above sclk = 4 would be viable once I add the remaining GPUs, anyway.

Any advice on memclk-fiddles which might boost the FLOPs/Watt welcome - for reasons unknown (some unable-to-write-a-system-file-even-as-su issue), was unable to alter the mclk settings on my Haswell-system's R7, hopefully the new one will allow it.

Will install 2nd card in a couple hours, after catching up on the stuff I would normally have done over the last few hours. Will need each new card to draw < 200W when running 2 gpuowl instance, figuring that 600-650W total is the max I want to run on my 850W PSU on a 24/7 basis.

Last fiddled with by ewmayer on 2020-05-14 at 20:19
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
AMD Radeon Pro WX 3200 ET_ GPU Computing 1 2019-07-04 11:02
Radeon Pro Vega II Duo (look at this monster) M344587487 GPU Computing 10 2019-06-18 14:00
What's the best project to run on a Radeon RX 480? jasong GPU Computing 0 2016-11-09 04:32
Radeon Pro Duo 0PolarBearsHere GPU Computing 0 2016-03-15 01:32
AMD Radeon R9 295X2 firejuggler GPU Computing 33 2014-09-03 21:42

All times are UTC. The time now is 12:58.


Fri Aug 6 12:58:44 UTC 2021 up 14 days, 7:27, 1 user, load averages: 2.63, 2.83, 2.66

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.