mersenneforum.org AVX512 hardware recommendations?
 Register FAQ Search Today's Posts Mark Forums Read

2020-06-04, 21:04   #34
ewmayer
2ω=0

Sep 2002
República de California

2×5,647 Posts

Quote:
 Originally Posted by paulunderwood I think you are out of luck: https://github.com/RadeonOpenCompute...ftware-Support https://en.wikipedia.org/wiki/Radeon_RX_500_series
You need to be a little more specific about what I should be looking at on those pages. Clicking on the AMD link for the Radeon 540 at the Wikipedia page, I see the following - what crucial element is missing from that list?
Attached Thumbnails

2020-06-04, 22:13   #35
paulunderwood

Sep 2002
Database er0rr

17×193 Posts

Quote:
 Originally Posted by ewmayer You need to be a little more specific about what I should be looking at on those pages. Clicking on the AMD link for the Radeon 540 at the Wikipedia page, I see the following - what crucial element is missing from that list?
According to (the fallible) Wiki yours is "Lexa" type and ROCm says it only supports "Polaris" from the 500 series. Although it does say:

Quote:
 ROCm is a collection of software ranging from drivers and runtimes to libraries and developer tools. Some of this software may work with more GPUs than the "officially supported" list above, though AMD does not make any official claims of support for these devices on the ROCm software platform. The following list of GPUs are enabled in the ROCm software, though full support is not guaranteed: GFX8 GPUs "Polaris 11" chips, such as on the AMD Radeon RX 570 and Radeon Pro WX 4100 "Polaris 12" chips, such as on the AMD Radeon RX 550 and Radeon RX 540
Further, according to this page The Polaris 12 a.k.a. Lexa Pro has an X tagged on: RX 540X

I am confused.

Last fiddled with by paulunderwood on 2020-06-04 at 22:27

 2020-06-04, 22:29 #36 ewmayer ∂2ω=0     Sep 2002 República de California 2×5,647 Posts Well, at this point I still don't know if the hang-on-startup issue is due to ROCm or not - as I said, rocm-smi seems to recognize the GPU. How to get more specific info re. the, um, well, "how's it hanging?" question? :P And, Ken mentioned trying a prebuilt-for-linux LL/PRP tester, but I saw no links to such at the mersenne.ca site he suggested. What are my alternative options here? I don't really care which precise code runs on the GPU, just that it be efficient. Was simply hoping the same workflow that works for gpuowl on Radeon VII under Linux would also work here.
2020-06-04, 22:37   #37
preda

"Mihai Preda"
Apr 2015

44E16 Posts

Quote:
 Originally Posted by ewmayer Well, at this point I still don't know if the hang-on-startup issue is due to ROCm or not - as I said, rocm-smi seems to recognize the GPU. How to get more specific info re. the, um, well, "how's it hanging?" question? :P And, Ken mentioned trying a prebuilt-for-linux LL/PRP tester, but I saw no links to such at the mersenne.ca site he suggested. What are my alternative options here? I don't really care which precise code runs on the GPU, just that it be efficient. Was simply hoping the same workflow that works for gpuowl on Radeon VII under Linux would also work here.
One of the first things that gpuowl does in the beginning is to compile the kernels (OpenCL compilation), and prints a message once that's done with timing.

You could strart gpuowl under the debugger, and interrupt it after a while, and see where the threads are sitting (i.e. what is it doing, what is it waiting for). Another way would be to add log() lines at strategic places in source to mark the passage.

Also, you gpuowl -h should display an entry for the GPU. (if it doesn't it's a bad omen). clinfo working is also a good sign.

 2020-06-04, 22:55 #38 paulunderwood     Sep 2002 Database er0rr 1100110100012 Posts I told Ernst to sudo ln -s /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0 /usr/lib/x86_64-linux-gnu/libOpenCL.so but I have this: Code: ls -l /opt/rocm-3.5.0/lib/libOpenCL.so* lrwxrwxrwx 1 root root 30 Jun 3 04:58 /opt/rocm-3.5.0/lib/libOpenCL.so -> ../opencl/lib/libOpenCL.so.1.2 lrwxrwxrwx 1 root root 30 Jun 3 04:58 /opt/rocm-3.5.0/lib/libOpenCL.so.1 -> ../opencl/lib/libOpenCL.so.1.2 lrwxrwxrwx 1 root root 30 Jun 3 04:58 /opt/rocm-3.5.0/lib/libOpenCL.so.1.2 -> ../opencl/lib/libOpenCL.so.1.2 I am wondering if he uses -l/opt/rocm-3.5.0/lib/OpenCL in the gpuowl Makefile (and recompile) it would be better. (Then he can rm the link /usr/lib/x86_64-linux-gnu/libOpenCL.so.) Edit: A bad idea. I just tried the altered "-l" option. This is what I have in my Makefile: Code: LIBPATH = -L/opt/rocm-3.5.0/opencl/lib/x86_64 -L. Last fiddled with by paulunderwood on 2020-06-04 at 23:30
2020-06-04, 23:11   #39
ewmayer
2ω=0

Sep 2002
República de California

2×5,647 Posts

Quote:
 Originally Posted by preda One of the first things that gpuowl does in the beginning is to compile the kernels (OpenCL compilation), and prints a message once that's done with timing. You could strart gpuowl under the debugger, and interrupt it after a while, and see where the threads are sitting (i.e. what is it doing, what is it waiting for). Another way would be to add log() lines at strategic places in source to mark the passage. Also, you gpuowl -h should display an entry for the GPU. (if it doesn't it's a bad omen). clinfo working is also a good sign.
Here is what happens on program start:
Code:
ewmayer@ewmayer-NUC8i3CYS:~/gpuowl/RUN$sudo ../gpuowl [sudo] password for ewmayer: 2020-06-03 18:26:55 gpuowl v6.11-311-gfa76bd9 2020-06-03 18:26:55 Note: not found 'config.txt' 2020-06-03 18:26:55 device 0, unique id '' At that point it hangs, and the ssh-session stops accepting signals ... I had to kill said remote session from another term on my Macbook. Comparing to run-start diagnostics on one of my R7s, after those 3 lines it next prints a line with this format: [date & time] [GPU ID, e.g. gfx906+sram-ecc-0] [exponent] [FFT info & bpw] That is followded by an "Expected maximum carry" line, an "OpenCL args" line, then an OpenCL-compilation-timing line. I'm getting none of those, i.e. it's hanging on the way to printout of the above informational line. As noted yesterday, clinfo shows the platform and CL version (2.0), but no valid devices: Code: Number of platforms 1 Platform Name AMD Accelerated Parallel Processing Platform Vendor Advanced Micro Devices, Inc. Platform Version OpenCL 2.0 AMD-APP (3137.0) Platform Profile FULL_PROFILE Platform Extensions cl_khr_icd cl_amd_event_callback Platform Extensions function suffix AMD Platform Name AMD Accelerated Parallel Processing Number of devices 0 NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform clCreateContext(NULL, ...) [default] No platform clCreateContext(NULL, ...) [other] No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No devices found in platform The ROCm interface seems OK, even allows me to do the usual --setsclk and --setfan fiddles and the corresponding numbers is the status (rocm-smi, no args) output change in the expected directions. Using 'sudo gdb' to un under gdb starts with the expected "(No debugging symbols found in ../gpuowl)", then 'run' hits the hung state, even under gdb ctrl-c and ctrl-z fail to have any effect. Advice on instrumenting via log(), enabling debug symbols, etc, welcome.  2020-06-04, 23:20 #40 paulunderwood Sep 2002 Database er0rr 17×193 Posts Try running gpuowl with -h as preda wrote. What is your gpuowl Makefile's LIBPATH? Last fiddled with by paulunderwood on 2020-06-04 at 23:22 2020-06-04, 23:51 #41 ewmayer 2ω=0 Sep 2002 República de California 2·5,647 Posts Quote:  Originally Posted by paulunderwood Try running gpuowl with -h as preda wrote. What is your gpuowl Makefile's LIBPATH? Code: ewmayer@ewmayer-NUC8i3CYS:~/gpuowl/RUN$ sudo ../gpuowl -h
2020-06-04 16:45:04 gpuowl v6.11-311-gfa76bd9

-dir <folder>      : specify local work directory (containing worktodo.txt, results.txt, config.txt, gpuowl.log)
-pool <dir>        : specify a directory with the shared (pooled) worktodo.txt and results.txt
Multiple GpuOwl instances, each in its own directory, can share a pool of assignments and report
the results back to the common pool.
-uid <unique_id>   : specifies to use the GPU with the given unique_id (only on ROCm/Linux)
-user <name>       : specify the user name.
-cpu  <name>       : specify the hardware name.
-time              : display kernel profiling information.
-fft <spec>        : specify FFT e.g.: 1152K, 5M, 5.5M, 256:10:1K
-block <value>     : PRP GEC block size, or LL iteration-block size. Must divide 10'000.
-log <step>        : log every <step> iterations. Multiple of 10'000.
-jacobi <step>     : (LL-only): do Jacobi check every <step> iterations. Default 1'000'000.
-carry long|short  : force carry type. Short carry may be faster, but requires high bits/word.
-B1                : P-1 B1 bound, default 1000000
-B2                : P-1 B2 bound, default B1 * 30
-rB2               : ratio of B2 to B1. Default 30, used only if B2 is not explicitly set
-cleanup           : delete save files at end of run
-prp <exponent>    : run a single PRP test and exit, ignoring worktodo.txt
-pm1 <exponent>    : run a single P-1 test and exit, ignoring worktodo.txt
-ll <exponent>     : run a single LL test and exit, ignoring worktodo.txt
-verify <file>|<exponent> : verify PRP-proof contained in <file> or in the folder <exponent>/
-proof [<power>]   : enable PRP proof generation. Default <power> is 9.
-results <file>    : name of results file, default 'results.txt'
-iters <N>         : run next PRP test for <N> iterations and exit. Multiple of 10000.
-maxAlloc          : limit GPU memory usage to this value in MB (needed on non-AMD GPUs)
-yield             : enable work-around for CUDA busy wait taking up one CPU core
-nospin            : disable progress spinner
-use NEW_FFT8,OLD_FFT5,NEW_FFT10: comma separated list of defines, see the #if tests in gpuowl.cl (used for perf tuning)
-safeMath          : do not use -cl-unsafe-math-optimizations (OpenCL)
-binary <file>     : specify a file containing the compiled kernels binary
-device <N>        : select a specific device:
[hangs]
Grep finds 2 occurrences of LIBPATH in the Makefile ... not sure what the '.' ending the actual-path one means:
Code:
ewmayer@ewmayer-NUC8i3CYS:~/gpuowl$grep LIBPATH Makefile 3:LIBPATH = -L/opt/rocm-3.3.0/opencl/lib/x86_64 -L/opt/rocm-3.1.0/opencl/lib/x86_64 -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L. 5:LDFLAGS = -lstdc++fs -lOpenCL -lgmp -pthread${LIBPATH}

 2020-06-05, 00:00 #42 paulunderwood     Sep 2002 Database er0rr 17·193 Posts It is a long shot... Make the beginning of you LIBPATH look like this: Code: LIBPATH = -L/opt/rocm-3.5.0/opencl/lib/x86_64 ...the rest and run make clean && make gpuowl Last fiddled with by paulunderwood on 2020-06-05 at 00:00
2020-06-05, 00:08   #43
preda

"Mihai Preda"
Apr 2015

2·19·29 Posts

Quote:
 Originally Posted by ewmayer [code]ewmayer@ewmayer-NUC8i3CYS:~/gpuowl/RUN\$ sudo ../gpuowl -h
OpenCL is not initialized correctly. It hangs when doing some basic OpenCL like list all the devices of an opencl provider. There's not much to fix in gpuowl for this IMO, you should get clinfo to report a valid device first.

The fact that rocm-smi works is not much. It just means that the GPU is initialized correctly. (basically the files under /sys/class/drm/card0/ are there). It does not mean that OpenCL works.

2020-06-05, 00:11   #44
ewmayer
2ω=0

Sep 2002
República de California

2×5,647 Posts

Quote:
 Originally Posted by paulunderwood It is a long shot... Make the beginning of you LIBPATH look like this: Code: LIBPATH = -L/opt/rocm-3.5.0/opencl/lib/x86_64 ...the rest and run make clean && make gpuowl
Compile succeeds, here the link line:

g++ -o gpuowl Pm1Plan.o GmpUtil.o Worktodo.o common.o main.o Gpu.o clwrap.o Task.o checkpoint.o timeutil.o Args.o state.o Signal.o FFTConfig.o AllocTrac.o gpuowl-wrap.o sha3.o -lstdc++fs -lOpenCL -lgmp -pthread -L/opt/rocm-3.5.0/opencl/lib/x86_64 -L/opt/rocm-3.1.0/opencl/lib/x86_64 -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L.

...but same result (hang after 'select a specific device:' informational) as before.

Gotta run, thanks for the various try-this advice, we'll see what tomorrow brings.

(I know one thing tomorrow will bring ... a cool pic of The Beast in its new glass-roof-and-floor lair, now that I have the resulting airflow-restriction issues resolved).

Last fiddled with by ewmayer on 2020-06-05 at 00:11

 Similar Threads Thread Thread Starter Forum Replies Last Post heliosh Hardware 19 2020-01-18 04:01 simon389 Software 20 2018-12-13 21:01 Mr. Odd Hardware 7 2016-06-02 01:07 ixfd64 Hardware 45 2012-11-14 01:19 Mr. Odd Factoring 12 2011-11-19 00:32

All times are UTC. The time now is 10:12.

Fri Jul 10 10:12:41 UTC 2020 up 107 days, 7:45, 0 users, load averages: 1.37, 1.30, 1.23