mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2020-06-04, 21:04   #34
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1130110 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
You need to be a little more specific about what I should be looking at on those pages. Clicking on the AMD link for the Radeon 540 at the Wikipedia page, I see the following - what crucial element is missing from that list?
Attached Thumbnails
Click image for larger version

Name:	radeon540.png
Views:	13
Size:	90.1 KB
ID:	22514  
ewmayer is offline   Reply With Quote
Old 2020-06-04, 22:13   #35
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

22×821 Posts
Default

Quote:
Originally Posted by ewmayer View Post
You need to be a little more specific about what I should be looking at on those pages. Clicking on the AMD link for the Radeon 540 at the Wikipedia page, I see the following - what crucial element is missing from that list?
According to (the fallible) Wiki yours is "Lexa" type and ROCm says it only supports "Polaris" from the 500 series. Although it does say:

Quote:
ROCm is a collection of software ranging from drivers and runtimes to libraries and developer tools. Some of this software may work with more GPUs than the "officially supported" list above, though AMD does not make any official claims of support for these devices on the ROCm software platform. The following list of GPUs are enabled in the ROCm software, though full support is not guaranteed:

GFX8 GPUs
"Polaris 11" chips, such as on the AMD Radeon RX 570 and Radeon Pro WX 4100
"Polaris 12" chips, such as on the AMD Radeon RX 550 and Radeon RX 540
Further, according to this page The Polaris 12 a.k.a. Lexa Pro has an X tagged on: RX 540X

I am confused.

Last fiddled with by paulunderwood on 2020-06-04 at 22:27
paulunderwood is online now   Reply With Quote
Old 2020-06-04, 22:29   #36
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

3×3,767 Posts
Default

Well, at this point I still don't know if the hang-on-startup issue is due to ROCm or not - as I said, rocm-smi seems to recognize the GPU. How to get more specific info re. the, um, well, "how's it hanging?" question? :P

And, Ken mentioned trying a prebuilt-for-linux LL/PRP tester, but I saw no links to such at the mersenne.ca site he suggested. What are my alternative options here? I don't really care which precise code runs on the GPU, just that it be efficient. Was simply hoping the same workflow that works for gpuowl on Radeon VII under Linux would also work here.
ewmayer is offline   Reply With Quote
Old 2020-06-04, 22:37   #37
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

100010011102 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Well, at this point I still don't know if the hang-on-startup issue is due to ROCm or not - as I said, rocm-smi seems to recognize the GPU. How to get more specific info re. the, um, well, "how's it hanging?" question? :P

And, Ken mentioned trying a prebuilt-for-linux LL/PRP tester, but I saw no links to such at the mersenne.ca site he suggested. What are my alternative options here? I don't really care which precise code runs on the GPU, just that it be efficient. Was simply hoping the same workflow that works for gpuowl on Radeon VII under Linux would also work here.
One of the first things that gpuowl does in the beginning is to compile the kernels (OpenCL compilation), and prints a message once that's done with timing.

You could strart gpuowl under the debugger, and interrupt it after a while, and see where the threads are sitting (i.e. what is it doing, what is it waiting for). Another way would be to add log() lines at strategic places in source to mark the passage.

Also, you gpuowl -h should display an entry for the GPU. (if it doesn't it's a bad omen). clinfo working is also a good sign.
preda is offline   Reply With Quote
Old 2020-06-04, 22:55   #38
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

22×821 Posts
Default

I told Ernst to sudo ln -s /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0 /usr/lib/x86_64-linux-gnu/libOpenCL.so but I have this:
Code:
ls -l /opt/rocm-3.5.0/lib/libOpenCL.so*
lrwxrwxrwx 1 root root 30 Jun  3 04:58 /opt/rocm-3.5.0/lib/libOpenCL.so -> ../opencl/lib/libOpenCL.so.1.2
lrwxrwxrwx 1 root root 30 Jun  3 04:58 /opt/rocm-3.5.0/lib/libOpenCL.so.1 -> ../opencl/lib/libOpenCL.so.1.2
lrwxrwxrwx 1 root root 30 Jun  3 04:58 /opt/rocm-3.5.0/lib/libOpenCL.so.1.2 -> ../opencl/lib/libOpenCL.so.1.2
I am wondering if he uses -l/opt/rocm-3.5.0/lib/OpenCL in the gpuowl Makefile (and recompile) it would be better.

(Then he can rm the link /usr/lib/x86_64-linux-gnu/libOpenCL.so.)

Edit: A bad idea. I just tried the altered "-l" option.

This is what I have in my Makefile:
Code:
LIBPATH = -L/opt/rocm-3.5.0/opencl/lib/x86_64  -L.

Last fiddled with by paulunderwood on 2020-06-04 at 23:30
paulunderwood is online now   Reply With Quote
Old 2020-06-04, 23:11   #39
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

3×3,767 Posts
Default

Quote:
Originally Posted by preda View Post
One of the first things that gpuowl does in the beginning is to compile the kernels (OpenCL compilation), and prints a message once that's done with timing.

You could strart gpuowl under the debugger, and interrupt it after a while, and see where the threads are sitting (i.e. what is it doing, what is it waiting for). Another way would be to add log() lines at strategic places in source to mark the passage.

Also, you gpuowl -h should display an entry for the GPU. (if it doesn't it's a bad omen). clinfo working is also a good sign.
Here is what happens on program start:
Code:
ewmayer@ewmayer-NUC8i3CYS:~/gpuowl/RUN$ sudo ../gpuowl
[sudo] password for ewmayer: 
2020-06-03 18:26:55 gpuowl v6.11-311-gfa76bd9
2020-06-03 18:26:55 Note: not found 'config.txt'
2020-06-03 18:26:55 device 0, unique id ''
At that point it hangs, and the ssh-session stops accepting signals ... I had to kill said remote session from another term on my Macbook. Comparing to run-start diagnostics on one of my R7s, after those 3 lines it next prints a line with this format:

[date & time] [GPU ID, e.g. gfx906+sram-ecc-0] [exponent] [FFT info & bpw]

That is followded by an "Expected maximum carry" line, an "OpenCL args" line, then an OpenCL-compilation-timing line. I'm getting none of those, i.e. it's hanging on the way to printout of the above informational line.

As noted yesterday, clinfo shows the platform and CL version (2.0), but no valid devices:
Code:
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.0 AMD-APP (3137.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback 
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 0

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No devices found in platform
The ROCm interface seems OK, even allows me to do the usual --setsclk and --setfan fiddles and the corresponding numbers is the status (rocm-smi, no args) output change in the expected directions.

Using 'sudo gdb' to un under gdb starts with the expected "(No debugging symbols found in ../gpuowl)", then 'run' hits the hung state, even under gdb ctrl-c and ctrl-z fail to have any effect.

Advice on instrumenting via log(), enabling debug symbols, etc, welcome.
ewmayer is offline   Reply With Quote
Old 2020-06-04, 23:20   #40
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

22×821 Posts
Default

Try running gpuowl with -h as preda wrote.

What is your gpuowl Makefile's LIBPATH?

Last fiddled with by paulunderwood on 2020-06-04 at 23:22
paulunderwood is online now   Reply With Quote
Old 2020-06-04, 23:51   #41
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

3·3,767 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
Try running gpuowl with -h as preda wrote.

What is your gpuowl Makefile's LIBPATH?
Code:
ewmayer@ewmayer-NUC8i3CYS:~/gpuowl/RUN$ sudo ../gpuowl -h
[sudo] password for ewmayer: 
2020-06-04 16:45:04 gpuowl v6.11-311-gfa76bd9

-dir <folder>      : specify local work directory (containing worktodo.txt, results.txt, config.txt, gpuowl.log)
-pool <dir>        : specify a directory with the shared (pooled) worktodo.txt and results.txt
                     Multiple GpuOwl instances, each in its own directory, can share a pool of assignments and report
                     the results back to the common pool.
-uid <unique_id>   : specifies to use the GPU with the given unique_id (only on ROCm/Linux)
-user <name>       : specify the user name.
-cpu  <name>       : specify the hardware name.
-time              : display kernel profiling information.
-fft <spec>        : specify FFT e.g.: 1152K, 5M, 5.5M, 256:10:1K
-block <value>     : PRP GEC block size, or LL iteration-block size. Must divide 10'000.
-log <step>        : log every <step> iterations. Multiple of 10'000.
-jacobi <step>     : (LL-only): do Jacobi check every <step> iterations. Default 1'000'000.
-carry long|short  : force carry type. Short carry may be faster, but requires high bits/word.
-B1                : P-1 B1 bound, default 1000000
-B2                : P-1 B2 bound, default B1 * 30
-rB2               : ratio of B2 to B1. Default 30, used only if B2 is not explicitly set
-cleanup           : delete save files at end of run
-prp <exponent>    : run a single PRP test and exit, ignoring worktodo.txt
-pm1 <exponent>    : run a single P-1 test and exit, ignoring worktodo.txt
-ll <exponent>     : run a single LL test and exit, ignoring worktodo.txt
-verify <file>|<exponent> : verify PRP-proof contained in <file> or in the folder <exponent>/
-proof [<power>]   : enable PRP proof generation. Default <power> is 9.
-results <file>    : name of results file, default 'results.txt'
-iters <N>         : run next PRP test for <N> iterations and exit. Multiple of 10000.
-maxAlloc          : limit GPU memory usage to this value in MB (needed on non-AMD GPUs)
-yield             : enable work-around for CUDA busy wait taking up one CPU core
-nospin            : disable progress spinner
-use NEW_FFT8,OLD_FFT5,NEW_FFT10: comma separated list of defines, see the #if tests in gpuowl.cl (used for perf tuning)
-safeMath          : do not use -cl-unsafe-math-optimizations (OpenCL)
-binary <file>     : specify a file containing the compiled kernels binary
-device <N>        : select a specific device:
[hangs]
Grep finds 2 occurrences of LIBPATH in the Makefile ... not sure what the '.' ending the actual-path one means:
Code:
ewmayer@ewmayer-NUC8i3CYS:~/gpuowl$ grep LIBPATH Makefile
3:LIBPATH = -L/opt/rocm-3.3.0/opencl/lib/x86_64 -L/opt/rocm-3.1.0/opencl/lib/x86_64 -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L.
5:LDFLAGS = -lstdc++fs -lOpenCL -lgmp -pthread ${LIBPATH}
ewmayer is offline   Reply With Quote
Old 2020-06-05, 00:00   #42
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

328410 Posts
Default

It is a long shot... Make the beginning of you LIBPATH look like this:

Code:
LIBPATH = -L/opt/rocm-3.5.0/opencl/lib/x86_64 ...the rest
and run make clean && make gpuowl

Last fiddled with by paulunderwood on 2020-06-05 at 00:00
paulunderwood is online now   Reply With Quote
Old 2020-06-05, 00:08   #43
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2·19·29 Posts
Default

Quote:
Originally Posted by ewmayer View Post
[code]ewmayer@ewmayer-NUC8i3CYS:~/gpuowl/RUN$ sudo ../gpuowl -h
OpenCL is not initialized correctly. It hangs when doing some basic OpenCL like list all the devices of an opencl provider. There's not much to fix in gpuowl for this IMO, you should get clinfo to report a valid device first.

The fact that rocm-smi works is not much. It just means that the GPU is initialized correctly. (basically the files under /sys/class/drm/card0/ are there). It does not mean that OpenCL works.
preda is offline   Reply With Quote
Old 2020-06-05, 00:11   #44
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1130110 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
It is a long shot... Make the beginning of you LIBPATH look like this:

Code:
LIBPATH = -L/opt/rocm-3.5.0/opencl/lib/x86_64 ...the rest
and run make clean && make gpuowl
Compile succeeds, here the link line:

g++ -o gpuowl Pm1Plan.o GmpUtil.o Worktodo.o common.o main.o Gpu.o clwrap.o Task.o checkpoint.o timeutil.o Args.o state.o Signal.o FFTConfig.o AllocTrac.o gpuowl-wrap.o sha3.o -lstdc++fs -lOpenCL -lgmp -pthread -L/opt/rocm-3.5.0/opencl/lib/x86_64 -L/opt/rocm-3.1.0/opencl/lib/x86_64 -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L.

...but same result (hang after 'select a specific device:' informational) as before.

Gotta run, thanks for the various try-this advice, we'll see what tomorrow brings.

(I know one thing tomorrow will bring ... a cool pic of The Beast in its new glass-roof-and-floor lair, now that I have the resulting airflow-restriction issues resolved).

Last fiddled with by ewmayer on 2020-06-05 at 00:11
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
AVX512 performance on new shiny Intel kit heliosh Hardware 19 2020-01-18 04:01
29.5 build 5 beta with AVX512 optimizations shows a 15% speed increase simon389 Software 20 2018-12-13 21:01
Hardware recommendations for factoring Mr. Odd Hardware 7 2016-06-02 01:07
need recommendations for a PC ixfd64 Hardware 45 2012-11-14 01:19
Hardware recommendations Mr. Odd Factoring 12 2011-11-19 00:32

All times are UTC. The time now is 07:55.

Sun Jul 12 07:55:11 UTC 2020 up 109 days, 5:28, 0 users, load averages: 1.80, 1.80, 1.77

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.