mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

preda 2017-04-20 01:08

Something is definitely unexpected with the performance you see. I'm using AMDGPU-Pro as OpenCL compiler (either most recent, 17.10, or prev 16.xx, were producing similar performance). What OpenCL driver do you use? is the GPU in a good PCIe slot (no transfer bottleneck)?

I would dump the .isa compiled kernels and look there for diffs (you can pass "-save-temps" in clwrap.h, it's there in a comment, and send me the .isa). Or I'll add an option to enable that as an argument.

tului 2017-04-20 03:22

Sam Harris would approve of the name. They're perfectly good after all.

kracker 2017-04-22 01:08

Been tinkering around with it on windows.. got this after compiling:

[code]
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
LL of 76000021 at iteration 0
FFT 1024*2048 (4M words, 18.12 bits per word)
log An invalid option was specified.

error -11
Assertion failed!
Program: C:\Users\Back\Desktop\gpuowl-0832c6d\gpuowl-0832c6d.exe
File: clwrap.h, Line 66

Expression: false

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
[/code]

Also, it might be better for the program to accept any assignment ID, I tried N/A, and also some random gibberish and it didn't accept it. Probably needed a minimum number of characters for it to accept it.

EDIT: Pulled and recompiled from the latest commit..

[code]
OpenCL Compilation error -11, log:
An invalid option was specified.
[/code]

VictordeHolland 2017-04-22 09:10

1 Attachment(s)
[QUOTE=kracker;457232]Been tinkering around with it on windows.. got this after compiling:

[code]
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
LL of 76000021 at iteration 0
FFT 1024*2048 (4M words, 18.12 bits per word)
log An invalid option was specified.

error -11
Assertion failed!
Program: C:\Users\Back\Desktop\gpuowl-0832c6d\gpuowl-0832c6d.exe
File: clwrap.h, Line 66

Expression: false

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
[/code]Also, it might be better for the program to accept any assignment ID, I tried N/A, and also some random gibberish and it didn't accept it. Probably needed a minimum number of characters for it to accept it.

EDIT: Pulled and recompiled from the latest commit..

[code]
OpenCL Compilation error -11, log:
An invalid option was specified.
[/code][/QUOTE]
Same here, compiles without errors on mingw64, but also
error -11
Assertion failed!
clwrap.h, Line 66

preda 2017-04-22 11:37

Kracker, Victor: thanks for trying the compile! As I only compiled with one CL implementation myself, I was not aware of these problems.

Now, would you try again with a fresh checked-out version, I simplified the CL options.

If it still doesn't pass the CL compiler, you can try deleting -cl-fast-relaxed-math -cl-std=CL2.0 from clwrap.h line 89 (but hopefully you won't have to do this).

Concerning the AID in worktodo.txt, try a single 0, like this:
Test=0,24036583,75,1

Otherwise, what should pass in any case is 32 hex-digits, like 00000.. repeated 32 times.
But Test=N/A would not pass because N/A is not valid hexadecimal digits.

- Mihai

preda 2017-04-22 11:44

@airsquirrels: could you run a fresh checked-out version with command line args "-cl -save-temps", e.g.: ./gpuowl -cl -save-temps

This passes "-save-temps" to the CL compiler, which should produce a dump of the GCN ISA. If that works, you should see a file like "_temp_1_Fiji.isa". Could you send that .isa file to me, to see if the reason for the perf degradation is in poor generated ISA code (like too many VGPRs used).

thanks,
Mihai

VictordeHolland 2017-04-22 13:10

1 Attachment(s)
[QUOTE=preda;457261]Kracker, Victor: thanks for trying the compile! As I only compiled with one CL implementation myself, I was not aware of these problems.

Now, would you try again with a fresh checked-out version, I simplified the CL options.

If it still doesn't pass the CL compiler, you can try deleting -cl-fast-relaxed-math -cl-std=CL2.0 from clwrap.h line 89 (but hopefully you won't have to do this).

Concerning the AID in worktodo.txt, try a single 0, like this:
Test=0,24036583,75,1

Otherwise, what should pass in any case is 32 hex-digits, like 00000.. repeated 32 times.
But Test=N/A would not pass because N/A is not valid hexadecimal digits.

- Mihai[/QUOTE]
Hi,

I tried with the default "-cl-fast-relaxed-math" and "-cl-opt-disable" change in line 89 in the clwrap.h but still got some errors.

I forgot to mention, but my card is a HD7950, which only supports OpenCL[U]1.2[/U]
[URL]https://en.wikipedia.org/wiki/Radeon_HD_7000_Series[/URL]

At least gpuOwl detects it is a Tahiti OpenCL 1.2 AMD-APP 2079.5 device :).

Attached the _temp_0_Tahiti.cl that was created. (I had to rename it to .txt or else I couldn't upload to the forum.

[code]
C:\msys64\home\gpuowl>gpuowl
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti - OpenCL 1.2 AMD-APP (2079.5)
OpenCL compilation error -11, log:
"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 15: error: attributes
may not appear here
double2 _O mul(double2 u, double a, double b) { return (double2) { u.x * a - u
.y * b, u.x * b + u.y * a}; }
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 16: error: attributes
may not appear here
double2 _O mul(double2 u, double2 v) { return mul(u, v.x, v.y); }
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 59: error: attributes
may not appear here
void _O shuffle(local double *lds, double2 *u, uint n, uint f) {
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 83: error: variable
with automatic storage duration cannot be stored in the named
address space
local double lds[1024];
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 86: error: identifier
"lds" is undefined
shuffle(lds, u, 4, 64);
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 102: error: variable
with automatic storage duration cannot be stored in the named
address space
local double lds[2048];
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 105: error: identifier
"lds" is undefined
shuffle(lds, u, 8, 32);
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 365: error: variable
with automatic storage duration cannot be stored in the named
address space
local double lds[4096];
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 372: error: identifier
"lds" is undefined
lds[l * 64 + (c + l) % 64] = ((double *)(u + i))[b];
^

"C:\Users\Victor\AppData\Local\Temp\OCL6952T1.cl", line 378: error: identifier
"lds" is undefined
((double *)(u + i))[b] = lds[l * 64 + (c + l) % 64];
^

10 errors detected in the compilation of "C:\Users\Victor\AppData\Local\Temp\OCL
6952T1.cl".
Frontend phase failed compilation.
[/code]Here is what I get from the clinfo command
[code]
clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (2079.5)
Platform Name: AMD Accelerated Parallel Proces
sing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_khr_d3d10_sharing
cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_amd_event_callback cl_amd_offl
ine_devices


Platform Name: AMD Accelerated Parallel Proces
sing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon HD 7900 Series
Device Topology: PCI[ B#1, D#0, F#0 ]
Max compute units: 28
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 900Mhz
Address bits: 32
Max memory allocation: 2214174021
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 3221225472
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 0
Max pipe active reservations: 0
Max pipe packet size: 0
Max global variable size: 0
Max global variable preferred total size: 0
Max read/write image args: 0
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: No
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 000007FED5DF5188
Name: Tahiti
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.2
Driver version: 2079.5 (VM)
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (2079.5)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_
global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int3
2_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_
khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store
cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd
_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sha
ring cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_e
vent


Device Type: CL_DEVICE_TYPE_CPU
Vendor ID: 1002h
Board name:
Max compute units: 4
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 8
Preferred vector width double: 4
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 8
Native vector width double: 4
Max clock frequency: 3300Mhz
Address bits: 64
Max memory allocation: 4289850368
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 4096
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 32768
Global memory size: 17159401472
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Global
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 4289850368
Max global variable size: 1879048192
Max global variable preferred total size: 1879048192
Max read/write image args: 64
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: No
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 1
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 310
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: Yes
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 000007FED5DF5188
Name: Intel(R) Core(TM) i5-250
0K CPU @ 3.30GHz
Vendor: GenuineIntel
Device OpenCL C version: OpenCL C 1.2
Driver version: 2079.5 (sse2,avx)
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (2079.5)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_
global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int3
2_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_
khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store
cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec
3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sh
aring cl_khr_spir cl_khr_gl_event[/code]

preda 2017-04-23 00:17

OK, I tried to fix these CL compilation errors as well (please retry).

OTOH I need to investigate a bug that appears to be present -- I would not recommend doing any serious LL with gpuOwL right now, I need to validate it a bit more first.

kracker 2017-04-23 01:07

Recompiled.. upon launch I got:
[code]
OpenCL compilation error -11, log:
An invalid option was specified.
[/code]

I removed -cl-std=CL2.0 from clwrap.h and it appears to be working... will see how it goes.

kracker 2017-04-23 01:55

1 Attachment(s)
Very impressive! It actually is slightly faster on my low end HD7770(will try on my R9 285 with OCL 2.0 capability when I have time) and also with better error numbers.. also residues seem to be matching with clLucas.

kladner 2017-04-23 08:06

If I may say, as a spectator, and a non coder, it amazes me to watch this birth process. The cooperation and involvement by several parties is impressive. Seeing this play out is one of the big pay-offs for hanging out on this forum.


All times are UTC. The time now is 02:03.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.