mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

nomead 2019-12-09 12:50

RTX 2080, Linux. Just some general observations and incoherent rambling, I haven't done that much tuning. In fact I haven't touched gpuOwL at all before this past weekend, so it's all a bit new to me.

I started with whatever was committed to github up until 2019-12-04. First I had some issues with the compilation, but that was due to those #pragma statements commented out in gpuowl.cl (fixed now).

Then the program apparently needed the -use NO_ASM option to run (thanks SELROC for pointing that out on IRC).

After that I got the program running... but the timings seemed to be all over the place. 2816K was 3.743 ms but the next one I tested, 5120K, was 3.884 ms/iter, and the difference should be bigger, so something must be wrong. Well, after some fiddling around, I found out how to specify the FFT options (width/height/middle) and found that the default settings were most of the time not the fastest ones. Maybe I should have read through this thread better...

Anyway, how would one specify both FFT size and other options? -fft 5632K and -fft +2 seem to be mutually exclusive, only one works. And it would be really useful to have these options in some configuration file, so the program is ready to use even if the FFT size changes (and the new size is faster with different options).

Then came the commits from yesterday (2019-12-08). On the RTX2080, the calculation is limited by FP64 units, not memory bandwidth (memory bus usage is in the 20-30% range depending on FFT size), but there were still some noticeable improvements. For example, 5632K (-fft +2) was 4.396 ms/iter before, and 4.237 ms after the update, using MERGED_MIDDLE. So that's almost 4% better. The improvements vary quite a bit, from 0% to 5.5%, and the average across all FFT sizes and parameters I tested (2M to 20M) was 2.2%.

Another comparison: CUDALucas with the closest applicable FFT size (5760K) and the same hardware is 5.585 ms/iter, so gpuOwL is a bit over 30% faster. Of course the difference varies quite a lot there, too, but 20-30% seems to be the norm.

One observation, though, about that MERGED_MIDDLE improvement. If the FFT size happens to be one without that "middle" part (2M, 4M, 8M, 16M...) and the dumb user (me) still instructs the program to -use MERGED_MIDDLE then the calculation will fail. In hindsight this shouldn't be a surprise, but I plead ignorance and the effects of a Monday morning. :coffee: The error is :
[CODE]2019-12-09 08:24:18 38000009 EE 0 loaded: blockSize 400, 0000000000000000 (expected 0000000000000003)
2019-12-09 08:24:18 Exiting because "error on load"
[/CODE]

kracker 2019-12-09 15:33

Some warnings when compiling... Probably unimportant.
[code]
In file included from common.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
In file included from Worktodo.cpp:6:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
In file included from main.cpp:8:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
In file included from clwrap.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
In file included from ProofSet.h:6,
from Gpu.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
In file included from Task.cpp:7:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
In file included from checkpoint.h:5,
from checkpoint.cpp:3:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
In file included from Args.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs

[/code]

kriesel 2019-12-09 16:04

[QUOTE=kracker;532452]Some warnings when compiling... Probably unimportant.
[/QUOTE]Maybe [url]https://www.mersenneforum.org/showpost.php?p=530766&postcount=40[/url] will help.

kriesel 2019-12-09 17:38

Feature request: OpenCL version test
 
Preda, please add an OpenCL version test. As previously posted, [URL]https://www.mersenneforum.org/showpost.php?p=525496&postcount=1354[/URL]

Gpuowl will not run on a test gpu Quadro 2000 (compute capability 2.1, opencl 1.1/1.2), and assorted other older gpus, producing a shower of cl compile errors relating to atomics. I think it requires at least OpenCL 2 and therefore a CUDA compute capability above 2.x. [B]An explicit test for opencl version[/B] by gpuowl and clear message if the version is too low might be a good thing. ("Gpuowl requires OpenCL 2 support for atomics, which this gpu does not appear to support. Exiting now." or some such helpful message.)

kracker 2019-12-09 18:41

Some numbers:

RX570
[code]
5033 NO_ASM
4384 NO_ASM,MERGED_MIDDLE
7285 NO_ASM,MERGED_MIDDLE,WORKINGIN
4365 NO_ASM,MERGED_MIDDLE,WORKINGIN1
4360 NO_ASM,MERGED_MIDDLE,WORKINGIN1A
4459 NO_ASM,MERGED_MIDDLE,WORKINGIN2
4381 NO_ASM,MERGED_MIDDLE,WORKINGIN3
4358 NO_ASM,MERGED_MIDDLE,WORKINGIN5
7433 NO_ASM,MERGED_MIDDLE,WORKINGOUT
5818 NO_ASM,MERGED_MIDDLE,WORKINGOUT0
4400 NO_ASM,MERGED_MIDDLE,WORKINGOUT1
4410 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
4762 NO_ASM,MERGED_MIDDLE,WORKINGOUT2
4385 NO_ASM,MERGED_MIDDLE,WORKINGOUT3
4610 NO_ASM,MERGED_MIDDLE,WORKINGOUT4
4517 NO_ASM,MERGED_MIDDLE,WORKINGOUT5
[/code]

Tesla P100
[code]
1318 NO_ASM
951 NO_ASM,MERGED_MIDDLE
945 NO_ASM,MERGED_MIDDLE,WORKINGIN
944 NO_ASM,MERGED_MIDDLE,WORKINGIN1
952 NO_ASM,MERGED_MIDDLE,WORKINGIN1A
945 NO_ASM,MERGED_MIDDLE,WORKINGIN2
952 NO_ASM,MERGED_MIDDLE,WORKINGIN3
939 NO_ASM,MERGED_MIDDLE,WORKINGIN4
942 NO_ASM,MERGED_MIDDLE,WORKINGIN5
948 NO_ASM,MERGED_MIDDLE,WORKINGOUT
948 NO_ASM,MERGED_MIDDLE,WORKINGOUT0
948 NO_ASM,MERGED_MIDDLE,WORKINGOUT1
956 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
948 NO_ASM,MERGED_MIDDLE,WORKINGOUT2
951 NO_ASM,MERGED_MIDDLE,WORKINGOUT3
954 NO_ASM,MERGED_MIDDLE,WORKINGOUT4
949 NO_ASM,MERGED_MIDDLE,WORKINGOUT5
[/code]

kriesel 2019-12-09 19:50

[QUOTE=kracker;532464]Some numbers:[/QUOTE]Wow, 15% and 40%.
Thanks for running these.
Please try on any other colab gpu models when you get a chance.

kriesel 2019-12-09 19:53

[QUOTE=Prime95;532431]Of course, that was from a huge sample size of 1 nVidia card.[/QUOTE]The need for sleep and a more efficient way of getting the data intervened. You're welcome. I also had multiple gpus tied up in P-1 limit and runtime scaling runs in gpuowl versions predating P-1 save file capability at the time.

preda 2019-12-09 19:58

Ken, I'm not confident that I can do the OpenCL version test reliably. For example, until recently, ROCm OpenCL was self-reporting as being OpenCL 1.2 although it was compiling fine 2.0. I'm worried that adding this check would not even attempt to compile in such a situation.

That said, I added an OpenCL 2.0 version check, please try it out on the old cards.

[QUOTE=kriesel;532458]Preda, please add an OpenCL version test. As previously posted, [URL]https://www.mersenneforum.org/showpost.php?p=525496&postcount=1354[/URL]

Gpuowl will not run on a test gpu Quadro 2000 (compute capability 2.1, opencl 1.1/1.2), and assorted other older gpus, producing a shower of cl compile errors relating to atomics. I think it requires at least OpenCL 2 and therefore a CUDA compute capability above 2.x. [B]An explicit test for opencl version[/B] by gpuowl and clear message if the version is too low might be a good thing. ("Gpuowl requires OpenCL 2 support for atomics, which this gpu does not appear to support. Exiting now." or some such helpful message.)[/QUOTE]

kriesel 2019-12-09 20:17

gpuowl feature request
 
Gpuowl feature request: P-1 res64 check for special very likely bad interim residues 0x00 and 0x01. P-1 is currently computing on the high wire without a safety net.

Undetected errors could cost hours or days in single lengthy P-1 runs, and also missed factors.



Come to think of it, that res64 check could save some lost PRP time too when errors occur. Less incentive there though, since the excellent GEC safety net catches the errors eventually.
[CODE]2019-12-08 06:01:08 91305491 OK 62250000 68.18%; 1184 us/sq; ETA 0d 09:34; 5cf68328b1473b4a (check 0.90s) 2 errors
2019-12-08 06:02:07 91305491 62300000 68.23%; 1184 us/sq; ETA 0d 09:32; 3efaa597c7d9c53d
2019-12-08 06:03:06 91305491 62350000 68.29%; 1184 us/sq; ETA 0d 09:31; 392c10b87c906301
2019-12-08 06:04:05 91305491 62400000 68.34%; 1178 us/sq; ETA 0d 09:28; 0000000000000000 <-- already have the res64 for output, test it, return to 62250000, save 100,000 additional bad iterations until the GEC check
2019-12-08 06:05:03 91305491 62450000 68.40%; 1155 us/sq; ETA 0d 09:15; 0000000000000000
2019-12-08 06:06:02 91305491 EE 62500000 68.45%; 1156 us/sq; ETA 0d 09:15; 0000000000000000 (check 0.90s) 2 errors
2019-12-08 06:07:02 91305491 62300000 68.23%; 1204 us/sq; ETA 0d 09:42; 3efaa597c7d9c53d
2019-12-08 06:08:01 91305491 62350000 68.29%; 1184 us/sq; ETA 0d 09:31; 392c10b87c906301
2019-12-08 06:09:01 91305491 62400000 68.34%; 1184 us/sq; ETA 0d 09:30; e0c75b60654dbfb4
2019-12-08 06:10:00 91305491 62450000 68.40%; 1183 us/sq; ETA 0d 09:29; c271cc2b8386285f
2019-12-08 06:11:00 91305491 OK 62500000 68.45%; 1183 us/sq; ETA 0d 09:28; 070950e467249083 (check 0.96s) 3 errors[/CODE]Average savings would be 125,000 iterations (2.5 minutes at wavefront on Radeon VII, proportionally higher on bigger exponents or slower gpus), min 50,000 (~1 minute), max 200,000 (about 4 minutes per error in this case)[CODE]2019-12-03 21:19:42 89064097 OK 75500000 84.77%; 1214 us/sq; ETA 0d 04:34; fba20ffb703f9fb7 (check 0.91s)
2019-12-03 21:20:42 89064097 75550000 84.83%; 1189 us/sq; ETA 0d 04:28; 0000000000000000
2019-12-03 21:21:40 89064097 75600000 84.88%; 1167 us/sq; ETA 0d 04:22; 0000000000000000
2019-12-03 21:22:39 89064097 75650000 84.94%; 1171 us/sq; ETA 0d 04:22; 0000000000000000
2019-12-03 21:23:38 89064097 75700000 84.99%; 1171 us/sq; ETA 0d 04:21; 0000000000000000
2019-12-03 21:24:37 89064097 EE 75750000 85.05%; 1172 us/sq; ETA 0d 04:20; 0000000000000000 (check 0.94s)
2019-12-03 21:25:39 89064097 75550000 84.83%; 1239 us/sq; ETA 0d 04:39; 49985b238359ff96
2019-12-03 21:26:40 89064097 75600000 84.88%; 1217 us/sq; ETA 0d 04:33; 78c0f7429d9a238f
2019-12-03 21:27:41 89064097 75650000 84.94%; 1217 us/sq; ETA 0d 04:32; efab7475b57165bb
2019-12-03 21:28:42 89064097 75700000 84.99%; 1216 us/sq; ETA 0d 04:31; f43c80e5de778e68
2019-12-03 21:29:44 89064097 OK 75750000 85.05%; 1212 us/sq; ETA 0d 04:29; d83c92710ddb50e8 (check 0.91s) 1 errors
[/CODE]It appears the majority of PRP3 GEC errors on my Radeon VII are of the 0x00 variety. I've not seen 0x01 yet. The rest are seemingly normal residue values.

Prime95 2019-12-09 20:25

Latest windows build (with a fix for power-of-two FFT size with MERGED_MIDDLE).

[url]https://www.dropbox.com/s/bxty3e5qz5is68d/gpuowl-win.exe?dl=0[/url]

kriesel 2019-12-09 20:27

[QUOTE=preda;532469]Ken, I'm not confident that I can do the OpenCL version test reliably. For example, until recently, ROCm OpenCL was self-reporting as being OpenCL 1.2 although it was compiling fine 2.0. I'm worried that adding this check would not even attempt to compile in such a situation.

That said, I added an OpenCL 2.0 version check, please try it out on the old cards.[/QUOTE]Thanks, will do. Even just a less than perfectly reliable warning as to why there may be trouble will help us ordinary users. I still don't know whether a certain CUDA gpu that failed to do gpuowl P-1 was because of OpenCL level, a bad -maxAlloc value, or something else. I have the impression I should go back and retest many for limits with some very recent version. If I recall correctly, the memory handling got better since sometime after v6.9. [URL]https://www.mersenneforum.org/showpost.php?p=525696&postcount=1361[/URL]
I recently found that V6.11-9 could do P-1 on a 2GB RX550 that a 3GB GTX1060 with v6.9-0-gc137007 could not.


All times are UTC. The time now is 23:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.