mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

ewmayer 2020-03-30 02:51

[QUOTE=Prime95;541293]We could detect the error condition with about 3 or 4 instructions. However, we try to create the fastest code possible and pick default settings that should safely avoid dangerous situations. Sometimes we don't quite succeed -- especially with day-to-day development.

The current code that selects CARRY64 for all P-1 work is overkill. I know how to fix that.[/QUOTE]

I forgot to ask if it could simply be a matter of p-1 auto-switching to use CARRY64 for a suitable set of exponent thresholds, based on your analysis.

For going-forward debug runs, would it be feasible to wrap the simple "sign after *3 same as sign of input" parity check in a set of preprocessor #ifs so you could build the slower-but-with-error-check code, run a bunch of expos just below the thresholds you set, and see if any such overflow-into-sign-bit errors occur?

Prime95 2020-03-30 03:17

[QUOTE=ewmayer;541297]I forgot to ask if it could simply be a matter of p-1 auto-switching to use CARRY64 for a suitable set of exponent thresholds, based on your analysis.[/quote]

Yes, that is the goal.

[quote]For going-forward debug runs, would it be feasible to wrap the simple "sign after *3 same as sign of input" parity check in a set of preprocessor #ifs so you could build the slower-but-with-error-check code, run a bunch of expos just below the thresholds you set, and see if any such overflow-into-sign-bit errors occur?[/QUOTE]

In the latest code, I set -use DEBUG,CARRY32_LIMIT=0x70000000 to print any iterations where 32-bit carry is getting close to the limit. This is slow code, useful for analysis, not for production runs.

ewmayer 2020-03-30 19:34

Update on my Radeon VII sudden-onset-slowdown yesterday: more weirdness. Haven't yet spotted anything in the dmesg logs of note, but I'm still get familiar with which AMD-GPU-related messages are normal and which not. Now to the weirdness.

My usual post-reboot procedure is:

1. Fire up gpuOwl job in each of two terminal windows, each job in a separate working dir;
2. Open 3rd window, fiddle settings to sclk=4 and fan=120 (or higher if interior temps warrant it);
3. Fire up LL/PRP job on the CPU;
4. Look at rocm-smi output to check GPU state.

Last night, again rebooted system (just to cover all bases), fired up first gpuOwl job, but then skipped to [4] above - all looked normal, Wattage ~200, temp nearing 70C, SCLK and MCLK at their expected values, fan noise ramping up nicely. Thought "yay! I fixed it!" Fired up second gpuOwl job - within seconds fan noise starts dropping fast, check of rocm-smi shows dreaded "the workers have gone on strike" numbers. Kill second job, things revert back to normal. No clue why the GPU is suddenly balking at running 2 jobs, but it being late figured better quit while I'm ahead - set sclk=5 to help compensate for the throughput hit from 1-job running, went to bed.

Even more weirdness - Just fired up 2nd job to see if the issue is reproducible, now all seems back to normal.

I believe the technical term is "gremlins".

ATH 2020-03-31 00:07

From gpuowl.cl:

[QUOTE]OUT_WG,OUT_SIZEX,OUT_SPACING <AMD default is 256,32,4> <nVidia default is 256,4,1 but needs testing>
IN_WG,IN_SIZEX,IN_SPACING <AMD default is 256,32,1> <nVidia default is 256,4,1 but needs testing>[/QUOTE]

What are the possible values and range to test for these variables?


On the Tesla P100 on Google Colab pro, it was at 909 µs/iteration at default settings at 5M FFT, now with tuned settings at 817 µs/iteration:
-use ORIG_X2,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,ORIG_SLOWTRIG,OUT_WG=256,OUT_SIZEX=32,OUT_SPACING=4

Prime95 2020-03-31 03:56

[QUOTE=ATH;541364]From gpuowl.cl:



What are the possible values and range to test for these variables?


On the Tesla P100 on Google Colab pro, it was at 909 µs/iteration at default settings at 5M FFT, now with tuned settings at 817 µs/iteration:
-use ORIG_X2,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,ORIG_SLOWTRIG,OUT_WG=256,OUT_SIZEX=32,OUT_SPACING=4[/QUOTE]

IN/OUT_WG=64,128,256,512
IN/OUT_SIZEX=4,8,16,32,64,128 (gpuowl will whine when the combination does not make sense)
IN/OUT_SPACING=4,8,16,32,64,128

You are the first nVidia user to test all these combinations. Alas, previously the colab GPUs showed only minor differences in these settings whereas nVidia consumer GPUs benefitted much more from an optimal setting.

BTW, are you using today's checked in code? I'm surprised ORIG_SLOWTRIG would be faster than NEW_SLOWTRIG.

ATH 2020-03-31 17:10

[QUOTE=Prime95;541384]BTW, are you using today's checked in code? I'm surprised ORIG_SLOWTRIG would be faster than NEW_SLOWTRIG.[/QUOTE]

I compiled it 2 days ago, any important changes since then? I can try to compile it again later today or tomorrow and test again.

This is the one to use right?
git clone [url]https://github.com/preda/gpuowl[/url]

Prime95 2020-03-31 17:30

[QUOTE=ATH;541410]I compiled it 2 days ago, any important changes since then? I can try to compile it again later today or tomorrow and test again.

This is the one to use right?
git clone [url]https://github.com/preda/gpuowl[/url][/QUOTE]

Since 2 days ago, the trig code changed -- probably a smidge faster and more accurate.
For Ernst, the new FFT boundaries are in place with automated selection of CARRY32 vs. CARRY64.

Yes, that is the correct source.

ewmayer 2020-03-31 19:56

[QUOTE=Prime95;541411]Since 2 days ago, the trig code changed -- probably a smidge faster and more accurate.
For Ernst, the new FFT boundaries are in place with automated selection of CARRY32 vs. CARRY64.[/QUOTE]

Just grabbed (v. 62a3025) and built. Switched one of my 2 runs to it to goive a spin, see this on start (PRP of p = 103937143 @5632K):
[i]
Expected maximum carry32: 47840000
[/i]
Aside - before switching that run to the new version, both were getting ~1335 us/iter (total 1498 iter/sec). With 1 run using new version, that run is now @ 1580 us/iter and the other has speeded up to 1168 us/iter (total 1490 iter/sec). With both runs using new version, both are at 1333 us/iter (total 1500 iter/sec). Probably some weird rocm-process-priority thing.

Aside #2: I've been doing near-daily price checks of new XFX Radeon VII cards on Amazon - they fluctuate interestingly. Couple days ago, $580. Yesterday, back to the same $550 I paid for mine in Feb. Just now, $600.

ATH 2020-04-01 00:20

Ok, just compiled it again 1-1.5h ago on the Colab Tesla P100-PCIE-16GB.

This version is a bit faster on default settings at 5M FFT: 895µs/iteration

Got down to 832 µs with:
-use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32

I did not test all 144 x2 combinations of the 6 variables, but I did test many and found 2 different combinations that both run at 809 µs:
[QUOTE]-use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,OUT_WG=64,OUT_SIZEX=8,OUT_SPACING=4,IN_WG=64,IN_SIZEX=8,IN_SPACING=2
-use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,OUT_WG=64,OUT_SIZEX=8,OUT_SPACING=4,IN_WG=128,IN_SIZEX=16,IN_SPACING=4[/QUOTE]

Many other combinations run at 810-820µs and many more at 820-850 µs, and a few rare bad ones ran at 980-990µs.

Switching back from ORIG_SLOWTRIG to NEW_SLOWTRIG at the final settings, changed the speed from 809µs to 814-815µs, so not a big difference.

preda 2020-04-01 01:15

[QUOTE=ATH;541441]Ok, just compiled it again 1-1.5h ago on the Colab Tesla P100-PCIE-16GB.

This version is a bit faster on default settings at 5M FFT: 895µs/iteration

Got down to 832 µs with:
-use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32

I did not test all 144 x2 combinations of the 6 variables, but I did test many and found 2 different combinations that both run at 809 µs:


Many other combinations run at 810-820µs and many more at 820-850 µs, and a few rare bad ones ran at 980-990µs.

Switching back from ORIG_SLOWTRIG to NEW_SLOWTRIG at the final settings, changed the speed from 809µs to 814-815µs, so not a big difference.[/QUOTE]

ORIG_X2 and INLINE_X2 do not exist anymore, setting them has no effect whatsoever.

This seems to suggest these changes to Nvidia defaults:
- handle T2_SHUFFLE like on AMD (i.e. default to NO_T2_SHUFFLE)
- handle CARRY like on AMD (i.e. default to CARRY32)

Could you run with -use ROUNDOFF paired in turn with ORIG_SLOWTRIG/NEW_SLOWTRIG and look at the average roundoff error to evaluate their respective accuracy. If ORIG_SLOWTRIG is similarly accurate to NEW_SLOWTRIG we may consider making it the default on Nvidia.

Could other Nvidia users speak up if those proposed Nvidia defaults have adverse performance effects for them (due to different hardware).

kracker 2020-04-01 03:39

Windows compilation:

[code]
g++ -MT Gpu.o -MMD -MP -MF .d/Gpu.Td -Wall -O2 -std=c++17 -c -o Gpu.o Gpu.cpp
In file included from ProofSet.h:6,
from Gpu.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:33:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
33 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
Gpu.cpp: In member function 'std::tuple<bool, long long unsigned int, unsigned int> Gpu::isPrimePRP(u32, const Args&, std::atomic<unsigned int>&)':
Gpu.cpp:881:51: warning: left shift count >= width of type [-Wshift-count-overflow]
881 | constexpr float roundScale = 1.0 / (1L << 32);
| ^~
Gpu.cpp:881:42: warning: division by zero [-Wdiv-by-zero]
881 | constexpr float roundScale = 1.0 / (1L << 32);
| ~~~~^~~~~~~~~~~~
Gpu.cpp:881:48: error: right operand of shift expression '(1 << 32)' is >= than the precision of the left operand [-fpermissive]
881 | constexpr float roundScale = 1.0 / (1L << 32);
| ~~~~^~~~~~
make: *** [Makefile:30: Gpu.o] Error 1
[/code]


All times are UTC. The time now is 23:08.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.