mersenneforum.org gpuOwL: an OpenCL program for Mersenne primality testing
 Register FAQ Search Today's Posts Mark Forums Read

2020-03-30, 02:51   #2014
ewmayer
2ω=0

Sep 2002
República de California

29×401 Posts

Quote:
 Originally Posted by Prime95 We could detect the error condition with about 3 or 4 instructions. However, we try to create the fastest code possible and pick default settings that should safely avoid dangerous situations. Sometimes we don't quite succeed -- especially with day-to-day development. The current code that selects CARRY64 for all P-1 work is overkill. I know how to fix that.
I forgot to ask if it could simply be a matter of p-1 auto-switching to use CARRY64 for a suitable set of exponent thresholds, based on your analysis.

For going-forward debug runs, would it be feasible to wrap the simple "sign after *3 same as sign of input" parity check in a set of preprocessor #ifs so you could build the slower-but-with-error-check code, run a bunch of expos just below the thresholds you set, and see if any such overflow-into-sign-bit errors occur?

2020-03-30, 03:17   #2015
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

2×5×7×107 Posts

Quote:
 Originally Posted by ewmayer I forgot to ask if it could simply be a matter of p-1 auto-switching to use CARRY64 for a suitable set of exponent thresholds, based on your analysis.
Yes, that is the goal.

Quote:
 For going-forward debug runs, would it be feasible to wrap the simple "sign after *3 same as sign of input" parity check in a set of preprocessor #ifs so you could build the slower-but-with-error-check code, run a bunch of expos just below the thresholds you set, and see if any such overflow-into-sign-bit errors occur?
In the latest code, I set -use DEBUG,CARRY32_LIMIT=0x70000000 to print any iterations where 32-bit carry is getting close to the limit. This is slow code, useful for analysis, not for production runs.

 2020-03-30, 19:34 #2016 ewmayer ∂2ω=0     Sep 2002 República de California 2D6D16 Posts Update on my Radeon VII sudden-onset-slowdown yesterday: more weirdness. Haven't yet spotted anything in the dmesg logs of note, but I'm still get familiar with which AMD-GPU-related messages are normal and which not. Now to the weirdness. My usual post-reboot procedure is: 1. Fire up gpuOwl job in each of two terminal windows, each job in a separate working dir; 2. Open 3rd window, fiddle settings to sclk=4 and fan=120 (or higher if interior temps warrant it); 3. Fire up LL/PRP job on the CPU; 4. Look at rocm-smi output to check GPU state. Last night, again rebooted system (just to cover all bases), fired up first gpuOwl job, but then skipped to [4] above - all looked normal, Wattage ~200, temp nearing 70C, SCLK and MCLK at their expected values, fan noise ramping up nicely. Thought "yay! I fixed it!" Fired up second gpuOwl job - within seconds fan noise starts dropping fast, check of rocm-smi shows dreaded "the workers have gone on strike" numbers. Kill second job, things revert back to normal. No clue why the GPU is suddenly balking at running 2 jobs, but it being late figured better quit while I'm ahead - set sclk=5 to help compensate for the throughput hit from 1-job running, went to bed. Even more weirdness - Just fired up 2nd job to see if the issue is reproducible, now all seems back to normal. I believe the technical term is "gremlins". Last fiddled with by ewmayer on 2020-03-30 at 19:35
2020-03-31, 00:07   #2017
ATH
Einyen

Dec 2003
Denmark

313110 Posts

From gpuowl.cl:

Quote:
 OUT_WG,OUT_SIZEX,OUT_SPACING IN_WG,IN_SIZEX,IN_SPACING
What are the possible values and range to test for these variables?

On the Tesla P100 on Google Colab pro, it was at 909 µs/iteration at default settings at 5M FFT, now with tuned settings at 817 µs/iteration:
-use ORIG_X2,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,ORIG_SLOWTRIG,OUT_WG=256,OUT_SIZEX=32,OUT_SPACING=4

Last fiddled with by ATH on 2020-03-31 at 00:59

2020-03-31, 03:56   #2018
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

749010 Posts

Quote:
 Originally Posted by ATH From gpuowl.cl: What are the possible values and range to test for these variables? On the Tesla P100 on Google Colab pro, it was at 909 µs/iteration at default settings at 5M FFT, now with tuned settings at 817 µs/iteration: -use ORIG_X2,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,ORIG_SLOWTRIG,OUT_WG=256,OUT_SIZEX=32,OUT_SPACING=4
IN/OUT_WG=64,128,256,512
IN/OUT_SIZEX=4,8,16,32,64,128 (gpuowl will whine when the combination does not make sense)
IN/OUT_SPACING=4,8,16,32,64,128

You are the first nVidia user to test all these combinations. Alas, previously the colab GPUs showed only minor differences in these settings whereas nVidia consumer GPUs benefitted much more from an optimal setting.

BTW, are you using today's checked in code? I'm surprised ORIG_SLOWTRIG would be faster than NEW_SLOWTRIG.

2020-03-31, 17:10   #2019
ATH
Einyen

Dec 2003
Denmark

31×101 Posts

Quote:
 Originally Posted by Prime95 BTW, are you using today's checked in code? I'm surprised ORIG_SLOWTRIG would be faster than NEW_SLOWTRIG.
I compiled it 2 days ago, any important changes since then? I can try to compile it again later today or tomorrow and test again.

This is the one to use right?
git clone https://github.com/preda/gpuowl

Last fiddled with by ATH on 2020-03-31 at 17:15

2020-03-31, 17:30   #2020
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

2·5·7·107 Posts

Quote:
 Originally Posted by ATH I compiled it 2 days ago, any important changes since then? I can try to compile it again later today or tomorrow and test again. This is the one to use right? git clone https://github.com/preda/gpuowl
Since 2 days ago, the trig code changed -- probably a smidge faster and more accurate.
For Ernst, the new FFT boundaries are in place with automated selection of CARRY32 vs. CARRY64.

Yes, that is the correct source.

2020-03-31, 19:56   #2021
ewmayer
2ω=0

Sep 2002
República de California

29·401 Posts

Quote:
 Originally Posted by Prime95 Since 2 days ago, the trig code changed -- probably a smidge faster and more accurate. For Ernst, the new FFT boundaries are in place with automated selection of CARRY32 vs. CARRY64.
Just grabbed (v. 62a3025) and built. Switched one of my 2 runs to it to goive a spin, see this on start (PRP of p = 103937143 @5632K):

Expected maximum carry32: 47840000

Aside - before switching that run to the new version, both were getting ~1335 us/iter (total 1498 iter/sec). With 1 run using new version, that run is now @ 1580 us/iter and the other has speeded up to 1168 us/iter (total 1490 iter/sec). With both runs using new version, both are at 1333 us/iter (total 1500 iter/sec). Probably some weird rocm-process-priority thing.

Aside #2: I've been doing near-daily price checks of new XFX Radeon VII cards on Amazon - they fluctuate interestingly. Couple days ago, $580. Yesterday, back to the same$550 I paid for mine in Feb. Just now, \$600.

2020-04-01, 00:20   #2022
ATH
Einyen

Dec 2003
Denmark

31·101 Posts

Ok, just compiled it again 1-1.5h ago on the Colab Tesla P100-PCIE-16GB.

This version is a bit faster on default settings at 5M FFT: 895µs/iteration

Got down to 832 µs with:
-use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32

I did not test all 144 x2 combinations of the 6 variables, but I did test many and found 2 different combinations that both run at 809 µs:
Quote:
 -use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,OUT_WG=64,OUT_SIZEX=8,OUT_SPACING=4,IN_WG=64,IN_SIZEX=8,IN_SPACING=2 -use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,OUT_WG=64,OUT_SIZEX=8,OUT_SPACING=4,IN_WG=128,IN_SIZEX=16,IN_SPACING=4
Many other combinations run at 810-820µs and many more at 820-850 µs, and a few rare bad ones ran at 980-990µs.

Switching back from ORIG_SLOWTRIG to NEW_SLOWTRIG at the final settings, changed the speed from 809µs to 814-815µs, so not a big difference.

Last fiddled with by ATH on 2020-04-01 at 00:25

2020-04-01, 01:15   #2023
preda

"Mihai Preda"
Apr 2015

32·151 Posts

Quote:
 Originally Posted by ATH Ok, just compiled it again 1-1.5h ago on the Colab Tesla P100-PCIE-16GB. This version is a bit faster on default settings at 5M FFT: 895µs/iteration Got down to 832 µs with: -use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32 I did not test all 144 x2 combinations of the 6 variables, but I did test many and found 2 different combinations that both run at 809 µs: Many other combinations run at 810-820µs and many more at 820-850 µs, and a few rare bad ones ran at 980-990µs. Switching back from ORIG_SLOWTRIG to NEW_SLOWTRIG at the final settings, changed the speed from 809µs to 814-815µs, so not a big difference.
ORIG_X2 and INLINE_X2 do not exist anymore, setting them has no effect whatsoever.

This seems to suggest these changes to Nvidia defaults:
- handle T2_SHUFFLE like on AMD (i.e. default to NO_T2_SHUFFLE)
- handle CARRY like on AMD (i.e. default to CARRY32)

Could you run with -use ROUNDOFF paired in turn with ORIG_SLOWTRIG/NEW_SLOWTRIG and look at the average roundoff error to evaluate their respective accuracy. If ORIG_SLOWTRIG is similarly accurate to NEW_SLOWTRIG we may consider making it the default on Nvidia.

Could other Nvidia users speak up if those proposed Nvidia defaults have adverse performance effects for them (due to different hardware).

Last fiddled with by preda on 2020-04-01 at 01:19

 2020-04-01, 03:39 #2024 kracker     "Mr. Meeseeks" Jan 2012 California, USA 32·241 Posts Windows compilation: Code: g++ -MT Gpu.o -MMD -MP -MF .d/Gpu.Td -Wall -O2 -std=c++17 -c -o Gpu.o Gpu.cpp In file included from ProofSet.h:6, from Gpu.cpp:4: File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)': File.h:33:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=] 33 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode); | ~^ ~~~~~~~~~~~~ | | | | char* const value_type* {aka const wchar_t*} | %hs Gpu.cpp: In member function 'std::tuple Gpu::isPrimePRP(u32, const Args&, std::atomic&)': Gpu.cpp:881:51: warning: left shift count >= width of type [-Wshift-count-overflow] 881 | constexpr float roundScale = 1.0 / (1L << 32); | ^~ Gpu.cpp:881:42: warning: division by zero [-Wdiv-by-zero] 881 | constexpr float roundScale = 1.0 / (1L << 32); | ~~~~^~~~~~~~~~~~ Gpu.cpp:881:48: error: right operand of shift expression '(1 << 32)' is >= than the precision of the left operand [-fpermissive] 881 | constexpr float roundScale = 1.0 / (1L << 32); | ~~~~^~~~~~ make: *** [Makefile:30: Gpu.o] Error 1

 Similar Threads Thread Thread Starter Forum Replies Last Post Bdot GPU Computing 1668 2020-12-22 15:38 xx005fs GpuOwl 0 2019-07-26 21:37 1260 Software 17 2015-08-28 01:35 CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12 Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 16:26.

Tue May 11 16:26:47 UTC 2021 up 33 days, 11:07, 2 users, load averages: 2.42, 2.52, 2.44