mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2020-03-30, 02:51   #2014
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×5,591 Posts
Default

Quote:
Originally Posted by Prime95 View Post
We could detect the error condition with about 3 or 4 instructions. However, we try to create the fastest code possible and pick default settings that should safely avoid dangerous situations. Sometimes we don't quite succeed -- especially with day-to-day development.

The current code that selects CARRY64 for all P-1 work is overkill. I know how to fix that.
I forgot to ask if it could simply be a matter of p-1 auto-switching to use CARRY64 for a suitable set of exponent thresholds, based on your analysis.

For going-forward debug runs, would it be feasible to wrap the simple "sign after *3 same as sign of input" parity check in a set of preprocessor #ifs so you could build the slower-but-with-error-check code, run a bunch of expos just below the thresholds you set, and see if any such overflow-into-sign-bit errors occur?
ewmayer is offline   Reply With Quote
Old 2020-03-30, 03:17   #2015
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1ABB16 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I forgot to ask if it could simply be a matter of p-1 auto-switching to use CARRY64 for a suitable set of exponent thresholds, based on your analysis.
Yes, that is the goal.

Quote:
For going-forward debug runs, would it be feasible to wrap the simple "sign after *3 same as sign of input" parity check in a set of preprocessor #ifs so you could build the slower-but-with-error-check code, run a bunch of expos just below the thresholds you set, and see if any such overflow-into-sign-bit errors occur?
In the latest code, I set -use DEBUG,CARRY32_LIMIT=0x70000000 to print any iterations where 32-bit carry is getting close to the limit. This is slow code, useful for analysis, not for production runs.
Prime95 is offline   Reply With Quote
Old 2020-03-30, 19:34   #2016
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×5,591 Posts
Default

Update on my Radeon VII sudden-onset-slowdown yesterday: more weirdness. Haven't yet spotted anything in the dmesg logs of note, but I'm still get familiar with which AMD-GPU-related messages are normal and which not. Now to the weirdness.

My usual post-reboot procedure is:

1. Fire up gpuOwl job in each of two terminal windows, each job in a separate working dir;
2. Open 3rd window, fiddle settings to sclk=4 and fan=120 (or higher if interior temps warrant it);
3. Fire up LL/PRP job on the CPU;
4. Look at rocm-smi output to check GPU state.

Last night, again rebooted system (just to cover all bases), fired up first gpuOwl job, but then skipped to [4] above - all looked normal, Wattage ~200, temp nearing 70C, SCLK and MCLK at their expected values, fan noise ramping up nicely. Thought "yay! I fixed it!" Fired up second gpuOwl job - within seconds fan noise starts dropping fast, check of rocm-smi shows dreaded "the workers have gone on strike" numbers. Kill second job, things revert back to normal. No clue why the GPU is suddenly balking at running 2 jobs, but it being late figured better quit while I'm ahead - set sclk=5 to help compensate for the throughput hit from 1-job running, went to bed.

Even more weirdness - Just fired up 2nd job to see if the issue is reproducible, now all seems back to normal.

I believe the technical term is "gremlins".

Last fiddled with by ewmayer on 2020-03-30 at 19:35
ewmayer is offline   Reply With Quote
Old 2020-03-31, 00:07   #2017
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2×3×11×43 Posts
Default

From gpuowl.cl:

Quote:
OUT_WG,OUT_SIZEX,OUT_SPACING <AMD default is 256,32,4> <nVidia default is 256,4,1 but needs testing>
IN_WG,IN_SIZEX,IN_SPACING <AMD default is 256,32,1> <nVidia default is 256,4,1 but needs testing>
What are the possible values and range to test for these variables?


On the Tesla P100 on Google Colab pro, it was at 909 µs/iteration at default settings at 5M FFT, now with tuned settings at 817 µs/iteration:
-use ORIG_X2,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,ORIG_SLOWTRIG,OUT_WG=256,OUT_SIZEX=32,OUT_SPACING=4

Last fiddled with by ATH on 2020-03-31 at 00:59
ATH is offline   Reply With Quote
Old 2020-03-31, 03:56   #2018
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

3×2,281 Posts
Default

Quote:
Originally Posted by ATH View Post
From gpuowl.cl:



What are the possible values and range to test for these variables?


On the Tesla P100 on Google Colab pro, it was at 909 µs/iteration at default settings at 5M FFT, now with tuned settings at 817 µs/iteration:
-use ORIG_X2,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,ORIG_SLOWTRIG,OUT_WG=256,OUT_SIZEX=32,OUT_SPACING=4
IN/OUT_WG=64,128,256,512
IN/OUT_SIZEX=4,8,16,32,64,128 (gpuowl will whine when the combination does not make sense)
IN/OUT_SPACING=4,8,16,32,64,128

You are the first nVidia user to test all these combinations. Alas, previously the colab GPUs showed only minor differences in these settings whereas nVidia consumer GPUs benefitted much more from an optimal setting.

BTW, are you using today's checked in code? I'm surprised ORIG_SLOWTRIG would be faster than NEW_SLOWTRIG.
Prime95 is offline   Reply With Quote
Old 2020-03-31, 17:10   #2019
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2×3×11×43 Posts
Default

Quote:
Originally Posted by Prime95 View Post
BTW, are you using today's checked in code? I'm surprised ORIG_SLOWTRIG would be faster than NEW_SLOWTRIG.
I compiled it 2 days ago, any important changes since then? I can try to compile it again later today or tomorrow and test again.

This is the one to use right?
git clone https://github.com/preda/gpuowl

Last fiddled with by ATH on 2020-03-31 at 17:15
ATH is offline   Reply With Quote
Old 2020-03-31, 17:30   #2020
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

3×2,281 Posts
Default

Quote:
Originally Posted by ATH View Post
I compiled it 2 days ago, any important changes since then? I can try to compile it again later today or tomorrow and test again.

This is the one to use right?
git clone https://github.com/preda/gpuowl
Since 2 days ago, the trig code changed -- probably a smidge faster and more accurate.
For Ernst, the new FFT boundaries are in place with automated selection of CARRY32 vs. CARRY64.

Yes, that is the correct source.
Prime95 is offline   Reply With Quote
Old 2020-03-31, 19:56   #2021
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·5,591 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Since 2 days ago, the trig code changed -- probably a smidge faster and more accurate.
For Ernst, the new FFT boundaries are in place with automated selection of CARRY32 vs. CARRY64.
Just grabbed (v. 62a3025) and built. Switched one of my 2 runs to it to goive a spin, see this on start (PRP of p = 103937143 @5632K):

Expected maximum carry32: 47840000

Aside - before switching that run to the new version, both were getting ~1335 us/iter (total 1498 iter/sec). With 1 run using new version, that run is now @ 1580 us/iter and the other has speeded up to 1168 us/iter (total 1490 iter/sec). With both runs using new version, both are at 1333 us/iter (total 1500 iter/sec). Probably some weird rocm-process-priority thing.

Aside #2: I've been doing near-daily price checks of new XFX Radeon VII cards on Amazon - they fluctuate interestingly. Couple days ago, $580. Yesterday, back to the same $550 I paid for mine in Feb. Just now, $600.
ewmayer is offline   Reply With Quote
Old 2020-04-01, 00:20   #2022
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

283810 Posts
Default

Ok, just compiled it again 1-1.5h ago on the Colab Tesla P100-PCIE-16GB.

This version is a bit faster on default settings at 5M FFT: 895µs/iteration

Got down to 832 µs with:
-use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32

I did not test all 144 x2 combinations of the 6 variables, but I did test many and found 2 different combinations that both run at 809 µs:
Quote:
-use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,OUT_WG=64,OUT_SIZEX=8,OUT_SPACING=4,IN_WG=64,IN_SIZEX=8,IN_SPACING=2
-use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,OUT_WG=64,OUT_SIZEX=8,OUT_SPACING=4,IN_WG=128,IN_SIZEX=16,IN_SPACING=4
Many other combinations run at 810-820µs and many more at 820-850 µs, and a few rare bad ones ran at 980-990µs.

Switching back from ORIG_SLOWTRIG to NEW_SLOWTRIG at the final settings, changed the speed from 809µs to 814-815µs, so not a big difference.

Last fiddled with by ATH on 2020-04-01 at 00:25
ATH is offline   Reply With Quote
Old 2020-04-01, 01:15   #2023
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

7×149 Posts
Default

Quote:
Originally Posted by ATH View Post
Ok, just compiled it again 1-1.5h ago on the Colab Tesla P100-PCIE-16GB.

This version is a bit faster on default settings at 5M FFT: 895µs/iteration

Got down to 832 µs with:
-use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32

I did not test all 144 x2 combinations of the 6 variables, but I did test many and found 2 different combinations that both run at 809 µs:


Many other combinations run at 810-820µs and many more at 820-850 µs, and a few rare bad ones ran at 980-990µs.

Switching back from ORIG_SLOWTRIG to NEW_SLOWTRIG at the final settings, changed the speed from 809µs to 814-815µs, so not a big difference.
ORIG_X2 and INLINE_X2 do not exist anymore, setting them has no effect whatsoever.

This seems to suggest these changes to Nvidia defaults:
- handle T2_SHUFFLE like on AMD (i.e. default to NO_T2_SHUFFLE)
- handle CARRY like on AMD (i.e. default to CARRY32)

Could you run with -use ROUNDOFF paired in turn with ORIG_SLOWTRIG/NEW_SLOWTRIG and look at the average roundoff error to evaluate their respective accuracy. If ORIG_SLOWTRIG is similarly accurate to NEW_SLOWTRIG we may consider making it the default on Nvidia.

Could other Nvidia users speak up if those proposed Nvidia defaults have adverse performance effects for them (due to different hardware).

Last fiddled with by preda on 2020-04-01 at 01:19
preda is offline   Reply With Quote
Old 2020-04-01, 03:39   #2024
kracker
ἀβουλία
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

2×13×83 Posts
Default

Windows compilation:

Code:
g++ -MT Gpu.o -MMD -MP -MF .d/Gpu.Td -Wall -O2 -std=c++17   -c -o Gpu.o Gpu.cpp
In file included from ProofSet.h:6,
                 from Gpu.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:33:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
   33 |       log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
      |                        ~^                  ~~~~~~~~~~~~
      |                         |                            |
      |                         char*                        const value_type* {aka const wchar_t*}
      |                        %hs
Gpu.cpp: In member function 'std::tuple<bool, long long unsigned int, unsigned int> Gpu::isPrimePRP(u32, const Args&, std::atomic<unsigned int>&)':
Gpu.cpp:881:51: warning: left shift count >= width of type [-Wshift-count-overflow]
  881 |         constexpr float roundScale = 1.0 / (1L << 32);
      |                                                   ^~
Gpu.cpp:881:42: warning: division by zero [-Wdiv-by-zero]
  881 |         constexpr float roundScale = 1.0 / (1L << 32);
      |                                      ~~~~^~~~~~~~~~~~
Gpu.cpp:881:48: error: right operand of shift expression '(1 << 32)' is >= than the precision of the left operand [-fpermissive]
  881 |         constexpr float roundScale = 1.0 / (1L << 32);
      |                                            ~~~~^~~~~~
make: *** [Makefile:30: Gpu.o] Error 1
kracker is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1616 2020-05-31 16:46
GPUOWL AMD Windows OpenCL issues xx005fs GPU Computing 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 08:03.

Sat Jun 6 08:03:04 UTC 2020 up 73 days, 5:36, 0 users, load averages: 1.69, 1.49, 1.28

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.