mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-04-09, 21:30   #2058
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,691 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Mihai/George, could you explain why residue shift apparently incurs such a heavy performance penalty in gpuOwl?
I don't believe there would be a performance penalty.
Prime95 is online now   Reply With Quote
Old 2020-04-09, 21:50   #2059
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

55F16 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Mihai/George, could you explain why residue shift apparently incurs such a heavy performance penalty in gpuOwl?

For LL with shift, I can maybe understand it - during the carry step of each iteration, one needs to precompute the bit offset of the -2 for the current shift value and then inject it into the corresponding residue word - not a lot of cycles needed, but perhaps in a massively-parallel GPU context, slowing whichever one of those smaller work units gets the shifted -2 causes the others to stall - just speculating here.

But in a PRP context, there is no per-iteration -2 subtrahend, we simply apply some initial shift value to the starting residue, then repeated-square-mod happily away, with the only shift-related expense being the per-iteration update of the shift value, shift = 2*shift (mod p), where the * and mod can both be replaced with low-latency operations, shift (or add), compute shift2 = shift - p, followed by cmov to select the proper one of shift and shift2.
I would also need to look into how the error check needs to be updated for non-zero offset. I don't think it would be a big cost, but probably not a zero-cost either. IMO not the highest priority thing to do ATM.
preda is offline   Reply With Quote
Old 2020-04-09, 21:53   #2060
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

2D9C16 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I don't believe there would be a performance penalty.
Mihai in #2049:
Quote:
For the "offset", it's too much trouble to fit it in without adversely affecting PRP which is still the main focus; also "offset" brings too little benefit to be worth bothering with IMO.
I read "adversely affecting PRP" as implying a performance hit.

I have a special interest here - once I finish my current round of v20-related updates to Mlucas I intend to get up to speed on the programming model used for gpuowl, with a long-term goal of enhancing it to support the negacyclic DWT (on top of the Mersenne-style IBDWT) and right-angle-transform data layout needed to support Fermat-mod arithmetic. For my Fermat number testing to date I've used pairs of side-by-side runs, both with 0 shift but at different FFT lengths. The problem is, as we approach F33, the window of possible sizes for the smaller, slightly-less-than-power-of-2 FFT length of said run pairings rapidly shrinks. For F31 a smaller-FFT length of 120M = 15*8M is gonna really be pushing the accuracy limits of a doubles-based FFT. For F33 we'd need at a minimum 496M = 31*16M, but that prime 31 means a 31-DFT, and even the best-of-breed such algorithm is horribly inefficient. So I originally had in mind some highly-composite length < 512M, specifically 504M = 63*8M, but even though 63 = 3^2.7 is decently smooth, the result will likely be slower than the accompanying 512M run. But in the meantime I've worked out all the needed details to do residue-shifted Fermat-mod arithmetic - it's quite a bit more involved than Mersenne-mod, for reasons I'll detail soon in an upcoming post to the "Pepin tests of Fermat numbers beyond F24" thread - but now that I've worked out the mathematical details and have working proof-of-principle code, it's clear that performance-wise it should be no worse than Mersenne-mod with shift. So F33 - starting with a deep p-1 stage 1 (where it's crucial to obtain a correct residue, since absent a resulting factor one wants to distribute said residue to many stage-2 subinterval-runners) can use paired runs at 512M FFT, each with a different shift.

Last fiddled with by ewmayer on 2020-04-09 at 21:55
ewmayer is offline   Reply With Quote
Old 2020-04-09, 22:37   #2061
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

53·11 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Mihai in #2049:

I read "adversely affecting PRP" as implying a performance hit.

I have a special interest here - once I finish my current round of v20-related updates to Mlucas I intend to get up to speed on the programming model used for gpuowl, with a long-term goal of enhancing it to support the negacyclic DWT (on top of the Mersenne-style IBDWT) and right-angle-transform data layout needed to support Fermat-mod arithmetic. For my Fermat number testing to date I've used pairs of side-by-side runs, both with 0 shift but at different FFT lengths. The problem is, as we approach F33, the window of possible sizes for the smaller, slightly-less-than-power-of-2 FFT length of said run pairings rapidly shrinks. For F31 a smaller-FFT length of 120M = 15*8M is gonna really be pushing the accuracy limits of a doubles-based FFT. For F33 we'd need at a minimum 496M = 31*16M, but that prime 31 means a 31-DFT, and even the best-of-breed such algorithm is horribly inefficient. So I originally had in mind some highly-composite length < 512M, specifically 504M = 63*8M, but even though 63 = 3^2.7 is decently smooth, the result will likely be slower than the accompanying 512M run. But in the meantime I've worked out all the needed details to do residue-shifted Fermat-mod arithmetic - it's quite a bit more involved than Mersenne-mod, for reasons I'll detail soon in an upcoming post to the "Pepin tests of Fermat numbers beyond F24" thread - but now that I've worked out the mathematical details and have working proof-of-principle code, it's clear that performance-wise it should be no worse than Mersenne-mod with shift. So F33 - starting with a deep p-1 stage 1 (where it's crucial to obtain a correct residue, since absent a resulting factor one wants to distribute said residue to many stage-2 subinterval-runners) can use paired runs at 512M FFT, each with a different shift.
Nice work! and your contributions to gpuowl are more than welcome. Did you consider a GEC-style error check for Pepin instead of paired runs?
preda is offline   Reply With Quote
Old 2020-04-10, 16:28   #2062
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

32·241 Posts
Default

Tried to submit an LL result, got "Did not understand 1 lines."

{"exponent":"54907981", "worktype":"LL", "status":"C", "program":{"name":"gpuowl", "version":"v6.11-252-gaf403e2"}, "timestamp":"2020-04-10 14:05:02 UTC", "user":"kracker", "computer":"core", "aid":"xxxxxxxxxx", "fft-length":3145728, "res64":"xxxxxxxxxxxxx", "offset":0}
kracker is offline   Reply With Quote
Old 2020-04-10, 16:59   #2063
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

17·349 Posts
Default gpuowl-v6.11-255-g81fa7c3 for Win 7 x64 or up

Latest commit build, build log, help output, etc.
Attached Files
File Type: txt build-log.txt (8.2 KB, 85 views)
File Type: 7z gpuowl-v6.11-255-g81fa7c3.7z (464.9 KB, 86 views)
kriesel is offline   Reply With Quote
Old 2020-04-10, 19:27   #2064
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

22·3·7·139 Posts
Default

Quote:
Originally Posted by preda View Post
Nice work! and your contributions to gpuowl are more than welcome. Did you consider a GEC-style error check for Pepin instead of paired runs?
The primality-test runs - whatever hardware they end up being done on - will of course use the GEC, but for this type of rare-but-historic computation, us all believing the GEC is foolproof will not suffice in terms of a research-quality announcement and attendant peer-reviewed paper - there must be at least 2 runs, which, if not done by independently developed programs, must at least provide reasonable assurance of independent-FFT-data. Having done F24, believe me, merely omitting a third pure-integer-code "drone" run (using interim residues provided by 2 cross-checked fast floating-FFT runs) is already a stretch, as far as more conservative parts of the computational number theory community are concerned.

One could object "but they accept GIMPS new-prime announcements, based on matching independent floating-FFT runs" -- true, and that establishes the minimum baseline for e.g. an F33 testing effort. Further, my ongoing Fermat number tests - currently finishing up run #2 of F30 @64M FFT, first run @60M finished late last year - all deposit interim every-10Miter checkpoint files, so knowing the format of same, anyone could do a parallel (in the sense of multiple runs, each covering a separate 10Miter subinterval) triple-check using whatever code they like. For F33 the resulting fileset, at ~1 GB per checkpoint and 858 such, will occupy slightly less than 1TB, so any such file sharing might have to be done using physical disk drives, depending on the state of storage technology at that timepoint.

Last fiddled with by ewmayer on 2020-04-10 at 19:29
ewmayer is offline   Reply With Quote
Old 2020-04-10, 19:43   #2065
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2·1,601 Posts
Default

Quote:
Originally Posted by kracker View Post
Tried to submit an LL result, got "Did not understand 1 lines."

{"exponent":"54907981", "worktype":"LL", "status":"C", "program":{"name":"gpuowl", "version":"v6.11-252-gaf403e2"}, "timestamp":"2020-04-10 14:05:02 UTC", "user":"kracker", "computer":"core", "aid":"xxxxxxxxxx", "fft-length":3145728, "res64":"xxxxxxxxxxxxx", "offset":0}
Yep, I just got the same message, so I guess George or Aaron needs to update the manual result script for this "new" gpuowl LL test. I guess it is different than back in gpuowl 0.6?
ATH is offline   Reply With Quote
Old 2020-04-10, 22:02   #2066
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

DCE16 Posts
Default

Quote:
Originally Posted by kracker View Post
Tried to submit an LL result, got "Did not understand 1 lines."
Quote:
Originally Posted by ATH View Post
Yep, I just got the same message, so I guess George or Aaron needs to update the manual result script for this "new" gpuowl LL test. I guess it is different than back in gpuowl 0.6?
Looks like Mihai changed the result format and forgot to tell me about it. (I do the manual results form, most everything else is George or Aaron).

I'm just waiting to hear back from Mihai regarding the change in format, I'll post back when the manual form will accept these results.
James Heinrich is online now   Reply With Quote
Old 2020-04-11, 00:36   #2067
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

53×11 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
Looks like Mihai changed the result format and forgot to tell me about it. (I do the manual results form, most everything else is George or Aaron).

I'm just waiting to hear back from Mihai regarding the change in format, I'll post back when the manual form will accept these results.
Sorry for that. I need to remind myself what is the right format for LL JSON.
preda is offline   Reply With Quote
Old 2020-04-11, 00:45   #2068
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

17×349 Posts
Default

Quote:
Originally Posted by ATH View Post
I guess it is different than back in gpuowl 0.6?
Very different; V0.6 was before the switch to JSON.
https://www.mersenneforum.org/showpo...9&postcount=28
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1680 2021-09-13 17:01
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 05:05.


Wed Dec 8 05:05:12 UTC 2021 up 137 days, 23:34, 1 user, load averages: 1.82, 1.86, 1.68

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.