![]() |
[QUOTE=ewmayer;542231]Mihai/George, could you explain why residue shift apparently incurs such a heavy performance penalty in gpuOwl?[/QUOTE]
I don't believe there would be a performance penalty. |
[QUOTE=ewmayer;542231]Mihai/George, could you explain why residue shift apparently incurs such a heavy performance penalty in gpuOwl?
For LL with shift, I can maybe understand it - during the carry step of each iteration, one needs to precompute the bit offset of the -2 for the current shift value and then inject it into the corresponding residue word - not a lot of cycles needed, but perhaps in a massively-parallel GPU context, slowing whichever one of those smaller work units gets the shifted -2 causes the others to stall - just speculating here. But in a PRP context, there is no per-iteration -2 subtrahend, we simply apply some initial shift value to the starting residue, then repeated-square-mod happily away, with the only shift-related expense being the per-iteration update of the shift value, shift = 2*shift (mod p), where the * and mod can both be replaced with low-latency operations, shift (or add), compute shift2 = shift - p, followed by cmov to select the proper one of shift and shift2.[/QUOTE] I would also need to look into how the error check needs to be updated for non-zero offset. I don't think it would be a big cost, but probably not a zero-cost either. IMO not the highest priority thing to do ATM. |
[QUOTE=Prime95;542241]I don't believe there would be a performance penalty.[/QUOTE]
Mihai in #2049: [quote]For the "offset", it's too much trouble to fit it in without adversely affecting PRP which is still the main focus; also "offset" brings too little benefit to be worth bothering with IMO.[/quote] I read "adversely affecting PRP" as implying a performance hit. I have a special interest here - once I finish my current round of v20-related updates to Mlucas I intend to get up to speed on the programming model used for gpuowl, with a long-term goal of enhancing it to support the negacyclic DWT (on top of the Mersenne-style IBDWT) and right-angle-transform data layout needed to support Fermat-mod arithmetic. For my Fermat number testing to date I've used pairs of side-by-side runs, both with 0 shift but at different FFT lengths. The problem is, as we approach F33, the window of possible sizes for the smaller, slightly-less-than-power-of-2 FFT length of said run pairings rapidly shrinks. For F31 a smaller-FFT length of 120M = 15*8M is gonna really be pushing the accuracy limits of a doubles-based FFT. For F33 we'd need at a minimum 496M = 31*16M, but that prime 31 means a 31-DFT, and even the best-of-breed such algorithm is horribly inefficient. So I originally had in mind some highly-composite length < 512M, specifically 504M = 63*8M, but even though 63 = 3^2.7 is decently smooth, the result will likely be slower than the accompanying 512M run. But in the meantime I've worked out all the needed details to do residue-shifted Fermat-mod arithmetic - it's quite a bit more involved than Mersenne-mod, for reasons I'll detail soon in an upcoming post to the "Pepin tests of Fermat numbers beyond F24" thread - but now that I've worked out the mathematical details and have working proof-of-principle code, it's clear that performance-wise it should be no worse than Mersenne-mod with shift. So F33 - starting with a deep p-1 stage 1 (where it's crucial to obtain a correct residue, since absent a resulting factor one wants to distribute said residue to many stage-2 subinterval-runners) can use paired runs at 512M FFT, each with a different shift. |
[QUOTE=ewmayer;542245]Mihai in #2049:
I read "adversely affecting PRP" as implying a performance hit. I have a special interest here - once I finish my current round of v20-related updates to Mlucas I intend to get up to speed on the programming model used for gpuowl, with a long-term goal of enhancing it to support the negacyclic DWT (on top of the Mersenne-style IBDWT) and right-angle-transform data layout needed to support Fermat-mod arithmetic. For my Fermat number testing to date I've used pairs of side-by-side runs, both with 0 shift but at different FFT lengths. The problem is, as we approach F33, the window of possible sizes for the smaller, slightly-less-than-power-of-2 FFT length of said run pairings rapidly shrinks. For F31 a smaller-FFT length of 120M = 15*8M is gonna really be pushing the accuracy limits of a doubles-based FFT. For F33 we'd need at a minimum 496M = 31*16M, but that prime 31 means a 31-DFT, and even the best-of-breed such algorithm is horribly inefficient. So I originally had in mind some highly-composite length < 512M, specifically 504M = 63*8M, but even though 63 = 3^2.7 is decently smooth, the result will likely be slower than the accompanying 512M run. But in the meantime I've worked out all the needed details to do residue-shifted Fermat-mod arithmetic - it's quite a bit more involved than Mersenne-mod, for reasons I'll detail soon in an upcoming post to the "Pepin tests of Fermat numbers beyond F24" thread - but now that I've worked out the mathematical details and have working proof-of-principle code, it's clear that performance-wise it should be no worse than Mersenne-mod with shift. So F33 - starting with a deep p-1 stage 1 (where it's crucial to obtain a correct residue, since absent a resulting factor one wants to distribute said residue to many stage-2 subinterval-runners) can use paired runs at 512M FFT, each with a different shift.[/QUOTE] Nice work! and your contributions to gpuowl are more than welcome. Did you consider a GEC-style error check for Pepin instead of paired runs? |
Tried to submit an LL result, got "Did not understand 1 lines."
{"exponent":"54907981", "worktype":"LL", "status":"C", "program":{"name":"gpuowl", "version":"v6.11-252-gaf403e2"}, "timestamp":"2020-04-10 14:05:02 UTC", "user":"kracker", "computer":"core", "aid":"xxxxxxxxxx", "fft-length":3145728, "res64":"xxxxxxxxxxxxx", "offset":0} |
gpuowl-v6.11-255-g81fa7c3 for Win 7 x64 or up
2 Attachment(s)
Latest commit build, build log, help output, etc.
|
[QUOTE=preda;542252]Nice work! and your contributions to gpuowl are more than welcome. Did you consider a GEC-style error check for Pepin instead of paired runs?[/QUOTE]
The primality-test runs - whatever hardware they end up being done on - will of course use the GEC, but for this type of rare-but-historic computation, us all believing the GEC is foolproof will not suffice in terms of a research-quality announcement and attendant peer-reviewed paper - there must be at least 2 runs, which, if not done by independently developed programs, must at least provide reasonable assurance of independent-FFT-data. Having done F24, believe me, merely omitting a third pure-integer-code "drone" run (using interim residues provided by 2 cross-checked fast floating-FFT runs) is already a stretch, as far as more conservative parts of the computational number theory community are concerned. One could object "but they accept GIMPS new-prime announcements, based on matching independent floating-FFT runs" -- true, and that establishes the minimum baseline for e.g. an F33 testing effort. Further, my ongoing Fermat number tests - currently finishing up run #2 of F30 @64M FFT, first run @60M finished late last year - all deposit interim every-10Miter checkpoint files, so knowing the format of same, anyone could do a parallel (in the sense of multiple runs, each covering a separate 10Miter subinterval) triple-check using whatever code they like. For F33 the resulting fileset, at ~1 GB per checkpoint and 858 such, will occupy slightly less than 1TB, so any such file sharing might have to be done using physical disk drives, depending on the state of storage technology at that timepoint. |
[QUOTE=kracker;542292]Tried to submit an LL result, got "Did not understand 1 lines."
{"exponent":"54907981", "worktype":"LL", "status":"C", "program":{"name":"gpuowl", "version":"v6.11-252-gaf403e2"}, "timestamp":"2020-04-10 14:05:02 UTC", "user":"kracker", "computer":"core", "aid":"xxxxxxxxxx", "fft-length":3145728, "res64":"xxxxxxxxxxxxx", "offset":0}[/QUOTE] Yep, I just got the same message, so I guess George or Aaron needs to update the manual result script for this "new" gpuowl LL test. I guess it is different than back in gpuowl 0.6? |
[QUOTE=kracker;542292]Tried to submit an LL result, got "Did not understand 1 lines."[/QUOTE][QUOTE=ATH;542304]Yep, I just got the same message, so I guess George or Aaron needs to update the manual result script for this "new" gpuowl LL test. I guess it is different than back in gpuowl 0.6?[/QUOTE]Looks like Mihai changed the result format and forgot to tell me about it. :smile: (I do the manual results form, most everything else is George or Aaron).
I'm just waiting to hear back from Mihai regarding the change in format, I'll post back when the manual form will accept these results. |
[QUOTE=James Heinrich;542315]Looks like Mihai changed the result format and forgot to tell me about it. :smile: (I do the manual results form, most everything else is George or Aaron).
I'm just waiting to hear back from Mihai regarding the change in format, I'll post back when the manual form will accept these results.[/QUOTE] Sorry for that. I need to remind myself what is the right format for LL JSON. |
[QUOTE=ATH;542304]I guess it is different than back in gpuowl 0.6?[/QUOTE]Very different; V0.6 was before the switch to JSON.
[URL]https://www.mersenneforum.org/showpost.php?p=531029&postcount=28[/URL] |
| All times are UTC. The time now is 23:07. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.