![]() |
[QUOTE=Xyzzy;306576]More "extra credit":
[CODE]Processing result: M56505451 has a factor: 86553876518403762963169 CPU credit is 323.9309 GHz-days. Processing result: M56488651 has a factor: 35566445275259107720993 CPU credit is 129.5622 GHz-days. Processing result: M56491177 has a factor: 23502006329787341695151 CPU credit is 89.0731 GHz-days.[/CODE][/QUOTE] :omg: |
[QUOTE=Xyzzy;306576]More "extra credit":[/QUOTE]
Yeah, you looked like you need some credit, so that's why :razz: ------------------------------------------------------------ @prime95, related to barrett77: "Enter George Woltman, an excellent programmer and organizer..." (from the Encyclopedia Galactica:smile:, the [URL="http://primes.utm.edu/mersenne/"]History of Mersenne Primes[/URL] section) (we need a smiley which take out his hat!) (edit, ok, this will substitute::bow:) When can we get mfaktc binaries for win64? (eventually for both the "classic" version, and the one for tf small expos, here a 20% improvement will look great, in fact we would be happier with a "barrett67" and a 50% improvement :razz:) |
wow, that is a perfomance boost :w00t:
|
[QUOTE=Prime95;306572]Oliver,
I propose creating a barrett77_mul32. This is the same as barrett79_mul32 but with the mod_simple_96 moved out of the loop. As long as f does not exceed 77 bits, a will not exceed 80 bits (above 80 bits and square_96_160 will fail). I tested this out and it passes the self tests up through 77 bits. Raw speed went from 205M/sec to 250M/sec. Crude source is attached.[/QUOTE] Very nice! In mfakto, this new 77-bit kernel is even 5% faster than the 70-bit, and 10% faster than the 73-bit kernels I have, making it the fastest again for VLIW5. The newer architectures benefit less from this kernel. Too bad this trick only works for the 79-bit kernel with it's fixed 2[SUP]81[/SUP]/f inverse. The other barretts with the 2[SUP]bit_max+1[/SUP]/f inverse cannot deal with the larger square in my kernels (the inverse does not seem to have enough significant digits). |
[QUOTE=Bdot;306663]Very nice! In mfakto, this new 77-bit kernel is even 5% faster than the 70-bit, and 10% faster than the 73-bit kernels [/QUOTE]
Glad it helped and is passing your tests too. You can also create a 78-bit kernel that only adjusts the result when there is a multiplication by 2. |
Hi George,
[QUOTE=Prime95;306572]Oliver, I propose creating a barrett77_mul32. This is the same as barrett79_mul32 but with the mod_simple_96 moved out of the loop. As long as f does not exceed 77 bits, a will not exceed 80 bits (above 80 bits and square_96_160 will fail). I tested this out and it passes the self tests up through 77 bits. Raw speed went from 205M/sec to 250M/sec. Crude source is attached.[/QUOTE] cool, I'll test this (again!). Some time ago I've tried similar but failed somehow. Did you run the tests with CHECKS_MODBASECASE (src/params.h) enabled? @others: please be carefully, there are some other changes and testing needed before this is save for daily usage, with this modification alone it [B]will[/B] choose this kernel for TF up to 2[SUP]79[/SUP] and it [B]will[/B] fail there. I guess I'll reschedule my release plan for 0.19 and add this. Oliver |
[QUOTE=TheJudger;306707]I guess I'll reschedule my release plan for 0.19 and add this.[/QUOTE]
We fully agree with this! Eagerly waiting! |
[QUOTE=Bdot;306663]
Too bad this trick only works for the 79-bit kernel with it's fixed 2[SUP]81[/SUP]/f inverse. The other barretts with the 2[SUP]bit_max+1[/SUP]/f inverse cannot deal with the larger square in my kernels (the inverse does not seem to have enough significant digits).[/QUOTE] I haven't tried it, but this should also work for the barrett 96-bit kernel for factors up to 90 bits. That is, a 90-bit factor will generate a 90-bit remainder + 3 bits because we're pretty sloppy calculating the remainder. When we square the 93-bit result we get a 186-bit value. We then apply 1/f to get a 96-bit quotient - which just fits in our 3 registers. |
[QUOTE=Prime95;306723]I haven't tried it, but this should also work for the barrett 96-bit kernel for factors up to 90 bits. That is, a 90-bit factor will generate a 90-bit remainder + 3 bits because we're pretty sloppy calculating the remainder. When we square the 93-bit result we get a 186-bit value. We then apply 1/f to get a 96-bit quotient - which just fits in our 3 registers.[/QUOTE]
I tried (with my 75-bit kernel[SUP]*[/SUP], and a 68-bit factor), and the 1/f step left a remainder that increased with each loop until I had an overflow. I'll check the details later. [SUP]*[/SUP] 5 words with 15 bits each, to avoid the expensive 32-bit multiplications and use mul24 instead. |
[QUOTE=TheJudger;304737]Some data from tf_barrett96.cu: mod_simple_96():
[CODE] qi = 0 q = 00000007 3C3F1F[COLOR="Red"]20[/COLOR] C454D397 nn = 00000000 00000000 00000000 res = 00000007 3C3F1F[COLOR="Red"]1F[/COLOR] C454D397 [/CODE] res = q - nn; So for now it looks like CUDA 5.0.7 fails when somebody uses sub with carry when the subtrahend is 0. So for now it looks like a bug in CUDA 5.0.7. Oliver[/QUOTE] [QUOTE=TheJudger;305015]Nvidia confirmed the bug so I would say: not my fault/problem! :smile: Oliver[/QUOTE] btw.: Nvidia told me they have fixed the bug with a driver update. Unfortionaly this driver is not yet available for me. |
[QUOTE=Prime95;306723]I haven't tried it, but this should also work for the barrett 96-bit kernel for factors up to 90 bits. [/QUOTE]
I forgot about all the nasty bit-shifting that kernel performs. It may not be possible to retrieve a 96-bit quotient -- needs further research. |
| All times are UTC. The time now is 23:16. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.