![]() |
|
|
#1783 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
7,537 Posts |
Quote:
Thanks for the bit-shift optimization timings. Very interesting. Do you have any ideas as to where the bottlenecks are? Please keep us informed as to other optimizations you try. The info may be useful for other CUDA projects. |
|
|
|
|
|
|
#1784 |
|
"Oliver"
Mar 2005
Germany
21278 Posts |
Yes, for many GPU computing applications Kepler is (another) step backwards.
![]() I've no clue what the bottlenecks are. The barrett79 kernel uses shifts only for the initial setup (precompute (scaled) inverse of the factor candidate), the main loop is without shifts! So I'm really surprised how worse the replacement of shiftright with multiply (high word) is. The lower number of registers per core on Kepler is not an issue for mfaktc, half of them would be enough! Shared memory, onchip cache and offchip memory is barely in use for mfaktc. The performance of mfaktc primary depends on 32bit integer throughput. ----- Back to the multiword shiftleft example: Code:
nn.d4 = __umad32(nn.d4, 8388608, (nn.d3 >> 9)); Code:
nn.d4= __umadhi32(nn.d3, 8388608, (nn.d4 << 23)); ----- A small improvement for 0.19-pre2: 381.94M/s! Oliver Last fiddled with by TheJudger on 2012-05-26 at 16:51 |
|
|
|
|
|
#1785 |
|
Mar 2003
Melbourne
10038 Posts |
Do the above limitations apply to both Keplers (mini and the recently announced big keplers)
Mini Kepler being GTX680, and Big Kepler being the announced K20 to be released at the end of the year? Or is it too early to comment on big Kepler? -- Craig Last fiddled with by nucleon on 2012-05-26 at 18:10 Reason: grammar |
|
|
|
|
|
#1786 |
|
"Oliver"
Mar 2005
Germany
11×101 Posts |
Hi Craig,
I've no clue... and I don't trust numbers written on paper. So give me a Kepler an I'll tell you. But for now I need to understand how Kepler light works. mini Kepler and big Kepler... for me it is Kepler light and Kepler (just as Fermi (CC 2.0) and Fermi light (CC 2.1)). Oliver |
|
|
|
|
|
#1787 |
|
Dec 2009
Peine, Germany
14B16 Posts |
Just curious: I've run the following line:
Code:
Factor=N/A,2373583,61,62 My code gives me a bit length of 62.02 so a bit higher than demanded... No problem but is this expected / wanted? Code:
mfaktc v0.18 (64bit built) got assignment: exp=2373583 bit_min=61 bit_max=62 Starting trial factoring M2373583 from 2^61 to 2^62 k_min = 485730431340 k_max = 971460871270 [k_fac = 984834546993] Using GPU kernel "75bit_mul32" Might k_max be a soft border because of the classes system? |
|
|
|
|
|
#1788 | |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
722110 Posts |
Yes.
Quote:
|
|
|
|
|
|
|
#1789 |
|
"Oliver"
Mar 2005
Germany
11×101 Posts |
Will somebody miss the debug option "VERBOSE_TIMING"? (If you don't know this option you won't miss it.)
It is a debugging/timing option which I've used in ancient versions of mfaktc. If nobody tells me a good reason why I shouldn't remove it I'll remove it in mfaktc 0.19! I'm not even sure if it works as expected in the current code... Oliver |
|
|
|
|
|
#1790 |
|
Mar 2011
Germany
3×31 Posts |
Hi,
I took the source of mfaktc 0.18 and changed some parts to handle base 10 repunits. I added a new code path (64 Bit kernel) and removed some other parts. Here is an overview of the changes:
This version gave our search for the next repunit (probable) prime a massive boost. Still the bottleneck is the PRP test. Danilo |
|
|
|
|
|
#1791 |
|
Mar 2011
Germany
3×31 Posts |
Due to the size restriction of the forum I could not put the executable in the source, so here it is:
|
|
|
|
|
|
#1792 | |
|
"Oliver"
Mar 2005
Germany
11×101 Posts |
Hi Danilo,
I'll take a look at your modifications later. Quote:
Oliver |
|
|
|
|
|
|
#1793 | |
|
Nov 2010
Germany
3·199 Posts |
Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 32 | 2020-11-11 19:56 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |