![]() |
[QUOTE=TheJudger;300278]
[B]Raw[/B] GPU speed for TF M66362159 69 70 mfaktc 0.19-pre1: 380.74M/s (my stock GTX 470 does ~335M/s) mfaktc 0.19-pre2: 380.92M/s [/QUOTE] Disappointing. I had hoped that Kepler would be better on mfaktc as it doesn't do much slow double-precision FP. Thanks for the bit-shift optimization timings. Very interesting. Do you have any ideas as to where the bottlenecks are? Please keep us informed as to other optimizations you try. The info may be useful for other CUDA projects. |
Yes, for many GPU computing applications Kepler is (another) step backwards. :sad:
I've no clue what the bottlenecks are. The barrett79 kernel uses shifts only for the initial setup (precompute (scaled) inverse of the factor candidate), the main loop is without shifts! So I'm really surprised how worse the replacement of shiftright with multiply (high word) is. The lower number of registers per core on Kepler is not an issue for mfaktc, half of them would be enough! Shared memory, onchip cache and offchip memory is barely in use for mfaktc. The performance of mfaktc primary depends on 32bit integer throughput. ----- Back to the multiword shiftleft example: [CODE] nn.d4 = __umad32(nn.d4, 8388608, (nn.d3 >> 9)); [/CODE] This is a very, very small improvement over shiftleft + shiftright + add! [CODE] nn.d4= __umadhi32(nn.d3, 8388608, (nn.d4 << 23)); [/CODE] And this is worse than shiftleft + shiftright + add! ----- A small improvement for 0.19-pre2: 381.94M/s! Oliver |
Do the above limitations apply to both Keplers (mini and the recently announced big keplers)
Mini Kepler being GTX680, and Big Kepler being the announced K20 to be released at the end of the year? Or is it too early to comment on big Kepler? -- Craig |
Hi Craig,
I've no clue... and I don't trust numbers written on paper. So give me a Kepler an I'll tell you. But for now I need to understand how Kepler light works. mini Kepler and big Kepler... for me it is Kepler light and Kepler (just as Fermi (CC 2.0) and Fermi light (CC 2.1)). Oliver |
k?
Just curious: I've run the following line:
[CODE] Factor=N/A,2373583,61,62[/CODE][SIZE=2]and found a factor: [/SIZE][SIZE=2]4675173077110571839 (prime) [/SIZE][B][k = 984834546993[/B]] [SIZE=2] My code gives me a bit length of 62.02 so a bit higher than demanded... No problem but is this expected / wanted? [/SIZE][CODE]mfaktc v0.18 (64bit built) got assignment: exp=2373583 bit_min=61 bit_max=62 Starting trial factoring M2373583 from 2^61 to 2^62 k_min = 485730431340 k_max = 971460871270 [B][k_fac = 984834546993[/B]] Using GPU kernel "75bit_mul32" [/CODE]I haven't looked in the mfaktc source... Might k_max be a soft border because of the classes system? |
[QUOTE=Brain;300852]
Might k_max be a soft border because of the classes system?[/QUOTE] Yes. [quote=README.txt]################################################################## # 5.1 Stuff that looks like an issue but actually isn't an issue # ################################################################## - mfaktc runs slower on small ranges. Usually it doesn't make much sense to run mfaktc with an upper limit smaller than 2^64. It is designed for trial factoring above 2^64 up to 2^95 (factor sizes). ==> mfaktc needs "long runs"! - mfaktc can find factors outside the given range. E.g. './mfaktc.exe -tf 66362159 40 41' has a high change to report 124246422648815633 as a factor. Actually this is a factor of M66362159 but it's size is between 2^56 and 2^57! Offcourse './mfaktc.exe -tf 66362159 56 57' will find this factor, too. The reason for this behaviour is that mfaktc works on huge factor blocks. This is controlled by GridSize in mfaktc.ini. The default value is 3 which means that mfaktc runs up to 1048576 factor candidates at once (per class). So the last block of each class is filled up with factor candidates above to upper limit. While this is a huge overhead for small ranges it's save to ignore it on bigger ranges. If a class contains 100 blocks the overhead is on average 0.5%. When a class needs 1000 blocks the overhead is 0.05%...[/quote] |
Will somebody miss the [B]debug[/B] option "VERBOSE_TIMING"? (If you don't know this option you won't miss it.)
It is a debugging/timing option which I've used in ancient versions of mfaktc. If nobody tells me a good reason why I shouldn't remove it I'll remove it in mfaktc 0.19! I'm not even sure if it works as expected in the current code... Oliver |
mfaktc for base 10 repunits
1 Attachment(s)
Hi,
I took the source of mfaktc 0.18 and changed some parts to handle base 10 repunits. I added a new code path (64 Bit kernel) and removed some other parts. Here is an overview of the changes: [LIST][*]rewrote lots of code to handle repunits[*]removed barrett kernel as it is not needed for repunits[*]removed 72 bit kernel -> No support for older GPUs[*]added 64 bit kernel[*]added a selftest for repunits[*]by default MORE_CLASSES is switched off (faster for smaller exponents)[*]moved the optional multiply into the modulus calculation (faster because the multiplication is done over less registers)[*]reduced the lower limit for exponents to 1000 (tested, but not guaranteed to work)[/LIST]Some notes about running mfaktc-repunit: [LIST=1][*]Format of worktodo files is the same (the exponents are now for base 10).[*]On a GeForce 460 GTX 1 instance of mfaktc (using the 64 Bit kernel) uses only a bit more than 50% of the GPU resources, so running 2 instances saturates the GPU, both then running nearly as fast as alone.[/LIST]@Oliver: I am not sure if this could be merged with the original mfaktc, you have to decide. This version gave our search for the next repunit (probable) prime a massive boost. Still the bottleneck is the PRP test. Danilo |
Win64 executable for mfaktc-repunit
1 Attachment(s)
Due to the size restriction of the forum I could not put the executable in the source, so here it is:
|
Hi Danilo,
I'll take a look at your modifications later. [QUOTE=MrRepunit;301522]Due to the size restriction of the forum I could not put the executable in the source, so here it is:[/QUOTE] I guess I'll upload it to [url]www.mersenneforum.org/mfaktc/[/url] later, including Win32 and Win64 built and CUDA DLLs. Oliver |
[QUOTE=TheJudger;301332]Will somebody miss the [B]debug[/B] option "VERBOSE_TIMING"? (If you don't know this option you won't miss it.)
It is a debugging/timing option which I've used in ancient versions of mfaktc. If nobody tells me a good reason why I shouldn't remove it I'll remove it in mfaktc 0.19! I'm not even sure if it works as expected in the current code... Oliver[/QUOTE] I've used it only at the very beginning of my OpenCL port. Go ahead and kick it. |
| All times are UTC. The time now is 23:17. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.