mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-05-26, 15:27   #1783
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Raw GPU speed for TF M66362159 69 70
mfaktc 0.19-pre1: 380.74M/s (my stock GTX 470 does ~335M/s)
mfaktc 0.19-pre2: 380.92M/s
Disappointing. I had hoped that Kepler would be better on mfaktc as it doesn't do much slow double-precision FP.

Thanks for the bit-shift optimization timings. Very interesting. Do you have any ideas as to where the bottlenecks are? Please keep us informed as to other optimizations you try. The info may be useful for other CUDA projects.
Prime95 is offline   Reply With Quote
Old 2012-05-26, 16:45   #1784
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21278 Posts
Default

Yes, for many GPU computing applications Kepler is (another) step backwards.

I've no clue what the bottlenecks are. The barrett79 kernel uses shifts only for the initial setup (precompute (scaled) inverse of the factor candidate), the main loop is without shifts! So I'm really surprised how worse the replacement of shiftright with multiply (high word) is.
The lower number of registers per core on Kepler is not an issue for mfaktc, half of them would be enough! Shared memory, onchip cache and offchip memory is barely in use for mfaktc. The performance of mfaktc primary depends on 32bit integer throughput.

-----

Back to the multiword shiftleft example:
Code:
nn.d4 = __umad32(nn.d4, 8388608, (nn.d3 >> 9));
This is a very, very small improvement over shiftleft + shiftright + add!

Code:
nn.d4= __umadhi32(nn.d3, 8388608, (nn.d4 << 23));
And this is worse than shiftleft + shiftright + add!

-----

A small improvement for 0.19-pre2: 381.94M/s!

Oliver

Last fiddled with by TheJudger on 2012-05-26 at 16:51
TheJudger is offline   Reply With Quote
Old 2012-05-26, 18:09   #1785
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5×103 Posts
Default

Do the above limitations apply to both Keplers (mini and the recently announced big keplers)

Mini Kepler being GTX680, and Big Kepler being the announced K20 to be released at the end of the year?

Or is it too early to comment on big Kepler?

-- Craig

Last fiddled with by nucleon on 2012-05-26 at 18:10 Reason: grammar
nucleon is offline   Reply With Quote
Old 2012-05-26, 18:19   #1786
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi Craig,

I've no clue... and I don't trust numbers written on paper. So give me a Kepler an I'll tell you. But for now I need to understand how Kepler light works.

mini Kepler and big Kepler... for me it is Kepler light and Kepler (just as Fermi (CC 2.0) and Fermi light (CC 2.1)).

Oliver
TheJudger is offline   Reply With Quote
Old 2012-05-31, 17:46   #1787
Brain
 
Brain's Avatar
 
Dec 2009
Peine, Germany

331 Posts
Default k?

Just curious: I've run the following line:
Code:
 Factor=N/A,2373583,61,62
and found a factor: 4675173077110571839 (prime) [k = 984834546993]
My code gives me a bit length of 62.02 so a bit higher than demanded... No problem but is this expected / wanted?
Code:
mfaktc v0.18 (64bit built)
got assignment: exp=2373583 bit_min=61 bit_max=62
Starting trial factoring M2373583 from 2^61 to 2^62
 k_min = 485730431340
 k_max = 971460871270
[k_fac = 984834546993]
Using GPU kernel "75bit_mul32"
I haven't looked in the mfaktc source...

Might k_max be a soft border because of the classes system?
Brain is offline   Reply With Quote
Old 2012-05-31, 18:24   #1788
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

160658 Posts
Default

Quote:
Originally Posted by Brain View Post
Might k_max be a soft border because of the classes system?
Yes.
Quote:
Originally Posted by README.txt
##################################################################
# 5.1 Stuff that looks like an issue but actually isn't an issue #
##################################################################

- mfaktc runs slower on small ranges. Usually it doesn't make much sense to
run mfaktc with an upper limit smaller than 2^64. It is designed for trial
factoring above 2^64 up to 2^95 (factor sizes). ==> mfaktc needs
"long runs"!
- mfaktc can find factors outside the given range.
E.g. './mfaktc.exe -tf 66362159 40 41' has a high change to report
124246422648815633 as a factor. Actually this is a factor of M66362159 but
it's size is between 2^56 and 2^57! Offcourse
'./mfaktc.exe -tf 66362159 56 57' will find this factor, too. The reason
for this behaviour is that mfaktc works on huge factor blocks. This is
controlled by GridSize in mfaktc.ini. The default value is 3 which means
that mfaktc runs up to 1048576 factor candidates at once (per class). So
the last block of each class is filled up with factor candidates above to
upper limit. While this is a huge overhead for small ranges it's save to
ignore it on bigger ranges. If a class contains 100 blocks the overhead is
on average 0.5%. When a class needs 1000 blocks the overhead is 0.05%...
Dubslow is offline   Reply With Quote
Old 2012-06-05, 15:50   #1789
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Will somebody miss the debug option "VERBOSE_TIMING"? (If you don't know this option you won't miss it.)
It is a debugging/timing option which I've used in ancient versions of mfaktc.
If nobody tells me a good reason why I shouldn't remove it I'll remove it in mfaktc 0.19! I'm not even sure if it works as expected in the current code...

Oliver
TheJudger is offline   Reply With Quote
Old 2012-06-06, 13:13   #1790
MrRepunit
 
MrRepunit's Avatar
 
Mar 2011
Germany

5D16 Posts
Default mfaktc for base 10 repunits

Hi,

I took the source of mfaktc 0.18 and changed some parts to handle base 10 repunits. I added a new code path (64 Bit kernel) and removed some other parts. Here is an overview of the changes:
  • rewrote lots of code to handle repunits
  • removed barrett kernel as it is not needed for repunits
  • removed 72 bit kernel -> No support for older GPUs
  • added 64 bit kernel
  • added a selftest for repunits
  • by default MORE_CLASSES is switched off (faster for smaller exponents)
  • moved the optional multiply into the modulus calculation (faster because the multiplication is done over less registers)
  • reduced the lower limit for exponents to 1000 (tested, but not guaranteed to work)
Some notes about running mfaktc-repunit:
  1. Format of worktodo files is the same (the exponents are now for base 10).
  2. On a GeForce 460 GTX 1 instance of mfaktc (using the 64 Bit kernel) uses only a bit more than 50% of the GPU resources, so running 2 instances saturates the GPU, both then running nearly as fast as alone.
@Oliver: I am not sure if this could be merged with the original mfaktc, you have to decide.

This version gave our search for the next repunit (probable) prime a massive boost. Still the bottleneck is the PRP test.

Danilo
Attached Files
File Type: zip mfaktc-repunit.zip (154.9 KB, 109 views)
MrRepunit is offline   Reply With Quote
Old 2012-06-07, 11:58   #1791
MrRepunit
 
MrRepunit's Avatar
 
Mar 2011
Germany

9310 Posts
Default Win64 executable for mfaktc-repunit

Due to the size restriction of the forum I could not put the executable in the source, so here it is:
Attached Files
File Type: zip mfaktc-repunit-win-64.zip (126.7 KB, 98 views)
MrRepunit is offline   Reply With Quote
Old 2012-06-07, 14:30   #1792
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi Danilo,

I'll take a look at your modifications later.

Quote:
Originally Posted by MrRepunit View Post
Due to the size restriction of the forum I could not put the executable in the source, so here it is:
I guess I'll upload it to www.mersenneforum.org/mfaktc/ later, including Win32 and Win64 built and CUDA DLLs.

Oliver
TheJudger is offline   Reply With Quote
Old 2012-06-07, 21:11   #1793
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Will somebody miss the debug option "VERBOSE_TIMING"? (If you don't know this option you won't miss it.)
It is a debugging/timing option which I've used in ancient versions of mfaktc.
If nobody tells me a good reason why I shouldn't remove it I'll remove it in mfaktc 0.19! I'm not even sure if it works as expected in the current code...

Oliver
I've used it only at the very beginning of my OpenCL port. Go ahead and kick it.
Bdot is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 07:31.


Mon Aug 2 07:31:15 UTC 2021 up 10 days, 2 hrs, 0 users, load averages: 1.10, 1.25, 1.36

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.