mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-03-05, 20:07   #1618
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by rcv View Post
So, let me re-ask my 2nd question... Aside from validating the code, is there a reason why the user shouldn't be allowed to set a lower SievePrimes than 5K?
I think, the best way to free your CPU core(s) for other tasks is to set mfaktc to low priority.

Regarding the pure question "why not lower SievePrimes": because it's not tested. As long as SievePrimes remains above SIEVE_SPLIT (250), the code will probably work correctly. However, before mfakt[co] is given to the public, some more tests are performed. You may have seen the CHECKS_MODBASECASE #define. Completing the full selftest at a certain SievePrimes value is a prerequisite, but not sufficient.

OK, now that this is out I can tell you that some time ago I had finished a few tests with CHECKS_MODBASECASE at SievePrimes=256, and that did not show up any error, but it was not the full test, and it was mfakto, which runs different kernels.

And I briefly tested a special version of mfakto that skips sieving and memory transfer to the GPU completely, testing all candidates of a class. This one also successfully completed a few rounds of CHECKS_MODBASECASE tests.

So I'm quite confident that the full tests of lower SievePrimes would pass, but so far nobody has done these tests.

Quote:
Originally Posted by rcv View Post
I maintain that if both sieving and trial factoring is done on the GPU, the tradeoff *is* as simple as matching candidates per second being removed by the siever to candidates per second being tested by the trial factoror.

I actually have some prototype sieving code. It is not optimized. At the smallest prime factor, not inherently sieved by the class mechanism (p=13), it can sieve out 64 billion candidates per second. At p=1583, the incremental rate of candidate removal is 1 billion candidates per second, At p=2297, the incremental rate of candidate removal is 500 megacandidates per second, and at p=4093, the incremental rate of candidate removal is 261 megacandidates per second. But the curve is rather flat, here. With my 560Ti GPU, my prototype sieving code, and your trial factoring code, it would seem the tradeoff between more sieving and more trial factoring is probably in the vicinity of SievePrimes=1000+/-500, and not especially sensitive to variations. [This would leave your CPU essentially unused.]

@Bdot: If you are interested, would you please weigh in on how this compares with your results.

Thanks to all who responded!
I'd be really be interested to see this work. Though I'm maintaining the OpenCL version, receiving a hint how GPU-sieving could be done in a fast way would certainly help.

My first approach was so terribly slow that I could only sieve the first 256 primes to get the 30M/s (sieve output for the siever alone) on a GPU that otherwise runs ~150M/s through factoring. I was so disappointed that I gave up ... Your results, however ...
Bdot is offline   Reply With Quote
Old 2012-03-05, 20:16   #1619
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

10010101012 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
I highly second this. The "M/s" value is somewhat meaningless for the user, and often misunderstood. The conversion of time-per-class into GHz-days-per-day should be very simple: GHz-days for the assignment is given by:
Code:
0.016968 * pow(2, $bitlevel - 48) * 1680 / $exponent

// example using M50,000,000 from 2^69-2^70:
= 0.016968 * pow(2, 70 - 48) * 1680 / 50000000
= 2.3912767291392 GHz-days

// magic constant is 0.016968 for TF to 65-bit and above
// magic constant is 0.017832 for 63-and 64-bit
// magic constant is 0.01116 for 62-bit and below
Then all you need is: 86400 / (time_per_class * classes_per_exponent) * ghz_days_assignment

Of course, above code is based on a single bitlevel, but easily adapted to multi-bitlevel assigments.
It's the first time I've seen the formula to calculate this. (OK, if I had asked you had probably told me earlier ... ) I think, I'll leave the M/s as an option in the ini-file, but switch the default to GHz-days/day (= GHz ?). Added to todo-list ...

Last fiddled with by Bdot on 2012-03-05 at 20:19 Reason: if I had asked
Bdot is offline   Reply With Quote
Old 2012-03-05, 22:23   #1620
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

11100001101012 Posts
Default

Perhaps call it Eq. GHz, for Equivalent GHz (to a Core2).

@rcv: What a wonderful defense I like where this is going. I'd volunteer to do some 460 testing, but unfortunately I cannot for the life of me upgrade my Linux Nvidia drivers past 270.xx, so I'm stuck with mfaktc 0.17.
Dubslow is offline   Reply With Quote
Old 2012-03-05, 22:35   #1621
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

1,123 Posts
Default

Quote:
Originally Posted by Bdot View Post
So I'm quite confident that the full tests of lower SievePrimes would pass, but so far nobody has done these tests.
Just point me in the right direction and I'll test it. What needs to be done, specifically?
flashjh is offline   Reply With Quote
Old 2012-03-05, 23:19   #1622
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Quote:
Originally Posted by rcv View Post
I actually have some prototype sieving code. It is not optimized.
I too would be interested in seeing this code.
Prime95 is offline   Reply With Quote
Old 2012-03-05, 23:26   #1623
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

59710 Posts
Default

Quote:
Originally Posted by flashjh View Post
Just point me in the right direction and I'll test it. What needs to be done, specifically?
1. Build a binary that allows for setting low SievePrimes, and that has CHECKS_MODBASECASE defined.
2. Run the full selftest with that binary with fixed, low SievePrimes
3. Analyze the output for any CHECKS_MODBASECASE violations (and of course, all factors need to be found).

But before we all run and do something just for the sake of doing something: Maybe go back and think again about "Why do we want lower SievePrimes?" and "Do we really want that?". I haven't quite understood that part yet.

Last fiddled with by Bdot on 2012-03-05 at 23:28 Reason: fix SievePrimes
Bdot is offline   Reply With Quote
Old 2012-03-06, 02:57   #1624
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

11100001101012 Posts
Default

Quote:
Originally Posted by Bdot View Post

But before we all run and do something just for the sake of doing something: Maybe go back and think again about "Why do we want lower SievePrimes?" and "Do we really want that?". I haven't quite understood that part yet.
Quote:
Originally Posted by rcv View Post

In the parlance of Mathematica, the fraction of candidates which pass the sieving is given by Apply[Times,Prime[5+Range[sp]]-1)/Prime[5+Range[sp]])], where sp is the number of SievePrimes. At SievePrimes=1500, the above formula yields 28.5914%. At SievePrimes=5000, the above formula yields 25.0285%.

The number of candidates reported in each class by mfaktc (1.39G, as shown above) with SievePrimes=1500 agrees with the theoretical. Floor[.285914665945569*2^71/4620/52575179/2+1/2]=1389675478 candidates per class.

At SievePrimes=5000, the number of candidates per class is theoretically Floor[0.250284623178239*2^71/4620/52575179/2+1/2]=1216497244.

When I switched from SievePrimes=5000 to SievePrimes=1500, the number of candidates per second remained constant, but the time per class increased by about 14% (0.285915/0.250285-1). As best as I could tell, my CPU usage due to mfaktc went down by more than half. Now, the GPU is almost never starved for work. In contrast, with high fixed values of SievePrimes, my CPU becomes saturated, the GPU is often starved for work, the net mfaktc throughput goes down, and I can't use my CPU for other useful work. With moderate values of SievePrimes, the CPU burns a lot of time and the GPU is sometimes starved for work.
Seems like a pretty clear benefit to me. (My emphasis.)

Last fiddled with by Dubslow on 2012-03-06 at 02:57 Reason: (My emphasis.)
Dubslow is offline   Reply With Quote
Old 2012-03-06, 04:18   #1625
kjaget
 
kjaget's Avatar
 
Jun 2005

8116 Posts
Default

Quote:
Originally Posted by Dubslow View Post
Seems like a pretty clear benefit to me. (My emphasis.)
Not sure which side you meant the benefit was, but here's my take.

Considering how much more efficient GPUs are at generating GHz-days of work, trading 100% of 1 CPU for a 14% speedup in GPU production feels like a net win to me. If you choose to measure it that way, of course.

For a 560ti, that's ~25 GHz-days / day extra throughput. I don't see that a CPU is going to give anywhere that kind of throughput doing other types of work.

Last fiddled with by kjaget on 2012-03-06 at 04:21
kjaget is offline   Reply With Quote
Old 2012-03-06, 06:00   #1626
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

No, it's about half a CPU core, there'd still be significant CPU use (unless he pushes the GPU sieve thing through as well, which I'm hoping for).
Dubslow is offline   Reply With Quote
Old 2012-03-06, 10:59   #1627
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21278 Posts
Default

Hi rcv,

Quote:
Originally Posted by rcv View Post
I actually have some prototype sieving code. It is not optimized. At the smallest prime factor, not inherently sieved by the class mechanism (p=13), it can sieve out 64 billion candidates per second. At p=1583, the incremental rate of candidate removal is 1 billion candidates per second, At p=2297, the incremental rate of candidate removal is 500 megacandidates per second, and at p=4093, the incremental rate of candidate removal is 261 megacandidates per second. But the curve is rather flat, here. With my 560Ti GPU, my prototype sieving code, and your trial factoring code, it would seem the tradeoff between more sieving and more trial factoring is probably in the vicinity of SievePrimes=1000+/-500, and not especially sensitive to variations. [This would leave your CPU essentially unused.]
this sounds interesting. I would love to see your code. Are you talking about sieving only or does this include the translation of set/unset bits into FC candidates, too?

Oliver
TheJudger is offline   Reply With Quote
Old 2012-03-06, 14:18   #1628
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by Bdot View Post
Maybe go back and think again about "Why do we want lower SievePrimes?" and "Do we really want that?". I haven't quite understood that part yet.
I have old CPU and new GPU. With lower limit at 5000 I saturate all 4 cores of CPU but my GPU usage is well below 100%. I played with reducing the limit too and found that the lowest time per class for me is around 1000-1500. As the limit is lowered below 1000 - the CPU usage will go down. I did run long tests at 1000 and there were no problems detected but I did not use the CHECKS_MODBASECASE define.

And low priority does not always help. I have not specifically checked mfaktc but prime95 would cause hiccups in some programs even on idle priority. In fact IIRC it has setting to pause processing when it detects specified programs running.

Last fiddled with by apsen on 2012-03-06 at 14:23
apsen is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 10:27.


Mon Aug 2 10:27:19 UTC 2021 up 10 days, 4:56, 0 users, load averages: 1.57, 1.41, 1.24

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.