mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-03-02, 21:44   #1607
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

1,123 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
Not yet.
I'm experiencing infinitely more difficulty setting up a development environment than I expected. WIMP != LAMP (or even WAMP as I have on my home/development server).
James,

My factor for M58703263 showed up correctly today:

Code:
Manual testing 58703263 F 2012-03-02 16:45 0.0 920694316080604322623 1.7365
Did you make some changes?
flashjh is offline   Reply With Quote
Old 2012-03-02, 22:07   #1608
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11·311 Posts
Default

Quote:
Originally Posted by flashjh View Post
Did you make some changes?
No. I'm working through changes, but nothing has been published to the site yet.

I assume you mean that your TF factor was correctly credited as TF (instead of P-1)? Your example is a relatively small factor (<2^70) so that's probably expected. It's only when you submit factors larger than what PrimeNet considers "reasonable" for TF that it falsely assumes it must've come from P-1.

Last fiddled with by James Heinrich on 2012-03-02 at 22:08
James Heinrich is offline   Reply With Quote
Old 2012-03-02, 22:08   #1609
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

1,123 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
No. I'm working through changes, but nothing has been published to the site yet.

I assume you mean that your TF factor was correctly credited as TF (instead of P-1)? It is a relatively small factor (<2^70) so that's probably expected. It's only when you submit factors larger than what PrimeNet considers "reasonable" for TF that it falsely assumes it must've come from P-1.
Ah, that explains some from the past, as well. Thanks for the update.
flashjh is offline   Reply With Quote
Old 2012-03-05, 07:44   #1610
rcv
 
Dec 2011

11·13 Posts
Default

I got my first CUDA capable card (560 Ti) a little over a month ago, and have been running mfaktc on 64-bit Linux. I have a few questions.

1. Can anyone explain why Compute Capability 2.1 is about "half" as fast as 2.0 for running mfaktc? [Yes, I know there are a billion or so fewer transistors, but what specific feature/function do the 2.0 cards have that 2.1 lacks that makes such a huge difference to mfaktc.]

2. I am disappointed in how much CPU it takes to feed my GPU. I would happily give up a fraction of my GPU performance to get back my CPU performance. [It's no trouble consuming nearly two full i7 cores to feed the GPU via two instances of mfaktc.]

mfaktc is compiled with a minimum SievePrimes=5000. I have tweaked the code to let me run at SievePrimes=1000. Is there a discussion as to why the user shouldn't be allowed to set a lower SievePrimes than 5K?

3. Has anyone considered running the sieving on the GPU? Is it just that nobody has written the code or is there a reason the idea was rejected? [If one were running the sieve and the trial factoring on the same processor, the proper tradeoff between sieving and trial factoring seems pretty clear -- If trial factoring can test, say, 250 million candidates per second, then sieving should stop at the point it can no longer remove more than 250 million candidates per second.]

Thanks in advance!
rcv is offline   Reply With Quote
Old 2012-03-05, 08:44   #1611
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

http://www.mersenneforum.org/showthr...245#post281245

^ That's the answer for your first question.

For the other two, I'm not sure, though I'll cast my vote again for on-GPU sieving.

Last fiddled with by Dubslow on 2012-03-05 at 08:45 Reason: That's the answer for your first question
Dubslow is offline   Reply With Quote
Old 2012-03-05, 10:51   #1612
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Hi!

Quote:
Originally Posted by rcv View Post
2. I am disappointed in how much CPU it takes to feed my GPU. I would happily give up a fraction of my GPU performance to get back my CPU performance. [It's no trouble consuming nearly two full i7 cores to feed the GPU via two instances of mfaktc.]
Just run CudaLucas to get your CPU free. Primenet needs more LL and less TF!

Quote:
Originally Posted by rcv View Post
mfaktc is compiled with a minimum SievePrimes=5000. I have tweaked the code to let me run at SievePrimes=1000. Is there a discussion as to why the user shouldn't be allowed to set a lower SievePrimes than 5K?
Have you compared the speed of mfaktc when you lower SievePrimes to 1000? Avg. rate is not the speed, time per class is the speed! Lower SievePrimes is just a waste of energy and not validated!

src/params.h:
Code:
[...]
/******************************************************************************
*******************************************************************************
*** DO NOT EDIT DEFINES BELOW THIS LINE UNLESS YOU REALLY KNOW WHAT YOU DO! ***
*** DO NOT EDIT DEFINES BELOW THIS LINE UNLESS YOU REALLY KNOW WHAT YOU DO! ***
*** DO NOT EDIT DEFINES BELOW THIS LINE UNLESS YOU REALLY KNOW WHAT YOU DO! ***
*******************************************************************************
******************************************************************************/
[...]
#define SIEVE_PRIMES_MIN      5000 /* DO NOT CHANGE! */
#define SIEVE_PRIMES_DEFAULT 25000 /* DO NOT CHANGE! */
#define SIEVE_PRIMES_MAX    200000 /* DO NOT CHANGE! */
[...]

Quote:
Originally Posted by rcv View Post
3. Has anyone considered running the sieving on the GPU? Is it just that nobody has written the code or is there a reason the idea was rejected? [If one were running the sieve and the trial factoring on the same processor, the proper tradeoff between sieving and trial factoring seems pretty clear -- If trial factoring can test, say, 250 million candidates per second, then sieving should stop at the point it can no longer remove more than 250 million candidates per second.]
Yes, for sure... but I don't know how to implement this efficient. Bdot (mfakto) tried and he got IIRC ~30M/s for a not so slow GPU. And the tradeoff is not that simple because the number of candidates per second doesn't matter, the time per assignment matters!

Oliver
TheJudger is offline   Reply With Quote
Old 2012-03-05, 12:29   #1613
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45716 Posts
Default

btw. don't take it personally if my previos post sounds too rude.

I'm getting the same questions again and again so I might be a little bit annoyed.

Oliver
TheJudger is offline   Reply With Quote
Old 2012-03-05, 16:20   #1614
kjaget
 
kjaget's Avatar
 
Jun 2005

2018 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Have you compared the speed of mfaktc when you lower SievePrimes to 1000? Avg. rate is not the speed, time per class is the speed!
I've seen this misunderstanding quite a bit as well. And thought into removing it from future versions? Maybe replacing it with something like GHz-days/day to something which is easy to sum up among instances to see the total throughput?
kjaget is offline   Reply With Quote
Old 2012-03-05, 16:28   #1615
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

722110 Posts
Default

Difficult to calculate and estimate. Sticking with the raw data is better, but we do need to figure out a better way to print it.
Dubslow is offline   Reply With Quote
Old 2012-03-05, 16:44   #1616
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11·311 Posts
Default

Quote:
Originally Posted by kjaget View Post
I've seen this misunderstanding quite a bit as well. And thought into removing it from future versions? Maybe replacing it with something like GHz-days/day to something which is easy to sum up among instances to see the total throughput?
I highly second this. The "M/s" value is somewhat meaningless for the user, and often misunderstood. The conversion of time-per-class into GHz-days-per-day should be very simple: GHz-days for the assignment is given by:
Code:
0.016968 * pow(2, $bitlevel - 48) * 1680 / $exponent

// example using M50,000,000 from 2^69-2^70:
= 0.016968 * pow(2, 70 - 48) * 1680 / 50000000
= 2.3912767291392 GHz-days

// magic constant is 0.016968 for TF to 65-bit and above
// magic constant is 0.017832 for 63-and 64-bit
// magic constant is 0.01116 for 62-bit and below
Then all you need is: 86400 / (time_per_class * classes_per_exponent) * ghz_days_assignment

Of course, above code is based on a single bitlevel, but easily adapted to multi-bitlevel assigments.
James Heinrich is offline   Reply With Quote
Old 2012-03-05, 19:17   #1617
rcv
 
Dec 2011

100011112 Posts
Default

@Dubslow: Thank you for the pointer. That's exactly what I was looking for.

@kjaget/TheJudger: On my setup, the time per class and the megaprimes per second are in a lock-step inverse relationship with each other (with constant SievePrimes). Whether I set SievePrimes to 1000 or 1500 or 2000, the average rate remains a little above 125 megacandidates per second, on each of the two instances I am running. When I vary SievePrimes, the number of candidates changes, as expected, and the time per class changes proportionally. If I have a misunderstanding, I'm sure you folks will correct me. [See more, below.]

Code:
Starting trial factoring M52575179 from 2^71 to 2^72
    class | candidates |    time |    ETA | avg. rate | SievePrimes | CPU wait
1669/4620 |      1.39G | 11.097s |  1h53m | 125.30M/s |        1500 |    0.39%
1672/4620 |      1.39G | 10.776s |  1h49m | 129.03M/s |        1500 |    0.41%
1680/4620 |      1.39G | 10.533s |  1h47m | 132.01M/s |        1500 |    0.41%
1681/4620 |      1.39G | 11.658s |  1h58m | 119.27M/s |        1500 |    0.37%
1689/4620 |      1.39G | 10.670s |  1h48m | 130.31M/s |        1500 |    0.41%
Quote:
Originally Posted by TheJudger View Post
btw. don't take it personally if my previos post sounds too rude. I'm getting the same questions again and again so I might be a little bit annoyed.
OK. I won't take it personally. For all you know, I just fell off the turnip truck. At least your answers are all here in one big bold place for future questioners to find.


Quote:
Originally Posted by TheJudger View Post
Have you compared the speed of mfaktc when you lower SievePrimes to 1000? Avg. rate is not the speed, time per class is the speed! Lower SievePrimes is just a waste of energy and not validated!
I disagree completely about this being a waste of energy! I saw the warning in the code, which I heeded. I saw the word "unless". I've looked at the code. I've run the self-test. I've found 4 factors at the 71/72 bit size with the tweaked version. Who should I see to get it validated? If you know of a problem, PLEASE let me know.

In the parlance of Mathematica, the fraction of candidates which pass the sieving is given by Apply[Times,Prime[5+Range[sp]]-1)/Prime[5+Range[sp]])], where sp is the number of SievePrimes. At SievePrimes=1500, the above formula yields 28.5914%. At SievePrimes=5000, the above formula yields 25.0285%.

The number of candidates reported in each class by mfaktc (1.39G, as shown above) with SievePrimes=1500 agrees with the theoretical. Floor[.285914665945569*2^71/4620/52575179/2+1/2]=1389675478 candidates per class.

At SievePrimes=5000, the number of candidates per class is theoretically Floor[0.250284623178239*2^71/4620/52575179/2+1/2]=1216497244.

When I switched from SievePrimes=5000 to SievePrimes=1500, the number of candidates per second remained constant, but the time per class increased by about 14% (0.285915/0.250285-1). As best as I could tell, my CPU usage due to mfaktc went down by more than half. Now, the GPU is almost never starved for work. In contrast, with high fixed values of SievePrimes, my CPU becomes saturated, the GPU is often starved for work, the net mfaktc throughput goes down, and I can't use my CPU for other useful work. With moderate values of SievePrimes, the CPU burns a lot of time and the GPU is sometimes starved for work.

With a smaller number of cores and a slower GPU, the default and minimum SievePrimes may make very good sense. But with a larger number of cores and a faster GPU, the minimum SievePrimes does not make sense for me. And, I would respectfully suggest it may not make sense for other people.

So, let me re-ask my 2nd question... Aside from validating the code, is there a reason why the user shouldn't be allowed to set a lower SievePrimes than 5K?


Quote:
Originally Posted by TheJudger View Post
Yes, for sure... but I don't know how to implement this efficient. Bdot (mfakto) tried and he got IIRC ~30M/s for a not so slow GPU. And the tradeoff is not that simple because the number of candidates per second doesn't matter, the time per assignment matters!
I maintain that if both sieving and trial factoring is done on the GPU, the tradeoff *is* as simple as matching candidates per second being removed by the siever to candidates per second being tested by the trial factoror.

I actually have some prototype sieving code. It is not optimized. At the smallest prime factor, not inherently sieved by the class mechanism (p=13), it can sieve out 64 billion candidates per second. At p=1583, the incremental rate of candidate removal is 1 billion candidates per second, At p=2297, the incremental rate of candidate removal is 500 megacandidates per second, and at p=4093, the incremental rate of candidate removal is 261 megacandidates per second. But the curve is rather flat, here. With my 560Ti GPU, my prototype sieving code, and your trial factoring code, it would seem the tradeoff between more sieving and more trial factoring is probably in the vicinity of SievePrimes=1000+/-500, and not especially sensitive to variations. [This would leave your CPU essentially unused.]

@Bdot: If you are interested, would you please weigh in on how this compares with your results.

Thanks to all who responded!

Last fiddled with by rcv on 2012-03-05 at 19:36
rcv is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 10:27.


Mon Aug 2 10:27:10 UTC 2021 up 10 days, 4:56, 0 users, load averages: 1.71, 1.43, 1.25

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.