mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-03-29, 02:34   #1728
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

Quote:
Originally Posted by kladner View Post
I have certainly been observing on a x6 AMD, that running mfaktc by itself, even a single instance with Priority assigned to a core, is faster than when it is competing with P95-64 running 4x P-1 and 1x LL (all with Priority assigned.) Starting P95-64 has more effect on mfaktc than making it compete with CuLu for the GPU.
I think that's a memory issue, and it's become even more pronounced for me since AVX came out. Without P95, I can get 195-200M avg. rate; with AVX/P95, I get 165-170M, and P95/SSE I previously got 172-175. Anything that either Prime95 or mfaktc can do to reduce memory would be a great gain for me at least, and it seems for many others as well. (George has known that P95 is mem-limited, but it's clear that mfaktc is as well, and I have no idea how much more room for improvement there is in that regard.)

Last fiddled with by Dubslow on 2012-03-29 at 02:35 Reason: really need to get better at quoting rather than just quick reply
Dubslow is offline   Reply With Quote
Old 2012-03-29, 02:35   #1729
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

72·197 Posts
Default

Quote:
Originally Posted by Prime95 View Post
No, going past the breakeven point makes no sense. The GPU will clear more exponents by switching to CUDALucas rather than TFing past the breakeven. The question for the GPU owner faces is: Do a I go TF a range that hasn't reached the breakeven or do I switch to CUDALucas?

How should we modify the chart to take into account the loss of CPU cores? You need to know how much CPU power is lost to keep mfaktc busy. For example, if it takes 2 i7-860 cores, then you'd compare mfaktc's factor found rate to CUdaLucas + 2 i7-860 cores LL rate. Has anyone tried to gather this kind of data?
I did not gather data from other people (I am not good in doing that :D) but I did the calculus on some post long ago, for my cards and few others which I tested. That time I was trying to justify my opinion that TF-ing at DC level makes no sense, and I brought into the argument the two CPU cores lost for mfaktc too. But people jumped on my head, so I gave up.
LaurV is offline   Reply With Quote
Old 2012-03-29, 04:12   #1730
bcp19
 
bcp19's Avatar
 
Oct 2011

7×97 Posts
Default

Quote:
Originally Posted by LaurV View Post
I did not gather data from other people (I am not good in doing that :D) but I did the calculus on some post long ago, for my cards and few others which I tested. That time I was trying to justify my opinion that TF-ing at DC level makes no sense, and I brought into the argument the two CPU cores lost for mfaktc too. But people jumped on my head, so I gave up.
I think part of the problem is that the CPU/GPU combination makes a huge difference. My 2500 uses a single core to run a 560, that core can do 37% of a 26M DC in a day, while the 560 running CL can do 64%. Running mfaktc, that core puts out around 144GHzD, so this combo gives me ~142.82GHzD for each DC I could have done. Now, my entire Core2 Quad can only do 22.7% of the same DC in a day and the 550 Ti in it can do 41%, but the Quad is kinda screwy. If you set all 4 cores to DC, you can complete 8 DC in the same amount of time as 2 cores could complete 6 while the other 2 cores run mfaktc, I'm sure it has something to do with some sort of shared memory. If I run the Quad with 1 core doing DC and 3 cores mfaktc, I can output almost 230 GHzD for each 'lost' DC. I have since installed a 480 in the 2500 and the 560 in one of the quads. This is the capability of my systems when using a 26M exponent:

2500/480 - 2 cores mfaktc - 149.26GHzD/DC 'lost'
2400/560Ti - 3 cores mfaktc - 159.53
X4 645/460 - 3 cores mfaktc - 172.78
Q6600/560 - 3 cores mfaktc - 204.85
Q8200/550Ti - 3 cores mfaktc - 228.15

Hard to believe, but the older system is actually 50% more efficient than my 'speed demon'.

The 'shared memory' in the quads that messes up running all cores as DC/LL has no such effect on mfaktc, which is what makes the quad so highly efficient compared to newer systems. Also, as I mentioned before, if the above systems with 3 instances are trimmed down to 2, they produce 10-15% less GHzD per 'lost' DC.
bcp19 is offline   Reply With Quote
Old 2012-03-29, 16:36   #1731
kjaget
 
kjaget's Avatar
 
Jun 2005

100000012 Posts
Default

Let me run through an example with my system - a 560ti448 (basically a 570) and overclocked i5-750 system. I'll use M55000000 as an example. I don't have exact measurements but this is more or less correct, I think. OTOH I'm rushing through this on my lunch break so any of the math could be wrong. On the third hand, at least I get the same rough answer that's been shown previously, so that's good (or confirmation bias).

TF on my system uses 3 CPU cores to saturate the GPU. The instances settle down to about 7.85 sec/class or 7540 seconds per exponent (TF from 71 to 72). Since there are 3 of them running, I get 3 results each 7540 seconds, or 1 result every 2510 seconds = 0.7 hours / exponent.

Switching that to LL testing, I get the results of the GPU plus 3 CPU cores. Here I'm using data from mersennaries since I don't have it in front of me, but it should be a reasonable guess. The GPU gives a result every 95.1 hours. Each CPU core gives a result every 675 hours (~28 days per exponent). Since the GPU & CPU rates aren't the same, you have to do 1/(1/GPU rate + core count/CPU rate) to get the average, which = 66.9 hours per exponent.

Assuming we're finding factors 1.12% of the time as shown on the GPUto72 stats, TF takes about 62 hours to find an exponent while LLing takes 67. But since each factor found saves 2 LL tests, and each extra bit level doubles the run time, I should be TFing to one more bit level to make the time for 1 factor roughly the same as the time for 2 LL tests. This is the same 73 bit optimal depth as we've seen calculated by ignoring the CPUs entirely.

Some problems - mfaktc run time scales with exponent size, while P95 scales differently (nlog(n)?) so the decision may be different along the exponent range.

CPUs don't scale linearly when adding more cores to LL testing.

Different CPUs are relatively more or less effective at mfaktc sieving versus LL tests (AVX, etc).

The calculation is sensitive to TF found factor rate. 1% vs 1.1% isn't many extra successes, but it is 10% more of them...

On the plus side, though, since TF scales by 2x each time you increase the bit level a few 10-20% hits on either side don't change the conclusion. My gut is telling me that we could test some of the extremes (CPUs really good and really bad at LL vs mfaktc) and see if we get the same answer. Since 2x performance is large, my guess is there's a good chance the answer is yes which would really simplify life.

Since most of the data I have here is from the mersennaries page, we may be able to plot this stuff out over a wider range of CPU and GPU types. The big piece missing is how many CPU cores it takes to feed various GPUs. But again, going back to the idea that 10-20% each way doesn't matter when compared to 2x hit for each bit level, that might not be as bad as I imagine.

For a quick test, going from 2->3->4 CPU cores takes the LL time from 74 to 67 to 61 hours per exponent. That last one looks like it would move the break even point back to 72 bits factoring (just barely) but adding the 4th CPU to TF work gives me a ~10% better TF rate as well so the overall conclusion doesn't change much. No matter how many CPUs I use, it doesn't move the results enough to justify a 2x jump in TF time one way or the other.

I have no gut feel for whether this holds for faster CPUs. On the one hand, they influence the LL rate a lot more. On the other hand, you need less of them to saturate a GPU so there's less to be gained from moving CPUs from TF to LL.

Last fiddled with by kjaget on 2012-03-29 at 16:38
kjaget is offline   Reply With Quote
Old 2012-03-29, 17:51   #1732
bcp19
 
bcp19's Avatar
 
Oct 2011

7·97 Posts
Default

Quote:
Originally Posted by kjaget View Post
Let me run through an example with my system
<snip>

Assuming we're finding factors 1.12% of the time as shown on the GPUto72 stats, TF takes about 62 hours to find an exponent while LLing takes 67. But since each factor found saves 2 LL tests, and each extra bit level doubles the run time, I should be TFing to one more bit level to make the time for 1 factor roughly the same as the time for 2 LL tests. This is the same 73 bit optimal depth as we've seen calculated by ignoring the CPUs entirely.
Using the information you supplied, I got the same answer you did, and you are correct, going 1 bit level deeper would be a 'tossup' on your system at that exp level since it is right at the 'breakeven' point. Once you get to 56M-58M though, it starts swinging more in favor of the TF.

A 'balanced' GPU also makes a difference within the same system. The Q6600/560 used to be a Q6600/450, which with 2 cores was at 185.56 and with 3 cores was 184.52. The switch to a 560 was a little over 10% more 'efficient' at 204.85. The 450 was 'too little' GPU, but too much GPU is as bad or worse. I initially put the 480 into the quad, but all 4 cores could not max it out. The 4 cores and the 480 could do 1.43 DC/day, but I calculated I'd only get ~200GD with 4 instances, or 139.8GD/DC 'lost'.

Last fiddled with by bcp19 on 2012-03-29 at 18:05
bcp19 is offline   Reply With Quote
Old 2012-03-29, 18:27   #1733
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Quote:
Originally Posted by bcp19 View Post
The Q6600/560 used to be a Q6600/450, which with 2 cores was at 185.56 and with 3 cores was 184.52. The switch to a 560 was a little over 10% more 'efficient' at 204.85....The 4 cores and the 480 could do 1.43 DC/day, but I calculated I'd only get ~200GD with 4 instances, or 139.8GD/DC 'lost'.
All these GHz-days calculations are irrelevant to the calculation of the TF breakeven point. You'll find that you'll get more GHz-days credit TFing from 2^90 to 2^91, but we can all agree that GIMPS would be better off with a GPU LLing rather than TFing to 2^91.

Kjaget's calculations are the way to go. You compare how much LL a system can do to the amount of TF a system can do coupled with the chance of TF finding a factor (I'm sure P-1 changes the calculation slightly, but I'd bet it can safely be ignored).

For GPU272, we should then set the "official" breakeven point by taking an average or typical reported breakeven points.

I'm guessing were presently doing too much TF at 45M, just right into the low to mid 50M, and too little at 55M+. But we need more data! James has gotten us closer, but his estimates are a little high because of the unaccounted for CPU cores used by mfaktc.

Last fiddled with by Prime95 on 2012-03-29 at 18:28
Prime95 is offline   Reply With Quote
Old 2012-03-29, 19:25   #1734
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11×311 Posts
Default

Quote:
Originally Posted by Prime95 View Post
But we need more data!
One piece of data that might be relevant (or at least interesting) is the ratio of potential GHz-days/day of your GPU vs the potential GHd/d of the CPU cores required to power it. I don't want to pollute this thread with trivial responses, so if everyone could PM or email me with:
* what GPU you have
* what CPU you have
* how many instances of mfaktc you run (thereby how many CPU cores are used)
* what average GPU usage you get by doing so (should typically be close to 100%).

e.g.: "GTX 570; Core i7-3930K, 2 instances; 98% GPU".

This gives me a GPU-CPU GHd/d ratio of almost 19:1 (281/(2*7.4)). My theory is that this ratio should be roughly constant and could serve as a basis for including CPU usage into the equation. Once I get a reasonable sample of data I'll post back with what I find.
James Heinrich is offline   Reply With Quote
Old 2012-03-29, 20:02   #1735
bcp19
 
bcp19's Avatar
 
Oct 2011

67910 Posts
Default

Quote:
Originally Posted by Prime95 View Post
All these GHz-days calculations are irrelevant to the calculation of the TF breakeven point. You'll find that you'll get more GHz-days credit TFing from 2^90 to 2^91, but we can all agree that GIMPS would be better off with a GPU LLing rather than TFing to 2^91.

Kjaget's calculations are the way to go. You compare how much LL a system can do to the amount of TF a system can do coupled with the chance of TF finding a factor (I'm sure P-1 changes the calculation slightly, but I'd bet it can safely be ignored).
Which is why I use http://www.gpu72.com/reports/factoring_cost/ to see the 'cost' per factor found in the range I am working and why I specify at 26M in all my posts. At 26M 1 of my machines would be good for ^69 while the rest should only do ^68. In comparing CPU effort to complete a 26M exp vs 29M or 30M exp, the DC would run 19-29% longer, meaning the GPU would produce 19-29% more GHzD/DC. Using this, one of my machines is getting close to being good for ^70 at 29M while the rest are comfortable at ^69. LaurV's machine, which I believe is around 120GD at the 26M range, would be at 143-155GD running 29-30M exps, which would still be outside of the ^69 range.

Quote:

For GPU272, we should then set the "official" breakeven point by taking an average or typical reported breakeven points.

I'm guessing were presently doing too much TF at 45M, just right into the low to mid 50M, and too little at 55M+. But we need more data! James has gotten us closer, but his estimates are a little high because of the unaccounted for CPU cores used by mfaktc.
Due to 27.4 and the AVX speedup(20%? 30%? 50%?), sandy bridges are much more efficient completing DC/LL than non SB. The 'quirk' in Core2Quads make them only 75% as efficient as a comparable i3/i5/i7/AMD quad.

Using this, any SB is likely to be 1-2 bit lower than a non SB i3/i5/i7 which would probably be 1/2 to 1 bit lower than the quads. AMD quads seem to fall between the i3/i5/i7 bracket and the C2Q's, but I only have 1 data point to fall upon.
bcp19 is offline   Reply With Quote
Old 2012-03-29, 20:59   #1736
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

1101010111012 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
* what GPU you have
* what CPU you have
* how many instances of mfaktc you run (thereby how many CPU cores are used)
* what average GPU usage you get by doing so (should typically be close to 100%)
Please also include:
* CPU usage per core (assuming AllowSleep=1)
* CPU speed (whether overclocked or not)
* SievePrimes value
James Heinrich is offline   Reply With Quote
Old 2012-03-29, 23:16   #1737
bcp19
 
bcp19's Avatar
 
Oct 2011

12478 Posts
Default

Quote:
Originally Posted by Prime95 View Post
All these GHz-days calculations are irrelevant to the calculation of the TF breakeven point. You'll find that you'll get more GHz-days credit TFing from 2^90 to 2^91, but we can all agree that GIMPS would be better off with a GPU LLing rather than TFing to 2^91.

Kjaget's calculations are the way to go. You compare how much LL a system can do to the amount of TF a system can do coupled with the chance of TF finding a factor (I'm sure P-1 changes the calculation slightly, but I'd bet it can safely be ignored).

For GPU272, we should then set the "official" breakeven point by taking an average or typical reported breakeven points.

I'm guessing were presently doing too much TF at 45M, just right into the low to mid 50M, and too little at 55M+. But we need more data! James has gotten us closer, but his estimates are a little high because of the unaccounted for CPU cores used by mfaktc.
I just realized you didn't understand my terms. The GHzDs I listed were NOT for the bit level, but the total the card could produce in the same time as it would take the GPU and the CPUs combined to complete 1 DC. On GPU72, at 26M, on average it takes 106 TF to find a factor which is 237GHzD. My most efficient machine at the 26M level can do 228, which means it would almost break even compared to current estimates.

So, I just finished timings on 45M exps on all my CPUs and GPUs. The credit for a 45M exp is around 72.22, a 26M is 22.208, so a 45M exp takes 3.24 times more effort than a 26M. Using the timings at 45M the same as I did for 26M, I ended up getting an increase between 2.9 and 3.05 times, which is fairly close. If I use 3 I get 448GHzD on my worst system and 684GHzD on my most efficient. Double that for 2 LL saved and you get 996 to 1368. This tells me all of my machines are efficient doing 45M exponents to ^71, while one could maybe get away with doing ^72, seeing that 12 factors in 1708 runs have been found, which is kinda of a small pool to use for an estimate.
bcp19 is offline   Reply With Quote
Old 2012-03-30, 03:07   #1738
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

965310 Posts
Default

Quote:
Originally Posted by kjaget View Post
Let me run through an example ...
very good post.
LaurV is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 07:30.


Mon Aug 2 07:30:08 UTC 2021 up 10 days, 1:59, 0 users, load averages: 1.46, 1.32, 1.40

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.