mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

bcp19 2012-03-09 06:04

[QUOTE=kladner;292402]@[URL="http://www.mersenneforum.org/member.php?u=1870"]kjaget[/URL]

These are the results for 4 instances.[/QUOTE]

FYI, without completing the run, you cannot tell how much increase you have in throuput with 4 running compared to 1,2 or 3.

LaurV 2012-03-09 06:19

[QUOTE=bcp19;292404]FYI, without completing the run, you cannot tell how much increase you have in throuput with 4 running compared to 1,2 or 3.[/QUOTE]
Why can't you? Is the ETA not reliable, or what? (especially for the one instance who already did ~800 classes). For me the estimation seems to be quite reliable.

bcp19 2012-03-09 07:05

[QUOTE=LaurV;292405]Why can't you? Is the ETA not reliable, or what? (especially for the one instance who already did ~800 classes). For me the estimation seems to be quite reliable.[/QUOTE]

Look at the 4 together, upper left goes 9:11, 9:13, 9:14, 9:11 remaining. Bottom right 7:50, 7:46, 7:47. All of the timings for the last 3-4 lines show 34.9-35.9 seconds. Looking at the ETA may give you a ROUGH estimation, but you need to complete a full run to get an accurate one. AS an example:
[quote]no factor for M29231441 from 2^68 to 2^69 [mfaktc 0.18 barrett79_mul32]
tf(): total time spent: 1h 0m 10.935s
no factor for M29231479 from 2^68 to 2^69 [mfaktc 0.18 barrett79_mul32]
tf(): total time spent: 1h 0m 4.711s
no factor for M29231617 from 2^68 to 2^69 [mfaktc 0.18 barrett79_mul32]
tf(): total time spent: 1h 0m 10.746s
no factor for M29231743 from 2^68 to 2^69 [mfaktc 0.18 barrett79_mul32]
tf(): total time spent: 1h 0m 12.207s
no factor for M29231773 from 2^68 to 2^69 [mfaktc 0.18 barrett79_mul32]
tf(): total time spent: 1h 0m 5.245s
[/quote]
These exponents print a new line every 3.7 seconds, but I can see the ETA jump by 10-60 seconds up or down. That's a pretty huge error rate.

LaurV 2012-03-09 07:31

[QUOTE=bcp19;292408]These exponents print a new line every 3.7 seconds, but I can see the ETA jump by 10-60 seconds up or down. That's a pretty huge error rate.[/QUOTE]
This happens if your computer does something else which steals clocks from mfaktx, and some classes get more CPU/GPU time then others, therefore the former showing shorter ETAs then the later. If the computer is balanced, all affinities are set right, neither CPU not GPU is "starving", then the ETA's are VERY stable and reliable. And if not, you can anytime "guess" some EMA or SMA quite accurate from the sequence of classes you see on screen.

kladner 2012-03-09 14:22

[QUOTE=bcp19;292404]FYI, without completing the run, you cannot tell how much increase you have in throuput with 4 running compared to 1,2 or 3.[/QUOTE]

Oops. I guess I didn't fully grasp what was significant information. I was mainly going for Time/Class readings once Sieve Primes had more or less stabilized.

@LaurV -Yeah. It's difficult to impossible to prevent the Windows from doing other stuff. And, it's unpredictable which cores Windows will steal time from. I guess if I had completely killed P95 instead of just stopping workers for the cores I was giving to mfaktc, it would have left 2 cores idle and available for the system to mess with.

kjaget 2012-03-09 15:25

[QUOTE=kladner;292423]Oops. I guess I didn't fully grasp what was significant information. I was mainly going for Time/Class readings once Sieve Primes had more or less stabilized.[/QUOTE]

Cool, that's more than enough. There's 970 classes tested per exponent, so simple multiplication will get you the theoretical run time when nothing else is stealing CPU & GPU time from the system.

But since the number of classes is constant it's easier to compare time/class and avoid doing extra math. So the average time/class for your examples :

1 cpu = 19.9 sec/class
2 cpus = 11.0 (80% speedup vs 1 CPU)
3 cpus = 9.27 (19% speedup vs 2 CPUs)
4 cpus = 8.82 (5.2% speedup vs 3 CPUs)

The 1 cpu version is just grabbed from the timing. The tough part is that this timing jumps around a bit, so 3 sig digits might be pushing it - if you left the computer totally idle for half an hour I bet they'd stabilize but what we have here is good enough.

The N cpu calculation is 1/(1/t1 + 1/t2 + 1/t3 + ... t/tN). Basically you're converting to a class/sec throughput rate, adding that up among instances, and then converting back to a time/class.

Using the formula posted a few pages back, each run of 46,166,291 from 68-72 gives 19.4GHz-days. Figuring the throughput is just converting to exponents/day : seconds * 970 = run time per exponent in seconds, convert to hours by dividing by 3600 seconds/hour, then 24/(hours/exponent) = exponents per day).

1 CPU = 4.76exp/day = 86.9 GHz-days / day
2 CPUs = 8.08 exp/day = 157 GHz-days / day
3 CPUs = 9.61 exp/day = 187 GHz-days / day
4 CPUs = 10.1 exp/day = 196 GHz-days / day

Since all of these are just scaling by a constant, the percentage difference between the values is the same as above. But it gives a more concrete example of how many GHz-days/day you're giving up in exchange for doing something else with the extra CPUs.

Your biggest gain is going from 1 to 2 CPUs, since that's where you finally max out the GPU. You don't get quite a 2x scaling because it takes less than the full second CPU to max the GPU. Since there's then extra CPU power, sieve prime increases to add extra load to the CPU to even it out with the work the GPU is doing.

Moving to 3 is a smaller gain - here the increase is strictly from increases in sieve primes reducing the candidates per class that the GPU has to test . This increase may or may not be worth it - you're trading 30GHz-days/day of TF for 5-8(?) of LL results. Depends on how much you value each type of result along with lots of other factors.

Same for the 3-4 CPU jump, except that the increases is less. If all you care about is max GHz-days/day from any source it makes sense (barely), but that's not the only way to decide this. I have a similar situation (adding the 4th core gives me ~9% better throughput) but decided to leave at least one core to do LL/DCs since I want to give balanced results from all the work types.

ETA - my guess on classes/exponent might be off. Looking at the code it might be 960 or 961. That changes the numbers here by ~1%, but all in the same way so the percentage changes are the same. Considering the noise in the timing data not a huge deal, but hopefully someone who understands the math better than me (i.e. most anyone :) ) can help.

kladner 2012-03-09 15:39

Thanks for the explanation. It was an interesting exercise. As mentioned, I can't really afford to make this machine into a P95/mfaktc slave. At some point, I will probably experiment with replacing the second mfaktc with CUDALucas, thus regaining most of a core for other uses. (I think?) It will be interesting to see how that affects GIMPS performance vs. overall usability.

I'm just watching the conversation on the progressive versions of CL to see when there is a general consensus on reliability of results.

James Heinrich 2012-03-09 16:12

[QUOTE=kladner;292427]I will probably experiment with replacing the second mfaktc with CUDALucas, thus regaining most of a core for other uses. (I think?) It will be interesting to see how that affects GIMPS performance vs. overall usability.[/QUOTE]I just started experimenting with CUDAlucas yesterday. First impressions: it uses zero CPU, but the GPU usage is more aggressive than mfaktc. Normal Windows usage is fine, I can't watch even DVD-quality video smoothly with CUDAlucas whereas it's only 1080 video I have to switch mfaktc off for. Most likely I'll go back to mfaktc, partly for usability, but also because the extra two cores don't scale so well with the new AVX cores in Prime95 (iteration times when running 6 workers are significantly slower than 4 workers).

kladner 2012-03-09 16:26

Thanks for the info, James. It's good to know how it works out for others. It might not hit me as hard, since YouTube and low-res .wmv's are just about the only videos I watch on the computer.

Xyzzy 2012-03-12 04:56

A cute post from a few years ago:

[url]http://www.mersenneforum.org/showpost.php?p=10266&postcount=12[/url]

TheJudger 2012-03-23 23:07

I guess I need to buy a GTX 6[78]0... ;)

LaurV 2012-03-24 05:33

[QUOTE=TheJudger;293953]I guess I need to buy a GTX 6[78]0... ;)[/QUOTE]
Do you mean 6(or 7, or 8)80 ? :smile:

bcp19 2012-03-24 05:34

[QUOTE=LaurV;293999]Do you mean 6(or 7, or 8)80 ? :smile:[/QUOTE]

More likely 670/680 (even tho 670 is not yet out)

ET_ 2012-03-24 09:55

[QUOTE=bcp19;294000]More likely 670/680 (even tho 670 is not yet out)[/QUOTE]

I'd wait for a GK110...

E.T.

TheJudger 2012-03-24 12:40

[QUOTE=LaurV;293999]Do you mean 6(or 7, or 8)80 ? :smile:[/QUOTE]
670 or 680. Use your favorite search engine and ask for "regular expressions".

LaurV 2012-03-24 15:37

[QUOTE=TheJudger;294023]670 or 680. Use your favorite search engine and ask for "regular expressions".[/QUOTE]
:razz:

BigBrother 2012-03-26 17:28

[QUOTE=TheJudger;293953]I guess I need to buy a GTX 6[78]0... ;)[/QUOTE]

Well, mfaktc works on my brand new GTX680 after compiling it with CUDA Toolkit 4.2 and sm_30.

bcp19 2012-03-26 18:15

[QUOTE=BigBrother;294268]Well, mfaktc works on my brand new GTX680 after compiling it with CUDA Toolkit 4.2 and sm_30.[/QUOTE]

Can you give us some exponents run and timings? Edit: forgot % GPU usage and # of instances.

James Heinrich 2012-03-26 18:46

[QUOTE=bcp19;294279]Can you give us some exponents run and timings? Edit: forgot % GPU usage and # of instances.[/QUOTE]For my purposes of updating the chart at [url]http://mersenne-aries.sili.net/mfaktc.php[/url] can I request that you run through an exponent on a [i]single mfaktc instance[/i], even if it doesn't max out your GPU. And let me know:
* what assignment (e.g. Factor=123456789,68,70)
* wall-clock runtime to complete exponent (e.g. 1h23m45s)
* average GPU usage (e.g. 56%)
* average SievePrimes value

BigBrother 2012-03-26 19:01

[QUOTE=James Heinrich;294293]For my purposes of updating the chart at [url]http://mersenne-aries.sili.net/mfaktc.php[/url] can I request that you run through an exponent on a [i]single mfaktc instance[/i], even if it doesn't max out your GPU. And let me know:
* what assignment (e.g. Factor=123456789,68,70)
* wall-clock runtime to complete exponent (e.g. 1h23m45s)
* average GPU usage (e.g. 56%)
* average SievePrimes value[/QUOTE]

I'm running a test right now. Is there a precise way to measure average GPU usage or do I just have to make an educated guess using GPU-Z?

James Heinrich 2012-03-26 19:12

[QUOTE=BigBrother;294297]I'm running a test right now. Is there a precise way to measure average GPU usage or do I just have to make an educated guess using GPU-Z?[/QUOTE]GPU-Z eyeballed is perfectly fine.

BigBrother 2012-03-26 20:05

[QUOTE=James Heinrich;294293]For my purposes of updating the chart at [url]http://mersenne-aries.sili.net/mfaktc.php[/url] can I request that you run through an exponent on a [i]single mfaktc instance[/i], even if it doesn't max out your GPU. And let me know:
* what assignment (e.g. Factor=123456789,68,70)
* wall-clock runtime to complete exponent (e.g. 1h23m45s)
* average GPU usage (e.g. 56%)
* average SievePrimes value[/QUOTE]

* Factor=N/A,55504619,70,71
* 0h 40m 47s
* 74% GPU usage
* SievePrimes=5000 throughout

firejuggler 2012-03-26 20:09

so around 35 a day, 150 GHz day/day. one more precision if possible : which cpu?

BigBrother 2012-03-26 20:22

[QUOTE=firejuggler;294308]so around 35 a day, 150 GHz day/day. one more precision if possible : which cpu?[/QUOTE]

One core of a i5-2500k @ 4.2 GHz, no other processes running. Only single channel memory though, don't know if that matters much in this case.

firejuggler 2012-03-26 20:23

I was just wondering if your 680 was 'underfed'

bcp19 2012-03-26 20:25

[QUOTE=firejuggler;294308]so around 35 a day, 150 GHz day/day. one more precision if possible : which cpu?[/QUOTE]

You forgot the 74% load, which would take it to a hair over 200, or a bit less than what a GTX 470 can do.

Edit: With that performance, if I were BB, I think I'd take the card and sell it on EBAY, since the scarcity seems to be driving up prices.

BigBrother 2012-03-26 20:38

[QUOTE=bcp19;294311]You forgot the 74% load, which would take it to a hair over 200, or a bit less than what a GTX 470 can do.

Edit: With that performance, if I were BB, I think I'd take the card and sell it on EBAY, since the scarcity seems to be driving up prices.[/QUOTE]

I primarily bought it for gaming, so I'm not really disappointed :)

James Heinrich 2012-03-26 21:32

[QUOTE=bcp19;294311]You forgot the 74% load, which would take it to a hair over 200, or a bit less than what a GTX 470 can do.

Edit: With that performance, if I were BB, I think I'd take the card and sell it on EBAY, since the scarcity seems to be driving up prices.[/QUOTE]I've updated [url]http://mersenne-aries.sili.net/mfaktc.php[/url] for GTX 680 and compute 3.0

GTX 680 performance is horrible for mfaktc and CUDALucas. Updating my [url=http://www.mersenneforum.org/showpost.php?p=294283&postcount=1126]previous post[/url], relative performance of various compute versions using 2.1 (e.g. GTX 560) as a baseline:

CUDALucas:
compute 1.3 = [color=darkorange]82%[/color]
compute 2.0 = [color=darkgreen]137%[/color]
compute 2.1 = [color=blue]100%[/color]
compute 3.0 = [color=orangered]56%[/color]

mfaktc:
compute 1.3 = [color=orangered]54%[/color]
compute 2.0 = [color=limegreen]150%[/color]
compute 2.1 = [color=blue]100%[/color]
compute 3.0 = [color=red]33%[/color]

James Heinrich 2012-03-26 21:35

[QUOTE=BigBrother;294314]I primarily bought it for gaming, so I'm not really disappointed :)[/QUOTE]It also presents a good opportunity for those who want to get (more) into mfaktc / CUDALucas: a number of gamers will be trying to sell off their GTX 570(s) and/or GTX 580(s) at quite reasonable prices so they can upgrade to a GTX 680. A used GTX 570 could provide excellent price/performance for GIMPS if you get it at a good price.

Prime95 2012-03-26 21:51

[QUOTE=James Heinrich;294327]GTX 680 performance is horrible for mfaktc and CUDALucas. [/QUOTE]

This is somewhat surprising to me. I guessed CUDALucas would be bad because it does FP64 in 8 special computation units rather than the more numerous CUDA cores (an effective 1/24 FP64 speed). However, I thought mfaktc would use the more numerous CUDA cores to do the 32-bit muls and adds that predominate in TF. Where did I go wrong?

bcp19 2012-03-26 22:27

[QUOTE=James Heinrich;294328]It also presents a good opportunity for those who want to get (more) into mfaktc / CUDALucas: a number of gamers will be trying to sell off their GTX 570(s) and/or GTX 580(s) at quite reasonable prices so they can upgrade to a GTX 680. A used GTX 570 could provide excellent price/performance for GIMPS if you get it at a good price.[/QUOTE]

I just picked up a GTX 480 for $125 :D

Dubslow 2012-03-26 22:34

[QUOTE=bcp19;294332]I just picked up a GTX 480 for $125 :D[/QUOTE]

Newegg has new 580s for <$400

[url]http://www.newegg.com/Product/Product.aspx?Item=N82E16814162092[/url]
[url]http://www.newegg.com/Product/Product.aspx?Item=N82E16814162073[/url]

Batalov 2012-03-26 22:47

I'd stay away from Galaxy or Zotac...

Dubslow 2012-03-26 22:54

The reviews seemed good. You've had bad experiences?

James Heinrich 2012-03-26 22:55

[QUOTE=Batalov;294337]I'd stay away from Galaxy or Zotac...[/QUOTE]For what it's worth, my 8800GT is from Galaxy and it's performed admirably for the last 4+ years (it continues to run cool and quiet and it churns out a modest amount of mfaktc).

James Heinrich 2012-03-26 23:02

I've updated [url]http://mersenne-aries.sili.net/cudalucas.php[/url] such that if you click any CPU model name down the left, it'll give you a chart of breakeven points between mfaktc TF and CUDALucas L-L (ignoring CPU entirely, including the CPU cores that CUDALucas [i]doesn't[/i] use). Cutoff points only vary by compute version (e.g. 2.1 vs 2.0 = GTX 570 vs GTX 560), but they do vary a fair bit (due to relative performance differences between mfatkc and CUDALucas, see [url=http://www.mersenneforum.org/showpost.php?p=294327&postcount=1677]post #1677[/url] above).

chalsall 2012-03-26 23:06

Thanks very much for doing this James.

And just for clarity, this analysis is the cut-off point for a single LL test, right? As in, it doesn't take into account that a factor found in the LL range saves two tests?

Nice to have hard data, rather than a gut feel.... :smile:

axn 2012-03-26 23:16

[QUOTE=Prime95;294330]This is somewhat surprising to me. I guessed CUDALucas would be bad because it does FP64 in 8 special computation units rather than the more numerous CUDA cores (an effective 1/24 FP64 speed). However, I thought mfaktc would use the more numerous CUDA cores to do the 32-bit muls and adds that predominate in TF. Where did I go wrong?[/QUOTE]

I shall add my own suprise to yours :shock: Perhaps mfaktc needs 680-specific optimizations?

James Heinrich 2012-03-26 23:38

[QUOTE=chalsall;294343]And just for clarity, this analysis is the cut-off point for a single LL test, right? As in, it doesn't take into account that a factor found in the LL range saves two tests?[/QUOTE]Correct. It's comparing the wall-clock runtime to run a single L-L on the exponent using CUDALucas vs the time to TF to said bit level, combined with the probability [i]above[/i] Prime95 default TF levels. If you mouseover the various cells it gives some extra info. The number displayed is a percentage of sorts: 100 means that it's the breakeven point; below 100 TF is more likely to clear the exponent faster; above 100 then LL is likely to clear it faster.
Feedback (including critical analysis of my approach) is welcome, since I'm not 100% confident this comparison is the best approach; if someone can suggest a better way I'm interested to hear.

bcp19 2012-03-27 00:21

[QUOTE=James Heinrich;294345]Correct. It's comparing the wall-clock runtime to run a single L-L on the exponent using CUDALucas vs the time to TF to said bit level, combined with the probability [I]above[/I] Prime95 default TF levels. If you mouseover the various cells it gives some extra info. The number displayed is a percentage of sorts: 100 means that it's the breakeven point; below 100 TF is more likely to clear the exponent faster; above 100 then LL is likely to clear it faster.
Feedback (including critical analysis of my approach) is welcome, since I'm not 100% confident this comparison is the best approach; if someone can suggest a better way I'm interested to hear.[/QUOTE]

There are many things that can 'skew' the data. A mid-high end CPU can saturate a mid-high end GPU with a single core. That single core (now that AVX has been incorporated) can produce more output than an entire Core 2 Quad, but if you devote all 4 cores of said Quad to the same GPU and let SP adjust as needed, now you have 130-180% GPU throughput for the same 'cost' as the high end core. Using older machines in this manner, you could theoretically push an extra bit or 2 beyond current levels.

axn 2012-03-27 00:39

[QUOTE=James Heinrich;294345]
Feedback (including critical analysis of my approach) is welcome, since I'm not 100% confident this comparison is the best approach; if someone can suggest a better way I'm interested to hear.[/QUOTE]

Looks like you're using cumulative probability in the calculation rather than incremental probability. That can't be right.

James Heinrich 2012-03-27 01:28

[QUOTE=axn;294348]Looks like you're using cumulative probability in the calculation rather than incremental probability. That can't be right.[/QUOTE]That's what I thought. And why I'm not confident in the numbers yet. Doing it this way made the numbers "look right", but it still seems wrong.
If someone could walk through an example of how it should be calculated I'd be very grateful.

axn 2012-03-27 02:26

[QUOTE=James Heinrich;294349]That's what I thought. And why I'm not confident in the numbers yet. Doing it this way made the numbers "look right", but it still seems wrong.
If someone could walk through an example of how it should be calculated I'd be very grateful.[/QUOTE]

You're nearly there. Rather than using the cum.prob., just use the probability for the given bit depth. You should see a rough doubling of the % with every bit.

LaurV 2012-03-27 02:29

[QUOTE=Prime95;294330]Where did I go wrong?[/QUOTE]
You did not. As I said before, mfaktc would need not only recompiling, but a bit of re-thinking too, to take advantage of the numerous cores instead of the double-faster shader clock which is now gone.

rcv 2012-03-27 05:19

[QUOTE=James Heinrich;294327]CUDALucas:
compute 1.3 = [COLOR=darkorange]82%[/COLOR]
compute 2.0 = [COLOR=darkgreen]137%[/COLOR]
compute 2.1 = [COLOR=blue]100%[/COLOR]
compute 3.0 = [COLOR=orangered]56%[/COLOR]

mfaktc:
compute 1.3 = [COLOR=orangered]54%[/COLOR]
compute 2.0 = [COLOR=limegreen]150%[/COLOR]
compute 2.1 = [COLOR=blue]100%[/COLOR]
compute 3.0 = [COLOR=red]33%[/COLOR][/QUOTE]

Here's another way to look at this, using the data James posted and the raw attributes of the various chips, I compare GTX 680 versus GTX 570 versus GTX 560 Ti:


Number of multiprocessors: 8 / 15 / 8
Cores per multiprocessor: 192 / 32 / 48
Total cores: 1536 / 480 / 384
base clock rates (MHz): 1006 / 732 / 822.5.
base Clock rate * #multiprocessors: 8048 / 10960 / 6580

From James' data:
mfaktc gigahertz days per day: 206 / 281 / 168.4

If we define "efficiency" as Hz days/day divided by (Clock rate * #multiprocessors):
mfaktc efficiency per multiprocessor: [COLOR=SeaGreen]29.60[/COLOR] / [COLOR=RoyalBlue]29.59[/COLOR] / [COLOR=Red]29.59[/COLOR]

From James' data:
cudalucas gigahertz days per day: 28.4 / 31.5 / 20.6

cudalucas efficiency per multiprocessor: [COLOR=Lime] [COLOR=SeaGreen]3.5[/COLOR][/COLOR] / [COLOR=RoyalBlue]2.9[/COLOR] / [COLOR=Red]3.1[/COLOR]

By this metric, the performance of cudalucas on the new 680 is a bit better than I expected. (Maybe the increased memory bandwidth is especially beneficial to cudalucas.)

But, by this metric, the performance of mfaktc on the new 680 is woefully below what I expected. Let me also remind everybody that Oliver didn't compile mfaktc to run the benchmarks. I wouldn't be a bit surprised if a trivial change could yield twice the performance. But until someone with the know-how and the hardware can run the profiler on a 680, we shouldn't assume these are *final* benchmarks.

BigBrother 2012-03-27 08:05

I can saturate (100% GPU) my GTX680 when running two instances of mfaktc.

[IMG]http://gpuz.techpowerup.com/12/03/27/a9a.png[/IMG]

Note that the GPU core clock is constantly boosted to +-1100MHz and the power consumption hovers around 72% TDP, which could mean that the performance/Watt for this chip is higher than in James' calculations. This power consumption sensor seems to be a new feature on this chip, I've never seen it displayed in GPU-Z before on any other card.

BigBrother 2012-03-27 11:13

It turns out that I plugged my brand new shiny bling-bling GTX680 into a PCI-E 2.0 x8 slot instead of a PCI-E 2.0 x16 slot... :blush: I'll change it tonight, and also try to fix a crazy problem that causes my motherboard to refuse more than one memory module, forcing it to use single channel DDR3. I don't expect radically improved CUDA performance, but we'll see.

msft 2012-03-27 11:44

[QUOTE=Prime95;294330]This is somewhat surprising to me. However, I thought mfaktc would use the more numerous CUDA cores to do the 32-bit muls and adds that predominate in TF. Where did I go wrong?[/QUOTE]
[URL="http://forums.nvidia.com/index.php?showtopic=225312&st=20&p=1387312&#entry1387312"]http://forums.nvidia.com/index.php?showtopic=225312&st=20&p=1387312&#entry1387312[/URL]
[QUOTE]
* Relative to the throughput of single precision multiply-add, the throughput of integer shifts, integer comparison, and integer multiplication is lower than before.
[/QUOTE]
It is answer?

nucleon 2012-03-27 12:33

[QUOTE=TheJudger;293953]I guess I need to buy a GTX 6[78]0... ;)[/QUOTE]

I'd be curious if you can weave some more TheJudger magic to get more out of the GTX680. :)

Now with some performance figures out, I'm pretty disappointed. I was hoping to buy some GTX680s to replace some hardware here to reduce my power bill.

It doesn't even surpass what I have on performance per watt metrics.

-- Craig

James Heinrich 2012-03-27 14:13

[QUOTE=axn;294351]You're nearly there. Rather than using the cum.prob., just use the probability for the given bit depth. You should see a rough doubling of the % with every bit.[/QUOTE]Thanks. I didn't have my brain screwed on quite straight yesterday, but I think I've fixed it so it makes sense now.
[url]http://mersenne-aries.sili.net/cudalucas.php?model=13[/url]

BigBrother 2012-03-27 14:37

[QUOTE=msft;294375][URL="http://forums.nvidia.com/index.php?showtopic=225312&st=20&p=1387312&#entry1387312"]http://forums.nvidia.com/index.php?showtopic=225312&st=20&p=1387312&#entry1387312[/URL]

It is answer?[/QUOTE]

Some exact numbers: (Operations per Clock Cycle per Multiprocessor)
[CODE]
CC 1.x CC 2.0 CC 2.1 CC 3.0

32-bit floating
point add, 8 32 48 192
multiply,
multiply-add

64-bit floating
point add, 1 16 4 8
multiply,
multiply-add

32-bit
integer add 10 32 48 168

32-bit integer
multiply,
multiply-add, Multiple 16 16 32
sum of absolute instructions
difference
[/CODE]
From table 5-1 in the CUDA C Programming Guide Version 4.2

Not much love for 32-bit integer multiply & multiply-add, compared to 32-bit floating point operations.

axn 2012-03-27 15:01

[QUOTE=James Heinrich;294383]Thanks. I didn't have my brain screwed on quite straight yesterday, but I think I've fixed it so it makes sense now.
[url]http://mersenne-aries.sili.net/cudalucas.php?model=13[/url][/QUOTE]

Much better. Now, if we could just drill down an individual row to 1M granularity... :whistle:

kladner 2012-03-27 15:12

[QUOTE=James Heinrich;294383]I didn't have my brain screwed on quite straight yesterday[URL="http://mersenne-aries.sili.net/cudalucas.php?model=13"][/URL][/QUOTE]

That page has really come a long way in a short time. Another great tool!
Thanks for doing it.

BTW: I wasn't thinking too well, either, when I ran the CuLu benchmarks. Sorry for the incomplete data, James.

James Heinrich 2012-03-27 16:09

[QUOTE=axn;294388]Much better. Now, if we could just drill down an individual row to 1M granularity... :whistle:[/QUOTE]You can if you click the zoom in/out links I just added. :smile:

msft 2012-03-27 16:14

[QUOTE=BigBrother;294384]Some exact numbers: (Operations per Clock Cycle per Multiprocessor)
[CODE]
CC 1.x CC 2.0 CC 2.1 CC 3.0
32-bit integer
multiply,
multiply-add, Multiple 16 16 32
sum of absolute instructions
difference
[/CODE]
[/QUOTE]
GTX-580 have 16 Multiprocessors,GTX-680 have 8.
GTX-680 each Multiprocessor have 192 core,But only 32 32-bit integer multiply exec.
Lots of thread wait exec.
[CODE]
CC 1.x CC 2.0 CC 2.1 CC 3.0
32-bit integer
shift compare 8 16 16 8
[/CODE]

BigBrother 2012-03-27 17:52

[QUOTE=BigBrother;294372]It turns out that I plugged my brand new shiny bling-bling GTX680 into a PCI-E 2.0 x8 slot instead of a PCI-E 2.0 x16 slot... :blush: I'll change it tonight, and also try to fix a crazy problem that causes my motherboard to refuse more than one memory module, forcing it to use single channel DDR3. I don't expect radically improved CUDA performance, but we'll see.[/QUOTE]

Well, The Card is now inserted into a PCI-E 2.0 x16 slot, and my brain surgery skills allowed me to fix a bent pin on the CPU socket so my memory is back at dual channel again. :cool:

One instance of mfaktc is now taking +-70% GPU instead of the 74% I reported yesterday, and nVidia's Visual Profiler shows transfer rates of 6Gb/s instead of 3 Gb/s, but since the amount of data to transfer is relatively small, there's no earth-shattering improvement. I could run the same benchmark I did yesterday again if James would like me to do that.

axn 2012-03-27 17:56

[QUOTE=BigBrother;294415]Well, The Card is now inserted into a PCI-E 2.0 x16 slot, and my brain surgery skills allowed me to fix a bent pin on the CPU socket so my memory is back at dual channel again. :cool:

One instance of mfaktc is now taking +-70% GPU instead of the 74% I reported yesterday, and nVidia's Visual Profiler shows transfer rates of 6Gb/s instead of 3 Gb/s, but since the amount of data to transfer is relatively small, there's no earth-shattering improvement. I could run the same benchmark I did yesterday again if James would like me to do that.[/QUOTE]

Improved memory should have a more pronounced impact on CUDALucas.

Dubslow 2012-03-27 20:20

[QUOTE=James Heinrich;294396]You can if you click the zoom in/out links I just added. :smile:[/QUOTE]

Would it be possible to somehow overlay the current TF bounds ([url]http://www.mersenne.org/various/math.php[/url], plus 3 bits) on top of the chart? It would be so pretty :smile:

Also, is it possible to make an "overall" chart that averages the breakeven points for all the GPUs? You'd have to figure out a way to weight the throughput of each GPU relative to the others; the 5xx would have highest weighting, 4xx next highest, and then everything else a lower weighting.

Edit: Perhaps a mod should move all the posts relating to James' new page to a separate "TF vs. LL" thread in this forum?

James Heinrich 2012-03-27 20:48

[QUOTE=Dubslow;294429]Would it be possible to somehow overlay the current TF bounds ([url]http://www.mersenne.org/various/math.php[/url], plus 3 bits) on top of the chart? It would be so pretty :smile:[/quote]I've put the current PrimeNet CPU-TF limits on the chart as orange.

[QUOTE=Dubslow;294429]Also, is it possible to make an "overall" chart that averages the breakeven points for all the GPUs? You'd have to figure out a way to weight the throughput of each GPU relative to the others; the 5xx would have highest weighting, 4xx next highest, and then everything else a lower weighting.[/QUOTE]Breakeven points are not dependent on GPU absolute performance, only relative performance for that compute version between mfaktc and CUDALucas. Relativel CUDALucas vs maktc performance may mean they need around 1 extra TF bitlevel for the breakeven point. But there's only 4 patterns there, one for each compute version: [url=http://mersenne-aries.sili.net/cudalucas.php?model=467]3.0[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=7]2.1[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=13]2.0[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=15]1.3[/url].

Dubslow 2012-03-27 21:09

[QUOTE=James Heinrich;294435]I've put the current PrimeNet CPU-TF limits on the chart as orange.[/quote]...pretty :smile:
[QUOTE=James Heinrich;294435]
Breakeven points are not dependent on GPU absolute performance, only relative performance for that compute version between mfaktc and CUDALucas. Relativel CUDALucas vs maktc performance may mean they need around 1 extra TF bitlevel for the breakeven point. But there's only 4 patterns there, one for each compute version: [url=http://mersenne-aries.sili.net/cudalucas.php?model=467]3.0[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=7]2.1[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=13]2.0[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=15]1.3[/url].[/QUOTE]
Heh, I didn't realize it was the same numbers, but now I do. 2.1 has slightly more conservative TF bounds, but otherwise matches up fairly well with 2.0; now, perhaps this should be put in the GPU272 forum, but I think that project should retain PrimeNet's TF bounds, unless you James can modify assignment rules (somehow I doubt that). If we do that, than +3 bits is the conservative goal, and +4 bits would be aggressive TF bounds. I vote for the former, because many of the cyan cells are in fact well above 200, and it is GIMPS, not GIMFS as petrw1 has pointed out elsewhere.

rcv 2012-03-27 21:42

[QUOTE=msft;294375][URL]http://forums.nvidia.com/index.php?showtopic=225312&st=20&p=1387312&#entry1387312[/URL]

It is answer?[/QUOTE]
I swear I searched NVIDIA's Web site on Thursday and Friday after the announcement, but found no new technical details. [Just pointers to new drivers.]

The new toolkit and docs explain a lot. In addition to the things we knew were slower on the 680, the docs reveal that shift instructions are way slow. And mfaktc does use C-style shifts in the inner loop. I still suspect there may be an occupancy issue that is halving the performance.

@BigBrother: Are you up to running the [FONT=Courier New]nvvp[/FONT] profiler?

@msft: Thanks for the link!!!

Prime95 2012-03-28 00:51

[QUOTE=rcv;294438]The new toolkit and docs explain a lot. In addition to the things we knew were slower on the 680, the docs reveal that shift instructions are way slow.[/QUOTE]

I'd say they are about 20 times slower than they should be!! 32-bit muls are much faster than shift lefts! Repeated adds are much faster than small shift lefts. Algorithms may have to change to avoid shift rights.

Prime95 2012-03-28 03:09

I also noticed that type conversion is dreadfully slow. Try to minimize these.

Does anyone know if add.cc runs runs on 168 cores or does it get restricted to 32 or even worse 8 cores??

Could bfe (bit field extract) be used as a replacement for the slow shift right?

In general, how does one know which PTX instructions map to actual hardware instructions? If it's emulated, how does one see which instructions are used to emulate the PTX instruction?

rcv 2012-03-28 04:13

After years of coaching developers to change their multiplies to shifts. Now, NVIDIA may be coaching developers to change their shifts back to multiplies. Even a shift right (by a constant) might be performed by a mul.hi instruction. How ironic.

[QUOTE=Prime95;294466]In general, how does one know which PTX instructions map to actual hardware instructions? If it's emulated, how does one see which instructions are used to emulate the PTX instruction?[/QUOTE]

I've not done it, but I've read that NVIDIA provides a disassembler that shows the honest-and-true (post PTX) machine code that is executed. But, as far as I know, NVIDIA doesn't document the low-level machine code.

I suppose that would answer questions such as "does the bit-field-extract PTX macro generate a single instruction or a series of instructions?"

LaurV 2012-03-28 04:44

[QUOTE=James Heinrich;294435]I've put the current PrimeNet CPU-TF limits on the chart as orange.

Breakeven points are not dependent on GPU absolute performance, only relative performance for that compute version between mfaktc and CUDALucas. Relativel CUDALucas vs maktc performance may mean they need around 1 extra TF bitlevel for the breakeven point. But there's only 4 patterns there, one for each compute version: [URL="http://mersenne-aries.sili.net/cudalucas.php?model=467"]3.0[/URL], [URL="http://mersenne-aries.sili.net/cudalucas.php?model=7"]2.1[/URL], [URL="http://mersenne-aries.sili.net/cudalucas.php?model=13"]2.0[/URL], [URL="http://mersenne-aries.sili.net/cudalucas.php?model=15"]1.3[/URL].[/QUOTE]

Man, those graphics are wonderful! And they perfectly match my cards and my calculus (despite the fact that I never submitted results to your site :smile:). Kotgw!

apsen 2012-03-28 19:52

[QUOTE=James Heinrich;294435]But there's only 4 patterns there, one for each compute version: [url=http://mersenne-aries.sili.net/cudalucas.php?model=467]3.0[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=7]2.1[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=13]2.0[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=15]1.3[/url].[/QUOTE]

What are the numbers in the cells?

James Heinrich 2012-03-28 20:16

[QUOTE=apsen;294534]What are the numbers in the cells?[/QUOTE]As I (probably poorly) tried to explain in the text above the graph, 100 = time spent on TF combined with the probability of finding a factor means equal chance to clear an exponent with either TF to that bit level or by L-L'ing it. 200 = double 100, which factors in the fact that 2x L-L tests are needed. It does not factor in the lesser amounts of triple-checks and P-1 testing that might be saved with a factor. My interpretation is that TF should be done to the 200 mark, or a little bit higher. Since "200" will rarely fall exactly on an integer bitlevel (actual breakeven point for "100" is in the rightmost column), TF to the rounded-up-from-that is appropriate when 2 L-Ls would be saved. If only 1 L-L would be saved, then TF to 1 bitlevel less (half the TF effort).

chalsall 2012-03-28 20:43

[QUOTE=James Heinrich;294540]My interpretation is that TF should be done to the 200 mark, or a little bit higher. Since "200" will rarely fall exactly on an integer bitlevel (actual breakeven point for "100" is in the rightmost column), TF to the rounded-up-from-that is appropriate when 2 L-Ls would be saved. If only 1 L-L would be saved, then TF to 1 bitlevel less (half the TF effort).[/QUOTE]

Thanks again for putting this together James. It has helped bring some solid answers to what was hotly debated (read: just how far should we GPU TF).

Would it be possible to have another right-most column for "2 L-L"? I'm guessing it isn't exactly "1 L-L" + 1.0.

James Heinrich 2012-03-28 21:48

[QUOTE=chalsall;294543]It has helped bring some solid answers to what was hotly debated (read: just how far should we GPU TF).[/quote]Remember this is only a partial answer. This is based on the assumption that CPUs do no useful work, and is purely between GPU-TF and GPU-LL. It completely sidesteps the whole debate of balance between GPU-TF and CPU-LL, and totally ignores P-1 (which is currently CPU-only, but that too may change soon).

[QUOTE=chalsall;294543]Would it be possible to have another right-most column for "2 L-L"? I'm guessing it isn't exactly "1 L-L" + 1.0.[/QUOTE]I could (and have), but it would be (is) exactly +1.0

Dubslow 2012-03-28 21:51

[QUOTE=James Heinrich;294548]Remember this is only a partial answer. This is based on the assumption that CPUs do no useful work, and is purely between GPU-TF and GPU-LL. It completely sidesteps the whole debate of balance between GPU-TF and CPU-LL, and totally ignores P-1 (which is currently CPU-only, but that too may change soon).[/QUOTE]

The hand-wavy method to deal with those is just to take one bit off, and then you wind up with more-or-less what we're doing now.

chalsall 2012-03-28 22:21

[QUOTE=Dubslow;294549]The hand-wavy method to deal with those is just to take one bit off, and then you wind up with more-or-less what we're doing now.[/QUOTE]

Is it? Or it is the other way around (add one bit)?

Given that there are [I][U]many[/U][/I] more CPUs than GPUs participating in GIMPS, I think the argument can be made that doing TFing past the break-even point for each GPU makes sense for [I][U]GIMPS[/U][/I], even if not for the individual participant.

But, as always, people are encouraged to do whatever they want (other than poaching or hording). It's their hardware, time and electricity.

flashjh 2012-03-28 22:50

[QUOTE=chalsall;294553]Is it? Or it is the other way around (add one bit)?

Given that there are [I][U]many[/U][/I] more CPUs than GPUs participating in GIMPS, I think the argument can be made that doing TFing past the break-even point for each GPU makes sense for [I][U]GIMPS[/U][/I], even if not for the individual participant.

But, as always, people are encouraged to do whatever they want (other than poaching or hording). It's their hardware, time and electricity.[/QUOTE]
Based on the charts it looks like we should GPU TF almost all exponents one more level.

Prime95 2012-03-28 23:03

[QUOTE=chalsall;294553]Given that there are [I][U]many[/U][/I] more CPUs than GPUs participating in GIMPS, I think the argument can be made that doing TFing past the break-even point for each GPU makes sense for [I][U]GIMPS[/U][/I], even if not for the individual participant.[/QUOTE]

No, going past the breakeven point makes no sense. The GPU will clear more exponents by switching to CUDALucas rather than TFing past the breakeven. The question for the GPU owner faces is: Do a I go TF a range that hasn't reached the breakeven or do I switch to CUDALucas?

How should we modify the chart to take into account the loss of CPU cores? You need to know how much CPU power is lost to keep mfaktc busy. For example, if it takes 2 i7-860 cores, then you'd compare mfaktc's factor found rate to CUdaLucas + 2 i7-860 cores LL rate. Has anyone tried to gather this kind of data?

James Heinrich 2012-03-28 23:21

[QUOTE=Prime95;294558]...then you'd compare mfaktc's factor found rate to CUdaLucas + 2 i7-860 cores LL rate.[/QUOTE]GPUs are ridiculously good at TF, even after factoring in the "lost" CPU cores. GPUs are still much faster than CPU for LL, but less distinctly so. To throw some numbers out, with a GTX 570 and 2 cores of a i7-3930K @ 4.5GHz, I can:
* mfaktc = 281GHz-days/day (TF)
* CUDALucas + Prime95 = 31 + 15 = 46GHz-days/day (LL + LL or P-1)
P-1 still needs doing, and has no GPU option at the moment.

axn 2012-03-28 23:41

[QUOTE=Prime95;294558]No, going past the breakeven point makes no sense. The GPU will clear more exponents by switching to CUDALucas rather than TFing past the breakeven. [/QUOTE]
It is worse than that. Break-even point is calculated vis-a-vis two LL tests. The second of the LL test is something we'd get to many many years later. If the GPU instead focuses on LL, it'd clear twice the number of exponents (compared to TF) thus speeding up the main LL wave even further.

bcp19 2012-03-28 23:59

[QUOTE=Prime95;294558]No, going past the breakeven point makes no sense. The GPU will clear more exponents by switching to CUDALucas rather than TFing past the breakeven. The question for the GPU owner faces is: Do a I go TF a range that hasn't reached the breakeven or do I switch to CUDALucas?

How should we modify the chart to take into account the loss of CPU cores? You need to know how much CPU power is lost to keep mfaktc busy. For example, if it takes 2 i7-860 cores, then you'd compare mfaktc's factor found rate to CUdaLucas + 2 i7-860 cores LL rate. Has anyone tried to gather this kind of data?[/QUOTE]

I worked on that a bit using my systems, if I were to devote every core and GPU to DC in the 26M range, I could clear, on average, around 5.9 per day. My current active cores total 1.7 DC per day, so I am 'losing' 4.2 DC/day in exchange for ~800 GHzD of TF. Since very few 26-28M still needing an extra bit or 2 of TF are showing up lately, most of my work goes towards 29-30M exponents. These exponents are 'worth' 26-30% more credit than the 26M ones, so you might say I am only 'losing' 3.3-3.6 DC/day, but still generating the 800GHzD of TF.

CPU/GPU combinations and # of instances are also a factor. I find the Core2 Quads running 3 instances on mid-high end GPUs (450/550/460/560) are 30%-50% more 'efficient' than the AMD x4 and i5/i7. At the 26M level, the quads are easily efficient to ^69 even adding in the 'loss' of 3 cores, while the others would 'break even' about 1/3 of the way between ^68 and ^69. Using 2 instances is actually 10-15% less efficient than 3.

Prime95 2012-03-29 00:34

[QUOTE=James Heinrich;294561]GPUs are ridiculously good at TF, even after factoring in the "lost" CPU cores.[/QUOTE]

I'm not denying that. My point is the lost CPU cores affect your calculation of the breakeven point.

kladner 2012-03-29 01:22

[CODE]CPU/GPU combinations and # of instances are also a factor. I find the Core2 Quads running 3 instances on mid-high end GPUs (450/550/460/560) are 30%-50% more 'efficient' than the AMD x4 and i5/i7.[/CODE]

I have certainly been observing on a x6 AMD, that running mfaktc by itself, even a single instance with Priority assigned to a core, is faster than when it is competing with P95-64 running 4x P-1 and 1x LL (all with Priority assigned.) Starting P95-64 has more effect on mfaktc than making it compete with CuLu for the GPU.

Dubslow 2012-03-29 02:34

[QUOTE=kladner;294583]
I have certainly been observing on a x6 AMD, that running mfaktc by itself, even a single instance with Priority assigned to a core, is faster than when it is competing with P95-64 running 4x P-1 and 1x LL (all with Priority assigned.) Starting P95-64 has more effect on mfaktc than making it compete with CuLu for the GPU.[/QUOTE]
I think that's a memory issue, and it's become even more pronounced for me since AVX came out. Without P95, I can get 195-200M avg. rate; with AVX/P95, I get 165-170M, and P95/SSE I previously got 172-175. Anything that either Prime95 or mfaktc can do to reduce memory would be a great gain for me at least, and it seems for many others as well. (George has known that P95 is mem-limited, but it's clear that mfaktc is as well, and I have no idea how much more room for improvement there is in that regard.)

LaurV 2012-03-29 02:35

[QUOTE=Prime95;294558]No, going past the breakeven point makes no sense. The GPU will clear more exponents by switching to CUDALucas rather than TFing past the breakeven. The question for the GPU owner faces is: Do a I go TF a range that hasn't reached the breakeven or do I switch to CUDALucas?

How should we modify the chart to take into account the loss of CPU cores? You need to know how much CPU power is lost to keep mfaktc busy. For example, if it takes 2 i7-860 cores, then you'd compare mfaktc's factor found rate to CUdaLucas + 2 i7-860 cores LL rate. Has anyone tried to gather this kind of data?[/QUOTE]
I did not gather data from other people (I am not good in doing that :D) but I did the calculus on some post long ago, for my cards and few others which I tested. That time I was trying to justify my opinion that TF-ing at DC level makes no sense, and I brought into the argument the two CPU cores lost for mfaktc too. But people jumped on my head, so I gave up.

bcp19 2012-03-29 04:12

[QUOTE=LaurV;294591]I did not gather data from other people (I am not good in doing that :D) but I did the calculus on some post long ago, for my cards and few others which I tested. That time I was trying to justify my opinion that TF-ing at DC level makes no sense, and I brought into the argument the two CPU cores lost for mfaktc too. But people jumped on my head, so I gave up.[/QUOTE]

I think part of the problem is that the CPU/GPU combination makes a [B]huge[/B] difference. My 2500 uses a single core to run a 560, that core can do 37% of a 26M DC in a day, while the 560 running CL can do 64%. Running mfaktc, that core puts out around 144GHzD, so this combo gives me ~142.82GHzD for each DC I could have done. Now, my entire Core2 Quad can only do 22.7% of the same DC in a day and the 550 Ti in it can do 41%, but the Quad is kinda screwy. If you set all 4 cores to DC, you can complete 8 DC in the same amount of time as 2 cores could complete 6 while the other 2 cores run mfaktc, I'm sure it has something to do with some sort of shared memory. If I run the Quad with 1 core doing DC and 3 cores mfaktc, I can output almost 230 GHzD for each 'lost' DC. I have since installed a 480 in the 2500 and the 560 in one of the quads. This is the capability of my systems when using a 26M exponent:

2500/480 - 2 cores mfaktc - 149.26GHzD/DC 'lost'
2400/560Ti - 3 cores mfaktc - 159.53
X4 645/460 - 3 cores mfaktc - 172.78
Q6600/560 - 3 cores mfaktc - 204.85
Q8200/550Ti - 3 cores mfaktc - 228.15

Hard to believe, but the older system is actually 50% more efficient than my 'speed demon'.

The 'shared memory' in the quads that messes up running all cores as DC/LL has no such effect on mfaktc, which is what makes the quad so highly efficient compared to newer systems. Also, as I mentioned before, if the above systems with 3 instances are trimmed down to 2, they produce 10-15% less GHzD per 'lost' DC.

kjaget 2012-03-29 16:36

Let me run through an example with my system - a 560ti448 (basically a 570) and overclocked i5-750 system. I'll use M55000000 as an example. I don't have exact measurements but this is more or less correct, I think. OTOH I'm rushing through this on my lunch break so any of the math could be wrong. On the third hand, at least I get the same rough answer that's been shown previously, so that's good (or confirmation bias).

TF on my system uses 3 CPU cores to saturate the GPU. The instances settle down to about 7.85 sec/class or 7540 seconds per exponent (TF from 71 to 72). Since there are 3 of them running, I get 3 results each 7540 seconds, or 1 result every 2510 seconds = 0.7 hours / exponent.

Switching that to LL testing, I get the results of the GPU plus 3 CPU cores. Here I'm using data from mersennaries since I don't have it in front of me, but it should be a reasonable guess. The GPU gives a result every 95.1 hours. Each CPU core gives a result every 675 hours (~28 days per exponent). Since the GPU & CPU rates aren't the same, you have to do 1/(1/GPU rate + core count/CPU rate) to get the average, which = 66.9 hours per exponent.

Assuming we're finding factors 1.12% of the time as shown on the GPUto72 stats, TF takes about 62 hours to find an exponent while LLing takes 67. But since each factor found saves 2 LL tests, and each extra bit level doubles the run time, I should be TFing to one more bit level to make the time for 1 factor roughly the same as the time for 2 LL tests. This is the same 73 bit optimal depth as we've seen calculated by ignoring the CPUs entirely.

Some problems - mfaktc run time scales with exponent size, while P95 scales differently (nlog(n)?) so the decision may be different along the exponent range.

CPUs don't scale linearly when adding more cores to LL testing.

Different CPUs are relatively more or less effective at mfaktc sieving versus LL tests (AVX, etc).

The calculation is sensitive to TF found factor rate. 1% vs 1.1% isn't many extra successes, but it is 10% more of them...

On the plus side, though, since TF scales by 2x each time you increase the bit level a few 10-20% hits on either side don't change the conclusion. My gut is telling me that we could test some of the extremes (CPUs really good and really bad at LL vs mfaktc) and see if we get the same answer. Since 2x performance is large, my guess is there's a good chance the answer is yes which would really simplify life.

Since most of the data I have here is from the mersennaries page, we may be able to plot this stuff out over a wider range of CPU and GPU types. The big piece missing is how many CPU cores it takes to feed various GPUs. But again, going back to the idea that 10-20% each way doesn't matter when compared to 2x hit for each bit level, that might not be as bad as I imagine.

For a quick test, going from 2->3->4 CPU cores takes the LL time from 74 to 67 to 61 hours per exponent. That last one looks like it would move the break even point back to 72 bits factoring (just barely) but adding the 4th CPU to TF work gives me a ~10% better TF rate as well so the overall conclusion doesn't change much. No matter how many CPUs I use, it doesn't move the results enough to justify a 2x jump in TF time one way or the other.

I have no gut feel for whether this holds for faster CPUs. On the one hand, they influence the LL rate a lot more. On the other hand, you need less of them to saturate a GPU so there's less to be gained from moving CPUs from TF to LL.

bcp19 2012-03-29 17:51

[QUOTE=kjaget;294667]Let me run through an example with my system
<snip>

Assuming we're finding factors 1.12% of the time as shown on the GPUto72 stats, TF takes about 62 hours to find an exponent while LLing takes 67. But since each factor found saves 2 LL tests, and each extra bit level doubles the run time, I should be TFing to one more bit level to make the time for 1 factor roughly the same as the time for 2 LL tests. This is the same 73 bit optimal depth as we've seen calculated by ignoring the CPUs entirely.[/QUOTE]

Using the information you supplied, I got the same answer you did, and you are correct, going 1 bit level deeper would be a 'tossup' on your system at that exp level since it is right at the 'breakeven' point. Once you get to 56M-58M though, it starts swinging more in favor of the TF.

A 'balanced' GPU also makes a difference within the same system. The Q6600/560 used to be a Q6600/450, which with 2 cores was at 185.56 and with 3 cores was 184.52. The switch to a 560 was a little over 10% more 'efficient' at 204.85. The 450 was 'too little' GPU, but too much GPU is as bad or worse. I initially put the 480 into the quad, but all 4 cores could not max it out. The 4 cores and the 480 could do 1.43 DC/day, but I calculated I'd only get ~200GD with 4 instances, or 139.8GD/DC 'lost'.

Prime95 2012-03-29 18:27

[QUOTE=bcp19;294683]The Q6600/560 used to be a Q6600/450, which with 2 cores was at 185.56 and with 3 cores was 184.52. The switch to a 560 was a little over 10% more 'efficient' at 204.85....The 4 cores and the 480 could do 1.43 DC/day, but I calculated I'd only get ~200GD with 4 instances, or 139.8GD/DC 'lost'.[/QUOTE]

All these GHz-days calculations are irrelevant to the calculation of the TF breakeven point. You'll find that you'll get more GHz-days credit TFing from 2^90 to 2^91, but we can all agree that GIMPS would be better off with a GPU LLing rather than TFing to 2^91.

Kjaget's calculations are the way to go. You compare how much LL a system can do to the amount of TF a system can do coupled with the chance of TF finding a factor (I'm sure P-1 changes the calculation slightly, but I'd bet it can safely be ignored).

For GPU272, we should then set the "official" breakeven point by taking an average or typical reported breakeven points.

I'm guessing were presently doing too much TF at 45M, just right into the low to mid 50M, and too little at 55M+. But we need more data! James has gotten us closer, but his estimates are a little high because of the unaccounted for CPU cores used by mfaktc.

James Heinrich 2012-03-29 19:25

[QUOTE=Prime95;294687]But we need more data![/QUOTE]One piece of data that might be relevant (or at least interesting) is the ratio of potential GHz-days/day of your GPU vs the potential GHd/d of the CPU cores required to power it. I don't want to pollute this thread with trivial responses, so if everyone could PM or email me with:
* what GPU you have
* what CPU you have
* how many instances of mfaktc you run (thereby how many CPU cores are used)
* what average GPU usage you get by doing so (should typically be close to 100%).

e.g.: "GTX 570; Core i7-3930K, 2 instances; 98% GPU".

This gives me a GPU-CPU GHd/d ratio of almost 19:1 ([url=http://mersenne-aries.localhost/mfaktc.php]281[/url]/(2*[url=http://mersenne-aries.localhost/throughput.php?cpu1=Intel%28R%29+Core%28TM%29+i7-3930K+CPU+%40+3.20GHz|256|12288&mhz1=4500]7.4[/url])). My theory is that this ratio should be roughly constant and could serve as a basis for including CPU usage into the equation. Once I get a reasonable sample of data I'll post back with what I find.

bcp19 2012-03-29 20:02

[QUOTE=Prime95;294687]All these GHz-days calculations are irrelevant to the calculation of the TF breakeven point. You'll find that you'll get more GHz-days credit TFing from 2^90 to 2^91, but we can all agree that GIMPS would be better off with a GPU LLing rather than TFing to 2^91.

Kjaget's calculations are the way to go. You compare how much LL a system can do to the amount of TF a system can do coupled with the chance of TF finding a factor (I'm sure P-1 changes the calculation slightly, but I'd bet it can safely be ignored).
[/quote]
Which is why I use [URL="http://www.gpu72.com/reports/factoring_cost/"][COLOR=#000080]http://www.gpu72.com/reports/factoring_cost/[/COLOR][/URL] to see the 'cost' per factor found in the range I am working and why I specify [B]at 26M[/B] in all my posts. At 26M 1 of my machines would be good for ^69 while the rest should only do ^68. In comparing CPU effort to complete a 26M exp vs 29M or 30M exp, the DC would run 19-29% longer, meaning the GPU would produce 19-29% more GHzD/DC. Using this, one of my machines is getting close to being good for ^70 at 29M while the rest are comfortable at ^69. LaurV's machine, which I believe is around 120GD at the 26M range, would be at 143-155GD running 29-30M exps, which would still be outside of the ^69 range.

[quote]

For GPU272, we should then set the "official" breakeven point by taking an average or typical reported breakeven points.

I'm guessing were presently doing too much TF at 45M, just right into the low to mid 50M, and too little at 55M+. But we need more data! James has gotten us closer, but his estimates are a little high because of the unaccounted for CPU cores used by mfaktc.[/QUOTE]

Due to 27.4 and the AVX speedup(20%? 30%? 50%?), sandy bridges are much more efficient completing DC/LL than non SB. The 'quirk' in Core2Quads make them only 75% as efficient as a comparable i3/i5/i7/AMD quad.

Using this, any SB is likely to be 1-2 bit lower than a non SB i3/i5/i7 which would probably be 1/2 to 1 bit lower than the quads. AMD quads seem to fall between the i3/i5/i7 bracket and the C2Q's, but I only have 1 data point to fall upon.

James Heinrich 2012-03-29 20:59

[QUOTE=James Heinrich;294700]
* what GPU you have
* what CPU you have
* how many instances of mfaktc you run (thereby how many CPU cores are used)
* what average GPU usage you get by doing so (should typically be close to 100%)[/QUOTE]Please also include:
* CPU usage per core (assuming AllowSleep=1)
* CPU speed (whether overclocked or not)
* SievePrimes value

bcp19 2012-03-29 23:16

[QUOTE=Prime95;294687]All these GHz-days calculations are irrelevant to the calculation of the TF breakeven point. You'll find that you'll get more GHz-days credit TFing from 2^90 to 2^91, but we can all agree that GIMPS would be better off with a GPU LLing rather than TFing to 2^91.

Kjaget's calculations are the way to go. You compare how much LL a system can do to the amount of TF a system can do coupled with the chance of TF finding a factor (I'm sure P-1 changes the calculation slightly, but I'd bet it can safely be ignored).

For GPU272, we should then set the "official" breakeven point by taking an average or typical reported breakeven points.

I'm guessing were presently doing too much TF at 45M, just right into the low to mid 50M, and too little at 55M+. But we need more data! James has gotten us closer, but his estimates are a little high because of the unaccounted for CPU cores used by mfaktc.[/QUOTE]

I just realized you didn't understand my terms. The GHzDs I listed were NOT for the bit level, but the total the card could produce in the same time as it would take the GPU and the CPUs combined to complete 1 DC. On GPU72, at 26M, on average it takes 106 TF to find a factor which is 237GHzD. My most efficient machine at the 26M level can do 228, which means it would almost break even compared to current estimates.

So, I just finished timings on 45M exps on all my CPUs and GPUs. The credit for a 45M exp is around 72.22, a 26M is 22.208, so a 45M exp takes 3.24 times more effort than a 26M. Using the timings at 45M the same as I did for 26M, I ended up getting an increase between 2.9 and 3.05 times, which is fairly close. If I use 3 I get 448GHzD on my worst system and 684GHzD on my most efficient. Double that for 2 LL saved and you get 996 to 1368. This tells me all of my machines are efficient doing 45M exponents to ^71, while one could maybe get away with doing ^72, seeing that 12 factors in 1708 runs have been found, which is kinda of a small pool to use for an estimate.

LaurV 2012-03-30 03:07

[QUOTE=kjaget;294667]Let me run through an example ...[/QUOTE]
:tu: :tu: very good post.

bcp19 2012-03-30 13:45

I find it interesting that kjaget and I have basically said the same thing but in different terms, yet what I said is not understood. Maybe instead of saying at 26M I get:

2500/480 - 2 cores mfaktc - 149.26GHzD/DC 'lost'
2400/560Ti - 3 cores mfaktc - 159.53
X4 645/460 - 3 cores mfaktc - 172.78
Q6600/560 - 3 cores mfaktc - 204.85
Q8200/550Ti - 3 cores mfaktc - 228.15

I should say: At 26M, these systems can either perform 1 DC or X ^69 TFs:
2500/480 - 2 cores mfaktc - 1 DC or 64.9 ^69 TFs
2400/560Ti - 3 cores mfaktc - 1 DC or 69.4 ^69 TFs
X4 645/460 - 3 cores mfaktc - 1 DC or 75.1 ^69 TFs
Q6600/560 - 3 cores mfaktc - 1 DC or 89 ^69 TFs
Q8200/550Ti - 3 cores mfaktc - 1 DC or 99.2 ^69 TFs

At the 45M level, the above systems could perform 168.2, 180, 195.1, 231.3 and 257.6 TF to ^71 or 2 LLs.

chalsall 2012-03-30 14:16

[QUOTE=bcp19;294832]At the 45M level, the above systems could perform 168.2, 180, 195.1, 231.3 and 257.6 TF to ^71 or 2 LLs.[/QUOTE]

Interesting...

Could you redo this analysis for 30M TF to 70 vs. 1 LL(DC), 52M TF to 72 vs 2 LLs, and 58.52M TF to 73 vs 2LLs?

I choose these numbers because the first two are where we're working currently, and the last is the current transition point to 73 based on Prime95's transition point.

I, like George et al, feel the transition at 58.52 to 73 should be lower, but I don't think that should happen until we've (mostly) cleared out the wave.

kjaget 2012-03-30 14:28

I think the confusion is that GHz-day/day is generated at different rates on GPUs and CPUs (and for different assignment and types on the same hardware). So adding that in rather than just measuring raw times is confusing the issue. Even if you're using it to convert to and from time temporarily it adds an extra layer of complexity - and an additional assumption - that isn't needed.

bcp19 2012-03-30 14:54

1 Attachment(s)
[QUOTE=chalsall;294833]Interesting...

Could you redo this analysis for 30M TF to 70 vs. 1 LL(DC), 52M TF to 72 vs 2 LLs, and 58.52M TF to 73 vs 2LLs?

I choose these numbers because the first two are where we're working currently, and the last is the current transition point to 73 based on Prime95's transition point.

I, like George et al, feel the transition at 58.52 to 73 should be lower, but I don't think that should happen until we've (mostly) cleared out the wave.[/QUOTE]

This isn't up to 70, but I had this worked out when I saw your post:

chalsall 2012-03-30 15:11

[QUOTE=bcp19;294841]This isn't up to 70, but I had this worked out when I saw your post:[/QUOTE]

Sweet!!! Thanks.

So, this clearly shows that we're going to 70 bits too early in the DC range. But you've said you still want to do that. Do you still? You're the main producer remaining in that range, so I'll defer to you on that.

In the LL range it shows that what we're doing now is "economical", and that we can go to 73 a little lower than the Prime95 transition point once we've finished everything below 58.52M to 72.

It also clearly shows that the CPU/GPU combinations have a huge influence on the cross-over points.

bcp19 2012-03-30 15:19

[QUOTE=kjaget;294836]I think the confusion is that GHz-day/day is generated at different rates on GPUs and CPUs (and for different assignment and types on the same hardware). So adding that in rather than just measuring raw times is confusing the issue. Even if you're using it to convert to and from time temporarily it adds an extra layer of complexity - and an additional assumption - that isn't needed.[/QUOTE]

Actually, I have less complexity than you. The givens I use are %DC(2LL)/day/CPU core, # of CPU cores used, %DC(2LL)/day/gpu, GHzD/Day output by GPU. Let's call those a,b,c,d. My formula then is d/(a*b+c). b and d are static while a and c vary with the exp tested. I have no need to take timings, as I can check James' site to see them and convert. I have to add a new variable to the formula to give #TF/DC(2LL), making it (d/(a*b+c))/e where e is the GHz credit for the exponent at the target bit level.

bcp19 2012-03-30 15:25

[QUOTE=chalsall;294844]Sweet!!! Thanks.

So, this clearly shows that we're going to 70 bits too early in the DC range. But you've said you still want to do that. Do you still? You're the main producer remaining in that range, so I'll defer to you on that.

In the LL range it shows that what we're doing now is "economical", and that we can go to 73 a little lower than the Prime95 transition point once we've finished everything below 58.52M to 72.

It also clearly shows that the CPU/GPU combinations have a huge influence on the cross-over points.[/QUOTE]

I have a feeling that my graph is off where theory is concerned, as I used a flat 1% factor found per bit level. Some people say the chance to find a factor is 1/bit level, but P-1 has been done on the DCs so that alters the equation. This is an EXTREMELY rough graph, and without P-1 on the LL candidates, they probably have greater than a 1% chance per bit level.

Edit: I also just found an error... the timings I was using for the 2400 were from v26.6, which means the 2400 is actually worse than the 2500 as it loses 13% with the timing change.

chalsall 2012-03-30 15:43

[QUOTE=bcp19;294847]This is an EXTREMELY rough graph, and without P-1 on the LL candidates, they probably have greater than a 1% chance per bit level.[/QUOTE]

Agreed. Probably about 1.125%.

And so everyone knows, the emprical data on the [URL="http://www.gpu72.com/reports/factor_percentage/"]Factor Found Percentage[/URL] report is undercounting a bit in the "to 71" and "to 72" columns for reasons I won't go into now, but this will hopefully correct itself shortly. (Hint, hint to the person responsible... :wink:)

But this doesn't change the fact that by both your and James' analysis, we're going to 70 bits too early in the DC range.


All times are UTC. The time now is 22:30.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.