mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

kladner 2012-03-08 18:10

[QUOTE=kjaget;292341].....until you max out sieve primes.....[/QUOTE]

Just to make sure I understand correctly, what kind of numbers are you considering maxed out?

kjaget 2012-03-08 18:39

[QUOTE=kladner;292342]Just to make sure I understand correctly, what kind of numbers are you considering maxed out?[/QUOTE]

High - possibly up to the max of 200,000. From what I've been able to test, adding CPU cores once you've got the GPU to 100% usage gives more improvement in TF throughput than you lose by taking that CPU off of other tasks. But that's not hard since GPUs are so much better at generating GHz-days of work. That was my point that showing that ~5% better GPU throughput is just as many GHZ-days as 100% of a CPU core.

I'm working from a small sample size (just my personal hardware at home) so I don't have enough different systems to say exactly where the break-even point is (and even that would just be an rough approximation). But I see lots of people locking sieve primes at 5000 to free up CPU time without measuring the TF GHz-day performance hit you take, and just assuming that once the Mcandidates /sec is maxed out that there's nothing more that more mfaktc can do.

Of course, this exact tradeoff depends on a complex interaction of how efficient your CPU is at both LL testing and sieving and how fast the GPU is. That's why I keep going back to the fact it's hard to give specific recommendations without knowing mfaktc timings running a bunch of instances in any particular system. I'm genuinely interested to see them for various CPU & GPU combinations since I have a limited set to test with here at home.

And also I keep reiterating that GHz-days/day isn't the only way to measure this, so there can be other correct answers. For example, you might be willing to give up the absolute max GHz-day/day if you value your ranking in each category above absolute total throughput (so a TF GHz-day isn't equally valuable to an LL GHz-day or whatever). Can't argue with that approach, especially considering the GPU firepower

Xyzzy 2012-03-08 18:54

We have four cores per box sieving for our GTX570s. If we let Mfaktc automatically choose the SievePrimes parameter, the cores go up to about 30,000 each and we net around 1250 GHz/days/day overall.

If we run Prime95 on each box as well, doing P-1 work using all four cores for one instance, and we set SievePrimes to 5000, we net around 1100 GHZ/days/day overall.

In the second example, we are able to complete (roughly) three P-1 tests every two days per box.

Since each box has 8GiB or more of memory it kinda makes sense to do the P-1 work and lose the 150 GHz/days/day. It does, however, push our temperatures up about 5°C.

We are not sure what the optimal settings are but we know P-1 testing needs to be done, so we think it is more helpful for the project. While it is fun to crank out GHz/days/day, doing optimal work for the project is probably best. We are currently running through a pile of 70-71 bit TF work but once that is done we will go back to just taking whatever work we get to 72 bits, which is kinda the goal of the project, or something like that.

(The fact that Craig is smoking us real bad makes it easier to make decisions like this.)

:max:

bcp19 2012-03-08 19:11

[QUOTE=kjaget;292341]You have to balance that against the fact that a faster CPU will be able to sieve deeper, allowing the GPU to generate more GHz-days per GFLOP. My gut feel is that GPUs are so much more efficient at producing GHz-days that until you max out sieve primes, it always makes sense to use CPUs running mfaktc rather than anything else. So if you can do that by running the 8200, that might be the way to go. If you can't, the faster cards need to be paired up with faster CPUs.[/QUOTE]

Actually, you are kind of thinking backwards, the slower systems will sieve higher than the faster ones per class while the faster one will sieve for more classes in the same amount of time. If I set the adjust to 1 on the 2500k, it stays unchanged at 5000 SP while the 2400 with 2 instances will battle back and forth a while then equalize around 14000-16000. Back when the 560 was in the 8200, the SP would equalize around 40k.

As a test, I just set the SP on the 2500 to 10k. M/s dropped by over 40 and it took over a minute longer to run the same exponent, which is approx a 5% decrease in throughput.

kjaget 2012-03-08 19:59

[QUOTE=bcp19;292348]Actually, you are kind of thinking backwards, the slower systems will sieve higher than the faster ones per class while the faster one will sieve for more classes in the same amount of time.If I set the adjust to 1 on the 2500k, it stays unchanged at 5000 SP while the 2400 with 2 instances will battle back and forth a while then equalize around 14000-16000. Back when the 560 was in the 8200, the SP would equalize around 40k.[/QUOTE]

If you're using 2 cores of 1 system and comparing it to 1 core on another, why are you saying that the 2 core setup is slower? 2 * 0.9 is more than 1 * 1.0 so I think you have the slower and faster setups backwards in your first sentence.

[QUOTE]As a test, I just set the SP on the 2500 to 10k. M/s dropped by over 40 and it took over a minute longer to run the same exponent, which is approx a 5% decrease in throughput.[/QUOTE]

This is a symptom of not having enough CPU power to max out the GPU. I was discussing the behavior after you add enough CPUs to max out the GPU. Of course if you don't have the CPU power to max out the GPU asking that CPU to do more will make things even slower. But there's additional gains by adding more CPU cores once the GPU is at 100%, and that's what I'm discussing.

Try an experiment. Set sieve prime adjust to 1. Run 1 instance and let it stabilize and note how long a class takes. Then run 2 of the same exponent, again letting it stabilize and keep track of the time. Repeat for 3 and 4 (up to however many cores you have). Post the results here and I'll show you what I'm talking about with respect to scaling.

Your throughput will increase rapidly with each additional core until you load the GPU 100%. Then you'll see smaller increases as the increased sieve primes make the GPU run quicker per class. The large increase is obviously worth it, the smaller one is the one that's closer to a GHz-day parity with CPU power so it takes more careful analysis to figure out whether it's worth it.

bcp19 2012-03-08 21:27

[QUOTE=kjaget;292359]If you're using 2 cores of 1 system and comparing it to 1 core on another, why are you saying that the 2 core setup is slower? 2 * 0.9 is more than 1 * 1.0 so I think you have the slower and faster setups backwards in your first sentence. [/QUOTE]
Apples and oranges. 1 core on system A can do 9.6 iter/day, 2 cores on system B can do 8.64 iter/day (4.32 each). 2 core (B) = .9 * 1 core (A), therefore 2B*.9 < 1A*1.0

[quote]This is a symptom of not having enough CPU power to max out the GPU. I was discussing the behavior after you add enough CPUs to max out the GPU. Of course if you don't have the CPU power to max out the GPU asking that CPU to do more will make things even slower. But there's additional gains by adding more CPU cores once the GPU is at 100%, and that's what I'm discussing.[/quote]
1 core (A) = 99% GPU load (maxed), 2 core (B) = 99% GPU load (maxed)

[Quote]Try an experiment. Set sieve prime adjust to 1. Run 1 instance and let it stabilize and note how long a class takes. Then run 2 of the same exponent, again letting it stabilize and keep track of the time. Repeat for 3 and 4 (up to however many cores you have). Post the results here and I'll show you what I'm talking about with respect to scaling.

Your throughput will increase rapidly with each additional core until you load the GPU 100%. Then you'll see smaller increases as the increased sieve primes make the GPU run quicker per class. The large increase is obviously worth it, the smaller one is the one that's closer to a GHz-day parity with CPU power so it takes more careful analysis to figure out whether it's worth it.[/Quote]
1 core of a 2500k and adjust = 1, SP= 5000, 16.5 min = ~87 exp/day, CPU wait <3%, ~171GHzD/Day
2 cores of a 2500k and adjust = 1, SP = ~25000, 31 min each = ~93 exp/day CPU wait ~20% each ~183 GHzD/Day *(should claify this, once SP climbed above 25k, the est time also increased, so the run was actually adjust=0 and SP=25k)
1 core of a 2400 and adjust =1, SP=5000, ~25 min = ~58 exp/day <3% wait, ~114 GHzD/Day
2 core of a 2400 and adjust = 0, SP = 5000, ~36 min each, ~80 exp/day, CPU wait 20% , ~157 GHzD/Day
2 cores, adjust = 1, SP = ~12000, ~33.75 min, ~85.3 exp/day, <3% CPU wait, ~168 GHzD/Day

So, a 7% gain on the 2500 from 160% cpu usage. Barely worth using a 2nd core. With such results, I did not continue.

The 2400, 1 core, obvious, not enough CPU. 2 core @ 5k vs 2 core @ 12K. Same 7% gain, but CPU usage goes from ~160 to ~200%. Did not bother to continue.

It can be argued that since one has a 560 and one a 560 Ti that you cannot adequately compare this, but it sure seems like the 2400 is more efficient.

Edit, reran it after thinking about the slowdown, must have had a process running as new run was 36K SP and ~25min giving a 32% increase. Fair improvement, but takes a lot of resources. It's possible similiar increases could be had on the 2400, will have to test later.

kladner 2012-03-08 21:36

[QUOTE=kjaget;292345]High - possibly up to the max of 200,000. From what I've been able to test, adding CPU cores once you've got the GPU to 100% usage gives more improvement in TF throughput than you lose by taking that CPU off of other tasks. But that's not hard since GPUs are so much better at generating GHz-days of work. That was my point that showing that ~5% better GPU throughput is just as many GHZ-days as 100% of a CPU core.

I'm working from a small sample size (just my personal hardware at home) so I don't have enough different systems to say exactly where the break-even point is (and even that would just be an rough approximation). But I see lots of people locking sieve primes at 5000 to free up CPU time without measuring the TF GHz-day performance hit you take, and just assuming that once the Mcandidates /sec is maxed out that there's nothing more that more mfaktc can do.

Of course, this exact tradeoff depends on a complex interaction of how efficient your CPU is at both LL testing and sieving and how fast the GPU is. That's why I keep going back to the fact it's hard to give specific recommendations without knowing mfaktc timings running a bunch of instances in any particular system. I'm genuinely interested to see them for various CPU & GPU combinations since I have a limited set to test with here at home.

And also I keep reiterating that GHz-days/day isn't the only way to measure this, so there can be other correct answers. For example, you might be willing to give up the absolute max GHz-day/day if you value your ranking in each category above absolute total throughput (so a TF GHz-day isn't equally valuable to an LL GHz-day or whatever). Can't argue with that approach, especially considering the GPU firepower[/QUOTE]

Thanks for the response. I have been watching developments in CUDALucas and considering rearranging what I run. Right now on a 1090T @ 3.5GHz w/16GB RAM, that is 3 P-1 cores, 1 LL/DC core, and 2 feeding mfaktc on a GTX 460. I am not really ready to change, as yet.

On the other hand, I would be willing to produce some data on different numbers of mfaktc instances if that would be useful.

EDIT: At the moment, I'm trying out locking Sieve Primes at 14000 for two instances. This was aimed at making other programs run a little better on the system. Clearly, there a trade offs. If I left mfaktc to decide, it would be running SP ~18-19K, depending on the exponents. The GPU fluctuates around 95%, and can be driven up with 3 instances. Thing is, I'm not sure I can live with the system under those circumstances. This is my only machine, and I do need for it to behave moderately well for general use.

kladner 2012-03-09 05:01

Some tests Part 1
 
1 Attachment(s)
@[URL="http://www.mersenneforum.org/member.php?u=1870"]kjaget[/URL] I will have to put these test runs up in more than one post. I guess I should have redirected the outputs to text files. In any case, I ran from 1 to 4 instances of mfaktc, with affinities set to individual cores of the 1090T. For this test, I used the same exponent for all instances. I let the tests run until Sieve Primes had held steady for several classes.

These are the results for 2 and 3 instances.

kladner 2012-03-09 05:03

1 Attachment(s)
@[URL="http://www.mersenneforum.org/member.php?u=1870"]kjaget[/URL]

These are the results for 1 instance.

kladner 2012-03-09 05:05

1 Attachment(s)
@[URL="http://www.mersenneforum.org/member.php?u=1870"]kjaget[/URL]

These are the results for 4 instances.

bcp19 2012-03-09 06:04

[QUOTE=kladner;292402]@[URL="http://www.mersenneforum.org/member.php?u=1870"]kjaget[/URL]

These are the results for 4 instances.[/QUOTE]

FYI, without completing the run, you cannot tell how much increase you have in throuput with 4 running compared to 1,2 or 3.


All times are UTC. The time now is 23:16.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.