mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-03-08, 18:10   #1640
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2×3×1,693 Posts
Default

Quote:
Originally Posted by kjaget View Post
.....until you max out sieve primes.....
Just to make sure I understand correctly, what kind of numbers are you considering maxed out?
kladner is offline   Reply With Quote
Old 2012-03-08, 18:39   #1641
kjaget
 
kjaget's Avatar
 
Jun 2005

3·43 Posts
Default

Quote:
Originally Posted by kladner View Post
Just to make sure I understand correctly, what kind of numbers are you considering maxed out?
High - possibly up to the max of 200,000. From what I've been able to test, adding CPU cores once you've got the GPU to 100% usage gives more improvement in TF throughput than you lose by taking that CPU off of other tasks. But that's not hard since GPUs are so much better at generating GHz-days of work. That was my point that showing that ~5% better GPU throughput is just as many GHZ-days as 100% of a CPU core.

I'm working from a small sample size (just my personal hardware at home) so I don't have enough different systems to say exactly where the break-even point is (and even that would just be an rough approximation). But I see lots of people locking sieve primes at 5000 to free up CPU time without measuring the TF GHz-day performance hit you take, and just assuming that once the Mcandidates /sec is maxed out that there's nothing more that more mfaktc can do.

Of course, this exact tradeoff depends on a complex interaction of how efficient your CPU is at both LL testing and sieving and how fast the GPU is. That's why I keep going back to the fact it's hard to give specific recommendations without knowing mfaktc timings running a bunch of instances in any particular system. I'm genuinely interested to see them for various CPU & GPU combinations since I have a limited set to test with here at home.

And also I keep reiterating that GHz-days/day isn't the only way to measure this, so there can be other correct answers. For example, you might be willing to give up the absolute max GHz-day/day if you value your ranking in each category above absolute total throughput (so a TF GHz-day isn't equally valuable to an LL GHz-day or whatever). Can't argue with that approach, especially considering the GPU firepower
kjaget is offline   Reply With Quote
Old 2012-03-08, 18:54   #1642
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

2·23·179 Posts
Default

We have four cores per box sieving for our GTX570s. If we let Mfaktc automatically choose the SievePrimes parameter, the cores go up to about 30,000 each and we net around 1250 GHz/days/day overall.

If we run Prime95 on each box as well, doing P-1 work using all four cores for one instance, and we set SievePrimes to 5000, we net around 1100 GHZ/days/day overall.

In the second example, we are able to complete (roughly) three P-1 tests every two days per box.

Since each box has 8GiB or more of memory it kinda makes sense to do the P-1 work and lose the 150 GHz/days/day. It does, however, push our temperatures up about 5°C.

We are not sure what the optimal settings are but we know P-1 testing needs to be done, so we think it is more helpful for the project. While it is fun to crank out GHz/days/day, doing optimal work for the project is probably best. We are currently running through a pile of 70-71 bit TF work but once that is done we will go back to just taking whatever work we get to 72 bits, which is kinda the goal of the project, or something like that.

(The fact that Craig is smoking us real bad makes it easier to make decisions like this.)

Xyzzy is offline   Reply With Quote
Old 2012-03-08, 19:11   #1643
bcp19
 
bcp19's Avatar
 
Oct 2011

7×97 Posts
Default

Quote:
Originally Posted by kjaget View Post
You have to balance that against the fact that a faster CPU will be able to sieve deeper, allowing the GPU to generate more GHz-days per GFLOP. My gut feel is that GPUs are so much more efficient at producing GHz-days that until you max out sieve primes, it always makes sense to use CPUs running mfaktc rather than anything else. So if you can do that by running the 8200, that might be the way to go. If you can't, the faster cards need to be paired up with faster CPUs.
Actually, you are kind of thinking backwards, the slower systems will sieve higher than the faster ones per class while the faster one will sieve for more classes in the same amount of time. If I set the adjust to 1 on the 2500k, it stays unchanged at 5000 SP while the 2400 with 2 instances will battle back and forth a while then equalize around 14000-16000. Back when the 560 was in the 8200, the SP would equalize around 40k.

As a test, I just set the SP on the 2500 to 10k. M/s dropped by over 40 and it took over a minute longer to run the same exponent, which is approx a 5% decrease in throughput.
bcp19 is offline   Reply With Quote
Old 2012-03-08, 19:59   #1644
kjaget
 
kjaget's Avatar
 
Jun 2005

3×43 Posts
Default

Quote:
Originally Posted by bcp19 View Post
Actually, you are kind of thinking backwards, the slower systems will sieve higher than the faster ones per class while the faster one will sieve for more classes in the same amount of time.If I set the adjust to 1 on the 2500k, it stays unchanged at 5000 SP while the 2400 with 2 instances will battle back and forth a while then equalize around 14000-16000. Back when the 560 was in the 8200, the SP would equalize around 40k.
If you're using 2 cores of 1 system and comparing it to 1 core on another, why are you saying that the 2 core setup is slower? 2 * 0.9 is more than 1 * 1.0 so I think you have the slower and faster setups backwards in your first sentence.

Quote:
As a test, I just set the SP on the 2500 to 10k. M/s dropped by over 40 and it took over a minute longer to run the same exponent, which is approx a 5% decrease in throughput.
This is a symptom of not having enough CPU power to max out the GPU. I was discussing the behavior after you add enough CPUs to max out the GPU. Of course if you don't have the CPU power to max out the GPU asking that CPU to do more will make things even slower. But there's additional gains by adding more CPU cores once the GPU is at 100%, and that's what I'm discussing.

Try an experiment. Set sieve prime adjust to 1. Run 1 instance and let it stabilize and note how long a class takes. Then run 2 of the same exponent, again letting it stabilize and keep track of the time. Repeat for 3 and 4 (up to however many cores you have). Post the results here and I'll show you what I'm talking about with respect to scaling.

Your throughput will increase rapidly with each additional core until you load the GPU 100%. Then you'll see smaller increases as the increased sieve primes make the GPU run quicker per class. The large increase is obviously worth it, the smaller one is the one that's closer to a GHz-day parity with CPU power so it takes more careful analysis to figure out whether it's worth it.
kjaget is offline   Reply With Quote
Old 2012-03-08, 21:27   #1645
bcp19
 
bcp19's Avatar
 
Oct 2011

7×97 Posts
Default

Quote:
Originally Posted by kjaget View Post
If you're using 2 cores of 1 system and comparing it to 1 core on another, why are you saying that the 2 core setup is slower? 2 * 0.9 is more than 1 * 1.0 so I think you have the slower and faster setups backwards in your first sentence.
Apples and oranges. 1 core on system A can do 9.6 iter/day, 2 cores on system B can do 8.64 iter/day (4.32 each). 2 core (B) = .9 * 1 core (A), therefore 2B*.9 < 1A*1.0

Quote:
This is a symptom of not having enough CPU power to max out the GPU. I was discussing the behavior after you add enough CPUs to max out the GPU. Of course if you don't have the CPU power to max out the GPU asking that CPU to do more will make things even slower. But there's additional gains by adding more CPU cores once the GPU is at 100%, and that's what I'm discussing.
1 core (A) = 99% GPU load (maxed), 2 core (B) = 99% GPU load (maxed)

Quote:
Try an experiment. Set sieve prime adjust to 1. Run 1 instance and let it stabilize and note how long a class takes. Then run 2 of the same exponent, again letting it stabilize and keep track of the time. Repeat for 3 and 4 (up to however many cores you have). Post the results here and I'll show you what I'm talking about with respect to scaling.

Your throughput will increase rapidly with each additional core until you load the GPU 100%. Then you'll see smaller increases as the increased sieve primes make the GPU run quicker per class. The large increase is obviously worth it, the smaller one is the one that's closer to a GHz-day parity with CPU power so it takes more careful analysis to figure out whether it's worth it.
1 core of a 2500k and adjust = 1, SP= 5000, 16.5 min = ~87 exp/day, CPU wait <3%, ~171GHzD/Day
2 cores of a 2500k and adjust = 1, SP = ~25000, 31 min each = ~93 exp/day CPU wait ~20% each ~183 GHzD/Day *(should claify this, once SP climbed above 25k, the est time also increased, so the run was actually adjust=0 and SP=25k)
1 core of a 2400 and adjust =1, SP=5000, ~25 min = ~58 exp/day <3% wait, ~114 GHzD/Day
2 core of a 2400 and adjust = 0, SP = 5000, ~36 min each, ~80 exp/day, CPU wait 20% , ~157 GHzD/Day
2 cores, adjust = 1, SP = ~12000, ~33.75 min, ~85.3 exp/day, <3% CPU wait, ~168 GHzD/Day

So, a 7% gain on the 2500 from 160% cpu usage. Barely worth using a 2nd core. With such results, I did not continue.

The 2400, 1 core, obvious, not enough CPU. 2 core @ 5k vs 2 core @ 12K. Same 7% gain, but CPU usage goes from ~160 to ~200%. Did not bother to continue.

It can be argued that since one has a 560 and one a 560 Ti that you cannot adequately compare this, but it sure seems like the 2400 is more efficient.

Edit, reran it after thinking about the slowdown, must have had a process running as new run was 36K SP and ~25min giving a 32% increase. Fair improvement, but takes a lot of resources. It's possible similiar increases could be had on the 2400, will have to test later.

Last fiddled with by bcp19 on 2012-03-08 at 22:12
bcp19 is offline   Reply With Quote
Old 2012-03-08, 21:36   #1646
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

27AE16 Posts
Default

Quote:
Originally Posted by kjaget View Post
High - possibly up to the max of 200,000. From what I've been able to test, adding CPU cores once you've got the GPU to 100% usage gives more improvement in TF throughput than you lose by taking that CPU off of other tasks. But that's not hard since GPUs are so much better at generating GHz-days of work. That was my point that showing that ~5% better GPU throughput is just as many GHZ-days as 100% of a CPU core.

I'm working from a small sample size (just my personal hardware at home) so I don't have enough different systems to say exactly where the break-even point is (and even that would just be an rough approximation). But I see lots of people locking sieve primes at 5000 to free up CPU time without measuring the TF GHz-day performance hit you take, and just assuming that once the Mcandidates /sec is maxed out that there's nothing more that more mfaktc can do.

Of course, this exact tradeoff depends on a complex interaction of how efficient your CPU is at both LL testing and sieving and how fast the GPU is. That's why I keep going back to the fact it's hard to give specific recommendations without knowing mfaktc timings running a bunch of instances in any particular system. I'm genuinely interested to see them for various CPU & GPU combinations since I have a limited set to test with here at home.

And also I keep reiterating that GHz-days/day isn't the only way to measure this, so there can be other correct answers. For example, you might be willing to give up the absolute max GHz-day/day if you value your ranking in each category above absolute total throughput (so a TF GHz-day isn't equally valuable to an LL GHz-day or whatever). Can't argue with that approach, especially considering the GPU firepower
Thanks for the response. I have been watching developments in CUDALucas and considering rearranging what I run. Right now on a 1090T @ 3.5GHz w/16GB RAM, that is 3 P-1 cores, 1 LL/DC core, and 2 feeding mfaktc on a GTX 460. I am not really ready to change, as yet.

On the other hand, I would be willing to produce some data on different numbers of mfaktc instances if that would be useful.

EDIT: At the moment, I'm trying out locking Sieve Primes at 14000 for two instances. This was aimed at making other programs run a little better on the system. Clearly, there a trade offs. If I left mfaktc to decide, it would be running SP ~18-19K, depending on the exponents. The GPU fluctuates around 95%, and can be driven up with 3 instances. Thing is, I'm not sure I can live with the system under those circumstances. This is my only machine, and I do need for it to behave moderately well for general use.

Last fiddled with by kladner on 2012-03-08 at 21:46
kladner is offline   Reply With Quote
Old 2012-03-09, 05:01   #1647
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

100111101011102 Posts
Default Some tests Part 1

@kjaget I will have to put these test runs up in more than one post. I guess I should have redirected the outputs to text files. In any case, I ran from 1 to 4 instances of mfaktc, with affinities set to individual cores of the 1090T. For this test, I used the same exponent for all instances. I let the tests run until Sieve Primes had held steady for several classes.

These are the results for 2 and 3 instances.
Attached Files
File Type: zip 2and3_mfaktc.zip (180.0 KB, 89 views)
kladner is offline   Reply With Quote
Old 2012-03-09, 05:03   #1648
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2×3×1,693 Posts
Default

@kjaget

These are the results for 1 instance.
Attached Thumbnails
Click image for larger version

Name:	1_mfaktc.JPG
Views:	113
Size:	92.2 KB
ID:	7756  
kladner is offline   Reply With Quote
Old 2012-03-09, 05:05   #1649
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

27AE16 Posts
Default

@kjaget

These are the results for 4 instances.
Attached Files
File Type: zip 4_mfaktc.zip (216.4 KB, 95 views)
kladner is offline   Reply With Quote
Old 2012-03-09, 06:04   #1650
bcp19
 
bcp19's Avatar
 
Oct 2011

12478 Posts
Default

Quote:
Originally Posted by kladner View Post
@kjaget

These are the results for 4 instances.
FYI, without completing the run, you cannot tell how much increase you have in throuput with 4 running compared to 1,2 or 3.
bcp19 is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 10:28.


Mon Aug 2 10:28:04 UTC 2021 up 10 days, 4:57, 0 users, load averages: 1.78, 1.49, 1.28

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.