![]() |
|
|
#1629 |
|
"James Heinrich"
May 2004
ex-Northern Ontario
D5D16 Posts |
Indeed it does: PauseWhileRunning= setting in prime.txt (and also the related LowMemWhileRunning=). Memory isn't an issue in mfaktc, but I would love to see a PauseWhileRunning setting so I don't have to kill off mfaktc before running a game or other GPU-intensive task, or (more importantly) remember to restart it afterwards.
|
|
|
|
|
|
#1630 |
|
Jun 2005
3×43 Posts |
I wasn't quite sure, but I thought it cut his CPU usage in half. Since he was running 2 cores to start with = 1 core reduction. Or maybe not, but at least that was my reading of it.
|
|
|
|
|
|
#1631 | |
|
Dec 2011
11·13 Posts |
@All: I'm not claiming that everyone would want to do less sieving. There are many whose sole purpose is to get their PrimeNet ranking as high as possible. For those people, you *should* spend all of the cores of your CPU to help feed the GPU.
But, as some of you have pointed out, PrimeNet already has more TF than it needs. If you can give it 20% less TF, but also give the project an extra core of P-1 or LL, it might be a net benefit to the project. What I *am* saying is that the user should be *allowed to choose* a higher or a lower SievePrimes, as to each user's needs and personal preferences. (As long as the code works, of course.) What I am also suggesting (in the way of sieving on the GPU) is there are some people, such as myself, who have plenty of useful work to keep our expensive Intel i7 cores busy, who would prefer to see the entire sieving/trial factoring compute operation dumped onto the GPU, regardless of a decrease in TF performance. Quote:
-- Rocke |
|
|
|
|
|
|
#1632 |
|
Oct 2011
7·97 Posts |
The talk here kind of brings up a thought I've been working on, mainly, how efficient is it to use a faster CPU to do your GPU work. When I started, I had a 560Ti in a Core 2 Quad 8200. 2 cores would get this card to around 200M/s output, which is about 75-80% of the card's capability. Adding a 3rd core seemed a waste. This same card is now in an i5-2500K running around 240M/s and takes up 1 full core (under 3% wait time). Doing some calculations, a single core of the 2500K can perform around 9.6M iterations on a 26M exp per day while all 4 cores on the 8200 combined would only be able to perform around 5.9M iterations per day. While using multiples cores of the 8200 is likely to be more power hungry than 1 core of the 2500K, should I 'waste' 9.6M iterations of calculations per day when the same work could be done using 3 of the 4 8200 cores while only sacrificing around 4.5M iterations?
|
|
|
|
|
|
#1633 | ||
|
Nov 2010
Germany
3·199 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#1634 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
7,537 Posts |
Quote:
This idea can tweaked further if the cost of setting up a new chunk to process is low (use smaller chunks and rather than waiting for CUDA cores processing chunks with lots of set bits, CUDA cores simply grab the next available chunk). |
|
|
|
|
|
|
#1635 | |
|
Jun 2005
3·43 Posts |
Quote:
|
|
|
|
|
|
|
#1636 | |
|
Oct 2011
7×97 Posts |
Quote:
Two systems with a 26M exp for a benchmark: 2500 = ~9.6M iterations per day on a single core while the 2400 = ~4.32M iterations per day. 2500 outputs ~168 GHzD/Day on a 560Ti using 1 core while the 2400 outputs ~160GHzD/day on a 560 using 2 cores (SievePrimes = 5000 on both systems). Using GPUz both cards show 99% GPU load. Mfaktc on the 2500 says CPU wait < 3% while the 2 instances on the 2400 have roughly 20% CPU wait. P95 on the 2400 is set to run all 4 cores, cores 1 and 3 share mfaktc instances and average 103-109ms/iter on a 45M exp while core 2 averages 19.3ms/iter. When I tried doing the same on the 2500, core 2 was running a 26M exp at 18ms/iter and core 3 (shared with mfaktc) was something ridiculous like 3 seconds per iteration, so I gave up P95 share on it. So, basically you have a 9.6M iterations CPU = 168 GHzD/day on the 2500 and 80% of 2 4.32M (6.912M) iteration CPUs = 160 GHzD/Day on the 2400. No M/s listed, so we've taken that out of the equation now, we only have work output. The result looks the same: in terms of work per day, it seems you are more efficient using a lesser system to run a GPU. Edit: Looking at the 8200, it's hard to really compare. I am not sure if it is from L1/L2/L3 sharing, but using a 45M exp, the 8200 running all 4 cores shows 90ms/iter. If you have core 1/3 running a 45M exp and 2/4 doing TF, it shows 60ms/iter. I currently have a 550Ti in the 8200. Using a single core on a 26M exp it comes out at ~2.2M iter/day. This theoretically means it it capable of 8.8M iter/day on 4 cores, but in reality gives ~5.9M iter/day. You can however runing 1/3 LL and use 2/4 to power a GPU. The 550Ti outputs ~96 GHzD/Day. If you give the 2 cores their 'full' capability, you have 96GHzD = 4.4M iter. The 2500 = 17.5 GHzD/1M iter, the 2400= 23.188 and at 4.4 the 8200 = 21.36. Here is where it get tricky though, core 1 and 3 are unaffected with the GPU running on 2/4, but are affected with LL/DC on 2/4. There is only 1.5M iter/day difference between 4 cores on LL and 2 cores on LL, so with 96 = 1.5M iter, you get 64. Makes it hard to compare. Last fiddled with by bcp19 on 2012-03-07 at 23:24 |
|
|
|
|
|
|
#1637 | |
|
Jun 2005
100000012 Posts |
Again, without timings I can't say much.
Quote:
In general, it only takes a small bit of improvement in mfaktc to overcome what you could get from p95. Using your numbers for example, a 26M exponent is worth ~22.5GHz-days. It takes about 2.7 days to run this on the 2500 CPU, so you're generating 8.3 GHz-days/day using 1 core. If the GPU is outputting 168GHz Days per day using 1 core, all you'll need is a 5% speedup by adding a second core to match that performance. If you get more than 5% more mfaktc throughput by adding a second core, it's a net win. To give an example from my fastest system, I need 3 cores to load up my GPU. Going to 4 cores lets each CPU sieve a bit deeper, giving me an overall 10% increase in performance (timings go from 7.3 sec/class with 3 instances to 8.8 sec a class with 4 instances). So if you're shooting for overall max GHz-days/day, that's a net win, assuming you're looking for max GHz-days credit. It's not intuitive that trading 25% of my CPU capability for a 10% speedup is the right thing to do, but GPUs are so much quicker than CPUs at generating GHz-days that you can't trust that 10% of one is equal to 25% of the other. So the first question is what are you trying to optimize. Second question is back to my original one - let's see timings for mfaktc running on 1, 2, 3,... cores of each system (makes sense to use the same exponent and bit depth for testing just to simplify stuff, and sieve primes should stabilize in a few minutes of running so you're not wasting that much time). Last fiddled with by kjaget on 2012-03-08 at 15:45 Reason: fixed timing numbers |
|
|
|
|
|
|
#1638 | |
|
Oct 2011
2A716 Posts |
Quote:
|
|
|
|
|
|
|
#1639 | |
|
Jun 2005
2018 Posts |
Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 32 | 2020-11-11 19:56 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |