mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-03-06, 14:32   #1629
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11×311 Posts
Default

Quote:
Originally Posted by apsen View Post
IIRC it has setting to pause processing when it detects specified programs running.
Indeed it does: PauseWhileRunning= setting in prime.txt (and also the related LowMemWhileRunning=). Memory isn't an issue in mfaktc, but I would love to see a PauseWhileRunning setting so I don't have to kill off mfaktc before running a game or other GPU-intensive task, or (more importantly) remember to restart it afterwards.
James Heinrich is offline   Reply With Quote
Old 2012-03-06, 18:58   #1630
kjaget
 
kjaget's Avatar
 
Jun 2005

12910 Posts
Default

Quote:
Originally Posted by Dubslow View Post
No, it's about half a CPU core, there'd still be significant CPU use (unless he pushes the GPU sieve thing through as well, which I'm hoping for).
I wasn't quite sure, but I thought it cut his CPU usage in half. Since he was running 2 cores to start with = 1 core reduction. Or maybe not, but at least that was my reading of it.
kjaget is offline   Reply With Quote
Old 2012-03-06, 21:28   #1631
rcv
 
Dec 2011

2178 Posts
Default

@All: I'm not claiming that everyone would want to do less sieving. There are many whose sole purpose is to get their PrimeNet ranking as high as possible. For those people, you *should* spend all of the cores of your CPU to help feed the GPU.

But, as some of you have pointed out, PrimeNet already has more TF than it needs. If you can give it 20% less TF, but also give the project an extra core of P-1 or LL, it might be a net benefit to the project.

What I *am* saying is that the user should be *allowed to choose* a higher or a lower SievePrimes, as to each user's needs and personal preferences. (As long as the code works, of course.)

What I am also suggesting (in the way of sieving on the GPU) is there are some people, such as myself, who have plenty of useful work to keep our expensive Intel i7 cores busy, who would prefer to see the entire sieving/trial factoring compute operation dumped onto the GPU, regardless of a decrease in TF performance.

Quote:
Originally Posted by TheJudger View Post
Are you talking about sieving only or does this include the translation of set/unset bits into FC candidates, too?
Yes, I have a prototype kernel that converts from the bitmap to a vector of candidates in the form your siever uploads to the GPU. But no, the timings I quoted previously did not include this kernel, because it is a relatively fixed overhead. [The timings were the approximate *incremental* costs of using higher or lower SievePrimes on the GPU.] For the record, this step of my unoptimized prototype code can convert candidates for the trial factoror's use on my 560Ti in about 3-4% of the time it takes your code to do the trial factoring. In exchange, you don't have to copy the list of candidates from host to device.

-- Rocke
rcv is offline   Reply With Quote
Old 2012-03-06, 21:56   #1632
bcp19
 
bcp19's Avatar
 
Oct 2011

7·97 Posts
Default

The talk here kind of brings up a thought I've been working on, mainly, how efficient is it to use a faster CPU to do your GPU work. When I started, I had a 560Ti in a Core 2 Quad 8200. 2 cores would get this card to around 200M/s output, which is about 75-80% of the card's capability. Adding a 3rd core seemed a waste. This same card is now in an i5-2500K running around 240M/s and takes up 1 full core (under 3% wait time). Doing some calculations, a single core of the 2500K can perform around 9.6M iterations on a 26M exp per day while all 4 cores on the 8200 combined would only be able to perform around 5.9M iterations per day. While using multiples cores of the 8200 is likely to be more power hungry than 1 core of the 2500K, should I 'waste' 9.6M iterations of calculations per day when the same work could be done using 3 of the 4 8200 cores while only sacrificing around 4.5M iterations?
bcp19 is offline   Reply With Quote
Old 2012-03-06, 23:11   #1633
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by rcv View Post
What I *am* saying is that the user should be *allowed to choose* a higher or a lower SievePrimes, as to each user's needs and personal preferences. (As long as the code works, of course.)
OK, added to todo list to run all tests with SievePrimes 256 and 1000. If they all succeed then we can still decide what the new low limit should be.
Quote:
Originally Posted by rcv View Post
What I am also suggesting (in the way of sieving on the GPU) is there are some people, such as myself, who have plenty of useful work to keep our expensive Intel i7 cores busy, who would prefer to see the entire sieving/trial factoring compute operation dumped onto the GPU, regardless of a decrease in TF performance.
I read the description of your prototype. It's a very neat approach and I already fear how much code would need to be ported to OpenCL. But I'll definitely throw in some effort if you allow me to. Add a GPL header (or other copyright of your choice) to it to avoid misuse.
Bdot is offline   Reply With Quote
Old 2012-03-07, 02:38   #1634
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Quote:
Originally Posted by rcv View Post
Yes, I have a prototype kernel that converts from the bitmap to a vector of candidates in the form your siever uploads to the GPU.
I was wondering if we could do away with this step. If you divide the bit array into reasonable size chunks - say 1KB - and put each CUDA core is in charge of TFing one chunk. Each core then processes a trial factor until its chunk is complete. Each chunk ought to have roughly the same number of set bits. You waste some TFs at the end-of-chunk processing as CUDA cores that got a chunk with fewer set bits wait for CUDA cores processing the chunks that had more set bits. The advantage is you save the memory accesses writing the vector of candidates.

This idea can tweaked further if the cost of setting up a new chunk to process is low (use smaller chunks and rather than waiting for CUDA cores processing chunks with lots of set bits, CUDA cores simply grab the next available chunk).
Prime95 is offline   Reply With Quote
Old 2012-03-07, 14:35   #1635
kjaget
 
kjaget's Avatar
 
Jun 2005

3×43 Posts
Default

Quote:
Originally Posted by bcp19 View Post
The talk here kind of brings up a thought I've been working on, mainly, how efficient is it to use a faster CPU to do your GPU work. When I started, I had a 560Ti in a Core 2 Quad 8200. 2 cores would get this card to around 200M/s output, which is about 75-80% of the card's capability. Adding a 3rd core seemed a waste. This same card is now in an i5-2500K running around 240M/s and takes up 1 full core (under 3% wait time). Doing some calculations, a single core of the 2500K can perform around 9.6M iterations on a 26M exp per day while all 4 cores on the 8200 combined would only be able to perform around 5.9M iterations per day. While using multiples cores of the 8200 is likely to be more power hungry than 1 core of the 2500K, should I 'waste' 9.6M iterations of calculations per day when the same work could be done using 3 of the 4 8200 cores while only sacrificing around 4.5M iterations?
Impossible to tell until you give us some timing info from your mfaktc runs. Which goes back to my point that we should think strongly about removing the candidates/sec report from mfaktc output since it's so easily misunderstood.
kjaget is offline   Reply With Quote
Old 2012-03-07, 23:03   #1636
bcp19
 
bcp19's Avatar
 
Oct 2011

7·97 Posts
Default

Quote:
Originally Posted by kjaget View Post
Impossible to tell until you give us some timing info from your mfaktc runs. Which goes back to my point that we should think strongly about removing the candidates/sec report from mfaktc output since it's so easily misunderstood.
Well, since I have no timings from the 8200 on hand, then how about this, not quite apples to apples, but close:
Two systems with a 26M exp for a benchmark: 2500 = ~9.6M iterations per day on a single core while the 2400 = ~4.32M iterations per day. 2500 outputs ~168 GHzD/Day on a 560Ti using 1 core while the 2400 outputs ~160GHzD/day on a 560 using 2 cores (SievePrimes = 5000 on both systems). Using GPUz both cards show 99% GPU load. Mfaktc on the 2500 says CPU wait < 3% while the 2 instances on the 2400 have roughly 20% CPU wait. P95 on the 2400 is set to run all 4 cores, cores 1 and 3 share mfaktc instances and average 103-109ms/iter on a 45M exp while core 2 averages 19.3ms/iter. When I tried doing the same on the 2500, core 2 was running a 26M exp at 18ms/iter and core 3 (shared with mfaktc) was something ridiculous like 3 seconds per iteration, so I gave up P95 share on it. So, basically you have a 9.6M iterations CPU = 168 GHzD/day on the 2500 and 80% of 2 4.32M (6.912M) iteration CPUs = 160 GHzD/Day on the 2400. No M/s listed, so we've taken that out of the equation now, we only have work output. The result looks the same: in terms of work per day, it seems you are more efficient using a lesser system to run a GPU.

Edit: Looking at the 8200, it's hard to really compare. I am not sure if it is from L1/L2/L3 sharing, but using a 45M exp, the 8200 running all 4 cores shows 90ms/iter. If you have core 1/3 running a 45M exp and 2/4 doing TF, it shows 60ms/iter. I currently have a 550Ti in the 8200. Using a single core on a 26M exp it comes out at ~2.2M iter/day. This theoretically means it it capable of 8.8M iter/day on 4 cores, but in reality gives ~5.9M iter/day. You can however runing 1/3 LL and use 2/4 to power a GPU. The 550Ti outputs ~96 GHzD/Day. If you give the 2 cores their 'full' capability, you have 96GHzD = 4.4M iter. The 2500 = 17.5 GHzD/1M iter, the 2400= 23.188 and at 4.4 the 8200 = 21.36. Here is where it get tricky though, core 1 and 3 are unaffected with the GPU running on 2/4, but are affected with LL/DC on 2/4. There is only 1.5M iter/day difference between 4 cores on LL and 2 cores on LL, so with 96 = 1.5M iter, you get 64. Makes it hard to compare.

Last fiddled with by bcp19 on 2012-03-07 at 23:24
bcp19 is offline   Reply With Quote
Old 2012-03-08, 14:58   #1637
kjaget
 
kjaget's Avatar
 
Jun 2005

8116 Posts
Default

Again, without timings I can't say much.

Quote:
Originally Posted by bcp19 View Post
Well, since I have no timings from the 8200 on hand, then how about this, not quite apples to apples, but close:
Two systems with a 26M exp for a benchmark: 2500 = ~9.6M iterations per day on a single core while the 2400 = ~4.32M iterations per day. 2500 outputs ~168 GHzD/Day on a 560Ti using 1 core while the 2400 outputs ~160GHzD/day on a 560 using 2 cores (SievePrimes = 5000 on both systems). Using GPUz both cards show 99% GPU load. Mfaktc on the 2500 says CPU wait < 3% while the 2 instances on the 2400 have roughly 20% CPU wait.
Allow sieve primes to auto-adjust for both cases and you should get better throughput, at least on the 2400 system. Considering adding at least one more core on the 2500 system, maybe more for both. My thinking :

In general, it only takes a small bit of improvement in mfaktc to overcome what you could get from p95. Using your numbers for example, a 26M exponent is worth ~22.5GHz-days. It takes about 2.7 days to run this on the 2500 CPU, so you're generating 8.3 GHz-days/day using 1 core.

If the GPU is outputting 168GHz Days per day using 1 core, all you'll need is a 5% speedup by adding a second core to match that performance. If you get more than 5% more mfaktc throughput by adding a second core, it's a net win.

To give an example from my fastest system, I need 3 cores to load up my GPU. Going to 4 cores lets each CPU sieve a bit deeper, giving me an overall 10% increase in performance (timings go from 7.3 sec/class with 3 instances to 8.8 sec a class with 4 instances). So if you're shooting for overall max GHz-days/day, that's a net win, assuming you're looking for max GHz-days credit. It's not intuitive that trading 25% of my CPU capability for a 10% speedup is the right thing to do, but GPUs are so much quicker than CPUs at generating GHz-days that you can't trust that 10% of one is equal to 25% of the other.

So the first question is what are you trying to optimize. Second question is back to my original one - let's see timings for mfaktc running on 1, 2, 3,... cores of each system (makes sense to use the same exponent and bit depth for testing just to simplify stuff, and sieve primes should stabilize in a few minutes of running so you're not wasting that much time).

Last fiddled with by kjaget on 2012-03-08 at 15:45 Reason: fixed timing numbers
kjaget is offline   Reply With Quote
Old 2012-03-08, 17:04   #1638
bcp19
 
bcp19's Avatar
 
Oct 2011

7·97 Posts
Default

Quote:
Originally Posted by kjaget View Post
Again, without timings I can't say much.

Allow sieve primes to auto-adjust for both cases and you should get better throughput, at least on the 2400 system. Considering adding at least one more core on the 2500 system, maybe more for both. My thinking :

In general, it only takes a small bit of improvement in mfaktc to overcome what you could get from p95. Using your numbers for example, a 26M exponent is worth ~22.5GHz-days. It takes about 2.7 days to run this on the 2500 CPU, so you're generating 8.3 GHz-days/day using 1 core.

If the GPU is outputting 168GHz Days per day using 1 core, all you'll need is a 5% speedup by adding a second core to match that performance. If you get more than 5% more mfaktc throughput by adding a second core, it's a net win.

To give an example from my fastest system, I need 3 cores to load up my GPU. Going to 4 cores lets each CPU sieve a bit deeper, giving me an overall 10% increase in performance (timings go from 8.x sec/class with 3 instances to 10.x sec a class with 4 instances). So if you're shooting for overall max GHz-days/day, that's a net win, assuming you're looking for max GHz-days credit. It's not intuitive that trading 25% of my CPU capability for a 10% speedup is the right thing to do, but GPUs are so much quicker than CPUs at generating GHz-days that it makes sense.

So the first question is what are you trying to optimize. Second question is back to my original one - let's see timings for mfaktc running on 1, 2, 3,... cores of each system (makes sense to use the same exponent and bit depth for testing just to simplify stuff, and sieve primes should stabilize in a few minutes of running so you're not wasting that much time).
<sigh> You are clearly not understanding. Imagine this: you have 3 computer systems, an overclocked 2500k, a normal 2400 and a normal 8200. If you have a single GPU, which system would it most efficiently run in? From what I have listed in previous messages, it appears that the 8200 would be most efficient, considering a single core of the 2500k can produce more LL work than the entire 8200 can, while the 8200 would only need 2 or 3 cores to produce the same amout of GPU output.
bcp19 is offline   Reply With Quote
Old 2012-03-08, 18:05   #1639
kjaget
 
kjaget's Avatar
 
Jun 2005

8116 Posts
Default

Quote:
Originally Posted by bcp19 View Post
<sigh> You are clearly not understanding. Imagine this: you have 3 computer systems, an overclocked 2500k, a normal 2400 and a normal 8200. If you have a single GPU, which system would it most efficiently run in? From what I have listed in previous messages, it appears that the 8200 would be most efficient, considering a single core of the 2500k can produce more LL work than the entire 8200 can, while the 8200 would only need 2 or 3 cores to produce the same amout of GPU output.
You have to balance that against the fact that a faster CPU will be able to sieve deeper, allowing the GPU to generate more GHz-days per GFLOP. My gut feel is that GPUs are so much more efficient at producing GHz-days that until you max out sieve primes, it always makes sense to use CPUs running mfaktc rather than anything else. So if you can do that by running the 8200, that might be the way to go. If you can't, the faster cards need to be paired up with faster CPUs.
kjaget is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 10:27.


Mon Aug 2 10:27:27 UTC 2021 up 10 days, 4:56, 0 users, load averages: 1.79, 1.46, 1.26

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.