mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet > GPU to 72

Reply
 
Thread Tools
Old 2011-12-01, 18:24   #45
kjaget
 
kjaget's Avatar
 
Jun 2005

3×43 Posts
Default

If we're looking to optimize, I guess one of the questions is should we use hardware X for factoring another bit level or turn them over to LL/DC tests. Even for GPUs, there may come a time when it makes more sense to run 2 first time LL tests than it would to run enough TFs at a given bit level to find 1 expected factor.

For CPUs, we have a lot of data to compare this switch-over point (ignore for now that we know that CPUs shouldn't be used to TF period). So instead of comparing TF time on GPU to LL time on a CPU, should we instead compare the time of each type of test running on a GPU? You can add in a fudge factor for the CPU cores needed to do TF on a GPU to make it more reasonable.

The next question : on CPUs, I believe the TF vs LL efficiency is pretty consistent across CPU families, so the switchover point is the more or less the same regardless of which architecture you're running. Is this true for GPUs as well, or do different types of GPUs behave radically differently on the two types of tests? If they scale pretty similarly, we end up with a nice way to compare TF vs LL effort without having to figure out how to compare a CPU to a GPU*


* - and probably create a bunch of other unanswerable questions, but it's a start :)

Last fiddled with by kjaget on 2011-12-01 at 18:24
kjaget is offline   Reply With Quote
Old 2011-12-01, 19:28   #46
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×5×7×139 Posts
Default

Quote:
Originally Posted by kjaget View Post
If we're looking to optimize, I guess one of the questions is should we use hardware X for factoring another bit level or turn them over to LL/DC tests. Even for GPUs, there may come a time when it makes more sense to run 2 first time LL tests than it would to run enough TFs at a given bit level to find 1 expected factor.
I agree that the fundamental question is to (try to) find the (near) optimal utilization of the GPU fire power. But I think we need to consider all of the fire power available to GIMPS; not just among the GPUs.

Quote:
Originally Posted by kjaget View Post
For CPUs, we have a lot of data to compare this switch-over point (ignore for now that we know that CPUs shouldn't be used to TF period). So instead of comparing TF time on GPU to LL time on a CPU, should we instead compare the time of each type of test running on a GPU? You can add in a fudge factor for the CPU cores needed to do TF on a GPU to make it more reasonable.
But another dimension to this extremely complex problem space is that there are orders of magnitude more CPU cycles available to GIMPS than GPU cycles. Thus, when analysing the system as a whole (which is what I think we should be doing) we need to consider that even when a single GPU can (for example) do two LLs at a given exponent range in equal or less "real" time than it takes to find one factor, it would still make sense to continue doing TF on the GPUs until it takes equal or longer wall-clock time than the average CPU can do two LLs.

Quote:
Originally Posted by kjaget View Post
The next question : on CPUs, I believe the TF vs LL efficiency is pretty consistent across CPU families, so the switchover point is the more or less the same regardless of which architecture you're running. Is this true for GPUs as well, or do different types of GPUs behave radically differently on the two types of tests? If they scale pretty similarly, we end up with a nice way to compare TF vs LL effort without having to figure out how to compare a CPU to a GPU.
For the reason I (hopefully reasonably) argued above, I think we still need to find an (approximate) way of comparing CPUs and GPUs before we can reasonable determine where the payback "curves" cross.
chalsall is online now   Reply With Quote
Old 2011-12-02, 05:40   #47
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

Quote:
Originally Posted by chalsall View Post

So, in order to be able to (approximately) compare GPUs to CPUs, we either then also divide all the CPU GHzDays values by 3, or else multiply the GPU wall-clock time by 3 to bring it to (as you pointed out) "Normalized GHzDays". I have chosen to do the latter because GHzDays are a metric which PrimeNet has been using and people are comfortable with it.
If we're going by what people are comfortable with, this thread is useless. The former is more useful as far as interpretation and optimization go.
Quote:
Originally Posted by chalsall View Post
I find it interesting that this entire discussion has come out of one small report which isn't really all that important....
Indeed, I have mostly stopped paying attention this thread (though I have at least skimmed every post)
Dubslow is offline   Reply With Quote
Old 2011-12-02, 06:32   #48
axn
 
axn's Avatar
 
Jun 2003

5,051 Posts
Default

Quote:
Originally Posted by chalsall View Post
Thus, when analysing the system as a whole (which is what I think we should be doing) we need to consider that even when a single GPU can (for example) do two LLs at a given exponent range in equal or less "real" time than it takes to find one factor, it would still make sense to continue doing TF on the GPUs until it takes equal or longer wall-clock time than the average CPU can do two LLs.
Your intuition has failed you. If indeed a GPU can do two LLs in the time that it takes to find a factor, it is better off doing the LL. In fact, since we wouldn't get to doublechecking of an exponent until years later, I would make a stronger statement that you're better of LL-ing, if you can do 1 LL quicker that the "expected time to find a factor by TF".
Think about it -- GPUs are doing TF so that they can eliminate candidates, thus advancing the LL wavefront. Doesn't matter if you eliminate via TF or via LL -- what matters is only how quickly can you eliminate them.
The only reason GPUs are doing higher TFs than the CPU-optimal, is becuase they have relatively higher efficiency for TF compared to LL (x100 vs x10). If they were equally as good in TF & LL, we would _not_ be doing additional TF with them -- we'll use them as we would any other CPU. Further, suppose that tomorrow a GPU comes out that is somehow _more_ efficient for LL than TF. In which case, the recommended option for it would be to do LLs that have less than optimal TF done.

Last fiddled with by axn on 2011-12-02 at 06:32
axn is offline   Reply With Quote
Old 2011-12-02, 06:42   #49
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

It takes two LL's to clear an exponent, so I would disagree with "I would make a stronger statement that you're better of LL-ing, if you can do 1 LL quicker that the "expected time to find a factor by TF"." It doesn't matter when the 2 tests take place, only that 2 are necessary to clear an exponent.

Quote:
The only reason GPUs are doing higher TFs than the CPU-optimal, is becuase they have relatively higher efficiency for TF compared to LL (x100 vs x10). If they were equally as good in TF & LL, we would _not_ be doing additional TF with them -- we'll use them as we would any other CPU. Further, suppose that tomorrow a GPU comes out that is somehow _more_ efficient for LL than TF. In which case, the recommended option for it would be to do LLs that have less than optimal TF done.
True -- but not relevant, in the sense that such a thing hasn't happened yet, and if it does, we can rethink all of this then.
Dubslow is offline   Reply With Quote
Old 2011-12-02, 17:14   #50
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

260216 Posts
Default

Quote:
Originally Posted by axn View Post
Your intuition has failed you. If indeed a GPU can do two LLs in the time that it takes to find a factor, it is better off doing the LL. In fact, since we wouldn't get to doublechecking of an exponent until years later, I would make a stronger statement that you're better of LL-ing, if you can do 1 LL quicker that the "expected time to find a factor by TF".
You are right. I was wrong.
chalsall is online now   Reply With Quote
Old 2011-12-06, 02:20   #51
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5×359 Posts
Default

The promised Benchmark:
AMD Phenom II x6, 1 core running mfaktc 0.17 with 4620 classes feeding Galaxy GTX440, all at stock speed.
TF(52765829,69,70) = 1hr, 3 minutes.
Computer is otherwise unloaded. SievePrimes is stable at 80,900.
OS is Xubuntu 11.10

Last fiddled with by Christenson on 2011-12-06 at 02:25 Reason: Benchmark needs sievePrimes reported.
Christenson is offline   Reply With Quote
Old 2011-12-07, 19:31   #52
garo
 
garo's Avatar
 
Aug 2002
Termonfeckin, IE

22·691 Posts
Default

no factor for M25406911 from 2^68 to 2^69 [mfaktc 0.17-Win....]
tf(): total time spent: 34m 41.908s

GTX550Ti, fully loaded one instance fed by one core of an i5-750.
garo is offline   Reply With Quote
Old 2011-12-07, 20:19   #53
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

32×5×107 Posts
Default

Quote:
Originally Posted by garo View Post
no factor for M25406911 from 2^68 to 2^69 [mfaktc 0.17-Win....]
tf(): total time spent: 34m 41.908s

GTX550Ti, fully loaded one instance fed by one core of an i5-750.
GTX 275, and about 70% of a core of an i5-750 needs 66 minutes to do the same work

Luigi
ET_ is offline   Reply With Quote
Old 2012-01-03, 22:38   #54
sonjohan
 
sonjohan's Avatar
 
May 2003
Belgium

2·139 Posts
Default

If I understand the last posts here, I should not have Prime95 run on all cores with CUDA running. Because when I check what my GTX 570M does in 24h, I don't get 100GHz days credit.

But then again, I got 2 instances of an MMORPG running on idle (AFK fishing), which might be eating on the GPU as well :)
sonjohan is offline   Reply With Quote
Old 2012-01-03, 22:51   #55
diamonddave
 
diamonddave's Avatar
 
Feb 2004

A016 Posts
Default

Quote:
Originally Posted by sonjohan View Post
If I understand the last posts here, I should not have Prime95 run on all cores with CUDA running. Because when I check what my GTX 570M does in 24h, I don't get 100GHz days credit.

But then again, I got 2 instances of an MMORPG running on idle (AFK fishing), which might be eating on the GPU as well :)
What do you mean by CUDA running?

CUDALucas? Mfaktc? Other?

It's perfectly fine to have Prime95 running with CUDALucas. While with mfaktc I would liberate 1 core on Prime95 per mfaktc instance that you are running.
diamonddave is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
What percentage of CPUs/GPUs have done a double check? Mark Rose Data 4 2016-06-17 14:38
Anyone using GPUs to do DC, LL or P-1 work? chalsall GPU to 72 56 2014-04-24 02:36
GPUs impact on TF petrw1 GPU Computing 0 2013-01-06 03:23
LMH Factoring on GPUs Uncwilly LMH > 100M 60 2012-05-15 08:37
Compare interim files with different start shifts? zanmato Software 12 2012-04-18 14:56

All times are UTC. The time now is 14:16.


Fri Jul 16 14:16:19 UTC 2021 up 49 days, 12:03, 2 users, load averages: 1.42, 1.71, 1.73

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.