mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-01-12, 15:48   #320
kjaget
 
kjaget's Avatar
 
Jun 2005

3×43 Posts
Default

Quote:
Originally Posted by KyleAskine View Post
I have no idea, and this could be way off target, but why don't you just get the M/s value and the GPU usage? Wouldn't that be the easiest benchmark to get? Or does that not account for everything?
For one, I don't think that M/sec has anything to do with run time per factor. Or at least there's no easy way to map from one to the other. Turn off the siever and your M/sec would go through the roof - as would execution time since you're doing a lot of unnecessary work.
kjaget is offline   Reply With Quote
Old 2012-01-12, 19:29   #321
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

2·5·29 Posts
Default

Quote:
Originally Posted by kjaget View Post
For one, I don't think that M/sec has anything to do with run time per factor. Or at least there's no easy way to map from one to the other. Turn off the siever and your M/sec would go through the roof - as would execution time since you're doing a lot of unnecessary work.
Sure it does. Candidates / M/s = time per class. I do admit that I am not sure what each class is, or why it seems to skip around at random, but I am sure there is a simple explanation of that.
KyleAskine is offline   Reply With Quote
Old 2012-01-12, 19:55   #322
kjaget
 
kjaget's Avatar
 
Jun 2005

3·43 Posts
Default

Quote:
Originally Posted by KyleAskine View Post
Sure it does. Candidates / M/s = time per class.
Well yeah, if you record X/sec and X you can work back to seconds, but since time is also given in the output, there's no point in making things more complex than they have to be. But my point was that the rate by itself tells you nothing since you have no idea how much work the GPU is doing, even if you do know what rate it is doing it at.

Last fiddled with by kjaget on 2012-01-12 at 19:59
kjaget is offline   Reply With Quote
Old 2012-01-13, 12:37   #323
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

25516 Posts
Red face

Quote:
Originally Posted by kjaget View Post
Well yeah, if you record X/sec and X you can work back to seconds, but since time is also given in the output, there's no point in making things more complex than they have to be. But my point was that the rate by itself tells you nothing since you have no idea how much work the GPU is doing, even if you do know what rate it is doing it at.
I also thought a bit about what mfakto should display. Each test is split into 4620 classes, of which 960 need to be tested, the others can be excluded right away (because they contain only FC's that are divisible by 3, 5, 7 or 11). Now, I can change the current display of class numbers to a display of the class counter (e. g. 23/960), or a percent complete. What would you prefer?
Bdot is offline   Reply With Quote
Old 2012-01-13, 13:06   #324
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
I have some rough data now, enough to at least put up the chart. I may refine it slightly as I get more data, but it should be reasonably close.
Thanks a lot for this table addition!
The question marks in the "Compute" column could be replaced by the OpenCL version that these cards support: 1.1 for all cards except those with an RVxxx chip with xxx<700. RV700 is the first to support OpenCL 1.1. But I don't know if the earlier cards supported 1.0 or no OpenCL at all ...

And OpenCL 1.1 is required for mfakto, therefore the same split can be used to find the AMD cards with "will not run" mfakto. Should be the same as selecting anything below HD4xxx as "will not run".

Quote:
Originally Posted by James Heinrich View Post
You'll notice that the Radeon+mfakto combination is considerably less efficient at turning theoretical GFLOPS into GHz-days/day TF results than GeForce+mfaktc. Right now I'm using a divider of 18 (for mfaktc, I'm using 14 for older v1.x GPUs, 5 for v2.0 and 7.5 for v2.1). So that's why you see a Radeon 6990 and a GeForce GTX 570 both expecting ~282GHz-days/day, even though the 6990 has 5100 GFLOPS to the 570's 1400.
Do you really need to show that so clearly
I now got the barrett mul24 kernel to work (correctly!), which increases the efficiency by ~20-30%. But to reach the "5" divider will be hard ... Maybe with the HD7970 ...
Bdot is offline   Reply With Quote
Old 2012-01-13, 13:21   #325
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

2×5×29 Posts
Default

Quote:
Originally Posted by kjaget View Post
Well yeah, if you record X/sec and X you can work back to seconds, but since time is also given in the output, there's no point in making things more complex than they have to be. But my point was that the rate by itself tells you nothing since you have no idea how much work the GPU is doing, even if you do know what rate it is doing it at.
Which is why you need GPU usage as well.

M/s should be constant no matter what the assignment. Time and Candidates change. Thus M/s is better.

I think M/s and GPU usage should be sufficient to determine a theoretical maximum GHz-d/d.

Of course things like CPU (which affects SievePrimes) matters too, but I think you can get a theoretical max (independent of system) from those numbers.

Last fiddled with by KyleAskine on 2012-01-13 at 13:28 Reason: Added last line, fixed a typo
KyleAskine is offline   Reply With Quote
Old 2012-01-13, 14:13   #326
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

23·149 Posts
Default

Quote:
Originally Posted by Bdot View Post
The question marks in the "Compute" column could be replaced by the OpenCL version that these cards support
I've updated the table with that data. Does this seem reasonable?
Code:
UPDATE `gpu` SET `compute` = "1.2" WHERE (`brand` = "A") AND (`model` LIKE "Radeon HD 7%");
UPDATE `gpu` SET `compute` = "1.2" WHERE (`brand` = "A") AND (`model` LIKE "Radeon HD 6%");
UPDATE `gpu` SET `compute` = "1.1" WHERE (`brand` = "A") AND (`model` LIKE "Radeon HD 5%");
UPDATE `gpu` SET `compute` = "1.1" WHERE (`brand` = "A") AND (`model` LIKE "Radeon HD 4%");

UPDATE `gpu` SET `compute` = "1.1" WHERE (`brand` = "A") AND ((`codename` = "Westler") OR (`codename` = "Zacate") OR (`codename` = "Ontario") OR (`codename` = "WinterPark") OR (`codename` = "BeaverCreek"));

Quote:
Originally Posted by Bdot View Post
Do you really need to show that so clearly
Sorry!
It's no reflection on your programming, just the design of AMD GPUs. This article illustrates some of the problems with VLIW4 that Graphics Core Next is supposed to remedy. Perhaps it can translate into better mfakto efficiency(?)

Quote:
Originally Posted by Bdot View Post
But to reach the "5" divider will be hard
It's also hard for mfaktc/NVIDIA. Older v1.x GPUs are pretty close to the current Radeon efficiency, and newer v2.1 GPUs actually take a 33% performance hit compared to v2.0, not quite sure why. But I still need more benchmark data. I've seen results ranging from 13x to 18x in the few benchmarks I've received so far, I need more data points to figure out what patterns there may be.
James Heinrich is offline   Reply With Quote
Old 2012-01-13, 14:43   #327
kjaget
 
kjaget's Avatar
 
Jun 2005

3·43 Posts
Default

Quote:
Originally Posted by KyleAskine View Post
Which is why you need GPU usage as well.

M/s should be constant no matter what the assignment. Time and Candidates change. Thus M/s is better.
Again, candidates per second is only useful for determining run time if you know how many candidates are needed to test an exponent. Since, as you say, this number changes depending on lots of factors, just measuring the time is a better measure of how long it takes to test one exponent.

If you're looking for a theoretical measure, we'd need to hack the code to turn off sieving so as many candidates are fed to the GPU as possible per CPU<->GPU transaction. Run as many copies of these as necessary to max the GPU (or compare this to running 1 instance and scaling it with GPU load to see if it gives the same answer). Then we'd need to run through a pass with sieve primes maxed to see the minimum number of candidates required to test an exponent. This last step would only have to be done once since it's independent of the GPU.

Combining the peak candidates per second with the minimum number of candidates per exponent would get us close to the theoretical peak throughput.

But I'm not convinced that ignoring the real overhead in real systems is any more accurate a measurement than just seeing how long an exponent takes to run in a real system.
kjaget is offline   Reply With Quote
Old 2012-01-13, 14:46   #328
kjaget
 
kjaget's Avatar
 
Jun 2005

3×43 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
It's also hard for mfaktc/NVIDIA. Older v1.x GPUs are pretty close to the current Radeon efficiency, and newer v2.1 GPUs actually take a 33% performance hit compared to v2.0, not quite sure why.
See http://www.mersenneforum.org/showpos...postcount=1399.

If I understand it correctly, 2.1 removed some compute resources and relies on a better scheduler to try and run more instructions in parallel. But mfaktc instruction parallelism can't be improved by the better scheduler so it gets hit by the reduced resources without any corresponding gain from the better scheduler.
kjaget is offline   Reply With Quote
Old 2012-01-13, 14:58   #329
kjaget
 
kjaget's Avatar
 
Jun 2005

2018 Posts
Default

Quote:
Originally Posted by Bdot View Post
I also thought a bit about what mfakto should display. Each test is split into 4620 classes, of which 960 need to be tested, the others can be excluded right away (because they contain only FC's that are divisible by 3, 5, 7 or 11). Now, I can change the current display of class numbers to a display of the class counter (e. g. 23/960), or a percent complete. What would you prefer?
A % complete would be interesting, but in a way it's implied by the ETA field.

I would like to see the timing info grouped together first (time/class & eta), then sieve primes, then the throughput stuff grouped together last. This orders it roughly by order of importance performance-wise, at least from a user's perspective. I've seen too many people set sieveprimes as low as possible to get a higher candidates/sec number when all that does is kill their run times. Hopefully moving time first will inspire them to minimize that instead of trying to max M/s by making the GPU do unnecessary work.

But whatever you do, I'd coordinate with Oliver so you guys keep as much of the code common as possible. Should make it easier later on when it's integrated into Prime95 (I can dream, can't I).
kjaget is offline   Reply With Quote
Old 2012-01-13, 15:12   #330
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

Quote:
Originally Posted by KyleAskine View Post
M/s should be constant no matter what the assignment. Time and Candidates change. Thus M/s is better.
I found that when testing a 200M number, avg. rate dropped from ~195 to ~170, maybe ~165 sometimes. When I went back to 50M, the rate went up again. Could this be due to a higher cost of checking factors?
Dubslow is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3498 2021-08-06 21:07
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2719 2021-08-05 22:43
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 22:00.


Fri Aug 6 22:00:25 UTC 2021 up 14 days, 16:29, 1 user, load averages: 2.48, 2.73, 2.67

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.