mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

kjaget 2012-01-12 15:48

[QUOTE=KyleAskine;286047]I have no idea, and this could be way off target, but why don't you just get the M/s value and the GPU usage? Wouldn't that be the easiest benchmark to get? Or does that not account for everything?[/QUOTE]

For one, I don't think that M/sec has anything to do with run time per factor. Or at least there's no easy way to map from one to the other. Turn off the siever and your M/sec would go through the roof - as would execution time since you're doing a lot of unnecessary work.

KyleAskine 2012-01-12 19:29

[QUOTE=kjaget;286062]For one, I don't think that M/sec has anything to do with run time per factor. Or at least there's no easy way to map from one to the other. Turn off the siever and your M/sec would go through the roof - as would execution time since you're doing a lot of unnecessary work.[/QUOTE]

Sure it does. Candidates / M/s = time per class. I do admit that I am not sure what each class is, or why it seems to skip around at random, but I am sure there is a simple explanation of that.

kjaget 2012-01-12 19:55

[QUOTE=KyleAskine;286080]Sure it does. Candidates / M/s = time per class.[/QUOTE]

Well yeah, if you record X/sec and X you can work back to seconds, but since time is also given in the output, there's no point in making things more complex than they have to be. But my point was that the rate by itself tells you nothing since you have no idea how much work the GPU is doing, even if you do know what rate it is doing it at.

Bdot 2012-01-13 12:37

[QUOTE=kjaget;286084]Well yeah, if you record X/sec and X you can work back to seconds, but since time is also given in the output, there's no point in making things more complex than they have to be. But my point was that the rate by itself tells you nothing since you have no idea how much work the GPU is doing, even if you do know what rate it is doing it at.[/QUOTE]

I also thought a bit about what mfakto should display. Each test is split into 4620 classes, of which 960 need to be tested, the others can be excluded right away (because they contain only FC's that are divisible by 3, 5, 7 or 11). Now, I can change the current display of class numbers to a display of the class counter (e. g. 23/960), or a percent complete. What would you prefer?

Bdot 2012-01-13 13:06

[QUOTE=James Heinrich;286059]I have some rough data now, enough to at least put up the chart. I may refine it slightly as I get more data, but it should be reasonably close.
[/QUOTE]

Thanks a lot for this table addition!
The question marks in the "Compute" column could be replaced by the OpenCL version that these cards support: 1.1 for all cards except those with an RVxxx chip with xxx<700. RV700 is the first to support OpenCL 1.1. But I don't know if the earlier cards supported 1.0 or no OpenCL at all ...

And OpenCL 1.1 is required for mfakto, therefore the same split can be used to find the AMD cards with "will not run" mfakto. Should be the same as selecting anything below HD4xxx as "will not run".

[QUOTE=James Heinrich;286059]
You'll notice that the Radeon+mfakto combination is considerably less efficient at turning theoretical GFLOPS into GHz-days/day TF results than GeForce+mfaktc. Right now I'm using a divider of 18 (for mfaktc, I'm using 14 for older v1.x GPUs, 5 for v2.0 and 7.5 for v2.1). So that's why you see a Radeon 6990 and a GeForce GTX 570 both expecting ~282GHz-days/day, even though the 6990 has 5100 GFLOPS to the 570's 1400.
[/QUOTE]

Do you really need to show that so clearly :cry:
I now got the barrett mul24 kernel to work (correctly!), which increases the efficiency by ~20-30%. But to reach the "5" divider will be hard ... Maybe with the HD7970 ...

KyleAskine 2012-01-13 13:21

[QUOTE=kjaget;286084]Well yeah, if you record X/sec and X you can work back to seconds, but since time is also given in the output, there's no point in making things more complex than they have to be. But my point was that the rate by itself tells you nothing since you have no idea how much work the GPU is doing, even if you do know what rate it is doing it at.[/QUOTE]

Which is why you need GPU usage as well.

M/s should be constant no matter what the assignment. Time and Candidates change. Thus M/s is better.

I think M/s and GPU usage should be sufficient to determine a theoretical maximum GHz-d/d.

Of course things like CPU (which affects SievePrimes) matters too, but I think you can get a theoretical max (independent of system) from those numbers.

James Heinrich 2012-01-13 14:13

[QUOTE=Bdot;286156]The question marks in the "Compute" column could be replaced by the OpenCL version that these cards support[/quote]I've updated the table with that data. Does this seem reasonable?[code]UPDATE `gpu` SET `compute` = "1.2" WHERE (`brand` = "A") AND (`model` LIKE "Radeon HD 7%");
UPDATE `gpu` SET `compute` = "1.2" WHERE (`brand` = "A") AND (`model` LIKE "Radeon HD 6%");
UPDATE `gpu` SET `compute` = "1.1" WHERE (`brand` = "A") AND (`model` LIKE "Radeon HD 5%");
UPDATE `gpu` SET `compute` = "1.1" WHERE (`brand` = "A") AND (`model` LIKE "Radeon HD 4%");

UPDATE `gpu` SET `compute` = "1.1" WHERE (`brand` = "A") AND ((`codename` = "Westler") OR (`codename` = "Zacate") OR (`codename` = "Ontario") OR (`codename` = "WinterPark") OR (`codename` = "BeaverCreek"));[/code]


[QUOTE=Bdot;286156]Do you really need to show that so clearly :cry:[/QUOTE]Sorry! :blush:
It's no reflection on your programming, just the design of AMD GPUs. [url=http://www.tomshardware.com/reviews/radeon-hd-7970-benchmark-tahiti-gcn,3104-2.html]This article[/url] illustrates some of the problems with VLIW4 that [i]Graphics Core Next[/i] is supposed to remedy. Perhaps it can translate into better mfakto efficiency(?)

[QUOTE=Bdot;286156]But to reach the "5" divider will be hard[/quote]It's also hard for mfaktc/NVIDIA. Older v1.x GPUs are pretty close to the current Radeon efficiency, and newer v2.1 GPUs actually take a 33% performance hit compared to v2.0, not quite sure why. But I still [b]need more benchmark data[/b]. I've seen results ranging from 13x to 18x in the few benchmarks I've received so far, I need more data points to figure out what patterns there may be.

kjaget 2012-01-13 14:43

[QUOTE=KyleAskine;286158]Which is why you need GPU usage as well.

M/s should be constant no matter what the assignment. Time and Candidates change. Thus M/s is better.[/QUOTE]

Again, candidates per second is only useful for determining run time if you know how many candidates are needed to test an exponent. Since, as you say, this number changes depending on lots of factors, just measuring the time is a better measure of how long it takes to test one exponent.

If you're looking for a theoretical measure, we'd need to hack the code to turn off sieving so as many candidates are fed to the GPU as possible per CPU<->GPU transaction. Run as many copies of these as necessary to max the GPU (or compare this to running 1 instance and scaling it with GPU load to see if it gives the same answer). Then we'd need to run through a pass with sieve primes maxed to see the minimum number of candidates required to test an exponent. This last step would only have to be done once since it's independent of the GPU.

Combining the peak candidates per second with the minimum number of candidates per exponent would get us close to the theoretical peak throughput.

But I'm not convinced that ignoring the real overhead in real systems is any more accurate a measurement than just seeing how long an exponent takes to run in a real system.

kjaget 2012-01-13 14:46

[QUOTE=James Heinrich;286162]It's also hard for mfaktc/NVIDIA. Older v1.x GPUs are pretty close to the current Radeon efficiency, and newer v2.1 GPUs actually take a 33% performance hit compared to v2.0, not quite sure why. [/QUOTE]

See [url]http://www.mersenneforum.org/showpost.php?p=281245&postcount=1399[/url].

If I understand it correctly, 2.1 removed some compute resources and relies on a better scheduler to try and run more instructions in parallel. But mfaktc instruction parallelism can't be improved by the better scheduler so it gets hit by the reduced resources without any corresponding gain from the better scheduler.

kjaget 2012-01-13 14:58

[QUOTE=Bdot;286155]I also thought a bit about what mfakto should display. Each test is split into 4620 classes, of which 960 need to be tested, the others can be excluded right away (because they contain only FC's that are divisible by 3, 5, 7 or 11). Now, I can change the current display of class numbers to a display of the class counter (e. g. 23/960), or a percent complete. What would you prefer?[/QUOTE]

A % complete would be interesting, but in a way it's implied by the ETA field.

I would like to see the timing info grouped together first (time/class & eta), then sieve primes, then the throughput stuff grouped together last. This orders it roughly by order of importance performance-wise, at least from a user's perspective. I've seen too many people set sieveprimes as low as possible to get a higher candidates/sec number when all that does is kill their run times. Hopefully moving time first will inspire them to minimize that instead of trying to max M/s by making the GPU do unnecessary work.

But whatever you do, I'd coordinate with Oliver so you guys keep as much of the code common as possible. Should make it easier later on when it's integrated into Prime95 (I can dream, can't I).

Dubslow 2012-01-13 15:12

[QUOTE=KyleAskine;286158]
M/s should be constant no matter what the assignment. Time and Candidates change. Thus M/s is better.
[/QUOTE]
I found that when testing a 200M number, avg. rate dropped from ~195 to ~170, maybe ~165 sometimes. When I went back to 50M, the rate went up again. Could this be due to a higher cost of checking factors?


All times are UTC. The time now is 22:50.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.