mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-01-13, 15:22   #331
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

160658 Posts
Default

Quote:
Originally Posted by kjaget View Post
If you're looking for a theoretical measure, we'd need to hack the code to turn off sieving so as many candidates are fed to the GPU as possible per CPU<->GPU transaction. Run as many copies of these as necessary to max the GPU (or compare this to running 1 instance and scaling it with GPU load to see if it gives the same answer).
This is what TheJudger does when he tests efficiency. He posted such a test on mfaktc 0.18's release. As for what Mr. Askine was saying, yes you'd need number of candidates tested to get runtime, but OTOH, avg. rate should correlate with GHzD/d, regardless of runtime, e.g. I get ~190 M/s, and GD/d of roughly ~100. Then you multiply by runtime to get GD=total FC per assignment.
Quote:
Originally Posted by kjaget View Post
A % complete would be interesting, but in a way it's implied by the ETA field.

I would like to see the timing info grouped together first (time/class & eta), then sieve primes, then the throughput stuff grouped together last. This orders it roughly by order of importance performance-wise, at least from a user's perspective. I've seen too many people set sieveprimes as low as possible to get a higher candidates/sec number when all that does is kill their run times. Hopefully moving time first will inspire them to minimize that instead of trying to max M/s by making the GPU do unnecessary work.
Quote:
Originally Posted by kjaget View Post
I'd prefer a count/960 rather than percentage. How do you know that it's always exactly 960 classes and that all the others don't work for a given assignment? Why couldn't it be 961, 962, or 1063?
Quote:
Originally Posted by kjaget View Post
But whatever you do, I'd coordinate with Oliver so you guys keep as much of the code common as possible. Should make it easier later on when it's integrated into Prime95 (I can dream, can't I).
I agree on the coordination point, but as for integration, at this point at least, we'd need to include both. Has anybody ever tested mfakto on nVidia cards?
Dubslow is offline   Reply With Quote
Old 2012-01-13, 17:17   #332
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
I've updated the table with that data. Does this seem reasonable?
Thanks, looks good to me! When I checked what's new in 1.2 and did not find anything important I did not even bother to check which cards support it. So I cannot tell this difference.

Quote:
Originally Posted by James Heinrich View Post
Sorry!
It's no reflection on your programming, just the design of AMD GPUs. This article illustrates some of the problems with VLIW4 that Graphics Core Next is supposed to remedy. Perhaps it can translate into better mfakto efficiency(?)
No worries, I did not take it too hard .
While this article shows a basic problem, it is one that the OpenCL compiler was brilliant at circumventing. Probably that optimizations have cost quite some effort, but the translated OpenCL code was reordered so much that I sometimes had trouble matching it to the original code. The compiler knows about the VLIW4/5 dependency issue, analyzes it and reorders as much as the dependencies allow.
But often it is hard to find independent instructions to fill the gaps.
Even more of a problem of VLIW5 are the instructions that can run only on the special "t" unit, leaving 4 others empty. Widely discussed are mul32 and mul_hi in this respect, but conversions back and forth between integer and floating point representation are as bad. And finally all the operations to provide for carry/borrow cost their share of the available GFLOPS.

Quote:
Originally Posted by James Heinrich View Post
But I still need more benchmark data.
I need more machines to test on
Quote:
Originally Posted by Dubslow
I found that when testing a 200M number, avg. rate dropped from ~195 to ~170, maybe ~165 sometimes. When I went back to 50M, the rate went up again. Could this be due to a higher cost of checking factors?
It may not seem much: 2 or 3 bits more are just 2 or 3 more loops. But testing a 50M number usually requires 19 loops, so we have an increase of more than 10%. I'd say: yes, it's the higher cost of checking the factors. The barrett kernel should not suffer that much from the additional loops as its loops are simpler at the cost of some more one-time effort.

Quote:
Originally Posted by Dubslow
How do you know that it's always exactly 960 classes and that all the others don't work for a given assignment? Why couldn't it be 961, 962, or 1063?
Because I've counted them all
That's the nice thing about modulo: it all repeats over and over ... No matter where in the circle of 4620 classes you start, you'll always hit each class once. By excluding FC's that are 3 or 5 mod 8, and multiples of 3, 5, 7, 11 you keep 2/4 * 2/3 * 4/5 * 6/7 * 10/11 = 960 of 4620 classes.

Quote:
Originally Posted by Dubslow
Has anybody ever tested mfakto on nVidia cards?
I did not notice the newer NV drivers add OpenCL 1.1 support! Thanks for the hint. Currently the "-O3" parameter to the OpenCL compiler makes it fail, but I'll try without it ...
Bdot is offline   Reply With Quote
Old 2012-01-13, 17:31   #333
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

2·5·29 Posts
Default

Quote:
Originally Posted by Bdot View Post
I need more machines to test on
In a sort of serious way, do you need anything that would help with mfakto? Do you have a 6xxx card? I would be more than willing to pitch something in to help you get appropriate equipment to help you test in house.
KyleAskine is offline   Reply With Quote
Old 2012-01-13, 17:45   #334
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

23×149 Posts
Default

Quote:
Originally Posted by KyleAskine View Post
Do you have a 6xxx card?
I think a 7xxx-series card would be far more useful, since things actually changed between 6 and 7.
James Heinrich is offline   Reply With Quote
Old 2012-01-13, 18:47   #335
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

4428 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
I think a 7xxx-series card would be far more useful, since things actually changed between 6 and 7.
Well, no one has one yet.

On my 5870 I get around 200 M/s with Barrett32.

On my 6970 I get around 120 M/s with Barrett32. I get around 140 M/s with MUL24.

So I think we still need major refinements with the 6xxx series.

Though hopefully Barrett24 fixes everything!

Last fiddled with by KyleAskine on 2012-01-13 at 18:48 Reason: Added last line!
KyleAskine is offline   Reply With Quote
Old 2012-01-14, 01:27   #336
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by KyleAskine View Post
Though hopefully Barrett24 fixes everything!
Well, certainly not everything. Currently it is capable only of finding factors between 263 and 270. It should be able to find them up to 271, but at 70.8 bits I see some misses. Once I see that in the debugger I will be able to tell if it will stay with the 270 limit, or if I can fix it to work for all 271 as well. And once that is done, I'd like to send it out to a few people for testing.

But 272, the goal of GPU-to-72, will not be possible with this kernel. The next kernel will add another 24 bits, which will certainly slow it down considerably. Or maybe just add 12 bits? Hmm, lets see ... I also started a kernel that uses 15-bit chunks in order to avoid the expensive mul_hi instructions, just to check if that maybe can increase the efficiency of AMD GPUs ...

BTW, testing mfakto on Nvidia turns out to be way more effort than it might be worth. Nvidia's OpenCL compiler is buggy and not yet complete. I had to remove all printf's even though they were in inactive #ifdefs. And once that was done, the compiler crashes.
Code:
Error in processing command line: Don't understand command line argument "-O3"!
Code:
(0) Error: call to external function printf is not supported
Code:
Select device - Get device info - Compiling kernels .Stack dump:
0.      Running pass 'Function Pass Manager' on module ''.
1.      Running pass 'Combine redundant instructions' on function '@mfakto_cl_barrett79'

mfakto-nv.exe has stopped working
Bdot is offline   Reply With Quote
Old 2012-01-14, 01:30   #337
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

Lol I can't help I hardly know anything about programming, only the very basics
Dubslow is offline   Reply With Quote
Old 2012-01-14, 03:13   #338
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

4428 Posts
Default

Quote:
Originally Posted by Bdot View Post
Well, certainly not everything. Currently it is capable only of finding factors between 263 and 270. It should be able to find them up to 271, but at 70.8 bits I see some misses. Once I see that in the debugger I will be able to tell if it will stay with the 270 limit, or if I can fix it to work for all 271 as well. And once that is done, I'd like to send it out to a few people for testing.

But 272, the goal of GPU-to-72, will not be possible with this kernel. The next kernel will add another 24 bits, which will certainly slow it down considerably. Or maybe just add 12 bits? Hmm, lets see ... I also started a kernel that uses 15-bit chunks in order to avoid the expensive mul_hi instructions, just to check if that maybe can increase the efficiency of AMD GPUs ...
Well, since I have to use MUL24 anyway, I cannot factor to 72 as is, so I am not really losing any functionality. Though being able to factor to 71 would be helpful, since there really aren't too many candidates left that are only done to 69 or less.
KyleAskine is offline   Reply With Quote
Old 2012-01-15, 05:44   #339
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

1,123 Posts
Default

Quote:
Originally Posted by Bdot View Post
I'd like to send it out to a few people for testing.
I can help test when you're ready.
flashjh is offline   Reply With Quote
Old 2012-01-16, 09:36   #340
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

10010101012 Posts
Default

Quote:
Originally Posted by KyleAskine View Post
Well, since I have to use MUL24 anyway, I cannot factor to 72 as is, so I am not really losing any functionality. Though being able to factor to 71 would be helpful, since there really aren't too many candidates left that are only done to 69 or less.
The MUL24 kernel can handle up to 72, having the limit at 71 was a mistake in one of the test versions I sent to you but fixed in the 0.10 release.

The barrett24 kernel, however, normally needs 3 spare bits. I managed to "borrow" one, but not more. As the processing width is 3x24 bits, I need to limit the new kernel's bit_max at 70. I also noticed that the new kernel's register usage seems to be very efficient, resulting in 1-2% performance gain when using a vector size of 8 instead of 4.

I'll send you and flashjh a test version within the next few days. Try to save a few 69 -> 70 assignments for it ...

Last fiddled with by Bdot on 2012-01-16 at 09:37
Bdot is offline   Reply With Quote
Old 2012-01-16, 11:36   #341
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

2×5×29 Posts
Default

Quote:
Originally Posted by Bdot View Post
The MUL24 kernel can handle up to 72, having the limit at 71 was a mistake in one of the test versions I sent to you but fixed in the 0.10 release.

The barrett24 kernel, however, normally needs 3 spare bits. I managed to "borrow" one, but not more. As the processing width is 3x24 bits, I need to limit the new kernel's bit_max at 70. I also noticed that the new kernel's register usage seems to be very efficient, resulting in 1-2% performance gain when using a vector size of 8 instead of 4.

I'll send you and flashjh a test version within the next few days. Try to save a few 69 -> 70 assignments for it ...
I will try to harvest some from GPU to 72 tonight when I get home.
KyleAskine is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3498 2021-08-06 21:07
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2719 2021-08-05 22:43
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 22:00.


Fri Aug 6 22:00:25 UTC 2021 up 14 days, 16:29, 1 user, load averages: 2.48, 2.73, 2.67

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.