![]() |
[QUOTE=kjaget;286165]
If you're looking for a theoretical measure, we'd need to hack the code to turn off sieving so as many candidates are fed to the GPU as possible per CPU<->GPU transaction. Run as many copies of these as necessary to max the GPU (or compare this to running 1 instance and scaling it with GPU load to see if it gives the same answer).[/QUOTE] This is what TheJudger [url=http://www.mersenneforum.org/showpost.php?p=281726&postcount=1409]does[/url] when he tests efficiency. He posted such a test on mfaktc 0.18's release. As for what Mr. Askine was saying, yes you'd need number of candidates tested to get runtime, but OTOH, avg. rate should correlate with GHzD/d, regardless of runtime, e.g. I get ~190 M/s, and GD/d of roughly ~100. Then you multiply by runtime to get GD=total FC per assignment. [QUOTE=kjaget;286168]A % complete would be interesting, but in a way it's implied by the ETA field. I would like to see the timing info grouped together first (time/class & eta), then sieve primes, then the throughput stuff grouped together last. This orders it roughly by order of importance performance-wise, at least from a user's perspective. I've seen too many people set sieveprimes as low as possible to get a higher candidates/sec number when all that does is kill their run times. Hopefully moving time first will inspire them to minimize that instead of trying to max M/s by making the GPU do unnecessary work.[/quote] [QUOTE=kjaget;286165]I'd prefer a count/960 rather than percentage. How do you know that it's always exactly 960 classes and that all the others don't work for a given assignment? Why couldn't it be 961, 962, or 1063?[/quote] [QUOTE=kjaget;286165]But whatever you do, I'd coordinate with Oliver so you guys keep as much of the code common as possible. Should make it easier later on when it's integrated into Prime95 (I can dream, can't I).[/QUOTE] I agree on the coordination point, but as for integration, at this point at least, we'd need to include both. Has anybody ever tested mfakto on nVidia cards? |
[QUOTE=James Heinrich;286162]I've updated the table with that data. Does this seem reasonable?
[/QUOTE] Thanks, looks good to me! When I checked what's new in 1.2 and did not find anything important I did not even bother to check which cards support it. So I cannot tell this difference. [QUOTE=James Heinrich;286162] Sorry! :blush: It's no reflection on your programming, just the design of AMD GPUs. [URL="http://www.tomshardware.com/reviews/radeon-hd-7970-benchmark-tahiti-gcn,3104-2.html"]This article[/URL] illustrates some of the problems with VLIW4 that [I]Graphics Core Next[/I] is supposed to remedy. Perhaps it can translate into better mfakto efficiency(?) [/QUOTE] No worries, I did not take it too hard :smile:. While this article shows a basic problem, it is one that the OpenCL compiler was brilliant at circumventing. Probably that optimizations have cost quite some effort, but the translated OpenCL code was reordered so much that I sometimes had trouble matching it to the original code. The compiler knows about the VLIW4/5 dependency issue, analyzes it and reorders as much as the dependencies allow. But often it is hard to find independent instructions to fill the gaps. Even more of a problem of VLIW5 are the instructions that can run only on the special "t" unit, leaving 4 others empty. Widely discussed are mul32 and mul_hi in this respect, but conversions back and forth between integer and floating point representation are as bad. And finally all the operations to provide for carry/borrow cost their share of the available GFLOPS. [QUOTE=James Heinrich;286162] But I still [B]need more benchmark data[/B].[/QUOTE] I need more machines to test on :grin: [quote=Dubslow] I found that when testing a 200M number, avg. rate dropped from ~195 to ~170, maybe ~165 sometimes. When I went back to 50M, the rate went up again. Could this be due to a higher cost of checking factors? [/quote] It may not seem much: 2 or 3 bits more are just 2 or 3 more loops. But testing a 50M number usually requires 19 loops, so we have an increase of more than 10%. I'd say: yes, it's the higher cost of checking the factors. The barrett kernel should not suffer that much from the additional loops as its loops are simpler at the cost of some more one-time effort. [quote=Dubslow] How do you know that it's always exactly 960 classes and that all the others don't work for a given assignment? Why couldn't it be 961, 962, or 1063?[/quote] Because I've counted them all :smile: That's the nice thing about modulo: it all repeats over and over ... No matter where in the circle of 4620 classes you start, you'll always hit each class once. By excluding FC's that are 3 or 5 mod 8, and multiples of 3, 5, 7, 11 you keep 2/4 * 2/3 * 4/5 * 6/7 * 10/11 = 960 of 4620 classes. [quote=Dubslow] Has anybody ever tested mfakto on nVidia cards? [/quote] I did not notice the newer NV drivers add OpenCL 1.1 support! Thanks for the hint. Currently the "-O3" parameter to the OpenCL compiler makes it fail, but I'll try without it ... |
[QUOTE=Bdot;286179]
I need more machines to test on :grin: [/QUOTE] In a sort of serious way, do you need anything that would help with mfakto? Do you have a 6xxx card? I would be more than willing to pitch something in to help you get appropriate equipment to help you test in house. |
[QUOTE=KyleAskine;286181]Do you have a 6xxx card?[/QUOTE]I think a 7xxx-series card would be far more useful, since things actually changed between 6 and 7. :smile:
|
[QUOTE=James Heinrich;286184]I think a 7xxx-series card would be far more useful, since things actually changed between 6 and 7. :smile:[/QUOTE]
Well, no one has one yet. On my 5870 I get around 200 M/s with Barrett32. On my 6970 I get around 120 M/s with Barrett32. I get around 140 M/s with MUL24. So I think we still need major refinements with the 6xxx series. Though hopefully Barrett24 fixes everything! |
[QUOTE=KyleAskine;286189]
Though hopefully Barrett24 fixes everything![/QUOTE] Well, certainly not everything. Currently it is capable only of finding factors between 2[SUP]63[/SUP] and 2[SUP]70[/SUP]. It should be able to find them up to 2[SUP]71[/SUP], but at 70.8 bits I see some misses. Once I see that in the debugger I will be able to tell if it will stay with the 2[SUP]70[/SUP] limit, or if I can fix it to work for all 2[SUP]71[/SUP] as well. And once that is done, I'd like to send it out to a few people for testing. But 2[SUP]72[/SUP], the goal of GPU-to-72, will not be possible with this kernel. The next kernel will add another 24 bits, which will certainly slow it down considerably. Or maybe just add 12 bits? Hmm, lets see ... I also started a kernel that uses 15-bit chunks in order to avoid the expensive mul_hi instructions, just to check if that maybe can increase the efficiency of AMD GPUs ... BTW, testing mfakto on Nvidia turns out to be way more effort than it might be worth. Nvidia's OpenCL compiler is buggy and not yet complete. I had to remove all printf's even though they were in inactive #ifdefs. And once that was done, the compiler crashes. [code] Error in processing command line: Don't understand command line argument "-O3"! [/code][code] (0) Error: call to external function printf is not supported [/code][code] Select device - Get device info - Compiling kernels .Stack dump: 0. Running pass 'Function Pass Manager' on module ''. 1. Running pass 'Combine redundant instructions' on function '@mfakto_cl_barrett79' mfakto-nv.exe has stopped working [/code] |
Lol I can't help I hardly know anything about programming, only the very basics
|
[QUOTE=Bdot;286230]Well, certainly not everything. Currently it is capable only of finding factors between 2[SUP]63[/SUP] and 2[SUP]70[/SUP]. It should be able to find them up to 2[SUP]71[/SUP], but at 70.8 bits I see some misses. Once I see that in the debugger I will be able to tell if it will stay with the 2[SUP]70[/SUP] limit, or if I can fix it to work for all 2[SUP]71[/SUP] as well. And once that is done, I'd like to send it out to a few people for testing.
But 2[SUP]72[/SUP], the goal of GPU-to-72, will not be possible with this kernel. The next kernel will add another 24 bits, which will certainly slow it down considerably. Or maybe just add 12 bits? Hmm, lets see ... I also started a kernel that uses 15-bit chunks in order to avoid the expensive mul_hi instructions, just to check if that maybe can increase the efficiency of AMD GPUs ... [/QUOTE] Well, since I have to use MUL24 anyway, I cannot factor to 72 as is, so I am not really losing any functionality. Though being able to factor to 71 would be helpful, since there really aren't too many candidates left that are only done to 69 or less. |
[QUOTE=Bdot;286230]I'd like to send it out to a few people for testing.[/QUOTE]
I can help test when you're ready. |
[QUOTE=KyleAskine;286238]Well, since I have to use MUL24 anyway, I cannot factor to 72 as is, so I am not really losing any functionality. Though being able to factor to 71 would be helpful, since there really aren't too many candidates left that are only done to 69 or less.[/QUOTE]
The MUL24 kernel can handle up to 72, having the limit at 71 was a mistake in one of the test versions I sent to you but fixed in the 0.10 release. The barrett24 kernel, however, normally needs 3 spare bits. I managed to "borrow" one, but not more. As the processing width is 3x24 bits, I need to limit the new kernel's bit_max at 70. I also noticed that the new kernel's register usage seems to be very efficient, resulting in 1-2% performance gain when using a vector size of 8 instead of 4. I'll send you and flashjh a test version within the next few days. Try to save a few 69 -> 70 assignments for it ... |
[QUOTE=Bdot;286453]The MUL24 kernel can handle up to 72, having the limit at 71 was a mistake in one of the test versions I sent to you but fixed in the 0.10 release.
The barrett24 kernel, however, normally needs 3 spare bits. I managed to "borrow" one, but not more. As the processing width is 3x24 bits, I need to limit the new kernel's bit_max at 70. I also noticed that the new kernel's register usage seems to be very efficient, resulting in 1-2% performance gain when using a vector size of 8 instead of 4. I'll send you and flashjh a test version within the next few days. Try to save a few 69 -> 70 assignments for it ...[/QUOTE] I will try to harvest some from GPU to 72 tonight when I get home. |
| All times are UTC. The time now is 22:50. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.