![]() |
|
|
#419 | |
|
Oct 2011
67910 Posts |
Quote:
Anyway, I'll try removing /low when I get my PSU for the Duo as I don't want to take the quad apart again to switch cards. |
|
|
|
|
|
|
#420 | |
|
Nov 2010
Germany
3×199 Posts |
Quote:
I believe it is caused by the threading design chosen by AMD for their OpenCL implementation. When CUDA programs are built, all code to drive the GPU is compiled right into the control flow of the program (thread) that calls the GPU functions. AMD's OpenCL library creates another thread upon initialization that will drive the GPU. OpenCL API calls will just issue request to this thread, and may or may not wait for it to complete the task. This design works very well if you have a "stand-by" CPU core to run the background thread. But if all cores are busy, then activation of the background thread has to wait until a time slice of another task finishes. Unfortunately, mfakto counts this switching time towards the CPU wait time, indicating that the CPU has to wait for the GPU, and consequently increases SievePrimes. I did not yet find a way to distinguish between "wait for GPU" and "wait for CPU to process GPU requests" as this is all hidden in the OpenCL APIs. On my AMD Phenom system, I need to use fix SievePrimes in order to be able to use it in addition to prime95. On a SandyBridge, I noticed, that an available hyper thread is fully sufficient to serve the needs. There, I can run 3x mfakto and 3 prime95-LL tests on 8 hyper threads. 4 LL-tests work as well, but lowers SievePrimes too much for my gusto. As the AVX-FFTs are memory-bandwidth-limited on my machine, It would not be faster to run LL-tests on each hyper-thread. On another machine, a 12-CPU-Xeon w/o hyper-threading, I run just 8 threads of mprime. In order to have mfakto run full speed, I let 3 instances use 4 CPUs -each at 133%CPU (unix-style counting). On none of these machines I set the affinity for mfakto, the OS normally figures out what's available. I'll take a note to my todo-list to allow setting the affinity for the Sieving thread - this may be some advantage especially on Windows where threads are normally switched around for no good. |
|
|
|
|
|
|
#421 | |
|
Oct 2011
12478 Posts |
Quote:
|
|
|
|
|
|
|
#422 |
|
Nov 2010
Germany
11258 Posts |
I'm in the final steps of creating a better performing solution for HD69xx (Cayman) and probably HD7xxx as well.
I've created a kernel using a word size of 15 bits per int. This way I can completely avoid the expensive 32-bit mul and mul_hi instructions. Using 5x15 bits, it is currently capable of doing TF 60 to 72 bits. I should be able to bring it to 73 bits soon. The kernel is still kind of immature: I have only one generic 75 x 75 bit -> 150 bit multiplication. Using an optimized squaring function, another one that only calculates the required precision and a few other optimizations I should be able to improve its speed by 30-50%. Currently, on HD5770 it runs at ~80% of the best kernel. For Cayman, predictions are that it is already 5% faster right now. With a little luck, HD6970 may finally be faster than HD5870 (hmm, probably just a bad joke). I could use some testing help towards the end of this week or next week ... flash, Kyle? Anyone else, of special interest are HD69xx or HD7xxx ? BTW, if anyone wants to follow/help development, I've put the source code to https://github.com/Bdot42/mfakto I usually do regular updates whenever I changed (improved?) anything. |
|
|
|
|
|
#423 | |
|
Oct 2011
Maryland
12216 Posts |
Quote:
|
|
|
|
|
|
|
#424 | |
|
"Jerry"
Nov 2011
Vancouver, WA
1,123 Posts |
Quote:
|
|
|
|
|
|
|
#425 |
|
Nov 2010
Germany
3·199 Posts |
Thanks to the very fast (and so far successful) testing of both of you, I think this kernel can be called stable very soon.
And for Cayman the result is even better than I expected: almost 50% speed-up! So, for TF up to 70 bits, HD5870 is still fastest, with ~320M/s raw speed. TF up to 73 bits now runs at ~285 M/s (up from ~255 M/s). HD6970 will now*) do all these ranges at ~295M/s (up from ~205M/s), making it the fastest AMD card for the usual GPU272 work. At least until someone can tell how HD7970 performs. Note, these are all raw figures without scheduling overhead - you should see 80-90% of that in the end. These significant performance improvements make me think I should release them even before I'm done with auxiliary changes like - display GHz-days/day - worktodo.add - perftest modes for kernel speed - two optional fields in mfaktc.ini for username and computerid - output datestamp lines in results.txt File locking for worktodo and results files is already included. GPU sieving is then the next big project. *) a slight change in the kernel selection is needed to make the new kernel the default for up to 70 bits in Cayman - so far it is selected only for 71-73 bits Last fiddled with by Bdot on 2012-04-12 at 13:28 Reason: HD7970 |
|
|
|
|
|
#426 |
|
Nov 2010
Germany
11258 Posts |
I've noticed there are different opinions about what mfakto should display while working an exponent. When adding the Ghz-days/day I had difficulties getting everything into a standard-80-characters line. On the other hand, I usually have my terminal windows ~220 chars wide, not using most of that in mfakto.
So, here we go: in mfakto.ini: Code:
V5UserID=Bdot ComputerID=mfakto PrintFormat=[%d %T] M%M[%l-%u]: %C/4620 %c/960 %p% %gGHz %ts %e to go, %n FCs, %rM/s, SP: %s, wait:%wus=%W%, %U@%H Code:
[Apr 30 09:18] M53910019[70-71]: 204/4620 45/960 4.69% 76.40GHz 5.225s 1h19m to go, 589.82M FCs, 112.88M/s, SP: 5316, wait: 106us= 0.92%, Bdot@mfakto Code:
+ %C - class ID (n/4620) "%4d" + %c - class number (n/960) "%3d" + %p - percent complete (%) "%6.2f" + %g - GHz-days/day (GHz) "%7.2f" + %t - time per class (s) "%6G" + %e - ETA (d/h/m/s) "%2dm%02ds"/"%2dh%02dm"/"%2dd%02dh" + %n - number of candidates (M/G) "%6.2fM"/"%6.2fG" + %r - rate (M/s) "%6.2f" + %s - SievePrimes "%7d" + %w - CPU wait time for GPU (us) "%6lld" + %W - CPU wait % (%) "6.2f" + %d - date (Mon nn) "%b %d" + %T - time (HH:MM) "%H:%M" + %U - username (as configured) "%s" !! variable length, 15 chars at most + %H - hostname (as configured) "%s" !! variable length, 15 chars at most + %M - the exponent being worked on "%d" !! no fixed width to allow prepending 'M' !! + %l - the lower bit-limit "%2d" + %u - the upper bit-limit "%2d" If you do specify your UserID and ComputerID, then the result lines will also contain them. A boolean "TimeStampInResults" setting allows to get the results files even closer to what the prime95 original looks like. |
|
|
|
|
|
#427 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
3×29×83 Posts |
![]() ... ... ... ... ... ... ![]() DO WANT ![]() ...One (not-so-)small request. With the multi-threading (sort of) and now this, would you be willing to "backport" your changes/additions into mfaktc? From a user's standpoint (i.e. helping people, and for those with both nVidia and AMD cards), it's optimal if mfaktc and mfakto are as similar as possible (TF algos aside), and it's clear you have more time (or desire/drive/whatever) for developing the extra non-math goodies than TheJudger. Thanks
Last fiddled with by Dubslow on 2012-04-30 at 08:22 |
|
|
|
|
|
#428 |
|
Romulan Interpreter
Jun 2011
Thailand
26×151 Posts |
[edit: I replied to Bdot's post, but took time to conceive the reply, busy at the job. Dubslow went in between]
That is a very nice idea! [edit: about customizing the output]. What a pity I have no AMD/OpenCL/GL cards... Under windoze, you don't need to limit the line length to 80 characters, you can specify a bigger buffer (number of lines and characters per line) for the dos prompt, just rightclick on the header of the window, properties, layout, and modify screen buffer size. I usually have 150 characters per line with 7x12 font (selectable from the fonts tab), which perfectly fit even a small (low resolution) monitor. There are a lot of advantages in having a wider screen, for yafu, msieve, cudalucas, etc. Practically the only program limited to 80 cpl is mfaktc. The idea of "custom output lines" could be copied to there too! Last fiddled with by LaurV on 2012-04-30 at 08:26 |
|
|
|
|
|
#429 | |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
11100001101012 Posts |
Quote:
Last fiddled with by Dubslow on 2012-04-30 at 08:41 |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfaktc: a CUDA program for Mersenne prefactoring | TheJudger | GPU Computing | 3498 | 2021-08-06 21:07 |
| gpuOwL: an OpenCL program for Mersenne primality testing | preda | GpuOwl | 2719 | 2021-08-05 22:43 |
| LL with OpenCL | msft | GPU Computing | 433 | 2019-06-23 21:11 |
| OpenCL for FPGAs | TObject | GPU Computing | 2 | 2013-10-12 21:09 |
| Program to TF Mersenne numbers with more than 1 sextillion digits? | Stargate38 | Factoring | 24 | 2011-11-03 00:34 |