mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

bcp19 2012-03-31 18:53

[QUOTE=kladner;294989]Just for grins, you might take out the /low, or replace it with /high. It might show if something else is stealing CPU cycles from mfakto. Since the second batch defaults to /normal, I guess that would be the best comparison. Affinity is the other variable, so that might make a difference, too.[/QUOTE]

I use that same line for mfaktc and never noticed a difference, which is why I asked. I was just curious if locking the program to 1 core caused it or if it was something else as I had also noticed that task manager reported 30% usage from 1 instance of mfakto during testing when I only had 1 core on P95. Also, 1 core P95 and 3 core mfakto with Adjust=1 caused SP to climb and climb while M/s kept dropping and dropping. I exited when the time remaining on all 3 instances had climbed to over 10 hours and SP was in the 140k's. Locking all 3 at 25k SP worked fair, but still fluctuated between 2.5 and 3 hours to go on each instance.

Anyway, I'll try removing /low when I get my PSU for the Duo as I don't want to take the quad apart again to switch cards.

Bdot 2012-04-02 13:02

[QUOTE=bcp19;295004]I use that same line for mfaktc and never noticed a difference, which is why I asked. I was just curious if locking the program to 1 core caused it or if it was something else as I had also noticed that task manager reported 30% usage from 1 instance of mfakto during testing when I only had 1 core on P95. Also, 1 core P95 and 3 core mfakto with Adjust=1 caused SP to climb and climb while M/s kept dropping and dropping. I exited when the time remaining on all 3 instances had climbed to over 10 hours and SP was in the 140k's. Locking all 3 at 25k SP worked fair, but still fluctuated between 2.5 and 3 hours to go on each instance.

Anyway, I'll try removing /low when I get my PSU for the Duo as I don't want to take the quad apart again to switch cards.[/QUOTE]

I've seen this difference between mfakto and mfaktc as well.

I believe it is caused by the threading design chosen by AMD for their OpenCL implementation. When CUDA programs are built, all code to drive the GPU is compiled right into the control flow of the program (thread) that calls the GPU functions. AMD's OpenCL library creates another thread upon initialization that will drive the GPU. OpenCL API calls will just issue request to this thread, and may or may not wait for it to complete the task.

This design works very well if you have a "stand-by" CPU core to run the background thread. But if all cores are busy, then activation of the background thread has to wait until a time slice of another task finishes. Unfortunately, mfakto counts this switching time towards the CPU wait time, indicating that the CPU has to wait for the GPU, and consequently increases SievePrimes. I did not yet find a way to distinguish between "wait for GPU" and "wait for CPU to process GPU requests" as this is all hidden in the OpenCL APIs.

On my AMD Phenom system, I need to use fix SievePrimes in order to be able to use it in addition to prime95. On a SandyBridge, I noticed, that an available hyper thread is fully sufficient to serve the needs. There, I can run 3x mfakto and 3 prime95-LL tests on 8 hyper threads. 4 LL-tests work as well, but lowers SievePrimes too much for my gusto. As the AVX-FFTs are memory-bandwidth-limited on my machine, It would not be faster to run LL-tests on each hyper-thread.

On another machine, a 12-CPU-Xeon w/o hyper-threading, I run just 8 threads of mprime. In order to have mfakto run full speed, I let 3 instances use 4 CPUs -each at 133%CPU (unix-style counting).

On none of these machines I set the affinity for mfakto, the OS normally figures out what's available. I'll take a note to my todo-list to allow setting the affinity for the Sieving thread - this may be some advantage especially on Windows where threads are normally switched around for no good.

bcp19 2012-04-02 15:16

[QUOTE=Bdot;295150]I've seen this difference between mfakto and mfaktc as well.

I believe it is caused by the threading design chosen by AMD for their OpenCL implementation. When CUDA programs are built, all code to drive the GPU is compiled right into the control flow of the program (thread) that calls the GPU functions. AMD's OpenCL library creates another thread upon initialization that will drive the GPU. OpenCL API calls will just issue request to this thread, and may or may not wait for it to complete the task.

This design works very well if you have a "stand-by" CPU core to run the background thread. But if all cores are busy, then activation of the background thread has to wait until a time slice of another task finishes. Unfortunately, mfakto counts this switching time towards the CPU wait time, indicating that the CPU has to wait for the GPU, and consequently increases SievePrimes. I did not yet find a way to distinguish between "wait for GPU" and "wait for CPU to process GPU requests" as this is all hidden in the OpenCL APIs.

On my AMD Phenom system, I need to use fix SievePrimes in order to be able to use it in addition to prime95. On a SandyBridge, I noticed, that an available hyper thread is fully sufficient to serve the needs. There, I can run 3x mfakto and 3 prime95-LL tests on 8 hyper threads. 4 LL-tests work as well, but lowers SievePrimes too much for my gusto. As the AVX-FFTs are memory-bandwidth-limited on my machine, It would not be faster to run LL-tests on each hyper-thread.

On another machine, a 12-CPU-Xeon w/o hyper-threading, I run just 8 threads of mprime. In order to have mfakto run full speed, I let 3 instances use 4 CPUs -each at 133%CPU (unix-style counting).

On none of these machines I set the affinity for mfakto, the OS normally figures out what's available. I'll take a note to my todo-list to allow setting the affinity for the Sieving thread - this may be some advantage especially on Windows where threads are normally switched around for no good.[/QUOTE]

Sounds like there is a little bit of 'hidden' cpu cost in running mfakto, which explains the numbers I was seeing. A single mfakto instance on my quad with 0/1 cores P95 produced ~64GD, with 2 cores it dropped to ~60 and with 3 it was ~56.

Bdot 2012-04-04 09:21

5 x 15 bit kernel - testers wanted!
 
I'm in the final steps of creating a better performing solution for HD69xx (Cayman) and probably HD7xxx as well.

I've created a kernel using a word size of 15 bits per int. This way I can completely avoid the expensive 32-bit mul and mul_hi instructions. Using 5x15 bits, it is currently capable of doing TF 60 to 72 bits. I should be able to bring it to 73 bits soon. The kernel is still kind of immature: I have only one generic 75 x 75 bit -> 150 bit multiplication. Using an optimized squaring function, another one that only calculates the required precision and a few other optimizations I should be able to improve its speed by 30-50%. Currently, on HD5770 it runs at ~80% of the best kernel. For Cayman, predictions are that it is already 5% faster right now. With a little luck, HD6970 may finally be faster than HD5870 (hmm, probably just a bad joke).

I could use some testing help towards the end of this week or next week ... flash, Kyle? Anyone else, of special interest are HD69xx or HD7xxx ?

BTW, if anyone wants to follow/help development, I've put the source code to
[URL]https://github.com/Bdot42/mfakto[/URL]
I usually do regular updates whenever I changed (improved?) anything.

KyleAskine 2012-04-04 10:52

[QUOTE=Bdot;295370]I'm in the final steps of creating a better performing solution for HD69xx (Cayman) and probably HD7xxx as well.

I've created a kernel using a word size of 15 bits per int. This way I can completely avoid the expensive 32-bit mul and mul_hi instructions. Using 5x15 bits, it is currently capable of doing TF 60 to 72 bits. I should be able to bring it to 73 bits soon. The kernel is still kind of immature: I have only one generic 75 x 75 bit -> 150 bit multiplication. Using an optimized squaring function, another one that only calculates the required precision and a few other optimizations I should be able to improve its speed by 30-50%. Currently, on HD5770 it runs at ~80% of the best kernel. For Cayman, predictions are that it is already 5% faster right now. With a little luck, HD6970 may finally be faster than HD5870 (hmm, probably just a bad joke).

I could use some testing help towards the end of this week or next week ... flash, Kyle? Anyone else, of special interest are HD69xx or HD7xxx ?

BTW, if anyone wants to follow/help development, I've put the source code to
[URL]https://github.com/Bdot42/mfakto[/URL]
I usually do regular updates whenever I changed (improved?) anything.[/QUOTE]

I am in Boston from 4/5 to 4/8, but would love to test after that!

flashjh 2012-04-04 13:05

[QUOTE=Bdot;295370]I'm in the final steps of creating a better performing solution for HD69xx (Cayman) and probably HD7xxx as well.
...
I could use some testing help towards the end of this week or next week ... flash, Kyle? Anyone else, of special interest are HD69xx or HD7xxx ?

BTW, if anyone wants to follow/help development, I've put the source code to
[URL]https://github.com/Bdot42/mfakto[/URL]
I usually do regular updates whenever I changed (improved?) anything.[/QUOTE]

I'd love to help, I don't have 69xx or 7x though, only 5870. If you can use data, let me know.

Bdot 2012-04-12 12:53

release it?
 
Thanks to the very fast (and so far successful) testing of both of you, I think this kernel can be called stable very soon.

And for Cayman the result is even better than I expected: almost 50% speed-up!

So, for TF up to 70 bits, HD5870 is still fastest, with ~320M/s raw speed. TF up to 73 bits now runs at ~285 M/s (up from ~255 M/s).

HD6970 will now[SUP]*)[/SUP] do all these ranges at ~295M/s (up from ~205M/s), making it the fastest AMD card for the usual GPU272 work. At least until someone can tell how HD7970 performs.

Note, these are all raw figures without scheduling overhead - you should see 80-90% of that in the end.

These significant performance improvements make me think I should release them even before I'm done with auxiliary changes like
- display GHz-days/day
- worktodo.add
- perftest modes for kernel speed
- two optional fields in mfaktc.ini for username and computerid
- output datestamp lines in results.txt

File locking for worktodo and results files is already included.

GPU sieving is then the next big project.

[SIZE=1][SUP]*)[/SUP] a slight change in the kernel selection is needed to make the new kernel the default for up to 70 bits in Cayman - so far it is selected only for 71-73 bits[/SIZE]

Bdot 2012-04-30 07:41

variable progress lines
 
I've noticed there are different opinions about what mfakto should display while working an exponent. When adding the Ghz-days/day I had difficulties getting everything into a standard-80-characters line. On the other hand, I usually have my terminal windows ~220 chars wide, not using most of that in mfakto.

So, here we go:

in mfakto.ini:
[code]
V5UserID=Bdot
ComputerID=mfakto
PrintFormat=[%d %T] M%M[%l-%u]: %C/4620 %c/960 %p% %gGHz %ts %e to go, %n FCs, %rM/s, SP: %s, wait:%wus=%W%, %U@%H
[/code]you get
[code]
[Apr 30 09:18] M53910019[70-71]: 204/4620 45/960 4.69% 76.40GHz 5.225s 1h19m to go, 589.82M FCs, 112.88M/s, SP: 5316, wait: 106us= 0.92%, Bdot@mfakto
[/code]These are the possible formats right now. Is there anything missing?
[code]
+ %C - class ID (n/4620) "%4d"
+ %c - class number (n/960) "%3d"
+ %p - percent complete (%) "%6.2f"
+ %g - GHz-days/day (GHz) "%7.2f"
+ %t - time per class (s) "%6G"
+ %e - ETA (d/h/m/s) "%2dm%02ds"/"%2dh%02dm"/"%2dd%02dh"
+ %n - number of candidates (M/G) "%6.2fM"/"%6.2fG"
+ %r - rate (M/s) "%6.2f"
+ %s - SievePrimes "%7d"
+ %w - CPU wait time for GPU (us) "%6lld"
+ %W - CPU wait % (%) "6.2f"
+ %d - date (Mon nn) "%b %d"
+ %T - time (HH:MM) "%H:%M"
+ %U - username (as configured) "%s" !! variable length, 15 chars at most
+ %H - hostname (as configured) "%s" !! variable length, 15 chars at most
+ %M - the exponent being worked on "%d" !! no fixed width to allow prepending 'M' !!
+ %l - the lower bit-limit "%2d"
+ %u - the upper bit-limit "%2d"
[/code]The format allows a multi-selection of up to 20 of these formats. I'll probably add another line for the header as this is too much effort to get aligned automatically.

If you do specify your UserID and ComputerID, then the result lines will also contain them. A boolean "TimeStampInResults" setting allows to get the results files even closer to what the prime95 original looks like.

Dubslow 2012-04-30 08:20

:shock:
...
...
...
...
...
...
:explode:






DO WANT



:smile:

...One (not-so-)small request. With the multi-threading (sort of) and now this, would you be willing to "backport" your changes/additions into mfaktc?

From a user's standpoint (i.e. helping people, and for those with both nVidia and AMD cards), it's optimal if mfaktc and mfakto are as similar as possible (TF algos aside), and it's clear you have more time (or desire/drive/whatever) for developing the extra non-math goodies than TheJudger.

Thanks :smile:

LaurV 2012-04-30 08:23

[edit: I replied to Bdot's post, but took time to conceive the reply, busy at the job. Dubslow went in between]

That is a very nice idea! [edit: about customizing the output]. What a pity I have no AMD/OpenCL/GL cards...

Under windoze, you don't need to limit the line length to 80 characters, you can specify a bigger buffer (number of lines and characters per line) for the dos prompt, just rightclick on the header of the window, properties, layout, and modify screen buffer size. I usually have 150 characters per line with 7x12 font (selectable from the fonts tab), which perfectly fit even a small (low resolution) monitor. There are a lot of advantages in having a wider screen, for yafu, msieve, cudalucas, etc. Practically the only program limited to 80 cpl is mfaktc. The idea of "custom output lines" could be copied to there too!

Dubslow 2012-04-30 08:41

[QUOTE=LaurV;297967]That is a very nice idea! What a pity I have no AMD/OpenCL/GL cards...
Under windoze, you don't need to limit the line length to 80 characters, you can specify a bigger buffer (number of lines and characters per line) for the dos prompt, just rightclick on the header of the window, properties, layout, and modify screen buffer size. I usually have 150 characters per line with 7x12 font (selectable from the fonts tab), which perfectly fit even a small (low resolution) monitor. There are a lot of advantages in having a wider screen, for yafu, msieve, cudalucas, etc. Practically the only program limited to 80 cpl is mfaktc. The idea of "custom output lines" could be copied to there too![/QUOTE]

Heh, in Linux (Gnome, specifically) all you need to do is make the terminal window bigger and the output matches on the fly. That's one complaint I had about the DOS prompt :razz:


All times are UTC. The time now is 22:59.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.