mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-03-31, 18:53   #419
bcp19
 
bcp19's Avatar
 
Oct 2011

67910 Posts
Default

Quote:
Originally Posted by kladner View Post
Just for grins, you might take out the /low, or replace it with /high. It might show if something else is stealing CPU cycles from mfakto. Since the second batch defaults to /normal, I guess that would be the best comparison. Affinity is the other variable, so that might make a difference, too.
I use that same line for mfaktc and never noticed a difference, which is why I asked. I was just curious if locking the program to 1 core caused it or if it was something else as I had also noticed that task manager reported 30% usage from 1 instance of mfakto during testing when I only had 1 core on P95. Also, 1 core P95 and 3 core mfakto with Adjust=1 caused SP to climb and climb while M/s kept dropping and dropping. I exited when the time remaining on all 3 instances had climbed to over 10 hours and SP was in the 140k's. Locking all 3 at 25k SP worked fair, but still fluctuated between 2.5 and 3 hours to go on each instance.

Anyway, I'll try removing /low when I get my PSU for the Duo as I don't want to take the quad apart again to switch cards.
bcp19 is offline   Reply With Quote
Old 2012-04-02, 13:02   #420
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by bcp19 View Post
I use that same line for mfaktc and never noticed a difference, which is why I asked. I was just curious if locking the program to 1 core caused it or if it was something else as I had also noticed that task manager reported 30% usage from 1 instance of mfakto during testing when I only had 1 core on P95. Also, 1 core P95 and 3 core mfakto with Adjust=1 caused SP to climb and climb while M/s kept dropping and dropping. I exited when the time remaining on all 3 instances had climbed to over 10 hours and SP was in the 140k's. Locking all 3 at 25k SP worked fair, but still fluctuated between 2.5 and 3 hours to go on each instance.

Anyway, I'll try removing /low when I get my PSU for the Duo as I don't want to take the quad apart again to switch cards.
I've seen this difference between mfakto and mfaktc as well.

I believe it is caused by the threading design chosen by AMD for their OpenCL implementation. When CUDA programs are built, all code to drive the GPU is compiled right into the control flow of the program (thread) that calls the GPU functions. AMD's OpenCL library creates another thread upon initialization that will drive the GPU. OpenCL API calls will just issue request to this thread, and may or may not wait for it to complete the task.

This design works very well if you have a "stand-by" CPU core to run the background thread. But if all cores are busy, then activation of the background thread has to wait until a time slice of another task finishes. Unfortunately, mfakto counts this switching time towards the CPU wait time, indicating that the CPU has to wait for the GPU, and consequently increases SievePrimes. I did not yet find a way to distinguish between "wait for GPU" and "wait for CPU to process GPU requests" as this is all hidden in the OpenCL APIs.

On my AMD Phenom system, I need to use fix SievePrimes in order to be able to use it in addition to prime95. On a SandyBridge, I noticed, that an available hyper thread is fully sufficient to serve the needs. There, I can run 3x mfakto and 3 prime95-LL tests on 8 hyper threads. 4 LL-tests work as well, but lowers SievePrimes too much for my gusto. As the AVX-FFTs are memory-bandwidth-limited on my machine, It would not be faster to run LL-tests on each hyper-thread.

On another machine, a 12-CPU-Xeon w/o hyper-threading, I run just 8 threads of mprime. In order to have mfakto run full speed, I let 3 instances use 4 CPUs -each at 133%CPU (unix-style counting).

On none of these machines I set the affinity for mfakto, the OS normally figures out what's available. I'll take a note to my todo-list to allow setting the affinity for the Sieving thread - this may be some advantage especially on Windows where threads are normally switched around for no good.
Bdot is offline   Reply With Quote
Old 2012-04-02, 15:16   #421
bcp19
 
bcp19's Avatar
 
Oct 2011

12478 Posts
Default

Quote:
Originally Posted by Bdot View Post
I've seen this difference between mfakto and mfaktc as well.

I believe it is caused by the threading design chosen by AMD for their OpenCL implementation. When CUDA programs are built, all code to drive the GPU is compiled right into the control flow of the program (thread) that calls the GPU functions. AMD's OpenCL library creates another thread upon initialization that will drive the GPU. OpenCL API calls will just issue request to this thread, and may or may not wait for it to complete the task.

This design works very well if you have a "stand-by" CPU core to run the background thread. But if all cores are busy, then activation of the background thread has to wait until a time slice of another task finishes. Unfortunately, mfakto counts this switching time towards the CPU wait time, indicating that the CPU has to wait for the GPU, and consequently increases SievePrimes. I did not yet find a way to distinguish between "wait for GPU" and "wait for CPU to process GPU requests" as this is all hidden in the OpenCL APIs.

On my AMD Phenom system, I need to use fix SievePrimes in order to be able to use it in addition to prime95. On a SandyBridge, I noticed, that an available hyper thread is fully sufficient to serve the needs. There, I can run 3x mfakto and 3 prime95-LL tests on 8 hyper threads. 4 LL-tests work as well, but lowers SievePrimes too much for my gusto. As the AVX-FFTs are memory-bandwidth-limited on my machine, It would not be faster to run LL-tests on each hyper-thread.

On another machine, a 12-CPU-Xeon w/o hyper-threading, I run just 8 threads of mprime. In order to have mfakto run full speed, I let 3 instances use 4 CPUs -each at 133%CPU (unix-style counting).

On none of these machines I set the affinity for mfakto, the OS normally figures out what's available. I'll take a note to my todo-list to allow setting the affinity for the Sieving thread - this may be some advantage especially on Windows where threads are normally switched around for no good.
Sounds like there is a little bit of 'hidden' cpu cost in running mfakto, which explains the numbers I was seeing. A single mfakto instance on my quad with 0/1 cores P95 produced ~64GD, with 2 cores it dropped to ~60 and with 3 it was ~56.
bcp19 is offline   Reply With Quote
Old 2012-04-04, 09:21   #422
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

11258 Posts
Default 5 x 15 bit kernel - testers wanted!

I'm in the final steps of creating a better performing solution for HD69xx (Cayman) and probably HD7xxx as well.

I've created a kernel using a word size of 15 bits per int. This way I can completely avoid the expensive 32-bit mul and mul_hi instructions. Using 5x15 bits, it is currently capable of doing TF 60 to 72 bits. I should be able to bring it to 73 bits soon. The kernel is still kind of immature: I have only one generic 75 x 75 bit -> 150 bit multiplication. Using an optimized squaring function, another one that only calculates the required precision and a few other optimizations I should be able to improve its speed by 30-50%. Currently, on HD5770 it runs at ~80% of the best kernel. For Cayman, predictions are that it is already 5% faster right now. With a little luck, HD6970 may finally be faster than HD5870 (hmm, probably just a bad joke).

I could use some testing help towards the end of this week or next week ... flash, Kyle? Anyone else, of special interest are HD69xx or HD7xxx ?

BTW, if anyone wants to follow/help development, I've put the source code to
https://github.com/Bdot42/mfakto
I usually do regular updates whenever I changed (improved?) anything.
Bdot is offline   Reply With Quote
Old 2012-04-04, 10:52   #423
KyleAskine
 
KyleAskine's Avatar
 
Oct 2011
Maryland

12216 Posts
Default

Quote:
Originally Posted by Bdot View Post
I'm in the final steps of creating a better performing solution for HD69xx (Cayman) and probably HD7xxx as well.

I've created a kernel using a word size of 15 bits per int. This way I can completely avoid the expensive 32-bit mul and mul_hi instructions. Using 5x15 bits, it is currently capable of doing TF 60 to 72 bits. I should be able to bring it to 73 bits soon. The kernel is still kind of immature: I have only one generic 75 x 75 bit -> 150 bit multiplication. Using an optimized squaring function, another one that only calculates the required precision and a few other optimizations I should be able to improve its speed by 30-50%. Currently, on HD5770 it runs at ~80% of the best kernel. For Cayman, predictions are that it is already 5% faster right now. With a little luck, HD6970 may finally be faster than HD5870 (hmm, probably just a bad joke).

I could use some testing help towards the end of this week or next week ... flash, Kyle? Anyone else, of special interest are HD69xx or HD7xxx ?

BTW, if anyone wants to follow/help development, I've put the source code to
https://github.com/Bdot42/mfakto
I usually do regular updates whenever I changed (improved?) anything.
I am in Boston from 4/5 to 4/8, but would love to test after that!
KyleAskine is offline   Reply With Quote
Old 2012-04-04, 13:05   #424
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

1,123 Posts
Default

Quote:
Originally Posted by Bdot View Post
I'm in the final steps of creating a better performing solution for HD69xx (Cayman) and probably HD7xxx as well.
...
I could use some testing help towards the end of this week or next week ... flash, Kyle? Anyone else, of special interest are HD69xx or HD7xxx ?

BTW, if anyone wants to follow/help development, I've put the source code to
https://github.com/Bdot42/mfakto
I usually do regular updates whenever I changed (improved?) anything.
I'd love to help, I don't have 69xx or 7x though, only 5870. If you can use data, let me know.
flashjh is offline   Reply With Quote
Old 2012-04-12, 12:53   #425
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default release it?

Thanks to the very fast (and so far successful) testing of both of you, I think this kernel can be called stable very soon.

And for Cayman the result is even better than I expected: almost 50% speed-up!

So, for TF up to 70 bits, HD5870 is still fastest, with ~320M/s raw speed. TF up to 73 bits now runs at ~285 M/s (up from ~255 M/s).

HD6970 will now*) do all these ranges at ~295M/s (up from ~205M/s), making it the fastest AMD card for the usual GPU272 work. At least until someone can tell how HD7970 performs.

Note, these are all raw figures without scheduling overhead - you should see 80-90% of that in the end.

These significant performance improvements make me think I should release them even before I'm done with auxiliary changes like
- display GHz-days/day
- worktodo.add
- perftest modes for kernel speed
- two optional fields in mfaktc.ini for username and computerid
- output datestamp lines in results.txt

File locking for worktodo and results files is already included.

GPU sieving is then the next big project.

*) a slight change in the kernel selection is needed to make the new kernel the default for up to 70 bits in Cayman - so far it is selected only for 71-73 bits

Last fiddled with by Bdot on 2012-04-12 at 13:28 Reason: HD7970
Bdot is offline   Reply With Quote
Old 2012-04-30, 07:41   #426
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

11258 Posts
Default variable progress lines

I've noticed there are different opinions about what mfakto should display while working an exponent. When adding the Ghz-days/day I had difficulties getting everything into a standard-80-characters line. On the other hand, I usually have my terminal windows ~220 chars wide, not using most of that in mfakto.

So, here we go:

in mfakto.ini:
Code:
V5UserID=Bdot
ComputerID=mfakto
PrintFormat=[%d %T] M%M[%l-%u]: %C/4620 %c/960 %p% %gGHz %ts %e to go, %n FCs, %rM/s, SP: %s, wait:%wus=%W%, %U@%H
you get
Code:
[Apr 30 09:18] M53910019[70-71]:  204/4620  45/960   4.69%   76.40GHz 5.225s  1h19m to go, 589.82M FCs, 112.88M/s, SP:    5316, wait:   106us=  0.92%, Bdot@mfakto
These are the possible formats right now. Is there anything missing?
Code:
+  %C - class ID (n/4620)            "%4d"
+  %c - class number (n/960)         "%3d"
+  %p - percent complete (%)         "%6.2f"
+  %g - GHz-days/day (GHz)           "%7.2f"
+  %t - time per class (s)           "%6G"
+  %e - ETA (d/h/m/s)                "%2dm%02ds"/"%2dh%02dm"/"%2dd%02dh"
+  %n - number of candidates (M/G)   "%6.2fM"/"%6.2fG"
+  %r - rate (M/s)                   "%6.2f"
+  %s - SievePrimes                  "%7d"
+  %w - CPU wait time for GPU (us)   "%6lld"
+  %W - CPU wait % (%)               "6.2f"
+  %d - date (Mon nn)                "%b %d"
+  %T - time (HH:MM)                 "%H:%M"
+  %U - username (as configured)     "%s"    !! variable length, 15 chars at most
+  %H - hostname (as configured)     "%s"    !! variable length, 15 chars at most
+  %M - the exponent being worked on "%d"    !! no fixed width to allow prepending 'M' !!
+  %l - the lower bit-limit          "%2d"
+  %u - the upper bit-limit          "%2d"
The format allows a multi-selection of up to 20 of these formats. I'll probably add another line for the header as this is too much effort to get aligned automatically.

If you do specify your UserID and ComputerID, then the result lines will also contain them. A boolean "TimeStampInResults" setting allows to get the results files even closer to what the prime95 original looks like.
Bdot is offline   Reply With Quote
Old 2012-04-30, 08:20   #427
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default


...
...
...
...
...
...







DO WANT





...One (not-so-)small request. With the multi-threading (sort of) and now this, would you be willing to "backport" your changes/additions into mfaktc?

From a user's standpoint (i.e. helping people, and for those with both nVidia and AMD cards), it's optimal if mfaktc and mfakto are as similar as possible (TF algos aside), and it's clear you have more time (or desire/drive/whatever) for developing the extra non-math goodies than TheJudger.

Thanks

Last fiddled with by Dubslow on 2012-04-30 at 08:22
Dubslow is offline   Reply With Quote
Old 2012-04-30, 08:23   #428
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

26×151 Posts
Default

[edit: I replied to Bdot's post, but took time to conceive the reply, busy at the job. Dubslow went in between]

That is a very nice idea! [edit: about customizing the output]. What a pity I have no AMD/OpenCL/GL cards...

Under windoze, you don't need to limit the line length to 80 characters, you can specify a bigger buffer (number of lines and characters per line) for the dos prompt, just rightclick on the header of the window, properties, layout, and modify screen buffer size. I usually have 150 characters per line with 7x12 font (selectable from the fonts tab), which perfectly fit even a small (low resolution) monitor. There are a lot of advantages in having a wider screen, for yafu, msieve, cudalucas, etc. Practically the only program limited to 80 cpl is mfaktc. The idea of "custom output lines" could be copied to there too!

Last fiddled with by LaurV on 2012-04-30 at 08:26
LaurV is offline   Reply With Quote
Old 2012-04-30, 08:41   #429
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

11100001101012 Posts
Default

Quote:
Originally Posted by LaurV View Post
That is a very nice idea! What a pity I have no AMD/OpenCL/GL cards...
Under windoze, you don't need to limit the line length to 80 characters, you can specify a bigger buffer (number of lines and characters per line) for the dos prompt, just rightclick on the header of the window, properties, layout, and modify screen buffer size. I usually have 150 characters per line with 7x12 font (selectable from the fonts tab), which perfectly fit even a small (low resolution) monitor. There are a lot of advantages in having a wider screen, for yafu, msieve, cudalucas, etc. Practically the only program limited to 80 cpl is mfaktc. The idea of "custom output lines" could be copied to there too!
Heh, in Linux (Gnome, specifically) all you need to do is make the terminal window bigger and the output matches on the fly. That's one complaint I had about the DOS prompt

Last fiddled with by Dubslow on 2012-04-30 at 08:41
Dubslow is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3498 2021-08-06 21:07
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2719 2021-08-05 22:43
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 22:10.


Fri Aug 6 22:10:27 UTC 2021 up 14 days, 16:39, 1 user, load averages: 3.09, 3.18, 2.94

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.