mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

Dubslow 2012-02-04 20:15

I tried looking through undoc.txt, but couldn't find the following: How does multiple workers work with worktodo.add? Do I put all the assignments in one list, or do I split it up as in the .txt file? Sorry, but like I said, I couldn't find this anywhere in undoc.txt.

flashjh 2012-02-04 20:20

[QUOTE=Dubslow;288314]I tried looking through undoc.txt, but couldn't find the following: How does multiple workers work with worktodo.add? Do I put all the assignments in one list, or do I split it up as in the .txt file? Sorry, but like I said, I couldn't find this anywhere in undoc.txt.[/QUOTE]

Add a section for each worker.

[Worker #1]

[Worker #2]

etc...

It adds the work to the bottom of each worker. I've never tried it without the sections, so I don't know what would happen. The only thing I don't like it Prime95 stops all workers when it adds work with the .add file so all stage-2 P-1 workers have to 'start over' which takes a lot of time when there are a lot of Relative Primes.

Jerry

Edit: I see you already pointed out the S-2 delay...

chalsall 2012-02-04 20:32

[QUOTE=flashjh;288315]It adds the work to the bottom of each worker. I've never tried it without the sections, so I don't know what would happen.[/QUOTE]

Everything is added to Worker #1.

[QUOTE=flashjh;288315]The only thing I don't like it Prime95 stops all workers when it adds work with the .add file so all stage-2 P-1 workers have to 'start over' which takes a lot of time when there are a lot of Relative Primes.[/QUOTE]

But this is also true if you stop Prime95.

Although I agree -- I can think of no reason the current program state has to change for work underway when the program is simply adding lines to worktodo.txt from worktodo.add.

flashjh 2012-02-04 20:36

[QUOTE=chalsall;288316]But this is also true if you stop Prime95.
[/QUOTE]

Yes, which is why I don't like messing with Prime95 once it's up and running P-1s. :smile:

[Quote]Everything is added to Worker #1.[/Quote]

Ahh...

chalsall 2012-02-04 20:42

[QUOTE=Bdot;288241]I've added worktodo.add to the todo list. But it's separate from the locking.[/QUOTE]

Sorry -- was too fast (busy) yesterday. I should have thanked you for that.

[QUOTE=Bdot;288241](Some may have noticed I currently favor flock() over fcntl().)[/QUOTE]

I don't disagree at all -- flock facilitates everything you're after, and is available from the shell(s) and scripts.

Dubslow 2012-02-04 21:05

Seems to me then there's absolutely no reason to use the worktodo.add function then. I was looking into it because I thought it wouldn't reset anything, that's my main problem... (and it can't add work besides to Worker 1?)
[/offtopic]

chalsall 2012-02-04 21:12

[QUOTE=Dubslow;288320](and it can't add work besides to Worker 1?)[/QUOTE]

Reread the above. It *can* add work to all workers; the file is in the exact same format as worktodo.txt, and will add work to each worker as specified (although, as you point out, appended rather than in order of exponents / cost of the work / etc).

DigiK-oz 2012-02-05 17:26

I am working on a sort of GUI for mfakto. I would also like it to play nice with mfaktc. This brings me to questions about the -d option.

On single-GPU systems, things are easy. I simply start mfakto/mfaktc with no parameters and let them sort it out :). This simply works.

Multi-GPU systems are harder. Mfakto seems to work nice if I pass -d 1, -d 2 etc. Mfaktc seems to start with 0, so -d 0, -d 1 etc?

Mixed systems (i.e. 1 ATI and 1 Nvidia GPU are a mystery (I now simply start both with no parameters, which seems to (sometimes?) work.

Systems with any other combination (i.e. 2 ATI and 1 Nvidia GPUs) are killing me. Can anybody explain the -d option in great detail, given the cases:

ATI x 2 (-d 1 and -d 2 ?)
Nvidia x2 (-d 0 and -d 1?)
ATI x 1, Nvidia x 1 (no parameters for either, or one (which?) with no parameter, and the other being the second GPU?
ATI x 2, Nvidia x 1 (Baffles me)

Bottom line : how are the device-ids to use with -d allotted in above cases?

Cheers!

Dubslow 2012-02-05 18:37

Feel free to continue work, but be aware that at some point in the future (admittedly probably not for at least a year, but I'm not the authority on this) that mfakt* will be integrated into Prime95. At such point, having its own GUI would probably be rendered redundant. If you still want to forge ahead though, go for it.

flashjh 2012-02-05 20:26

Is there any way to reserve the 30M exponents that need DC TF 69 to 70?

chalsall 2012-02-05 20:52

[QUOTE=flashjh;288398]Is there any way to reserve the 30M exponents that need DC TF 69 to 70?[/QUOTE]

I'm being presumpious and am assuming you're talking to me wrt G72...

If you set the Pledge Level to 70, you will first be assigned candidates which have already been TFed to 69. There are (at this moment) 22 of them.

This is just a quick hack to facilitiate this -- I'll need to add the same Options feature as is available on the LLTF assignment page. Oh, and just to be pedantic, everything above 29.69M is to be taken to 70.

flashjh 2012-02-05 21:01

[QUOTE=chalsall;288402]I'm being presumpious and am assuming you're talking to me wrt G72...[/QUOTE]

Yes :smile:

[QUOTE]If you set the Pledge Level to 70, you will first be assigned candidates which have already been TFed to 69. There are (at this moment) 22 of them.[/QUOTE]

Thanks!

DigiK-oz 2012-02-06 06:29

[QUOTE=Dubslow;288391]Feel free to continue work, but be aware that at some point in the future (admittedly probably not for at least a year, but I'm not the authority on this) that mfakt* will be integrated into Prime95. At such point, having its own GUI would probably be rendered redundant. If you still want to forge ahead though, go for it.[/QUOTE]

Thanks for the info. I hope mfakt* will soon be integrated into prime95. Till that happens, I want to get rid of the good old command prompt windows cluttering my taskbar :) It is a rather small programming effort (it is already working OK for single-GPU-brand systems). Of course, as soon as mfakt* is integrated I will ditch it :)

Dubslow 2012-02-06 06:35

Well, okay. Like I said, it's likely to be a year or more, though I'm still not the authority on this. mfakt* are still, compared to Prime95, somewhat immature. (No wonder, since P95's been in development for ~15 years.)

KyleAskine 2012-02-07 23:01

Does anyone have any idea on how the 79xx's should preform? I have a $50 giftcard to Newegg that expires at the end of the month, and am considering picking one up.

But with the absolute disaster that Cayman is in terms of TF (25% slower than Cypress), I wonder if Tahiti will be just as bad.

Edit - I am not looking for theoretical flops, since they do not tell the story with Cayman.

Bdot 2012-02-08 19:23

[QUOTE=KyleAskine;288582]Does anyone have any idea on how the 79xx's should preform? I have a $50 giftcard to Newegg that expires at the end of the month, and am considering picking one up.

But with the absolute disaster that Cayman is in terms of TF (25% slower than Cypress), I wonder if Tahiti will be just as bad.

Edit - I am not looking for theoretical flops, since they do not tell the story with Cayman.[/QUOTE]
That is really hard to tell until someone can try. Cayman's big issue is with 32-bit multiplications, which occupy 4 compute units (SIMDs) for one cycle. Some sources say that 3 compute units are used for that, but looking at the assembly I saw all 4 units busy with the same multiplication. Cypress, on the other hand, can also run only one 32-bit mul per cycle and SIMD-array. But there, only the "special" SIMD is occupied, the 4 simple ones can do other tasks.

mfakto's kernels (so far) have enough mul32 instructions that Cayman pays a lot of penalties for them. But there are also lots of other instructions that Cypress can run in parallel to the mul32 while Cayman needs to schedule them after mul32.

In the profiler I saw that the 800 SIMDs of a 5770 are utilized ~93% (when using proper vectoring). Only 7% of the cycles a SIMD will be unused because of instruction dependencies. This will be similar with the 1600 SIMDs of a 5870.

The 1536 SIMDs of a 6970 should be occupied almost 100% of the time with a vector size of 4 as they are independent. The poor mul32 is what hurts.

Back to the question: How fast will 7970 be? I expect it will also use 4 of its SIMDs for a mul32, therefore my assumption (which is kind of worst case) is, that the mfakto throughput will scale with the GFlops in comparison with Cayman. That is 2048 vs. 1536 SIMDs, and 925 vs. 880 MHz yielding 1.4 times the 6970 results.

If someone has access to one of these, I'd could provide an instrumented mfakto version telling exact kernel runtime numbers.

KyleAskine 2012-02-15 15:29

I might buy a 7750 one they become available on Newegg, since I have a $50 gift card I need to use this month. If I do, I will try out mfakto once I get it!

chair 2012-02-23 04:31

hello, I'm having trouble running 2 instances of mfakto. i have two graphics cards that meet the specs, but my trouble is where to enter the -d 2 (at least i think its that command) to tell the 2nd mfakto on what card to run.
id say that im new to this kind of computing, but i think this post already does. any help would be nice.

Bdot 2012-02-23 19:41

[QUOTE=chair;290512]hello, I'm having trouble running 2 instances of mfakto. i have two graphics cards that meet the specs, but my trouble is where to enter the -d 2 (at least i think its that command) to tell the 2nd mfakto on what card to run.
id say that im new to this kind of computing, but i think this post already does. any help would be nice.[/QUOTE]

Having -d 2 as the first option is usually good, but actually it should not matter where to place it (except for some test-modes), for example:
[code]
mfakto -d 2 -i instance2.ini
[/code]will try to use GPU #2 and read instance2.ini instead of mfakto.ini for the config parameters.

However, OpenCL also has the notion of "platforms", not just a simple numbering of all available GPUs. "Platforms" can be simplified as "vendors". Therefore, if you maybe have other OpenCL-enabled devices (built-in graphics, other GPUs), you may have to use the correct platform number. It's hard to predict in which order the platforms will appear. The [I]clinfo[/I] tool can help to find out, but you need to install AMD APP SDK to use it. Or, you just try it out:
[code]
mfakto -d 11
mfakto -d 21
mfakto -d 31
...
[/code]This will use the first device of platform 1, 2, 3, ... Once you got the right platform that uses one of the GPU's you wish to use, then you can increase the last digit to use other devices of the same platform:
[code]
mfakto -d 22
mfakto -d 23
...
[/code]

Bdot 2012-03-16 10:10

Catalyst 12.2 seems to work well
 
AMD released Catalyst version 12.2. I have it running on two W7-64 machines and the first tests went well. I'll do a few tests on Linux later, but I'd say, this time no new bugs interfere with mfakto.

bcp19 2012-03-30 21:08

Has anyone tried the new 12.3 catalyst version yet?

flashjh 2012-03-30 22:12

[QUOTE=bcp19;294892]Has anyone tried the new 12.3 catalyst version yet?[/QUOTE]

Not yet, but 12.2 caused blue screens for me.

KyleAskine 2012-03-31 00:11

[QUOTE=flashjh;294897]Not yet, but 12.2 caused blue screens for me.[/QUOTE]

Odd. 12.2 works perfectly for me on all boxes.

flashjh 2012-03-31 00:16

[QUOTE=KyleAskine;294915]Odd. 12.2 works perfectly for me on all boxes.[/QUOTE]

I never spent any time t/s'ing the problem, I've been too busy. It was easier to revert and key it go. Is 12.2/.3 faster for you?

chair 2012-03-31 07:54

Thanks for the help bdot, i was able to get the 2nd instances running.

bcp19 2012-03-31 13:40

Bdot, when running some tests using mfakto yesterday, I noticed that when I used the first half of the following in a batch file that the program only ran at about 2/3 the speed as when I ran the second half.

[code]c:
cd mfakto
cmd.exe /k "start /b /low /affinity 0x08 mfakto-x64.exe"

vs

c:
cd mfakto
mfakto-x64.exe[/code]

Any thoughts?

Dubslow 2012-03-31 13:51

This can't really account for speed, but if you're launching it from a batch file, calling cmd.exe shouldn't be necessary. This is what I use (developed by kladner):
[code]C:
cd C:\Users\Bill\GPU-Prime95\mfaktc-0.17
start /low /affinity 8 mfaktc-win-64[/code]

bcp19 2012-03-31 14:32

Looking for data
 
I'd like to work up some charts for GPUs running mfakto to determine how well they work once the cores running them are factored in.

If you could PM me: CPU and GPU model (Q8200/5770), # cores running P95/mfakto (1/3, 2/2, 5/1), exp size and iter time on P95 cores (45381221/.059ms, 26202373/.035ms), exponent size, bit level and rough avg time for each mfakto instance (30311929/68-69/74m29s, 30363997/68-69/73m52s, 29499839/69-70/162m11s) it would be a great help. Thanks in advance for your help.

kladner 2012-03-31 16:53

[QUOTE=bcp19;294968]Bdot, when running some tests using mfakto yesterday, I noticed that when I used the first half of the following in a batch file that the program only ran at about 2/3 the speed as when I ran the second half.[/QUOTE]

Just for grins, you might take out the /low, or replace it with /high. It might show if something else is stealing CPU cycles from mfakto. Since the second batch defaults to /normal, I guess that would be the best comparison. Affinity is the other variable, so that might make a difference, too.

KyleAskine 2012-03-31 17:46

[QUOTE=flashjh;294916]I never spent any time t/s'ing the problem, I've been too busy. It was easier to revert and key it go. Is 12.2/.3 faster for you?[/QUOTE]

Not really. About the same as 12.1

bcp19 2012-03-31 18:53

[QUOTE=kladner;294989]Just for grins, you might take out the /low, or replace it with /high. It might show if something else is stealing CPU cycles from mfakto. Since the second batch defaults to /normal, I guess that would be the best comparison. Affinity is the other variable, so that might make a difference, too.[/QUOTE]

I use that same line for mfaktc and never noticed a difference, which is why I asked. I was just curious if locking the program to 1 core caused it or if it was something else as I had also noticed that task manager reported 30% usage from 1 instance of mfakto during testing when I only had 1 core on P95. Also, 1 core P95 and 3 core mfakto with Adjust=1 caused SP to climb and climb while M/s kept dropping and dropping. I exited when the time remaining on all 3 instances had climbed to over 10 hours and SP was in the 140k's. Locking all 3 at 25k SP worked fair, but still fluctuated between 2.5 and 3 hours to go on each instance.

Anyway, I'll try removing /low when I get my PSU for the Duo as I don't want to take the quad apart again to switch cards.

Bdot 2012-04-02 13:02

[QUOTE=bcp19;295004]I use that same line for mfaktc and never noticed a difference, which is why I asked. I was just curious if locking the program to 1 core caused it or if it was something else as I had also noticed that task manager reported 30% usage from 1 instance of mfakto during testing when I only had 1 core on P95. Also, 1 core P95 and 3 core mfakto with Adjust=1 caused SP to climb and climb while M/s kept dropping and dropping. I exited when the time remaining on all 3 instances had climbed to over 10 hours and SP was in the 140k's. Locking all 3 at 25k SP worked fair, but still fluctuated between 2.5 and 3 hours to go on each instance.

Anyway, I'll try removing /low when I get my PSU for the Duo as I don't want to take the quad apart again to switch cards.[/QUOTE]

I've seen this difference between mfakto and mfaktc as well.

I believe it is caused by the threading design chosen by AMD for their OpenCL implementation. When CUDA programs are built, all code to drive the GPU is compiled right into the control flow of the program (thread) that calls the GPU functions. AMD's OpenCL library creates another thread upon initialization that will drive the GPU. OpenCL API calls will just issue request to this thread, and may or may not wait for it to complete the task.

This design works very well if you have a "stand-by" CPU core to run the background thread. But if all cores are busy, then activation of the background thread has to wait until a time slice of another task finishes. Unfortunately, mfakto counts this switching time towards the CPU wait time, indicating that the CPU has to wait for the GPU, and consequently increases SievePrimes. I did not yet find a way to distinguish between "wait for GPU" and "wait for CPU to process GPU requests" as this is all hidden in the OpenCL APIs.

On my AMD Phenom system, I need to use fix SievePrimes in order to be able to use it in addition to prime95. On a SandyBridge, I noticed, that an available hyper thread is fully sufficient to serve the needs. There, I can run 3x mfakto and 3 prime95-LL tests on 8 hyper threads. 4 LL-tests work as well, but lowers SievePrimes too much for my gusto. As the AVX-FFTs are memory-bandwidth-limited on my machine, It would not be faster to run LL-tests on each hyper-thread.

On another machine, a 12-CPU-Xeon w/o hyper-threading, I run just 8 threads of mprime. In order to have mfakto run full speed, I let 3 instances use 4 CPUs -each at 133%CPU (unix-style counting).

On none of these machines I set the affinity for mfakto, the OS normally figures out what's available. I'll take a note to my todo-list to allow setting the affinity for the Sieving thread - this may be some advantage especially on Windows where threads are normally switched around for no good.

bcp19 2012-04-02 15:16

[QUOTE=Bdot;295150]I've seen this difference between mfakto and mfaktc as well.

I believe it is caused by the threading design chosen by AMD for their OpenCL implementation. When CUDA programs are built, all code to drive the GPU is compiled right into the control flow of the program (thread) that calls the GPU functions. AMD's OpenCL library creates another thread upon initialization that will drive the GPU. OpenCL API calls will just issue request to this thread, and may or may not wait for it to complete the task.

This design works very well if you have a "stand-by" CPU core to run the background thread. But if all cores are busy, then activation of the background thread has to wait until a time slice of another task finishes. Unfortunately, mfakto counts this switching time towards the CPU wait time, indicating that the CPU has to wait for the GPU, and consequently increases SievePrimes. I did not yet find a way to distinguish between "wait for GPU" and "wait for CPU to process GPU requests" as this is all hidden in the OpenCL APIs.

On my AMD Phenom system, I need to use fix SievePrimes in order to be able to use it in addition to prime95. On a SandyBridge, I noticed, that an available hyper thread is fully sufficient to serve the needs. There, I can run 3x mfakto and 3 prime95-LL tests on 8 hyper threads. 4 LL-tests work as well, but lowers SievePrimes too much for my gusto. As the AVX-FFTs are memory-bandwidth-limited on my machine, It would not be faster to run LL-tests on each hyper-thread.

On another machine, a 12-CPU-Xeon w/o hyper-threading, I run just 8 threads of mprime. In order to have mfakto run full speed, I let 3 instances use 4 CPUs -each at 133%CPU (unix-style counting).

On none of these machines I set the affinity for mfakto, the OS normally figures out what's available. I'll take a note to my todo-list to allow setting the affinity for the Sieving thread - this may be some advantage especially on Windows where threads are normally switched around for no good.[/QUOTE]

Sounds like there is a little bit of 'hidden' cpu cost in running mfakto, which explains the numbers I was seeing. A single mfakto instance on my quad with 0/1 cores P95 produced ~64GD, with 2 cores it dropped to ~60 and with 3 it was ~56.

Bdot 2012-04-04 09:21

5 x 15 bit kernel - testers wanted!
 
I'm in the final steps of creating a better performing solution for HD69xx (Cayman) and probably HD7xxx as well.

I've created a kernel using a word size of 15 bits per int. This way I can completely avoid the expensive 32-bit mul and mul_hi instructions. Using 5x15 bits, it is currently capable of doing TF 60 to 72 bits. I should be able to bring it to 73 bits soon. The kernel is still kind of immature: I have only one generic 75 x 75 bit -> 150 bit multiplication. Using an optimized squaring function, another one that only calculates the required precision and a few other optimizations I should be able to improve its speed by 30-50%. Currently, on HD5770 it runs at ~80% of the best kernel. For Cayman, predictions are that it is already 5% faster right now. With a little luck, HD6970 may finally be faster than HD5870 (hmm, probably just a bad joke).

I could use some testing help towards the end of this week or next week ... flash, Kyle? Anyone else, of special interest are HD69xx or HD7xxx ?

BTW, if anyone wants to follow/help development, I've put the source code to
[URL]https://github.com/Bdot42/mfakto[/URL]
I usually do regular updates whenever I changed (improved?) anything.

KyleAskine 2012-04-04 10:52

[QUOTE=Bdot;295370]I'm in the final steps of creating a better performing solution for HD69xx (Cayman) and probably HD7xxx as well.

I've created a kernel using a word size of 15 bits per int. This way I can completely avoid the expensive 32-bit mul and mul_hi instructions. Using 5x15 bits, it is currently capable of doing TF 60 to 72 bits. I should be able to bring it to 73 bits soon. The kernel is still kind of immature: I have only one generic 75 x 75 bit -> 150 bit multiplication. Using an optimized squaring function, another one that only calculates the required precision and a few other optimizations I should be able to improve its speed by 30-50%. Currently, on HD5770 it runs at ~80% of the best kernel. For Cayman, predictions are that it is already 5% faster right now. With a little luck, HD6970 may finally be faster than HD5870 (hmm, probably just a bad joke).

I could use some testing help towards the end of this week or next week ... flash, Kyle? Anyone else, of special interest are HD69xx or HD7xxx ?

BTW, if anyone wants to follow/help development, I've put the source code to
[URL]https://github.com/Bdot42/mfakto[/URL]
I usually do regular updates whenever I changed (improved?) anything.[/QUOTE]

I am in Boston from 4/5 to 4/8, but would love to test after that!

flashjh 2012-04-04 13:05

[QUOTE=Bdot;295370]I'm in the final steps of creating a better performing solution for HD69xx (Cayman) and probably HD7xxx as well.
...
I could use some testing help towards the end of this week or next week ... flash, Kyle? Anyone else, of special interest are HD69xx or HD7xxx ?

BTW, if anyone wants to follow/help development, I've put the source code to
[URL]https://github.com/Bdot42/mfakto[/URL]
I usually do regular updates whenever I changed (improved?) anything.[/QUOTE]

I'd love to help, I don't have 69xx or 7x though, only 5870. If you can use data, let me know.

Bdot 2012-04-12 12:53

release it?
 
Thanks to the very fast (and so far successful) testing of both of you, I think this kernel can be called stable very soon.

And for Cayman the result is even better than I expected: almost 50% speed-up!

So, for TF up to 70 bits, HD5870 is still fastest, with ~320M/s raw speed. TF up to 73 bits now runs at ~285 M/s (up from ~255 M/s).

HD6970 will now[SUP]*)[/SUP] do all these ranges at ~295M/s (up from ~205M/s), making it the fastest AMD card for the usual GPU272 work. At least until someone can tell how HD7970 performs.

Note, these are all raw figures without scheduling overhead - you should see 80-90% of that in the end.

These significant performance improvements make me think I should release them even before I'm done with auxiliary changes like
- display GHz-days/day
- worktodo.add
- perftest modes for kernel speed
- two optional fields in mfaktc.ini for username and computerid
- output datestamp lines in results.txt

File locking for worktodo and results files is already included.

GPU sieving is then the next big project.

[SIZE=1][SUP]*)[/SUP] a slight change in the kernel selection is needed to make the new kernel the default for up to 70 bits in Cayman - so far it is selected only for 71-73 bits[/SIZE]

Bdot 2012-04-30 07:41

variable progress lines
 
I've noticed there are different opinions about what mfakto should display while working an exponent. When adding the Ghz-days/day I had difficulties getting everything into a standard-80-characters line. On the other hand, I usually have my terminal windows ~220 chars wide, not using most of that in mfakto.

So, here we go:

in mfakto.ini:
[code]
V5UserID=Bdot
ComputerID=mfakto
PrintFormat=[%d %T] M%M[%l-%u]: %C/4620 %c/960 %p% %gGHz %ts %e to go, %n FCs, %rM/s, SP: %s, wait:%wus=%W%, %U@%H
[/code]you get
[code]
[Apr 30 09:18] M53910019[70-71]: 204/4620 45/960 4.69% 76.40GHz 5.225s 1h19m to go, 589.82M FCs, 112.88M/s, SP: 5316, wait: 106us= 0.92%, Bdot@mfakto
[/code]These are the possible formats right now. Is there anything missing?
[code]
+ %C - class ID (n/4620) "%4d"
+ %c - class number (n/960) "%3d"
+ %p - percent complete (%) "%6.2f"
+ %g - GHz-days/day (GHz) "%7.2f"
+ %t - time per class (s) "%6G"
+ %e - ETA (d/h/m/s) "%2dm%02ds"/"%2dh%02dm"/"%2dd%02dh"
+ %n - number of candidates (M/G) "%6.2fM"/"%6.2fG"
+ %r - rate (M/s) "%6.2f"
+ %s - SievePrimes "%7d"
+ %w - CPU wait time for GPU (us) "%6lld"
+ %W - CPU wait % (%) "6.2f"
+ %d - date (Mon nn) "%b %d"
+ %T - time (HH:MM) "%H:%M"
+ %U - username (as configured) "%s" !! variable length, 15 chars at most
+ %H - hostname (as configured) "%s" !! variable length, 15 chars at most
+ %M - the exponent being worked on "%d" !! no fixed width to allow prepending 'M' !!
+ %l - the lower bit-limit "%2d"
+ %u - the upper bit-limit "%2d"
[/code]The format allows a multi-selection of up to 20 of these formats. I'll probably add another line for the header as this is too much effort to get aligned automatically.

If you do specify your UserID and ComputerID, then the result lines will also contain them. A boolean "TimeStampInResults" setting allows to get the results files even closer to what the prime95 original looks like.

Dubslow 2012-04-30 08:20

:shock:
...
...
...
...
...
...
:explode:






DO WANT



:smile:

...One (not-so-)small request. With the multi-threading (sort of) and now this, would you be willing to "backport" your changes/additions into mfaktc?

From a user's standpoint (i.e. helping people, and for those with both nVidia and AMD cards), it's optimal if mfaktc and mfakto are as similar as possible (TF algos aside), and it's clear you have more time (or desire/drive/whatever) for developing the extra non-math goodies than TheJudger.

Thanks :smile:

LaurV 2012-04-30 08:23

[edit: I replied to Bdot's post, but took time to conceive the reply, busy at the job. Dubslow went in between]

That is a very nice idea! [edit: about customizing the output]. What a pity I have no AMD/OpenCL/GL cards...

Under windoze, you don't need to limit the line length to 80 characters, you can specify a bigger buffer (number of lines and characters per line) for the dos prompt, just rightclick on the header of the window, properties, layout, and modify screen buffer size. I usually have 150 characters per line with 7x12 font (selectable from the fonts tab), which perfectly fit even a small (low resolution) monitor. There are a lot of advantages in having a wider screen, for yafu, msieve, cudalucas, etc. Practically the only program limited to 80 cpl is mfaktc. The idea of "custom output lines" could be copied to there too!

Dubslow 2012-04-30 08:41

[QUOTE=LaurV;297967]That is a very nice idea! What a pity I have no AMD/OpenCL/GL cards...
Under windoze, you don't need to limit the line length to 80 characters, you can specify a bigger buffer (number of lines and characters per line) for the dos prompt, just rightclick on the header of the window, properties, layout, and modify screen buffer size. I usually have 150 characters per line with 7x12 font (selectable from the fonts tab), which perfectly fit even a small (low resolution) monitor. There are a lot of advantages in having a wider screen, for yafu, msieve, cudalucas, etc. Practically the only program limited to 80 cpl is mfaktc. The idea of "custom output lines" could be copied to there too![/QUOTE]

Heh, in Linux (Gnome, specifically) all you need to do is make the terminal window bigger and the output matches on the fly. That's one complaint I had about the DOS prompt :razz:

Bdot 2012-04-30 12:35

[QUOTE=Dubslow;297966]:shock:

...One (not-so-)small request. With the multi-threading (sort of) and now this, would you be willing to "backport" your changes/additions into mfaktc?

From a user's standpoint (i.e. helping people, and for those with both nVidia and AMD cards), it's optimal if mfaktc and mfakto are as similar as possible (TF algos aside), and it's clear you have more time (or desire/drive/whatever) for developing the extra non-math goodies than TheJudger.

Thanks :smile:[/QUOTE]

I'd be happy if TheJudger decided to use some of my code for mfaktc, after I took so much of his code to make mfakto. However, it remains his decision if he wants any of that in.

When I started building OpenCL stuff, I screwed up my CUDA dev env, and I never really spent effort to fix that. But anyone who's ever built mfaktc and knows how to read code should be able to merge these changes. Since quite some time I regularly check in my code to [URL]https://github.com/Bdot42/mfakto[/URL], and it's still OpenSource :smile:.
Have a look at [URL]https://github.com/Bdot42/mfakto/commit/ccf6d26fe3be5d4ab655b0069e0885c16337d05a[/URL], for instance, to see the first check-in bringing the variable progress - there's still some work needed though ...

Dubslow 2012-04-30 13:02

[QUOTE=Bdot;297979]I'd be happy if TheJudger decided to use some of my code for mfaktc, after I took so much of his code to make mfakto. However, it remains his decision if he wants any of that in.

When I started building OpenCL stuff, I screwed up my CUDA dev env, and I never really spent effort to fix that. But anyone who's ever built mfaktc and knows how to read code should be able to merge these changes. Since quite some time I regularly check in my code to [URL]https://github.com/Bdot42/mfakto[/URL], and it's still OpenSource :smile:.
Have a look at [URL]https://github.com/Bdot42/mfakto/commit/ccf6d26fe3be5d4ab655b0069e0885c16337d05a[/URL], for instance, to see the first check-in bringing the variable progress - there's still some work needed though ...[/QUOTE]
I'd love to do a merge -- there are two reasons I asked you:
1) I have very limited C experience, though merging already-written code should be a good thing from an experience standpoint
2) For the next 2 weeks I will have little time to spend on coding/merging -- but then after that is summer :smile:

I guess that means that if no one else has in two weeks' time, I'll take a crack at it. (Some people know I already took a shot at merging some mfaktc code into CUDALucas, and I had planned on extending that.)

KyleAskine 2012-05-02 21:31

500M/s Sieving Cap??
 
I have around 600M/s of cards in my main PC (2x6970), but I cannot seem to feed my cards faster than 480M/s of candidates a second, not matter how I arrange my instances of mfakto. I suspect I am at a 'sieving cap'. Processor isn't the issue, as I am only at around 70% use now, nor are my actual GPU's, which can both go to 99% if I kill one of the instances of mfakto feeding the other process.

What could be holding me back? Memory Bandwidth? Something to do with Caching? Something I am not considering?

bcp19 2012-05-02 22:47

[QUOTE=KyleAskine;298219]I have around 600M/s of cards in my main PC (2x6970), but I cannot seem to feed my cards faster than 480M/s of candidates a second, not matter how I arrange my instances of mfakto. I suspect I am at a 'sieving cap'. Processor isn't the issue, as I am only at around 70% use now, nor are my actual GPU's, which can both go to 99% if I kill one of the instances of mfakto feeding the other process.

What could be holding me back? Memory Bandwidth? Something to do with Caching? Something I am not considering?[/QUOTE]

Are you running Windows or Linux? AMD or Intel chip? While I never have run Linux(so don't know if it would change things), I know with a Win7/i5 2400 combo I could not max my HD5770 even with 2 cores feeding it, plus the 'active' window always ran faster. When I put it in my 2500, I could hit 88% load if the mfakto window was 'active' and 66% load if it was not(had a pretty high SP as well, or it'd have a 20% cpu wait at 5k). I since have gotten an AMD Phenom II x6 1055 which I run 3 cores on a GTX 460 and 2 cores on the 5770. It too runs Win7, but the 2 cores keep it at 88% load regardless of what is the active window. I think the problem boils down to OpenCL not being able to run as well as CUDA, unless Linux makes a difference.

KyleAskine 2012-05-03 01:31

[QUOTE=bcp19;298230]Are you running Windows or Linux? AMD or Intel chip? While I never have run Linux(so don't know if it would change things), I know with a Win7/i5 2400 combo I could not max my HD5770 even with 2 cores feeding it, plus the 'active' window always ran faster. When I put it in my 2500, I could hit 88% load if the mfakto window was 'active' and 66% load if it was not(had a pretty high SP as well, or it'd have a 20% cpu wait at 5k). I since have gotten an AMD Phenom II x6 1055 which I run 3 cores on a GTX 460 and 2 cores on the 5770. It too runs Win7, but the 2 cores keep it at 88% load regardless of what is the active window. I think the problem boils down to OpenCL not being able to run as well as CUDA, unless Linux makes a difference.[/QUOTE]

I think I was really hazy in my last post. Let me try to be a bit more precise with my language.

I am running four instances of mfakto on a windows 7 machine. Two for each of my two graphics cards.

No matter how I arrange my mfakto, I cannot sieve more than around 480-490 M/s across all of my instances combined. Like I said, I can max out either of my graphics cards (requires around 300M/s sieving to get 99% GPU load) by simply killing one of the processes feeding the other graphics card. However, I need around 600M/s of sieving power to saturate both cards at the same time. I cannot get it right now.

Processor is definitely not the bottleneck. Nor is raw GPU power.

I just want to know if there is a way for me to discover what is.

bcp19 2012-05-03 11:00

[QUOTE=KyleAskine;298245]I think I was really hazy in my last post. Let me try to be a bit more precise with my language.

I am running four instances of mfakto on a windows 7 machine. Two for each of my two graphics cards.

No matter how I arrange my mfakto, I cannot sieve more than around 480-490 M/s across all of my instances combined. Like I said, I can max out either of my graphics cards (requires around 300M/s sieving to get 99% GPU load) by simply killing one of the processes feeding the other graphics card. However, I need around 600M/s of sieving power to saturate both cards at the same time. I cannot get it right now.

Processor is definitely not the bottleneck. Nor is raw GPU power.

I just want to know if there is a way for me to discover what is.[/QUOTE]

When I was testing mfakto on one of my quads I ran into a somewhat similar problem with what I refer to as 'diminishing returns'. mfakto, unlike mfaktc, seems to need an extra bit of computing power beyond the core supplied(a single mfakto instance reports 30% cpu usage overall), so when I used all 4 cores, it would actually give me less throughput than if I used 3 and had nothing on the 4th core. (I only tested this on a Core2Quad, which has it's own quirks, but I would imagine a i5/i7 quad would have somewhat similar results)

Bdot 2012-05-03 11:23

[QUOTE=KyleAskine;298245]I think I was really hazy in my last post. Let me try to be a bit more precise with my language.

I am running four instances of mfakto on a windows 7 machine. Two for each of my two graphics cards.

No matter how I arrange my mfakto, I cannot sieve more than around 480-490 M/s across all of my instances combined. Like I said, I can max out either of my graphics cards (requires around 300M/s sieving to get 99% GPU load) by simply killing one of the processes feeding the other graphics card. However, I need around 600M/s of sieving power to saturate both cards at the same time. I cannot get it right now.

Processor is definitely not the bottleneck. Nor is raw GPU power.

I just want to know if there is a way for me to discover what is.[/QUOTE]

a few ideas:
[LIST][*]if you run two instances for one card, none for the other and two threads prime95, then what's the sieving you get out of the mfakto instances? If starting prime95 slows down sieving to ~60%, then it is probably just hyper-threads that are assigned to the tasks. An available hyper-thread is reported by windows as available, but if you start using it, you're actually stealing CPU-time from its twin.[*]on an otherwise idle machine, run 4 instances of "mfakto --perftest" at once. What's the total sieving performance with that? This test eliminates any GPU interaction (copy and process).[*]take the variable-SieveSize binary that I sent you and try with 24k SieveSizeLimit. If L1 caches are the problem, then this should be faster than 36k [B]with the same binary[/B]. You can also use this binary for the --perftest above, in order to see what the best SieveSizeLimit would be. (Note this setting is a multiple of ~12k - it will use the highest such multiple that is not more than what you specify)[*]I added to my todo-list a dummy kernel that would just read each transferred byte but not do any TF with it, just for better performance tests.[*]I finished my tests with a lower SievePrimes (as low as 256, but I think I'll not allow below 1000). Together with the new GHz-days/day display you can easily see the effects of lowering SP that far, so I think I can publish that. In many cases you would sacrifice total throughput just to get higher GPU utilization - but your case may be different.[/LIST]

KyleAskine 2012-05-03 12:52

I will take a look tonight. Thanks for the suggestions!

KyleAskine 2012-05-03 18:10

And just so you know, the reason I do suspect it isn't processor related is that I moved Sieveprimes around, and once I got below a certain point processor load dropped, but M/s stayed constant.

KyleAskine 2012-05-06 02:24

[QUOTE=Bdot;298277]a few ideas:
[LIST][*]if you run two instances for one card, none for the other and two threads prime95, then what's the sieving you get out of the mfakto instances? If starting prime95 slows down sieving to ~60%, then it is probably just hyper-threads that are assigned to the tasks. An available hyper-thread is reported by windows as available, but if you start using it, you're actually stealing CPU-time from its twin.[*]on an otherwise idle machine, run 4 instances of "mfakto --perftest" at once. What's the total sieving performance with that? This test eliminates any GPU interaction (copy and process).[*]take the variable-SieveSize binary that I sent you and try with 24k SieveSizeLimit. If L1 caches are the problem, then this should be faster than 36k [B]with the same binary[/B]. You can also use this binary for the --perftest above, in order to see what the best SieveSizeLimit would be. (Note this setting is a multiple of ~12k - it will use the highest such multiple that is not more than what you specify)[*]I added to my todo-list a dummy kernel that would just read each transferred byte but not do any TF with it, just for better performance tests.[*]I finished my tests with a lower SievePrimes (as low as 256, but I think I'll not allow below 1000). Together with the new GHz-days/day display you can easily see the effects of lowering SP that far, so I think I can publish that. In many cases you would sacrifice total throughput just to get higher GPU utilization - but your case may be different.[/LIST][/QUOTE]

I ran the stress test with two threads and two versions of mfakto hitting the same card. It ran at around 90%, which is what the 'faster' of the two cards runs at when I run 4x mfakto.

The perftests look fine - they all run at around a max of 500M/s each, even when I run 4 at the same time. So raw sieving isn't the issue.

Mfakto runs around the same speed with the 36k exec and with the var exec with 24 specified in the ini file.

So I continue to be stumped. I think if it was somehow processor confined lowering my sieving should increase GPU Load, but it just doesn't. When I run four instances at 5000 Sieveprimes, I get 50% processor usage, and 170% GPU Load across both of my graphics cards (when you add them).

When I run four instances at 25000 Sieveprimes, I get 80% processor usage and 170% GPU Load across both of my graphics cards.

I just don't know why I can't push my GPU usage to close to 200%, since each card can easily reach 99% individually.

Bdot 2012-05-06 21:01

[QUOTE=KyleAskine;298577]I ran the stress test with two threads and two versions of mfakto hitting the same card. It ran at around 90%, which is what the 'faster' of the two cards runs at when I run 4x mfakto.

The perftests look fine - they all run at around a max of 500M/s each, even when I run 4 at the same time. So raw sieving isn't the issue.

Mfakto runs around the same speed with the 36k exec and with the var exec with 24 specified in the ini file.

So I continue to be stumped. I think if it was somehow processor confined lowering my sieving should increase GPU Load, but it just doesn't. When I run four instances at 5000 Sieveprimes, I get 50% processor usage, and 170% GPU Load across both of my graphics cards (when you add them).

When I run four instances at 25000 Sieveprimes, I get 80% processor usage and 170% GPU Load across both of my graphics cards.

I just don't know why I can't push my GPU usage to close to 200%, since each card can easily reach 99% individually.[/QUOTE]

The tests somewhat point towards memory, if they point to something at all. It's not CPU, I perfectly agree. And the GPUs also have some room.

If 2x mfakto can bring one GPU to 99%, but adding the prime95 stress-test lowers the GPU-load to 90%, then we already see that there is some influence. In this case, it can only be the memory system including caches.

Delivering 500M candidates to the GPUs also means transferring 2GB of data over the bus. PCIe 2.0 x16 should be able to transfer 8GB to each card (4GB, if you enabled crossfire) - plenty of room, you'd think.

I suggest another test: I also sent you the performance-info binary in the last package. This is a normal mfakto binary, additionally it queries and displays OpenCL performance data of both the data transfer and the kernel execution. The perf-info you've sent me last time showed transfer rates of 2.1-2.3 GB/s. Please start the pi-binary instead of the real ones, but start them one by one and monitor the transfer rates it is reporting. The first one will certainly start with fairly consistent ~2.2 GB/s. When adding another mfakto-pi on the same card, does it start to fluctuate? Is the reaction the same when adding an instance on the other card? And how do the transfer rates look like with 4 instances?

I expect them to still show 2.3GB/s quite often, but in between it will also show much lower values, if the memory transfer to the GPU is an issue.

I think I will compile a version for you that adds another debug-flag to show detailed timing info for each of the steps. This will show where more time is spent when more instances start up.

I also have a version that skips sieving and transfer of the candidates to the GPU completely, I just need to adapt it to the new kernels. This way we could test what the GPUs really could do if they had all data they needed.

Did you already play around with the clocks of your memory modules? Of course, overclocking is always a bit dangerous, but how about slowing it down a bit? To see if the capping effect gets stronger?

And I have yet another idea: Of each 32-bit offset for the FCs only 24 bits are evaluated. Each GPU-thread needs 4 FCs to fill its vector. Instead of transferring 4x32=128 bits for each GPU thread, I could squeeze 4x24 bits into 3x32 bit integers. This should reduce the required bandwidth by 25%. A bit more computational effort, but if the reduced I/O offsets that? Certainly worth a test.

KyleAskine 2012-05-07 00:55

1 Attachment(s)
I have uploaded my results when I run one instance alone, when I run two instances on the same card, and when I run four instances (two on each card).

As you guessed, transfer rate gets demolished with more than one.

Bdot 2012-05-07 10:28

[QUOTE=KyleAskine;298634]I have uploaded my results when I run one instance alone, when I run two instances on the same card, and when I run four instances (two on each card).

As you guessed, transfer rate gets demolished with more than one.[/QUOTE]

Ouch, I did not expect the copying performance to deteriorate so much ... at least we seem to have found the reason of the strange behavior.

OK, I just measured the same thing on my HD5770 here, and I get ~2.1GB/s single instance, or 2x 190MB/s, 4x 54MB/s.

I guess this is some serious scheduling issue inside the OpenCL runtime. I think I prepare a case for AMD ...

I then modified the code to ignore the number of compute units in the GPU and always run 2M FCs at once, which increased the copy performance to 2.4GB/s, 2x220MB/s, 4x105MB/s. Certainly some improvement, but I guess I need to invest in the 4x24bit=3x32bit idea for data transfers.

aketilander 2012-05-15 18:16

One of my oldest boxes has a GPU: [B]AMD Radeon X1650 Series. [/B]If I have understood it rightly this GPU cannot be used for TF. Just to make sure I have installed mfakto 0.10p1 and "Additional required software".

When I run the program with the -st I get the following output:

[code]mfakto 0.10p1-Win (32bit build)

Runtime options
Inifile mfakto.ini
SievePrimes 25000
SievePrimesAdjust 1
NumStreams 5
GridSize 4
WorkFile worktodo.txt
ResultsFile results.txt
Checkpoints enabled
CheckpointDelay 300s
Stages enabled
StopAfterFactor class
PrintMode full
AllowSleep yes
VectorSize 4
PreferKernel mfakto_cl_barrett79
SieveOnGPU no
Compiletime options
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 193154bits
SIEVE_SPLIT 250
MORE_CLASSES enabled
Select device - GPU not found, fallback to CPU.
Get device info - Compiling kernels .
BUILD OUTPUT
Internal Error: as failed
END OF BUILD OUTPUT
init_CL(5, 0) failed[/code]

I just want to make sure that I cannot use this GPU ([B]AMD Radeon X1650 Series)[/B] for TF or any other GIMPS related work. Is that so?

chalsall 2012-05-15 18:42

[QUOTE=aketilander;299545]I just want to make sure that I cannot use this GPU ([B]AMD Radeon X1650 Series)[/B] for TF or any other GIMPS related work. Is that so?[/QUOTE]

This line should have given it away: "Select device - GPU not found, fallback to CPU.

flashjh 2012-05-15 18:55

You can visit James' website to see which cards will work.

[url]http://mersenne-aries.sili.net/mfaktc.php?sort=ghdpd&noN=1[/url]

Bdot 2012-05-15 21:24

[QUOTE=aketilander;299545]One of my oldest boxes has a GPU: [B]AMD Radeon X1650 Series. [/B]If I have understood it rightly this GPU cannot be used for TF.[/QUOTE]
The X1650 has got an RV535 GPU chip. The first chip that supports OpenCL is RV700. Therefore, no OpenCL program will run on this GPU. In fact, it is 3 generations too old (X1000 -> HD2000 -> HD3000 -> HD4000, which is the first generation for OpenCL).

aketilander 2012-05-16 12:37

[QUOTE=Bdot;299564]The X1650 has got an RV535 GPU chip. The first chip that supports OpenCL is RV700. Therefore, no OpenCL program will run on this GPU. In fact, it is 3 generations too old (X1000 -> HD2000 -> HD3000 -> HD4000, which is the first generation for OpenCL).[/QUOTE]

Thank you, chalsall, flashjh and Bdot. Your help was much appreciated!

Bdot 2012-05-20 21:58

v0.11
 
v0.11 is ready. Please get it from [URL]http://mersenneforum.org/mfakto/mfakto-0.11/[/URL]

What's new:
[LIST][*]24-bit barrett kernel for FCs up to 2^70 - very fast![*]15-bit barrett kernel for FCs up to 2^73 - almost as fast, especially on Cayman this one has a speedup of 50% over 0.10p1[*]new [B]SievePrimesMin [/B]ini-file variable to replace the so-far fix value of 5000 (hard minimum is 256)[*]new [B]V5UserID [/B]and [B]ComputerID [/B]ini-file variables that let you configure these ID's for the results file output (so far only useful for mersenne-aries.sili.net)[*]new [B]TimeStampInResults [/B]ini-file variable allows to configure that each result line should be preceded by a time stamp[*]new [B]ProgressHeader [/B]and [B]PrintFormat [/B]ini-file variables to adapt the information that is printed after each class is finished. See the included mfakto.ini file for details.[*]On Linux: Siever code is now compiled with gcc4.6: ~10% faster sieve[*]file locking: worktodo and results files accesses are now synchronized using a lock file (.lck appended to the file name).[*]evaluation of GHz-days of assignments, and current speed as GHz-days/day[*]Ctrl-C handler already in selftest to get a summary of so-far-completed tests[*]new --pertest option to test the siever performance depending on SievePrimes and SieveSizeLimit (if that is not fix at compile time)[*]using a fix power of 2 for the number of GPU threads (still set via GridSize)[/LIST]Source code is at [URL]https://github.com/Bdot42/mfakto[/URL], [URL="https://github.com/Bdot42/mfakto/zipball/v0.11"]v0.11[/URL]

Note that the new fast kernels can not be used without Stages=1, as they need to process each bitlevel separately. Also, because of the other new config variables I suggest using the new shipped ini file and adjust it to your needs.

And, as usual, let me know if anything does not work as expected :smile:

LaurV 2012-05-21 03:12

You make me feel terrible sad that I don't have an AMD card... :smile:
I believe some of those are already implemented into mfaktc, but some of them are still missing, especially many "cosmetic" stuff... Do you still have a dialog with Oliver, or you went totally different paths now? It would be nice (for us, the blind users) if the two programs grow up together, and they don't become totally different stuff in few years...

Bdot 2012-05-21 11:23

[QUOTE=LaurV;299942]You make me feel terrible sad that I don't have an AMD card... :smile:
I believe some of those are already implemented into mfaktc, but some of them are still missing, especially many "cosmetic" stuff... Do you still have a dialog with Oliver, or you went totally different paths now? It would be nice (for us, the blind users) if the two programs grow up together, and they don't become totally different stuff in few years...[/QUOTE]

Hehe, mfaktc has the performance, mfakto has the fancy stuff?

I'm in contact with Oliver and he said he'd merge the stuff to mfaktc, [B]if users requested it explicitly.[/B] I understood he did not want to plainly merge everything. But if you, the mfaktc users tell him exactly which features you'd like to see in mfaktc, then he'd do. In most cases I can easily extract the changes that would be required - still it is quite some effort on Oliver's side to build and test. As CUDA code is not as separated from the C-code as OpenCL, merging may also be challenging in some cases.

TheJudger 2012-05-21 12:58

[QUOTE=Bdot;299918][*]new [B]SievePrimesMin [/B]ini-file variable to replace the so-far fix value of 5000 (hard minimum is 256)
[/QUOTE]
Let us extend this to SievePrimesMin + SievePrimesMax in mfakt?.ini:
SIEVE_PRIMES_MIN <= SievePrimesMin < SievePrimesMax <= SIEVE_PRIMES_MAX
With SIEVE_PRIMES_M[IN|AX] hardcoded and fix and SievePrimesM[in|ax] usertuneable in mfakt?.ini. (Something that I've on my todo for 0.19)
[QUOTE=Bdot;299918][*]new [B]V5UserID [/B]and [B]ComputerID [/B]ini-file variables that let you configure these ID's for the results file output (so far only useful for mersenne-aries.sili.net)[*]new [B]TimeStampInResults [/B]ini-file variable allows to configure that each result line should be preceded by a time stamp
[/QUOTE]
I guess I'll can addept those two easily in mfaktc.
[QUOTE=Bdot;299918][*]new [B]ProgressHeader [/B]and [B]PrintFormat [/B]ini-file variables to adapt the information that is printed after each class is finished. See the included mfakto.ini file for details.
[/QUOTE]
I have to look at this, fancy stuff! :smile:
[QUOTE=Bdot;299918][*]On Linux: Siever code is now compiled with gcc4.6: ~10% faster sieve
[/QUOTE]
mfaktc compiles fine with gcc 4.6 / CUDA >= 4.2. The sieve code is ~10% faster on my IVB compared to gcc 4.4. :cool:
[QUOTE=Bdot;299918][*]file locking: worktodo and results files accesses are now synchronized using a lock file (.lck appended to the file name).
[/QUOTE]
I have to check but personally I'm not really a fan of file locking... two many failures in the past...

[QUOTE=LaurV;299942]You make me feel terrible sad that I don't have an AMD card... :smile:
I believe some of those are already implemented into mfaktc, but some of them are still missing, especially many "cosmetic" stuff... Do you still have a dialog with Oliver, or you went totally different paths now? It would be nice (for us, the blind users) if the two programs grow up together, and they don't become totally different stuff in few years...[/QUOTE]

Yes, we are talking together, usually via PM in german (which is easier for both of us I guess). It is a good idea to have both, mfaktc and mfakto, similar/identical in places where it is doable. Ofcourse this is not the case for the GPU code and CUDA/OpenCL specific stuff. An it is no secret that my focus is on the performance while I tend to ignore the "useless stuff" like an user interface. :blush:

Oliver

bcp19 2012-05-21 15:01

I like the V5UserID item, with something like that it seems an easy step to be able to either have the spider send results like P95 does, or possibly even incorporate it into the program.

chalsall 2012-05-21 15:27

[QUOTE=bcp19;299977]I like the V5UserID item, with something like that it seems an easy step to be able to either have the spider send results like P95 does, or possibly even incorporate it into the program.[/QUOTE]

If I understand what Bdot has done here, the information will be in the results string itself. Thus, the current submission spider will be sending the data to PrimeNet for it to use when it's ready for it.

This also means GPU72 will be able to be extended to determine which computer sent the results as well.

chalsall 2012-05-21 15:36

[QUOTE=TheJudger;299967]I have to check but personally I'm not really a fan of file locking... two many failures in the past...[/QUOTE]

Personally I [B][I][U]would[/U][/I][/B] really like to see WORKTODO.ADD functionality added to both programs. This is not mutually exclusive of file locking, but in my opinion is a safer way to add work to a running system.

One creator/writer; one reader/deleter. The spider wakes up and checks "worktodo.txt" to see if any more work is needed. If not, it goes back to sleep. If more work is needed, it next checks to see if "worktodo.add" already exists. If it does, it again goes to back to sleep. If it doesn't, it attempts to get new work and places it into a file like "worktodo.adt".

The last step is to move (rename) "worktodo.adt" to "worktodo.add", and goes back to sleep. No race conditions; no locks.

For people like Dubslow who like to order new work with old, file locking is very useful. For most people, workdodo.add functionality is fine and sane.

Bdot 2012-05-21 20:31

[QUOTE=TheJudger;299967]Let us extend this to SievePrimesMin + SievePrimesMax in mfakt?.ini:
SIEVE_PRIMES_MIN <= SievePrimesMin < SievePrimesMax <= SIEVE_PRIMES_MAX
With SIEVE_PRIMES_M[IN|AX] hardcoded and fix and SievePrimesM[in|ax] usertuneable in mfakt?.ini. (Something that I've on my todo for 0.19)
[/QUOTE]

Yes, that's like it's implemented in mfakto (SievePrimesMax already came in version 0.10). Currently SIEVE_PRIMES_MIN=256 and SIEVE_PRIMES_MAX=1000000 (the later is possible because mfakto always uses 4620 classes, with only 420 classes this could overflow the 24 bits per FC offset).

[QUOTE=TheJudger;299967]
I have to look at this, fancy stuff! :smile:
[/QUOTE]

I guess, [URL="https://github.com/Bdot42/mfakto/compare/acf0f840cbd6d60df56f17175289da6756ab9649...b22f4aaeaabc5f89bb34d95b582cee41640a1ff0"]this github diff[/URL] should be close to what you want to look at.

[QUOTE=TheJudger;299967]
Yes, we are talking together, usually via PM in german (which is easier for both of us I guess). It is a good idea to have both, mfaktc and mfakto, similar/identical in places where it is doable. Ofcourse this is not the case for the GPU code and CUDA/OpenCL specific stuff. An it is no secret that my focus is on the performance while I tend to ignore the "useless stuff" like an user interface. :blush:

Oliver[/QUOTE]

That's allright if you allow others to take care of it :smile:

[QUOTE=bcp19;299977]I like the V5UserID item, with something like that it seems an easy step to be able to either have the spider send results like P95 does, or possibly even incorporate it into the program.[/QUOTE]

Yes, the automatic primenet/gpu72 integration would be nice, and having the IDs will certainly help. But this was the smallest part :wink:
[QUOTE=chalsall;299979]If I understand what Bdot has done here, the information will be in the results string itself. Thus, the current submission spider will be sending the data to PrimeNet for it to use when it's ready for it.

This also means GPU72 will be able to be extended to determine which computer sent the results as well.[/QUOTE]

Yes, the UID can now be part of the results line. However, as long as we use primenet's manual submit page, the UID is ignored there. So far, only mersenne-aries can make use of it ...

[QUOTE=chalsall;299980]Personally I [B][I][U]would[/U][/I][/B] really like to see WORKTODO.ADD functionality added to both programs. This is not mutually exclusive of file locking, but in my opinion is a safer way to add work to a running system.[/QUOTE]

You shall have it with the next version, I promise :grin:

Dubslow 2012-05-24 22:10

[QUOTE=Bdot;299962]As CUDA code is not as separated from the C-code as OpenCL, merging may also be challenging in some cases.[/QUOTE]
Just taking an initial look at it, TheJudger does a very good job of keeping them separate.
[code]bill@Gravemind:~/mfaktc-0.18/src∰∂ ls
checkpoint.c mfaktc.c selftest-data.c tf_72bit.h tf_debug.h
checkpoint.h my_intrinsics.h sieve.c [COLOR="RoyalBlue"]tf_96bit.cu[/COLOR] timer.c
compatibility.h my_types.h sieve.h tf_96bit.h timer.h
Makefile params.h signal_handler.c [COLOR="RoyalBlue"]tf_barrett96.cu[/COLOR] timeval.h
Makefile.win read_config.c signal_handler.h tf_barrett96.h
Makefile.win32 read_config.h [COLOR="RoyalBlue"]tf_72bit.cu[/COLOR] [COLOR="RoyalBlue"]tf_common.cu[/COLOR][/code]

Bdot 2012-05-25 12:20

[QUOTE=Dubslow;300191][quote=Bdot]As CUDA code is not as separated from the C-code as OpenCL, merging may also be challenging in some cases.[/quote]Just taking an initial look at it, TheJudger does a very good job of keeping them separate.[/QUOTE]

I perfectly agree with you on that!. It seems, my note was easy to misunderstand ...

I was referring to a conceptual difference between OpenCL and CUDA:

In CUDA, you compile the device code right into your binary, enabling shared header files, for instance. In the .cu files you can (and usually will) have CPU code and GPU code mixed on function level.

Using OpenCL, you usually provide the GPU source code to the GPU-compiler at runtime of the binary. If you want to share header files between the binary and the GPU code, you need to ship them, for example.

For this reason I needed a different source file structure for mfakto, making merges between mfakto and mfaktc more difficult. This is, what I wanted to say, no more, no less :smile:.

Dubslow 2012-05-25 17:26

[QUOTE=Bdot;300211]
Using OpenCL, you usually provide the GPU source code to the GPU-compiler at runtime of the binary. If you want to share header files between the binary and the GPU code, you need to ship them, for example.[/QUOTE]

:huh:
:yucky:


Doesn't that rather defeat the purpose of compiling?

Bdot 2012-05-25 23:12

[QUOTE=Dubslow;300220]
Doesn't that rather defeat the purpose of compiling?[/QUOTE]
You compile and link the stuff that runs on the CPU and drives the GPU. However, as OpenCL's claim is to run on a wide variety of devices, it is impractical to have pre-compiled device-code for all possible platforms. Rather, the device vendors ship the compiler in their drivers, and the GPU-code is compiled at runtime. During the build of mfakto, the OpenCL files are not touched. You can easily modify them before starting mfakto, and your changes will be compiled and executed. An approach somewhere between Java and shell scripts :wink:.

Dubslow 2012-05-25 23:18

[QUOTE=Bdot;300238]However, as OpenCL's claim is to run on a wide variety of devices, it is impractical to have pre-compiled device-code for all possible platforms. [/QUOTE]

Ah, okay, so whereas nVidia knows exactly which cards are CUDA-capable and what they can each do (and it's the only driver provider), OpenCL is (in theory) supposed to be agnostic of whatever device it's running on, which potentially includes a lot more than AMD GPUs, up to and including regular old CPUs. Makes sense :smile: (Still, the compilers can't be capable of too much optimization, otherwise you'd have to wait five minutes between when you start the program and when it starts running, especially for more complex code.)

LaurV 2012-05-26 04:59

[QUOTE=Dubslow;300239]...which potentially includes a lot more than AMD GPUs...[/QUOTE]
Which includes - certainly, not potentially - the NV GPUs too, [URL="http://www.nvidia.com/object/cuda_opencl_1.html"]they are all OpenCL-able[/URL], at least at theoretical level...

Dubslow 2012-05-26 06:49

[QUOTE=LaurV;300255] at least at theoretical level...[/QUOTE]
[url]http://www.mersenneforum.org/showpost.php?p=286230&postcount=336[/url]
[QUOTE=Bdot;286230]
BTW, testing mfakto on Nvidia turns out to be way more effort than it might be worth. Nvidia's OpenCL compiler is buggy and not yet complete. I had to remove all printf's even though they were in inactive #ifdefs. And once that was done, the compiler crashes.
[code]
Error in processing command line: Don't understand command line argument "-O3"!
[/code][code]
(0) Error: call to external function printf is not supported
[/code][code]
Select device - Get device info - Compiling kernels .Stack dump:
0. Running pass 'Function Pass Manager' on module ''.
1. Running pass 'Combine redundant instructions' on function '@mfakto_cl_barrett79'

mfakto-nv.exe has stopped working
[/code][/QUOTE]
:smile:
"No plan survives contact with the enemy."

kracker 2012-05-31 19:14

0.11 is much, much, much better here... thanks :)

on 0.10p1, one LL TF test took about ~8 hours, with 0.11 it takes about ~4.5 hours

Bdot 2012-06-01 20:35

[QUOTE=kracker;300867]0.11 is much, much, much better here... thanks :)

on 0.10p1, one LL TF test took about ~8 hours, with 0.11 it takes about ~4.5 hours[/QUOTE]

Thanks :smile:

what hw are you running? I have not seen an almost-doubling in my tests ... Did you change anything else?

Anyway, it's nice to hear it's working well.

kracker 2012-06-03 19:22

[QUOTE=Bdot;300981]Thanks :smile:

what hw are you running? I have not seen an almost-doubling in my tests ... Did you change anything else?

Anyway, it's nice to hear it's working well.[/QUOTE]

[URL]http://www.mersenneforum.org/showpost.php?p=285827&postcount=307[/URL]
[URL]http://www.newegg.com/Product/Product.aspx?Item=N82E16819103942[/URL]

P.S.: Using 64k binary

dbaugh 2012-06-06 09:35

on chip graphics
 
I am using an i7-3960x with a Radeon HD 7970. mfakto compiles and passes both the self test and the two suggested test worktodo values. When I give it something that takes more that a couple of minutes, I lose my screen (all monitor timeouts are disabled). Sometimes it gives me a messed up screen first. I can only get it back by doing a power on reset. Is there a way to force the system (Win7) to use the onchip graphics and leave the video card for opencl work?

Thanks,

David

axn 2012-06-06 12:01

[QUOTE=dbaugh;301424]I am using an i7-3960x with a Radeon HD 7970. mfakto compiles and passes both the self test and the two suggested test worktodo values. When I give it something that takes more that a couple of minutes, I lose my screen (all monitor timeouts are disabled). Sometimes it gives me a messed up screen first. I can only get it back by doing a power on reset. Is there a way to force the system (Win7) to use the onchip graphics and leave the video card for opencl work?

Thanks,

David[/QUOTE]

Ummm... Connect the monitor to the IGP output?

KyleAskine 2012-06-06 12:21

[QUOTE=dbaugh;301424]I am using an i7-3960x with a Radeon HD 7970. mfakto compiles and passes both the self test and the two suggested test worktodo values. When I give it something that takes more that a couple of minutes, I lose my screen (all monitor timeouts are disabled). Sometimes it gives me a messed up screen first. I can only get it back by doing a power on reset. Is there a way to force the system (Win7) to use the onchip graphics and leave the video card for opencl work?

Thanks,

David[/QUOTE]

This simply shouldn't happen. What do you mean by 'all monitor timeouts are disbaled'. I sounds to me like there are either driver issues, or you have a faulty video card and the stress simply shows the bad hardware.

fivemack 2012-06-06 12:34

[QUOTE=dbaugh;301424]I am using an i7-3960x ... is there a way to force the system (Win7) to use the onchip graphics and leave the video card for opencl work?[/QUOTE]

The i7-3960x does not have on-chip graphics, so that's difficult. Do you have another PCIe slot that you could put a cheap low-end Radeon in?

dbaugh 2012-06-06 19:57

on-chip graphics
 
fivemack is on to something. It looks like the first thing I need to do to force the use of on-chip graphics is use a processor that has it. By "all monitor timeouts disabled", I meant under power settings, among other places, I chose to never turn off the display. I'll give new drivers (if they have come out in the last month) a try and a low-end buddy card (probably my best shot). I sure hope my card is not bad. I'll need to find a non-mfakto way to test it.

Many thanks,

David

Dubslow 2012-06-06 20:48

God, I hate the designations on the SB-E... I saw the '3' and without thinking assumed it was Ivy Bridge. Good catch fivemack. Damn you and your stupid names, Intel.

kracker 2012-06-06 23:15

[QUOTE=Dubslow;301472]God, I hate the designations on the SB-E... I saw the '3' and without thinking assumed it was Ivy Bridge. Good catch fivemack. Damn you and your stupid names, Intel.[/QUOTE]

+1

Me too. Those are damned confusing.

nucleon 2012-06-07 14:36

i7-39xx - hex core +HT LGA2011
i7-38xx - quad core +HT LGA2011
i7-37xx - quad core +HT LGA1155
i7-35xx - quad core no HT LGA1155

LGA2011 - no inbuilt GPU
LGA1155 - on die GPU

It does have some sense to it.

-- Craig

Dubslow 2012-06-07 14:40

[QUOTE=nucleon;301534]i7-39xx - hex core +HT LGA2011 -- SANDY BRIDGE (32nm)
i7-38xx - quad core +HT LGA2011 -- SANDY BRIDGE (32nm)
i7-37xx - quad core +HT LGA1155 -- IVY BRIDGE (22nm)
i7-35xx - quad core no HT LGA1155 -- IVY BRIDGE (22nm)

LGA2011 - no inbuilt GPU
LGA1155 - on die GPU

It does have some sense to it.

-- Craig[/QUOTE]
Except for the whole different architectures thing. One architecture should be 2xxx, and the other 3xxx so we can tell the difference at a glance. It would be a lovely and excellent naming system, [i]if the got the series/architecture part right[/i]. Do agree the rest is sensible.

nucleon 2012-06-07 15:19

True - I was a bit surprised with SNB-E was i7-3xxx, I was expecting i7-2xxx.

They kinda of baked themselves into a corner with IVB-E (if it ever gets released)



-- Craig Meyers

Dubslow 2012-06-07 15:25

[QUOTE=nucleon;301544]True - I was a bit surprised with SNB-E was i7-3xxx, I was expecting i7-2xxx.

They kinda of baked themselves into a corner with IVB-E (if it ever gets released)



-- Craig Meyers[/QUOTE]

If they call it i7-4xxx I'll shoot something.

Bdot 2012-06-07 21:19

[QUOTE=kracker;301155][URL]http://www.mersenneforum.org/showpost.php?p=285827&postcount=307[/URL]
[URL]http://www.newegg.com/Product/Product.aspx?Item=N82E16819103942[/URL]

P.S.: Using 64k binary[/QUOTE]

Would you mind running a single assignment when the machine is otherwise idle and send
[LIST=1][*]GPU model (including core speed if overclocked)[*]Assignment (e.g. "Factor=54321987,69,70")[*]Real time to run assignment[*]GPU usage in percent (average)[/LIST] to [EMAIL="//james@jamesheinrich.com"]James[/EMAIL]? So he can update the [URL="http://mersenne-aries.sili.net/mfaktc.php"]GPU performance page.[/URL]

Actually, not just you. It would be good if everyone running the 0.11 version helps this page by submitting measurements ...

kracker 2012-06-07 21:39

[QUOTE=Bdot;301577]Would you mind running a single assignment when the machine is otherwise idle and send
[LIST=1][*]GPU model (including core speed if overclocked)[*]Assignment (e.g. "Factor=54321987,69,70")[*]Real time to run assignment[*]GPU usage in percent (average)[/LIST] to [EMAIL="//james@jamesheinrich.com"]James[/EMAIL]? So he can update the [URL="http://mersenne-aries.sili.net/mfaktc.php"]GPU performance page.[/URL]

Actually, not just you. It would be good if everyone running the 0.11 version helps this page by submitting measurements ...[/QUOTE]

Ok, I will, after this current batch finishes :)

Bdot 2012-06-07 21:42

[QUOTE=dbaugh;301424]I am using an i7-3960x with a Radeon HD 7970. mfakto compiles and passes both the self test and the two suggested test worktodo values. When I give it something that takes more that a couple of minutes, I lose my screen (all monitor timeouts are disabled). Sometimes it gives me a messed up screen first. I can only get it back by doing a power on reset. Is there a way to force the system (Win7) to use the onchip graphics and leave the video card for opencl work?

Thanks,

David[/QUOTE]

Finally someone testing mfakto on HD7970, Thanks!

I guess the onchip graphics question has been "solved" :smile:, I'll try and help with the current setup.

If all went well, there should be no need to switch to another card - your card should be able to handle it all. So if the kernels run for too long, then there's something wrong.

First test would be to lower GridSize in mfakto.ini and see if that improves the situation. Start with GridSize=0 and increase it as long as it is stable.

Second, I'd suggest to use GPU-Z or something similar to see if the card's core clock goes up when mfakto is running. It is hard to believe that this card could not handle a block of 2M candidates within the timeout.

Third, I'd like you to run the performance info version of mfakto (from [URL="http://mersenneforum.org/mfakto/mfakto-0.11/specialVersions_x64.zip"]here[/URL]) like this:
mfakto-pi -st > perfinfo.txt
You can Ctrl-C it after a minute or so. Using this output I can determine which of the kernels potentially takes too long.

dbaugh 2012-06-08 03:40

suggested test results
 
Installed 12-4 driver. Set ini to GridSize=0. Had GPU-Z.0.6.2 running. GPU Core Clock jumped from 300 to 950, GPU Memory Clock jumped from 150 to 1425 and GPU Load jumped from 0% to 68% when I start mfakto. The same blank screen after a couple of minutes or less. I have the CPU box on a wattmeter and it goes from 180 to 325 when I start mfakto. It drops to 220 when the screen blanks out. Indicating that the GPU is no longer working very hard. Tried to attach the perfinfo.txt file from start to a couple of minutes after blankout, but it is 5MB.

WBR,

David

Bdot 2012-06-08 08:56

[QUOTE=dbaugh;301605]Installed 12-4 driver. Set ini to GridSize=0. Had GPU-Z.0.6.2 running. GPU Core Clock jumped from 300 to 950, GPU Memory Clock jumped from 150 to 1425 and GPU Load jumped from 0% to 68% when I start mfakto. The same blank screen after a couple of minutes or less. I have the CPU box on a wattmeter and it goes from 180 to 325 when I start mfakto. It drops to 220 when the screen blanks out. Indicating that the GPU is no longer working very hard. Tried to attach the perfinfo.txt file from start to a couple of minutes after blankout, but it is 5MB.

WBR,

David[/QUOTE]

I've sent you a pm with an email address where you can forward the perfinfo.txt to.

Well, it seems I had misunderstood the problem a bit. I thought the test would run into the graphics driver timeout of Windows. But now I understand that testing some assignment starts OK, processes a few classes and then stops with screen garbage or blank screen. This would indeed indicate a hardware problem, maybe of the GPU fan(s). In GPU-Z, do you see very high temperatures for the GPU (above 80 or 90 C)? All fans running?

The AMD driver should also have installed "AMD VISION Engine Control Center". In there, under power or performance, there should be Graphics Overdrive - usually used for overclocking. But I'd like you to move the GPU and memory clock sliders to the lowest possible settings (probably around 600MHz for the GPU). And then test again. This setting should produce way less heat, and if that was the issue, then it should run stable (but slow) now.

If heat was the problem, then other GPU tests, like furmark, will also show the issue.

kracker 2012-06-08 19:31

What are compute elements?
 
@bdot:
OT, but I just wanted to ask you, are "compute elements" stream processors or is it something different?

Thanks

...
maximum threads per block 256
maximum threads per grid 16777216
number of multiprocessors 5 (400 compute elements (estimate for ATI GPUs))
clock rate 600MHz
...

Bdot 2012-06-09 13:34

[QUOTE=kracker;301704]@bdot:
OT, but I just wanted to ask you, are "compute elements" stream processors or is it something different?

Thanks

...
maximum threads per block 256
maximum threads per grid 16777216
number of multiprocessors 5 (400 compute elements (estimate for ATI GPUs))
clock rate 600MHz
...[/QUOTE]
:smile:
so far it is just 80 times the number of multiprocessors reported by the hardware. Unfortunately there is no reported number of stream processors, so I used the multiplier of the VLIW5 architecture. For VLIW4 (Cayman) this is already different, and for GCN (Tahiti) anyways. 400 is obviously not quite true for your card ...

And that is not offtopic. I wanted to automate this calculation long ago, but is is harder than I imagined. I'll try to come up with something for the next release.

Hmm, which card was that? The report you sent me for HD7970 contains
[code]number of multiprocessors 32 (2560 compute elements (estimate for ATI GPUs))[/code]Thanks for your perfinfo, btw. There's good and bad news.

The bad news is, that these results do not show any indication what might go wrong with your card - the tests advance normally and successfully as on any other card (which is actually good).

The good news is, that HD7970 can top HD6970 by at least 10% in mfakto. However, completely different way I thought. The 15-bit kernel that helped bring Cayman to the top is the absolutely slowest one on Tahiti. It appears that Tahiti no longer imposes a penalty on 32-bit operations because suddenly the plain 32-bit barrett79 kernel is the fastest with a theoretical throughput of 360M/s.

I guess with a little optimization for this platform 400 seems feasible. However, that stability issue is certainly alarming. Maybe you post that issue to [url]http://devgurus.amd.com/community/opencl[/url], there are amd folks and geeks who can better help troubleshoot.

kracker 2012-06-09 14:13

Yeah, I believe mine has 600 stream processors on it

Edit: BTW, the 7970 has 2048.

Bdot 2012-06-14 15:43

[QUOTE=kracker;301838]Yeah, I believe mine has 600 stream processors on it
[/QUOTE]

It's 400 (@600 MHz) in the [FONT=Verdana][SIZE=2][COLOR=#000000][FONT=verdana,geneva][SIZE=2]HD 6550D / A8 3850[/SIZE][/FONT][/COLOR][/SIZE][/FONT], so mfakto guessed it right :smile:


All times are UTC. The time now is 23:49.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.