mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

apsen 2011-08-31 13:25

1 Attachment(s)
[QUOTE=Bdot;270427]I posted that on the [URL="http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=154048&enterthread=y"]AMD Forum[/URL] and they want to know what exactly the error is. Could you please try again and tell me?[/QUOTE]

Actually it does not even give an error. The installer says that the installation of that part has failed and provides a way to open log file. The log file says "Error messages" and it looks like some details should follow but there are none.

Bdot 2011-08-31 19:08

[QUOTE=AldoA;270456]Thanks. Now I can open mfakto, but I think it's using the CPU because it says "select device-GPU not found-fallback to CPU". What to do? Anyway I made the selftest and it passed it. What other can I do? (Sorry for the questions but I'm not really into computing).[/QUOTE]

Did you also install one of the recent Catalyst graphics drivers? 11.7 and 11.8 should work, not sure about 11.6, but they definitely should not be older.

If that is up-to-date, then please post the output of clinfo (e.g. C:\Program Files (x86)\AMD APP\bin\x86_64\clinfo.exe, or in the x86 directory if you run 32-bit OS). This should contain one section for your GPU and one for the CPU.


@apsen: Thanks for the details, I forwarded it - looks like W2k8 should work as well ...

AldoA 2011-08-31 19:44

1 Attachment(s)
[QUOTE=Bdot;270478]Did you also install one of the recent Catalyst graphics drivers? 11.7 and 11.8 should work, not sure about 11.6, but they definitely should not be older.

If that is up-to-date, then please post the output of clinfo (e.g. C:\Program Files (x86)\AMD APP\bin\x86_64\clinfo.exe, or in the x86 directory if you run 32-bit OS). This should contain one section for your GPU and one for the CPU.


@apsen: Thanks for the details, I forwarded it - looks like W2k8 should work as well ...[/QUOTE]

No, I tried to install 11.8, but my antivirus found a virus, so it stopped.
Anyway, this should be the output of clinfo.exe

MrHappy 2011-08-31 21:00

With two instances running, every instance just gets half the throughput (25M/s) - so nothing gained or lost. But with two instances the "avg. wait" time is constantly at 6000µs while it is at ~100µs with one instance. Is that a good or bad thing?

Christenson 2011-09-01 00:17

[QUOTE=Bdot;270394]I'm still in a stage of collecting ideas how to distribute the work onto multiple threads.

Easiest would be to give each thread a different exponent to work on. (Yuck!)

Each thread could also process a fixed block of sieve-input. This would require sieve-initialization for each block as you cannot build upon the state of the previous block. Therefore each block needs to have a good size to make the initialization less prominent. An extra step (i.e. extra kernel) would be needed to combine the output of all the threads into the sieve-output. And only after that step we know if we have enough FCs to fill a block for the GPU factoring. (I'd rather do variable size blocks and toss the ends when needed)

Similarly, we could let each thread prepare a whole block of sieve-output factor candidates. This would require to have good estimates about where each block will start. Usually you don't know where a certain block starts until the previous block is finished sieving. It can be estimated, but to be safe, there needs to be a certain overlap, some checks and maybe re-runs of the sieving if gaps were detected. (again, variable sized blocks are better)

We could split the primes that are used to sieve a block. Disadvantages include different run-lengths for the loops, lots of (slow) global memory operations and synchronization for access to the block of FCs (not sure about that). Maybe that could be optimized by using workgroup-size blocks and local memory that is considerably faster, and combining that later into global memory.

Maybe the best would be to split the task (factor M[SUB]exp[/SUB] from 2[SUP]n[/SUP] to 2[SUP]m[/SUP]) into <workgroup> equally-sized blocks and run sieving and factoring of those blocks in independent threads. Again, lots of initializations, plus maybe too many private resources required ... Preferred workgroup numbers seem to be 32 to 256, depending on the GPU.

More suggestions, votes, comments?[/QUOTE]
Impressionistically, on mfaktc, my exponents are running a half-hour or more apiece. The reallly big ones from Operation Billion Digits can run for days and weeks. So I would argue that you want to assume the card is TF'ing on only one exponent at a time, not TF'ing multiple exponents in parallel.

I vote for optimizing large, equally sized blocks that sieve down to a variable number of FCs, much larger than the number of exponentiations that can be done in parallel. These FCs then split into a number of blocks of the right size to do in parallel, plus one "runt" block.

From a dataflow perspective, the list of probable primes(PrP's) of the right form is fed from the CPU to the GPU in blocks for calculating (2^Exp) mod PrP and checking if the result is unity. The quality of the sieving is adjusted so as to just keep the exponentiation and modulus process busy.

I'm not sure where the 4620 classes (in mfaktc) comes from, but it seems to me you would want to create each class (block of factor candidates) in turn on the GPU. As a first step, let that class be copied back to CPU memory while awaiting GPU readiness. As a second step, keep those blocks completely in GPU memory, and the CPU just feeds pointers to the GPU kernels.

I think the right way to think about the threading problem is that a single thread on the CPU acts like an operating system for the GPU....it allocates the GPU resources, sets the kernels in motion, and wakes up again when a kernel completes and figures out what the GPU should do next, and/or fires the next GPU kernel. The optimum level of parallelism in time on the GPU is an open question....should you sieve until the GPU memory is half-full of PrP FCs, then run them to completion, or should you sieve into more digestible blocks and queue them up so they run in parallel? It all depends on the cost and form of context switching on the GPU.

There's also some fairly heavy dependency on your representation for sieving; as with various modified sieves of eratosthenes, the representation proper can be optimised for space and speed, and how that works depends tremendously on the card internals, like the relative cost of clearing bits versus setting bytes or words, and the relative cost of scanning for the remaining FCs at the end of sieving.

Finally, if you had two GPUs on the bus, especially in SLI mode, the SLI connector could be carrying the FCs from the siever card to the one that did the exponentiation.

***********
But I am still working on automating mfaktc communications with Primenet; the only real progress has been an upgrade to parse.c that you (Bdot) will probably be very interested in. I'm supporting comments in worktodo.txt.
***********

Christenson 2011-09-01 00:25

[QUOTE=MrHappy;270487]With two instances running, every instance just gets half the throughput (25M/s) - so nothing gained or lost. But with two instances the "avg. wait" time is constantly at 6000µs while it is at ~100µs with one instance. Is that a good or bad thing?[/QUOTE]

This equality of throughput is the statement that the CPU is waiting for the GPU...that is, the process is GPU-bound. Have a look at what sieveprimes is doing...I'm betting that it is higher with two instances running than one, meaning the CPU is doing a bit better job of sieving and working a bit harder.

Running two instances is probably using a little more of your CPU than just one(you should check this), and you have to deal with a little bit more confusion, so, the data says, for your case, to just run one instance, especially if your CPU isn't doing LL or P-1 tests as a consequence of running mfakto.

Chaichontat 2011-09-01 10:55

[QUOTE=Bdot;270353]
Chaichontat's HD6850 (at this speed rather a 6870!) should achieve around 160M/s. For that you'll need at least 2, probably 3 instances running on at least 4 CPU-cores.
[/QUOTE]

Hi,
My GPU had made 90M/s in 2 CPU and 58 percent GPU usage, and "Sieve Prime" setting 5000. When I started another instance GPU usage swings up to 96 percent, and uses 4 threads; first instance gives 65M/s at 10000 Sieve Prime; second instance gives 70M/s at 8000 Sieve Prime. Combining both instances give ~140M/s.

P.S. My GPU is HD6950 though.

Christenson 2011-09-01 12:43

[QUOTE=Chaichontat;270553]Hi,
My GPU had made 90M/s in 2 CPU and 58 percent GPU usage, and "Sieve Prime" setting 5000. When I started another instance GPU usage swings up to 96 percent, and uses 4 threads; first instance gives 65M/s at 10000 Sieve Prime; second instance gives 70M/s at 8000 Sieve Prime. Combining both instances give ~140M/s.

P.S. My GPU is HD6950 though.[/QUOTE]

This is the opposite of Mr Happy's situation... with two instances, the CPUs here are just barely keeping up with the GPUs, and have to lower the quality of the candidates (SievePrimes) to do it. That is, you are definitely CPU bound.

Bdot 2011-09-01 13:28

[QUOTE=AldoA;270481]No, I tried to install 11.8, but my antivirus found a virus, so it stopped.
Anyway, this should be the output of clinfo.exe[/QUOTE]

This teaches us that we really need a (fairly) up-to-date Catalyst driver. The clinfo output does not even mention your GPU.

Please try again to download and install 11.8. If your antivirus still complains, try to get an updated virus definition, or temporarily disable checking. If all fails, AMD provides a link to [URL="http://support.amd.com/us/gpudownload/windows/previous/Pages/radeonaiw_vista64.aspx"]older versions[/URL] - you can try 11.7.

Bdot 2011-09-01 13:31

[QUOTE=Christenson;270505]This equality of throughput is the statement that the CPU is waiting for the GPU...that is, the process is GPU-bound. Have a look at what sieveprimes is doing...I'm betting that it is higher with two instances running than one, meaning the CPU is doing a bit better job of sieving and working a bit harder.

Running two instances is probably using a little more of your CPU than just one(you should check this), and you have to deal with a little bit more confusion, so, the data says, for your case, to just run one instance, especially if your CPU isn't doing LL or P-1 tests as a consequence of running mfakto.[/QUOTE]

Yes, that depends on the SievePrimes. I guess when running 2 instances, it will go to 200k for both. For a single instance it is probably lower. If you can spare the CPU power, running 2 instances will still be some advantage, because at the same total speed of 50M/s less candidates need to be tested, thus the classes are finished faster.

Bdot 2011-09-01 14:09

GPU sieving
 
Christenson, thanks for your thoughts, it's good to have some discussion about it ...

[QUOTE=Christenson;270502]
I vote for optimizing large, equally sized blocks that sieve down to a variable number of FCs, much larger than the number of exponentiations that can be done in parallel. These FCs then split into a number of blocks of the right size to do in parallel, plus one "runt" block.

I'm not sure where the 4620 classes (in mfaktc) comes from, but it seems to me you would want to create each class (block of factor candidates) in turn on the GPU. As a first step, let that class be copied back to CPU memory while awaiting GPU readiness. As a second step, keep those blocks completely in GPU memory, and the CPU just feeds pointers to the GPU kernels.
[/QUOTE]

I'll probably go ahead and create a siever and compacter kernels that initially do not do a lot of sieving, just to see how they can work together and how the CPU can control them. The kernels need to run interleaved as OpenCL does not (yet) support running them in parallel (except on different devices). However, there's no need to copy the blocks to the CPU and back as the subsequent kernel can easily access them. This will be the major benefit of the GPU sieve (no pressure on the memory bus, reducing interference with prime95, for instance) - I do not expect it to be much faster than CPU sieving. Also, when leaving the blocks on GPU memory, optimizing for size does not seem to be so important.

BTW, GPU context switching requires to copy memory blocks from and to the GPU, therefore having smaller memory blocks can be advantageous. However, different kernels can run (sequentially) in the same context, with almost no switching time.

[QUOTE=Christenson;270502]
Finally, if you had two GPUs on the bus, especially in SLI mode, the SLI connector could be carrying the FCs from the siever card to the one that did the exponentiation.
[/QUOTE]

This would require good balancing to make the kernels run equally long. Which is not necessary when running the kernels serialized. And you'd need to copy the blocks around again.

Bdot 2011-09-05 21:57

[QUOTE=apsen;270466]Actually it does not even give an error. The installer says that the installation of that part has failed and provides a way to open log file. The log file says "Error messages" and it looks like some details should follow but there are none.[/QUOTE]

They say it may happen if the Catalyst driver is too old, or did not install properly.

Did you have an outdated version on the W2008 box (or no Catalyst at all)?

apsen 2011-09-06 14:11

[QUOTE=Bdot;270908]They say it may happen if the Catalyst driver is too old, or did not install properly.

Did you have an outdated version on the W2008 box (or no Catalyst at all)?[/QUOTE]

I actually tried to install the whole bunch. Every other component got installed but that one. After that I tried to install just the APP but it failed too.

Bdot 2011-09-06 21:36

[QUOTE=apsen;270959]I actually tried to install the whole bunch. Every other component got installed but that one. After that I tried to install just the APP but it failed too.[/QUOTE]

The Catalyst drivers are a separate package - they're not included in the AMD APP SDK. They are the Graphics drivers for your card.

apsen 2011-09-08 03:46

[QUOTE=Bdot;270996]The Catalyst drivers are a separate package - they're not included in the AMD APP SDK. They are the Graphics drivers for your card.[/QUOTE]

But drivers come with SDK if you get the big package.

KingKurly 2011-09-08 15:55

Bdot,

Thank you for your work on mfakto; I enjoy it and find it useful.

When do you think the program will be ready for us to start using it to submit "no factor found" results? I have found one factor (which I already mentioned to you) but only "no factor" results since then. Until I get a go-ahead, I will continue to NOT submit those results: only factors.

-Greg

Bdot 2011-09-08 18:32

[QUOTE=KingKurly;271192]Bdot,

Thank you for your work on mfakto; I enjoy it and find it useful.

When do you think the program will be ready for us to start using it to submit "no factor found" results? I have found one factor (which I already mentioned to you) but only "no factor" results since then. Until I get a go-ahead, I will continue to NOT submit those results: only factors.

-Greg[/QUOTE]


I did fixes for the issues reported for 0.07. Currently there are some tests underway because of confusing performance results with 0.07: It seems on HD6xxx, version 0.05 was faster than 0.07 (on HD5xxx, 0.07 is about 10% faster than 0.05 was). If that is true, then I will need one more change before releasing the next version - should come within the next few days.

AldoA 2011-09-10 16:21

[QUOTE=Bdot;270567]This teaches us that we really need a (fairly) up-to-date Catalyst driver. The clinfo output does not even mention your GPU.

Please try again to download and install 11.8. If your antivirus still complains, try to get an updated virus definition, or temporarily disable checking. If all fails, AMD provides a link to [URL="http://support.amd.com/us/gpudownload/windows/previous/Pages/radeonaiw_vista64.aspx"]older versions[/URL] - you can try 11.7.[/QUOTE]

Sorry, but the new driver give me too many problem. However I installed mfakto in my Notebook that it has an ATI Mobility Radeon HD5470 with the 11.7 version, but also with this it doesn't found the GPU. I tried to copy and post the clinfo, but it closed too fast. Have you any advide to what to do? Thanks

Bdot 2011-09-12 07:57

[QUOTE=AldoA;271403]Sorry, but the new driver give me too many problem. However I installed mfakto in my Notebook that it has an ATI Mobility Radeon HD5470 with the 11.7 version, but also with this it doesn't found the GPU. I tried to copy and post the clinfo, but it closed too fast. Have you any advide to what to do? Thanks[/QUOTE]

To get the output of command-line tools it is usually best to open a command prompt (Run -> cmd, or find the black "Command Prompt" icon). In there, run the tool you want, in this case clinfo. You may need to add the full path to it, or cd (change directory) to where the tool sits.

Redirecting the output to a file is quite useful, such as
clinfo >clinfo.txt or
mfakto --CLtest >mfakto-test.txt

For the installation of the Catalyst drivers and AMD APP SDK I cannot suggest much more ... if both installations succeeded yet the GPU cannot be found, I don't know. You can try some of the sample applications that come with the APP SDK, and see if they find the GPU, but if clinfo does not list it, that's not likely.

DigiK-oz 2011-09-13 12:47

[QUOTE=Bdot;270043]Did anyone else give mfakto a try? Any experiences to share (anything strange happening, suggestions you'd like to get included or excluded for the next versions, performance figures for other GPUs, ...)?

I'm running this version on a SuSE 11.4 box with AMD APP SDK 2.4, and when multiple instances are running I occasionally see one instance hanging. It will completely occupy one CPU core but no GPU resources. It is looping inside some kernel code, being immune to kill, kill -9 or attempts to attach a debugger or gcore. So far, reboot was the only way I know to get rid of it. How can I find out where that hang occurs? And what else could I try to kick such a process without a reboot?[/QUOTE]

Just set up 0.07 on my WIN7 X64 system with AMD 5770 (driver version 11.8). Having the same problem as reported earlier, selftest fails on 1-5 and 9-11.

I do not have the entire SDK installed, just the drivers including OpenCL. Should I install the entire SDK? Anything else I can do to get started? (I know the source can be changed slightly, but no compiler/SDK handy).

Looking forward to the "official" version we can actually report "No factors" with!

MrHappy 2011-09-13 19:10

1 Attachment(s)
I just updated catalyst drivers from 11.6 to 11.8 and.... got errors in the test and mfakto just shuts down...

Bdot 2011-09-13 20:35

Version 0.08 release
 
1 Attachment(s)
Here it is, [B]mfakto version 0.08[/B]

This one is now compatible with Catalyst 11.8, and has a few other changes (from the changelog):
[LIST][*]added --help and parameter checking[*]exclude single-vectored 72-bit-mul24-kernel (not working with AMD APP SDK 2.5)[*]removed THREADS_PER_BLOCK (no longer needed, will be selected automatically based on the GPU capabilities)[*]removed slow 95-bit kernel (not usable anyway)[*]added to mfakto.ini a config setting
PreferKernel=mfakto_cl_barrett79|mfakto_cl_71
as HD5xxx and HD6xxx have their top-speed with different kernels[*]tuned the settings for SievePrimesAdjust - should now be usable[/LIST]Quite important is the new config setting [B]PreferKernel[/B] at the end of mfakto.ini. The two possible settings, mfakto_cl_barrett79 and mfakto_cl_71 decide, which of these two kernels will be used for factors between 2[SUP]64[/SUP] and 2[SUP]71[/SUP]. On HD5xxx cards, mfakto_cl_barrett79 seems to be faster while HD6xxx may run better with mfakto_cl_71 (which is the MUL24 kernel). Try it out and let me know :smile:.

And feel free to submit your results to primenet with this version.

Bdot 2011-09-13 20:38

mfakto version 0.08 - Windows
 
1 Attachment(s)
Windows 32-bit and 64-bit.

Bdot 2011-09-13 20:40

mfakto version 0.08 - Sources
 
1 Attachment(s)
and the source

KingKurly 2011-09-14 01:40

I was able to test the Linux build as a success:

[CODE]
Selftest statistics
number of tests 3332
successfull tests 3332

selftest PASSED!


real 43m31.235s
user 9m22.360s
sys 31m10.151s
[/CODE]

I did find it a bit strange that test cases 1552 through 1557 had no output to the terminal, but I will transition to using 0.08 now and I will begin submitting results to PrimeNet as they become available.

Do we need to redo "no factor" work that was done under 0.07 or can those be submitted? I have not downloaded the source to check a diff to see if it is reasonable thing to do.

Bdot 2011-09-14 08:53

[QUOTE=KingKurly;271678]I was able to test the Linux build as a success:
[CODE]

selftest PASSED!

[/CODE][/QUOTE]
Great to see!
[QUOTE=KingKurly;271678]I did find it a bit strange that test cases 1552 through 1557 had no output to the terminal, but I will transition to using 0.08 now and I will begin submitting results to PrimeNet as they become available.
[/QUOTE]

The reason is, that test cases 1552 to 1557 are for factors of more than 91 bits. I had removed the 95-bit kernel because it was so terribly slow that you would not want to use it anyway. Therefore, mfakto currently only supports TF up to 91 bits.

[QUOTE=KingKurly;271678]
Do we need to redo "no factor" work that was done under 0.07 or can those be submitted? I have not downloaded the source to check a diff to see if it is reasonable thing to do.[/QUOTE]

In version 0.07, the single-vectored MUL24 kernel did not work with Catalyst 11.8. In your self-compiled version you removed that kernel from the selftest, but not from the program. If you never changed the mfakto.ini-Parameter VectorSize (i.e. if you left it at 4), then that faulty kernel has not been used and you can submit the previous results without re-running them.

KingKurly 2011-09-14 14:50

[QUOTE=Bdot;271700]In version 0.07, the single-vectored MUL24 kernel did not work with Catalyst 11.8. In your self-compiled version you removed that kernel from the selftest, but not from the program. If you never changed the mfakto.ini-Parameter VectorSize (i.e. if you left it at 4), then that faulty kernel has not been used and you can submit the previous results without re-running them.[/QUOTE]
I have confirmed that the VectorSize never changed from 4. I am submitting the results, "closing the book" on 0.07, and moving to 0.08. Thank you so much! :smile:

Bdot 2011-09-14 16:03

[QUOTE=KingKurly;271707] moving to 0.08. Thank you so much! :smile:[/QUOTE]

Did you already check if mfakto_cl_barrett79 or mfakto_cl_71 are faster on your GPU? I'm really interested to see the two compared on different GPUs. On my HD5770, barrett is about 10% faster ...

Razor_FX_II 2011-09-14 16:28

[QUOTE=Bdot;271717]Did you already check if mfakto_cl_barrett79 or mfakto_cl_71 are faster on your GPU? I'm really interested to see the two compared on different GPUs. On my HD5770, barrett is about 10% faster ...[/QUOTE]
Using mfakto-0.08 on my HD4870's and HD4890's mfakto_cl_barrett79 is about 10% faster.
mfakto_cl_barrett79 avg rate: 55M/s
mfakto_cl_71 avg rate: 50M/s

KingKurly 2011-09-14 17:53

[QUOTE=Bdot;271717]Did you already check if mfakto_cl_barrett79 or mfakto_cl_71 are faster on your GPU? I'm really interested to see the two compared on different GPUs. On my HD5770, barrett is about 10% faster ...[/QUOTE]
I am finding similar results. My HD5450 seems to do about 8.6M/s on the mfakto_cl_71 and about 9.1M/s on the mfakto_cl_barrett79, doing TF on M41774351 from 68 to 69.

James Heinrich 2011-09-14 23:13

Since it appeared to be missing, I've created a stub article on MersenneWiki for mfakto:
[url]http://www.mersennewiki.org/index.php/Mfakto[/url]

But since I don't actually use mfakto, perhaps someone else could fill in and fix all the details in the article.

DigiK-oz 2011-09-20 15:33

In the GPUGRID forum :

there's a bug in the latest sdk that makes a full use of a cpu-core whenever an opencl app is running.
They promised a fix, but still not here in 11.8
maybe in 11.9??

Maybe mfakto suffers from this as well? One of the threads using 100% of one cpu happens to be in the ATI libs....

Bdot 2011-09-20 18:23

[QUOTE=DigiK-oz;272157]In the GPUGRID forum :

there's a bug in the latest sdk that makes a full use of a cpu-core whenever an opencl app is running.
They promised a fix, but still not here in 11.8
maybe in 11.9??

Maybe mfakto suffers from this as well? One of the threads using 100% of one cpu happens to be in the ATI libs....[/QUOTE]

They seem to have implemented some kind of busy-wait (futex-based) whenever something needs to be synchronized with the GPU. As this is usually the CPU just waiting for the GPU to complete something, that is a total waste of CPU resources.

However, mfakto is not hit that badly as mfakto passes the prepared factor candidates to the GPU but does not wait for the results immediately. Instead, the next block of factor candidates is prepared on the CPU. Only when the CPU is faster preparing the stuff than the GPU can process it, then mfakto will synchronize with the GPU. And of course at the end of a class.

So yes, mfakto will also consume a full CPU core, but it will do something useful most of that time.

Samoflan 2011-09-20 18:33

[QUOTE=Razor_FX_II;271719]Using mfakto-0.08 on my HD4870's and HD4890's mfakto_cl_barrett79 is about 10% faster.
mfakto_cl_barrett79 avg rate: 55M/s
mfakto_cl_71 avg rate: 50M/s[/QUOTE]

I get similar results on my HD4890

mfakto_cl_barrett79 avg rate: 51.9M/s
mfakto_cl_71 avg rate: 48.7M/s

GPU load is 91-95%
CPU load will almost cap out 2 cores on my Phenom II x4 955

Bdot 2011-09-26 19:14

Bug warning
 
I´m sorry to report: yesterday I found a bug, mfakto up to 0.08 does not find the factor for k=3 for M6599953.

The reason is an invalid "optimization" that I made over the mfaktc-code. Mfaktc does [B]not [/B]have this problem. I have fixed the bug and added a test case for it to the selftests.

The mfakto kernel "mfakto_cl_71" (all vector sizes) sometimes calculated a bad modulus when the factor candidate was <2[SUP]48[/SUP]. Smaller FCs (~2[SUP]24[/SUP]) had a higher chance for the error to occur, FCs >2[SUP]48[/SUP] were always calculated correctly. The problem does not depend on the exponent size.

I´m sorry for possibly having wasted effort and resources, but I hope it´s not too many tests that need to be repeated as it´s only about small FCs. I will provide a fixed version within the next few days.

apsen 2011-09-26 20:04

[QUOTE=Bdot;272797]I´m sorry to report: yesterday I found a bug, mfakto up to 0.08 does not find the factor for k=3 for M6599953.
[/QUOTE]

Does this affect 0.07 too?

Bdot 2011-09-26 23:10

[QUOTE=apsen;272801]Does this affect 0.07 too?[/QUOTE]
Yes. The code there is the same as 0.08 and I just tested it - the mentioned factor is not found by 0.07 either.

KingKurly 2011-09-27 03:26

[QUOTE=Bdot;272797]I´m sorry to report: yesterday I found a bug, mfakto up to 0.08 does not find the factor for k=3 for M6599953.

The reason is an invalid "optimization" that I made over the mfaktc-code. Mfaktc does [B]not [/B]have this problem. I have fixed the bug and added a test case for it to the selftests.

The mfakto kernel "mfakto_cl_71" (all vector sizes) sometimes calculated a bad modulus when the factor candidate was <2[SUP]48[/SUP]. Smaller FCs (~2[SUP]24[/SUP]) had a higher chance for the error to occur, FCs >2[SUP]48[/SUP] were always calculated correctly. The problem does not depend on the exponent size.

I´m sorry for possibly having wasted effort and resources, but I hope it´s not too many tests that need to be repeated as it´s only about small FCs. I will provide a fixed version within the next few days.[/QUOTE]
So tests like "no factor for M43207903 from 2^68 to 2^69 [mfakto 0.08 mfakto_cl_barrett79]" would be unaffected. A test would have to be for much lower bit levels to be affected?

Bdot 2011-09-27 07:20

[QUOTE=KingKurly;272844]So tests like "no factor for M43207903 from 2^68 to 2^69 [mfakto 0.08 mfakto_cl_barrett79]" would be unaffected. A test would have to be for much lower bit levels to be affected?[/QUOTE]
Yes. If you did the test like "no factor for M43207903 from 2^1 to 2^69 [mfakto 0.08 ...]" then the test might have missed factors that are below 2[SUP]48[/SUP], and a safety test would need to be done from 2[SUP]1[/SUP] to 2[SUP]48[/SUP] (with mfakto 0.09, or mfaktc).

Furthermore, tests done by the barrett kernel "[mfakto 0.08 mfakto_cl_barrett79]" are not affected, but this kernel cannot and will not be used for low factor sizes anyway.

So affected are:
- version 0.07 and 0.08 of mfakto
- all tests that have been run by the "mfakto_cl_71" kernel AND where the starting bit level is <48.

henryzz 2011-09-27 11:38

I am pretty certain all of the GIMPS candidates have been tested past 2^48 using Prime95 so this shouldn't be an issue. mfakto just didn't double check that range.

KingKurly 2011-09-27 14:18

[QUOTE=henryzz;272878]I am pretty certain all of the GIMPS candidates have been tested past 2^48 using Prime95 so this shouldn't be an issue. mfakto just didn't double check that range.[/QUOTE]
Correct, and since I haven't checked any exponents for factors below 2^48 using mfakto, I can confidently say that my results are unaffected. Still, good catch, and I look forward to 0.09 soon.

DigiK-oz 2011-09-29 22:10

[QUOTE=Bdot;272179]They seem to have implemented some kind of busy-wait (futex-based) whenever something needs to be synchronized with the GPU. As this is usually the CPU just waiting for the GPU to complete something, that is a total waste of CPU resources.

However, mfakto is not hit that badly as mfakto passes the prepared factor candidates to the GPU but does not wait for the results immediately. Instead, the next block of factor candidates is prepared on the CPU. Only when the CPU is faster preparing the stuff than the GPU can process it, then mfakto will synchronize with the GPU. And of course at the end of a class.

So yes, mfakto will also consume a full CPU core, but it will do something useful most of that time.[/QUOTE]

Well, mfakto used to eat 2 full cores alongside the GPU (1 thread for mfakto itself, 1 thread somewhere in an ATI dll), but since the 11.9 drivers the only thread eating CPU is mfakto itself! So the guys at ATI seem to have fixed their drivers in that respect :)

Samoflan 2011-09-30 05:39

ATI drivers 11.9 seem to have increased the performance mfakto 0.08 slightly, by almost 2% on my Radeon HD4870. CPU utilization is still the same on my Phenom II x4 955 at about 47% across all 4 cores. Video card seems to stay at a consistant 95% load now instead of fluxing from 91-95%

DigiK-oz 2011-09-30 05:55

Strange, the 11.9 drivers brought down CPU usage on my I7 920 with hyperthreading from almost 25% (=2 cpus) to about 12% (=1 cpu)... With the 11.8 drivers, a thread in some ATI dll used 12%, as well as mfakto itself. With 11.9, the only thread using 12% is mfakto. The thread in the ATI dll is still there, but sits at 0,07% cpu. Performance has stayed about the same.

Anyone else seeing this behaviour?

jeebee 2011-09-30 14:20

[QUOTE=DigiK-oz;273030]Strange, the 11.9 drivers brought down CPU usage on my I7 920 with hyperthreading from almost 25% (=2 cpus) to about 12% (=1 cpu)... With the 11.8 drivers, a thread in some ATI dll used 12%, as well as mfakto itself. With 11.9, the only thread using 12% is mfakto. The thread in the ATI dll is still there, but sits at 0,07% cpu. Performance has stayed about the same.

Anyone else seeing this behaviour?[/QUOTE]


I have a similar experience. 11.9 seems like a major improvement upon 11.8. On a 2500k & HD6780, the old drivers needed 2 cores to output roughly 140mb/s. The newer driver delivers about 130mb/s with only one core running. I've thus decided to devote the third core to p95.

Bdot 2011-10-05 21:07

mfakto 0.09 - Windows versions
 
1 Attachment(s)
[QUOTE=jeebee;273056]I have a similar experience. 11.9 seems like a major improvement upon 11.8. On a 2500k & HD6780, the old drivers needed 2 cores to output roughly 140mb/s. The newer driver delivers about 130mb/s with only one core running. I've thus decided to devote the third core to p95.[/QUOTE]

Yes, the upgrade to 11.9 is certainly recommended, way lower CPU usage.

I think it is time to release the fix to the previously reported bug. I played around with a few ideas to fix the bug without affecting performance, but I did not have time to do it right. Therefore, the fixed 72-bit kernel of version 0.09 will be 3-5% slower than 0.08. The barrett kernels are not affected. I'm working on getting the same speed as before, but that will take some more time.

So here is version 0.09, first Windows ...

Bdot 2011-10-05 21:10

mfakto 0.09 - Linux version
 
1 Attachment(s)
Linux 64-bit

Bdot 2011-10-05 21:12

mfakto 0.09 - sources
 
1 Attachment(s)
... and the source code

KyleAskine 2011-10-27 00:05

Hi!

I am new here, and I might have missed a point of discussion earlier in the thread. If that is the case, I am sorry.

Anyway, I have two GPUs in my current PC (6950s flashed as 6970s), but it looks like mfakto only uses one of them. Is this a known issue, or could there be a problem with my setup?

Also, I have 11.9, but to get one GPU to 90%, it took me two cores.

Thanks for your help!

Bdot 2011-10-27 20:07

[QUOTE=KyleAskine;275878]
Anyway, I have two GPUs in my current PC (6950s flashed as 6970s), but it looks like mfakto only uses one of them.
[/QUOTE]
Did you already play around with the -d <dev-num> switch? This is supposed to let you decide which device an mfakto instance will use.

If that does not work, please send me the clinfo output (e.g. in C:\Program Files (x86)\AMD APP\bin\x86_64\clinfo.exe).

[QUOTE=KyleAskine;275878]
Also, I have 11.9, but to get one GPU to 90%, it took me two cores.
[/QUOTE]

That is expected with higher-end cards. One mfakto process will always use only one device, so you may need 4 instances in total to get both GPUs to 90%.

KyleAskine 2011-10-28 12:01

[QUOTE=Bdot;276017]Did you already play around with the -d <dev-num> switch? This is supposed to let you decide which device an mfakto instance will use.

If that does not work, please send me the clinfo output (e.g. in C:\Program Files (x86)\AMD APP\bin\x86_64\clinfo.exe).



That is expected with higher-end cards. One mfakto process will always use only one device, so you may need 4 instances in total to get both GPUs to 90%.[/QUOTE]

Thanks for your answers! I will play around with it today!

Ethan (EO) 2011-11-07 19:19

Upgrading to the 11.10 driver on x64 Windows broke the mfakto 0.09 executable for me, because the kernel compiler in this driver version is hung up on calls to mad24 with mixed argument types.

Casting all of the integer constants in the mad24 calls to (uint) fixed this for me!

Edit: No it didn't -- this lets the executable run but it's failing the selftest. I've used up the time I can spend on this today unfortunately but there it is.

ReEdit: Only the 64bit build is failing the selftest.

i.e.
[CODE]
nn.d1 = mad24(mul_hi(n.d0, qi), (uint)256, tmp >> 24);
[/CODE]

Also of note, I had to change the Platform Toolset setting to Windows7.1SDK from v100 to get this to build in Visual Studio Express.

Bdot 2011-11-07 19:47

[QUOTE=Ethan (EO);277464]Upgrading to the 11.10 driver on x64 Windows broke the mfakto 0.09 executable for me, because the kernel compiler in this driver version is hung up on calls to mad24 with mixed argument types.
[/QUOTE]

Uh-oh ... every new version adds new surprises ... With that I'm afraid to upgrade my drivers ;-)

I'll see if I can do something about it ...

Dubslow 2011-11-07 21:58

On another note Bdot, on the FAQ PDF available in the FAQ threads, it says not to report no factor results from mfakto. I don't know why it says that, but someone somewhere said it was the factors < 2^48, which has been fixed. If that was the reason why, please tell Brain to fix the PDF. I just hope we haven't lost too much work.

Brain 2011-11-08 06:12

PDF going to be updated... Do submit all results.

Bdot 2011-11-10 15:37

Fix for 11.10?
 
1 Attachment(s)
[QUOTE=Ethan (EO);277464]Upgrading to the 11.10 driver on x64 Windows broke the mfakto 0.09 executable for me.

[CODE]
nn.d1 = mad24(mul_hi(n.d0, qi), (uint)256, tmp >> 24);
[/CODE][/QUOTE]

I've replaced all of those constants by their uint equivalent (256 => 256u), and on my slow test box (W7-64) this seems to work, the small selftest succeeded, and so far it found all factors of the full selftest - still running.

[CODE]
nn.d1 = mad24(mul_hi(n.d0, qi), 256u, tmp >> 24);
[/CODE]I've attached the kernel file. Could you please check if this one still fails the selftest on your machine?

Ethan (EO) 2011-11-10 22:41

I'm still failing about half of the selftests with that kernel file. I'm going to revert to the exact contents of your 0.09 src zip to make sure I haven't mucked anything up in the project settings.


Ethan

edit: No luck -- unziped your src file directly, put the updated cl file in src, built Release/x64, and ran. No runtime cl compilation errors, but -- aha -- just noticed that it is passing the first test in each test case, and then failing the rest:

[CODE]
########## testcase 6/1558 ##########
tf(53134687, 68, 69, ...);
k_min = 2999999998380 - k_max = 3300000000000
Using GPU kernel "mfakto_cl_71_8"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3120/4620 | 14.16M | 0.468s | 30.25M/s | 25000 | n.a. | 10798us
Result[00]: M53134687 has a factor: 337073926433410950601
found 1 factor(s) for M53134687 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_71_
8]
selftest for M53134687 passed (mfakto_cl_71_8)!
tf(): total time spent: 0.487s

tf(53134687, 68, 69, ...);
k_min = 2999999998380 - k_max = 3300000000000
Using GPU kernel "mfakto_cl_71_4"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3120/4620 | 14.16M | 0.215s | 65.84M/s | 25000 | n.a. | 0us
no factor for M53134687 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_71_4]
ERROR: selftest failed for M53134687 (mfakto_cl_71_4)
no factor found
tf(): total time spent: 0.234s

tf(53134687, 68, 69, ...);
k_min = 2999999998380 - k_max = 3300000000000
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3120/4620 | 14.16M | 0.214s | 66.15M/s | 25000 | n.a. | 0us
no factor for M53134687 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
ERROR: selftest failed for M53134687 (mfakto_cl_barrett79)
no factor found
tf(): total time spent: 0.233s

tf(53134687, 68, 69, ...);
k_min = 2999999998380 - k_max = 3300000000000
Using GPU kernel "mfakto_cl_barrett92"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3120/4620 | 14.16M | 0.214s | 66.15M/s | 25000 | n.a. | 0us
no factor for M53134687 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett92]
ERROR: selftest failed for M53134687 (mfakto_cl_barrett92)
no factor found
tf(): total time spent: 0.232s
[/code]

And that's consistent across the testcases.

reedit: The same executable runs without error on the CPU:

[CODE]
########## testcase 6/1558 ##########
tf(53134687, 68, 69, ...);
k_min = 2999999998380 - k_max = 3300000000000
Using GPU kernel "mfakto_cl_71_8"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3120/4620 | 14.68M | 5.737s | 2.56M/s | 25000 | n.a. | 362964us
Result[00]: M53134687 has a factor: 337073926433410950601
found 1 factor(s) for M53134687 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_71_
8]
selftest for M53134687 passed (mfakto_cl_71_8)!
tf(): total time spent: 5.753s

tf(53134687, 68, 69, ...);
k_min = 2999999998380 - k_max = 3300000000000
Using GPU kernel "mfakto_cl_71_4"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3120/4620 | 14.68M | 6.460s | 2.27M/s | 25000 | n.a. | 410991us
Result[00]: M53134687 has a factor: 337073926433410950601
found 1 factor(s) for M53134687 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_71_
4]
selftest for M53134687 passed (mfakto_cl_71_4)!
tf(): total time spent: 6.479s

tf(53134687, 68, 69, ...);
k_min = 2999999998380 - k_max = 3300000000000
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3120/4620 | 14.68M | 4.439s | 3.31M/s | 25000 | n.a. | 276762us
Result[00]: M53134687 has a factor: 337073926433410950601
found 1 factor(s) for M53134687 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_bar
rett79]
selftest for M53134687 passed (mfakto_cl_barrett79)!
tf(): total time spent: 4.459s

tf(53134687, 68, 69, ...);
k_min = 2999999998380 - k_max = 3300000000000
Using GPU kernel "mfakto_cl_barrett92"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3120/4620 | 14.68M | 5.800s | 2.53M/s | 25000 | n.a. | 366906us
Result[00]: M53134687 has a factor: 337073926433410950601
found 1 factor(s) for M53134687 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_bar
rett92]
selftest for M53134687 passed (mfakto_cl_barrett92)!
tf(): total time spent: 5.822s
[/CODE]

...and the 32bit build runs fine on both CPU and GPU.

Bdot 2011-11-11 08:48

[QUOTE=Ethan (EO);277845]I'm still failing about half of the selftests with that kernel file.
...

edit: No luck -- unziped your src file directly, put the updated cl file in src, built Release/x64, and ran. No runtime cl compilation errors, but -- aha -- just noticed that it is passing the first test in each test case, and then failing the rest:

[CODE]
########## testcase 6/1558 ##########
tf(53134687, 68, 69, ...);
k_min = 2999999998380 - k_max = 3300000000000
Using GPU kernel "mfakto_cl_71_8"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3120/4620 | 14.16M | 0.468s | 30.25M/s | 25000 | n.a. | 10798us
Result[00]: M53134687 has a factor: 337073926433410950601
found 1 factor(s) for M53134687 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_71_
8]
selftest for M53134687 passed (mfakto_cl_71_8)!
tf(): total time spent: 0.487s

tf(53134687, 68, 69, ...);
k_min = 2999999998380 - k_max = 3300000000000
Using GPU kernel "mfakto_cl_71_4"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3120/4620 | 14.16M | 0.215s | 65.84M/s | 25000 | n.a. | 0us
no factor for M53134687 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_71_4]
ERROR: selftest failed for M53134687 (mfakto_cl_71_4)
no factor found
tf(): total time spent: 0.234s

tf(53134687, 68, 69, ...);
k_min = 2999999998380 - k_max = 3300000000000
Using GPU kernel "mfakto_cl_barrett79"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3120/4620 | 14.16M | 0.214s | 66.15M/s | 25000 | n.a. | 0us
no factor for M53134687 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett79]
ERROR: selftest failed for M53134687 (mfakto_cl_barrett79)
no factor found
tf(): total time spent: 0.233s

tf(53134687, 68, 69, ...);
k_min = 2999999998380 - k_max = 3300000000000
Using GPU kernel "mfakto_cl_barrett92"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
3120/4620 | 14.16M | 0.214s | 66.15M/s | 25000 | n.a. | 0us
no factor for M53134687 from 2^68 to 2^69 [mfakto 0.09-Win mfakto_cl_barrett92]
ERROR: selftest failed for M53134687 (mfakto_cl_barrett92)
no factor found
tf(): total time spent: 0.232s
[/CODE]And that's consistent across the testcases.

[/QUOTE]

There should be no need to rebuild anything - just replace the kernel file next to the original 0.09 binary. On the other hand, rebuilding should not hurt.

I'll update my home PC over the weekend, maybe I can reproduce the error there. The symptom looks a bit like something not initialized in the correct place ... and now it depends on memory layout or other rather random things.

Ethan (EO) 2011-11-11 20:28

[QUOTE=Bdot;277904]There should be no need to rebuild anything - just replace the kernel file next to the original 0.09 binary.[/QUOTE]

Yeah -- I just rebuilt from your unaltered project to make sure I hadn't messed something up on the executable I had built previously :)


Ethan

Ethan (EO) 2011-11-12 00:53

I reordered the test cases to see if the failure pattern was the same, and it turns out that the order of the kernels within a testcase is irrelevant -- mfakto_cl_71_4, mfakto_cl_barrett79, and mfakto_cl_barrett92 are failing, but mfakto_cl_71_8 is working.

TheJudger 2011-11-12 01:31

Hello,

just a shot into the dark: The average wait is 0 when the known factor is not found: does the GPU-kernel run at all?

Oliver

Ethan (EO) 2011-11-12 01:47

[QUOTE=TheJudger;278000]Hello,

just a shot into the dark: The average wait is 0 when the known factor is not found: does the GPU-kernel run at all?

Oliver[/QUOTE]

Yep -- they are running. Just turned on the kernel tracing stuff in the OpenCL kernels, and I've found a difference:

32 bit build cl_71_4:
[CODE]
########## testcase 1/1558 ##########
tf(50804297, 67, 68, ...);
k_min = 1599999998520 - k_max = 1900000000000
Using GPU kernel "mfakto_cl_71_4"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
mfakto_cl_71: tid=0: p=3073649, *2 =6:e6c92, k=0, 0, 0, 0:17487, 17487, 17487, 1
7487:6e8773, 6ef3bb, 6f05c7, 6f3beb, f=8d029, 8d029, 8d029, 8d02a:fccff8, ff5fc2
, ffcd0e, 114f3:77c397, 53e4a7, a33f7f, 915007, shift=19, b=0, 0, 0, 0:1, 1, 1,
1:0, 0, 0, 0:0, 0, 0, 0:0, 0, 0, 0:0, 0, 0, 0
mod_144_72#1: qf=3.51844E+013, nf=6.15105E-021, *=2.16421E-007, qi=0
mod_144_72#1: q=0:1:0:0:0:0, n=8d029:fccff8:77c397, qi=0
mod_144_72#1.1: nn=0:0:0:0:0:0
mod_144_72#1.2: nn=0:0:0:0:0:0
mod_144_72#1.3: nn=0:0:0:0:0:0Error: The arguments don't match the printf format
string. printf(mod_144_72#1.3: nn=%x:%x:%x:%x:%x:%x
[/CODE]

64bit build cl_71_4:
[CODE]
########## testcase 1/1558 ##########
tf(50804297, 67, 68, ...);
k_min = 1599999998520 - k_max = 1900000000000
Using GPU kernel "mfakto_cl_71_4"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
mfakto_cl_71: tid=0: p=3073649, *2 =6:e6c92, k=0, 0, 0, 0:17487, 17487, 17487, 1
7487:6e8773, 6ef3bb, 6f05c7, 6f3beb, f=8d029, 8d029, 8d029, 8d02a:fccff8, ff5fc2
, ffcd0e, 114f3:77c397, 53e4a7, a33f7f, 915007, shift=19, b=0, 0, 0, 0:0, 0, 0,
0:0, 0, 0, 0:0, 0, 0, 0:0, 0, 0, 0:0, 0, 0, 0
mod_144_72#1: qf=0.000000, nf=6.15105E-021, *=0.000000, qi=0
mod_144_72#1: q=0:0:0:0:0:0, n=8d029:fccff8:77c397, qi=0
mod_144_72#1.1: nn=0:0:0:0:0:0
mod_144_72#1.2: nn=0:0:0:0:0:0
mod_144_72#1.3: nn=0:0:0:0:0:0Error: The arguments don't match the printf format
string. printf(mod_144_72#1.3: nn=%x:%x:%x:%x:%x:%x
[/CODE]

64 bit build cl_71_8:
[CODE]
########## testcase 1/1558 ##########
tf(50804297, 67, 68, ...);
k_min = 1599999998520 - k_max = 1900000000000
Using GPU kernel "mfakto_cl_71_8"
class | candidates | time | avg. rate | SievePrimes | ETA | avg. wait
mfakto_cl_71: tid=0: p=3073649, *2 =6:e6c92, k=0, 0, 0, 0, 0, 0, 0, 0:17487, 174
87, 17487, 17487, 17487, 17487, 17487, 17487:6e8773, 6ef3bb, 6f05c7, 6f3beb, 6fd
e57, 70147b, 706eb7, 71c59b, f=8d029, 8d029, 8d029, 8d02a, 8d02a, 8d02a, 8d02a,
8d02a:fccff8, ff5fc2, ffcd0e, 114f3, 4eca2, 63487, 85704, 1073ae:77c397, 53e4a7,
a33f7f, 915007, 5b819f, 499227, d6585f, ba1667, shift=19, b=0, 0, 0, 0, 0, 0, 0
, 0:1, 1, 1, 1, 1, 1, 1, 1:0, 0, 0, 0, 0, 0, 0, 0:0, 0, 0, 0, 0, 0, 0, 0:0, 0, 0
, 0, 0, 0, 0, 0:0, 0, 0, 0, 0, 0, 0, 0
mod_144_72#1: qf=3.51844E+013, nf=6.15105E-021, *=2.16421E-007, qi=0
mod_144_72#1: q=0:1:0:0:0:0, n=8d029:fccff8:77c397, qi=0
mod_144_72#1.1: nn=0:0:0:0:0:0
mod_144_72#1.2: nn=0:0:0:0:0:0
mod_144_72#1.3: nn=0:0:0:0:0:0Error: The arguments don't match the printf format
string. printf(mod_144_72#1.3: nn=%x:%x:%x:%x:%x:%x
[/CODE]

I haven't worked back yet to see if b is correct in the caller in the 64bit/v4 case... and haven't finished changing the kernels' printf format strings to vector types as you can see :)

TheJudger 2011-11-12 16:04

qf = 0.00000 doesn't look good.[LIST][*]precomputation failed[*]floatingpoint conversion failed (unlikely?)[*]data transfer doesn't work / isn't finished[*]something else[/LIST]

Bdot 2011-11-12 19:40

[QUOTE=TheJudger;278049]qf = 0.00000 doesn't look good.[LIST][*]precomputation failed[*]floatingpoint conversion failed (unlikely?)[*]data transfer doesn't work / isn't finished[*]something else[/LIST][/QUOTE]

I did not yet change all the trace statements to work for vectors. The kernel trace is only accurate when tracing non-vectored kernels. That´s also the reason for the "arguments don´t match" message.

Looks like some work to do ...

Edit: And the average wait can be zero for mfakto because the necessary wait time for the last block of a class is not included in the calculation (one of the differences to the earlier mfaktc versions, to work better on small classes).

Bdot 2011-11-12 23:14

The kernels do not receive the input parameter that holds the pre-processing information, but get a zero there.

With the kernel tracing fixed and set to at least level 3, the mfakto_cl_71_4 kernel will receive the correct parameters and find the factors. So far I did not get the barrett kernels to receive all input parameters.

My guess is that the optimizer removed them as it did not deem them important. But trying to build the kernel non-optimized crashes the kernel compiler.

In the light of this it is probably not helping that the barrett kernels are ~4% faster with 11.10. Probably because crucial parts have been optimized away.

I guess we just need to skip the Catalyst 11.10 version :-(

bcp19 2011-11-13 13:32

I just recently got myself an HD 6770 and picked up mfakto .09 and when I try to run the 64 bit windows exe I get multiple errors about too many instances of mad24 and then a message saying there were 27 errors and the program shutdown. (paraphrasing as I am not sitting AT the machine atm) If I run the 32 bit exe, everything appears to run normal. I have 11.9 drivers installed as I read on here about the problems with 11.10.

Bdot 2011-11-13 21:24

[QUOTE=bcp19;278121]I just recently got myself an HD 6770 and picked up mfakto .09 and when I try to run the 64 bit windows exe I get multiple errors about too many instances of mad24 and then a message saying there were 27 errors and the program shutdown. (paraphrasing as I am not sitting AT the machine atm) If I run the 32 bit exe, everything appears to run normal. I have 11.9 drivers installed as I read on here about the problems with 11.10.[/QUOTE]

[code]
Select device - Get device info - Compiling kernels.
BUILD OUTPUT
C:\Users\root\AppData\Local\Temp\OCLCEF5.tmp.cl(2192): error: more than one
instance of overloaded function "mad24" matches the argument list:
function "mad24(int, int, int) C++"
function "mad24(uint, uint, uint) C++"
argument types are: (uint, int, uint)
*res_hi = mad24(mul_hi(a,b), 256, (*res_lo >> 24));
^
...

C:\Users\root\AppData\Local\Temp\OCLCEF5.tmp.cl(2726): error: more than one
instance of overloaded function "mad24" matches the argument list:
function "mad24(int, int, int) C++"
function "mad24(uint, uint, uint) C++"
argument types are: (uint, int, uint)
nn.d2 = mad24(mul_hi(n.d1, qi), 256, tmp >> 24);
^

27 errors detected in the compilation of "C:\Users\root\AppData\Local\Temp\OCLCEF5.tmp.cl".

Internal error: clc compiler invocation failed.

END OF BUILD OUTPUT
Error -11: clBuildProgram
init_CL(5, 0) failed
[/code]This is exactly the error that on my machine started appearing with the installation of Catalyst 11.10. These compilation errors are easy to solve, but mfakto will still fail the selftest as there are other bugs in the compiled kernel.

I tried deinstalling 11.10 and went back as far as 11.6 - the errors remain. It's not the first time that the ATI drivers do not correctly deinstall themselves. Maybe they do but some hardware switch remained in a bad position. Anyway: the sad result is: once in that state, I could not get out. (I cannot try reinstalling the machine.)

I'll see if I can build an "11.10-workaround-version" for trapped folks like me. There will certainly be a performance-penalty. Where it still works, it's probably faster to run the 32-bit version for now - on my machine the 32-bit version fails as well.

Strange, strange, strange. Maybe there's still a bug in the main program that just has these side effects.

bcp19 2011-11-13 21:27

Hmmm, does that mean the exp's I've been doing on the 32 bit client are suspect?

Bdot 2011-11-13 23:25

[QUOTE=bcp19;278180]Hmmm, does that mean the exp's I've been doing on the 32 bit client are suspect?[/QUOTE]

No. If the selftest succeeds, then the code works well and the results can be trusted. To be sure, you can run the extended selftest (-st).


Fighting the problem, I found out that reinstalling Windows helps - in a new Windows installation, with Catalyst 11.9, mfakto resumed normal operation. So the problem is either caused by some bad registry entries, or files, or persistent hardware state that are not corrected when deinstalling 11.10 and installing 11.9 again ... I'm trying to compare the registry, but the new Windows installation has corrupted my bootloader.

Dubslow 2011-11-14 01:50

[QUOTE=Bdot;278189]No. If the selftest succeeds, then the code works well and the results can be trusted. To be sure, you can run the extended selftest (-st).


Fighting the problem, I found out that reinstalling Windows helps - in a new Windows installation, with Catalyst 11.9, mfakto resumed normal operation. So the problem is either caused by some bad registry entries, or files, or persistent hardware state that are not corrected when deinstalling 11.10 and installing 11.9 again ... I'm trying to compare the registry, but the new Windows installation has corrupted my bootloader.[/QUOTE]
Reminds me of [URL="http://xkcd.com/349/"]this[/URL] :)

Ethan (EO) 2011-11-14 22:05

[QUOTE=Bdot;278069]I did not yet change all the trace statements to work for vectors. The kernel trace is only accurate when tracing non-vectored kernels. That´s also the reason for the "arguments don´t match" message.
[/QUOTE]

I changed the trace format strings to v4 and v8 for the first few outputs from each kernel for the output I posted above... the "arguments don't match" message marks the end of the changes I made.

Bdot 2011-11-14 23:08

[QUOTE=Ethan (EO);278351]I changed the trace format strings to v4 and v8 for the first few outputs from each kernel for the output I posted above... the "arguments don't match" message marks the end of the changes I made.[/QUOTE]

I see ... I now have a version that allows tracing all the way through, but that does not help. The trace shows that the input b value is zero for all components. Not having a one anywhere can never find a factor.

I suspect the new compiler does not handle a struct of 6 uints passed by value. I'll see that I can change that. If that does not work either, then I'll just send the bit-position of the 1 and each kernel thread needs to calculate b on its own.

This catalyst version does not leave a good impression. AMD says APP SDK 2.6 will come out soon, with a newer compiler. Lets see if that already fixed this. I cannot get rid of 11.10, so one machine can now throw all cores at P-1 and LL testing ... and the GPU temp is 35 degrees lower than usual.

bcp19 2011-11-15 18:32

I've been noticing a weird thing with mfakto. I have an i5-2400 with an HD 6770 running 2 instances of the 32 bit mfakto with P95 running a P-1 and an LL. If I have P95 selected as the 'active' window, both instances of mfakto show 40-44M/s. If I have one of the mfakto windows 'active', the active runs at 55M/s while the other runs at 46M/s. I cannot test the 64 bit bersion thanks to 11.10, but I see no similiar behavior on my other 2 GPU machines which are running the 64 bit mfaktc. Any thoughts?

LaurV 2011-11-15 18:39

[QUOTE=bcp19;278524] Any thoughts?[/QUOTE]
Could be from windows? Priorities? In Win7 rightclick on Computer, properties, advanced, performance, and check how the priorities a balanced between "background task" or "service" and "program in front". You can set windows to automatically (dynamic) give more priority to the tasks according with their z-level, windows in front get more processor power.

bcp19 2011-11-15 19:13

[QUOTE=LaurV;278530]Could be from windows? Priorities? In Win7 rightclick on Computer, properties, advanced, performance, and check how the priorities a balanced between "background task" or "service" and "program in front". You can set windows to automatically (dynamic) give more priority to the tasks according with their z-level, windows in front get more processor power.[/QUOTE]

It has 2 selections, Programs and Background Services. If I change the setting both instances run slower regardless of task in front so I left it as it was.

Chalk up another reason to dislike Win7.

Bdot 2011-11-16 10:04

[QUOTE=bcp19;278524]I've been noticing a weird thing with mfakto. I have an i5-2400 with an HD 6770 running 2 instances of the 32 bit mfakto with P95 running a P-1 and an LL. If I have P95 selected as the 'active' window, both instances of mfakto show 40-44M/s. If I have one of the mfakto windows 'active', the active runs at 55M/s while the other runs at 46M/s. I cannot test the 64 bit bersion thanks to 11.10, but I see no similiar behavior on my other 2 GPU machines which are running the 64 bit mfaktc. Any thoughts?[/QUOTE]

I've seen this behavior too, also with 64-bit-mfakto on Win7.

mfakto (in fact, OpenCL) uses a background thread to handle the communication with the GPU. So whenever the main thread says "Go!" and then waits for the results, some background thread will do some magic to drive the GPU, collect the execution status and trigger the main thread when the kernel has finished. I did not check yet, but have the feeling that this background thread runs at lower-than-normal priority. This way, the prime95-threads (running at lowest priority) can interfere with the mfakto threads. And then the fact that LaurV posted can help mfakto, if it is the foreground application. Collecting the kernel results requires two thread switches (from p95 to the background thread, and then to the main thread). Priorities can play a big role, but certainly other things as well, e.g. CPU cache invalidation, as p95 and mfakto compete for memory access.

What throughput do the two instances have when no P95 runs? Probably ~60M/s each?

jeebee 2011-11-16 16:35

People with problems with 11.10 might as well try out the newest revision 11.11. I'm still on 11.9 and don't plan on switching until confirmation the software renews its compatibility...

Bdot 2011-11-16 19:27

[QUOTE=jeebee;278698]People with problems with 11.10 might as well try out the newest revision 11.11. I'm still on 11.9 and don't plan on switching until confirmation the software renews its compatibility...[/QUOTE]

Deinstalling 11.10, removing system32\amdocl64.dll, system32\amdoclcl64.dll, syswow64\amdocl.dll and syswow64\amdoclcl.dll, and then installing 11.9 did the trick, now mfakto runs again, also in 64-bits!

And now that I know that these are the critical files that are not removed during the driver deinstallation, I can as well try the latest version ;-)

Edit: I tried, and 11.11 has the same issues as 11.10. So 11.9 stays the last usable version (for mfakto).

bcp19 2011-11-16 23:52

[QUOTE=Bdot;278639]I've seen this behavior too, also with 64-bit-mfakto on Win7.

mfakto (in fact, OpenCL) uses a background thread to handle the communication with the GPU. So whenever the main thread says "Go!" and then waits for the results, some background thread will do some magic to drive the GPU, collect the execution status and trigger the main thread when the kernel has finished. I did not check yet, but have the feeling that this background thread runs at lower-than-normal priority. This way, the prime95-threads (running at lowest priority) can interfere with the mfakto threads. And then the fact that LaurV posted can help mfakto, if it is the foreground application. Collecting the kernel results requires two thread switches (from p95 to the background thread, and then to the main thread). Priorities can play a big role, but certainly other things as well, e.g. CPU cache invalidation, as p95 and mfakto compete for memory access.

What throughput do the two instances have when no P95 runs? Probably ~60M/s each?[/QUOTE]

There is no change in the throughput when I shut down P95. There are only 3 'states' of the mfakto window... if 'on top' and selected it runs 55-56M/s, if 'on top' and not selected it runs at 46M/s. If another window is active over it (like IE, P95 maximized, Notepad, etc) both 'background' instances run at 40-42M/s.

nucleon 2011-11-21 10:33

If I wanted to max the amount of GHz-days/day from an ATI/AMD card with mfakto, what should I be getting? And how many GHz-days/day could I hope to achieve.

Just hypothetical questions at this stage.

The best I can do so far is about 300GHz-days/day with a single GTX580.

-- Craig

KyleAskine 2011-11-21 12:00

[QUOTE=nucleon;279366]If I wanted to max the amount of GHz-days/day from an ATI/AMD card with mfakto, what should I be getting? And how many GHz-days/day could I hope to achieve.

Just hypothetical questions at this stage.

The best I can do so far is about 300GHz-days/day with a single GTX580.

-- Craig[/QUOTE]

I am not sure, but I have a 5870 and two 6950's (flashed as 6970's). With only one instance of mfakto for each card and sieving 5000 primes I get around numbers right around 150 on the output (or maybe 150M... I don't remember) for the 5870. Of course I don't know what the column means, other than bigger is better, which I why I can't remember what exactly it said.

Another metric is that I factor one number every half hour from 70 to 71 on the 5870.

I am positive I can do better with more primes being sieved and more instances. I can look a bit more into it when I get home today and let you know more exactly. I am interested how AMDs match up with nVidia's myself.

bcp19 2011-11-21 12:41

[QUOTE=nucleon;279366]If I wanted to max the amount of GHz-days/day from an ATI/AMD card with mfakto, what should I be getting? And how many GHz-days/day could I hope to achieve.

Just hypothetical questions at this stage.

The best I can do so far is about 300GHz-days/day with a single GTX580.

-- Craig[/QUOTE]

With an HD 6770 I can get 100 M/s with 2 mfaktos running on an i5 2400, which is similiar to the 120 M/s of my GTS 450, but it is kind of a low end card. Can do roughly 3 48M 69-72 per mfakto ~= 90-105GHz/day. With a 560 Ti running 1 Mfaktc I can get 170 M/s on the 2400. Since 2 only gets me up to 200 M/s I let P95 have the core.

Wizzard 2011-11-21 14:52

Hello. Is Radeon HD 3400 supported too? If so, where can I download the latest mfakto? Thank you :)

edit: well, I found version 0.08, and I see "GPU not found", so I assume, it is not supported.

KyleAskine 2011-11-21 15:41

[QUOTE=Wizzard;279383]Hello. Is Radeon HD 3400 supported too? If so, where can I download the latest mfakto? Thank you :)

edit: well, I found version 0.08, and I see "GPU not found", so I assume, it is not supported.[/QUOTE]

OpenCL is supported in 4xxx series and newer.

nucleon 2011-11-22 22:26

[QUOTE=KyleAskine;279371]I am not sure, but I have a 5870 and two 6950's (flashed as 6970's). With only one instance of mfakto for each card and sieving 5000 primes I get around numbers right around 150 on the output (or maybe 150M... I don't remember) for the 5870. Of course I don't know what the column means, other than bigger is better, which I why I can't remember what exactly it said.

Another metric is that I factor one number every half hour from 70 to 71 on the 5870.

I am positive I can do better with more primes being sieved and more instances. I can look a bit more into it when I get home today and let you know more exactly. I am interested how AMDs match up with nVidia's myself.[/QUOTE]

On one machine, I have 2x instances with GTX580, using GPU-Z, it's usage hovers around 95-97%, so practically maxed out. Sieve primes=5000, and cpu is 100% constant. I don't have any more cpu cycles to throw at it at this stage.

Some timing data:

[CODE]Instance0:
20111123-033143 no factor for M45006901 from 2^69 to 2^72 [mfaktc 0.16-Win barre
20111123-064935 no factor for M45034081 from 2^69 to 2^72 [mfaktc 0.16-Win barre

Instance1:
20111123-044206 no factor for M46251449 from 2^68 to 2^72 [mfaktc 0.16-Win barre
20111123-074124 no factor for M46629067 from 2^68 to 2^72 [mfaktc 0.16-Win barre[/CODE]

The first column is time completed. i.e. YYYYMMDD-hhmmss format.

To get the timing data, I run this command in the background:

[CODE]tail -n 0 -F results.txt | xargs -I XX -n 1 bash -c "echo \`date +%Y%m%d-%H%M%S\` \"XX\"" >> results.log &[/CODE]

To timings are accurate within +/-1sec if I understand tail correctly. Yes it's a hack. But it's good start.

By my guess, 70-71 takes about 45mins and I'll have about 2 results in this time.

-- Craig

KyleAskine 2011-11-23 12:10

[QUOTE=nucleon;279534]
To get the timing data, I run this command in the background:

[CODE]tail -n 0 -F results.txt | xargs -I XX -n 1 bash -c "echo \`date +%Y%m%d-%H%M%S\` \"XX\"" >> results.log &[/CODE]

To timings are accurate within +/-1sec if I understand tail correctly. Yes it's a hack. But it's good start.

By my guess, 70-71 takes about 45mins and I'll have about 2 results in this time.

-- Craig[/QUOTE]

Alright, I will throw this on my linux box tonight!

KyleAskine 2011-11-24 16:01

[QUOTE=KyleAskine;279577]Alright, I will throw this on my linux box tonight![/QUOTE]

So I embarrassed myself. When I said I did one factor per half hour from 70 ot 71, I meant one factor per half hour from 69 to 70. Only off by one factor of two!! This is with an HD5870. I have two 6970s that are around the same speed.

Anyway, this is with only one instance of mfakto only sieving 5000 primes on an old AMD Phenom system. I can get a little bit more with two systems, but not enough to make it worthwhile in my opinion.

[CODE]20111123-163748 no factor for M50771309 from 2^69 to 2^70 [mfakto 0.09 mfakto_cl_barrett79]
20111123-171212 no factor for M50781161 from 2^69 to 2^70 [mfakto 0.09 mfakto_cl_barrett79]
20111123-174634 no factor for M50781917 from 2^69 to 2^70 [mfakto 0.09 mfakto_cl_barrett79]
20111123-182057 no factor for M50783597 from 2^69 to 2^70 [mfakto 0.09 mfakto_cl_barrett79]
20111123-185520 no factor for M50789623 from 2^69 to 2^70 [mfakto 0.09 mfakto_cl_barrett79]
20111123-192943 no factor for M50801119 from 2^69 to 2^70 [mfakto 0.09 mfakto_cl_barrett79]
20111123-200406 no factor for M50803657 from 2^69 to 2^70 [mfakto 0.09 mfakto_cl_barrett79]
20111123-203829 no factor for M50804087 from 2^69 to 2^70 [mfakto 0.09 mfakto_cl_barrett79]
20111123-211252 no factor for M50806543 from 2^69 to 2^70 [mfakto 0.09 mfakto_cl_barrett79]
20111123-214716 no factor for M50807389 from 2^69 to 2^70 [mfakto 0.09 mfakto_cl_barrett79]
20111123-222140 no factor for M50807563 from 2^69 to 2^70 [mfakto 0.09 mfakto_cl_barrett79]
20111123-225603 no factor for M50807587 from 2^69 to 2^70 [mfakto 0.09 mfakto_cl_barrett79]
20111123-233027 no factor for M50812409 from 2^69 to 2^70 [mfakto 0.09 mfakto_cl_barrett79]
20111124-000449 no factor for M50823419 from 2^69 to 2^70 [mfakto 0.09 mfakto_cl_barrett79][/CODE]

So it looks like that for this approx. $200 video card it is around 2x as slow as a comprable nVidia???

bcp19 2011-11-24 16:25

[QUOTE=KyleAskine;279716]So I embarrassed myself. When I said I did one factor per half hour from 70 ot 71, I meant one factor per half hour from 69 to 70. Only off by one factor of two!! This is with an HD5870. I have two 6970s that are around the same speed.

Anyway, this is with only one instance of mfakto only sieving 5000 primes on an old AMD Phenom system. I can get a little bit more with two systems, but not enough to make it worthwhile in my opinion.

So it looks like that for this approx. $200 video card it is around 2x as slow as a comprable nVidia???[/QUOTE]

Not sure what is considered a comparable nVidia card, but you are a tad slower than my 560 Ti. I see 30-32 minutes on it from ^69-^70. In looking at [URL]http://www.hwcompare.com/8915/geforce-gtx-560-ti-vs-radeon-hd-5870/[/URL] your card has a higher memory bandwidth than mine. Like yours I can get a little better with 2 instances running (~20%) but it's not really worth it.

Edit: My bad, that was my 560... the Ti does it in ~24 min.

flashjh 2011-11-24 19:41

Works!
 
[QUOTE=Bdot;278750]Deinstalling 11.10, removing system32\amdocl64.dll, system32\amdoclcl64.dll, syswow64\amdocl.dll and syswow64\amdoclcl.dll, and then installing 11.9 did the trick, now mfakto runs again, also in 64-bits!

And now that I know that these are the critical files that are not removed during the driver deinstallation, I can as well try the latest version ;-)

Edit: I tried, and 11.11 has the same issues as 11.10. So 11.9 stays the last usable version (for mfakto).[/QUOTE]

This works, thanks! Also, I can confirm that 11.11 does not work.

Bdot 2011-11-24 22:37

[QUOTE=KyleAskine;279716]So I embarrassed myself. When I said I did one factor per half hour from 70 ot 71, I meant one factor per half hour from 69 to 70. Only off by one factor of two!! This is with an HD5870. I have two 6970s that are around the same speed.

Anyway, this is with only one instance of mfakto only sieving 5000 primes on an old AMD Phenom system. I can get a little bit more with two systems, but not enough to make it worthwhile in my opinion.
[/QUOTE]

The OpenCL version of mfaktc is slower than its original for various reasons:
[LIST=1][*]It is a rather plain port, not initially designed for OpenCL. I did some changes to "make it work", but only few optimizations for performance (yet).[*]OpenCL does not (easily) allow direct access to the hardware's capabilities. For instance, no mul24_hi is available in OpenCL, even though the GPU has that instruction.[*]ATI GPUs do not have hardware carry. Even if direct access to the whole instruction set was available, arithmetics with more than 32 bits require additional instructions and/or registers to maintain carry/borrow.[*]The kernel that would most likely get the optimal performance out of the AMD chips is not yet included: A barrett kernel based on 24-bit instructions.[*]OpenCL's multi-threaded approach to driving the GPU has disadvantages in heavily-loaded systems. mfakto will slow down when prime95 runs on the same box - even though mfakto runs at higher priority.[/LIST]Given all that I think it would not be too bad if same-price NV cards delivered only double of AMDs. However, the 35 min per your test is what I get on my HD5770 card (and 2 Phenom-cores @3.4GHz, SievePrimes between 130k and 180k). HD5870 should be 50 to 100% faster, so I guess that the limit is not the GPU in your case. HD6970 should add another ~10% speed ... did you test switching to the mul24 kernel which should suit the HD6xxx better?



I have a lot of ideas what I could test/enhance/implement ... however, the current driver issues are not exactly motivating. And time is always limited ...

KyleAskine 2011-11-25 12:21

[QUOTE=Bdot;279761]
Given all that I think it would not be too bad if same-price NV cards delivered only double of AMDs. However, the 35 min per your test is what I get on my HD5770 card (and 2 Phenom-cores @3.4GHz, SievePrimes between 130k and 180k). HD5870 should be 50 to 100% faster, so I guess that the limit is not the GPU in your case. HD6970 should add another ~10% speed ... did you test switching to the mul24 kernel which should suit the HD6xxx better?

[/QUOTE]

That is confusing to me. Unless you get a 100% speed boost from sieving the extra primes, I agree, my card should be significantly faster than yours. So you are saying you have two instances of mfakto running at around 75M/s each, because that would indeed match what I am able to do with one instance (150M/s).

I tried running multiple instances once on my linux box, but the second instance actually locked up. Is there some trick that other people do to get a second instance up and running on linux? I could have done something wrong. However, I don't think it looked significantly faster so I didn't worry about it. Plus this way I could have two cores devoted to mprime, instead of 0.

However, the 6970s are a touch slower than the 5870, perhaps because they actually have fewer shaders than the 5870 (which is the gold standard for shader count, if those are what really matter with the OpenCL implementation).

KyleAskine 2011-11-25 12:51

I have a general mfakto/c question. Is avg. wait the video card waiting for the processor, or the processor waiting for the video card?

Bdot 2011-11-25 12:56

[QUOTE=KyleAskine;279801]That is confusing to me. Unless you get a 100% speed boost from sieving the extra primes, I agree, my card should be significantly faster than yours. So you are saying you have two instances of mfakto running at around 75M/s each, because that would indeed match what I am able to do with one instance (150M/s).

I tried running multiple instances once on my linux box, but the second instance actually locked up. Is there some trick that other people do to get a second instance up and running on linux? I could have done something wrong. However, I don't think it looked significantly faster so I didn't worry about it. Plus this way I could have two cores devoted to mprime, instead of 0.

However, the 6970s are a touch slower than the 5870, perhaps because they actually have fewer shaders than the 5870 (which is the gold standard for shader count, if those are what really matter with the OpenCL implementation).[/QUOTE]

Actually, in order to get both the CPU and the GPU to (almost) full load, I have to run 3 mfakto instances and 3 prime95 threads on my quad-core Phenom. This way, the 3 mfakto instances add up to almost 2 CPU cores (with peaks to ~220%). Two of the prime95 threads advance at normal speed, the third is just taking what's left over (~5-10% CPU, i.e. rather crawling along). I don't pin mfakto to any core, I let Windows7 choose.

Each mfakto instance is running at ~40M/s, which due to the higher SievePrimes turns out to be as much as 1x 150M/s at SievePrimes 5k, when looking at the test throughput.

The lock-up on Linux happend to me quite often, usually with a line "[fglrx] ASIC hang happened" and a stack dump in /var/log/messages. About at the time when I upgraded to 11.9, I gave the card a little more air (moved to the other PCIe slot and higher fan setting). Since then I had no such lock-ups anymore. Not sure which of the actions helped.

I think I'll add a raw performance measurement mode to mfakto, detailing the pure kernel runtime per kernel. This way it would be easier to compare the cards, also to NV/mfaktc. Until then, use tools like GPU-Z, or "aticonfig --od-getclocks | grep load" to find out how much room the GPU still has. I unfortunately have access to only 2 different ATI cards, one of them bound to 11.11 :-(

[QUOTE=KyleAskine]
I have a general mfakto/c question. Is avg. wait the video card waiting for the processor, or the processor waiting for the video card?
[/quote]
The later.

bcp19 2011-11-25 14:25

I'm guessing the cpu 'quality' is a fair factor here? I'm running 2x mfaktc on a Core 2 Quad Q8200/GTS 450 with SievePrimes set to 15000 and getting ~60M/s each with a bit over 2kus wait times. This setup keeps the GPU at 99% load. When I don't lock SievePrimes, the M/s drops (can't remember to what, seems 40 or so) but the overall time to complete the same assignment increased and if I remember correctly, GPU load was 60-70%. My core 2 Duo took both cores to almost max out my 560, but when I upgraded to the 2500k one core outdoes what the Duo did. Is this a fair assumption?

flashjh 2011-11-25 16:13

I'm running a QX9650 with a Gigabyte EP45-UD3P (oc to 400fsb 3.4Ghz) & 8Gb ram.

I have two Sapphire 5870s in crossfire. I can run two instances (I use -d 11 and -d 12). I get ~120M/s each while TF in the 50M range from 70 - 71. If I run one instance, I only get ~130M/s, so two is definitely better. With two instances, my CPU runs 85% across all four cores. When I start Prime95 with one worker LL test my rates drop to ~105M/s with CPU @ 100%. The system is still usable in all circumstances, but I have to shutdown mfakto to use GPU.

flashjh 2011-11-25 19:46

[QUOTE=bcp19;279817]I'm guessing the cpu 'quality' is a fair factor here? I'm running 2x mfaktc on a Core 2 Quad Q8200/GTS 450 with SievePrimes set to 15000 and getting ~60M/s each with a bit over 2kus wait times. This setup keeps the GPU at 99% load. When I don't lock SievePrimes, the M/s drops (can't remember to what, seems 40 or so) but the overall time to complete the same assignment increased and if From what I remember correctly, GPU load was 60-70%. My core 2 Duo took both cores to almost max out my 560, but when I upgraded to the 2500k one core outdoes what the Duo did. Is this a fair assumption?[/QUOTE]

From what I can tell you need two or more instances to max out the GPUs. If you have a slower CPU, you might max it out before you max the GPU. Like in my case, my GPUs sit at 60% each with two instances, but my CPU doesn't have much more to throw at it, sitting at 85%. My wait times are quite low, mostly 0, but up to 200µs. But, when I change sieve to anything other than 5000, my M/s drops way down. Could be the difference between ATI & nVidia?

KyleAskine 2011-11-25 22:17

Alright, I just started a second instance of mfakto on my linux (5870) box. I now have two instances running at around 30000 primes sieved and 100 M/s.


All times are UTC. The time now is 13:56.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.