mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

Bdot 2011-08-29 22:07

Thanks a lot for your reports, good to hear some people would actually use it :smile:

I did some detailed testing on the CPU demands of mfakto vs. mprime/prime95. (Fix SievePrimes for this test)

My HD 5770 can reach about 120M/s total with 2 instances running and no other big consumer. Then, mfakto's CPU load is about 320%. Yes, only two instances, single-threaded, will occupy a little more than 3 cores. A single instance will reach about 105M/s, at 195% CPU.

Starting mprime (mostly LL tests) will drastically decrease mfakto's CPU load. Throughput also drops, but by a lesser degree - with ~100% CPU a single instance still reaches 75M/s. Conclusion is that the OpenCL runtime has quite some busy-waits behind the user-level events ...

And what timings change inside mfakto when mprime starts? Well, the pure kernel runtime is absolutely unchanged (as expected). The siever slows down by 15-20% (370 -> 433 ms for 20 blocks of 1.25M). (Even though mfakto runs at normal priority while mprime is the "nicest" of all.) But what may count way more is the time required to copy the blocks to the GPU. While this is normally above 3 GB/s, the rate starts fluctuating a lot, averaging 1.55GB/s when mprime runs. Worst case was 14.7 ms to copy the block, and 9.4 ms to process it on the GPU. Unlike the longer sieving times, the longer transfer times will not be hidden by parallelism: OpenCL does not yet support copying data to the kernel while another kernel is still running. mfakto will copy and process blocks strictly alternating.

Conclusion? Both mprime and mfakto put quite some stress on the memory bus (and mfakto not yet being optimized to be cache-friendly). When a CPU waits for data from memory, this is counted as "CPU busy" towards the application, even though the CPU has to wait lots of cycles.

I'll see if I can make mfakto a bit "cache-friendlier", but the ultimate solution to this problem will be when the siever runs on the GPU.

Regarding the maximum throughput of your cards:

Chaichontat's HD6850 (at this speed rather a 6870!) should achieve around 160M/s. For that you'll need at least 2, probably 3 instances running on at least 4 CPU-cores.

MrHappy's HD5670 should have it's max at ~40M/s. Maybe 2 instances are needed here too in order to keep the GPU at 99%.

@Christenson: keep dreaming, one day it will come true ...

Christenson 2011-08-30 03:23

I do keep dreaming...the question is whether I will be the implementer, or someone else....

Bdot 2011-08-30 12:22

GPU sieving for Trial Factoring
 
[QUOTE=Christenson;270374]I do keep dreaming...the question is whether I will be the implementer, or someone else....[/QUOTE]
I'm still in a stage of collecting ideas how to distribute the work onto multiple threads.

Easiest would be to give each thread a different exponent to work on. This would eliminate the need for threads to communicate with each other, each could work in the fast private storage ... However, you'd need at least 64 exponents to work on, for high-end GPUs up to 1024. The factoring progress of each would be about 2-4M/s, leading to huge runtime even for medium bitlevels.

Each thread could also process a fixed block of sieve-input. This would require sieve-initialization for each block as you cannot build upon the state of the previous block. Therefore each block needs to have a good size to make the initialization less prominent. An extra step (i.e. extra kernel) would be needed to combine the output of all the threads into the sieve-output. And only after that step we know if we have enough FCs to fill a block for the GPU factoring.

Similarly, we could let each thread prepare a whole block of sieve-output factor candidates. This would require to have good estimates about where each block will start. Usually you don't know where a certain block starts until the previous block is finished sieving. It can be estimated, but to be safe, there needs to be a certain overlap, some checks and maybe re-runs of the sieving if gaps were detected.

We could split the primes that are used to sieve a block. Disadvantages include different run-lengths for the loops, lots of (slow) global memory operations and synchronization for access to the block of FCs (not sure about that). Maybe that could be optimized by using workgroup-size blocks and local memory that is considerably faster, and combining that later into global memory.

Maybe the best would be to split the task (factor M[SUB]exp[/SUB] from 2[SUP]n[/SUP] to 2[SUP]m[/SUP]) into <workgroup> equally-sized blocks and run sieving and factoring of those blocks in independent threads. Again, lots of initializations, plus maybe too many private resources required ... Preferred workgroup numbers seem to be 32 to 256, depending on the GPU.

More suggestions, votes, comments?

MrHappy 2011-08-30 18:27

With Prime95 stopped mfakto reaches ~50M/s on the HD5670.

AldoA 2011-08-30 18:52

Hi everyone. I wanted to help this project with my ATI Redeon HD 4650.
Then I downloaded OpenCL, and mfakto. I installed OpenCl, but when I started mfakto it sayes "Impossible to start the application, MSCVR100.dll hasn't found, a new installation of the program could solve the problem".
Can anyone say me what do I have to do or install. Thanks

Bdot 2011-08-30 21:09

Hi AldoA,

this is the Microsoft Visual C++ runtime, to download from MS (the below links are for the German version, but there you can also change the language):
[URL="http://www.microsoft.com/downloads/details.aspx?FamilyID=a7b7a05e-6de6-4d3a-a423-37bf0912db84&displayLang=de"]Microsoft Visual C++ 2010 Redistributable Package (x86)[/URL]
[URL="http://www.microsoft.com/downloads/details.aspx?familyid=BD512D9E-43C8-4655-81BF-9350143D5867&displaylang=de"]Microsoft Visual C++ 2010 Redistributable Package (x64)[/URL]

I'll add it to the list of dependencies in the README.

Bdot 2011-08-30 21:18

[QUOTE=MrHappy;270416]With Prime95 stopped mfakto reaches ~50M/s on the HD5670.[/QUOTE]

Did you try a single instance only, or also two separate invocations (two different exponents)? That would certainly add something on the totals line.

Bdot 2011-08-30 21:48

[QUOTE=apsen;270091]Apart from that AMD_APP refused to install on Win2008 [/QUOTE]

I posted that on the [URL="http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=154048&enterthread=y"]AMD Forum[/URL] and they want to know what exactly the error is. Could you please try again and tell me?

AldoA 2011-08-31 10:03

[QUOTE=Bdot;270422]Hi AldoA,

this is the Microsoft Visual C++ runtime, to download from MS (the below links are for the German version, but there you can also change the language):
[URL="http://www.microsoft.com/downloads/details.aspx?FamilyID=a7b7a05e-6de6-4d3a-a423-37bf0912db84&displayLang=de"]Microsoft Visual C++ 2010 Redistributable Package (x86)[/URL]
[URL="http://www.microsoft.com/downloads/details.aspx?familyid=BD512D9E-43C8-4655-81BF-9350143D5867&displaylang=de"]Microsoft Visual C++ 2010 Redistributable Package (x64)[/URL]

I'll add it to the list of dependencies in the README.[/QUOTE]

Thanks. Now I can open mfakto, but I think it's using the CPU because it says "select device-GPU not found-fallback to CPU". What to do? Anyway I made the selftest and it passed it. What other can I do? (Sorry for the questions but I'm not really into computing).

apsen 2011-08-31 13:25

1 Attachment(s)
[QUOTE=Bdot;270427]I posted that on the [URL="http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=154048&enterthread=y"]AMD Forum[/URL] and they want to know what exactly the error is. Could you please try again and tell me?[/QUOTE]

Actually it does not even give an error. The installer says that the installation of that part has failed and provides a way to open log file. The log file says "Error messages" and it looks like some details should follow but there are none.

Bdot 2011-08-31 19:08

[QUOTE=AldoA;270456]Thanks. Now I can open mfakto, but I think it's using the CPU because it says "select device-GPU not found-fallback to CPU". What to do? Anyway I made the selftest and it passed it. What other can I do? (Sorry for the questions but I'm not really into computing).[/QUOTE]

Did you also install one of the recent Catalyst graphics drivers? 11.7 and 11.8 should work, not sure about 11.6, but they definitely should not be older.

If that is up-to-date, then please post the output of clinfo (e.g. C:\Program Files (x86)\AMD APP\bin\x86_64\clinfo.exe, or in the x86 directory if you run 32-bit OS). This should contain one section for your GPU and one for the CPU.


@apsen: Thanks for the details, I forwarded it - looks like W2k8 should work as well ...


All times are UTC. The time now is 17:40.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.