mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-08-29, 22:07   #89
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Thanks a lot for your reports, good to hear some people would actually use it

I did some detailed testing on the CPU demands of mfakto vs. mprime/prime95. (Fix SievePrimes for this test)

My HD 5770 can reach about 120M/s total with 2 instances running and no other big consumer. Then, mfakto's CPU load is about 320%. Yes, only two instances, single-threaded, will occupy a little more than 3 cores. A single instance will reach about 105M/s, at 195% CPU.

Starting mprime (mostly LL tests) will drastically decrease mfakto's CPU load. Throughput also drops, but by a lesser degree - with ~100% CPU a single instance still reaches 75M/s. Conclusion is that the OpenCL runtime has quite some busy-waits behind the user-level events ...

And what timings change inside mfakto when mprime starts? Well, the pure kernel runtime is absolutely unchanged (as expected). The siever slows down by 15-20% (370 -> 433 ms for 20 blocks of 1.25M). (Even though mfakto runs at normal priority while mprime is the "nicest" of all.) But what may count way more is the time required to copy the blocks to the GPU. While this is normally above 3 GB/s, the rate starts fluctuating a lot, averaging 1.55GB/s when mprime runs. Worst case was 14.7 ms to copy the block, and 9.4 ms to process it on the GPU. Unlike the longer sieving times, the longer transfer times will not be hidden by parallelism: OpenCL does not yet support copying data to the kernel while another kernel is still running. mfakto will copy and process blocks strictly alternating.

Conclusion? Both mprime and mfakto put quite some stress on the memory bus (and mfakto not yet being optimized to be cache-friendly). When a CPU waits for data from memory, this is counted as "CPU busy" towards the application, even though the CPU has to wait lots of cycles.

I'll see if I can make mfakto a bit "cache-friendlier", but the ultimate solution to this problem will be when the siever runs on the GPU.

Regarding the maximum throughput of your cards:

Chaichontat's HD6850 (at this speed rather a 6870!) should achieve around 160M/s. For that you'll need at least 2, probably 3 instances running on at least 4 CPU-cores.

MrHappy's HD5670 should have it's max at ~40M/s. Maybe 2 instances are needed here too in order to keep the GPU at 99%.

@Christenson: keep dreaming, one day it will come true ...
Bdot is offline   Reply With Quote
Old 2011-08-30, 03:23   #90
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5×359 Posts
Default

I do keep dreaming...the question is whether I will be the implementer, or someone else....
Christenson is offline   Reply With Quote
Old 2011-08-30, 12:22   #91
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default GPU sieving for Trial Factoring

Quote:
Originally Posted by Christenson View Post
I do keep dreaming...the question is whether I will be the implementer, or someone else....
I'm still in a stage of collecting ideas how to distribute the work onto multiple threads.

Easiest would be to give each thread a different exponent to work on. This would eliminate the need for threads to communicate with each other, each could work in the fast private storage ... However, you'd need at least 64 exponents to work on, for high-end GPUs up to 1024. The factoring progress of each would be about 2-4M/s, leading to huge runtime even for medium bitlevels.

Each thread could also process a fixed block of sieve-input. This would require sieve-initialization for each block as you cannot build upon the state of the previous block. Therefore each block needs to have a good size to make the initialization less prominent. An extra step (i.e. extra kernel) would be needed to combine the output of all the threads into the sieve-output. And only after that step we know if we have enough FCs to fill a block for the GPU factoring.

Similarly, we could let each thread prepare a whole block of sieve-output factor candidates. This would require to have good estimates about where each block will start. Usually you don't know where a certain block starts until the previous block is finished sieving. It can be estimated, but to be safe, there needs to be a certain overlap, some checks and maybe re-runs of the sieving if gaps were detected.

We could split the primes that are used to sieve a block. Disadvantages include different run-lengths for the loops, lots of (slow) global memory operations and synchronization for access to the block of FCs (not sure about that). Maybe that could be optimized by using workgroup-size blocks and local memory that is considerably faster, and combining that later into global memory.

Maybe the best would be to split the task (factor Mexp from 2n to 2m) into <workgroup> equally-sized blocks and run sieving and factoring of those blocks in independent threads. Again, lots of initializations, plus maybe too many private resources required ... Preferred workgroup numbers seem to be 32 to 256, depending on the GPU.

More suggestions, votes, comments?

Last fiddled with by Bdot on 2011-08-30 at 12:27 Reason: )
Bdot is offline   Reply With Quote
Old 2011-08-30, 18:27   #92
MrHappy
 
MrHappy's Avatar
 
Dec 2003
Paisley Park & Neverland

5×37 Posts
Default

With Prime95 stopped mfakto reaches ~50M/s on the HD5670.
MrHappy is offline   Reply With Quote
Old 2011-08-30, 18:52   #93
AldoA
 
Aug 2011

22 Posts
Default

Hi everyone. I wanted to help this project with my ATI Redeon HD 4650.
Then I downloaded OpenCL, and mfakto. I installed OpenCl, but when I started mfakto it sayes "Impossible to start the application, MSCVR100.dll hasn't found, a new installation of the program could solve the problem".
Can anyone say me what do I have to do or install. Thanks
AldoA is offline   Reply With Quote
Old 2011-08-30, 21:09   #94
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Hi AldoA,

this is the Microsoft Visual C++ runtime, to download from MS (the below links are for the German version, but there you can also change the language):
Microsoft Visual C++ 2010 Redistributable Package (x86)
Microsoft Visual C++ 2010 Redistributable Package (x64)

I'll add it to the list of dependencies in the README.
Bdot is offline   Reply With Quote
Old 2011-08-30, 21:18   #95
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by MrHappy View Post
With Prime95 stopped mfakto reaches ~50M/s on the HD5670.
Did you try a single instance only, or also two separate invocations (two different exponents)? That would certainly add something on the totals line.
Bdot is offline   Reply With Quote
Old 2011-08-30, 21:48   #96
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by apsen View Post
Apart from that AMD_APP refused to install on Win2008
I posted that on the AMD Forum and they want to know what exactly the error is. Could you please try again and tell me?
Bdot is offline   Reply With Quote
Old 2011-08-31, 10:03   #97
AldoA
 
Aug 2011

22 Posts
Default

Quote:
Originally Posted by Bdot View Post
Hi AldoA,

this is the Microsoft Visual C++ runtime, to download from MS (the below links are for the German version, but there you can also change the language):
Microsoft Visual C++ 2010 Redistributable Package (x86)
Microsoft Visual C++ 2010 Redistributable Package (x64)

I'll add it to the list of dependencies in the README.
Thanks. Now I can open mfakto, but I think it's using the CPU because it says "select device-GPU not found-fallback to CPU". What to do? Anyway I made the selftest and it passed it. What other can I do? (Sorry for the questions but I'm not really into computing).
AldoA is offline   Reply With Quote
Old 2011-08-31, 13:25   #98
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by Bdot View Post
I posted that on the AMD Forum and they want to know what exactly the error is. Could you please try again and tell me?
Actually it does not even give an error. The installer says that the installation of that part has failed and provides a way to open log file. The log file says "Error messages" and it looks like some details should follow but there are none.
Attached Files
File Type: zip Reports.zip (2.3 KB, 145 views)
apsen is offline   Reply With Quote
Old 2011-08-31, 19:08   #99
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

25516 Posts
Default

Quote:
Originally Posted by AldoA View Post
Thanks. Now I can open mfakto, but I think it's using the CPU because it says "select device-GPU not found-fallback to CPU". What to do? Anyway I made the selftest and it passed it. What other can I do? (Sorry for the questions but I'm not really into computing).
Did you also install one of the recent Catalyst graphics drivers? 11.7 and 11.8 should work, not sure about 11.6, but they definitely should not be older.

If that is up-to-date, then please post the output of clinfo (e.g. C:\Program Files (x86)\AMD APP\bin\x86_64\clinfo.exe, or in the x86 directory if you run 32-bit OS). This should contain one section for your GPU and one for the CPU.


@apsen: Thanks for the details, I forwarded it - looks like W2k8 should work as well ...
Bdot is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2733 2021-10-13 10:39
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3506 2021-09-18 00:04
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 01:49.


Tue Oct 19 01:49:54 UTC 2021 up 87 days, 20:18, 0 users, load averages: 1.64, 1.85, 1.79

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.