mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-08-31, 19:44   #100
AldoA
 
Aug 2011

22 Posts
Default

Quote:
Originally Posted by Bdot View Post
Did you also install one of the recent Catalyst graphics drivers? 11.7 and 11.8 should work, not sure about 11.6, but they definitely should not be older.

If that is up-to-date, then please post the output of clinfo (e.g. C:\Program Files (x86)\AMD APP\bin\x86_64\clinfo.exe, or in the x86 directory if you run 32-bit OS). This should contain one section for your GPU and one for the CPU.


@apsen: Thanks for the details, I forwarded it - looks like W2k8 should work as well ...
No, I tried to install 11.8, but my antivirus found a virus, so it stopped.
Anyway, this should be the output of clinfo.exe
Attached Files
File Type: txt ATI.txt (3.3 KB, 110 views)
AldoA is offline   Reply With Quote
Old 2011-08-31, 21:00   #101
MrHappy
 
MrHappy's Avatar
 
Dec 2003
Paisley Park & Neverland

B916 Posts
Default

With two instances running, every instance just gets half the throughput (25M/s) - so nothing gained or lost. But with two instances the "avg. wait" time is constantly at 6000µs while it is at ~100µs with one instance. Is that a good or bad thing?
MrHappy is offline   Reply With Quote
Old 2011-09-01, 00:17   #102
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5×359 Posts
Default

Quote:
Originally Posted by Bdot View Post
I'm still in a stage of collecting ideas how to distribute the work onto multiple threads.

Easiest would be to give each thread a different exponent to work on. (Yuck!)

Each thread could also process a fixed block of sieve-input. This would require sieve-initialization for each block as you cannot build upon the state of the previous block. Therefore each block needs to have a good size to make the initialization less prominent. An extra step (i.e. extra kernel) would be needed to combine the output of all the threads into the sieve-output. And only after that step we know if we have enough FCs to fill a block for the GPU factoring. (I'd rather do variable size blocks and toss the ends when needed)

Similarly, we could let each thread prepare a whole block of sieve-output factor candidates. This would require to have good estimates about where each block will start. Usually you don't know where a certain block starts until the previous block is finished sieving. It can be estimated, but to be safe, there needs to be a certain overlap, some checks and maybe re-runs of the sieving if gaps were detected. (again, variable sized blocks are better)

We could split the primes that are used to sieve a block. Disadvantages include different run-lengths for the loops, lots of (slow) global memory operations and synchronization for access to the block of FCs (not sure about that). Maybe that could be optimized by using workgroup-size blocks and local memory that is considerably faster, and combining that later into global memory.

Maybe the best would be to split the task (factor Mexp from 2n to 2m) into <workgroup> equally-sized blocks and run sieving and factoring of those blocks in independent threads. Again, lots of initializations, plus maybe too many private resources required ... Preferred workgroup numbers seem to be 32 to 256, depending on the GPU.

More suggestions, votes, comments?
Impressionistically, on mfaktc, my exponents are running a half-hour or more apiece. The reallly big ones from Operation Billion Digits can run for days and weeks. So I would argue that you want to assume the card is TF'ing on only one exponent at a time, not TF'ing multiple exponents in parallel.

I vote for optimizing large, equally sized blocks that sieve down to a variable number of FCs, much larger than the number of exponentiations that can be done in parallel. These FCs then split into a number of blocks of the right size to do in parallel, plus one "runt" block.

From a dataflow perspective, the list of probable primes(PrP's) of the right form is fed from the CPU to the GPU in blocks for calculating (2^Exp) mod PrP and checking if the result is unity. The quality of the sieving is adjusted so as to just keep the exponentiation and modulus process busy.

I'm not sure where the 4620 classes (in mfaktc) comes from, but it seems to me you would want to create each class (block of factor candidates) in turn on the GPU. As a first step, let that class be copied back to CPU memory while awaiting GPU readiness. As a second step, keep those blocks completely in GPU memory, and the CPU just feeds pointers to the GPU kernels.

I think the right way to think about the threading problem is that a single thread on the CPU acts like an operating system for the GPU....it allocates the GPU resources, sets the kernels in motion, and wakes up again when a kernel completes and figures out what the GPU should do next, and/or fires the next GPU kernel. The optimum level of parallelism in time on the GPU is an open question....should you sieve until the GPU memory is half-full of PrP FCs, then run them to completion, or should you sieve into more digestible blocks and queue them up so they run in parallel? It all depends on the cost and form of context switching on the GPU.

There's also some fairly heavy dependency on your representation for sieving; as with various modified sieves of eratosthenes, the representation proper can be optimised for space and speed, and how that works depends tremendously on the card internals, like the relative cost of clearing bits versus setting bytes or words, and the relative cost of scanning for the remaining FCs at the end of sieving.

Finally, if you had two GPUs on the bus, especially in SLI mode, the SLI connector could be carrying the FCs from the siever card to the one that did the exponentiation.

***********
But I am still working on automating mfaktc communications with Primenet; the only real progress has been an upgrade to parse.c that you (Bdot) will probably be very interested in. I'm supporting comments in worktodo.txt.
***********
Christenson is offline   Reply With Quote
Old 2011-09-01, 00:25   #103
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5·359 Posts
Default

Quote:
Originally Posted by MrHappy View Post
With two instances running, every instance just gets half the throughput (25M/s) - so nothing gained or lost. But with two instances the "avg. wait" time is constantly at 6000µs while it is at ~100µs with one instance. Is that a good or bad thing?
This equality of throughput is the statement that the CPU is waiting for the GPU...that is, the process is GPU-bound. Have a look at what sieveprimes is doing...I'm betting that it is higher with two instances running than one, meaning the CPU is doing a bit better job of sieving and working a bit harder.

Running two instances is probably using a little more of your CPU than just one(you should check this), and you have to deal with a little bit more confusion, so, the data says, for your case, to just run one instance, especially if your CPU isn't doing LL or P-1 tests as a consequence of running mfakto.
Christenson is offline   Reply With Quote
Old 2011-09-01, 10:55   #104
Chaichontat
 
Aug 2011

102 Posts
Default

Quote:
Originally Posted by Bdot View Post
Chaichontat's HD6850 (at this speed rather a 6870!) should achieve around 160M/s. For that you'll need at least 2, probably 3 instances running on at least 4 CPU-cores.
Hi,
My GPU had made 90M/s in 2 CPU and 58 percent GPU usage, and "Sieve Prime" setting 5000. When I started another instance GPU usage swings up to 96 percent, and uses 4 threads; first instance gives 65M/s at 10000 Sieve Prime; second instance gives 70M/s at 8000 Sieve Prime. Combining both instances give ~140M/s.

P.S. My GPU is HD6950 though.
Chaichontat is offline   Reply With Quote
Old 2011-09-01, 12:43   #105
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

179510 Posts
Default

Quote:
Originally Posted by Chaichontat View Post
Hi,
My GPU had made 90M/s in 2 CPU and 58 percent GPU usage, and "Sieve Prime" setting 5000. When I started another instance GPU usage swings up to 96 percent, and uses 4 threads; first instance gives 65M/s at 10000 Sieve Prime; second instance gives 70M/s at 8000 Sieve Prime. Combining both instances give ~140M/s.

P.S. My GPU is HD6950 though.
This is the opposite of Mr Happy's situation... with two instances, the CPUs here are just barely keeping up with the GPUs, and have to lower the quality of the candidates (SievePrimes) to do it. That is, you are definitely CPU bound.
Christenson is offline   Reply With Quote
Old 2011-09-01, 13:28   #106
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by AldoA View Post
No, I tried to install 11.8, but my antivirus found a virus, so it stopped.
Anyway, this should be the output of clinfo.exe
This teaches us that we really need a (fairly) up-to-date Catalyst driver. The clinfo output does not even mention your GPU.

Please try again to download and install 11.8. If your antivirus still complains, try to get an updated virus definition, or temporarily disable checking. If all fails, AMD provides a link to older versions - you can try 11.7.
Bdot is offline   Reply With Quote
Old 2011-09-01, 13:31   #107
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by Christenson View Post
This equality of throughput is the statement that the CPU is waiting for the GPU...that is, the process is GPU-bound. Have a look at what sieveprimes is doing...I'm betting that it is higher with two instances running than one, meaning the CPU is doing a bit better job of sieving and working a bit harder.

Running two instances is probably using a little more of your CPU than just one(you should check this), and you have to deal with a little bit more confusion, so, the data says, for your case, to just run one instance, especially if your CPU isn't doing LL or P-1 tests as a consequence of running mfakto.
Yes, that depends on the SievePrimes. I guess when running 2 instances, it will go to 200k for both. For a single instance it is probably lower. If you can spare the CPU power, running 2 instances will still be some advantage, because at the same total speed of 50M/s less candidates need to be tested, thus the classes are finished faster.
Bdot is offline   Reply With Quote
Old 2011-09-01, 14:09   #108
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

25516 Posts
Default GPU sieving

Christenson, thanks for your thoughts, it's good to have some discussion about it ...

Quote:
Originally Posted by Christenson View Post
I vote for optimizing large, equally sized blocks that sieve down to a variable number of FCs, much larger than the number of exponentiations that can be done in parallel. These FCs then split into a number of blocks of the right size to do in parallel, plus one "runt" block.

I'm not sure where the 4620 classes (in mfaktc) comes from, but it seems to me you would want to create each class (block of factor candidates) in turn on the GPU. As a first step, let that class be copied back to CPU memory while awaiting GPU readiness. As a second step, keep those blocks completely in GPU memory, and the CPU just feeds pointers to the GPU kernels.
I'll probably go ahead and create a siever and compacter kernels that initially do not do a lot of sieving, just to see how they can work together and how the CPU can control them. The kernels need to run interleaved as OpenCL does not (yet) support running them in parallel (except on different devices). However, there's no need to copy the blocks to the CPU and back as the subsequent kernel can easily access them. This will be the major benefit of the GPU sieve (no pressure on the memory bus, reducing interference with prime95, for instance) - I do not expect it to be much faster than CPU sieving. Also, when leaving the blocks on GPU memory, optimizing for size does not seem to be so important.

BTW, GPU context switching requires to copy memory blocks from and to the GPU, therefore having smaller memory blocks can be advantageous. However, different kernels can run (sequentially) in the same context, with almost no switching time.

Quote:
Originally Posted by Christenson View Post
Finally, if you had two GPUs on the bus, especially in SLI mode, the SLI connector could be carrying the FCs from the siever card to the one that did the exponentiation.
This would require good balancing to make the kernels run equally long. Which is not necessary when running the kernels serialized. And you'd need to copy the blocks around again.

Last fiddled with by Bdot on 2011-09-01 at 14:10 Reason: title
Bdot is offline   Reply With Quote
Old 2011-09-05, 21:57   #109
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

59710 Posts
Default

Quote:
Originally Posted by apsen View Post
Actually it does not even give an error. The installer says that the installation of that part has failed and provides a way to open log file. The log file says "Error messages" and it looks like some details should follow but there are none.
They say it may happen if the Catalyst driver is too old, or did not install properly.

Did you have an outdated version on the W2008 box (or no Catalyst at all)?
Bdot is offline   Reply With Quote
Old 2011-09-06, 14:11   #110
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by Bdot View Post
They say it may happen if the Catalyst driver is too old, or did not install properly.

Did you have an outdated version on the W2008 box (or no Catalyst at all)?
I actually tried to install the whole bunch. Every other component got installed but that one. After that I tried to install just the APP but it failed too.
apsen is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2695 2021-04-07 21:50
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3492 2021-03-24 14:09
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 12:29.

Sat Apr 10 12:29:59 UTC 2021 up 2 days, 7:10, 1 user, load averages: 1.65, 2.15, 2.16

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.