![]() |
|
|
#100 | |
|
Aug 2011
416 Posts |
Quote:
Anyway, this should be the output of clinfo.exe |
|
|
|
|
|
|
#101 |
|
Dec 2003
Paisley Park & Neverland
5×37 Posts |
With two instances running, every instance just gets half the throughput (25M/s) - so nothing gained or lost. But with two instances the "avg. wait" time is constantly at 6000µs while it is at ~100µs with one instance. Is that a good or bad thing?
|
|
|
|
|
|
#102 | |
|
Dec 2010
Monticello
5×359 Posts |
Quote:
I vote for optimizing large, equally sized blocks that sieve down to a variable number of FCs, much larger than the number of exponentiations that can be done in parallel. These FCs then split into a number of blocks of the right size to do in parallel, plus one "runt" block. From a dataflow perspective, the list of probable primes(PrP's) of the right form is fed from the CPU to the GPU in blocks for calculating (2^Exp) mod PrP and checking if the result is unity. The quality of the sieving is adjusted so as to just keep the exponentiation and modulus process busy. I'm not sure where the 4620 classes (in mfaktc) comes from, but it seems to me you would want to create each class (block of factor candidates) in turn on the GPU. As a first step, let that class be copied back to CPU memory while awaiting GPU readiness. As a second step, keep those blocks completely in GPU memory, and the CPU just feeds pointers to the GPU kernels. I think the right way to think about the threading problem is that a single thread on the CPU acts like an operating system for the GPU....it allocates the GPU resources, sets the kernels in motion, and wakes up again when a kernel completes and figures out what the GPU should do next, and/or fires the next GPU kernel. The optimum level of parallelism in time on the GPU is an open question....should you sieve until the GPU memory is half-full of PrP FCs, then run them to completion, or should you sieve into more digestible blocks and queue them up so they run in parallel? It all depends on the cost and form of context switching on the GPU. There's also some fairly heavy dependency on your representation for sieving; as with various modified sieves of eratosthenes, the representation proper can be optimised for space and speed, and how that works depends tremendously on the card internals, like the relative cost of clearing bits versus setting bytes or words, and the relative cost of scanning for the remaining FCs at the end of sieving. Finally, if you had two GPUs on the bus, especially in SLI mode, the SLI connector could be carrying the FCs from the siever card to the one that did the exponentiation. *********** But I am still working on automating mfaktc communications with Primenet; the only real progress has been an upgrade to parse.c that you (Bdot) will probably be very interested in. I'm supporting comments in worktodo.txt. *********** |
|
|
|
|
|
|
#103 | |
|
Dec 2010
Monticello
70316 Posts |
Quote:
Running two instances is probably using a little more of your CPU than just one(you should check this), and you have to deal with a little bit more confusion, so, the data says, for your case, to just run one instance, especially if your CPU isn't doing LL or P-1 tests as a consequence of running mfakto. |
|
|
|
|
|
|
#104 | |
|
Aug 2011
102 Posts |
Quote:
My GPU had made 90M/s in 2 CPU and 58 percent GPU usage, and "Sieve Prime" setting 5000. When I started another instance GPU usage swings up to 96 percent, and uses 4 threads; first instance gives 65M/s at 10000 Sieve Prime; second instance gives 70M/s at 8000 Sieve Prime. Combining both instances give ~140M/s. P.S. My GPU is HD6950 though. |
|
|
|
|
|
|
#105 | |
|
Dec 2010
Monticello
5×359 Posts |
Quote:
|
|
|
|
|
|
|
#106 | |
|
Nov 2010
Germany
25516 Posts |
Quote:
Please try again to download and install 11.8. If your antivirus still complains, try to get an updated virus definition, or temporarily disable checking. If all fails, AMD provides a link to older versions - you can try 11.7. |
|
|
|
|
|
|
#107 | |
|
Nov 2010
Germany
3×199 Posts |
Quote:
|
|
|
|
|
|
|
#108 | |
|
Nov 2010
Germany
3×199 Posts |
Christenson, thanks for your thoughts, it's good to have some discussion about it ...
Quote:
BTW, GPU context switching requires to copy memory blocks from and to the GPU, therefore having smaller memory blocks can be advantageous. However, different kernels can run (sequentially) in the same context, with almost no switching time. This would require good balancing to make the kernels run equally long. Which is not necessary when running the kernels serialized. And you'd need to copy the blocks around again. Last fiddled with by Bdot on 2011-09-01 at 14:10 Reason: title |
|
|
|
|
|
|
#109 | |
|
Nov 2010
Germany
3·199 Posts |
Quote:
Did you have an outdated version on the W2008 box (or no Catalyst at all)? |
|
|
|
|
|
|
#110 |
|
Jun 2011
13110 Posts |
I actually tried to install the whole bunch. Every other component got installed but that one. After that I tried to install just the APP but it failed too.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| gpuOwL: an OpenCL program for Mersenne primality testing | preda | GpuOwl | 2938 | 2023-06-30 14:04 |
| mfaktc: a CUDA program for Mersenne prefactoring | TheJudger | GPU Computing | 3628 | 2023-04-17 22:08 |
| LL with OpenCL | msft | GPU Computing | 433 | 2019-06-23 21:11 |
| OpenCL for FPGAs | TObject | GPU Computing | 2 | 2013-10-12 21:09 |
| Program to TF Mersenne numbers with more than 1 sextillion digits? | Stargate38 | Factoring | 24 | 2011-11-03 00:34 |