![]() |
1 Attachment(s)
[QUOTE=Bdot;270478]Did you also install one of the recent Catalyst graphics drivers? 11.7 and 11.8 should work, not sure about 11.6, but they definitely should not be older.
If that is up-to-date, then please post the output of clinfo (e.g. C:\Program Files (x86)\AMD APP\bin\x86_64\clinfo.exe, or in the x86 directory if you run 32-bit OS). This should contain one section for your GPU and one for the CPU. @apsen: Thanks for the details, I forwarded it - looks like W2k8 should work as well ...[/QUOTE] No, I tried to install 11.8, but my antivirus found a virus, so it stopped. Anyway, this should be the output of clinfo.exe |
With two instances running, every instance just gets half the throughput (25M/s) - so nothing gained or lost. But with two instances the "avg. wait" time is constantly at 6000µs while it is at ~100µs with one instance. Is that a good or bad thing?
|
[QUOTE=Bdot;270394]I'm still in a stage of collecting ideas how to distribute the work onto multiple threads.
Easiest would be to give each thread a different exponent to work on. (Yuck!) Each thread could also process a fixed block of sieve-input. This would require sieve-initialization for each block as you cannot build upon the state of the previous block. Therefore each block needs to have a good size to make the initialization less prominent. An extra step (i.e. extra kernel) would be needed to combine the output of all the threads into the sieve-output. And only after that step we know if we have enough FCs to fill a block for the GPU factoring. (I'd rather do variable size blocks and toss the ends when needed) Similarly, we could let each thread prepare a whole block of sieve-output factor candidates. This would require to have good estimates about where each block will start. Usually you don't know where a certain block starts until the previous block is finished sieving. It can be estimated, but to be safe, there needs to be a certain overlap, some checks and maybe re-runs of the sieving if gaps were detected. (again, variable sized blocks are better) We could split the primes that are used to sieve a block. Disadvantages include different run-lengths for the loops, lots of (slow) global memory operations and synchronization for access to the block of FCs (not sure about that). Maybe that could be optimized by using workgroup-size blocks and local memory that is considerably faster, and combining that later into global memory. Maybe the best would be to split the task (factor M[SUB]exp[/SUB] from 2[SUP]n[/SUP] to 2[SUP]m[/SUP]) into <workgroup> equally-sized blocks and run sieving and factoring of those blocks in independent threads. Again, lots of initializations, plus maybe too many private resources required ... Preferred workgroup numbers seem to be 32 to 256, depending on the GPU. More suggestions, votes, comments?[/QUOTE] Impressionistically, on mfaktc, my exponents are running a half-hour or more apiece. The reallly big ones from Operation Billion Digits can run for days and weeks. So I would argue that you want to assume the card is TF'ing on only one exponent at a time, not TF'ing multiple exponents in parallel. I vote for optimizing large, equally sized blocks that sieve down to a variable number of FCs, much larger than the number of exponentiations that can be done in parallel. These FCs then split into a number of blocks of the right size to do in parallel, plus one "runt" block. From a dataflow perspective, the list of probable primes(PrP's) of the right form is fed from the CPU to the GPU in blocks for calculating (2^Exp) mod PrP and checking if the result is unity. The quality of the sieving is adjusted so as to just keep the exponentiation and modulus process busy. I'm not sure where the 4620 classes (in mfaktc) comes from, but it seems to me you would want to create each class (block of factor candidates) in turn on the GPU. As a first step, let that class be copied back to CPU memory while awaiting GPU readiness. As a second step, keep those blocks completely in GPU memory, and the CPU just feeds pointers to the GPU kernels. I think the right way to think about the threading problem is that a single thread on the CPU acts like an operating system for the GPU....it allocates the GPU resources, sets the kernels in motion, and wakes up again when a kernel completes and figures out what the GPU should do next, and/or fires the next GPU kernel. The optimum level of parallelism in time on the GPU is an open question....should you sieve until the GPU memory is half-full of PrP FCs, then run them to completion, or should you sieve into more digestible blocks and queue them up so they run in parallel? It all depends on the cost and form of context switching on the GPU. There's also some fairly heavy dependency on your representation for sieving; as with various modified sieves of eratosthenes, the representation proper can be optimised for space and speed, and how that works depends tremendously on the card internals, like the relative cost of clearing bits versus setting bytes or words, and the relative cost of scanning for the remaining FCs at the end of sieving. Finally, if you had two GPUs on the bus, especially in SLI mode, the SLI connector could be carrying the FCs from the siever card to the one that did the exponentiation. *********** But I am still working on automating mfaktc communications with Primenet; the only real progress has been an upgrade to parse.c that you (Bdot) will probably be very interested in. I'm supporting comments in worktodo.txt. *********** |
[QUOTE=MrHappy;270487]With two instances running, every instance just gets half the throughput (25M/s) - so nothing gained or lost. But with two instances the "avg. wait" time is constantly at 6000µs while it is at ~100µs with one instance. Is that a good or bad thing?[/QUOTE]
This equality of throughput is the statement that the CPU is waiting for the GPU...that is, the process is GPU-bound. Have a look at what sieveprimes is doing...I'm betting that it is higher with two instances running than one, meaning the CPU is doing a bit better job of sieving and working a bit harder. Running two instances is probably using a little more of your CPU than just one(you should check this), and you have to deal with a little bit more confusion, so, the data says, for your case, to just run one instance, especially if your CPU isn't doing LL or P-1 tests as a consequence of running mfakto. |
[QUOTE=Bdot;270353]
Chaichontat's HD6850 (at this speed rather a 6870!) should achieve around 160M/s. For that you'll need at least 2, probably 3 instances running on at least 4 CPU-cores. [/QUOTE] Hi, My GPU had made 90M/s in 2 CPU and 58 percent GPU usage, and "Sieve Prime" setting 5000. When I started another instance GPU usage swings up to 96 percent, and uses 4 threads; first instance gives 65M/s at 10000 Sieve Prime; second instance gives 70M/s at 8000 Sieve Prime. Combining both instances give ~140M/s. P.S. My GPU is HD6950 though. |
[QUOTE=Chaichontat;270553]Hi,
My GPU had made 90M/s in 2 CPU and 58 percent GPU usage, and "Sieve Prime" setting 5000. When I started another instance GPU usage swings up to 96 percent, and uses 4 threads; first instance gives 65M/s at 10000 Sieve Prime; second instance gives 70M/s at 8000 Sieve Prime. Combining both instances give ~140M/s. P.S. My GPU is HD6950 though.[/QUOTE] This is the opposite of Mr Happy's situation... with two instances, the CPUs here are just barely keeping up with the GPUs, and have to lower the quality of the candidates (SievePrimes) to do it. That is, you are definitely CPU bound. |
[QUOTE=AldoA;270481]No, I tried to install 11.8, but my antivirus found a virus, so it stopped.
Anyway, this should be the output of clinfo.exe[/QUOTE] This teaches us that we really need a (fairly) up-to-date Catalyst driver. The clinfo output does not even mention your GPU. Please try again to download and install 11.8. If your antivirus still complains, try to get an updated virus definition, or temporarily disable checking. If all fails, AMD provides a link to [URL="http://support.amd.com/us/gpudownload/windows/previous/Pages/radeonaiw_vista64.aspx"]older versions[/URL] - you can try 11.7. |
[QUOTE=Christenson;270505]This equality of throughput is the statement that the CPU is waiting for the GPU...that is, the process is GPU-bound. Have a look at what sieveprimes is doing...I'm betting that it is higher with two instances running than one, meaning the CPU is doing a bit better job of sieving and working a bit harder.
Running two instances is probably using a little more of your CPU than just one(you should check this), and you have to deal with a little bit more confusion, so, the data says, for your case, to just run one instance, especially if your CPU isn't doing LL or P-1 tests as a consequence of running mfakto.[/QUOTE] Yes, that depends on the SievePrimes. I guess when running 2 instances, it will go to 200k for both. For a single instance it is probably lower. If you can spare the CPU power, running 2 instances will still be some advantage, because at the same total speed of 50M/s less candidates need to be tested, thus the classes are finished faster. |
GPU sieving
Christenson, thanks for your thoughts, it's good to have some discussion about it ...
[QUOTE=Christenson;270502] I vote for optimizing large, equally sized blocks that sieve down to a variable number of FCs, much larger than the number of exponentiations that can be done in parallel. These FCs then split into a number of blocks of the right size to do in parallel, plus one "runt" block. I'm not sure where the 4620 classes (in mfaktc) comes from, but it seems to me you would want to create each class (block of factor candidates) in turn on the GPU. As a first step, let that class be copied back to CPU memory while awaiting GPU readiness. As a second step, keep those blocks completely in GPU memory, and the CPU just feeds pointers to the GPU kernels. [/QUOTE] I'll probably go ahead and create a siever and compacter kernels that initially do not do a lot of sieving, just to see how they can work together and how the CPU can control them. The kernels need to run interleaved as OpenCL does not (yet) support running them in parallel (except on different devices). However, there's no need to copy the blocks to the CPU and back as the subsequent kernel can easily access them. This will be the major benefit of the GPU sieve (no pressure on the memory bus, reducing interference with prime95, for instance) - I do not expect it to be much faster than CPU sieving. Also, when leaving the blocks on GPU memory, optimizing for size does not seem to be so important. BTW, GPU context switching requires to copy memory blocks from and to the GPU, therefore having smaller memory blocks can be advantageous. However, different kernels can run (sequentially) in the same context, with almost no switching time. [QUOTE=Christenson;270502] Finally, if you had two GPUs on the bus, especially in SLI mode, the SLI connector could be carrying the FCs from the siever card to the one that did the exponentiation. [/QUOTE] This would require good balancing to make the kernels run equally long. Which is not necessary when running the kernels serialized. And you'd need to copy the blocks around again. |
[QUOTE=apsen;270466]Actually it does not even give an error. The installer says that the installation of that part has failed and provides a way to open log file. The log file says "Error messages" and it looks like some details should follow but there are none.[/QUOTE]
They say it may happen if the Catalyst driver is too old, or did not install properly. Did you have an outdated version on the W2008 box (or no Catalyst at all)? |
[QUOTE=Bdot;270908]They say it may happen if the Catalyst driver is too old, or did not install properly.
Did you have an outdated version on the W2008 box (or no Catalyst at all)?[/QUOTE] I actually tried to install the whole bunch. Every other component got installed but that one. After that I tried to install just the APP but it failed too. |
All times are UTC. The time now is 20:43. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.