mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

monst 2011-08-15 13:39

I haven't played games with the priorities, but I have successfully run CUDALucas and mfaktc concurrently on the same card. With respect to a production-worthy binary, I returned a successful double-check residue with apsen's build from post #518.

Christenson 2011-08-16 01:52

[QUOTE=monst;269144]I haven't played games with the priorities, but I have successfully run CUDALucas and mfaktc concurrently on the same card. With respect to a production-worthy binary, I returned a successful double-check residue with apsen's build from post #518.[/QUOTE]

Thank you. It seems to be a go, especially with Brain's handy little cheat sheet, but it's complaining it needs cudart64_40_17.dll. That's at [URL]http://www.mersenneforum.org/showpost.php?p=266923&postcount=1096[/URL]

It also needs cudafftw64_40_17 from Nvidia. Mangle Detachments wouldn't let me upload it....it's too big!

On a GTX480, running also mfaktc, I'm getting 40 or 50ms per iteration. Exponent size is 25.0M for an LL-D done in perhaps 300 hours. That about right?

amphoria 2011-08-16 19:04

[QUOTE=Christenson;269194]On a GTX480, running also mfaktc, I'm getting 40 or 50ms per iteration. Exponent size is 25.0M for an LL-D done in perhaps 300 hours. That about right?[/QUOTE]

That's very slow. On a GTX-570 I was completing a 26M exponent in about 33 hours. That was using the stock CUDALucas on Linux.

Christenson 2011-08-16 23:07

Ideas? have I saturated the card's memory? (mfaktc is busy TF'ing something for OBD to 84 bits, and will be done with the assignment in 8 days or so). Or did I simply not assign cudaLucas enough priority with respect to mfaktc? Machine is 4-core, Win64, and running P95 on the other 3 cores. I ran out of experimentation time last night, need to remember how I got the temperature out of the GPU last time.

Ethan (EO) 2011-08-17 02:54

If memory serves me correctly, mfaktc keeps the ALUs and the host-to-device memory controllers pretty busy, and executes its inner loops almost entirely in shared memory -- I don't think there's terribly much GPU left over if you're already running mfaktc on a fast host processor.

The priority you give to the two executables shouldn't make too much difference, as CUDALucas uses essentially 0 host CPU.

Because it eliminates one devicetohost memory copy in the inner loop, you may want to give my build from [url]http://www.mersenneforum.org/showpost.php?p=268927&postcount=524[/url] a try -- I've completed successful doublechecks on 25056947 and 25038353 and verfied all the known mersenne primes between 216091 and 20996011 with it.

On an unrelated topic, CUDALucas on my card is much less prone to errors with overclocking than mfaktc, which surprises me. Both of the doublechecks I metioned were run with the core clock at 775mhz, while mfaktc starts failing selftests around 700mhz on the same card.

Christenson 2011-08-17 03:27

The host CPU isn't terribly fast -- it only managed to use about 40% of the GPU capacity with mfaktc.
To my mind, your overclocking result says that mfaktc stresses different parts of the GPU. I suppose that when I next visit that machine, I'll double-check my timing and try your build and see if it is significantly faster. Is it worth forcing mfaktc and cudaLucas to run on separate cores?

TheJudger 2011-08-17 09:15

Hi!

[QUOTE=Christenson;269077]I'm considering running CUDALucas on the same card as mfaktc, since, if only one CPU is running mfaktc, the process is CPU bound.[/QUOTE]

[QUOTE=monst;269144]I haven't played games with the priorities, but I have successfully run CUDALucas and mfaktc concurrently on the same card. With respect to a production-worthy binary, I returned a successful double-check residue with apsen's build from post #518.[/QUOTE]

[B]AFAIK[/B] Technically you can't really run CUDALucas and mfaktc [B]concurrently[/B] on the same GPU. Fermi class GPUs can concurrently run multiple kernel from a single CUDA context (single application) but not from different CUDA contexts. Best case it will run one complete kernel call from CUDALucas followed by one complete kernel call from mfaktc followed by...

[QUOTE=Ethan (EO);269290]If memory serves me correctly, mfaktc keeps the ALUs and the host-to-device memory controllers pretty busy, and executes its inner loops almost entirely in shared memory -- I don't think there's terribly much GPU left over if you're already running mfaktc on a fast host processor.[/QUOTE]
Correct, mfaktc uses ALUs, register files, a little bit SP FP and [B]some[/B] shared memory. PCIe Bandwidth is used too. The GPU memory controller is virtually idle while running mfaktc, typical memory controller load is less than 2%.

[QUOTE=Ethan (EO);269290]On an unrelated topic, CUDALucas on my card is much less prone to errors with overclocking than mfaktc, which surprises me. Both of the doublechecks I metioned were run with the core clock at 775mhz, while mfaktc starts failing selftests around 700mhz on the same card.[/QUOTE]
Interesting! Can you compare GPU temperatures and power consumption (CUDALucas vs. mfaktc)?
CUDALucas is memory bandwidth limited, thats why those 2000 series Teslas aren't much faster than GTX [45][78]0. :sad: I would [I]assume[/I] that CUDALucas doesn't stress the ALU/FPU that hard.


Oliver

Christenson 2011-08-17 12:12

Any idea what my context switches might cost? Good ideas on how to reduce their number (system has 2-4 Gigs of real memory, mfaktc on one core cannot keep up, so it certainly could be made to buffer a bit more and swap out of the GPU less often)?

Ethan (EO) 2011-08-17 16:49

[QUOTE=TheJudger;269303][B]AFAIK[/B] Technically you can't really run CUDALucas and mfaktc [B]concurrently[/B] on the same GPU. Fermi class GPUs can concurrently run multiple kernel from a single CUDA context (single application) but not from different CUDA contexts. Best case it will run one complete kernel call from CUDALucas followed by one complete kernel call from mfaktc followed by...[/QUOTE]

This is my understanding too; in contrast, the Radeon 6xxx series has explicit support for concurrent kernel execution, including concurrent PCIe access (well, I'm sure it is serialized at some point...) by multiple kernels. I don't know about priority/resource allocation being exposed in the OpenCL implementation, though...

[QUOTE=TheJudger;269303]
Interesting! Can you compare GPU temperatures and power consumption (CUDALucas vs. mfaktc)?
CUDALucas is memory bandwidth limited, thats why those 2000 series Teslas aren't much faster than GTX [45][78]0. :sad: I would [I]assume[/I] that CUDALucas doesn't stress the ALU/FPU that hard.
Oliver[/QUOTE]

Yeah -- I plan to do some more investigation later on the power consumption of the two programs. Interestingly, CUDALucas performance scales linearly with shader clock on my setup even if I hold memory clock fixed, even though all of the hot kernels in the code seem bandwidth limited both intuitively and in Visual Profiler.

-Ethan

ps. I haven't been keeping up on goings on in the mfaktc world for a while; hope things are good :)

Bdot 2011-08-18 21:31

[QUOTE=TheJudger;269303]Hi!

[B]AFAIK[/B] Technically you can't really run CUDALucas and mfaktc [B]concurrently[/B] on the same GPU. Fermi class GPUs can concurrently run multiple kernel from a single CUDA context (single application) but not from different CUDA contexts. Best case it will run one complete kernel call from CUDALucas followed by one complete kernel call from mfaktc followed by...
[/QUOTE]

Hmm, but isn't that the same as running multiple mfaktc instances in order to fully utilize the GPU? Would the GPU know that the kernels of the different mfaktc processes are actually the same? Would surprise me. And if the GPU treats them as different kernels anyway, why not replace one or two mfaktc instances by CUDALucas? Of course, neither will run at full speed, but together they should bring the GPU to 99%. Kind of: run CUDALucas while the (comparably slow) CPU prepares the next mfaktc grid, then run that mfaktc kernel, then switch back again ... ?

Regarding priorities and OpenCL: At least in the current OpenCL version 1.1 there is no mention of them. Possible that AMD will provide it as a non-standard extension, but otherwise the host processes need to do that scheduling. This is a little bit like I/O: Two processes writing data to the same disk will get the writes done independent on the process priorities.

Context switches: They happen only on the CPU. A running GPU kernel will not be interrupted when its host process gets inactivated or even paged out. I know of no way to interrupt a running OpenCL kernel (maybe it is possible on CUDA or lower-level GPU access methods). As the GPU runs the kernels one after the other anyway, context switches should not be a big problem here.

Christenson 2011-08-18 22:20

Context Switches, in the classical sense, only happen on the CPU. But I need some word to describe what it is that happens when we switch between the CUDALucas kernel and data and the mfaktc kernel and data on the GPU. Whatever this mechanism is called, it becomes more important when mfaktc adds a second kernel to the GPU for sieving.

What my multicore machine is doing right now is running both CUDALucas and mfaktc. Prime95 runs on the other 3 cores. The mfaktc performance was definitely bound by the fact that it could only sieve so quickly on the one core; I estimated it was using 40% of the GPU's power. Timing CUDAlucas with my watch, I estimated 300 hours to finish a 25M LL test. mfaktc seemed to speed up as I killed off the stuff I'd had to do to get the cufft dlls installed from nvidia. I'd like CUDAlucas to replace the CPU I have lost to mfaktc. The GPU card is an unmarked GTX480, Nvidia seems to be handing them out to the right people, such as First Robotics teams at competitions, to whet the appetite of the future market.


All times are UTC. The time now is 23:03.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.