mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-08-15, 13:39   #529
monst
 
monst's Avatar
 
Mar 2007

179 Posts
Default

I haven't played games with the priorities, but I have successfully run CUDALucas and mfaktc concurrently on the same card. With respect to a production-worthy binary, I returned a successful double-check residue with apsen's build from post #518.
monst is offline   Reply With Quote
Old 2011-08-16, 01:52   #530
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5·359 Posts
Default

Quote:
Originally Posted by monst View Post
I haven't played games with the priorities, but I have successfully run CUDALucas and mfaktc concurrently on the same card. With respect to a production-worthy binary, I returned a successful double-check residue with apsen's build from post #518.
Thank you. It seems to be a go, especially with Brain's handy little cheat sheet, but it's complaining it needs cudart64_40_17.dll. That's at http://www.mersenneforum.org/showpos...postcount=1096

It also needs cudafftw64_40_17 from Nvidia. Mangle Detachments wouldn't let me upload it....it's too big!

On a GTX480, running also mfaktc, I'm getting 40 or 50ms per iteration. Exponent size is 25.0M for an LL-D done in perhaps 300 hours. That about right?

Last fiddled with by Christenson on 2011-08-16 at 02:47 Reason: Found out why I can't mangle that detachment!
Christenson is offline   Reply With Quote
Old 2011-08-16, 19:04   #531
amphoria
 
amphoria's Avatar
 
"Dave"
Sep 2005
UK

53308 Posts
Default

Quote:
Originally Posted by Christenson View Post
On a GTX480, running also mfaktc, I'm getting 40 or 50ms per iteration. Exponent size is 25.0M for an LL-D done in perhaps 300 hours. That about right?
That's very slow. On a GTX-570 I was completing a 26M exponent in about 33 hours. That was using the stock CUDALucas on Linux.
amphoria is offline   Reply With Quote
Old 2011-08-16, 23:07   #532
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

111000000112 Posts
Default

Ideas? have I saturated the card's memory? (mfaktc is busy TF'ing something for OBD to 84 bits, and will be done with the assignment in 8 days or so). Or did I simply not assign cudaLucas enough priority with respect to mfaktc? Machine is 4-core, Win64, and running P95 on the other 3 cores. I ran out of experimentation time last night, need to remember how I got the temperature out of the GPU last time.
Christenson is offline   Reply With Quote
Old 2011-08-17, 02:54   #533
Ethan (EO)
 
Ethan (EO)'s Avatar
 
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996

1348 Posts
Default

If memory serves me correctly, mfaktc keeps the ALUs and the host-to-device memory controllers pretty busy, and executes its inner loops almost entirely in shared memory -- I don't think there's terribly much GPU left over if you're already running mfaktc on a fast host processor.

The priority you give to the two executables shouldn't make too much difference, as CUDALucas uses essentially 0 host CPU.

Because it eliminates one devicetohost memory copy in the inner loop, you may want to give my build from http://www.mersenneforum.org/showpos...&postcount=524 a try -- I've completed successful doublechecks on 25056947 and 25038353 and verfied all the known mersenne primes between 216091 and 20996011 with it.

On an unrelated topic, CUDALucas on my card is much less prone to errors with overclocking than mfaktc, which surprises me. Both of the doublechecks I metioned were run with the core clock at 775mhz, while mfaktc starts failing selftests around 700mhz on the same card.
Ethan (EO) is offline   Reply With Quote
Old 2011-08-17, 03:27   #534
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5·359 Posts
Default

The host CPU isn't terribly fast -- it only managed to use about 40% of the GPU capacity with mfaktc.
To my mind, your overclocking result says that mfaktc stresses different parts of the GPU. I suppose that when I next visit that machine, I'll double-check my timing and try your build and see if it is significantly faster. Is it worth forcing mfaktc and cudaLucas to run on separate cores?
Christenson is offline   Reply With Quote
Old 2011-08-17, 09:15   #535
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45716 Posts
Default

Hi!

Quote:
Originally Posted by Christenson View Post
I'm considering running CUDALucas on the same card as mfaktc, since, if only one CPU is running mfaktc, the process is CPU bound.
Quote:
Originally Posted by monst View Post
I haven't played games with the priorities, but I have successfully run CUDALucas and mfaktc concurrently on the same card. With respect to a production-worthy binary, I returned a successful double-check residue with apsen's build from post #518.
AFAIK Technically you can't really run CUDALucas and mfaktc concurrently on the same GPU. Fermi class GPUs can concurrently run multiple kernel from a single CUDA context (single application) but not from different CUDA contexts. Best case it will run one complete kernel call from CUDALucas followed by one complete kernel call from mfaktc followed by...

Quote:
Originally Posted by Ethan (EO) View Post
If memory serves me correctly, mfaktc keeps the ALUs and the host-to-device memory controllers pretty busy, and executes its inner loops almost entirely in shared memory -- I don't think there's terribly much GPU left over if you're already running mfaktc on a fast host processor.
Correct, mfaktc uses ALUs, register files, a little bit SP FP and some shared memory. PCIe Bandwidth is used too. The GPU memory controller is virtually idle while running mfaktc, typical memory controller load is less than 2%.

Quote:
Originally Posted by Ethan (EO) View Post
On an unrelated topic, CUDALucas on my card is much less prone to errors with overclocking than mfaktc, which surprises me. Both of the doublechecks I metioned were run with the core clock at 775mhz, while mfaktc starts failing selftests around 700mhz on the same card.
Interesting! Can you compare GPU temperatures and power consumption (CUDALucas vs. mfaktc)?
CUDALucas is memory bandwidth limited, thats why those 2000 series Teslas aren't much faster than GTX [45][78]0. I would assume that CUDALucas doesn't stress the ALU/FPU that hard.


Oliver
TheJudger is offline   Reply With Quote
Old 2011-08-17, 12:12   #536
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5×359 Posts
Default

Any idea what my context switches might cost? Good ideas on how to reduce their number (system has 2-4 Gigs of real memory, mfaktc on one core cannot keep up, so it certainly could be made to buffer a bit more and swap out of the GPU less often)?
Christenson is offline   Reply With Quote
Old 2011-08-17, 16:49   #537
Ethan (EO)
 
Ethan (EO)'s Avatar
 
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996

22·23 Posts
Default

Quote:
Originally Posted by TheJudger View Post
AFAIK Technically you can't really run CUDALucas and mfaktc concurrently on the same GPU. Fermi class GPUs can concurrently run multiple kernel from a single CUDA context (single application) but not from different CUDA contexts. Best case it will run one complete kernel call from CUDALucas followed by one complete kernel call from mfaktc followed by...
This is my understanding too; in contrast, the Radeon 6xxx series has explicit support for concurrent kernel execution, including concurrent PCIe access (well, I'm sure it is serialized at some point...) by multiple kernels. I don't know about priority/resource allocation being exposed in the OpenCL implementation, though...

Quote:
Originally Posted by TheJudger View Post
Interesting! Can you compare GPU temperatures and power consumption (CUDALucas vs. mfaktc)?
CUDALucas is memory bandwidth limited, thats why those 2000 series Teslas aren't much faster than GTX [45][78]0. I would assume that CUDALucas doesn't stress the ALU/FPU that hard.
Oliver
Yeah -- I plan to do some more investigation later on the power consumption of the two programs. Interestingly, CUDALucas performance scales linearly with shader clock on my setup even if I hold memory clock fixed, even though all of the hot kernels in the code seem bandwidth limited both intuitively and in Visual Profiler.

-Ethan

ps. I haven't been keeping up on goings on in the mfaktc world for a while; hope things are good :)
Ethan (EO) is offline   Reply With Quote
Old 2011-08-18, 21:31   #538
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hi!

AFAIK Technically you can't really run CUDALucas and mfaktc concurrently on the same GPU. Fermi class GPUs can concurrently run multiple kernel from a single CUDA context (single application) but not from different CUDA contexts. Best case it will run one complete kernel call from CUDALucas followed by one complete kernel call from mfaktc followed by...
Hmm, but isn't that the same as running multiple mfaktc instances in order to fully utilize the GPU? Would the GPU know that the kernels of the different mfaktc processes are actually the same? Would surprise me. And if the GPU treats them as different kernels anyway, why not replace one or two mfaktc instances by CUDALucas? Of course, neither will run at full speed, but together they should bring the GPU to 99%. Kind of: run CUDALucas while the (comparably slow) CPU prepares the next mfaktc grid, then run that mfaktc kernel, then switch back again ... ?

Regarding priorities and OpenCL: At least in the current OpenCL version 1.1 there is no mention of them. Possible that AMD will provide it as a non-standard extension, but otherwise the host processes need to do that scheduling. This is a little bit like I/O: Two processes writing data to the same disk will get the writes done independent on the process priorities.

Context switches: They happen only on the CPU. A running GPU kernel will not be interrupted when its host process gets inactivated or even paged out. I know of no way to interrupt a running OpenCL kernel (maybe it is possible on CUDA or lower-level GPU access methods). As the GPU runs the kernels one after the other anyway, context switches should not be a big problem here.
Bdot is offline   Reply With Quote
Old 2011-08-18, 22:20   #539
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

70316 Posts
Default

Context Switches, in the classical sense, only happen on the CPU. But I need some word to describe what it is that happens when we switch between the CUDALucas kernel and data and the mfaktc kernel and data on the GPU. Whatever this mechanism is called, it becomes more important when mfaktc adds a second kernel to the GPU for sieving.

What my multicore machine is doing right now is running both CUDALucas and mfaktc. Prime95 runs on the other 3 cores. The mfaktc performance was definitely bound by the fact that it could only sieve so quickly on the one core; I estimated it was using 40% of the GPU's power. Timing CUDAlucas with my watch, I estimated 300 hours to finish a 25M LL test. mfaktc seemed to speed up as I killed off the stuff I'd had to do to get the cufft dlls installed from nvidia. I'd like CUDAlucas to replace the CPU I have lost to mfaktc. The GPU card is an unmarked GTX480, Nvidia seems to be handing them out to the right people, such as First Robotics teams at competitions, to whet the appetite of the future market.
Christenson is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Don't DC/LL them with CudaLucas LaurV Data 131 2017-05-02 18:41
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 Brain GPU Computing 13 2016-02-19 15:53
CUDALucas: which binary to use? Karl M Johnson GPU Computing 15 2015-10-13 04:44
settings for cudaLucas fairsky GPU Computing 11 2013-11-03 02:08
Trying to run CUDALucas on Windows 8 CP Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 13:02.


Fri Aug 6 13:02:25 UTC 2021 up 14 days, 7:31, 1 user, load averages: 2.92, 2.92, 2.74

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.