mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

msft 2011-03-01 14:44

[QUOTE=vsuite;254074]
Does it make sense to run multiple instances of CudaLucas to increase overall throughput, or is the GPU really maxed out?
[/QUOTE]
GPU really maxed out.:smile:

vsuite 2011-03-01 23:02

Thanks msft.

So it makes sense to run prime95 LL (maxes out CPU) plus 1 instance CudaLucas (maxes out GPU & about 5x faster than Prime95).

Does CudaLucas share 50-50 with mfaktc if both are run simultaneously?

Thanks.

msft 2011-03-02 02:37

[QUOTE=vsuite;254124]Does CudaLucas share 50-50 with mfaktc if both are run simultaneously?
[/QUOTE]
No, GPU Job Scheduling algorithm is not fair.

msft 2011-03-06 06:10

I found cufftSetCompatibilityMode().
[CODE]

#include <cuda.h>
#include <cuda_runtime.h>
#include <cufft.h>
#include <cutil_inline.h>
int main()
{
cufftHandle plan; cudaEvent_t start, stop; double *x; double *g_x; float outerTime; int i,j,imax;
imax = 1024*1024*4+1;
cutilSafeCall(cudaMalloc((void**)&g_x, sizeof(double)*imax));
x = ((double *)malloc(sizeof(double)*imax));
for(i=0;i<imax;i++)x[i]=0;
cutilSafeCall(cudaMemcpy(g_x, x, sizeof(double)*imax, cudaMemcpyHostToDevice));
cutilSafeCall( cudaEventCreate(&start) );
cutilSafeCall( cudaEventCreate(&stop) );
for(j=1024*1024*2;j<imax;j+=1024*1024/2)
{
cufftSafeCall(cufftPlan1d(&plan, j/2, CUFFT_Z2Z, 1));
cufftSafeCall(cufftExecZ2Z(plan,(cufftDoubleComplex *)g_x,(cufftDoubleComplex *)g_x, CUFFT_INVERSE));
cutilSafeCall( cudaEventRecord(start, 0) );
for(i=0;i<10;i++) cufftSafeCall(cufftExecZ2Z(plan,(cufftDoubleComplex *)g_x,(cufftDoubleComplex *)g_x, CUFFT_INVERSE));
cutilSafeCall( cudaEventRecord(stop, 0) );
cutilSafeCall( cudaEventSynchronize(stop) );
cutilSafeCall( cudaEventElapsedTime(&outerTime, start, stop) );
printf("CUFFT_Z2Z size=%d k time=%f msec\n",j/1024,outerTime/10);
cufftSafeCall(cufftDestroy(plan));
cufftSafeCall(cufftPlan1d(&plan, j, CUFFT_D2Z, 1));
cufftSafeCall(cufftExecD2Z(plan,g_x,(cufftDoubleComplex *)g_x));
cutilSafeCall( cudaEventRecord(start, 0) );
for(i=0;i<10;i++) cufftSafeCall(cufftExecD2Z(plan,g_x,(cufftDoubleComplex *)g_x));
cutilSafeCall( cudaEventRecord(stop, 0) );
cutilSafeCall( cudaEventSynchronize(stop) );
cutilSafeCall( cudaEventElapsedTime(&outerTime, start, stop) );
printf("CUFFT_D2Z size=%d k time=%f msec\n",j/1024,outerTime/10);
cufftSafeCall(cufftDestroy(plan));
cufftSafeCall(cufftPlan1d(&plan, j, CUFFT_D2Z, 1));
cufftSafeCall(cufftExecD2Z(plan,g_x,(cufftDoubleComplex *)g_x));
cufftSetCompatibilityMode(plan,CUFFT_COMPATIBILITY_NATIVE);
cutilSafeCall( cudaEventRecord(start, 0) );
for(i=0;i<10;i++) cufftSafeCall(cufftExecD2Z(plan,g_x,(cufftDoubleComplex *)g_x));
cutilSafeCall( cudaEventRecord(stop, 0) );
cutilSafeCall( cudaEventSynchronize(stop) );
cutilSafeCall( cudaEventElapsedTime(&outerTime, start, stop) );
printf("CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=%d k time=%f msec\n",j/1024,outerTime/10);
cufftSafeCall(cufftDestroy(plan));
}
cutilSafeCall(cudaFree((char *)g_x));
cutilSafeCall( cudaEventDestroy(start) );
cutilSafeCall( cudaEventDestroy(stop) );
}
[/CODE]
[CODE]

h$ ./cufftbench
CUFFT_Z2Z size=2048 k time=3.042378 msec
CUFFT_D2Z size=2048 k time=4.063469 msec
CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=2048 k time=3.623491 msec
CUFFT_Z2Z size=2560 k time=4.008445 msec
CUFFT_D2Z size=2560 k time=5.339828 msec
CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=2560 k time=4.789418 msec
CUFFT_Z2Z size=3072 k time=4.938816 msec
CUFFT_D2Z size=3072 k time=6.547917 msec
CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=3072 k time=5.887846 msec
CUFFT_Z2Z size=3584 k time=5.683993 msec
CUFFT_D2Z size=3584 k time=7.542262 msec
CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=3584 k time=6.773341 msec
CUFFT_Z2Z size=4096 k time=6.289344 msec
CUFFT_D2Z size=4096 k time=8.457850 msec
CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=4096 k time=7.577504 msec
[/CODE]

TheJudger 2011-03-06 13:23

[QUOTE=vsuite;254124]Thanks msft.

So it makes sense to run prime95 LL (maxes out CPU) plus 1 instance CudaLucas (maxes out GPU & about 5x faster than Prime95).

Does CudaLucas share 50-50 with mfaktc if both are run simultaneously?

Thanks.[/QUOTE]

[QUOTE=msft;254132]No, GPU Job Scheduling algorithm is not fair.[/QUOTE]

Yep, GPU job scheduling is allmost nonexistend... very coarse-grained. Once Nvidia implements a fine-grained scheduling on GPU it should be a nice variant running both. CUDALucas seems to be memory-bandwidth bound (and leaves some compute power "idle") while mfaktc is totally compute bound (and needs very very low memory-bandwidth). But without fine-grained scheduling it wont work.

Oliver

msft 2011-03-07 04:14

1 Attachment(s)
MacLucasFFTW with fftw-3.2.2 version.
[CODE]
$ tar -xvf MacLucasFFTW.fftw-3.2.2.tar.bz2
$ cd MacLucasFFTW.fftw-3.2.2/
MacLucasFFTW.fftw-3.2.2$ make
MacLucasFFTW.fftw-3.2.2$ time ./MacLucasFFTW 11213M( 11213 )P, n = 640, MacLucasFFTW v8.1 Ballester

real 0m0.542s
user 0m0.440s
sys 0m0.000s
[/CODE]
This version use
[CODE]
forw=fftw_plan_r2r_1d(n,(double *)x,(double*)x,FFTW_R2HC,FFTW_ESTIMATE);
back=fftw_plan_r2r_1d(n,(double *)x,(double *) x,FFTW_HC2R,FFTW_ESTIMATE);
[/CODE]
But this function not support with CUFFT.
I want change to
[CODE]
forw=fftw_plan_dft_r2c_1d(n,(double *)x,(fftw_complex *) x,FFTW_ESTIMATE);
back=fftw_plan_dft_c2r_1d(n,(fftw_complex *)x,(double *) x,FFTW_ESTIMATE);
[/CODE]
It is not work.

Christenson 2011-03-22 02:39

[QUOTE=alexhiggins732;253480]Thanks for your work and the fast reply.

<snip>

I have attached the source code with make files for win32, win64, and Linux and added a short README.txt with basic instructions on how to compile on each of the systems. Feel free to include these in your distribution.

The attachment also contains win32 binaries.[/QUOTE]
Was it decided as to whether these win32 binaries were working, or should I try to feed them to my GEForce210 Cuda 1.2 card with some test exponents and see if it works?

Thanks
Eric Christenson

msft 2011-03-24 20:37

Hi ,Christenson
[QUOTE=Christenson;256332] or should I try to feed them to my GEForce210 Cuda 1.2 card with some test exponents and see if it works?
[/QUOTE]
CUDALucas need CUDA 1.3.

Christenson 2011-03-28 02:27

[QUOTE=msft;256544]Hi ,Christenson

CUDALucas need CUDA 1.3.[/QUOTE]
Thanks.
:down:
Any hope to do LL tests on that GEForce210 with CUDA1.2, or am I stuck running mfaktc?

ET_ 2011-04-02 12:56

Hi all, hi msft! :smile:

I have an i5-750 with Ubintu 64 bit and a GTX275 with CUDA 3.0

I successfully ran this version of MacLucasFFTW on my system:

$Id: MacLucasFFTW.c,v 8.1 2007/06/23 22:33:35 wedgingt Exp $ [email]wedgingt@acm.org[/email]

Is there a newer/faster version that I can try and compile or, better, a ready binary that I can have?

May you give me any pointer?

Thank you :smile:

Luigi

msft 2011-04-04 20:50

Hi ,ET_
[QUOTE=ET_;257374] $Id: MacLucasFFTW.c,v 8.1 2007/06/23 22:33:35 wedgingt Exp $ [email]wedgingt@acm.org[/email]
[/QUOTE]
CUDALucas.1.0.tar.gz is newer version.
Thank you,


All times are UTC. The time now is 22:59.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.