![]() |
[QUOTE=vsuite;254074]
Does it make sense to run multiple instances of CudaLucas to increase overall throughput, or is the GPU really maxed out? [/QUOTE] GPU really maxed out.:smile: |
Thanks msft.
So it makes sense to run prime95 LL (maxes out CPU) plus 1 instance CudaLucas (maxes out GPU & about 5x faster than Prime95). Does CudaLucas share 50-50 with mfaktc if both are run simultaneously? Thanks. |
[QUOTE=vsuite;254124]Does CudaLucas share 50-50 with mfaktc if both are run simultaneously?
[/QUOTE] No, GPU Job Scheduling algorithm is not fair. |
I found cufftSetCompatibilityMode().
[CODE] #include <cuda.h> #include <cuda_runtime.h> #include <cufft.h> #include <cutil_inline.h> int main() { cufftHandle plan; cudaEvent_t start, stop; double *x; double *g_x; float outerTime; int i,j,imax; imax = 1024*1024*4+1; cutilSafeCall(cudaMalloc((void**)&g_x, sizeof(double)*imax)); x = ((double *)malloc(sizeof(double)*imax)); for(i=0;i<imax;i++)x[i]=0; cutilSafeCall(cudaMemcpy(g_x, x, sizeof(double)*imax, cudaMemcpyHostToDevice)); cutilSafeCall( cudaEventCreate(&start) ); cutilSafeCall( cudaEventCreate(&stop) ); for(j=1024*1024*2;j<imax;j+=1024*1024/2) { cufftSafeCall(cufftPlan1d(&plan, j/2, CUFFT_Z2Z, 1)); cufftSafeCall(cufftExecZ2Z(plan,(cufftDoubleComplex *)g_x,(cufftDoubleComplex *)g_x, CUFFT_INVERSE)); cutilSafeCall( cudaEventRecord(start, 0) ); for(i=0;i<10;i++) cufftSafeCall(cufftExecZ2Z(plan,(cufftDoubleComplex *)g_x,(cufftDoubleComplex *)g_x, CUFFT_INVERSE)); cutilSafeCall( cudaEventRecord(stop, 0) ); cutilSafeCall( cudaEventSynchronize(stop) ); cutilSafeCall( cudaEventElapsedTime(&outerTime, start, stop) ); printf("CUFFT_Z2Z size=%d k time=%f msec\n",j/1024,outerTime/10); cufftSafeCall(cufftDestroy(plan)); cufftSafeCall(cufftPlan1d(&plan, j, CUFFT_D2Z, 1)); cufftSafeCall(cufftExecD2Z(plan,g_x,(cufftDoubleComplex *)g_x)); cutilSafeCall( cudaEventRecord(start, 0) ); for(i=0;i<10;i++) cufftSafeCall(cufftExecD2Z(plan,g_x,(cufftDoubleComplex *)g_x)); cutilSafeCall( cudaEventRecord(stop, 0) ); cutilSafeCall( cudaEventSynchronize(stop) ); cutilSafeCall( cudaEventElapsedTime(&outerTime, start, stop) ); printf("CUFFT_D2Z size=%d k time=%f msec\n",j/1024,outerTime/10); cufftSafeCall(cufftDestroy(plan)); cufftSafeCall(cufftPlan1d(&plan, j, CUFFT_D2Z, 1)); cufftSafeCall(cufftExecD2Z(plan,g_x,(cufftDoubleComplex *)g_x)); cufftSetCompatibilityMode(plan,CUFFT_COMPATIBILITY_NATIVE); cutilSafeCall( cudaEventRecord(start, 0) ); for(i=0;i<10;i++) cufftSafeCall(cufftExecD2Z(plan,g_x,(cufftDoubleComplex *)g_x)); cutilSafeCall( cudaEventRecord(stop, 0) ); cutilSafeCall( cudaEventSynchronize(stop) ); cutilSafeCall( cudaEventElapsedTime(&outerTime, start, stop) ); printf("CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=%d k time=%f msec\n",j/1024,outerTime/10); cufftSafeCall(cufftDestroy(plan)); } cutilSafeCall(cudaFree((char *)g_x)); cutilSafeCall( cudaEventDestroy(start) ); cutilSafeCall( cudaEventDestroy(stop) ); } [/CODE] [CODE] h$ ./cufftbench CUFFT_Z2Z size=2048 k time=3.042378 msec CUFFT_D2Z size=2048 k time=4.063469 msec CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=2048 k time=3.623491 msec CUFFT_Z2Z size=2560 k time=4.008445 msec CUFFT_D2Z size=2560 k time=5.339828 msec CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=2560 k time=4.789418 msec CUFFT_Z2Z size=3072 k time=4.938816 msec CUFFT_D2Z size=3072 k time=6.547917 msec CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=3072 k time=5.887846 msec CUFFT_Z2Z size=3584 k time=5.683993 msec CUFFT_D2Z size=3584 k time=7.542262 msec CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=3584 k time=6.773341 msec CUFFT_Z2Z size=4096 k time=6.289344 msec CUFFT_D2Z size=4096 k time=8.457850 msec CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=4096 k time=7.577504 msec [/CODE] |
[QUOTE=vsuite;254124]Thanks msft.
So it makes sense to run prime95 LL (maxes out CPU) plus 1 instance CudaLucas (maxes out GPU & about 5x faster than Prime95). Does CudaLucas share 50-50 with mfaktc if both are run simultaneously? Thanks.[/QUOTE] [QUOTE=msft;254132]No, GPU Job Scheduling algorithm is not fair.[/QUOTE] Yep, GPU job scheduling is allmost nonexistend... very coarse-grained. Once Nvidia implements a fine-grained scheduling on GPU it should be a nice variant running both. CUDALucas seems to be memory-bandwidth bound (and leaves some compute power "idle") while mfaktc is totally compute bound (and needs very very low memory-bandwidth). But without fine-grained scheduling it wont work. Oliver |
1 Attachment(s)
MacLucasFFTW with fftw-3.2.2 version.
[CODE] $ tar -xvf MacLucasFFTW.fftw-3.2.2.tar.bz2 $ cd MacLucasFFTW.fftw-3.2.2/ MacLucasFFTW.fftw-3.2.2$ make MacLucasFFTW.fftw-3.2.2$ time ./MacLucasFFTW 11213M( 11213 )P, n = 640, MacLucasFFTW v8.1 Ballester real 0m0.542s user 0m0.440s sys 0m0.000s [/CODE] This version use [CODE] forw=fftw_plan_r2r_1d(n,(double *)x,(double*)x,FFTW_R2HC,FFTW_ESTIMATE); back=fftw_plan_r2r_1d(n,(double *)x,(double *) x,FFTW_HC2R,FFTW_ESTIMATE); [/CODE] But this function not support with CUFFT. I want change to [CODE] forw=fftw_plan_dft_r2c_1d(n,(double *)x,(fftw_complex *) x,FFTW_ESTIMATE); back=fftw_plan_dft_c2r_1d(n,(fftw_complex *)x,(double *) x,FFTW_ESTIMATE); [/CODE] It is not work. |
[QUOTE=alexhiggins732;253480]Thanks for your work and the fast reply.
<snip> I have attached the source code with make files for win32, win64, and Linux and added a short README.txt with basic instructions on how to compile on each of the systems. Feel free to include these in your distribution. The attachment also contains win32 binaries.[/QUOTE] Was it decided as to whether these win32 binaries were working, or should I try to feed them to my GEForce210 Cuda 1.2 card with some test exponents and see if it works? Thanks Eric Christenson |
Hi ,Christenson
[QUOTE=Christenson;256332] or should I try to feed them to my GEForce210 Cuda 1.2 card with some test exponents and see if it works? [/QUOTE] CUDALucas need CUDA 1.3. |
[QUOTE=msft;256544]Hi ,Christenson
CUDALucas need CUDA 1.3.[/QUOTE] Thanks. :down: Any hope to do LL tests on that GEForce210 with CUDA1.2, or am I stuck running mfaktc? |
Hi all, hi msft! :smile:
I have an i5-750 with Ubintu 64 bit and a GTX275 with CUDA 3.0 I successfully ran this version of MacLucasFFTW on my system: $Id: MacLucasFFTW.c,v 8.1 2007/06/23 22:33:35 wedgingt Exp $ [email]wedgingt@acm.org[/email] Is there a newer/faster version that I can try and compile or, better, a ready binary that I can have? May you give me any pointer? Thank you :smile: Luigi |
Hi ,ET_
[QUOTE=ET_;257374] $Id: MacLucasFFTW.c,v 8.1 2007/06/23 22:33:35 wedgingt Exp $ [email]wedgingt@acm.org[/email] [/QUOTE] CUDALucas.1.0.tar.gz is newer version. Thank you, |
| All times are UTC. The time now is 22:59. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.