mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   GPU and Mersenne primes (https://www.mersenneforum.org/showthread.php?t=17061)

bulldog 2012-08-13 13:26

GPU and Mersenne primes
 
I have played a little bit with cudaLucas and gpuLucas, and I have the following question. The algorithms utilize all the threads and all the power of the GPU in order to test a single Mersenne exponent.

How does the efficiency depend on the total number of GPU threads used ?
Did anyone check this ?
I didn't check this, but I guess that this dependency should be not linear.

If so, why not to change the algorithm ?
Why not to utilize, say, each 128 threads for testing another single exponent ?
In this case one would be able to test thousands of Mersenne exponents in parallel, using only a single GPU. Wouldn't this be much more efficient and make much more sence ?

Dubslow 2012-08-13 15:34

This has been discussed in detail, recently, but I can't find it back, so a summary:

Yes, the utilization/efficiency is particularly low for FFTs, but the problem is one test requires some CPU time (not enough to appear in a System Monitor, but it's there). The PCI transfers are enormously laggy and to be avoided. Running more than one test in parallel on one card would double, triple, etc., the number of transfers, enormously increasing the time taken. See, e.g., the third post in the CUDALucas [URL="http://www.mersenneforum.org/showthread.php?t=12576"]thread[/URL].

bulldog 2012-08-13 15:59

Dubslow, thank you for the explaination. But this seems silly to transfer data between CPU and GPU and to use the CPU time at each iteration, not only in the very beginning and in the very end of the test. It seems that if one modifies the algorithm in order to do everything inside a single GPU kernel, then the algorithm would be scalable and it would be able to test a lot of primes in parallel......

Dubslow 2012-08-13 16:08

I don't know about the details, you'd have to ask msft why it's a nightmare.

axn 2012-08-13 16:17

[QUOTE=Dubslow;307813]This has been discussed in detail, recently, but I can't find it back, so a summary:[/QUOTE]

I thought that CuLu did all of the compute inside GPU? At least, its cousin GeneferCUDA does so.

The really main problem is the same one facing CPUs today -- bandwidth to memory. Additionally, there is one more problem - total available memory. These two factors limit the number of tests that can execute concurrently.

Dubslow 2012-08-13 16:37

[QUOTE=axn;307817]I thought that CuLu did all of the compute inside GPU? At least, its cousin GeneferCUDA does so.

The really main problem is the same one facing CPUs today -- bandwidth to memory. Additionally, there is one more problem - total available memory. These two factors limit the number of tests that can execute concurrently.[/QUOTE]

[url]https://sourceforge.net/p/cudalucas/code/37/tree/trunk/CUDALucas.cu?force=True[/url]

The CUDA kernel names are:
rftfsub_kernel
normalize_kernel
normalize2_kernel
last_normalize
...and of course the two FFT lib calls.

Everything else is CPU.

You'll have to go digging for details yourself.

[code]double
lucas_square (double *x, int N, int iter, int last, float* maxerr,
int error_flag)
{
double terr;
rdft (N, 1, x, ip); // CUFFT, rftfsub_kernel (presumably the convolution), CUIFFT
if (iter == last)
{
cutilSafeCall (cudaMemcpy
(x, g_x, sizeof (double) * N, cudaMemcpyDeviceToHost));
terr = last_normalize (x, N, error_flag);
}
else
{

if ((iter % checkpoint_iter) == 0)
{
cutilSafeCall (cudaMemcpy
(x, g_x, sizeof (double) * N,
cudaMemcpyDeviceToHost));
terr = last_normalize (x, N, error_flag);
}

if (error_flag || t_f)
{
normalize_kernel < 1 > <<<N / threads, threads >>> (g_x, threads,
g_err, g_carry,
g_mask, g_ttp,
g_ttmp,
g_numbits,
*maxerr);
}
else
{
normalize_kernel < 0 > <<<N / threads, threads >>> (g_x, threads,
g_err, g_carry,
g_mask, g_ttp,
g_ttmp,
g_numbits,
*maxerr);
}
normalize2_kernel <<< ((N + threads - 1) / threads + 127) / 128,
128 >>> (g_x, threads, g_carry, N, g_inv2, g_ttp2, g_ttmp2, g_inv3,
g_ttp3, g_ttmp3);

terr = 0.0;
if (error_flag || (polite_f && (iter % polite) == 0))
{
float c_err;
cutilSafeCall (cudaMemcpy
(&c_err, g_err, sizeof (float),
cudaMemcpyDeviceToHost));
if(c_err > *maxerr)
*maxerr = c_err;
terr = c_err;
}
}
return (terr);
}[/code]
So the individual iterations are largely GPU. (Error flag is true every 100 iters, or true every iter with the extra error checking option enabled.)

Edit: If it isn't obvious by the fact that lucas_square is a CPU function, the main LL loop is run on the CPU.

bulldog 2012-08-13 17:20

OK! Thank you.

I understand, of course, that the main loop in CUDALucas is currently in the CPU code. I don't know, actually, how much memory a single FFT requires, but I could guess that total available GPU memory could be a serious limitation. As well as the memory access speed, also. I did not check this, but, at first sight, the arguments are reasonable. Thank you.

I have also the other question, which is, actually, also very natural.
Suppose that I have an access to the cluster with 1000 Nvidia Teslas.
MPI and CUDA are, naturally, supported. Is it possible in principle to test primality of a 12-million-digit Mersenne number during several hours of cluster time, with such hardware? If yes, is corresponding software already written, or not ?

axn 2012-08-13 18:00

[QUOTE=Dubslow;307818]Edit: If it isn't obvious by the fact that lucas_square is a CPU function, the main LL loop is run on the CPU.[/QUOTE]
Not really. The CPU routine is just a controlling thing. There is no actual compuation being run on the cpu.

[QUOTE=Dubslow;307813]Running more than one test in parallel on one card would double, triple, etc., the number of transfers, enormously increasing the time taken. [/QUOTE]

PCI transfer = cudaMemcpy

As you can see from the code, the data is transferred to CPU only during checkpoints and final iteration. Error is being transferred once every few iterations. That's not a whole lot. Otherwise, pretty much the whole of computaion is happening inside GPU, with zero involvement from CPU.

xilman 2012-08-13 18:16

[QUOTE=bulldog;307814]Dubslow, thank you for the explaination. But this seems silly to transfer data between CPU and GPU and to use the CPU time at each iteration, not only in the very beginning and in the very end of the test. It seems that if one modifies the algorithm in order to do everything inside a single GPU kernel, then the algorithm would be scalable and it would be able to test a lot of primes in parallel......[/QUOTE]Nice argument and sometimes the right way to go about things.

Your approach tends to optimise throughput rather than latency. More can be done per unit time but it can take many units of time before you get any output.

This effect is particularly noticeable with parallel implementations of ECM factoring. The EPFL people found that they could run roughly a hundred times as many curves per annum on a reasonable number of GPUs as they could on their PS3 cluster. Unfortunately, they would have to wait a year before any curve completed but then they'd have many thousands. On the PS3 the curves took on the order of a day to complete.

Which would you rather have: ultimate throughput or answers which arrive before your career is over?

Dubslow 2012-08-13 19:02

[QUOTE=axn;307823]Not really. The CPU routine is just a controlling thing. There is no actual compuation being run on the cpu.



PCI transfer = cudaMemcpy
[/quote]
Yes, I know. (I never said the CPU did any work, I just said the main loop was [i]controlled[/i] by the CPU.)
[QUOTE=axn;307823]
As you can see from the code, the data is transferred to CPU only during checkpoints and final iteration. Error is being transferred once every few iterations. That's not a whole lot. Otherwise, pretty much the whole of computaion is happening inside GPU, with zero involvement from CPU.[/QUOTE]
Earlier in the thread it occurred to me that it might be possible to create arrays of all the data that needs to be transferred, one array corresponding to one variable, each element corresponding to a different parallel test's copy of that variable; then transfer the whole array for essentially the same price as transferring the one element. The only place where this argument doesn't work is transferring the whole FFT data, which is FFTlen*8bytes of data per exponent, quite a large amount of data compared to the rest of the variables.

bulldog 2012-08-13 22:03

[QUOTE=Dubslow;307832]Yes, I know. (I never said the CPU did any work, I just said the main loop was [i]controlled[/i] by the CPU.)

Earlier in the thread it occurred to me that it might be possible to create arrays of all the data that needs to be transferred, one array corresponding to one variable, each element corresponding to a different parallel test's copy of that variable; then transfer the whole array for essentially the same price as transferring the one element. The only place where this argument doesn't work is transferring the whole FFT data, which is FFTlen*8bytes of data per exponent, quite a large amount of data compared to the rest of the variables.[/QUOTE]

FFTlen*8bytes is not a lot. This is only megabytes, not gigabytes, right?
in this case the device RAM would be enough for testing 1000 mersenne exponents in parallel.


All times are UTC. The time now is 22:33.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.