![]() |
|
|
#1 |
|
Aug 2012
Moscow, Russia
22 Posts |
I have played a little bit with cudaLucas and gpuLucas, and I have the following question. The algorithms utilize all the threads and all the power of the GPU in order to test a single Mersenne exponent.
How does the efficiency depend on the total number of GPU threads used ? Did anyone check this ? I didn't check this, but I guess that this dependency should be not linear. If so, why not to change the algorithm ? Why not to utilize, say, each 128 threads for testing another single exponent ? In this case one would be able to test thousands of Mersenne exponents in parallel, using only a single GPU. Wouldn't this be much more efficient and make much more sence ? |
|
|
|
|
|
#2 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
3·29·83 Posts |
This has been discussed in detail, recently, but I can't find it back, so a summary:
Yes, the utilization/efficiency is particularly low for FFTs, but the problem is one test requires some CPU time (not enough to appear in a System Monitor, but it's there). The PCI transfers are enormously laggy and to be avoided. Running more than one test in parallel on one card would double, triple, etc., the number of transfers, enormously increasing the time taken. See, e.g., the third post in the CUDALucas thread. |
|
|
|
|
|
#3 |
|
Aug 2012
Moscow, Russia
22 Posts |
Dubslow, thank you for the explaination. But this seems silly to transfer data between CPU and GPU and to use the CPU time at each iteration, not only in the very beginning and in the very end of the test. It seems that if one modifies the algorithm in order to do everything inside a single GPU kernel, then the algorithm would be scalable and it would be able to test a lot of primes in parallel......
|
|
|
|
|
|
#4 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
160658 Posts |
I don't know about the details, you'd have to ask msft why it's a nightmare.
|
|
|
|
|
|
#5 | |
|
Jun 2003
5,087 Posts |
Quote:
The really main problem is the same one facing CPUs today -- bandwidth to memory. Additionally, there is one more problem - total available memory. These two factors limit the number of tests that can execute concurrently. |
|
|
|
|
|
|
#6 | |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
11100001101012 Posts |
Quote:
The CUDA kernel names are: rftfsub_kernel normalize_kernel normalize2_kernel last_normalize ...and of course the two FFT lib calls. Everything else is CPU. You'll have to go digging for details yourself. Code:
double
lucas_square (double *x, int N, int iter, int last, float* maxerr,
int error_flag)
{
double terr;
rdft (N, 1, x, ip); // CUFFT, rftfsub_kernel (presumably the convolution), CUIFFT
if (iter == last)
{
cutilSafeCall (cudaMemcpy
(x, g_x, sizeof (double) * N, cudaMemcpyDeviceToHost));
terr = last_normalize (x, N, error_flag);
}
else
{
if ((iter % checkpoint_iter) == 0)
{
cutilSafeCall (cudaMemcpy
(x, g_x, sizeof (double) * N,
cudaMemcpyDeviceToHost));
terr = last_normalize (x, N, error_flag);
}
if (error_flag || t_f)
{
normalize_kernel < 1 > <<<N / threads, threads >>> (g_x, threads,
g_err, g_carry,
g_mask, g_ttp,
g_ttmp,
g_numbits,
*maxerr);
}
else
{
normalize_kernel < 0 > <<<N / threads, threads >>> (g_x, threads,
g_err, g_carry,
g_mask, g_ttp,
g_ttmp,
g_numbits,
*maxerr);
}
normalize2_kernel <<< ((N + threads - 1) / threads + 127) / 128,
128 >>> (g_x, threads, g_carry, N, g_inv2, g_ttp2, g_ttmp2, g_inv3,
g_ttp3, g_ttmp3);
terr = 0.0;
if (error_flag || (polite_f && (iter % polite) == 0))
{
float c_err;
cutilSafeCall (cudaMemcpy
(&c_err, g_err, sizeof (float),
cudaMemcpyDeviceToHost));
if(c_err > *maxerr)
*maxerr = c_err;
terr = c_err;
}
}
return (terr);
}
Edit: If it isn't obvious by the fact that lucas_square is a CPU function, the main LL loop is run on the CPU. Last fiddled with by Dubslow on 2012-08-13 at 16:58 |
|
|
|
|
|
|
#7 |
|
Aug 2012
Moscow, Russia
22 Posts |
OK! Thank you.
I understand, of course, that the main loop in CUDALucas is currently in the CPU code. I don't know, actually, how much memory a single FFT requires, but I could guess that total available GPU memory could be a serious limitation. As well as the memory access speed, also. I did not check this, but, at first sight, the arguments are reasonable. Thank you. I have also the other question, which is, actually, also very natural. Suppose that I have an access to the cluster with 1000 Nvidia Teslas. MPI and CUDA are, naturally, supported. Is it possible in principle to test primality of a 12-million-digit Mersenne number during several hours of cluster time, with such hardware? If yes, is corresponding software already written, or not ? |
|
|
|
|
|
#8 | ||
|
Jun 2003
10011110111112 Posts |
Quote:
Quote:
As you can see from the code, the data is transferred to CPU only during checkpoints and final iteration. Error is being transferred once every few iterations. That's not a whole lot. Otherwise, pretty much the whole of computaion is happening inside GPU, with zero involvement from CPU. |
||
|
|
|
|
|
#9 | |
|
Bamboozled!
"πΊππ·π·π"
May 2003
Down not across
2A2216 Posts |
Quote:
Your approach tends to optimise throughput rather than latency. More can be done per unit time but it can take many units of time before you get any output. This effect is particularly noticeable with parallel implementations of ECM factoring. The EPFL people found that they could run roughly a hundred times as many curves per annum on a reasonable number of GPUs as they could on their PS3 cluster. Unfortunately, they would have to wait a year before any curve completed but then they'd have many thousands. On the PS3 the curves took on the order of a day to complete. Which would you rather have: ultimate throughput or answers which arrive before your career is over? |
|
|
|
|
|
|
#10 | ||
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
11100001101012 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#11 | |
|
Aug 2012
Moscow, Russia
416 Posts |
Quote:
in this case the device RAM would be enough for testing 1000 mersenne exponents in parallel. |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Mersenne Primes p which are in a set of twin primes is finite? | carpetpool | Miscellaneous Math | 3 | 2017-08-10 13:47 |
| Distribution of Mersenne primes before and after couples of primes found | emily | Math | 34 | 2017-07-16 18:44 |
| Gaussian-Mersenne & Eisenstein-Mersenne primes | siegert81 | Math | 2 | 2011-09-19 17:36 |
| A conjecture about Mersenne primes and non-primes | Unregistered | Information & Answers | 0 | 2011-01-31 15:41 |
| Mersenne Wiki: Improving the mersenne primes web site by FOSS methods | optim | PrimeNet | 13 | 2004-07-09 13:51 |