![]() |
GPU and Mersenne primes
I have played a little bit with cudaLucas and gpuLucas, and I have the following question. The algorithms utilize all the threads and all the power of the GPU in order to test a single Mersenne exponent.
How does the efficiency depend on the total number of GPU threads used ? Did anyone check this ? I didn't check this, but I guess that this dependency should be not linear. If so, why not to change the algorithm ? Why not to utilize, say, each 128 threads for testing another single exponent ? In this case one would be able to test thousands of Mersenne exponents in parallel, using only a single GPU. Wouldn't this be much more efficient and make much more sence ? |
This has been discussed in detail, recently, but I can't find it back, so a summary:
Yes, the utilization/efficiency is particularly low for FFTs, but the problem is one test requires some CPU time (not enough to appear in a System Monitor, but it's there). The PCI transfers are enormously laggy and to be avoided. Running more than one test in parallel on one card would double, triple, etc., the number of transfers, enormously increasing the time taken. See, e.g., the third post in the CUDALucas [URL="http://www.mersenneforum.org/showthread.php?t=12576"]thread[/URL]. |
Dubslow, thank you for the explaination. But this seems silly to transfer data between CPU and GPU and to use the CPU time at each iteration, not only in the very beginning and in the very end of the test. It seems that if one modifies the algorithm in order to do everything inside a single GPU kernel, then the algorithm would be scalable and it would be able to test a lot of primes in parallel......
|
I don't know about the details, you'd have to ask msft why it's a nightmare.
|
[QUOTE=Dubslow;307813]This has been discussed in detail, recently, but I can't find it back, so a summary:[/QUOTE]
I thought that CuLu did all of the compute inside GPU? At least, its cousin GeneferCUDA does so. The really main problem is the same one facing CPUs today -- bandwidth to memory. Additionally, there is one more problem - total available memory. These two factors limit the number of tests that can execute concurrently. |
[QUOTE=axn;307817]I thought that CuLu did all of the compute inside GPU? At least, its cousin GeneferCUDA does so.
The really main problem is the same one facing CPUs today -- bandwidth to memory. Additionally, there is one more problem - total available memory. These two factors limit the number of tests that can execute concurrently.[/QUOTE] [url]https://sourceforge.net/p/cudalucas/code/37/tree/trunk/CUDALucas.cu?force=True[/url] The CUDA kernel names are: rftfsub_kernel normalize_kernel normalize2_kernel last_normalize ...and of course the two FFT lib calls. Everything else is CPU. You'll have to go digging for details yourself. [code]double lucas_square (double *x, int N, int iter, int last, float* maxerr, int error_flag) { double terr; rdft (N, 1, x, ip); // CUFFT, rftfsub_kernel (presumably the convolution), CUIFFT if (iter == last) { cutilSafeCall (cudaMemcpy (x, g_x, sizeof (double) * N, cudaMemcpyDeviceToHost)); terr = last_normalize (x, N, error_flag); } else { if ((iter % checkpoint_iter) == 0) { cutilSafeCall (cudaMemcpy (x, g_x, sizeof (double) * N, cudaMemcpyDeviceToHost)); terr = last_normalize (x, N, error_flag); } if (error_flag || t_f) { normalize_kernel < 1 > <<<N / threads, threads >>> (g_x, threads, g_err, g_carry, g_mask, g_ttp, g_ttmp, g_numbits, *maxerr); } else { normalize_kernel < 0 > <<<N / threads, threads >>> (g_x, threads, g_err, g_carry, g_mask, g_ttp, g_ttmp, g_numbits, *maxerr); } normalize2_kernel <<< ((N + threads - 1) / threads + 127) / 128, 128 >>> (g_x, threads, g_carry, N, g_inv2, g_ttp2, g_ttmp2, g_inv3, g_ttp3, g_ttmp3); terr = 0.0; if (error_flag || (polite_f && (iter % polite) == 0)) { float c_err; cutilSafeCall (cudaMemcpy (&c_err, g_err, sizeof (float), cudaMemcpyDeviceToHost)); if(c_err > *maxerr) *maxerr = c_err; terr = c_err; } } return (terr); }[/code] So the individual iterations are largely GPU. (Error flag is true every 100 iters, or true every iter with the extra error checking option enabled.) Edit: If it isn't obvious by the fact that lucas_square is a CPU function, the main LL loop is run on the CPU. |
OK! Thank you.
I understand, of course, that the main loop in CUDALucas is currently in the CPU code. I don't know, actually, how much memory a single FFT requires, but I could guess that total available GPU memory could be a serious limitation. As well as the memory access speed, also. I did not check this, but, at first sight, the arguments are reasonable. Thank you. I have also the other question, which is, actually, also very natural. Suppose that I have an access to the cluster with 1000 Nvidia Teslas. MPI and CUDA are, naturally, supported. Is it possible in principle to test primality of a 12-million-digit Mersenne number during several hours of cluster time, with such hardware? If yes, is corresponding software already written, or not ? |
[QUOTE=Dubslow;307818]Edit: If it isn't obvious by the fact that lucas_square is a CPU function, the main LL loop is run on the CPU.[/QUOTE]
Not really. The CPU routine is just a controlling thing. There is no actual compuation being run on the cpu. [QUOTE=Dubslow;307813]Running more than one test in parallel on one card would double, triple, etc., the number of transfers, enormously increasing the time taken. [/QUOTE] PCI transfer = cudaMemcpy As you can see from the code, the data is transferred to CPU only during checkpoints and final iteration. Error is being transferred once every few iterations. That's not a whole lot. Otherwise, pretty much the whole of computaion is happening inside GPU, with zero involvement from CPU. |
[QUOTE=bulldog;307814]Dubslow, thank you for the explaination. But this seems silly to transfer data between CPU and GPU and to use the CPU time at each iteration, not only in the very beginning and in the very end of the test. It seems that if one modifies the algorithm in order to do everything inside a single GPU kernel, then the algorithm would be scalable and it would be able to test a lot of primes in parallel......[/QUOTE]Nice argument and sometimes the right way to go about things.
Your approach tends to optimise throughput rather than latency. More can be done per unit time but it can take many units of time before you get any output. This effect is particularly noticeable with parallel implementations of ECM factoring. The EPFL people found that they could run roughly a hundred times as many curves per annum on a reasonable number of GPUs as they could on their PS3 cluster. Unfortunately, they would have to wait a year before any curve completed but then they'd have many thousands. On the PS3 the curves took on the order of a day to complete. Which would you rather have: ultimate throughput or answers which arrive before your career is over? |
[QUOTE=axn;307823]Not really. The CPU routine is just a controlling thing. There is no actual compuation being run on the cpu.
PCI transfer = cudaMemcpy [/quote] Yes, I know. (I never said the CPU did any work, I just said the main loop was [i]controlled[/i] by the CPU.) [QUOTE=axn;307823] As you can see from the code, the data is transferred to CPU only during checkpoints and final iteration. Error is being transferred once every few iterations. That's not a whole lot. Otherwise, pretty much the whole of computaion is happening inside GPU, with zero involvement from CPU.[/QUOTE] Earlier in the thread it occurred to me that it might be possible to create arrays of all the data that needs to be transferred, one array corresponding to one variable, each element corresponding to a different parallel test's copy of that variable; then transfer the whole array for essentially the same price as transferring the one element. The only place where this argument doesn't work is transferring the whole FFT data, which is FFTlen*8bytes of data per exponent, quite a large amount of data compared to the rest of the variables. |
[QUOTE=Dubslow;307832]Yes, I know. (I never said the CPU did any work, I just said the main loop was [i]controlled[/i] by the CPU.)
Earlier in the thread it occurred to me that it might be possible to create arrays of all the data that needs to be transferred, one array corresponding to one variable, each element corresponding to a different parallel test's copy of that variable; then transfer the whole array for essentially the same price as transferring the one element. The only place where this argument doesn't work is transferring the whole FFT data, which is FFTlen*8bytes of data per exponent, quite a large amount of data compared to the rest of the variables.[/QUOTE] FFTlen*8bytes is not a lot. This is only megabytes, not gigabytes, right? in this case the device RAM would be enough for testing 1000 mersenne exponents in parallel. |
I'm only thinking two or three parallel tests.
An int is 4 bytes -- two, or three, even ten, is 40 bytes. 40 bytes takes no longer to transfer than 4 bytes. FFTlen*8bytes, for a typical current double check exponent, is around [URL="http://www.wolframalpha.com/input/?i=1536*1024*8+bytes"]12.6 MB[/URL]. I imagine 126 MB would take perhaps a few ms longer to transfer, but keep in mind that (on my 460) each iteration itself is only around 6-7 ms... But then, such a heavy transfer would only occur at checkpoint iterations... I suppose it is doable... but there are a lot of other challenges that would have to be solved... |
[QUOTE=bulldog;307865]FFTlen*8bytes is not a lot. This is only megabytes, not gigabytes, right?
in this case the device RAM would be enough for testing 1000 mersenne exponents in parallel.[/QUOTE] msft is the real person to answer your questions. I haven't studied CUDALucas carefully, but I'll give you my (somewhat informed) opinion. The key thing, I think, is that CUDALucas uses the NVIDIA-provided FFT routines. I believe NVIDIA is very proud of their FFT routines, and I believe they have optimized their FFT routines to make good use of the parallel hardware. The data at [URL]http://mersenne-aries.sili.net/cudalucas.php[/URL] shows that the performance scales well with the amount of parallel hardware on a card. Despite some of their marketing literature, NVIDIA's cards aren't magic. Running 1000 Mersenne numbers in parallel isn't going to yield 1000x throughput of an algorithm that already exploits the parallelism of the hardware. |
[QUOTE=rcv;307883]Despite some of their marketing literature, NVIDIA's cards aren't magic. Running 1000 Mersenne numbers in parallel isn't going to yield 1000x throughput of an algorithm that already exploits the parallelism of the hardware.[/QUOTE]
Seconding that! On gtx580, (assuming your expo is enough big to max the card with one instance), running two cufft (CL) in parallel decrease the performance of each to about 47%. This means that running one instance only, is 6% more profitable. Somehow the two instances fault each-other, like the football (soccer) players who shot with the right foot into the left leg... :smile:. Some speed increase you can get if the expo is small (FFT size small) and the card is not maxed with a single instance. In this case (same as for mfaktc, where more instances are always necessary) you can get some more output running more instances, but the additional output come from additional occupancy of the card. When the card is 97-100% occupied, there is no way to increase the output. Next step to increase the output, if you really want it, would be overclocking the card, but that is generally not profitable and not advisable (without liquid cooling, cheap or free electricity, etc). Overclocking results in little increase of the output, and huge increase of the consumed power, as it was discussed many times around here. The stability of the system decrease without good/expensive cooling equipment, and if you get a mismatch from time to time, all your increased output was in vain. If you increase your output by 10%, but get one bad residue in 10, then per total you gained nothing, except a 30% higher electricity bill for all the period. Don't ask how we know that... Edit: numeric example: GTX580 clocked at 782MHz: LLDC for a 26M expo, one instance, it takes about 17 hours, the card is 97% busy. Two tests in serial (one after the other) would take about 34 hours. All times were rounded in addition, the card is a bit faster in these initial conditions. Running both tests in parallel, two different CL instances will give an average ETA of 36 hours for the two tests to finish (average because usually one test finishes a bit faster, resulting in "dead times" between the two instances (assuming you have equal worktodo files), more waste of time. |
[QUOTE=bulldog;307821]I have also the other question, which is, actually, also very natural.
Suppose that I have an access to the cluster with 1000 Nvidia Teslas. MPI and CUDA are, naturally, supported. Is it possible in principle to test primality of a 12-million-digit Mersenne number during several hours of cluster time, with such hardware? If yes, is corresponding software already written, or not ?[/QUOTE] I see by a superficial reading of the thread that no one really lifted this glove (sorry if anybody answered already). Yes, there is possible to test a 12-million-digit Mersenne number in several hours, with a [B]single Tesla[/B], and the software is already written. In fact, adding to the numerical example in the former post, a Tesla M2090 takes about 30-35 hours to test a 40M exponent. My GTX580 needs less then 40 hours of CudaLucas with FFT length 2592k (about 13% faster then the CL default 2400K/2560K at this size of the expo). On the other side, a Tesla M2090 is about 1.5 times faster for DP calculus, but is about 20-30% lower clocked. So they finish the job in about the same time, with the Tesla a bit faster. I have no idea how cufft scales on two Teslas, most probably it does not, or at least I was not able to make two Tesla to behave as a "double faster single Tesla". Better run two CL, each Tesla with its own instance. If you have 1000 boards, run 1000 tests. It would be faster then running one test in parallel on all boards, and doing 1000 tests one by one, because of the communication between them. See Prime95, the most performing is running one worker for each CPU core, in spite of the fact that you can get a single test finished faster if you put more cores to it, at the end when you draw the line, the "single core single worker" version still give more output. |
| All times are UTC. The time now is 22:44. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.