![]() |
I'm only thinking two or three parallel tests.
An int is 4 bytes -- two, or three, even ten, is 40 bytes. 40 bytes takes no longer to transfer than 4 bytes. FFTlen*8bytes, for a typical current double check exponent, is around [URL="http://www.wolframalpha.com/input/?i=1536*1024*8+bytes"]12.6 MB[/URL]. I imagine 126 MB would take perhaps a few ms longer to transfer, but keep in mind that (on my 460) each iteration itself is only around 6-7 ms... But then, such a heavy transfer would only occur at checkpoint iterations... I suppose it is doable... but there are a lot of other challenges that would have to be solved... |
[QUOTE=bulldog;307865]FFTlen*8bytes is not a lot. This is only megabytes, not gigabytes, right?
in this case the device RAM would be enough for testing 1000 mersenne exponents in parallel.[/QUOTE] msft is the real person to answer your questions. I haven't studied CUDALucas carefully, but I'll give you my (somewhat informed) opinion. The key thing, I think, is that CUDALucas uses the NVIDIA-provided FFT routines. I believe NVIDIA is very proud of their FFT routines, and I believe they have optimized their FFT routines to make good use of the parallel hardware. The data at [URL]http://mersenne-aries.sili.net/cudalucas.php[/URL] shows that the performance scales well with the amount of parallel hardware on a card. Despite some of their marketing literature, NVIDIA's cards aren't magic. Running 1000 Mersenne numbers in parallel isn't going to yield 1000x throughput of an algorithm that already exploits the parallelism of the hardware. |
[QUOTE=rcv;307883]Despite some of their marketing literature, NVIDIA's cards aren't magic. Running 1000 Mersenne numbers in parallel isn't going to yield 1000x throughput of an algorithm that already exploits the parallelism of the hardware.[/QUOTE]
Seconding that! On gtx580, (assuming your expo is enough big to max the card with one instance), running two cufft (CL) in parallel decrease the performance of each to about 47%. This means that running one instance only, is 6% more profitable. Somehow the two instances fault each-other, like the football (soccer) players who shot with the right foot into the left leg... :smile:. Some speed increase you can get if the expo is small (FFT size small) and the card is not maxed with a single instance. In this case (same as for mfaktc, where more instances are always necessary) you can get some more output running more instances, but the additional output come from additional occupancy of the card. When the card is 97-100% occupied, there is no way to increase the output. Next step to increase the output, if you really want it, would be overclocking the card, but that is generally not profitable and not advisable (without liquid cooling, cheap or free electricity, etc). Overclocking results in little increase of the output, and huge increase of the consumed power, as it was discussed many times around here. The stability of the system decrease without good/expensive cooling equipment, and if you get a mismatch from time to time, all your increased output was in vain. If you increase your output by 10%, but get one bad residue in 10, then per total you gained nothing, except a 30% higher electricity bill for all the period. Don't ask how we know that... Edit: numeric example: GTX580 clocked at 782MHz: LLDC for a 26M expo, one instance, it takes about 17 hours, the card is 97% busy. Two tests in serial (one after the other) would take about 34 hours. All times were rounded in addition, the card is a bit faster in these initial conditions. Running both tests in parallel, two different CL instances will give an average ETA of 36 hours for the two tests to finish (average because usually one test finishes a bit faster, resulting in "dead times" between the two instances (assuming you have equal worktodo files), more waste of time. |
[QUOTE=bulldog;307821]I have also the other question, which is, actually, also very natural.
Suppose that I have an access to the cluster with 1000 Nvidia Teslas. MPI and CUDA are, naturally, supported. Is it possible in principle to test primality of a 12-million-digit Mersenne number during several hours of cluster time, with such hardware? If yes, is corresponding software already written, or not ?[/QUOTE] I see by a superficial reading of the thread that no one really lifted this glove (sorry if anybody answered already). Yes, there is possible to test a 12-million-digit Mersenne number in several hours, with a [B]single Tesla[/B], and the software is already written. In fact, adding to the numerical example in the former post, a Tesla M2090 takes about 30-35 hours to test a 40M exponent. My GTX580 needs less then 40 hours of CudaLucas with FFT length 2592k (about 13% faster then the CL default 2400K/2560K at this size of the expo). On the other side, a Tesla M2090 is about 1.5 times faster for DP calculus, but is about 20-30% lower clocked. So they finish the job in about the same time, with the Tesla a bit faster. I have no idea how cufft scales on two Teslas, most probably it does not, or at least I was not able to make two Tesla to behave as a "double faster single Tesla". Better run two CL, each Tesla with its own instance. If you have 1000 boards, run 1000 tests. It would be faster then running one test in parallel on all boards, and doing 1000 tests one by one, because of the communication between them. See Prime95, the most performing is running one worker for each CPU core, in spite of the fact that you can get a single test finished faster if you put more cores to it, at the end when you draw the line, the "single core single worker" version still give more output. |
| All times are UTC. The time now is 22:44. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.