mersenneforum.org llrCUDA
 Register FAQ Search Today's Posts Mark Forums Read

2018-01-04, 15:00   #342
henryzz
Just call me Henry

"David"
Sep 2007
Liverpool (GMT/BST)

3×5×397 Posts

Quote:
 Originally Posted by Jean Penné About your last question, I presently don't know... Regards, Jean
As a test of this it would be worth comparing something like 3*2^11895718-1 on both cpu and gpu.
I will try and get it working on my own pc soon but my 750Ti isn't really capable of matching my 6700K.

2018-01-04, 16:38   #343
Jean Penné

May 2004
FRANCE

24616 Posts

Quote:
 Originally Posted by pinhodecarlos Jean, whilst running llrcuda do you see CPU consumption? What’s the CPU percentage load during the candidate test?
Presently, I see 84 to 85% of GPU used by llrCUDA, and 50 t0 52% CPU used.
(while IBDWT testing ~112000 decimal digits numbers).
while testing with larger k's (Rational base DWT and zero padding),
the percentages become respectively 31% and 80% !

Last fiddled with by Jean Penné on 2018-01-04 at 16:44 Reason: adding results

2018-01-04, 21:05   #344
pinhodecarlos

"Carlos Pinho"
Oct 2011
Milton Keynes, UK

139016 Posts

Quote:
 Originally Posted by Jean Penné Presently, I see 84 to 85% of GPU used by llrCUDA, and 50 t0 52% CPU used. (while IBDWT testing ~112000 decimal digits numbers). while testing with larger k's (Rational base DWT and zero padding), the percentages become respectively 31% and 80% !
No further questions nevertheless thank you for the llrcuda version, now next steps are to tweak the code for better GPU efficiency.

 2018-01-04, 23:03 #345 diep     Sep 2006 The Netherlands 2×17×23 Posts At my newly installed Debian 9.3 as it appears after i had set it all - gcc 6 is the default compiler there. CUDA 8 needs 5.3.1 or earlier... So i'm afraid my 'benchmarking' needs bit more time as gonna install a fresh debian8 install here to get CUDA 8 to work there :) Last fiddled with by diep on 2018-01-04 at 23:03
2018-01-05, 18:07   #346
diep

Sep 2006
The Netherlands

2×17×23 Posts

Quote:
 Originally Posted by Jean Penné That is one on the largest known non-Mersenne prime : jpenne@421360c21a63:~/llrcuda381/llrcuda381linux64$./llrCUDA -a1 -oVerbose=1 -d -q"10223*2^31172165+1" Starting Proth prime test of 10223*2^31172165+1 10223*2^31172165+1, bit: 60000 / 31172178 [0.19%]. Time per bit: 11.648 ms. To be compared to : jpenne@crazycomp:~$ llr64 -a2 -t4 -oVerbose=1 -d -q"10223*2^31172165+1" Starting Proth prime test of 10223*2^31172165+1 Using all-complex FMA3 FFT length 2560K, Pass1=640, Pass2=4K, 4 threads, a = 3 10223*2^31172165+1, bit: 60000 / 31172178 [0.19%]. Time per bit: 7.542 ms. So, not large enough to really compete... Regards, Jean
Initial result here: time per bit at 0.03% 12.665 ms ,
at 0.06% it goes down to 12.029 ms
0.09% : 12.030 ms
0.12% : 12.022 ms

Yet it compiles default with -g here and didn't check whether the gpu has been lobotomized still nor whether it uses both gpu's of the titan-z here.

Starting CUDA toolkit 7.0 and newer, nvidia by default lobotomizes older than the latest cards double precision. You seemingly have some options to turn it back to what it should be. That's a software lobotimization.

If i start 4 LLRcuda's i guess i get same speed for each one of them.

Soon update if i manage to improve upon this...

 2018-01-06, 12:40 #347 diep     Sep 2006 The Netherlands 2×17×23 Posts The limitation of cudallr is that one can only run 1 instance. What would i need to change to run several? It's eating seemingly resources at just 1 out of the 2 gpu's that the double gpu titanZ has. If i run single instance cudallr the prime: 69 * 2 ^ 3140225 - 1 then iteration time drops to 1.438 ms for it. If i try to start cudallr a 2nd time, it refuses to launch. I used same parameters like Jean posted here. How do i launch it several times? Starting with Kepler generation the Nvidia cards can handle 32 connections simultaneously (edit: not to confuse with total number of connections which is far far more of course) If i launch gpu burn simultaneously, then iteration time of cudallr drops to 6.0 - 6.6 ms with 6.6 ms being there in overwhelming majority. gpuburn loses at 1 gpu roughly 300 gflops double precision. Now that might not be very accurate to mention it that way as gpu burn doesn't get the maximum out of both gpu's anyway, so it's better to mention the loss is *at least* 300 gflops from a single gpu. Not a very good SMP algorithm by the cufft library if introducing favourable circumstances for race conditions cause scaling to drop by factor 5 meanwhile still swallowing 300 gflops+ worth of resources while having a "Woltman performance" far far under 10 Gflops. Factor 30 overhead wasted. Regrettably we cannot modify the cufft library as the code is not public. What we would want to do is run the cufft libraries code at a single SIMD for each exponent we test and throw several exponents at each SIMD. At titanZ each SIMD delivers 64 * 0.7 = 45G instructions a clock or 90 Gflops double precision. There is 30 SIMDs spreaded over 2 gpu's. 2.7Tflop available in total. Last fiddled with by diep on 2018-01-06 at 13:00
2018-01-06, 14:58   #348
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10111111000012 Posts

Quote:
 Originally Posted by diep The limitation of cudallr is that one can only run 1 instance. What would i need to change to run several? It's eating seemingly resources at just 1 out of the 2 gpu's that the double gpu titanZ has. If i run single instance cudallr the prime: 69 * 2 ^ 3140225 - 1 then iteration time drops to 1.438 ms for it. If i try to start cudallr a 2nd time, it refuses to launch. I used same parameters like Jean posted here. How do i launch it several times? Starting with Kepler generation the Nvidia cards can handle 32 connections simultaneously (edit: not to confuse with total number of connections which is far far more of course) If i launch gpu burn simultaneously, then iteration time of cudallr drops to 6.0 - 6.6 ms with 6.6 ms being there in overwhelming majority. gpuburn loses at 1 gpu roughly 300 gflops double precision. Now that might not be very accurate to mention it that way as gpu burn doesn't get the maximum out of both gpu's anyway, so it's better to mention the loss is *at least* 300 gflops from a single gpu. Not a very good SMP algorithm by the cufft library if introducing favourable circumstances for race conditions cause scaling to drop by factor 5 meanwhile still swallowing 300 gflops+ worth of resources while having a "Woltman performance" far far under 10 Gflops. Factor 30 overhead wasted. Regrettably we cannot modify the cufft library as the code is not public. What we would want to do is run the cufft libraries code at a single SIMD for each exponent we test and throw several exponents at each SIMD. At titanZ each SIMD delivers 64 * 0.7 = 45G instructions a clock or 90 Gflops double precision. There is 30 SIMDs spreaded over 2 gpu's. 2.7Tflop available in total.
Have a look at other source code for how CUDALucas, Mfaktc. etc support use of a GPU device other than the lowest numbered one. (One gpu per instance; usually running in separate directories.)

Note also that sometimes in these other applications there are gains by running multiple instances on the same gpu. Try it on llrcuda, time it, and see.

There are also pitfalls, such as the NVIDIA driver timeout and restart under Windows, or thermal shutdown of a gpu, which can cause CUDA application instances intended to run on different gpus to land on the same one at times.

In the meantime, you could run llrcuda on device 0 and other cuda code on device 1.

Last fiddled with by kriesel on 2018-01-06 at 15:07

2018-01-06, 17:15   #349
diep

Sep 2006
The Netherlands

2·17·23 Posts

Quote:
 Originally Posted by kriesel Have a look at other source code for how CUDALucas, Mfaktc. etc support use of a GPU device other than the lowest numbered one. (One gpu per instance; usually running in separate directories.) Note also that sometimes in these other applications there are gains by running multiple instances on the same gpu. Try it on llrcuda, time it, and see. There are also pitfalls, such as the NVIDIA driver timeout and restart under Windows, or thermal shutdown of a gpu, which can cause CUDA application instances intended to run on different gpus to land on the same one at times. In the meantime, you could run llrcuda on device 0 and other cuda code on device 1.
The only thing that makes sense with LLRCuda is try run 120 instances of LLRcuda for the TitanZ i have (which is 2 cuda devices) and have each LLRCUDA consume a single SIMD. It's not scaling well between SMX'es (SIMDs, nowadays called SM's at Nvidia - back then SMX for this generation).

One instance per device is just wasting power.

Using CUfft and trying to 'tune it' is like trying to get a Flintstone car with stone wheels compete in the formula 1.

Last fiddled with by diep on 2018-01-06 at 17:18

2018-01-06, 19:02   #350
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

17E116 Posts

Quote:
 Originally Posted by diep The only thing that makes sense with LLRCuda is try run 120 instances of LLRcuda for the TitanZ i have (which is 2 cuda devices) and have each LLRCUDA consume a single SIMD. It's not scaling well between SMX'es (SIMDs, nowadays called SM's at Nvidia - back then SMX for this generation). One instance per device is just wasting power. Using CUfft and trying to 'tune it' is like trying to get a Flintstone car with stone wheels compete in the formula 1.
I have no hands on experience with llrcuda. But a quick trip through the readme didn't reveal a provision for specifying which gpu device it uses.

I'm stumped how you propose to run 120 instances if you're unclear and asking how to run two.

Then again my unfamiliarity and the flu bug may be in the way.

2018-01-14, 00:24   #351
bgbeuning

Dec 2014

111111112 Posts

Quote:
 Originally Posted by diep The only thing that makes sense with LLRCuda is try run 120 instances of LLRcuda for the TitanZ i have (which is 2 cuda devices) and have each LLRCUDA consume a single SIMD.
This seems to assume one GPU PCI card per machine.
I have machines with 4 GPU each.
The other nvidia programs have a -d argument (device) to say which GPU to run on.

Last fiddled with by bgbeuning on 2018-01-14 at 00:25

2018-01-19, 01:20   #352
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

6,113 Posts

Quote:
 Originally Posted by bgbeuning This seems to assume one GPU PCI card per machine. I have machines with 4 GPU each. The other nvidia programs have a -d argument (device) to say which GPU to run on.
And it's common, I've read, for code to need to specify -d 0 in one instance to use one of the gpus on a dual-gpu card, and -d 1 in another instance to use the other gpu on the dual-gpu card (That was CUDA app syntax; OpenCL would typically be -d 11 and -d 12.)
I have machines with 2 and 3 GPU cards each, one GPU per card, and the software supporting device selection is essential.

There's no impediment to running more than one instance per each gpu, if the software supports selecting other than the "first" gpu. In some cases memory available supports many instances per gpu (particularly TF, but also often LL and P-1). In some software cases intending maximum throughput with a single software instance, a single instance provides considerable utilization and yet multiple instances can improve total throughput, sometimes by more than 10 percent. Running dissimilar code: TF, LL, P-1 shows this on some NVIDIA cards. In other cases multiple instance lower total throughput slightly (TF or P-1 with LL on GTX1060 or 1070). In other cases, all gpu memory is best applied to a single instance (P-1 stage 2 for example).

 Similar Threads Thread Thread Starter Forum Replies Last Post shanecruise Riesel Prime Search 8 2014-09-16 02:09 diep GPU Computing 1 2013-10-02 12:12

All times are UTC. The time now is 23:22.

Mon Jan 24 23:22:46 UTC 2022 up 185 days, 17:51, 0 users, load averages: 1.12, 1.35, 1.32