![]() |
|
|
#342 | |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
137758 Posts |
Quote:
I will try and get it working on my own pc soon but my 750Ti isn't really capable of matching my 6700K. |
|
|
|
|
|
|
#343 | |
|
May 2004
FRANCE
24×3×13 Posts |
Quote:
(while IBDWT testing ~112000 decimal digits numbers). while testing with larger k's (Rational base DWT and zero padding), the percentages become respectively 31% and 80% ! Last fiddled with by Jean Penné on 2018-01-04 at 16:44 Reason: adding results |
|
|
|
|
|
|
#344 |
|
"Carlos Pinho"
Oct 2011
Milton Keynes, UK
2·5·11·47 Posts |
No further questions nevertheless thank you for the llrcuda version, now next steps are to tweak the code for better GPU efficiency.
|
|
|
|
|
|
#345 |
|
Sep 2006
The Netherlands
14478 Posts |
At my newly installed Debian 9.3 as it appears after i had set it all - gcc 6 is the default compiler there. CUDA 8 needs 5.3.1 or earlier...
So i'm afraid my 'benchmarking' needs bit more time as gonna install a fresh debian8 install here to get CUDA 8 to work there :) Last fiddled with by diep on 2018-01-04 at 23:03 |
|
|
|
|
|
#346 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
at 0.06% it goes down to 12.029 ms 0.09% : 12.030 ms 0.12% : 12.022 ms Yet it compiles default with -g here and didn't check whether the gpu has been lobotomized still nor whether it uses both gpu's of the titan-z here. Starting CUDA toolkit 7.0 and newer, nvidia by default lobotomizes older than the latest cards double precision. You seemingly have some options to turn it back to what it should be. That's a software lobotimization. If i start 4 LLRcuda's i guess i get same speed for each one of them. Soon update if i manage to improve upon this... |
|
|
|
|
|
|
#347 |
|
Sep 2006
The Netherlands
3×269 Posts |
The limitation of cudallr is that one can only run 1 instance.
What would i need to change to run several? It's eating seemingly resources at just 1 out of the 2 gpu's that the double gpu titanZ has. If i run single instance cudallr the prime: 69 * 2 ^ 3140225 - 1 then iteration time drops to 1.438 ms for it. If i try to start cudallr a 2nd time, it refuses to launch. I used same parameters like Jean posted here. How do i launch it several times? Starting with Kepler generation the Nvidia cards can handle 32 connections simultaneously (edit: not to confuse with total number of connections which is far far more of course) If i launch gpu burn simultaneously, then iteration time of cudallr drops to 6.0 - 6.6 ms with 6.6 ms being there in overwhelming majority. gpuburn loses at 1 gpu roughly 300 gflops double precision. Now that might not be very accurate to mention it that way as gpu burn doesn't get the maximum out of both gpu's anyway, so it's better to mention the loss is *at least* 300 gflops from a single gpu. Not a very good SMP algorithm by the cufft library if introducing favourable circumstances for race conditions cause scaling to drop by factor 5 meanwhile still swallowing 300 gflops+ worth of resources while having a "Woltman performance" far far under 10 Gflops. Factor 30 overhead wasted. Regrettably we cannot modify the cufft library as the code is not public. What we would want to do is run the cufft libraries code at a single SIMD for each exponent we test and throw several exponents at each SIMD. At titanZ each SIMD delivers 64 * 0.7 = 45G instructions a clock or 90 Gflops double precision. There is 30 SIMDs spreaded over 2 gpu's. 2.7Tflop available in total. Last fiddled with by diep on 2018-01-06 at 13:00 |
|
|
|
|
|
#348 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24×3×163 Posts |
Quote:
Note also that sometimes in these other applications there are gains by running multiple instances on the same gpu. Try it on llrcuda, time it, and see. There are also pitfalls, such as the NVIDIA driver timeout and restart under Windows, or thermal shutdown of a gpu, which can cause CUDA application instances intended to run on different gpus to land on the same one at times. In the meantime, you could run llrcuda on device 0 and other cuda code on device 1. Last fiddled with by kriesel on 2018-01-06 at 15:07 |
|
|
|
|
|
|
#349 | |
|
Sep 2006
The Netherlands
32716 Posts |
Quote:
One instance per device is just wasting power. Using CUfft and trying to 'tune it' is like trying to get a Flintstone car with stone wheels compete in the formula 1. Last fiddled with by diep on 2018-01-06 at 17:18 |
|
|
|
|
|
|
#350 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24×3×163 Posts |
Quote:
I'm stumped how you propose to run 120 instances if you're unclear and asking how to run two. Then again my unfamiliarity and the flu bug may be in the way. |
|
|
|
|
|
|
#351 | |
|
Dec 2014
FF16 Posts |
Quote:
I have machines with 4 GPU each. The other nvidia programs have a -d argument (device) to say which GPU to run on. Last fiddled with by bgbeuning on 2018-01-14 at 00:25 |
|
|
|
|
|
|
#352 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24·3·163 Posts |
Quote:
I have machines with 2 and 3 GPU cards each, one GPU per card, and the software supporting device selection is essential. There's no impediment to running more than one instance per each gpu, if the software supports selecting other than the "first" gpu. In some cases memory available supports many instances per gpu (particularly TF, but also often LL and P-1). In some software cases intending maximum throughput with a single software instance, a single instance provides considerable utilization and yet multiple instances can improve total throughput, sometimes by more than 10 percent. Running dissimilar code: TF, LL, P-1 shows this on some NVIDIA cards. In other cases multiple instance lower total throughput slightly (TF or P-1 with LL on GTX1060 or 1070). In other cases, all gpu memory is best applied to a single instance (P-1 stage 2 for example). |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| LLRcuda | shanecruise | Riesel Prime Search | 8 | 2014-09-16 02:09 |
| LLRCUDA - getting it to work | diep | GPU Computing | 1 | 2013-10-02 12:12 |