![]() |
|
|
#2696 | |
|
"Eric"
Jan 2018
USA
22×53 Posts |
Quote:
|
|
|
|
|
|
|
#2697 | |
|
"Oliver"
Mar 2005
Germany
11·101 Posts |
Quote:
Oliver |
|
|
|
|
|
|
#2698 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
31×173 Posts |
Quote:
Code:
CUDA version info binary compiled for CUDA 8.0 CUDA runtime version 8.0 CUDA driver version 10.0 CUDA device info name GeForce RTX 2080 compute capability 7.5 A separate issue is via either Windows TDR for low compute capability devices, or perhaps regardless of sm/CC, through hardware issues, a device dropping out in a multi-gpu system may actually switch during a single run which gpu is getting used by the program instance. In this case the required minimum may change during a run. Detecting the change by periodic recheck and at least providing a clear message if not an orderly termination would be useful. On the flip side, detecting a compute capability that's too low for the minimum CUDA available and providing a clear message about that case would also be useful. Last fiddled with by kriesel on 2018-10-06 at 18:27 |
|
|
|
|
|
|
#2699 | |
|
"Eric"
Jan 2018
USA
22·53 Posts |
Quote:
|
|
|
|
|
|
|
#2700 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
2×53×71 Posts |
If I understand the 2080 architecture correctly, LL test speed could be improved (perhaps greatly) by going to 128-bit fixed point reals represented as four 32-bit integers. I investigated this somewhat 4 years ago when 32-bit adds had huge throughput advantage but 32-bit multiplies had no advantage compared to DP throughput. IIUC, in the 2080 both 32-bit adds and 32-bit multiplies have a huge throughput advantage compared to DP throughput.
The basic idea is that adding two 128-bit fixed point reals requires four 32-bit adds (with carries) plus some overhead for handling signs. Multiplying two 128-bit fixed point reals requires sixteen 32-bit multiplies, plus some adds, and some overhead for handling signs. Each FFT butterfly adds and subtracts FFT data values which increases the maximum FFT data value by one bit. Thus, the fixed point reals must be shifted one bit prior to a butterfly (i.e. move the implied decimal point). This adds some additional overhead in implementing a fixed-point real FFT. My research indicated we could store as many as 51 bits of input data in each 128-bit fixed point real. This (51/128) is much more memory efficient than current DP FFTs which store about 17-bits of data in each 64-bit double. Is there any flaw in my understanding of the 2080 architecture? Does anyone have time to explore the feasibility of this approach? |
|
|
|
|
|
#2701 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
31×173 Posts |
Quote:
Code:
In one system: GTX 1050Ti 92% TDP, fan 49% gpu temp 82C, Perfcap reason: Pwr, Therm, idle alternately (in nonstop mfaktc) GTX 1070 ~80% TDP, fan 93% gpu temp 87C, Perfcap: mostly therm, some idle (in nonstop mfaktc) quadro 2000 perfcap: none second system, GTX 1060 70% TDP, fan 60%, gpu temp 74C, perfcap: VRel (in nonstop cudapm1) quadro 4000 perfcap: none quadro 2000 perfcap: none third system, GTX 1080 49% TDP, fan 87%, gpu temp 83C, perfcap: therm (in nonstop cudapm1) gpu load 99%, memory controller 74% quadro 4000 perfcap: none fourth system, quadro 5000 perfcap: none GTX 480 perfcap: none Even so, the gtx1080 pounds out ~882 GhzD/day in mfaktc as is at 88% gpu load. Two mfaktc instances takes it up to ~920 GhzD/day aggregate, 95% GPU load, and 93C, 70%TDP. |
|
|
|
|
|
|
#2702 | |
|
"Oliver"
Mar 2005
Germany
11·101 Posts |
Quote:
Oliver |
|
|
|
|
|
|
#2703 |
|
"Eric"
Jan 2018
USA
3248 Posts |
I am still currently running some double check to test it out. So far it seems like it's outputting some residue for every output (aka not 0x000...0, and in my case every 100,000 iterations) and it seems to be going pretty fast too. Still I have around 4 hours left on the exponent and hopefully it works with the cuFFT64_100 because it is a somewhat significant speedup (at least for my Titan V and I haven't tested with 1070 yet).
Last fiddled with by xx005fs on 2018-10-06 at 23:00 |
|
|
|
|
|
#2704 | |
|
"Eric"
Jan 2018
USA
22×53 Posts |
Quote:
Last fiddled with by xx005fs on 2018-10-07 at 02:46 |
|
|
|
|
|
|
#2705 | |
|
"Eric"
Jan 2018
USA
22·53 Posts |
Quote:
Generally I have a favour toward AMD GPU for this computing project because of their much higher DP and memory bandwidth compared to similarly priced GeForce cards as I am not really a fan of TF because they simply pull too much power and the temperature of the GPUs are making me uncomfortable. |
|
|
|
|
|
|
#2706 |
|
"James Heinrich"
May 2004
ex-Northern Ontario
1101010100002 Posts |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Don't DC/LL them with CudaLucas | LaurV | Data | 131 | 2017-05-02 18:41 |
| CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 | Brain | GPU Computing | 13 | 2016-02-19 15:53 |
| CUDALucas: which binary to use? | Karl M Johnson | GPU Computing | 15 | 2015-10-13 04:44 |
| settings for cudaLucas | fairsky | GPU Computing | 11 | 2013-11-03 02:08 |
| Trying to run CUDALucas on Windows 8 CP | Rodrigo | GPU Computing | 12 | 2012-03-07 23:20 |