mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

xx005fs 2018-10-06 18:01

[QUOTE=TheJudger;497508]Had the same issue yesterday when running benchmarks on RTX 2080 Ti. In my case it was a combination of user error and lack of error checking. I had compiled CUDALucas for a quick & dirty benchmark only for sm75 (Turing). [U]Than I decided to check performance of Volta with CUDA 10.0, too.[/U] The binary runs without any warnings/error messages, benchmarks showed an improvement of nearly 50% over my previous benchmark. But when during the first 30000 iterations of M57885161 for James I've noticed those 0x0000000000000000 and knew something is wrong. Recompiling CUDALucas for sm70 solved this issue and performance was back on the same level as CUDA 9.1/9.2.

I know it is not perfect but in mfaktc I do[CODE]
cudaError = cudaGetLastError();
if(cudaError != cudaSuccess)
printf("ERROR: cudaGetLastError() returned %d: %s\n", cudaError, cudaGetErrorString(cudaError));[/CODE]

every now and then. From host those CUDA calls are asynchronous and so there is no return value and you have to ask for errors [I]later[/I]. This would catch those types of errors easily and is the main (but not only) source of the famous[CODE]
[B]ERROR: cudaGetLastError() returned 8: invalid device function[/B][/CODE] in mfaktc for example [URL="https://mersenneforum.org/showpost.php?p=496755&postcount=2883"]here[/URL] when running old binaries on GTX 2080.

Oliver[/QUOTE]

I tried downloaded the precompiled 2.06 beta binary and it works like a charm. Also I realized that if you take the cuFFT64_100.dll and rename it to cuFFT64_80.dll as well as the runtimes, the speed up is pretty good, I am getting from 0.95ms/it to 0.79ms/it on titan v at really low power target, so extremely happy about it now :smile:

TheJudger 2018-10-06 18:03

[QUOTE=James Heinrich;497462]Thanks, benchmark page is updated now with the single result. Does this look about right, with 2080 [i]significantly[/i] slower than V100?
[url]https://www.mersenne.ca/cudalucas.php?filter=V100|2080[/url][/QUOTE]

We knew that CUDALucas is limited by memory bandwidth on P100 and V100, some time ago I've posted benchmarks of P100 16GiB and P100 12 GiB (with 75% memory bandwidth of the 16 GiB version). On the other hand it is save to assume that on current consumer cards the performance is compute bound (FP64 performance). And Turing has, IIRC, 1/32 of FP32 performance while (consumer-)Pascal has 1/24 (again IIRC) of FP32 performance.

Oliver

kriesel 2018-10-06 18:23

[QUOTE=TheJudger;497508]I know it is not perfect but in mfaktc I do[CODE]
cudaError = cudaGetLastError();
if(cudaError != cudaSuccess)
printf("ERROR: cudaGetLastError() returned %d: %s\n", cudaError, cudaGetErrorString(cudaError));[/CODE]every now and then. From host those CUDA calls are asynchronous and so there is no return value and you have to ask for errors [I]later[/I]. This would catch those types of errors easily and is the main (but not only) source of the famous[CODE]
[B]ERROR: cudaGetLastError() returned 8: invalid device function[/B][/CODE] in mfaktc for example [URL="https://mersenneforum.org/showpost.php?p=496755&postcount=2883"]here[/URL] when running old binaries on GTX 2080.

Oliver[/QUOTE]
It would be good to have a code fragment that compared the minimum of the first three numbers below, to what the fourth requires. That could be implemented through a very small table.

[CODE]CUDA version info binary compiled for CUDA
8.0 CUDA runtime version
8.0 CUDA driver version
10.0 CUDA device info

name GeForce RTX 2080
compute capability 7.5[/CODE]If the minimum required is not met, at program launch, a clear message could be provided about that, followed by an orderly program exit.

A separate issue is via either Windows TDR for low compute capability devices, or perhaps regardless of sm/CC, through hardware issues, a device dropping out in a multi-gpu system may actually switch during a single run which gpu is getting used by the program instance. In this case the required minimum may change during a run. Detecting the change by periodic recheck and at least providing a clear message if not an orderly termination would be useful.

On the flip side, detecting a compute capability that's too low for the minimum CUDA available and providing a clear message about that case would also be useful.

xx005fs 2018-10-06 18:29

[QUOTE=TheJudger;497511]We knew that CUDALucas is limited by memory bandwidth on P100 and V100, some time ago I've posted benchmarks of P100 16GiB and P100 12 GiB (with 75% memory bandwidth of the 16 GiB version). On the other hand it is save to assume that on current consumer cards the performance is compute bound (FP64 performance). And Turing has, IIRC, 1/32 of FP32 performance while (consumer-)Pascal has 1/24 (again IIRC) of FP32 performance.

Oliver[/QUOTE]

I can definitely confirm that as overclocking the memory clock on my 1070 doesn't really change speed, however, increasing core clock has increased the speed quite a bit. On AMD consumer cards with 1/16 FP:DP ratio has pretty much equal improvement on the core and HBM clock. The Titan V doesn't scale AT ALL with the core lock above 1300MHz offering 0 improvements even with 1000MHz HBM clock.

Prime95 2018-10-06 21:17

Project idea for an ambitious programmer
 
If I understand the 2080 architecture correctly, LL test speed could be improved (perhaps greatly) by going to 128-bit fixed point reals represented as four 32-bit integers. I investigated this somewhat 4 years ago when 32-bit adds had huge throughput advantage but 32-bit multiplies had no advantage compared to DP throughput. IIUC, in the 2080 both 32-bit adds and 32-bit multiplies have a huge throughput advantage compared to DP throughput.

The basic idea is that adding two 128-bit fixed point reals requires four 32-bit adds (with carries) plus some overhead for handling signs. Multiplying two 128-bit fixed point reals requires sixteen 32-bit multiplies, plus some adds, and some overhead for handling signs.

Each FFT butterfly adds and subtracts FFT data values which increases the maximum FFT data value by one bit. Thus, the fixed point reals must be shifted one bit prior to a butterfly (i.e. move the implied decimal point). This adds some additional overhead in implementing a fixed-point real FFT.

My research indicated we could store as many as 51 bits of input data in each 128-bit fixed point real. This (51/128) is much more memory efficient than current DP FFTs which store about 17-bits of data in each 64-bit double.

Is there any flaw in my understanding of the 2080 architecture? Does anyone have time to explore the feasibility of this approach?

kriesel 2018-10-06 21:23

[QUOTE=xx005fs;497513]I can definitely confirm that as overclocking the memory clock on my 1070 doesn't really change speed, however, increasing core clock has increased the speed quite a bit. On AMD consumer cards with 1/16 FP:DP ratio has pretty much equal improvement on the core and HBM clock. The Titan V doesn't scale AT ALL with the core lock above 1300MHz offering 0 improvements even with 1000MHz HBM clock.[/QUOTE]

How much of the rated TDP is your 1070 running at? I see considerable differences vs. model within the same system and similar thermal environment. From GPU-z,

[CODE]In one system:
GTX 1050Ti 92% TDP, fan 49% gpu temp 82C, Perfcap reason: Pwr, Therm, idle alternately (in nonstop mfaktc)
GTX 1070 ~80% TDP, fan 93% gpu temp 87C, Perfcap: mostly therm, some idle (in nonstop mfaktc)
quadro 2000 perfcap: none

second system,
GTX 1060 70% TDP, fan 60%, gpu temp 74C, perfcap: VRel (in nonstop cudapm1)
quadro 4000 perfcap: none
quadro 2000 perfcap: none

third system,
GTX 1080 49% TDP, fan 87%, gpu temp 83C, perfcap: therm (in nonstop cudapm1) gpu load 99%, memory controller 74%
quadro 4000 perfcap: none

fourth system,
quadro 5000 perfcap: none
GTX 480 perfcap: none[/CODE]I was shocked to see the GTX1080 come up limited at half power. The 1080 is benchmarking at 1.25 times the speed of the tdp limited 1070, rather than the 1.5x I had expected. May soon be looking into cramming another fan in per systems 1-3.


Even so, the gtx1080 pounds out ~882 GhzD/day in mfaktc as is at 88% gpu load. Two mfaktc instances takes it up to ~920 GhzD/day aggregate, 95% GPU load, and 93C, 70%TDP.

TheJudger 2018-10-06 21:53

[QUOTE=xx005fs;497510]I tried downloaded the precompiled 2.06 beta binary and it works like a charm. Also I realized that if you take the cuFFT64_100.dll and rename it to cuFFT64_80.dll as well as the runtimes, the speed up is pretty good, I am getting from 0.95ms/it to 0.79ms/it on titan v at really low power target, so extremely happy about it now :smile:[/QUOTE]

Are you sure it produces correct results this way?

Oliver

xx005fs 2018-10-06 22:56

[QUOTE=TheJudger;497520]Are you sure it produces correct results this way?

Oliver[/QUOTE]

I am still currently running some double check to test it out. So far it seems like it's outputting some residue for every output (aka not 0x000...0, and in my case every 100,000 iterations) and it seems to be going pretty fast too. Still I have around 4 hours left on the exponent and hopefully it works with the cuFFT64_100 because it is a somewhat significant speedup (at least for my Titan V and I haven't tested with 1070 yet).

xx005fs 2018-10-07 02:36

[QUOTE=xx005fs;497522]I am still currently running some double check to test it out. So far it seems like it's outputting some residue for every output (aka not 0x000...0, and in my case every 100,000 iterations) and it seems to be going pretty fast too. Still I have around 4 hours left on the exponent and hopefully it works with the cuFFT64_100 because it is a somewhat significant speedup (at least for my Titan V and I haven't tested with 1070 yet).[/QUOTE]

Update: The Double check has completed successfully with an absolutely correct result. The residue matches and the exponent was 47655299. Best thing is with this hack the exponent took only 8 hours to complete with Titan V. However, the speedup isn't as drastic as it seems since when I was getting false result the programme is using the wrong FFT size. However, it is still a good thing because cuFFT64_100 is significantly smaller than cuFFT64_80 and there are very tiny amount of speedup in general.

xx005fs 2018-10-07 02:54

[QUOTE=kriesel;497518]How much of the rated TDP is your 1070 running at? I see considerable differences vs. model within the same system and similar thermal environment. From GPU-z,

[CODE]In one system:
GTX 1050Ti 92% TDP, fan 49% gpu temp 82C, Perfcap reason: Pwr, Therm, idle alternately (in nonstop mfaktc)
GTX 1070 ~80% TDP, fan 93% gpu temp 87C, Perfcap: mostly therm, some idle (in nonstop mfaktc)
quadro 2000 perfcap: none

second system,
GTX 1060 70% TDP, fan 60%, gpu temp 74C, perfcap: VRel (in nonstop cudapm1)
quadro 4000 perfcap: none
quadro 2000 perfcap: none

third system,
GTX 1080 49% TDP, fan 87%, gpu temp 83C, perfcap: therm (in nonstop cudapm1) gpu load 99%, memory controller 74%
quadro 4000 perfcap: none

fourth system,
quadro 5000 perfcap: none
GTX 480 perfcap: none[/CODE]I was shocked to see the GTX1080 come up limited at half power. The 1080 is benchmarking at 1.25 times the speed of the tdp limited 1070, rather than the 1.5x I had expected. May soon be looking into cramming another fan in per systems 1-3.


Even so, the gtx1080 pounds out ~882 GhzD/day in mfaktc as is at 88% gpu load. Two mfaktc instances takes it up to ~920 GhzD/day aggregate, 95% GPU load, and 93C, 70%TDP.[/QUOTE]

I don't really do trial factoring works anymore with GPUs and I am mostly focused on LL and PRP testing, and I haven't touched my 1070 and it's just sitting idle currently for over 4 month and I am fairly sure that the TDP I was using for it was 70% due to power concerns (I have a relatively weak psu at only 650W and I wanna hit the top of efficiency curve at around 300-400W or so.) I have the MSI Gaming x model so with fans at 100% the thermals were really impressive at only 51 celsius or so IIRC. For trial factoring I just run single GPU at 100% power target with max fan speed at around 60 celsius, which for my gaming x model pulls around 210W. Honestly I can't remember the speed of it for TF but I might test it soon when I get a new power supply.

Generally I have a favour toward AMD GPU for this computing project because of their much higher DP and memory bandwidth compared to similarly priced GeForce cards as I am not really a fan of TF because they simply pull too much power and the temperature of the GPUs are making me uncomfortable.

James Heinrich 2018-10-07 12:20

[QUOTE=xx005fs;497535]I haven't touched my 1070 and it's just sitting idle currently for over 4 month[/QUOTE]It has 3x the throughput of my 670 at less power. But I guess we each have to run with what we can get our hands on.


All times are UTC. The time now is 13:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.