![]() |
|
|
#89 | |
|
"David"
Jul 2015
Ohio
20516 Posts |
Quote:
@laurv - I still have the Titans (and one or two more I'm afraid), however the last few months have been so busy I have not had a chance to bundle up and ship them. They are still tagged for you just waiting for a slow day! @preda I very much appreciate your work on this. I did a good bit of performance evaluation work on clLucas a few months back and theorized several times that the method of kernel calls in clFFT could be combined into a single kernel or more efficient form that was about 2x faster, however I never found time to do the work. Do you think you could add a power of two 16K FFT option easily just for some tests? I presume that would be easier than implementing efficient mixed radix FFTs. Last fiddled with by airsquirrels on 2017-05-02 at 12:18 |
|
|
|
|
|
|
#90 |
|
"Mr. Meeseeks"
Jan 2012
California, USA
23·271 Posts |
FYI: with -Wall passed to g++ I'm getting this warning, not sure if it's anything(only bringing it up because you had -Werror in the makefile)
Code:
gpuowl.cpp: In function 'void doLog(int, int, float, float, double, u64)':
gpuowl.cpp:285:93: warning: unknown conversion type character 'l' in format [-Wformat=]
k, E, k * percent, msPerIter, days, hours, mins, (unsigned long long) res, err, maxErr);
^
gpuowl.cpp:285:93: warning: format '%g' expects argument of type 'double', but argument 9 has type 'u64 {aka long long unsigned int}' [-Wformat=]
gpuowl.cpp:285:93: warning: too many arguments for format [-Wformat-extra-args]
|
|
|
|
|
|
#91 | |
|
"Mihai Preda"
Apr 2015
3×457 Posts |
Quote:
In this case, it seems gcc does not like %llx in printf() ("long long unsigned"). printf() may still execute that correctly though, try. |
|
|
|
|
|
|
#92 | |
|
"Mihai Preda"
Apr 2015
3×457 Posts |
Quote:
I just saw now your thread about LL implementation -- I didn't see it earlier sorry. I was also thinking initially about merging kernels for performance, but that didn't work well for reasons of VGPR (register) pressure, which is a major limit on GCN ISA. Keeping the kernels "small" reduces VGPR usage, allowing more workgroups to run at the same time. In fact, I initially tried to implement Nussbaumer convolution, which does not need any floating point (integers only) and fewer multiplications. I stopped when I become convinced that it'd still be slower than classical LL (double precision FFT) on GPUs. (this is because Nussbaumer is more memory-intense). The main optimization in gpuOwL IMO is using a transposed representation of the data matrix (what I call "transposed convolution"), which fits very nicely with the GPU memory access pattern. This saves two transposition steps in both the direct and inverse FFT. The transposed representation is also good for the parallel carry propagation. |
|
|
|
|
|
|
#93 | |
|
"David"
Jul 2015
Ohio
51710 Posts |
Quote:
Your transpose representation solves one of the big problems with the default runtime generated kernels in clFFT. The other problem was a post processing step needed to reload all of the values from memory that were just available in registers to convert them back to integers for carry prop., which was most of the bottleneck. I did get some good advice from Mr Prime95 himself regarding the carry step - it is not necessary to complete the entire carry chain. You only need to carry enough to reduce your word sizes back to the point where they stay within the error bounds of another FFT and squaring step. I also intend to test your openCL code on the Nvidia system and see if it runs and how it performs vs cuFFT. Last fiddled with by airsquirrels on 2017-05-03 at 02:13 |
|
|
|
|
|
|
#94 |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Here is a quick test on a GTX 1080, residues match - however the code is 4.53ms vs 3.55. Still impressively close given that it was optimized for AMD cards.
I did another quick test on a Titan Black with double precision boost on, and gpuOwl was around 5.5ms vs 2.5 from CUDALucas. Interestingly they were closer with double precision boost off, suggesting cuFFT is perhaps better able to take advantage of the faster compute while gpuOwl's code is compiling memory bound. Code:
gpuowl -logstep 5000 gpuOwL v0.1 GPU Lucas-Lehmer primality checker GeForce GTX 1080; OpenCL 1.2 CUDA Will log every 5000 iterations, and persist checkpoint every 2500000 iterations. Falling back to CL1.x compilation (error -11) Checkpoint file 'c71561261.ll' not found. You can use 't71561261.ll'. LL FFT 4096K (1024*2048*2) of 71561261 (17.06 bits/word) at iteration 0 OpenCL setup: 1395 ms 00005000 / 71561261 [0.01%], ms/iter: 4.530, ETA: 3d 18:03; b40dd71dc9998cfd error 0.0390625 (max 0.0390625) 00010000 / 71561261 [0.01%], ms/iter: 4.538, ETA: 3d 18:12; 9421fec94352d8fd error 0.0390625 (max 0.0390625) 00015000 / 71561261 [0.02%], ms/iter: 4.545, ETA: 3d 18:20; 7ff289450308f24f error 0.0390625 (max 0.0390625) 00020000 / 71561261 [0.03%], ms/iter: 4.557, ETA: 3d 18:33; 02729de7028e2114 error 0.0390625 (max 0.0390625) Code:
CUDALucas -f 4096k 71561261 ------- DEVICE 0 ------- name GeForce GTX 1080 Compatibility 6.1 clockRate (MHz) 1733 memClockRate (MHz) 5005 totalGlobalMem 8507555840 totalConstMem 65536 l2CacheSize 2097152 sharedMemPerBlock 49152 regsPerBlock 65536 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsPerMP 2048 multiProcessorCount 20 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment 512 deviceOverlap 1 Using threads: square 256, splice 128. Starting M71561261 fft length = 4096K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | May 02 22:30:22 | M71561261 10000 0x9421fec94352d8fd | 4096K 0.04688 3.5572 35.57s | 2:22:42:05 0.01% | | May 02 22:30:57 | M71561261 20000 0x02729de7028e2114 | 4096K 0.05078 3.5814 35.81s | 2:22:55:55 0.02% | |
|
|
|
|
|
#95 |
|
"Mihai Preda"
Apr 2015
3·457 Posts |
|
|
|
|
|
|
#96 | |
|
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
23·3·5·72 Posts |
Quote:
More generally a fermat prp test would be fine for k*b^n+-1. Proof can be done on a cpu. The FFT is the harder thing to get right though. |
|
|
|
|
|
|
#97 |
|
Romulan Interpreter
Jun 2011
Thailand
258B16 Posts |
1. Shifting (more important and easier to implement than other FFT sizes - a step toward making gpuOwl production-ready).
2. Some command line switch to enumerate the existing devices (like GPU-Z is doing). (not important, but useful, some of us have no idea what treasures are hidden in our computer boxes... )
Last fiddled with by LaurV on 2017-05-03 at 13:16 |
|
|
|
|
|
#98 | |
|
"Mihai Preda"
Apr 2015
3×457 Posts |
Quote:
It looks like this: (R9 Nano): Code:
51442000 / 72155953 [71.29%], ms/iter: 2.991, ETA: 0d 17:13; c9ac48b9de3e0d80 error 0.046875 (max 0.046875) fftPremul1K 373.7us, 12.5% transpose1K 341.6us, 11.4% fft2K_1K 385.9us, 12.9% cquare2K 318.7us, 10.7% fft2K 389.7us, 13.0% mtranspose2K 344.2us, 11.5% fft1K_2K 361.2us, 12.1% carryA 330.6us, 11.1% carryB 143.9us, 4.8% Total 2989.7us |
|
|
|
|
|
|
#99 | |
|
"David"
Jul 2015
Ohio
11·47 Posts |
Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| GPUOWL AMD Windows OpenCL issues | xx005fs | GpuOwl | 0 | 2019-07-26 21:37 |
| Testing an expression for primality | 1260 | Software | 17 | 2015-08-28 01:35 |
| Testing Mersenne cofactors for primality? | CRGreathouse | Computer Science & Computational Number Theory | 18 | 2013-06-08 19:12 |
| Primality-testing program with multiple types of moduli (PFGW-related) | Unregistered | Information & Answers | 4 | 2006-10-04 22:38 |