mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2017-05-02, 12:10   #89
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

20516 Posts
Default

Quote:
Originally Posted by preda View Post
@airsquirrels : Impressive hardware! Do you have a description somewhere of you hardware setup? (e.g. what motherboard, how are the GPUs connected and cooled, pictures, power use etc).

30C is such a low temperature, how do you cool? or was that only on startup?
There is a thread somewhere around here detailing the setup. The liquid is on a large multi-system loop with industrial pumps and a large external heat exchanger, so I target 35-45* C for loop temp. The 7-8 GPU systems are all based on SuperMicro 4027/4028 8GPU chassis.

@laurv - I still have the Titans (and one or two more I'm afraid), however the last few months have been so busy I have not had a chance to bundle up and ship them. They are still tagged for you just waiting for a slow day!

@preda I very much appreciate your work on this. I did a good bit of performance evaluation work on clLucas a few months back and theorized several times that the method of kernel calls in clFFT could be combined into a single kernel or more efficient form that was about 2x faster, however I never found time to do the work. Do you think you could add a power of two 16K FFT option easily just for some tests? I presume that would be easier than implementing efficient mixed radix FFTs.

Last fiddled with by airsquirrels on 2017-05-02 at 12:18
airsquirrels is offline   Reply With Quote
Old 2017-05-02, 23:04   #90
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

FYI: with -Wall passed to g++ I'm getting this warning, not sure if it's anything(only bringing it up because you had -Werror in the makefile)
Code:
gpuowl.cpp: In function 'void doLog(int, int, float, float, double, u64)':
gpuowl.cpp:285:93: warning: unknown conversion type character 'l' in format [-Wformat=]
       k, E, k * percent, msPerIter, days, hours, mins, (unsigned long long) res, err, maxErr);
                                                                                             ^
gpuowl.cpp:285:93: warning: format '%g' expects argument of type 'double', but argument 9 has type 'u64 {aka long long unsigned int}' [-Wformat=]
gpuowl.cpp:285:93: warning: too many arguments for format [-Wformat-extra-args]
kracker is offline   Reply With Quote
Old 2017-05-03, 00:00   #91
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default

Quote:
Originally Posted by kracker View Post
FYI: with -Wall passed to g++ I'm getting this warning, not sure if it's anything(only bringing it up because you had -Werror in the makefile)
Code:
gpuowl.cpp: In function 'void doLog(int, int, float, float, double, u64)':
gpuowl.cpp:285:93: warning: unknown conversion type character 'l' in format [-Wformat=]
       k, E, k * percent, msPerIter, days, hours, mins, (unsigned long long) res, err, maxErr);
                                                                                             ^
gpuowl.cpp:285:93: warning: format '%g' expects argument of type 'double', but argument 9 has type 'u64 {aka long long unsigned int}' [-Wformat=]
gpuowl.cpp:285:93: warning: too many arguments for format [-Wformat-extra-args]
Feel free to drop -Werror to get the compilation going. (I only added it because in general it's useful to check every warning).

In this case, it seems gcc does not like %llx in printf() ("long long unsigned"). printf() may still execute that correctly though, try.
preda is offline   Reply With Quote
Old 2017-05-03, 00:18   #92
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
@preda I very much appreciate your work on this. I did a good bit of performance evaluation work on clLucas a few months back and theorized several times that the method of kernel calls in clFFT could be combined into a single kernel or more efficient form that was about 2x faster, however I never found time to do the work. Do you think you could add a power of two 16K FFT option easily just for some tests? I presume that would be easier than implementing efficient mixed radix FFTs.
POT FFTs are clearly easier. Do you want 16K or 16M? (why 16K, so small?). In fact for 16K I wonder how many bits/word would work with float (single precision).

I just saw now your thread about LL implementation -- I didn't see it earlier sorry. I was also thinking initially about merging kernels for performance, but that didn't work well for reasons of VGPR (register) pressure, which is a major limit on GCN ISA. Keeping the kernels "small" reduces VGPR usage, allowing more workgroups to run at the same time.

In fact, I initially tried to implement Nussbaumer convolution, which does not need any floating point (integers only) and fewer multiplications. I stopped when I become convinced that it'd still be slower than classical LL (double precision FFT) on GPUs. (this is because Nussbaumer is more memory-intense).

The main optimization in gpuOwL IMO is using a transposed representation of the data matrix (what I call "transposed convolution"), which fits very nicely with the GPU memory access pattern. This saves two transposition steps in both the direct and inverse FFT. The transposed representation is also good for the parallel carry propagation.
preda is offline   Reply With Quote
Old 2017-05-03, 01:58   #93
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

51710 Posts
Default

Quote:
Originally Posted by preda View Post
POT FFTs are clearly easier. Do you want 16K or 16M? (why 16K, so small?). In fact for 16K I wonder how many bits/word would work with float (single precision).

I just saw now your thread about LL implementation -- I didn't see it earlier sorry. I was also thinking initially about merging kernels for performance, but that didn't work well for reasons of VGPR (register) pressure, which is a major limit on GCN ISA. Keeping the kernels "small" reduces VGPR usage, allowing more workgroups to run at the same time.

In fact, I initially tried to implement Nussbaumer convolution, which does not need any floating point (integers only) and fewer multiplications. I stopped when I become convinced that it'd still be slower than classical LL (double precision FFT) on GPUs. (this is because Nussbaumer is more memory-intense).

The main optimization in gpuOwL IMO is using a transposed representation of the data matrix (what I call "transposed convolution"), which fits very nicely with the GPU memory access pattern. This saves two transposition steps in both the direct and inverse FFT. The transposed representation is also good for the parallel carry propagation.
You are correct - 16M, although 8M is good too! I am curious if you have given any thought to the difficulty of mixed-radix FFT sizes at this point.

Your transpose representation solves one of the big problems with the default runtime generated kernels in clFFT. The other problem was a post processing step needed to reload all of the values from memory that were just available in registers to convert them back to integers for carry prop., which was most of the bottleneck.

I did get some good advice from Mr Prime95 himself regarding the carry step - it is not necessary to complete the entire carry chain. You only need to carry enough to reduce your word sizes back to the point where they stay within the error bounds of another FFT and squaring step.

I also intend to test your openCL code on the Nvidia system and see if it runs and how it performs vs cuFFT.

Last fiddled with by airsquirrels on 2017-05-03 at 02:13
airsquirrels is offline   Reply With Quote
Old 2017-05-03, 02:28   #94
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Here is a quick test on a GTX 1080, residues match - however the code is 4.53ms vs 3.55. Still impressively close given that it was optimized for AMD cards.

I did another quick test on a Titan Black with double precision boost on, and gpuOwl was around 5.5ms vs 2.5 from CUDALucas. Interestingly they were closer with double precision boost off, suggesting cuFFT is perhaps better able to take advantage of the faster compute while gpuOwl's code is compiling memory bound.

Code:
gpuowl -logstep 5000
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
GeForce GTX 1080; OpenCL 1.2 CUDA
Will log every 5000 iterations, and persist checkpoint every 2500000 iterations.
Falling back to CL1.x compilation (error -11)
Checkpoint file 'c71561261.ll' not found. You can use 't71561261.ll'.
LL FFT 4096K (1024*2048*2) of 71561261 (17.06 bits/word) at iteration 0
OpenCL setup: 1395 ms
00005000 / 71561261 [0.01%], ms/iter: 4.530, ETA: 3d 18:03; b40dd71dc9998cfd error 0.0390625 (max 0.0390625)
00010000 / 71561261 [0.01%], ms/iter: 4.538, ETA: 3d 18:12; 9421fec94352d8fd error 0.0390625 (max 0.0390625)
00015000 / 71561261 [0.02%], ms/iter: 4.545, ETA: 3d 18:20; 7ff289450308f24f error 0.0390625 (max 0.0390625)
00020000 / 71561261 [0.03%], ms/iter: 4.557, ETA: 3d 18:33; 02729de7028e2114 error 0.0390625 (max 0.0390625)
Code:
CUDALucas -f 4096k 71561261

------- DEVICE 0 -------
name                GeForce GTX 1080
Compatibility       6.1
clockRate (MHz)     1733
memClockRate (MHz)  5005
totalGlobalMem      8507555840
totalConstMem       65536
l2CacheSize         2097152
sharedMemPerBlock   49152
regsPerBlock        65536
warpSize            32
memPitch            2147483647
maxThreadsPerBlock  1024
maxThreadsPerMP     2048
multiProcessorCount 20
maxThreadsDim[3]    1024,1024,64
maxGridSize[3]      2147483647,65535,65535
textureAlignment    512
deviceOverlap       1

Using threads: square 256, splice 128.
Starting M71561261 fft length = 4096K
|   Date     Time    |   Test Num     Iter        Residue        |    FFT   Error     ms/It     Time  |       ETA      Done   |
|  May 02  22:30:22  |  M71561261     10000  0x9421fec94352d8fd  |  4096K  0.04688   3.5572   35.57s  |   2:22:42:05   0.01%  |
|  May 02  22:30:57  |  M71561261     20000  0x02729de7028e2114  |  4096K  0.05078   3.5814   35.81s  |   2:22:55:55   0.02%  |
airsquirrels is offline   Reply With Quote
Old 2017-05-03, 08:29   #95
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default

Quote:
Originally Posted by henryzz View Post
There are many people who would appreciate a fast fft modulo k*2^n-1 for the LLR test for the GPU
Please allow me a few days to think about it (I'm quite new to LLR).
preda is offline   Reply With Quote
Old 2017-05-03, 10:37   #96
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

23·3·5·72 Posts
Default

Quote:
Originally Posted by preda View Post
Please allow me a few days to think about it (I'm quite new to LLR).
The LLR test would be nice as it means the results will be comparable with other programs for k*2^n-1. There is also Proth primality tests for k*2^n+1
More generally a fermat prp test would be fine for k*b^n+-1. Proof can be done on a cpu.


The FFT is the harder thing to get right though.
henryzz is offline   Reply With Quote
Old 2017-05-03, 13:15   #97
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

258B16 Posts
Default

1. Shifting (more important and easier to implement than other FFT sizes - a step toward making gpuOwl production-ready).
2. Some command line switch to enumerate the existing devices (like GPU-Z is doing). (not important, but useful, some of us have no idea what treasures are hidden in our computer boxes... )

Last fiddled with by LaurV on 2017-05-03 at 13:16
LaurV is offline   Reply With Quote
Old 2017-05-03, 18:27   #98
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default New option to time kernels

Quote:
Originally Posted by airsquirrels View Post
Here is a quick test on a GTX 1080, residues match - however the code is 4.53ms vs 3.55. Still impressively close given that it was optimized for AMD cards.
I added a new command line option, "-time kernels" (also mentioned in "-h"), which prints on each log-step timing information per kernel. This should allow to point out a "trouble" kernel that performs particularly badly on the different platform (taking up a disproportionate percentage of time).

It looks like this: (R9 Nano):
Code:
51442000 / 72155953 [71.29%], ms/iter: 2.991, ETA: 0d 17:13; c9ac48b9de3e0d80 error 0.046875 (max 0.046875)
  fftPremul1K  373.7us, 12.5%
  transpose1K  341.6us, 11.4%
  fft2K_1K     385.9us, 12.9%
  cquare2K     318.7us, 10.7%
  fft2K        389.7us, 13.0%
  mtranspose2K 344.2us, 11.5%
  fft1K_2K     361.2us, 12.1%
  carryA       330.6us, 11.1%
  carryB       143.9us, 4.8%
  Total        2989.7us
"-time kernels" does have a perf hit, so should be enabled only for a few logsteps while investigating.
preda is offline   Reply With Quote
Old 2017-05-03, 21:00   #99
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Quote:
Originally Posted by preda View Post
I added a new command line option, "-time kernels" (also mentioned in "-h"), which prints on each log-step timing information per kernel. This should allow to point out a "trouble" kernel that performs particularly badly on the different platform (taking up a disproportionate percentage...
Excellent! I will give this a run on all the different OpenCL GPU hardware I currently have running and give you the feedback.
airsquirrels is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 02:44.


Sat Jul 17 02:44:07 UTC 2021 up 50 days, 31 mins, 1 user, load averages: 1.79, 1.54, 1.48

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.