mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2015-10-12, 05:51   #1
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3×137 Posts
Default CUDALucas: which binary to use?

Good morning.

I have decided to do a small research regarding binary selection for CUDALucas.
Downloaded the available binaries from Sourceforge, used the nearly default configuration file.
I've used M43112609 for LL product calculation time, 2352K FFT size and 256 threads & splice, all running on 'ol GTX Titan.
The average numbers of five different values were calculated:
Code:
C_4.2: 1.8538
C_5.0: 1.82502
C_5.5: 1.8279
C_6.0: 1.83506
C_6.5: 1.6982
What this means is, that for this particular exponent range (M43ish), the binary compiled with CUDA toolkit v6.5 performs best on sm_50 hardware.


Now, same series of tests on M58496057 with 4320K FFT size:
Code:
C_4.2: 3.52024
C_5.0: 3.40014
C_5.5: 3.32784
C_6.0: 3.31762
C_6.5: 3.33362
The situation changes, as the binary compiled with CUDA toolkit 6.0 is the fastest on sm_50 hardware for 58Mish exp. range, though not by much.


Miscellaneous observation(s):
1. For some reason, while C4.2 - C5.5 binaries are OK with a smaller starting FFT length, C6.0 and C6.5 binaries require bigger FFT sizes, and this is erratic behaviour always occurs.
Why this is happening is beyond my knowledge of the topic.
There is more to it than that:
Code:
Using threads: square 256, splice 256.
Starting M58496057 fft length = 4320K
Running careful round off test for 1000 iterations.
If average error > 0.25, or maximum error > 0.35,
the test will restart with a longer FFT.
Iteration = 80 < 1000 && err = 0.50000 > 0.35, increasing n from 4320K
The fft length 4608K is too large for exponent 58496057, decreasing to 4320K
Using threads: square 256, splice 256.
Starting M58496057 fft length = 4320K
Running careful round off test for 1000 iterations.
If average error > 0.25, or maximum error > 0.35,
the test will restart with a longer FFT.
Iteration  100, average error = 0.00021, max error = 0.00032
Iteration  200, average error = 0.00024, max error = 0.00033
Iteration  300, average error = 0.00025, max error = 0.00033
Iteration  400, average error = 0.00026, max error = 0.00032
Iteration  500, average error = 0.00026, max error = 0.00034
Iteration  600, average error = 0.00026, max error = 0.00032
Iteration  700, average error = 0.00027, max error = 0.00032
Iteration  800, average error = 0.00027, max error = 0.00032
Iteration  900, average error = 0.00027, max error = 0.00032
Iteration 1000, average error = 0.00027 <= 0.25 (max error = 0.00034), continuing test.
Doesn't look right, does it, how it tries increasing FFT size because of error, then going back to it and running without problems?
Some initialisation bug?

2. The situation may (and I have a feeling it will) be different for NV cards of other shader model.
Tracking particular "golden" binaries for particular exponent isn't easy, and adding particular shader models into that makes it tougher.
One day the developers of CUDALucas may have to consider maintaining only a single build of CUDALucas and deprecating the rest, thus "embracing progress".


Comments, along with other CUDA builds ('specially 7.0 and 7.5), are welcome!
Karl M Johnson is offline   Reply With Quote
Old 2015-10-12, 06:08   #2
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

2·17·251 Posts
Default

That is nothing wrong with the program, only that your FFT is too big, for this expo I think a ~3M FFT would work better and faster than the 4M3 FFT you use. When I reach home I will check with my cudaLucas setup.
[edit: the 3M is from estimation looking to your error size. You may need a bit higher than 3M. Generally, the best way (i.e. optimum and fast and safe to test) is when the error is around 0.2 (like from 0.15 to 0.25 according with your FFT selection). This FFT is definitely too big]

Last fiddled with by LaurV on 2015-10-12 at 06:11
LaurV is offline   Reply With Quote
Old 2015-10-12, 06:31   #3
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

19B16 Posts
Default

Indeed, I've done more tests and found out that this "bug" or whatever's happening here can't always be reproduced.

Once the proper binary for any FFT size is picked, proper internal benchmarks should be done to find out the best FFT size/thread/splice combinations.

Last fiddled with by Karl M Johnson on 2015-10-12 at 06:49 Reason: yes
Karl M Johnson is offline   Reply With Quote
Old 2015-10-12, 09:41   #4
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT)

22·5·283 Posts
Default

Sounds like an unstable card to me.
henryzz is offline   Reply With Quote
Old 2015-10-12, 11:19   #5
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

6338 Posts
Default

Not unlikely, even though I don't recall ever submitting a bad LL test before migrating from 337.xx Forceware.
As usual, needs more scientific testing.

Any comments on the method?
Any hints regarding CUDALucas and how it works with newer CUDA toolkit versions?

Last fiddled with by Karl M Johnson on 2015-10-12 at 11:20 Reason: yes
Karl M Johnson is offline   Reply With Quote
Old 2015-10-12, 15:06   #6
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

3×1,087 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
Not unlikely, even though I don't recall ever submitting a bad LL test before migrating from 337.xx Forceware.
As usual, needs more scientific testing.

Any comments on the method?
Any hints regarding CUDALucas and how it works with newer CUDA toolkit versions?
I figured I'd throw out the obvious possibility, that the 6.5 version mentioned might have more conservative settings as it related to memory usage, perhaps? Thus the larger FFT being impacted in some way.

The basic code itself won't change with different FFT sizes, so it's only the way the program allocates memory which I think would be the big variable here, and 6.0 / 6.5 must not be entirely the same in that regard.

I've never used any of the cuda compilers so I couldn't be more specific, but I'd look to see if any default options have changed, especially as it relates to the memory aspect. Like maybe 6.5 had some build option to do something to have it spend a little more time in garbage collection or something weird... something that would affect a larger memory chunk more than a smaller one. Or maybe something in the way it allocates memory differently, etc. etc.

Last fiddled with by Madpoo on 2015-10-12 at 15:07
Madpoo is offline   Reply With Quote
Old 2015-10-12, 16:37   #7
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3·137 Posts
Default

Okay, I thought the hardware is unstable, it may lose overclocking potential, but as it turns out, it's not entirely related to that.
So far I've gotten no bad residues on C4.2 binaries, but this doesn't mean anything yet.
(5.5)C6.0-C6.5 could indeed be more stability-demanding, even if previous versions worked flawlessly for years.
Will report my further findings.
Karl M Johnson is offline   Reply With Quote
Old 2015-10-12, 16:53   #8
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

11×19×43 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
Will report my further findings.
"The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'" -- Isaac Asimov
chalsall is offline   Reply With Quote
Old 2015-10-12, 17:37   #9
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

43·47 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
2. The situation may (and I have a feeling it will) be different for NV cards of other shader model.
Tracking particular "golden" binaries for particular exponent isn't easy, and adding particular shader models into that makes it tougher.
One day the developers of CUDALucas may have to consider maintaining only a single build of CUDALucas and deprecating the rest, thus "embracing progress".
I don't spend time worrying about small differences. When I upgrade to a newer CUDA toolkit, I recompile CUDALucas and run new fft bench and thread bench. This usually does end up with somewhat different preferred fft sizes. I then run a few double checks, and once they are fine I trash the old version.
frmky is offline   Reply With Quote
Old 2015-10-12, 17:56   #10
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

11·19·43 Posts
Default

Quote:
Originally Posted by frmky View Post
I don't spend time worrying about small differences ...and once they are fine I trash the old version.
So, then, you throw important information away?

WTF?
chalsall is offline   Reply With Quote
Old 2015-10-12, 18:15   #11
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

43×47 Posts
Default

I throw information away, yes. I question whether it's important.
frmky is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Binary Multitasking a1call Lounge 8 2016-12-03 21:20
CUDALucas writing binary data to screen patrik GPU Computing 3 2014-07-20 23:56
Would you use a 'fat binary' of GMP-ECM? jasonp GMP-ECM 8 2012-02-12 22:25
How to build a binary of SVN183? Andi47 Msieve 12 2010-02-01 19:30
2-d binary representation only_human Miscellaneous Math 9 2009-02-23 00:11

All times are UTC. The time now is 12:11.

Thu May 28 12:11:28 UTC 2020 up 64 days, 9:44, 2 users, load averages: 1.37, 1.46, 1.42

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.