mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

LaurV 2012-08-29 10:42

[QUOTE=TheJudger;309504]
There is only one point where I'm unsure: I don't know whether you can run both apps (with different CUDA versions) concurrently or not.[/QUOTE]
You can't. I use different sm's for CL and mfaktc, they run perfect as long as I don't mix them for the same card. I can mix them in the computer in the same time if they target different cards and the cards are not SLI. Keeping many versions in the same time in the computer is easy, you only put the right dlls in each folder, as both mfaktc and CL look in the folder for the dll if it is not loaded. But you can't RUN two versions on the same card in the same time.

LaurV 2012-08-29 11:09

[QUOTE=Dubslow;309566]Okay, slight change of plans: I recall LaurV somewhere saying that a larger FFT length was faster than some smaller ones in CUDALucas' table, but I wasn't able to relocate that post. In addition, I will also add the signal-handling fix discussed before to r39.

In the meantime, all Windows users should test flash's latest compile for the filelocking bug; note, however, that compared to earlier beta releases, some FFT lengths might not appear. If the bug is confirmed killed, then the final release (non-beta) of 2.04 will reincorporate the changes from the old binary lost in the new ones (i.e., it will be r39). r39 will be committed when LaurV responds.[/QUOTE]

Yes, I have the bad habit to tune the FFT size manually for every range and sometime this saved me hours of CL-ing. For example (IIRC, if not then I will give exact confirmation tomorrow, I have no internet on my house since Friday morning, and I still have to go home this evening, which will be in max half hour, and check if I am not mistaking the numbers. Here I have internet, but no GPU), so, for example, 2304k (smooth as 2^18*3^2) is much faster then the smaller ones (about 6% faster then the default one) and 2592 (smooth 2^15*3^4) is about 14% faster (NO JOKE!) than the default one (can't remember which, maybe 2400k, maybe 2568k, or the next higher one). I have an excel table somewhere, if the net problem can't be solved soon, then I will bring it here (the table is not complete, just for the ranges I had expos to test, i.e. 25M to 46M expos, but is very detailed).

And the cards are gtx580, gtx570, tesla c2050, no difference between them. Smaller granulation of FFT (smoother number) is always faster then smaller FFT size with bigger granulation (not so smooth), with very few exceptions. 1440k is such exception which id 5-smooth but still very fast! Higher then 1440 (default FFT) the default size can be almost always tuned to a better one. I can't say for sure if this is not card/OS/whatever dependent. Someone should try FFT 2592k against the smaller defaults on gtx580 on linux. I constantly get (beside of smaller/safer rounding errors) a speed improvement of 13-14% on win64/gtx580 (which is the main setup). This translates into 46-49 hours for a 4xM expo, instead of 52-55 hours.

edit: I am going home now, but you can search the forum for "2592k" I am 100% sure for this number (it seems to be only multiple of 2 and 3 too :D) and you should find my former posts. Trust better the numbers in those posts then the numbers in the current post.

Dubslow 2012-08-29 20:23

[QUOTE=LaurV;309608]Yes, I have the bad habit to tune the FFT size manually for every range and sometime this saved me hours of CL-ing. For example (IIRC, if not then I will give exact confirmation tomorrow, I have no internet on my house since Friday morning, and I still have to go home this evening, which will be in max half hour, and check if I am not mistaking the numbers. Here I have internet, but no GPU), so, for example, 2304k (smooth as 2^18*3^2) is much faster then the smaller ones (about 6% faster then the default one) and 2592 (smooth 2^15*3^4) is about 14% faster (NO JOKE!) than the default one (can't remember which, maybe 2400k, maybe 2568k, or the next higher one). I have an excel table somewhere, if the net problem can't be solved soon, then I will bring it here (the table is not complete, just for the ranges I had expos to test, i.e. 25M to 46M expos, but is very detailed).

And the cards are gtx580, gtx570, tesla c2050, no difference between them. Smaller granulation of FFT (smoother number) is always faster then smaller FFT size with bigger granulation (not so smooth), with very few exceptions. 1440k is such exception which id 5-smooth but still very fast! Higher then 1440 (default FFT) the default size can be almost always tuned to a better one. I can't say for sure if this is not card/OS/whatever dependent. Someone should try FFT 2592k against the smaller defaults on gtx580 on linux. I constantly get (beside of smaller/safer rounding errors) a speed improvement of 13-14% on win64/gtx580 (which is the main setup). This translates into 46-49 hours for a 4xM expo, instead of 52-55 hours.

edit: I am going home now, but you can search the forum for "2592k" I am 100% sure for this number (it seems to be only multiple of 2 and 3 too :D) and you should find my former posts. Trust better the numbers in those posts then the numbers in the current post.[/QUOTE]

I would love to see the spreadsheet. For what it's worth, here's all [STRIKE]four[/STRIKE] five lines of how CUDALucas chooses a length:

[code] #define COUNT 119
int multipliers[COUNT] = { 6, 8, 12, 16, 18, 24, 32,
40, 48, 64, 72, 80, 96, 120,
128, 144, 160, 192, 224, 240, 256,
288, 320, 336, 384, 448, 480, 512,
576, 640, 672, 768, 800, 864, 896,
960, 1024, 1120, 1152, 1200, 1280, 1344,
1440, 1536, 1600, 1680, 1728, 1792, 1920,
2048, 2240, 2304, 2400, 2560, 2688, 2880,
3072, 3200, 3360, 3456, 3584, 3840, 4000,
4096, 4480, 4608, 4800, 5120, 5376, 5600,
5760, 6144, 6400, 6720, 6912, 7168, 7680,
8000, 8192, 8960, 9216, 9600, 10240, 10752,
11200, 11520, 12288, 12800, 13440, 13824, 14366,
15360, 16000, 16128, 16384, 17920, 18432, 19200,
20480, 21504, 22400, 23040, 24576, 25600, 26880,
29672, 30720, 32000, 32768, 34992, 36864, 38400,
40960, 46080, 49152, 51200, 55296, 61440, 65536 };
// Largely copied from Prime95's jump tables, up to 32M
// Support up to 64M, the maximum length with threads == 1024
...
int len, i, estimate = q/20;
for(i = 0; i < COUNT; i++) {
len = 1024*multipliers[i];
if( len >= estimate )
{
return len;
}
}[/code]
If you say larger lengths are faster, it should just be a matter of removing the slower lengths from the table.

flashjh 2012-08-30 04:58

We need a switch like -fft that does more than just q/20 and then increase until >=. When enabled it can a test several FFT lengths, log the time and error for each and then select the best one for that particular exponent. If a worktodo file is used, then it runs the FFT test when each exponent is started. Once an FFT is selected, it will need to be able to put the FFT into the worktodo file for that exponent. The main problem is how many FFTs to test before it's a waste of time. (If LaurV's suggestion can be vetted, it may be possible to narrow down the FFTs to a small enough number to test all each time). Once enough test data is collected and reviewed, it may be possible to have the program select a particular set of FFTs to test based on the exponent number and GPU chipset.

One thing I noticed, when the .ini file contains a particular FFT length, if the program needs to change FFT sizes, it always goes up. However, I was testing smaller exponents that needed smaller FFTs (it took me a while to figure out why the program was failing; then I remembered the FFT size in the .ini file). The mentioned test above could also be used to select correct FFTs for all exponents if the default FFT is too big for the exponent (which caused serious rounding errors). (I guess if the -fft switch can be implemented, there will be no reason to specify FFTs in the .ini file. One could put an FFT that is incorrect in the worktodo though.)


Thoughts?

------
So far, testing of the new 2.04 beta is going well, for me. I was able to place many smaller exponents in the worktodo file and they all continued just fine. My DC still has a while left though...

[B]How is the testing going for everyone else?[/B]

kladner 2012-08-30 13:21

[QUOTE=flashjh;309712]
------
So far, testing of the new 2.04 beta is going well, for me. I was able to place many smaller exponents in the worktodo file and they all continued just fine. My DC still has a while left though...

[B]How is the testing going for everyone else?[/B][/QUOTE]

I have successfully completed 13 DC's and 2 LL's with 2.04-Beta-3.2-sm_13-x64. I think there were two times when I saw the Corrupt Save File cause a restart. I spotted these pretty quickly and was able to resume from a very recent good Save File with little lost work time.

flashjh 2012-08-30 15:23

[QUOTE=kladner;309735]I have successfully completed 13 DC's and 2 LL's with 2.04-Beta-3.2-sm_13-x64. I think there were two times when I saw the Corrupt Save File cause a restart. I spotted these pretty quickly and was able to resume from a very recent good Save File with little lost work time.[/QUOTE]

Have you switched to the updated 2.04 beta? Have you had any file locking problems with the new one?

kladner 2012-08-30 16:29

Would this creation date be the latest?[INDENT]Friday, ‎August ‎03, ‎2012, ‏‎9:21:17 AM
[/INDENT]I just downloaded it to be sure, but the one I was running has the same date. So....I guess I probably have been running the latest version.

I confess that I do not entirely understand the file locking issue.

I think most or all of the savefile corruption episodes were associated with unrelated (I think) BSODs. I have not seen CL restart (corrupt savefile) in the last 5-6 runs.

Please ask if there's other data you want.

Thanks to flash and dubslow (EDIT: and LaurV!) for all their work on this project. Bravo, Guys!

Dubslow 2012-08-30 18:01

[QUOTE=kladner;309748]Thanks to flash and dubslow (EDIT: and LaurV!) for all their work on this project. Bravo, Guys![/QUOTE]
Don't forget msft! He does all the mathy stuff :smile:

kladner 2012-08-30 18:23

[QUOTE=Dubslow;309762]Don't forget msft! He does all the mathy stuff :smile:[/QUOTE]

That's always the hazard of giving credit: leaving someone out.

Thanks msft! Sorry for the omission.

flashjh 2012-08-30 18:24

[QUOTE=kladner;309748]Would this creation date be the latest?[INDENT]Friday, ‎August ‎03, ‎2012, ‏‎9:21:17 AM
[/INDENT][/QUOTE][INDENT]Go [URL="http://sourceforge.net/projects/cudalucas/files/2.04%20Beta/"]here[/URL]. The lastest build is 28 Aug 2012
[/INDENT][QUOTE=Dubslow;309762]Don't forget msft! He does all the mathy stuff :smile:[/QUOTE]
Agree, and many others! I just make it compile on Windows :smile:

kladner 2012-08-30 18:29

Thanks Jerry. Done!


All times are UTC. The time now is 23:15.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.