mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

flashjh 2012-09-26 00:28

1 Attachment(s)
[QUOTE=Dubslow;312748]Since you asked for it... this isn't the first time a too-aggressive FFT length has been picked. If you can, please run a test of the dinky little program I posed in [URL="http://www.mersenneforum.org/showthread.php?p=306898#post306898"]this first discussion[/URL] of the issue. (Unfortunately, I can't compile it, so you'll have to ask flash or start playing around with MSVS -- it's a very simple program, and so should be quite a bit easier to compile than CUDALucas.)[/QUOTE]

Attached the compiled program. It works, successfully converted a 1600K to 1728K on M27232109. I tried converting a 1440K to 1600K and 1728K and it didn't work. I kept getting a roundoff error. I'll test some more when I get a chance. This is all on a GT 430.

@Dubslow - what successful conversions have you made?

Dubslow 2012-09-26 02:49

[QUOTE=flashjh;312781]@Dubslow - what successful conversions have you made?[/QUOTE]

...not sure if I've ever tried. I have an exam tomorrow though... I'll try and do it over the weekend.
[QUOTE=flashjh;312781]
Attached the compiled program. It works, successfully converted a 1600K to 1728K on M27232109. I tried converting a 1440K to 1600K and 1728K and it didn't work. I kept getting a roundoff error. I'll test some more when I get a chance. This is all on a GT 430.[/QUOTE]
I have no idea if this is mathematically valid or not, I'm just winging it. For the one you said worked, did you get the same interim residues with the older and newer length? For the ones that didn't work, try running the test from the start with the longer FFTs -- too long can cause errors just the same as too short, so we should eliminate that as the cause.

flashjh 2012-09-26 02:53

[QUOTE=Dubslow;312786]...not sure if I've ever tried. I have an exam tomorrow though... I'll try and do it over the weekend.

I have no idea if this is mathematically valid or not, I'm just winging it. For the one you said worked, did you get the same interim residues with the older and newer length? For the ones that didn't work, try running the test from the start with the longer FFTs -- too long can cause errors just the same as too short, so we should eliminate that as the cause.[/QUOTE]
That may be the problem, I started the program and stopped it really early and tried the conversion. I never did one that had run for a while. I'll try different settings and conversions over the next few days.

Good luck on your exam! :yucky:

flashjh 2012-09-26 04:59

1 Attachment(s)
[QUOTE=Dubslow;312786]...not sure if I've ever tried. I have an exam tomorrow though... I'll try and do it over the weekend.

I have no idea if this is mathematically valid or not, I'm just winging it. For the one you said worked, did you get the same interim residues with the older and newer length? For the ones that didn't work, try running the test from the start with the longer FFTs -- too long can cause errors just the same as too short, so we should eliminate that as the cause.[/QUOTE]
Attached some test results. Unfortunately, the results are not good. After a few scenarios, I discovered that the residues do not match.

I started M27232109 from an old save file and then used a converted version of the same save file and received residue mismatches. I then started the same exponent from scratch, let it run for a while and tried a couple different things. Basically, converted files don't give good residues. The only good thing was I double converted a file (1536 -> 1600 -> 1768) and did get a match on the residue from the one that went straight to 1768 from 1536)

Is there hope for on-the-fly FFT conversion or is this the end?

Dubslow 2012-09-26 05:02

[QUOTE=flashjh;312797]Attached some test results. Unfortuneatly, the results are not good. After a few scenarios, I discovered that the residues do not match.

I started M27232109 from an old save file and then used a converted version of the same save file and received residue mismatches. I then started the same exponent from scratch, let it run for a while and tried a couple different things. Basically, converted files don't give good residues. The only good thing was I double converted a file (1536 -> 1600 -> 1768) and did get a match on the residue from the one that went straight to 1768 from 1536)

Is there hope for on-the-fly FFT conversion or is this the end?[/QUOTE]

Unless you know something I don't, that's the end of it. I have no justification of any sort, like I said, I just wanted to try it and see what happened :razz:

Edit: I sorta take that back, depending on your answer to this: the double-conversion, did you let that run a few iterations at the middle length before the second conversion? Otherwise the effect would be exactly the same, so of course they match. If you [i]did[/i] run a few iterations at the middle length, then there is hope, but I don't think I can do it; what the conversion does is pad the new/extra space with 0 (I couldn't think of anything better) so it might be that a different padding would work properly. (Among other things, I have no idea how to represent a bigint in an array of doubles.)

flashjh 2012-09-26 05:09

[QUOTE=Dubslow;312798]Unless you know something I don't, that's the end of it. I have no justification of any sort, like I said, I just wanted to try it and see what happened :razz:[/QUOTE]
Nothing at the moment.

I've not seen it before, what does Prime95 do when the error gets too large? Does it quit or increase the FFT size and continue?

LaurV 2012-09-26 07:57

[QUOTE=kladner;312780]Interestingly, 1536K isn't on LaurV's list.[/QUOTE]
This means that [U]for my cards[/U] (gtx580, but some things tested also with 570 and tesla c2050) there is one which is higher, and faster. "The list" shows the best speed/accuracy. I first made a list with all values from min to max (like 1,2,3,4,5..... etc). Then I tested for every one of them, and made a table: (size, time). Then, for each value of the size, cut out from the table all rows which have the size smaller and the speed bigger. This way, (repeat - for my cards, my system, no idea how dependent of system it is) there is a value which is [B]higher in size[/B] (i.e. more accurate, lower errors) and [B]faster in speed[/B]. Looking to the list, 1568K is the criminal. You can read below of that post with the list, [QUOTE]Using 1568k instead if 1536k, you get about 6%-8% faster.[/QUOTE]I don't know why it seems that gtx580 makes better pairing with 7 (1568=2^5*7^2) than with 3 (1536=2^9*3) this time. Try both of them with -cufftbench on your card and see which one is faster (I am curious to know too!).

Pairing those butterflies from the fft multiplication with the number of threads is tricky. To see what is like, imagine you have a frying pan which can cook two donuts in the same time. Cooking one donut on one side takes 1 minute and you have to cook them on both sides each. If you have to cook 5 donuts, this means [B]theoretically [/B]you have 10 sides to cook, and you will be done in 5 minutes, cooking two sides at a time (of different donuts, of course:razz:). That is theoretically. [B]Practically[/B], the thing is tricky. You can cook two donuts first, on one side, then on the other side, and you spent 2 minutes. Repeat with another two donuts and cook them on both sides, you have spent already 4 minutes and cooked 4 donuts, altogether 8 sides. But now you have only one donut left, and you still need two minutes to cook it, one minute on one side, and another minute on another side. Of course, during those two minutes, you "wasted" half of the energy, as the frying pan was half empty. But you still spent 6 minutes to cook the 5 donuts. Of course, you still can cook them in 5 minutes, but this involves "trickery". No, you are not allowed to split the donuts or the pan, nor to put more then two in the fry. But after you cooked fist two, spending 2 minutes, you still have 3 to cook (say x, y, z) and you can cook them in the following way: put donuts x and y on one side, cook them one minute, put donut x on the other side, take out donut y, put donut z and cook them (x and z) for a second minute. After two minutes (totally 4, considering from the starting point with cooking all 5 donuts), you have x cooked on both sides, and y,z cooked on one side each. Spend the last minute finishing y and z, and you are done with all in 5 minutes.

How about cooking 1103 pieces of 17-sided donuts in a 23-slots frying pan? :razz: (this is good for a post in the puzzle thread, for a general case, hehe).

Now imagine your GPU is a big frying pan, able to cook 256, 512, 1024 [strike]threads[/strike] donuts in the same time, but you have an arbitrary number of complex [strike]fft multiplications[/strike] donuts to cook, some with 2 sides, some with more sides, and you have to cook them in groups (like you can't split these 7 pieces, or those 29 pieces, you have to cook them all in the same time). The "granularity" of the FFT gives the cooking rules. The size gives the number of donuts. You may be able to cook them faster, even if you have more, if the rules are so flexible allowing you more freedom with pairing. But this may be related to how big is your frying pan too...

kladner 2012-09-26 14:26

[QUOTE]I don't know why it seems that gtx580 makes better pairing with 7 (1568=2^5*7^2) than with 3 (1536=2^9*3) this time. Try both of them with -cufftbench on your card and see which one is faster (I am curious to know too!). [/QUOTE]It seems to cook faster in my frying pan (GTX 460) at 1536K. I wonder how your cufftbench compares to mine (posted above) in this range. Could the number of CUDA cores or SM units or the bus width explain the difference in speed?

[CODE]1536K -Note: 2nd line twice the iterations.
Iteration 12550000 M( 27278xxx )C, 0xe7043e1df2c39cf7, n = 1536K, CUDALucas v2.04 Beta err = 0.0703 (4:03 real, 4.8622 ms/iter, ETA 19:51:13)
Iteration 12700000 M( 27278xxx )C, 0xcdbfe69d44c35dfc, n = 1536K, CUDALucas v2.04 Beta err = 0.0664 (8:07 real, 4.8707 ms/iter, ETA 19:37:04)

1568K
Iteration 150000 M( 27278xxx )C, 0x9498598baf7dd0f3, n = 1568K, CUDALucas v2.04 Beta err = 0.0425 (4:15 real, 5.0981 ms/iter, ETA 38:22:38)

1600K
Iteration 100000 M( 27278xxx )C, 0x4a2808ec7f06a47d, n = 1600K, CUDALucas v2.04 Beta err = 0.0234 (4:37 real, 5.5513 ms/iter, ETA 41:51:58)[/CODE]EDIT: A definite plus for me from all this is that I think I finally grasp the FFT selection process. I never had quite understood before.

LaurV 2012-09-26 16:28

1 Attachment(s)
[QUOTE=kladner;312846]I wonder how your cufftbench compares to mine (posted above) in this range. Could the number of CUDA cores or SM units or the bus width explain the difference in speed?
[/QUOTE]
My -cufftbench is done long ago, about 20 times, with different configurations (like p95 is running or not, aliquots are factoring, yafu-ing, etc or not, video playing or not, autocad/protel running or not, etc). The average of ALL THOSE RUNS is in the column E of the table below, and the table is sorted by it in increasing order. When I run some expo, I see what CL selects and how big the error is, and I enter that value of the FFT in the yellow cell F2 in the table. The formula in cell F25, for example, is
[CODE]=IF(G25>=$F$2,"<<This","-")[/CODE] and so for all column F.
Next step is to look for the first "this" in the column F and use that value [B]instead[/B] of the one recommended by CL. I (almost) always get it faster :D

LaurV 2012-09-26 17:08

1 Attachment(s)
And a log report i just did (could not attach 2 files). The exponent is chosen random, next "nice looking" prime after your double-27-xxxx :smile: (i.e. my exponent is nicer because is triple-27, which is 3 at a triple power, hehe).

[ATTACH]8668[/ATTACH]

The attached file is the log for a gtx580 card, no overclock, p95 running in background. If you open the file and look inside, you will remark that -polite option does not work in command line if the ini file specifies a different value, I have no idea why, but it was easy to spot (I argued about this with somebody here, long time ago, if the times are not constant, something is stealing clocks, and as long as nothing else was running, I suspected it was the polite switch - see first run in the file, for FFT=1440k. When CL runs in aggressive mode and nobody else steals GPU clocks, the times must be fixed and NOT oscillating). I had no time (and don't want to) edit the files, so I pressed p each time. Please remark that when you press p, the next row which is displayed is not "reliable", as part of those iterations was done before you press p (in polite mode) and part after (in aggressive mode). So only the rows after the first row after you pressed p are important.

The "del" lines are to delete checkpoint files before switching to a new fft size, otherwise the old one is resumed (I forgot to delete it once!).

Remark that CL is a bit stupid when 1472/1504k is selected, it switches to a higher FFT for SOME of the expos in the range (only for SOME), which is not normal, if you use 256 threads (for 512 the FFT must be multiple of 64k and it may be normal). Anyhow, for 1472 and 1504, when they DO run, they are slower then 1536 and 1568.

Also remark that the best choice after 1600 is much higher, 1728, the other in between are really bad. To make it clear, I ended the log with a -cufftbench, and if you like you can convert them to k by dividing to 1024 (like 1474560 is 1440k, 1572864 is 1536k, etc).

And to make you blue of envy :razz: here is the "get the smoke out of me" version put inline (I still can't attach two files, grrr) - this is with the card OC'ed to 30%, P95 stopped, water pump max speed, second card doing nothing (even water-cooling is not enough for cooling them both at such speed). This is for demo purposes only, and it will always crash or give strange errors and mismatching residues, but it is nice to see the timing anyhow :razz:

[CODE]e:\-99-Prime\CudaLucas\CL1>cl204b4020x64 -d 1 -c 10000 -f 1440k -s backup1 -t -polite 0 -k 27272723

mkdir: cannot create directory `backup1': File exists
Starting M27272723 fft length = 1440K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration 100, average error = 0.17627, max error = 0.24609
Iteration 200, average error = 0.20707, max error = 0.26563
Iteration 300, average error = 0.21572, max error = 0.25000
Iteration 400, average error = 0.22152, max error = 0.26563
Iteration 500, average error = 0.22436, max error = 0.28906
Iteration 600, average error = 0.22512, max error = 0.25391
Iteration 700, average error = 0.22587, max error = 0.25000
Iteration 800, average error = 0.22717, max error = 0.25098
Iteration 900, average error = 0.22826, max error = 0.25781
Iteration 1000, average error = 0.22902 < 0.25 (max error = 0.25391), continuing test.
p
-polite 0
Iteration 10000 M( 27272723 )C, 0x47acf390a9fa95a4, n = 1440K, CUDALucas v2.04 Beta err = 0.2910 (0:19 real, 1.9369 ms/iter, ETA 14:40:18)
Iteration 20000 M( 27272723 )C, 0x8eee5fea6b377293, n = 1440K, CUDALucas v2.04 Beta err = 0.2969 (0:19 real, 1.8131 ms/iter, ETA 13:43:25)
Iteration 30000 M( 27272723 )C, 0x5231010685e0ed53, n = 1440K, CUDALucas v2.04 Beta err = 0.3047 (0:18 real, 1.8132 ms/iter, ETA 13:43:27)
Iteration 40000 M( 27272723 )C, 0x862d8eb96bd24428, n = 1440K, CUDALucas v2.04 Beta err = 0.2969 (0:18 real, 1.8131 ms/iter, ETA 13:43:04)
Iteration 50000 M( 27272723 )C, 0x38bc606b9b959f78, n = 1440K, CUDALucas v2.04 Beta err = 0.2813 (0:18 real, 1.8131 ms/iter, ETA 13:42:46)
SIGINT caught, writing checkpoint. Estimated time spent so far: 1:36
[/CODE]

kladner 2012-09-26 17:24

Hey LaurV,
Many thanks for that load of data. I think I see the point you are making. There is certainly a lot for me to learn from your posts. :goodposting:

EDIT: I should note that I always ignore the first output line, either at the beginning or after p=0.


All times are UTC. The time now is 23:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.