mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

kladner 2012-09-26 00:21

1 Attachment(s)
[QUOTE=flashjh;312777]Some testing still needs to be done, but LaurV put together a list of FFTs that perform better [URL="http://www.mersenneforum.org/showthread.php?p=310136#post310136"]here[/URL].

It may be worth while to do testing on your 460 and see if the results match.[/QUOTE]

I don't know why the timing went down. The previous expo was getting ~5.2433 ms/iter, while the current one is doing ~4.8614 ms/iter. They are the same for at least the first five digits. Interestingly, 1536K isn't on LaurV's list.

EDIT: The results of cufftbench are attached.

flashjh 2012-09-26 00:28

1 Attachment(s)
[QUOTE=Dubslow;312748]Since you asked for it... this isn't the first time a too-aggressive FFT length has been picked. If you can, please run a test of the dinky little program I posed in [URL="http://www.mersenneforum.org/showthread.php?p=306898#post306898"]this first discussion[/URL] of the issue. (Unfortunately, I can't compile it, so you'll have to ask flash or start playing around with MSVS -- it's a very simple program, and so should be quite a bit easier to compile than CUDALucas.)[/QUOTE]

Attached the compiled program. It works, successfully converted a 1600K to 1728K on M27232109. I tried converting a 1440K to 1600K and 1728K and it didn't work. I kept getting a roundoff error. I'll test some more when I get a chance. This is all on a GT 430.

@Dubslow - what successful conversions have you made?

Dubslow 2012-09-26 02:49

[QUOTE=flashjh;312781]@Dubslow - what successful conversions have you made?[/QUOTE]

...not sure if I've ever tried. I have an exam tomorrow though... I'll try and do it over the weekend.
[QUOTE=flashjh;312781]
Attached the compiled program. It works, successfully converted a 1600K to 1728K on M27232109. I tried converting a 1440K to 1600K and 1728K and it didn't work. I kept getting a roundoff error. I'll test some more when I get a chance. This is all on a GT 430.[/QUOTE]
I have no idea if this is mathematically valid or not, I'm just winging it. For the one you said worked, did you get the same interim residues with the older and newer length? For the ones that didn't work, try running the test from the start with the longer FFTs -- too long can cause errors just the same as too short, so we should eliminate that as the cause.

flashjh 2012-09-26 02:53

[QUOTE=Dubslow;312786]...not sure if I've ever tried. I have an exam tomorrow though... I'll try and do it over the weekend.

I have no idea if this is mathematically valid or not, I'm just winging it. For the one you said worked, did you get the same interim residues with the older and newer length? For the ones that didn't work, try running the test from the start with the longer FFTs -- too long can cause errors just the same as too short, so we should eliminate that as the cause.[/QUOTE]
That may be the problem, I started the program and stopped it really early and tried the conversion. I never did one that had run for a while. I'll try different settings and conversions over the next few days.

Good luck on your exam! :yucky:

flashjh 2012-09-26 04:59

1 Attachment(s)
[QUOTE=Dubslow;312786]...not sure if I've ever tried. I have an exam tomorrow though... I'll try and do it over the weekend.

I have no idea if this is mathematically valid or not, I'm just winging it. For the one you said worked, did you get the same interim residues with the older and newer length? For the ones that didn't work, try running the test from the start with the longer FFTs -- too long can cause errors just the same as too short, so we should eliminate that as the cause.[/QUOTE]
Attached some test results. Unfortunately, the results are not good. After a few scenarios, I discovered that the residues do not match.

I started M27232109 from an old save file and then used a converted version of the same save file and received residue mismatches. I then started the same exponent from scratch, let it run for a while and tried a couple different things. Basically, converted files don't give good residues. The only good thing was I double converted a file (1536 -> 1600 -> 1768) and did get a match on the residue from the one that went straight to 1768 from 1536)

Is there hope for on-the-fly FFT conversion or is this the end?

Dubslow 2012-09-26 05:02

[QUOTE=flashjh;312797]Attached some test results. Unfortuneatly, the results are not good. After a few scenarios, I discovered that the residues do not match.

I started M27232109 from an old save file and then used a converted version of the same save file and received residue mismatches. I then started the same exponent from scratch, let it run for a while and tried a couple different things. Basically, converted files don't give good residues. The only good thing was I double converted a file (1536 -> 1600 -> 1768) and did get a match on the residue from the one that went straight to 1768 from 1536)

Is there hope for on-the-fly FFT conversion or is this the end?[/QUOTE]

Unless you know something I don't, that's the end of it. I have no justification of any sort, like I said, I just wanted to try it and see what happened :razz:

Edit: I sorta take that back, depending on your answer to this: the double-conversion, did you let that run a few iterations at the middle length before the second conversion? Otherwise the effect would be exactly the same, so of course they match. If you [i]did[/i] run a few iterations at the middle length, then there is hope, but I don't think I can do it; what the conversion does is pad the new/extra space with 0 (I couldn't think of anything better) so it might be that a different padding would work properly. (Among other things, I have no idea how to represent a bigint in an array of doubles.)

flashjh 2012-09-26 05:09

[QUOTE=Dubslow;312798]Unless you know something I don't, that's the end of it. I have no justification of any sort, like I said, I just wanted to try it and see what happened :razz:[/QUOTE]
Nothing at the moment.

I've not seen it before, what does Prime95 do when the error gets too large? Does it quit or increase the FFT size and continue?

LaurV 2012-09-26 07:57

[QUOTE=kladner;312780]Interestingly, 1536K isn't on LaurV's list.[/QUOTE]
This means that [U]for my cards[/U] (gtx580, but some things tested also with 570 and tesla c2050) there is one which is higher, and faster. "The list" shows the best speed/accuracy. I first made a list with all values from min to max (like 1,2,3,4,5..... etc). Then I tested for every one of them, and made a table: (size, time). Then, for each value of the size, cut out from the table all rows which have the size smaller and the speed bigger. This way, (repeat - for my cards, my system, no idea how dependent of system it is) there is a value which is [B]higher in size[/B] (i.e. more accurate, lower errors) and [B]faster in speed[/B]. Looking to the list, 1568K is the criminal. You can read below of that post with the list, [QUOTE]Using 1568k instead if 1536k, you get about 6%-8% faster.[/QUOTE]I don't know why it seems that gtx580 makes better pairing with 7 (1568=2^5*7^2) than with 3 (1536=2^9*3) this time. Try both of them with -cufftbench on your card and see which one is faster (I am curious to know too!).

Pairing those butterflies from the fft multiplication with the number of threads is tricky. To see what is like, imagine you have a frying pan which can cook two donuts in the same time. Cooking one donut on one side takes 1 minute and you have to cook them on both sides each. If you have to cook 5 donuts, this means [B]theoretically [/B]you have 10 sides to cook, and you will be done in 5 minutes, cooking two sides at a time (of different donuts, of course:razz:). That is theoretically. [B]Practically[/B], the thing is tricky. You can cook two donuts first, on one side, then on the other side, and you spent 2 minutes. Repeat with another two donuts and cook them on both sides, you have spent already 4 minutes and cooked 4 donuts, altogether 8 sides. But now you have only one donut left, and you still need two minutes to cook it, one minute on one side, and another minute on another side. Of course, during those two minutes, you "wasted" half of the energy, as the frying pan was half empty. But you still spent 6 minutes to cook the 5 donuts. Of course, you still can cook them in 5 minutes, but this involves "trickery". No, you are not allowed to split the donuts or the pan, nor to put more then two in the fry. But after you cooked fist two, spending 2 minutes, you still have 3 to cook (say x, y, z) and you can cook them in the following way: put donuts x and y on one side, cook them one minute, put donut x on the other side, take out donut y, put donut z and cook them (x and z) for a second minute. After two minutes (totally 4, considering from the starting point with cooking all 5 donuts), you have x cooked on both sides, and y,z cooked on one side each. Spend the last minute finishing y and z, and you are done with all in 5 minutes.

How about cooking 1103 pieces of 17-sided donuts in a 23-slots frying pan? :razz: (this is good for a post in the puzzle thread, for a general case, hehe).

Now imagine your GPU is a big frying pan, able to cook 256, 512, 1024 [strike]threads[/strike] donuts in the same time, but you have an arbitrary number of complex [strike]fft multiplications[/strike] donuts to cook, some with 2 sides, some with more sides, and you have to cook them in groups (like you can't split these 7 pieces, or those 29 pieces, you have to cook them all in the same time). The "granularity" of the FFT gives the cooking rules. The size gives the number of donuts. You may be able to cook them faster, even if you have more, if the rules are so flexible allowing you more freedom with pairing. But this may be related to how big is your frying pan too...

kladner 2012-09-26 14:26

[QUOTE]I don't know why it seems that gtx580 makes better pairing with 7 (1568=2^5*7^2) than with 3 (1536=2^9*3) this time. Try both of them with -cufftbench on your card and see which one is faster (I am curious to know too!). [/QUOTE]It seems to cook faster in my frying pan (GTX 460) at 1536K. I wonder how your cufftbench compares to mine (posted above) in this range. Could the number of CUDA cores or SM units or the bus width explain the difference in speed?

[CODE]1536K -Note: 2nd line twice the iterations.
Iteration 12550000 M( 27278xxx )C, 0xe7043e1df2c39cf7, n = 1536K, CUDALucas v2.04 Beta err = 0.0703 (4:03 real, 4.8622 ms/iter, ETA 19:51:13)
Iteration 12700000 M( 27278xxx )C, 0xcdbfe69d44c35dfc, n = 1536K, CUDALucas v2.04 Beta err = 0.0664 (8:07 real, 4.8707 ms/iter, ETA 19:37:04)

1568K
Iteration 150000 M( 27278xxx )C, 0x9498598baf7dd0f3, n = 1568K, CUDALucas v2.04 Beta err = 0.0425 (4:15 real, 5.0981 ms/iter, ETA 38:22:38)

1600K
Iteration 100000 M( 27278xxx )C, 0x4a2808ec7f06a47d, n = 1600K, CUDALucas v2.04 Beta err = 0.0234 (4:37 real, 5.5513 ms/iter, ETA 41:51:58)[/CODE]EDIT: A definite plus for me from all this is that I think I finally grasp the FFT selection process. I never had quite understood before.

LaurV 2012-09-26 16:28

1 Attachment(s)
[QUOTE=kladner;312846]I wonder how your cufftbench compares to mine (posted above) in this range. Could the number of CUDA cores or SM units or the bus width explain the difference in speed?
[/QUOTE]
My -cufftbench is done long ago, about 20 times, with different configurations (like p95 is running or not, aliquots are factoring, yafu-ing, etc or not, video playing or not, autocad/protel running or not, etc). The average of ALL THOSE RUNS is in the column E of the table below, and the table is sorted by it in increasing order. When I run some expo, I see what CL selects and how big the error is, and I enter that value of the FFT in the yellow cell F2 in the table. The formula in cell F25, for example, is
[CODE]=IF(G25>=$F$2,"<<This","-")[/CODE] and so for all column F.
Next step is to look for the first "this" in the column F and use that value [B]instead[/B] of the one recommended by CL. I (almost) always get it faster :D

LaurV 2012-09-26 17:08

1 Attachment(s)
And a log report i just did (could not attach 2 files). The exponent is chosen random, next "nice looking" prime after your double-27-xxxx :smile: (i.e. my exponent is nicer because is triple-27, which is 3 at a triple power, hehe).

[ATTACH]8668[/ATTACH]

The attached file is the log for a gtx580 card, no overclock, p95 running in background. If you open the file and look inside, you will remark that -polite option does not work in command line if the ini file specifies a different value, I have no idea why, but it was easy to spot (I argued about this with somebody here, long time ago, if the times are not constant, something is stealing clocks, and as long as nothing else was running, I suspected it was the polite switch - see first run in the file, for FFT=1440k. When CL runs in aggressive mode and nobody else steals GPU clocks, the times must be fixed and NOT oscillating). I had no time (and don't want to) edit the files, so I pressed p each time. Please remark that when you press p, the next row which is displayed is not "reliable", as part of those iterations was done before you press p (in polite mode) and part after (in aggressive mode). So only the rows after the first row after you pressed p are important.

The "del" lines are to delete checkpoint files before switching to a new fft size, otherwise the old one is resumed (I forgot to delete it once!).

Remark that CL is a bit stupid when 1472/1504k is selected, it switches to a higher FFT for SOME of the expos in the range (only for SOME), which is not normal, if you use 256 threads (for 512 the FFT must be multiple of 64k and it may be normal). Anyhow, for 1472 and 1504, when they DO run, they are slower then 1536 and 1568.

Also remark that the best choice after 1600 is much higher, 1728, the other in between are really bad. To make it clear, I ended the log with a -cufftbench, and if you like you can convert them to k by dividing to 1024 (like 1474560 is 1440k, 1572864 is 1536k, etc).

And to make you blue of envy :razz: here is the "get the smoke out of me" version put inline (I still can't attach two files, grrr) - this is with the card OC'ed to 30%, P95 stopped, water pump max speed, second card doing nothing (even water-cooling is not enough for cooling them both at such speed). This is for demo purposes only, and it will always crash or give strange errors and mismatching residues, but it is nice to see the timing anyhow :razz:

[CODE]e:\-99-Prime\CudaLucas\CL1>cl204b4020x64 -d 1 -c 10000 -f 1440k -s backup1 -t -polite 0 -k 27272723

mkdir: cannot create directory `backup1': File exists
Starting M27272723 fft length = 1440K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration 100, average error = 0.17627, max error = 0.24609
Iteration 200, average error = 0.20707, max error = 0.26563
Iteration 300, average error = 0.21572, max error = 0.25000
Iteration 400, average error = 0.22152, max error = 0.26563
Iteration 500, average error = 0.22436, max error = 0.28906
Iteration 600, average error = 0.22512, max error = 0.25391
Iteration 700, average error = 0.22587, max error = 0.25000
Iteration 800, average error = 0.22717, max error = 0.25098
Iteration 900, average error = 0.22826, max error = 0.25781
Iteration 1000, average error = 0.22902 < 0.25 (max error = 0.25391), continuing test.
p
-polite 0
Iteration 10000 M( 27272723 )C, 0x47acf390a9fa95a4, n = 1440K, CUDALucas v2.04 Beta err = 0.2910 (0:19 real, 1.9369 ms/iter, ETA 14:40:18)
Iteration 20000 M( 27272723 )C, 0x8eee5fea6b377293, n = 1440K, CUDALucas v2.04 Beta err = 0.2969 (0:19 real, 1.8131 ms/iter, ETA 13:43:25)
Iteration 30000 M( 27272723 )C, 0x5231010685e0ed53, n = 1440K, CUDALucas v2.04 Beta err = 0.3047 (0:18 real, 1.8132 ms/iter, ETA 13:43:27)
Iteration 40000 M( 27272723 )C, 0x862d8eb96bd24428, n = 1440K, CUDALucas v2.04 Beta err = 0.2969 (0:18 real, 1.8131 ms/iter, ETA 13:43:04)
Iteration 50000 M( 27272723 )C, 0x38bc606b9b959f78, n = 1440K, CUDALucas v2.04 Beta err = 0.2813 (0:18 real, 1.8131 ms/iter, ETA 13:42:46)
SIGINT caught, writing checkpoint. Estimated time spent so far: 1:36
[/CODE]

kladner 2012-09-26 17:24

Hey LaurV,
Many thanks for that load of data. I think I see the point you are making. There is certainly a lot for me to learn from your posts. :goodposting:

EDIT: I should note that I always ignore the first output line, either at the beginning or after p=0.

Dubslow 2012-09-26 20:29

[QUOTE=LaurV;312859]Remark that CL is a bit stupid when 1472/1504k is selected, it switches to a higher FFT for SOME of the expos in the range (only for SOME), which is not normal, if you use 256 threads (for 512 the FFT must be multiple of 64k and it may be normal). Anyhow, for 1472 and 1504, when they DO run, they are slower then 1536 and 1568.

Also remark that the best choice after 1600 is much higher, 1728, the other in between are really bad. To make it clear, I ended the log with a -cufftbench, and if you like you can convert them to k by dividing to 1024 (like 1474560 is 1440k, 1572864 is 1536k, etc).[/QUOTE]

See, your mistake is assuming that CL is any kind of smart :razz:

Like I've mentioned before, selection is just pick the smallest length from the list that's > exp/20, and the list was chosen from Prime95's jump tables. Clearly exp/20 isn't very good :razz: Even the test-for-too-small length appears to be too aggressive (i.e. the first iterations aren't very representative of the "average").

Since kladner reports your timings aren't as good on a 460, it's a reasonable conclusion that the FFTs vary significantly based on speed/threads/memory/etc. of the card in use; I'll add the lengths you mention to CL's list, if they aren't there already; and then in the README will be a paragraph about how the automatic selection really isn't optimal, and that the curious user should experiment to squeeze the most speed out of CL. (At least the optimal is close :razz:)

kladner 2012-09-27 13:53

Latest run GTX 460
 
Fortunately, this only ran a couple of hours before I caught it. 1440K had been selected, and the average error was in the range which eventually got stopped by -t on a previous expo. Too bad it's so close to the edge. In this case 1440K was faster.
These were run at Polite 90. I forgot to hit P on the first and didn't want to do it over.

Test results-
[CODE]Starting M27303xxx fft length = 1440K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration 100, average error = 0.18088, max error = 0.25000
Iteration 200, average error = 0.20465, max error = 0.24219
Iteration 300, average error = 0.21246, max error = 0.26367
Iteration 400, average error = 0.21672, max error = 0.25000
Iteration 500, average error = 0.22009, max error = 0.29102
Iteration 600, average error = 0.22095, max error = 0.24219
Iteration 700, average error = 0.22143, max error = 0.26563
Iteration 800, average error = 0.22165, max error = 0.25000
Iteration 900, average error = 0.22200, max error = 0.28125
Iteration 1000, average error = 0.22289 < 0.25 (max error = 0.26563), continuing test.
Iteration 1000 M( 27303xxx )C, 0xbfe1e2e63170f3b9, n = 1440K, CUDALucas v2.04 Beta err = 0.0000 (0:06 real, 5.9563 ms/iter, ETA 45:10:19)
Iteration 2000 M( 27303xxx )C, 0x7f6a3d1c91672f83, n = 1440K, CUDALucas v2.04 Beta err = 0.2578 (0:05 real, 4.6708 ms/iter, ETA 35:25:16)
Iteration 3000 M( 27303xxx )C, 0x658167a19034ff0c, n = 1440K, CUDALucas v2.04 Beta err = 0.2656 (0:04 real, 4.6578 ms/iter, ETA 35:19:18)
Iteration 4000 M( 27303xxx )C, 0xcfd036cabb908829, n = 1440K, CUDALucas v2.04 Beta err = 0.2813 (0:05 real, 4.6606 ms/iter, ETA 35:20:30)
Iteration 5000 M( 27303xxx )C, 0x69aae7e03823395c, n = 1440K, CUDALucas v2.04 Beta err = 0.2734 (0:05 real, 4.6578 ms/iter, ETA 35:19:08)

1536K
Iteration 4000 M( 27303xxx)C, 0xcfd036cabb908829, n = 1536K, CUDALucas v2.04 Beta err = 0.0547 (0:05 real, 4.9678 ms/iter, ETA 37:40:15)

1568K
Iteration 4000 M( 27303xxx )C, 0xcfd036cabb908829, n = 1568K, CUDALucas v2.04 Beta err = 0.0361 (0:05 real, 5.2011 ms/iter, ETA 39:26:25)[/CODE]

patrik 2012-09-27 15:42

What does this error mean?
 
I'm starting to learn to run CUDALucas and I get the following error message:

[CODE]C:\Users\Patrik Johansson\Documents\CUDALucas>CUDALucas-2.03-cuda4.0-sm_20-x86-64.exe -t worktodo.txt

Warning: No ini file detected. Using defaults for non-specified options.
Starting M33271093 fft length = 1835008
iteration = 1301 >= 1000 && err = 0.5 >= 0.35, fft length = 1835008, writing checkpoint file (because -t is enabled) and exiting.

C:\Users\Patrik Johansson\Documents\CUDALucas>CUDALucas-2.03-cuda4.0-sm_20-x86-64.exe -t worktodo.txt

Warning: No ini file detected. Using defaults for non-specified options.
Continuing work from a partial result of M33271093 fft length = 1835008 iteration = 1202
iteration = 2601 >= 1000 && err = 0.5 >= 0.35, fft length = 1835008, writing checkpoint file (because -t is enabled) and exiting.

C:\Users\Patrik Johansson\Documents\CUDALucas>[/CODE]

What does this mean? Do I have to manually select a different FFT size?

EDIT: Downloading CUDALucas.ini seems to have solved the problem.

kladner 2012-09-27 16:16

@ Patrik

Glad you got it fixed. If you want to take the trouble, you could run -cufftbench as described at the end of the ini file. It is quite possible that you can get noticeably better performance from selecting the FFT length. There is considerable discussion of this in the last couple of pages of this thread. Note that you can leave FFT=0 (auto) in the ini, and specify a custom length on the worktodo line for the expo. This is also described in the last section of the ini.

Dubslow 2012-09-27 17:01

[QUOTE=kladner;312935]Note that you can leave FFT=0 (auto) in the ini, and specify a custom length on the worktodo line for the expo. This is also described in the last section of the ini.[/QUOTE]

Actually, that's only in 2.04, and he's using 2.03. (Note the full FFT length is given, not the length/1024.)

kladner 2012-09-27 17:16

[QUOTE=Dubslow;312938]Actually, that's only in 2.04, and he's using 2.03. (Note the full FFT length is given, not the length/1024.)[/QUOTE]

Oopsy! :redface: Thanks for correcting that.

patrik 2012-09-27 18:09

Well, the error occured again. Does this mean that I should choose a different FFT size, or is it my hardware? The card I use (Gigabyte GTX 570) is overclocked by default, IIRC).

[CODE]C:\Users\Patrik Johansson\Documents\CUDALucas>CUDALucas-2.03-cuda4.0-sm_20-x86-6
4.exe -t worktodo.txt

Starting M33271093 fft length = 1835008
Iteration 10000 M( 33271093 )C, 0x5348d62b85363b87, n = 1835008, CUDALucas v2.03
err = 0.2031 (0:38 real, 3.7536 ms/iter, ETA 34:40:44)
Iteration 20000 M( 33271093 )C, 0xd261b237d0ea981a, n = 1835008, CUDALucas v2.03
err = 0.2188 (0:37 real, 3.7551 ms/iter, ETA 34:40:55)
Iteration 30000 M( 33271093 )C, 0x0d91040de48abe77, n = 1835008, CUDALucas v2.03
err = 0.2188 (0:38 real, 3.7991 ms/iter, ETA 35:04:42)
[---]
Iteration 200000 M( 33271093 )C, 0x165d9f1fb29fb1f4, n = 1835008, CUDALucas v2.0
3 err = 0.2188 (0:38 real, 3.7499 ms/iter, ETA 34:26:49)
Iteration 210000 M( 33271093 )C, 0x0fe59984a99d2238, n = 1835008, CUDALucas v2.0
3 err = 0.2188 (0:37 real, 3.7483 ms/iter, ETA 34:25:17)
iteration = 216901 >= 1000 && err = 0.5 >= 0.35, fft length = 1835008, writing c
heckpoint file (because -t is enabled) and exiting.

C:\Users\Patrik Johansson\Documents\CUDALucas>[/CODE]

Dubslow 2012-09-27 18:24

[QUOTE=patrik;312950]Well, the error occured again. Does this mean that I should choose a different FFT size, or is it my hardware? The card I use (Gigabyte GTX 570) is overclocked by default, IIRC).

[CODE]C:\Users\Patrik Johansson\Documents\CUDALucas>CUDALucas-2.03-cuda4.0-sm_20-x86-6
4.exe -t worktodo.txt

Starting M33271093 fft length = 1835008
Iteration 10000 M( 33271093 )C, 0x5348d62b85363b87, n = 1835008, CUDALucas v2.03
err = 0.2031 (0:38 real, 3.7536 ms/iter, ETA 34:40:44)
Iteration 20000 M( 33271093 )C, 0xd261b237d0ea981a, n = 1835008, CUDALucas v2.03
err = 0.2188 (0:37 real, 3.7551 ms/iter, ETA 34:40:55)
Iteration 30000 M( 33271093 )C, 0x0d91040de48abe77, n = 1835008, CUDALucas v2.03
err = 0.2188 (0:38 real, 3.7991 ms/iter, ETA 35:04:42)
[---]
Iteration 200000 M( 33271093 )C, 0x165d9f1fb29fb1f4, n = 1835008, CUDALucas v2.0
3 err = 0.2188 (0:38 real, 3.7499 ms/iter, ETA 34:26:49)
Iteration 210000 M( 33271093 )C, 0x0fe59984a99d2238, n = 1835008, CUDALucas v2.0
3 err = 0.2188 (0:37 real, 3.7483 ms/iter, ETA 34:25:17)
iteration = 216901 >= 1000 && err = 0.5 >= 0.35, fft length = 1835008, writing c
heckpoint file (because -t is enabled) and exiting.

C:\Users\Patrik Johansson\Documents\CUDALucas>[/CODE][/QUOTE]

Run the self test, which (IIRC) is -r.

If all those pass, then just increase the FFT length; otherwise, reduce the clock speed if you can. kladner et al. can be more helpful in determining if it's the card or not. (Personally, I highly doubt it, but you never know.)

kladner 2012-09-27 20:22

Hi Patrik,

Which Gigabyte card do you have? Mine is a GV-N570OC-13I, the 3 fan model. This is just curiosity on my part. I will be extremely interested in your progress with this card, whichever it is. I have yet to succeed in getting mine to run CuLu.

If you want/need to pursue testing beyond what Dubslow suggested there are a few possibilities: OCCT and MemtestG80 especially.

MemtestG80: [url]http://folding.stanford.edu/English/DownloadUtils#ntoc2[/url]

There are versions for Windows, Linux, and Mac.

OCCT (Windows only): [url]http://www.ocbase.com/index.php/download[/url]

For MemtestG80 I suggest a command line like this-
[CODE]memtestg80 -g 0 -b 1140 200[/CODE]
-g [N] sets GPU number starting from 0, if you have more than one.
-b bypasses a query if you want to send data from the tests to Stanford.
1140 is the largest amount of memory I got to run on my card (1.25GiB)
200 is the number of iterations which will be run. Your choice.

While the app is labeled as a bandwidth test, it also keeps a running total of errors detected.

OCCT is a stress tester with different tabs for GPU, CPU, and PSU. On the GPU tab check Full Screen and Error Check. You can play with the length of runtime. This will heat up the GPU on the same order as Fur Mark (a Lot!), but it provides monitoring so you can track it.

I hope you get things worked out. I have currently put my 570 back on mfaktc duty because I couldn't make CuLu work. It works great on my Gigabyte GTX 460

patrik 2012-09-28 12:09

I think we have identical cards: GV-N570OC-13I V2.0

It passes both MemtestG80 and the three minutes of OCCT I ran before raising temperatures made me worried.

The error in CUDALucas is not reproducible and happens at different iterations. Also in the self-test it fails at different exponents (but most often at M20996011).

kladner 2012-09-28 15:26

1 Attachment(s)
I just set up a test folder to try CuLu on the 570 again. It still quits the same way on the self-test (last few lines):
[CODE]Starting M2976221 fft length = 8000K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration = 32 < 1000 && err = 0.50000 >= 0.35, increasing n from 8000K
Starting M2976221 fft length = 8192K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
CUDALucas.cu(159) : cufftSafeCall() CUFFT error 6: CUFFT_EXEC_FAILED[/CODE]It seems that this card did not get as far as M20996011 on this run. I'll give it a couple more tries.

EDIT: Did you get the error 6 result? "CUDALucas.cu(159) : cufftSafeCall() CUFFT error 6: CUFFT_EXEC_FAILED"

EDIT2: The attached text file shows the tail end of the next -r run. This time it got to M2976221 as well, but tested up to a ridiculous FFT length (5760K) at which point it reports
[CODE]Iteration 1000, average error = 0.00450 < 0.25 (max error = 0.00000), continuing test.
Iteration = 1701 >= 1000 && err = 0.5 >= 0.35, fft length = 5760K, writing checkpoint file (because -t is enabled) and exiting.[/CODE]I am at a loss. Back to mfaktc.

patrik 2012-09-28 16:53

The output I got when I re-ran it right now was
[CODE][---]
Starting M24036583 fft length = 1835008
Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1835008, CUDALucas v2.03
err = 0.0002 (0:35 real, 3.4479 ms/iter, ETA 23:00:17)
This residue is correct.
Starting M25964951 fft length = 1310720
iteration = 22 < 1000 && err = 0.319336 >= 0.25, increasing n from 1310720
Starting M25964951 fft length = 1572864
iteration = 501 < 1000 && err = 0.5 >= 0.25, increasing n from 1572864
Starting M25964951 fft length = 1835008
iteration = 9901 >= 1000 && err = 0.5 >= 0.35, fft length = 1835008, writing che
ckpoint file (because -t is enabled) and exiting.

C:\Users\Patrik Johansson\Documents\CUDALucas>[/CODE]
and
[CODE]C:\Users\Patrik Johansson\Documents\CUDALucas>CUDALucas-2.03-cuda4.0-sm_20-x86-6
4.exe -t worktodo.txt

Starting M33271093 fft length = 1835008
iteration = 533 < 1000 && err = 0.5 >= 0.25, increasing n from 1835008
Starting M33271093 fft length = 1966080
iteration = 11 < 1000 && err = 0.499999 >= 0.25, increasing n from 1966080
Starting M33271093 fft length = 2097152
iteration = 1201 >= 1000 && err = 0.5 >= 0.35, fft length = 2097152, writing che
ckpoint file (because -t is enabled) and exiting.

C:\Users\Patrik Johansson\Documents\CUDALucas>[/CODE]
but I think I have seen different output as well.

Another point to make is that before I started using -t (check round off error all iterations) I saw (by comparing with two later matching runs) that the residues were wrong already a few 10000 iterations before CUDALucas detected the rounding error.

Dubslow 2012-09-28 17:58

[QUOTE=patrik;313047]I think we have identical cards: GV-N570OC-13I V2.0

The error in CUDALucas is not reproducible and happens at different iterations. Also in the self-test it fails at different exponents (but most often at M20996011).[/QUOTE]
That's not good :razz:

I would say it's a card problem, except kladner has been testing his own 570 for a while now, and everything except CUDALucas works great on it. I haven't known what to say to him, and I still don't know what to say to you, sorry :razz:

kladner 2012-10-05 14:36

Succesful CL -r run on GTX 570!
 
1 Attachment(s)
OMFG! :shock:

I thought that I had tried everything, but it seems I had not. I downclocked the GPU to nVidia stock (732 MHz) and the memory to 1900 MHz. While it still threw an error on a test run of a DC (because -t is DISabled!?), it continued on a restart. It then completed on -r. See attached. This has never happened before.

I am going to try again with memory underclocked to 1800. Results anon.

This seems to back LaurV's hunch that memory is the culprit. Too bad I can't do an elimination test as on a motherboard by removing parts of the RAM and testing each piece.

EDIT: Completed -r at factory GPU OC of 781 MHz, RAM at 1800. The 570 is reporting a beginning ETA of 22.5 hrs on M27361157 vs ~37 hrs on the GTX 460 in the same general range. While this makes me feel rather foolish for not having tried these settings before :redface:, it is very gratifying to see it actually work. :grin:

EDIT2: This is puzzling. -r just completed with factory settings of 781 GPU, 1900 RAM. Perhaps this is due to cooler conditions, though the readouts aren't that different from warmer days. Of course, I can't monitor the VRAM temps, so that might be involved. If I put this card to DC work in CuLu I will probably go with the RAM underclocked to 1800, just to be a bit safer.

EDIT3: Stop the presses! I just realized that I did not Apply the 1900 RAM setting in Afterburner. I'll have to run it again.

EDIT4: It craps out in less than a minute at 1900. It seems that this is the answer.

flashjh 2012-10-06 05:11

[QUOTE=kladner;313723]OMFG! :shock:

I thought that I had tried everything, but it seems I had not. I downclocked the GPU to nVidia stock (732 MHz) and the memory to 1900 MHz. While it still threw an error on a test run of a DC (because -t is DISabled!?), it continued on a restart. It then completed on -r. See attached. This has never happened before.

I am going to try again with memory underclocked to 1800. Results anon.

This seems to back LaurV's hunch that memory is the culprit. Too bad I can't do an elimination test as on a motherboard by removing parts of the RAM and testing each piece.

EDIT: Completed -r at factory GPU OC of 781 MHz, RAM at 1800. The 570 is reporting a beginning ETA of 22.5 hrs on M27361157 vs ~37 hrs on the GTX 460 in the same general range. While this makes me feel rather foolish for not having tried these settings before :redface:, it is very gratifying to see it actually work. :grin:

EDIT2: This is puzzling. -r just completed with factory settings of 781 GPU, 1900 RAM. Perhaps this is due to cooler conditions, though the readouts aren't that different from warmer days. Of course, I can't monitor the VRAM temps, so that might be involved. If I put this card to DC work in CuLu I will probably go with the RAM underclocked to 1800, just to be a bit safer.

EDIT3: Stop the presses! I just realized that I did not Apply the 1900 RAM setting in Afterburner. I'll have to run it again.

EDIT4: It craps out in less than a minute at 1900. It seems that this is the answer.[/QUOTE]
Glad you got it figured out! I run all my cards at 1600, though my MSI cards come @ 1600 from the factory, I have to OC the others a bit to get that.

kladner 2012-10-06 15:01

[QUOTE=flashjh;313811]Glad you got it figured out! I run all my cards at 1600, though my MSI cards come @ 1600 from the factory, I have to OC the others a bit to get that.[/QUOTE]

nVidia spec for the 570 is 732 GHz for the GPU, 1900 RAM. Gigabyte runs this card at 780/1900, but 1900 doesn't cut it. I have it at 780/1750. It seemed OK at 1800, but I backed off a hair more just for margin. It has run overnight without problem and should complete ~0230 UTC, 10/7/12. We'll see if it matches the first LL.

What's funny is that I can push the GTX 460 much harder without issues. It has turned out one DC after another running at 823 MHz GPU, 2000 RAM. nVidia spec is 675/1800. Gigabyte is 715/1800. Still, in the 27M range the 460 is running at ~58% the speed of the 570: 4.86 ms/iter vs 2.83 ms/iter.

ATM, I have both running CL, and 4 P-1s on the CPU, with 2 cores unassigned. I'm waiting to clear the current CL assignments (which are dues within a few minutes of each other) before I decide whether to leave the 570 on CuLu. That would mean putting 2x or 3x mfaktc on the 460 and adjusting the P-1s if necessary. 3 and 4 P-1 instances are "good" numbers, in that they stay in relative balance between S1 and S2 with 2 or 3 HighMemWorkers respectively.

patrik 2012-10-15 05:10

I just completed a succesful double-check of M33273391 on my GPU with its memory downclocked to 1800 MHz. This seems to be the solution for this GPU.

The main problem for me was that I never underclocked (or overclocked) a GPU before, so I had to learn that nvidia had some tools that I could download.

kladner 2012-10-15 05:16

[QUOTE=patrik;314706]I just completed a succesful double-check of M33273391 on my GPU with its memory downclocked to 1800 MHz. This seems to be the solution for this GPU.

The main problem for me was that I never underclocked (or overclocked) a GPU before, so I had to learn that nvidia had some tools that I could download.[/QUOTE]

MSI Afterburner is freely available and provides monitoring and control functions. I should have mentioned that.

[url]http://event.msi.com/vga/afterburner/download.htm[/url]

flashjh 2012-11-18 03:32

Does anyone know what would be involved in writing a small program (or modifying CUDALucas) to automatically assign exponents?

What I mean is instead of getting LL-DC work from GPU72 and then putting it into P95 to get it assigned to me and then moving them to CUDALucas worktodo, just have a small program in the same directory and anytime the worktodo file gets updated, you double click it and it updates your PrimeNet account with your exponent and an expected completion date, maybe 30 days or a settable date. I'm thinking even a Perl script might be enough; I just don't know how to do it.

kladner 2012-11-18 03:53

[QUOTE=flashjh;318815]Does anyone know what would be involved in writing a small program (or modifying CUDALucas) to automatically assign exponents?

What I mean is instead of getting LL-DC work from GPU72 and then putting it into P95 to get it assigned to me and then moving them to CUDALucas worktodo, just have a small program in the same directory and anytime the worktodo file gets updated, you double click it and it updates your PrimeNet account with your exponent and an expected completion date, maybe 30 days or a settable date. I'm thinking even a Perl script might be enough; I just don't know how to do it.[/QUOTE]

I don't do it quite that way, but such a feature would be great. I had been using a separate instance of P95 to get LL-DC via the proxy, then moving the assignments to CuLu worktodo.txt. Right now, I'm short-handed in the hardware department, so I haven't been doing CL.

flashjh 2012-11-18 04:05

[QUOTE=kladner;318817]I don't do it quite that way, but such a feature would be great. I had been using a separate instance of P95 to get LL-DC via the proxy, then moving the assignments to CuLu worktodo.txt. Right now, I'm short-handed in the hardware department, so I haven't been doing CL.[/QUOTE]
Really, using one directory for CuLu and P95, one could share a worktodo.txt file. Anytime P95 was fired up it would update all the exponents and delete the completed ones. I just don't want to end up in the same boat getting DC assignemnts I don't want. Seems like it would be easier to use a program or script designed just for CuLu processing. If nothing surfaces, I'll probably go with P95 in the directory and just make sure the computer name is different than the main worker to hopefully avoid any troubles.

Dubslow 2012-11-18 04:31

[QUOTE=flashjh;318815]Does anyone know what would be involved in writing a small program (or modifying CUDALucas) to automatically assign exponents?

What I mean is instead of getting LL-DC work from GPU72 and then putting it into P95 to get it assigned to me and then moving them to CUDALucas worktodo, just have a small program in the same directory and anytime the worktodo file gets updated, you double click it and it updates your PrimeNet account with your exponent and an expected completion date, maybe 30 days or a settable date. I'm thinking even a Perl script might be enough; I just don't know how to do it.[/QUOTE]
The problem is, you'd need a fairly decent understanding of the PrimeNet protocols, which isn't that hard; more importantly, I'm not sure how strict PrimeNet is about the information in account/work requests. When I looked at the protocol, it [i]seemed[/i] to require a lot of things, such as computer name, GUID, hardware ID, basic computer info, and various other miscellanea on top of the expo/AID. Once we understand that, writing such a script would be easy.

I would ask chalsall for his advice; AFAIK, he, Prime95, Scott Kurowski and perhaps Christenson (who I haven't seen for a year) are the only ones who are more than a bit knowledgeable about the protocol. (There is a public description of the API, but like I said, I don't know how strictly it must be adhered to, which is why I would ask chalsall.)
[QUOTE=flashjh;318818]Really, using one directory for CuLu and P95, one could share a worktodo.txt file. Anytime P95 was fired up it would update all the exponents and delete the completed ones. I just don't want to end up in the same boat getting DC assignemnts I don't want. Seems like it would be easier to use a program or script designed just for CuLu processing. If nothing surfaces, I'll probably go with P95 in the directory and just make sure the computer name is different than the main worker to hopefully avoid any troubles.[/QUOTE]

That's a pretty good idea in the meantime.

diep 2012-11-18 08:59

[snip]

[QUOTE=Dubslow;318819](who I haven't seen for a year)
[/QUOTE]

Lemme do a posting, maybe magically the posse once again shows up.

Oh on automatic assigning exponents - i thought about it for Wagstaff as well. Yet with many computers searching not on the internet it doesn't make much of a sense. I hand out ranges by hand. What would be interesting though is a tool checking whether you forgot something and a tool to take a sample in ranges you didn't take a sample yet to double check. Tony Reix found for example a bug in the LLR-software when we got to higher ranges for Wagstaff. That thirst to double checking will grow when i manage to get the Tesla's search 'em.

swl551 2012-12-01 01:22

[QUOTE=flashjh;318815]Does anyone know what would be involved in writing a small program (or modifying CUDALucas) to automatically assign exponents?

What I mean is instead of getting LL-DC work from GPU72 and then putting it into P95 to get it assigned to me and then moving them to CUDALucas worktodo, just have a small program in the same directory and anytime the worktodo file gets updated, you double click it and it updates your PrimeNet account with your exponent and an expected completion date, maybe 30 days or a settable date. I'm thinking even a Perl script might be enough; I just don't know how to do it.[/QUOTE]

Jerry,
Can you get LL-DC from the gimps manual assign page (seem so, but need confirmation) if so I can whip up a util to fetch it via HTTP GET. Just like I get TF factors from GIMPS in MISFIT.

flashjh 2012-12-01 01:32

[QUOTE=swl551;320100]Jerry,
Can you get LL-DC from the gimps manual assign page (seem so, but need confirmation) if so I can whip up a util to fetch it via HTTP GET. Just like I get TF factors from GIMPS in MISFIT.[/QUOTE]

Yes, I can, however, right now I'm getting all my LL-DC assignments from GPU72.

swl551 2012-12-01 01:38

[QUOTE=flashjh;320101]Yes, I can, however, right now I'm getting all my LL-DC assignments from GPU72.[/QUOTE]

As you know I've already written code to get TF assignment from GPU72 via HTTP POST. Wouldn't that same approach be useful here. I don't get the Prime95 "connector" methodology. (really, what is that for?)

Can I help here?

swl551 2012-12-01 01:42

Also has anyone looked at creating a WEB SERVICE for fetching/reporting work?

PrimeNet and Gimps? These products are ripe for such services.

flashjh 2012-12-01 01:51

[QUOTE=swl551;320102]As you know I've already written code to get TF assignment from GPU72 via HTTP POST. Wouldn't that same approach be useful here. I don't get the Prime95 "connector" methodology. (really, what is that for?)

Can I help here?[/QUOTE]
If you can use MISFIT for get LL-DC assignments from GPU72 and submit the results to PrimeNet, that would be very helpful. I'm currently only doing one DC per week on a GT 430, but I'm sure there are others that would like CuLu to be automated with both PrimeNet and GPU72. My main question is how to let PrimeNet know that the assignment belongs to 'me' once I get it from GPU72. I don't undersatand the API at all. Right now I decided to use a copy of P95 as I described above, but it's not optimal because it causes another 'computer' to show up in GPU72 and on PrimeNet even though it's actually not doing anything. A 'real' i7 would be able to do more LL-DC than one per week.

swl551 2012-12-01 04:10

Looks like adapting MISFIT to CuLu will be simple since they use the same worktodo/results files as mfaktO/C

Minor changes
1. the prefix of the factor row is different
2. the type of work to fetch from gimps is different


Everything else, including my SendCtrlSignal.exe and MISFITServer.exe will work without modification.

The first thing I'll do is outfit MISFIT with the file locking system implemented in mfaktO and CuLu and finally get that addressed.

swl551 2012-12-01 18:43

How to calculate GHZdaysCredit for DC and LL
 
As I look at adding support for CuLu to MISFIT I need the formula for calculating GHZdaysCredit for DC and LL that CuLu processes.

Please don't point me to the MONSTER source code at Mersenne.ca to try to find it.


For TF Chalsall gave me the following in perl. Below is my version converted to C#. I'm looking for the same type of documentation for DC/LL.
[B]public static double CalcGHDZ(int exp, int from, int to)
{
double GHZdays=0;

for(int i=from+1;i <= to;i++)
{

GHZdays += (0.00707 * 2.4) * (Math.Pow(2, i - 48)) * 1680 / exp;
}

return GHZdays;
}

}[/B]

Thanks!

Dubslow 2012-12-01 22:42

Chalsall got his info from Mersenne.ca, or more specifically its owner; I have absolutely no idea how PrimeNet calculates the credit for CUDALucas tests.

ckdo 2012-12-01 23:32

[QUOTE=swl551;320167]For TF Chalsall gave me the following in perl. [...] I'm looking for the same type of documentation for DC/LL. [/QUOTE]

Why use a loop for TF credit?

Anyway, LL/DC credit is based on the FFT size actually used for a given assignment. Come back when you have that info.

Dubslow 2012-12-01 23:58

[QUOTE=ckdo;320193]Why use a loop for TF credit?[/quote]It loops over each bit level.
[QUOTE=ckdo;320193]
Anyway, LL/DC credit is based on the FFT size actually used for a given assignment. Come back when you have that info.[/QUOTE]
That it does, however for Prime95 results, there's more to it than that, and even with the FFT, it does vary by exponent. You could probably experimentally make a decent guess at the formula, but that would take a variety of results to make an actually decent guess.

(PS @swl551 the FFT length is the "n = ..." part of the result line.)

flashjh 2012-12-02 00:07

Anyone working with CUDALucas in Windows, swl551 is working on a program to track CUDALucas workers [URL="http://www.mersenneforum.org/showthread.php?p=320183#post320183"]here[/URL]. It's still in initial testing, but I have it working quite well right now. There is no automatic GPU72 support for now, but you can get assignments from GPU72 and add them manually. You can get assignments from PrimeNet through the program. The version of this program for TFing is quite good and so I expect this one to mature quickly, as well. Except for getting assignments, it makes the process all but completely automatic including submitting results.

ckdo 2012-12-02 06:40

[QUOTE=Dubslow;320194]It loops over each bit level.[/QUOTE]

Obviously. And it's silly to do so, which was the point of "Why?". :razz:

swl551 2012-12-02 13:26

[QUOTE=ckdo;320211]Obviously. And it's silly to do so, which was the point of "Why?". :razz:[/QUOTE]

We are always looking for ways to improve. Feel free to give us an updated version of the function without the loop.

Antonio 2012-12-02 14:39

[QUOTE=swl551;320226]We are always looking for ways to improve. Feel free to give us an updated version of the function without the loop.[/QUOTE]

The following expression calculates GHz days without the use of loops:

GHzD=28.50624*(POWER(2;(to-from+1))-2)*POWER(2;(from-48))/exp

Checked using a spreadsheet against Primenet credits for work I've submitted.
The constant is = 0.00707 * 2.4 * 1680 (just rearranging your equation).

swl551 2012-12-02 15:02

[QUOTE=Antonio;320228]The following expression calculates GHz days without the use of loops:

GHzD=28.50624*(POWER(2;(to-from+1))-2)*POWER(2;(from-48))/exp

Checked using a spreadsheet against Primenet credits for work I've submitted.
The constant is = 0.00707 * 2.4 * 1680 (just rearranging your equation).[/QUOTE]
Nicely done!

swl551 2012-12-03 20:08

[QUOTE=Dubslow;320189]Chalsall got his info from Mersenne.ca, or more specifically its owner; I have absolutely no idea how PrimeNet calculates the credit for CUDALucas tests.[/QUOTE]

I worked with James H. and translated his PHP calc functions over to c#. It was not the thrill of my day! Crazy stuff.

Antonio 2012-12-06 11:29

[QUOTE=swl551;320229]Nicely done![/QUOTE]

Thanks, an identical but neater solution is:-

GHzD=28.50624 * (POWER(2;to-47) - POWER(2;from-47)) / exp

dbaugh 2013-01-22 09:06

To run two instances of CuLu on a 590 do I just do the "-d 1" like with mfaktc? FYI on a 3970x a 60M LL is twice as fast with P95 as half a 590, 3.886ms vs 7.765ms.

flashjh 2013-01-22 15:42

[QUOTE=dbaugh;325464]To run two instances of CuLu on a 590 do I just do the "-d 1" like with mfaktc? FYI on a 3970x a 60M LL is twice as fast with P95 as half a 590, 3.886ms vs 7.765ms.[/QUOTE]

Yes, that will work.

owftheevil 2013-02-03 21:55

In response to a post of LaurV from September last year, here are the fft timings I get on a 570 running on a Linux box. I know its been a while, but I assume the interest in this data still exists. The first column is the fft length in multiples of 1024, the second is the timing in milliseconds per iteration. Missing lengths were slower than longer ffts in the table.


[CODE]1 0.007
2 0.011
8 0.019
9 0.020
14 0.022
18 0.023
20 0.028
22 0.028
26 0.030
32 0.030
36 0.037
40 0.039
48 0.040
56 0.043
64 0.054
70 0.064
80 0.064
84 0.070
96 0.072
112 0.075
120 0.092
128 0.095
144 0.099
160 0.110
180 0.128
192 0.135
224 0.141
256 0.168
288 0.174
320 0.204
336 0.229
360 0.246
384 0.256
392 0.267
400 0.269
448 0.270
512 0.309
576 0.342
640 0.405
648 0.418
672 0.457
720 0.474
768 0.513
784 0.522
896 0.522
1024 0.645
1152 0.722
1176 0.849
1280 0.855
1296 0.868
1344 0.928
1440 0.956
1568 1.020
1600 1.069
1728 1.110
1792 1.169
2048 1.263
2304 1.503
2560 1.731
2592 1.734
2688 1.953
2880 1.954
3136 2.101
3200 2.288
3456 2.377
3584 2.412
3600 2.651
4096 2.696
4608 3.088
4704 3.553
5120 3.639[/CODE]

LaurV 2013-02-04 16:49

1 Attachment(s)
The conclusion was that everybody needs to tune it for his/her system (card, cpu, etc). For me, for example, 2688 is very slow. I tune it for every exponent range, in small ranges (like every meg, or so). Here is a snap from my tables, with the difference that I gray the higher, not delete. They are updated periodically by averaging the real test times with the times in the table, so they become very accurate in time. Also note that the values are the real iteration time for LL test, not the values given by -cufftbench parameter (which is about 2.66 times less, as a single FFT is done for the bench, but the test does the multiplication and the reverse FFT too, to subtract 2 and control the errors).
[ATTACH]9250[/ATTACH]

Also, please note that not all FFT's are "usable". They have to be multiple of 16k, 32k, 64k, depending on your card (see msft's posts). For example 2160 is faster, but it is not multiple of 32k, so you have to live with 2304 in case you have a gtx580 and want to use 512 threads, which would be a bit faster. Also, 2646 may be faster, but is even not multiple of 16k, so you will need 128 threads for it, which is not maxing the card. You must use either 2800 with 256 threads, or 2880 with 512 (as 2800 is multiple of 16k, but not of 32k).

owftheevil 2013-02-05 00:15

I don't like this conclusion, I would prefer to figure out how to refine the fft selection process. The only reason I can see to avoid the lengths that are not multiples of 16K, 32K, etc. is that one block in the second nomalize kernel will have some idle threads. For example, using 1701k fft and threads = 512, one out of 27 blocks in normalize2_kernel has about 42% idle threads. But this is only 42% of 1/27 of ~1/7 of ~1/4 of the total iteration time. Hardly noticible. To test this I ran a few iterations with fft = 1701k and the iteration times matched well with the prediction from the cufftbench test, 3.238 actual as opposed to 3.234 predicted. (I use 2 * cb + a * fft, a = .00044 for my 570.)

Thanks for pointing me back to previous posts in this thread. I thought I had read the entire thing several times over, but looking for msft's posts about fft length, I found whole sections I must have slept through. In particular there are many posts which have cufftbench data I had never seen before.

Keldor 2013-02-24 10:54

I hope this is the correct thread for my questions:

This week I downloaded CL to see what my GTX570 could do.
The self test ("-r") ran twice without problems (apart of annoying sound from the gfx card, varying with different FFT lengths).

Then I decided to do some "real" LL-ing and put a M59.xxx.xxx exponent in worktodo.txt.
CL ran for a while (CL choose FFT length 3670016) and then stopped due to rounding error. After starting CL again it went on a while, but then stopped again (again err >= 0.35). That happened several times, so I wonder if the result has even a chance of being correct because of the huge(?) number of rounding errors (>10 until now).

Then I tried M26026433 to compare the residues with the residues frmky kindly published. At 350101 iterations I got a rounding error (and CL exited), but the residues matchted up to 350000 iterations. After starting CL once more the residues still matched, but 30000 iterations later the next rounding error appeared.
So should I continue on the M59.xxx.xxx exponent (about 1/2 of the work is done), or is it just a waste of time, since the result is probably not correct?

The other thing I don't know is why the rounding errors appear and disappear despite using the same FFT length. Does it mean the gfx card is faulty? CPU temperatur was up to 69C, CPU about 71C (probably because of Mprime95). I use the precompiled CUDALucas-2.03_cuda4.0_sm_20-x86-64.exe, Windows7 64bit, and a GigaByte GV-N570OC-13I.

Any comments are welcome.

flashjh 2013-02-25 14:30

Current changes are being made to address your issues and others, as well. v2.05 should be ready for testing soon. We'll post when the version is up for testing.

Jerry

LaurV 2013-02-25 14:39

[QUOTE=flashjh;330929]Current changes are being made to address your issues and others, as well. v2.05 should be ready for testing soon. We'll post when the version is up for testing.

Jerry[/QUOTE]
Current changes are being made to address "increasing the FFT size on the fly" when the default one is too small and this is found later in the test, to avoid re-starting everything from the beginning.

However, this is not his problem, as the errors are NOT reproducible. His problem is thermal.

@Keldor: Please run some temperature monitoring software (GPU-Z from PowerTech is a good start) and monitor the temperature of your card. When it gets very hot, it may spit random things, which CL will treat like errors. The "noises from the card" is an indicative of the fact that the fans are running like crazy. You may have to clean the dust, and ensure a proper ventilation around your computer case. Under a closed desk or in a very warm ambient, it will not feel so comfortable.

Edit: and do DC for a while, expos around 30M-35M, until you are sure the card is running well and stable. Only after that you can go to first-time-LL. Otherwise you will feel very sorry 2 years later when I will find a prime during DC for an exponent you did LL and got bad residue... :razz:

Keldor 2013-02-25 19:09

Thank you for the information.
The GPU temperatur stayed always < 70C, which shouldn't be a problem since fan speed is about 50% (sorry for the typing error in my first post, the first "CPU" should be "GPU").
The "noises from the card" are not from the fans. I can reproduce them with the self-test or by going to the loading screen of some games. (Sounds like in this video [URL]http://www.youtube.com/watch?v=CzHwXuujpZo[/URL] but it's not a CL problem.)

So I will complete the current exponent (3/4 done) and switch to DC (or TF) afterwards.

[URL="http://www.youtube.com/watch?v=CzHwXuujpZo"][/URL]

Dubslow 2013-02-25 19:19

In that case you must have just gotten unlucky and been assigned an exponent that's right near an FFT crossover point.

2.05 will have code to deal with these issues without straight up aborting.

Batalov 2013-02-25 19:56

[QUOTE=Keldor;330794]... and a GigaByte GV-N570OC-13I.

Any comments are welcome.[/QUOTE]
I am afraid I have bad news for you.

Welcome to the club -- in two senses: 1) to the mersenneforum, and 2) the club of unlucky GV-N570OC-13I owners.

I am not alone here who already swapped this particular card multiple times (there's at least one more owner who was likewise lured by the attractive design). I returned one of them (without charge) when it failed many cuda tests (genefer, cudalucas, some others). With the second specimen, I am stuck for half a year, during which I already had two RMAs, and now in the process of the third RMA. It fails every 1.5-2 months (blows some transistors-or-capacitors* with nasty smell). This time I picked up the phone and complained, and got a verbal assurance that I am not going to receive the same card again (with a few replaced VRM elements) and I got a prepaid label this time. We’ll see what will be different this time. /sigh/

In comparison, my other card (EVGA 560-448 OC, twin-fan variant) works without any problems for a year.


____________
[SIZE=1]*I cannot tell precisely because these are under the sink and I am not curious enough to open and mess with the paste later to put it together again.[/SIZE]

Batalov 2013-02-25 20:10

The sound bit is also a true and annoying feature. I second that.
("It is a sound card!" :w00t:)

In my case, it is not that much annoying - maybe you have a worse specimen. In my hands, it was apparent for some very short job units, e.g. cudalucas -st2, or for many tests that I ran while debugging mmff-gfn on small ranges. For normal jobs, the card is silent.

kracker 2013-02-25 20:35

[QUOTE=Batalov;330976]The sound bit is also a true and annoying feature. I second that.
("It is a sound card!" :w00t:)

In my case, it is not that much annoying - maybe you have a worse specimen. In my hands, it was apparent for some very short job units, e.g. cudalucas -st2, or for many tests that I ran while debugging mmff-gfn on small ranges. For normal jobs, the card is silent.[/QUOTE]

With three fans, yeah! My single fan GPU gets loud with the fan over ~55%
Under that, quiet.

Batalov 2013-02-25 21:36

Three fans are in fact silent[I][B]er[/B][/I] than one.

It is not the fans. It is probably not the GPU either, but some of the external circuitry; the VRMs, maybe. If they do vibrate under changing load characteristics (from idle to load, and again, and so on four thousand times under cudalucas -st2), hence the sound (it is similar to the 8-bit Mario brothers or Leisure Suit Larry type el cheapo music from the 80s) and hence the eventual blowups. That's just my guess.

If you have seen the Matrix, the sound is reminiscent of the sound that you can hear when Neo all covers with liguid metal and the POV camera flies into his mouth. :davieddy:

ewmayer 2013-02-25 22:03

[QUOTE=LaurV;330930]Current changes are being made to address "increasing the FFT size on the fly" when the default one is too small and this is found later in the test, to avoid re-starting everything from the beginning. [/QUOTE]

Does the code already use an FFT-length-independent savefile format? That makes such length-changing hacks much simpler to effect.

Dubslow 2013-02-25 22:08

[QUOTE=ewmayer;330998]Does the code already use an FFT-length-independent savefile format? That makes such length-changing hacks much simpler to effect.[/QUOTE]

No it does not, however owftheevil already wrote the length-changing routines and they are in use.

owftheevil 2013-02-25 22:55

[QUOTE=ewmayer;330998]Does the code already use an FFT-length-independent savefile format? That makes such length-changing hacks much simpler to effect.[/QUOTE]

You are definitely right about that. I initially didn't use such a format and getting everything to work properly was a nightmare. I later changed to an fft agnostic format and things are now much simpler.

Hey--who gave me the funky avatar?

frmky 2013-02-26 00:11

[QUOTE=owftheevil;331007]Hey--who gave me the funky avatar?[/QUOTE]
Around here, if you don't choose one then one will be chosen for you. Makes life more interesting. :smile:

kracker 2013-02-26 00:34

[QUOTE=frmky;331017]Around here, if you don't choose one then one will be chosen for you. Makes life more interesting. :smile:[/QUOTE]

:raman:

Brain 2013-02-26 06:18

[QUOTE=flashjh;330929]Current changes are being made to address your issues and others, as well. v2.05 should be ready for testing soon. We'll post when the version is up for testing.

Jerry[/QUOTE]
Hi Jerry,
would you mind compiling 2.05 with CUDA 4.2/5.0 and CC 2.0/3.0/3.5?
Especially CUDA 5.0 and CC3.5 would be interesting for the GTX Titan...

I haven't reset my build environment.
Greetings, Sebastian

[QUOTE=Brain;290851]One way to compile CUDALucas for Win64:
0. Have Win7 64 bit
1. Install Nvidia GPU Toolkit (e.g. version 4.1)
2. Install Nvidia GPU SDK (e.g. version 4.1)
3. Install Make for Windows
4. Install MS Visual Studio 2010 Professional Trial Edition (needed for 64bit, trial will not run out as only command line usage)
5. Set Path for nvcc, make and cl.exe (from VS/bin)
6. Edit given makefile for Win64: Adapt CUDA and SM parameter (e.g. 4.1/2.0). Rename it to makefile.
7. Enter "make" in console being in the CUDALucas/src directory.
8. Delete *.obj files.
9. Find the exe and be happy.

This should be it.

The day will come I won't be there to compile it. So a backup person/compiler will be needed. Any volunteers?[/QUOTE]
Any volunteers? --> Thanks to Jerry...
By the way, what is your build environment?

LaurV 2013-02-27 01:39

[QUOTE=owftheevil;331007]Hey--who gave me the funky avatar?[/QUOTE]
We are sniffing you :razz:

owftheevil 2013-02-27 01:46

Nah. If it was really me you were sniffing, you wouldn't be able to sit at your computer and type.

Batalov 2013-02-27 01:56

"You has a smell"

Note: ewmayer [URL="http://mersenneforum.org/showpost.php?p=331040&postcount=10"]has no idea[/URL] that he, too, has an avatar!!
Pssst. Don't tell him. :missingteeth:

ewmayer 2013-02-27 02:03

[QUOTE=Batalov;331183]"You has a smell"

Note: ewmayer [URL="http://mersenneforum.org/showpost.php?p=331040&postcount=10"]has no idea[/URL] that he, too, has an avatar!!
Pssst. Don't tell him. :missingteeth:[/QUOTE]

I don't just have an avatar ... I have the Hoff-atar. Try not to be jealous - it's both a blessing and a curse, a heavy Solomonic burden to be borne painfully but stoically.

Not that I am aware of it - for all I know Mike is changing my av 6 times an hour but rigging it so when I choose to view it, it comes up "Hoff".

Aramis Wyler 2013-02-27 02:08

The noise.
 
The noise my video cards make when I try to run cudalucas on either of them is part of the reason I don't intend to run it until it's been recoded. Both of my cards are water cooled, and are silent when trial factoring. But they each scream like their little souls are being torn forcibly from their heatsinks if I run cudalucas on them. It puts my teeth on edge; I can't run the program.

flashjh 2013-02-27 02:30

I'll get it compiled in the next couple days; we still need to incorporate the changes. Also, I don't have the newest CC installed yet.

My Titan shipped today!! :smile:

Brain 2013-02-27 06:18

[QUOTE=flashjh;331189]
My Titan shipped today!! :smile:[/QUOTE]
[URL="https://www.youtube.com/watch?v=8EI7p2p1QJI"]You lucky bastard![/URL] ;-)

Me still waitin'. Maybe still flying over the ocean.
What partner card? Asus/EVGA/Gainward/Zotac?

flashjh 2013-02-27 07:21

EVGA - I had a pre-order in for Asus, but I checked at Newegg earlier today and they had the EVGAs in stock, so I ordered that one and cancelled the other order. It's supposed to be here on Friday.

Brain 2013-02-28 21:59

CUDA 5.0 compile
 
I've reset up my build environment for CUDA 5.0 on Win64.
Here's a [URL="https://dl.dropbox.com/u/72392549/CUDALucas-2.04%20Beta-5.0-x64.exe.zip"]ZIP[/URL] with CUDALucas@latest SVN and @SM={2.0,3.5} and the necessary dlls (8.5 MB).
I've tested 2.0 with 216091: OK
I've tested 3.5 with 216091: CUFFTERR (Titan version but I don't have mine yet).
I only did these 2 tests as it's late.

Maybe this is useful for somebody. I'd love to hear from successful Titan runs. ;-)
Greetings, Sebastian

owftheevil 2013-02-28 22:46

The latest stuff on SF has some serious problems with the fft changing code. That has hopefully been fixed, but I don't see the new code with the fix anywhere up there. If it has the x_packed variable in it, its ok.

Dubslow 2013-02-28 22:48

I'll update your changes in a few hours. (Did you get my email about getting committ access?)

owftheevil 2013-02-28 23:23

Yes, got the email, but have been too busy to do anything about it yet. Sounds like a good idea.

flashjh 2013-03-01 03:02

[QUOTE=Dubslow;331463]I'll update your changes in a few hours. (Did you get my email about getting committ access?)[/QUOTE]

Have you started the update? I can take the changes and test and upload... I have some time tonight.

flashjh 2013-03-01 05:20

The most recent CUDALucas.cu file is giving me a problem compiling:

[CODE]Line 703: struct stat FileAttrib;
Line 717: if(stat(chkpnt_cfn, &FileAttrib) == 0) fprintf (stderr, "\nUnable to open the checkpoint file. Trying the backup file.\n");
Line 737: if(stat(chkpnt_cfn, &FileAttrib) == 0) fprintf (stderr, "\nUnable to open the backup file. Restarting test.\n");
[/CODE]

Compiler output:
[CODE]C:\CUDA\src>make -f makefile.win
makefile.win:12: Extraneous text after `else' directive
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0/bin/nvcc" -c CUDALucas.
cu -o CUDALucas.cuda5.0.sm_35.x64.obj -m64 --ptxas-options=-v -ccbin="C:\Program
Files (x86)\Microsoft Visual Studio 10.0\VC\bin\amd64" -Xcompiler /EHsc,/W3,/no
logo,/Ox,/Oy,/GL -arch=sm_35 -O3
CUDALucas.cu
CUDALucas.cu(703): error: incomplete type is not allowed

CUDALucas.cu(717): error: incomplete type is not allowed

CUDALucas.cu(737): error: incomplete type is not allowed[/CODE]

If you let me know what you're trying to do with the struct maybe I can figure out how to make MSVS happy.

Brain 2013-03-01 05:26

[QUOTE=Brain;331455]I've reset up my build environment for CUDA 5.0 on Win64.
Here's a [URL="https://dl.dropbox.com/u/72392549/CUDALucas-2.04%20Beta-5.0-x64.exe.zip"]ZIP[/URL] with CUDALucas@latest SVN and @SM={2.0,3.5} and the necessary dlls (8.5 MB).
I've tested 2.0 with 216091: OK
I've tested 3.5 with 216091: CUFFTERR (Titan version but I don't have mine yet).
I only did these 2 tests as it's late.

Maybe this is useful for somebody. I'd love to hear from successful Titan runs. ;-)
Greetings, Sebastian[/QUOTE]
ZIP removed due to code problems mentioned above.

flashjh 2013-03-01 05:29

I'm using a .cu file that was emailed to me. I haven't uploaded to svn yet because I want to get it to compile first. If you want to take a look at it, PM me your email address.

owftheevil 2013-03-01 12:31

Flash, you can take those lines out if you want. They are just checking if the file exists before printing the message that they can't be opened.

kjaget 2013-03-01 14:45

[QUOTE=flashjh;331496]The most recent CUDALucas.cu file is giving me a problem compiling:

[CODE]Line 703: struct stat FileAttrib;
Line 717: if(stat(chkpnt_cfn, &FileAttrib) == 0) fprintf (stderr, "\nUnable to open the checkpoint file. Trying the backup file.\n");
Line 737: if(stat(chkpnt_cfn, &FileAttrib) == 0) fprintf (stderr, "\nUnable to open the backup file. Restarting test.\n");
[/CODE]

Compiler output:
[CODE]C:\CUDA\src>make -f makefile.win
makefile.win:12: Extraneous text after `else' directive
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0/bin/nvcc" -c CUDALucas.
cu -o CUDALucas.cuda5.0.sm_35.x64.obj -m64 --ptxas-options=-v -ccbin="C:\Program
Files (x86)\Microsoft Visual Studio 10.0\VC\bin\amd64" -Xcompiler /EHsc,/W3,/no
logo,/Ox,/Oy,/GL -arch=sm_35 -O3
CUDALucas.cu
CUDALucas.cu(703): error: incomplete type is not allowed

CUDALucas.cu(717): error: incomplete type is not allowed

CUDALucas.cu(737): error: incomplete type is not allowed[/CODE]

If you let me know what you're trying to do with the struct maybe I can figure out how to make MSVS happy.[/QUOTE]

Try changing stat to _stat and see if it compiles.

flashjh 2013-03-01 19:54

I'll try tonight.

frmky 2013-03-01 21:40

Regarding FFT timings, I have brief access to a Tesla K20m. I ran a benchmark starting at 1440k going up in 16k increments, and here are the results. As usual, lengths slower than longer FFTs have been deleted. I'm surprised how relatively short this table is.

[CODE]FFT length (k), ms/iteration
1568 0.508496
1600 0.596174
2000 0.645019
2048 0.655126
2592 0.820283
3136 1.123238
4000 1.256788
4096 1.304601
4320 1.804463
4608 1.876166
4704 1.910958
5120 2.120896
5488 2.136009
5600 2.270577
6000 2.438436
6048 2.448022
6144 2.480506
6272 2.526666
7776 2.620803
[/CODE]

For the most recent prime, actually using 4000k rather than 3200k results in a slightly longer ETA, presumably because of doing the multiplication and normalization on the longer array.

[CODE]./CUDALucas -k 57885161
Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:41 real, 4.0933 ms/iter, ETA 65:47:57)
Iteration 20000 M( 57885161 )C, 0xfd8e311d20ffe6ab, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:41 real, 4.0613 ms/iter, ETA 65:16:29)


./CUDALucas -f 4000k -k 57885161
Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 4000K, CUDALucas v2.04 Beta err = 0.0009 (0:42 real, 4.2419 ms/iter, ETA 68:11:21)
Iteration 20000 M( 57885161 )C, 0xfd8e311d20ffe6ab, n = 4000K, CUDALucas v2.04 Beta err = 0.0010 (0:42 real, 4.2032 ms/iter, ETA 67:33:15)
[/CODE]

ET_ 2013-03-02 12:54

Just to be sure...

I'm working fine with CudaLucas v2.01 under Linux 64bit: the program aautomatically recognize errors on the FFT computation and rollbacks to a safer size, even if not the most efficient.

1 - Is v2.04 available for Linux?
2 - Is it more reliable?
3 - Is it faster?
4 - Does it aautomagically choose the fastest FFT size?

Thank you for the infos..

Luigi

Dubslow 2013-03-02 21:36

2.03 is stable, 2.04 Beta works (but was never pushed out of beta), 2.05 is under development, and all work on GNU-Linux.

None are safer and more reliable; the main differences from 2.01 to 2.04 are interface features (i.e. worktodo in version 2.03, other good stuff in 2.04). They all automagically choose a *roughly* good FFT size, but manual futzing can usually get an extra 5-10% performance boost.

ET_ 2013-03-02 23:19

[QUOTE=Dubslow;331713]2.03 is stable, 2.04 Beta works (but was never pushed out of beta), 2.05 is under development, and all work on GNU-Linux.

None are safer and more reliable; the main differences from 2.01 to 2.04 are interface features (i.e. worktodo in version 2.03, other good stuff in 2.04). They all automagically choose a *roughly* good FFT size, but manual futzing can usually get an extra 5-10% performance boost.[/QUOTE]

Thank you Dubslow, that's exactly what I supposed... I'm waiting at the window, looking at the development, helping with tests if needed, but still keeping my old plain-vanilla v2.01 :smile:

Luigi

Brain 2013-03-13 21:07

Efficiency formula
 
[QUOTE=aaronhaviland;295742]Absolutely (from the CUFFT documentation):

I've been testing CUFFT timings for other lengths than just multiples of 32768. I've excluded the timings because they're not run exactly as CUDALucas would run them, but the fact that they are "optimal lengths" should still apply.

Eff% is is calculated similarly to the prior examples here, but scaled so that the results are all within the range 0-100. Very few lengths have Eff% between 15% - 75%; the majority of inefficient lengths ran around 9-10%. These have all been excluded. Some of the 70-80% efficient run-lengths have also been excluded because they are smaller than a larger+faster length. [COLOR=Blue]Note the exponents in blue which would be skipped over if only looking at multiples of 32768[/COLOR]:

[CODE]
FFT Exponent
Size Eff% 2 3 5 7
======================
1048576 97.23 20 0 0 0
[COLOR=Blue]1105920 88.82 13 3 1 0[/COLOR]
1179648 91.20 17 2 0 0[COLOR=Blue]
1204224 82.49 13 1 0 2[/COLOR]
1310720 89.06 18 0 1 0[COLOR=Blue]
1327104 90.86 14 4 0 0[/COLOR]
1376256 85.13 16 1 0 1
1474560 89.14 15 2 1 0[COLOR=Blue]
1548288 89.05 13 3 0 1[/COLOR]
1572864 89.23 19 1 0 0
1605632 88.84 15 0 0 2
1769472 92.58 16 3 0 0
1835008 89.17 18 0 0 1
2097152 95.87 21 0 0 0[COLOR=Blue]
2211840 87.81 14 3 1 0[/COLOR]
2359296 89.84 18 2 0 0[COLOR=Blue]
2370816 80.62 8 3 0 3[/COLOR][COLOR=Blue]
2408448 81.08 14 1 0 2[/COLOR]
2621440 87.60 19 0 1 0
2654208 85.52 15 4 0 0
[COLOR=Blue]2709504 82.21 11 3 0 2
2809856 82.38 13 0 0 3
2985984 87.28 12 6 0 0
3096576 85.87 14 3 0 1
[/COLOR]3145728 85.74 20 1 0 0
3211264 82.12 16 0 0 2
[COLOR=Blue]3317760 82.69 13 4 1 0
3359232 74.71 9 8 0 0
3386880 71.31 9 3 1 2
[/COLOR]3932160 80.93 18 1 1 0
[COLOR=Blue]4014080 80.65 14 0 1 2
[/COLOR]4096000 73.66 15 0 3 0
4194304 95.87 22 0 0 0
4423680 87.81 15 3 1 0
4718592 89.84 19 2 0 0
[COLOR=Blue]4741632 80.62 9 3 0 3
[/COLOR]4816896 81.08 15 1 0 2
5242880 87.60 20 0 1 0
5308416 85.52 16 4 0 0
[COLOR=Blue]5419008 82.21 12 3 0 2
5619712 82.38 14 0 0 3
5971968 87.28 13 6 0 0
[/COLOR]6193152 85.87 15 3 0 1
6291456 85.74 21 1 0 0
6422528 82.12 17 0 0 2
[COLOR=Blue]6635520 82.69 14 4 1 0
6718464 74.71 10 8 0 0
6773760 71.31 10 3 1 2
[/COLOR]7864320 80.93 19 1 1 0
8028160 80.65 15 0 1 2
8192000 73.66 16 0 3 0[/CODE][/QUOTE]

[QUOTE]For sizes handled by the Cooley-Tukey code path (that is, representable as 2 a ⋅ 3 b ⋅ 5 c ⋅ 7 d ), the most efficient implementation is obtained by applying the following constraints (listed in order from the most generic to the most specialized constraint, with each subsequent constraint providing the potential of an additional performance improvement).
[LIST][*] [I]Restrict the size along all dimensions to be representable as 2 a ⋅ 3 b ⋅ 5 c ⋅ 7 d .[/I]
The CUFFT library has highly optimized kernels for tranforms whose dimensions have these prime factors.[*] [I]Restrict the size along each dimension to use fewer distinct prime factors.[/I]
For example, a transform of size 3 n will usually be faster than one of size 2 i ∗ 3 j even if the latter is slightly smaller.[*] [I]Restrict the power-of-two factorization term of the x dimension to be a multiple of either 2 5 6 for single-precision transforms or 6 4 for double-precision transforms.[/I]
This further aids with memory coalescing.[*] [I]Restrict the x dimension of single-precision transforms to be strictly a power of two either between 2 and 8 1 9 2 for Fermi-class GPUs or between 2 and 2 0 4 8 for earlier architectures.[/I]
These transforms are implemented as specialized hand-coded kernels that keep all intermediate results in shared memory.[*] [I]Use Native compatibility mode for in-place complex-to-real or real-to-complex transforms.[/I]
This scheme reduces the write/read of padding bytes hence helping with coalescing of the data.[/LIST][/QUOTE]I'd like to have a formula that computes an efficiency score between 0 and 100% for a given FFT length like Aaron did. Any suggestions? Aaron, can you post yours?
Thoughts:
- measure FFT lengths runtime and adapt formula to match ms/iter
- theoretical Gflops throuhput as given by Nvidia for radix-2 to -7
- theoretical model with weighted scores for radix-2 to -7 and penalty for more distinct prime factors used <-- my try but scores must be analysed
- ...

Suggestions? Or has this been invented yet?

Brain 2013-03-14 21:48

[QUOTE=Brain;333237]
Suggestions? Or has this been invented yet?[/QUOTE]
I think Aaronhaviland did this with the cufftbench option.
I will do the same and normalize the timings by FFT length / time.
Raw data that will be used can be found in this [URL="http://www.mersenneforum.org/showpost.php?p=333372&postcount=234"]Titan's thread post[/URL].


All times are UTC. The time now is 22:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.