mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

James Heinrich 2012-03-28 19:20

Is there any kind of benchmark (or reference in the code) that gives a list of all suggested FFT sizes for a given exponent size? A quick glance at the source didn't jump out at me where the lookup table was for what FFT size CUDALucas will start with by default. How can I find this (ideally for all possible exponents that it can handle)?

Prime95 2012-03-28 19:38

1 Attachment(s)
Attached is my cufftbench for a GTX460. I flagged with a "Y" the FFT sizes that make sense.

flashjh 2012-03-28 19:46

[QUOTE=James Heinrich;294529]Is there any kind of benchmark (or reference in the code) that gives a list of all suggested FFT sizes for a given exponent size? A quick glance at the source didn't jump out at me where the lookup table was for what FFT size CUDALucas will start with by default. How can I find this (ideally for all possible exponents that it can handle)?[/QUOTE]

As far as I know, right now just pick a range around the exponent and run the cufft test to choose the best FFT for your card/system/exponent. I have not yet tested different thread sizes and the corresponding cufft test for different FFTs. When I get some time, I'll see what i can do. Look back though the thread for more FFT discussions.

Dubslow 2012-03-28 19:50

[QUOTE=Prime95;294530]Attached is my cufftbench for a GTX460. I flagged with a "Y" the FFT sizes that make sense.[/QUOTE]

I've had the same question as James about Prime95. Where in the P95 source is the table of possible FFT lengths? 15-20 minutes digging around didn't turn up much.

Prime95 2012-03-28 19:58

[QUOTE=Dubslow;294533]I've had the same question as James about Prime95. Where in the P95 source is the table of possible FFT lengths? 15-20 minutes digging around didn't turn up much.[/QUOTE]

We're getting a little off topic. P95 uses a table in mult.asm. The xjmptable is for SSE2, the yjmptable is for AVX. The table includes FFT size, maximum Mersenne exponent, estimated timing, mem used, which CPU architectures should use it, and some other stuff.

axn 2012-03-28 19:59

[QUOTE=Prime95;294530]Attached is my cufftbench for a GTX460. I flagged with a "Y" the FFT sizes that make sense.[/QUOTE]

Found a problem.
[CODE]CUFFT_Z2Z size= 1146880 time= 1.417851 msec Y
CUFFT_Z2Z size= 1179648 time= 1.390691 msec
[/CODE]

axn 2012-03-28 20:13

Looking at the multipliers, there are definite patterns. The multipliers 1,3,5,7,9,21,27,45,49 and 81 are always selected as preferred.

Except for one instance of 21 vs 45. 1376256 is slower than 1474560. It might be worth re-benchmarking the following four to see if the results are consistent.

[CODE]CUFFT_Z2Z size= 1376256 time= 1.818975 msec (21)
CUFFT_Z2Z size= 1474560 time= 1.809079 msec Y (45)

CUFFT_Z2Z size= 2752512 time= 3.812189 msec Y (21)
CUFFT_Z2Z size= 2949120 time= 3.853927 msec Y (45)
[/CODE]

Dubslow 2012-03-28 20:13

[QUOTE=Prime95;294530]Attached is my cufftbench for a GTX460. I flagged with a "Y" the FFT sizes that make sense.[/QUOTE]

Rehashed to show a bit more about the FFT size, as well as axn's correction.

Edit: Whoops, cross post.
@axn: msft said CuLu can use any multiple of 32K, that's why I did as such.

Edit2: Redone to show more lengths that are "reasonable", but not best. Those are marked with M.

[code]CUFFT_Z2Z size= 1048576 = 1024K = 32*32K time= 1.130540 msec Y
CUFFT_Z2Z size= 1146880 = 1120K = 35*32K time= 1.417851 msec M
CUFFT_Z2Z size= 1179648 = 1152K = 36*32K time= 1.390691 msec Y
CUFFT_Z2Z size= 1310720 = 1280K = 40*32K time= 1.533345 msec Y
CUFFT_Z2Z size= 1376256 = 1344K = 42*32K time= 1.818975 msec M
CUFFT_Z2Z size= 1474560 = 1440K = 45*32K time= 1.809079 msec Y
CUFFT_Z2Z size= 1572864 = 1536K = 48*32K time= 1.937807 msec Y
CUFFT_Z2Z size= 1605632 = 1568K = 49*32K time= 2.023415 msec Y
CUFFT_Z2Z size= 1638400 = 1600K = 50*32K time= 2.217558 msec M
CUFFT_Z2Z size= 1769472 = 1728K = 54*32K time= 2.141137 msec Y
CUFFT_Z2Z size= 1835008 = 1792K = 56*32K time= 2.163136 msec Y
CUFFT_Z2Z size= 1966080 = 1920K = 60*32K time= 2.700584 msec M
CUFFT_Z2Z size= 2064384 = 2016K = 63*32K time= 2.551482 msec M
CUFFT_Z2Z size= 2097152 = 2048K = 64*32K time= 2.409963 msec Y
CUFFT_Z2Z size= 2293760 = 2240K = 70*32K time= 3.018234 msec M
CUFFT_Z2Z size= 2359296 = 2304K = 72*32K time= 2.766602 msec Y
CUFFT_Z2Z size= 2457600 = 2400K = 75*32K time= 3.627161 msec M
CUFFT_Z2Z size= 2621440 = 2560K = 80*32K time= 3.239111 msec Y
CUFFT_Z2Z size= 2654208 = 2592K = 81*32K time= 3.409978 msec Y
CUFFT_Z2Z size= 2752512 = 2688K = 84*32K time= 3.812189 msec Y
CUFFT_Z2Z size= 2949120 = 2880K = 90*32K time= 3.853927 msec Y
CUFFT_Z2Z size= 3145728 = 3072K = 96*32K time= 4.029561 msec Y
CUFFT_Z2Z size= 3211264 = 3136K = 98*32K time= 4.324980 msec Y
CUFFT_Z2Z size= 3276800 = 3200K = 100*32K time= 4.702814 msec M
CUFFT_Z2Z size= 3440640 = 3360K = 105*32K time= 4.934543 msec M
CUFFT_Z2Z size= 3538944 = 3456K = 108*32K time= 4.573230 msec Y
CUFFT_Z2Z size= 3670016 = 3584K = 112*32K time= 4.591721 msec Y
CUFFT_Z2Z size= 3932160 = 3840K = 120*32K time= 5.395338 msec M
CUFFT_Z2Z size= 4128768 = 4032K = 126*32K time= 5.436691 msec M
CUFFT_Z2Z size= 4194304 = 4096K = 128*32K time= 5.049356 msec Y
CUFFT_Z2Z size= 4423680 = 4320K = 135*32K time= 5.862155 msec M
CUFFT_Z2Z size= 4587520 = 4480K = 140*32K time= 6.353941 msec M
CUFFT_Z2Z size= 4718592 = 4608K = 144*32K time= 5.858453 msec Y
CUFFT_Z2Z size= 4816896 = 4704K = 147*32K time= 7.085539 msec M
CUFFT_Z2Z size= 4915200 = 4800K = 150*32K time= 7.661496 msec M
[/code]
[QUOTE=Prime95;294535]We're getting a little off topic. P95 uses a table in mult.asm. The xjmptable is for SSE2, the yjmptable is for AVX. The table includes FFT size, maximum Mersenne exponent, estimated timing, mem used, which CPU architectures should use it, and some other stuff.[/QUOTE]
:ouch1:

...The former is 2800 lines. Did you write those all by hand?

axn 2012-03-28 20:40

[QUOTE=Dubslow;294539]Rehashed to show a bit more about the FFT size, as well as axn's correction.

Edit: Whoops, cross post.
@axn: msft said CuLu can use any multiple of 32K, that's why I did as such.

Edit2: Redone to show more lengths that are "reasonable", but not best. Those are marked with M.

[/QUOTE]

I did a similar exercise, this time normalizing the time by dividing it by (FFT/1048576). There is a clear pattern. Any multiplier that is 7-smooth yields decent (not necessarily preferred) performance. Anything that is not 7-smooth yields terrible performance. Something like 4x or worse.

Dubslow 2012-03-28 20:48

[QUOTE=axn;294542]I did a similar exercise, this time normalizing the time by dividing it by (FFT/1048576). There is a clear pattern. Any multiplier that is 7-smooth yields decent (not necessarily preferred) performance. Anything that is not 7-smooth yields terrible performance. Something like 4x or worse.[/QUOTE]
Could you post a chart of the multiplier's factorizations or do you want me to do it?

axn 2012-03-28 21:16

[QUOTE=Dubslow;294544]Could you post a chart of the multiplier's factorizations or do you want me to do it?[/QUOTE]

[CODE] FFT Pref Mult Smooth Time (ms) Normalized
1048576 Y 1 1 1.1305 1.130
2097152 Y 1 1 2.4099 1.204
4194304 Y 1 1 5.0493 1.262
1572864 Y 3 3 1.9378 1.291
3145728 Y 3 3 4.0295 1.343
1310720 Y 5 5 1.5333 1.226
2621440 Y 5 5 3.2391 1.295
1835008 Y 7 7 2.1631 1.236
3670016 Y 7 7 4.5917 1.311
1179648 Y 9 3 1.3906 1.236
2359296 Y 9 3 2.7666 1.229
4718592 Y 9 3 5.8584 1.301
1441792 11 11 12.0507 8.764
2883584 11 11 25.2414 9.178
1703936 13 13 15.8089 9.728
3407872 13 13 32.9923 10.151
1966080 15 5 2.7005 1.440
3932160 15 5 5.3953 1.438
1114112 17 17 17.8324 16.783
2228224 17 17 23.8903 11.242
4456448 17 17 54.9814 12.936
1245184 19 19 14.2480 11.998
2490368 19 19 28.5571 12.024
4980736 19 19 65.0263 13.689
1376256 ? 21 7 1.8189 1.385
2752512 Y 21 7 3.8121 1.452
1507328 23 23 19.8118 13.782
3014656 23 23 40.1851 13.977
1638400 25 5 2.2175 1.419
3276800 25 5 4.7028 1.504
1769472 Y 27 3 2.1411 1.268
3538944 Y 27 3 4.5732 1.355
1900544 29 29 30.3831 16.763
3801088 29 29 61.4396 16.948
2031616 31 31 33.3520 17.213
4063232 31 31 67.5301 17.427
1081344 33 11 9.9185 9.618
2162688 33 11 18.5583 8.997
4325376 33 11 40.4085 9.796
1146880 35 7 1.4178 1.296
2293760 35 7 3.0182 1.379
4587520 35 7 6.3539 1.452
1212416 37 37 22.6343 19.575
2424832 37 37 45.7872 19.799
4849664 37 37 99.3098 21.472
1277952 39 13 12.9222 10.602
2555904 39 13 23.8400 9.780
1343488 41 41 27.0680 21.126
2686976 41 41 54.9051 21.426
1409024 43 43 29.3962 21.876
2818048 43 43 59.6049 22.178
1474560 Y 45 5 1.8090 1.286
2949120 Y 45 5 3.8539 1.370
1540096 47 47 33.5578 22.847
3080192 47 47 68.1485 23.199
1605632 Y 49 7 2.0234 1.321
3211264 Y 49 7 4.3249 1.412
1671168 51 17 18.4646 11.585
3342336 51 17 37.9425 11.903
1736704 53 53 13.0645 7.888
3473408 53 53 26.7417 8.072
1802240 55 11 15.3619 8.937
3604480 55 11 33.7740 9.825
1867776 57 19 22.4452 12.600
3735552 57 19 45.0705 12.651
1933312 59 59 15.6682 8.498
3866624 59 59 32.2185 8.737
1998848 61 61 17.2398 9.043
3997696 61 61 36.1076 9.470
2064384 63 7 2.5514 1.295
4128768 63 7 5.4366 1.380
2129920 65 13 20.0319 9.861
4259840 65 13 43.8546 10.794
2195456 67 67 14.6807 7.011
4390912 67 67 31.1684 7.443
2260992 69 23 30.2865 14.045
4521984 69 23 62.9652 14.600
2326528 71 71 17.2002 7.752
4653056 71 71 36.2993 8.180
2392064 73 73 21.2844 9.330
4784128 73 73 44.9508 9.852
2457600 75 5 3.6271 1.547
4915200 75 5 7.6614 1.634
2523136 77 11 21.6817 9.010
2588672 79 79 20.6799 8.376
2654208 Y 81 3 3.4099 1.347
2719744 83 83 19.7181 7.602
2785280 85 17 31.6756 11.924
2850816 87 29 44.9787 16.543
2916352 89 89 20.1506 7.245
2981888 91 13 28.5019 10.022
3047424 93 31 49.9787 17.197
3112960 95 19 37.6290 12.675
3178496 97 97 26.7683 8.830
3244032 99 11 32.7558 10.587
3309568 101 101 23.0312 7.297
3375104 103 103 26.8451 8.340
3440640 105 7 4.9345 1.503
3506176 107 107 23.0966 6.907
3571712 109 109 30.6403 8.995
3637248 111 37 70.8848 20.435
3702784 113 113 26.2526 7.434
3768320 115 23 52.5471 14.621
3833856 117 13 38.0365 10.403
3899392 119 17 44.3925 11.937
3964928 121 11 54.2576 14.349
4030464 123 41 84.7203 22.041
4096000 125 5 6.4623 1.654
4161536 127 127 26.5980 6.701
4227072 129 43 91.9276 22.803
4292608 131 131 36.1097 8.820
4358144 133 19 52.7457 12.690
4423680 135 5 5.8621 1.389
4489216 137 137 40.8194 9.534
4554752 139 139 36.4544 8.392
4620288 141 47 104.9800 23.825
4685824 143 13 68.4353 15.314
4751360 145 29 79.7051 17.590
4816896 147 7 7.0855 1.542
4882432 149 149 38.9179 8.358
4947968 151 151 38.3244 8.121
[/CODE]


All times are UTC. The time now is 23:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.