mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

Lorenzo 2017-02-11 16:11

Oliver, could you please test for 332M.

TheJudger 2017-02-12 15:59

[QUOTE=ATH;452776]Now there is a Nvidia Quadro GP100 coming in March:
[url]http://www.anandtech.com/show/11102/nvidia-announces-quadro-gp100[/url]

Unfortunately they only state FP64 = 1/2 FP32, not the actual numbers.[/QUOTE]

Same chip, even same number of cores enabled and I'm pretty sure there are no other difference so just scale based on clock rates. Keep in mind that the Quadro has lower TDP and thus less prohability to run at max boost clock.
For CUDALucas it is safe to assume the the TDP won't limit the clock rates (will post some numbers later).

Oliver

TheJudger 2017-02-12 16:23

Benchmark for FFTs from 8M to 32M:
[CODE]Device Tesla P100-PCIE-16GB
Compatibility 6.0
clockRate (MHz) 1328
memClockRate (MHz) 715

fft max exp ms/iter
8192 149447533 2.2734
8232 150161509 2.7919
8748 159365399 2.9310
9000 163856051 2.9859
9072 165138601 3.0350
9216 167703023 3.0529
10976 198980129 3.1378
11200 202952693 3.8478
11664 211176269 3.9333
11907 215480183 4.0265
12000 217126817 4.0990
12250 221551991 4.1330
12348 223286171 4.1555
12500 225975263 4.1805
12544 226753511 4.1911
12800 231280639 4.2456
13122 236972111 4.3220
15552 279831199 4.3367
15680 282084599 5.2484
16000 287716357 5.2611
16384 294471259 5.2623
16807 301908293 5.6589
17496 314013451 5.7786
18000 322861793 5.9534
21952 392070229 6.0496
22400 399897793 7.5535
23328 416101459 7.9283
23814 424581893 7.9459
24500 436545821 8.0674
25088 446794913 8.1354
25600 455715121 8.4951
26244 466929581 8.5585
27000 480086839 8.8713
27648 491358173 9.0432
27783 493705637 9.4553
28672 509158127 9.5131
28800 511382147 9.7871
30375 538730923 10.2642
31104 551379091 10.3052
31752 562616531 10.3843
32000 566915989 10.4485
32768 580225813 10.5754
[/CODE]

And as requested (won't continue, known composite (factor found)):
[CODE]Starting M332192879 fft length = 21952K
| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done |
| Feb 12 17:18:57 | M332192879 10000 0xa19043095e213f4c | 21952K 0.00684 6.0656 60.65s | 23:07:41:48 0.00% |
| Feb 12 17:19:58 | M332192879 20000 0xcb7bc66ac81b24be | 21952K 0.00635 6.0665 60.66s | 23:07:43:22 0.00% |
| Feb 12 17:20:59 | M332192879 30000 0x38e4cc517de8fda3 | 21952K 0.00647 6.0663 60.66s | 23:07:42:50 0.00% |
[/CODE]

Power consumption (board power as reported by nvidia-smi) is around 180W-185W.

Oliver

Lorenzo 2017-02-12 19:24

Cool! Thank you, Oliver. Looks like an absolute record: ~6.0665 ms/iter and ETA ~23 days and 7 hours. :tu:

TheJudger 2017-02-12 21:38

[QUOTE=Lorenzo;452842]Cool! Thank you, Oliver. Looks like an absolute record: ~6.0665 ms/iter and ETA ~23 days and 7 hours. :tu:[/QUOTE]

Guess performance per watt isn't that bad, too. :smile:
Quadro GP100 should beat absolute performance of this baby.
Tesla P100-SXM2 will be even faster (same chip, higher clockrate and 300W TDP)

On the other hand performance per money (hardware purchase) is on the lower end somehow. :blush:

Oliver

P.S. 1000th post for me!

TheJudger 2017-03-08 18:17

Need moar baaaaaaandwidth!
 
Linux, CUDALucas 2.05.1, CUDA 8.0 running './CUDALucas -cufftbench 2048 32768 20'
[CODE]Device Tesla P100-PCIE-[B][COLOR="Red"]12[/COLOR][/B]GB
Compatibility 6.0
clockRate (MHz) 1328
memClockRate (MHz) 715

fft max exp ms/iter
2048 38492887 0.8290
2058 38676779 1.0388
2592 48471289 1.0416
2744 51250889 1.1145
3136 58404433 1.2523
3200 59570449 1.5077
3240 60298969 1.5151
3888 72075517 1.5259
4096 75846319 1.5597
5184 95507747 2.0062
5488 100984691 2.3054
5600 103000823 2.6145
5832 107174381 2.6678
6075 111541967 2.8229
6125 112440191 2.8590
6272 115080019 2.8595
6400 117377567 2.8738
7776 142017539 2.9085
8000 146019329 3.0031
8192 149447533 3.0191
8640 157439981 3.9448
8748 159365399 4.0645
9072 165138601 4.1045
9604 174608443 4.2650
9800 178094491 4.3199
10976 198980129 4.4162
11200 202952693 5.1175
11664 211176269 5.2472
11907 215480183 5.3872
12150 219782179 5.4020
12250 221551991 5.4762
12544 226753511 5.4953
12800 231280639 5.6827
15552 279831199 5.7517
15625 281116351 7.0191
15876 285534331 7.0333
16384 294471259 7.0384
16807 301908293 7.7869
17150 307935821 7.8517
17280 310219633 7.8554
17496 314013451 8.0196
18144 325388893 8.2523
21952 392070229 8.3905
22400 399897793 10.1928
23328 416101459 10.6708
23814 424581893 10.6803
24300 433058579 10.7332
24500 436545821 10.7743
25088 446794913 10.9927
25600 455715121 11.3715
25920 461288279 11.7309
26244 466929581 12.0685
27216 483844577 12.3786
27648 491358173 12.5897
28224 501372343 12.7468
28672 509158127 13.1385
28800 511382147 13.2269
30375 538730923 13.6812
31104 551379091 13.8575
32000 566915989 13.9916
32768 580225813 14.1780
[/CODE]

[CODE]Starting M332192879 fft length = 21952K
| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done |
| Mar 08 19:12:13 | M332192879 10000 0xa19043095e213f4c | 21952K 0.00658 8.4420 84.42s | 32:10:58:17 0.00% |
| Mar 08 19:13:37 | M332192879 20000 0xcb7bc66ac81b24be | 21952K 0.00684 8.4417 84.41s | 32:10:56:14 0.00% |
| Mar 08 19:15:02 | M332192879 30000 0x38e4cc517de8fda3 | 21952K 0.00696 8.4417 84.41s | 32:10:54:27 0.00% |
[/CODE]

The Tesla P100-PCIE-[B][COLOR="Red"]12[/COLOR][/B]GB is identical to the P100-PCIE-[B][COLOR="Red"]16[/COLOR][/B]GB except that [U]memory capacity and bandwidth is only 3/4[/U]. Seems like CUDALucas is memory bandwidth bound on that P100... 732 GB/s (16GB) vs. 549 GB/s (12GB).

Again no special tuning, just checkout sourcecode, compile and run benchmark.

Oliver

flashjh 2017-03-09 01:36

Nice

Lorenzo 2017-03-09 09:09

Very sensitive to bandwidth. And results for 332M test are different more than 3/4 (or 25%).

TheJudger 2017-03-23 22:42

stock 1080 Ti "Founders Edition"
 
Again Linux, CUDALucas 2.05.1, CUDA 8.0 running './CUDALucas -cufftbench 2048 32768 20'
[CODE]
Device Graphics Device
Compatibility 6.1
clockRate (MHz) 1582
memClockRate (MHz) 5505

fft max exp ms/iter
2048 38492887 1.3294
2160 40551479 1.5075
2304 43194913 1.5292
2592 48471289 1.6809
2625 49075057 1.9591
2700 50446621 1.9755
2800 52274087 2.0140
2916 54392209 2.0207
3136 58404433 2.0756
3240 60298969 2.2794
3402 63247511 2.3988
3584 66556463 2.4230
3600 66847171 2.5569
4096 75846319 2.6252
4608 85111207 3.1588
5184 95507747 3.5201
5292 97454309 3.9470
5600 103000823 4.0726
5832 107174381 4.1691
6144 112781477 4.3976
6272 115080019 4.4866
6480 118813021 4.7310
6912 126558077 4.7951
7168 131142761 5.0873
7200 131715607 5.2188
8192 149447533 5.4126
8640 157439981 6.3776
9216 167703023 6.4584
9408 171120919 7.1689
9600 174537299 7.2167
9720 176671801 7.3573
10080 183071879 7.6530
10240 185914837 7.8567
10368 188188471 8.0238
10584 192023851 8.1263
10935 198252811 8.2703
11200 202952693 8.3223
11664 211176269 8.4596
12096 218826341 8.9538
12544 226753511 9.3923
12960 234109067 9.6682
13824 249369863 10.0077
14336 258403573 10.2316
14400 259532291 10.6340
15552 279831199 11.1631
16384 294471259 11.7147
18432 330441847 13.0869
18816 337176443 14.4504
19440 348113921 14.9802
20480 366326371 14.9975
20736 370806323 15.9969
21168 378363589 16.3615
23040 411074273 16.5980
23328 416101459 17.3808
24192 431175197 19.0066
25088 446794913 19.2028
25600 455715121 19.9979
27648 491358173 20.5194
28672 509158127 21.1422
28800 511382147 21.5948
32256 571353353 23.3608
32768 580225813 23.7932
[/CODE]

Funny fact: driver version 378.13 doesn't know the name of the card...

[CODE]Starting M332192879 fft length = 18432K
| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done |
| Mar 23 23:37:01 | M332192879 10000 0xa19043095e213f4c | 18432K 0.25000 12.5995 125.99s | 48:10:35:53 0.00% |
| Mar 23 23:39:12 | M332192879 20000 0xcb7bc66ac81b24be | 18432K 0.25000 13.0765 130.76s | 49:08:34:04 0.00% |
| Mar 23 23:41:23 | M332192879 30000 0x38e4cc517de8fda3 | 18432K 0.25781 13.0962 130.96s | 49:16:28:27 0.00% |
[/CODE]

Power consumption during this testrun hovers around 180W (board power as reported by nvidia driver).

Oliver

kriesel 2017-03-30 14:30

[QUOTE=LaurV;301515] ... For mfaktc I solve this with batches, anyhow I should create some batches for CL, just in case :razz:

[CODE]copy /b allresults.txt+cl0\result.txt
del cl0\results.txt
copy /b allresults.txt+cl1\result.txt
del cl1\results.txt
copy /b allresults.txt+cl2\result.txt
del cl2\results.txt
etc
[/CODE]and launch it from time to time...

That doesn't look very safe to me. This works, or you don't mind losing some results? (Copy is assumed to be successful. Note result.txt in copy versus results.txt in del.) (A little late nitpicking of my own ;) on not enough sleep to code myself.)

kriesel 2017-03-30 21:11

Adding logging
 
[QUOTE=Dubslow;302866]Sigh... it had been on my personal todo list to clean up a lot of the functions, which would be necessary to produce logging functionality... it'd probably take me a couple of days.

I wouldn't think adding the logging itself would take long. A few decades ago (pre-386!) when I was writing LL code the hard slow way in c, before I heard of Woltman, Crandall, prime95, etc., I had gotten in deep before deciding to add logging. One day bit the bullet and replaced all my printf's with dprintf, and made a little routine called dprintf to print whatever it got fed, to both stdout and a log file.


All times are UTC. The time now is 22:42.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.