![]() |
Oliver, could you please test for 332M.
|
[QUOTE=ATH;452776]Now there is a Nvidia Quadro GP100 coming in March:
[url]http://www.anandtech.com/show/11102/nvidia-announces-quadro-gp100[/url] Unfortunately they only state FP64 = 1/2 FP32, not the actual numbers.[/QUOTE] Same chip, even same number of cores enabled and I'm pretty sure there are no other difference so just scale based on clock rates. Keep in mind that the Quadro has lower TDP and thus less prohability to run at max boost clock. For CUDALucas it is safe to assume the the TDP won't limit the clock rates (will post some numbers later). Oliver |
Benchmark for FFTs from 8M to 32M:
[CODE]Device Tesla P100-PCIE-16GB Compatibility 6.0 clockRate (MHz) 1328 memClockRate (MHz) 715 fft max exp ms/iter 8192 149447533 2.2734 8232 150161509 2.7919 8748 159365399 2.9310 9000 163856051 2.9859 9072 165138601 3.0350 9216 167703023 3.0529 10976 198980129 3.1378 11200 202952693 3.8478 11664 211176269 3.9333 11907 215480183 4.0265 12000 217126817 4.0990 12250 221551991 4.1330 12348 223286171 4.1555 12500 225975263 4.1805 12544 226753511 4.1911 12800 231280639 4.2456 13122 236972111 4.3220 15552 279831199 4.3367 15680 282084599 5.2484 16000 287716357 5.2611 16384 294471259 5.2623 16807 301908293 5.6589 17496 314013451 5.7786 18000 322861793 5.9534 21952 392070229 6.0496 22400 399897793 7.5535 23328 416101459 7.9283 23814 424581893 7.9459 24500 436545821 8.0674 25088 446794913 8.1354 25600 455715121 8.4951 26244 466929581 8.5585 27000 480086839 8.8713 27648 491358173 9.0432 27783 493705637 9.4553 28672 509158127 9.5131 28800 511382147 9.7871 30375 538730923 10.2642 31104 551379091 10.3052 31752 562616531 10.3843 32000 566915989 10.4485 32768 580225813 10.5754 [/CODE] And as requested (won't continue, known composite (factor found)): [CODE]Starting M332192879 fft length = 21952K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Feb 12 17:18:57 | M332192879 10000 0xa19043095e213f4c | 21952K 0.00684 6.0656 60.65s | 23:07:41:48 0.00% | | Feb 12 17:19:58 | M332192879 20000 0xcb7bc66ac81b24be | 21952K 0.00635 6.0665 60.66s | 23:07:43:22 0.00% | | Feb 12 17:20:59 | M332192879 30000 0x38e4cc517de8fda3 | 21952K 0.00647 6.0663 60.66s | 23:07:42:50 0.00% | [/CODE] Power consumption (board power as reported by nvidia-smi) is around 180W-185W. Oliver |
Cool! Thank you, Oliver. Looks like an absolute record: ~6.0665 ms/iter and ETA ~23 days and 7 hours. :tu:
|
[QUOTE=Lorenzo;452842]Cool! Thank you, Oliver. Looks like an absolute record: ~6.0665 ms/iter and ETA ~23 days and 7 hours. :tu:[/QUOTE]
Guess performance per watt isn't that bad, too. :smile: Quadro GP100 should beat absolute performance of this baby. Tesla P100-SXM2 will be even faster (same chip, higher clockrate and 300W TDP) On the other hand performance per money (hardware purchase) is on the lower end somehow. :blush: Oliver P.S. 1000th post for me! |
Need moar baaaaaaandwidth!
Linux, CUDALucas 2.05.1, CUDA 8.0 running './CUDALucas -cufftbench 2048 32768 20'
[CODE]Device Tesla P100-PCIE-[B][COLOR="Red"]12[/COLOR][/B]GB Compatibility 6.0 clockRate (MHz) 1328 memClockRate (MHz) 715 fft max exp ms/iter 2048 38492887 0.8290 2058 38676779 1.0388 2592 48471289 1.0416 2744 51250889 1.1145 3136 58404433 1.2523 3200 59570449 1.5077 3240 60298969 1.5151 3888 72075517 1.5259 4096 75846319 1.5597 5184 95507747 2.0062 5488 100984691 2.3054 5600 103000823 2.6145 5832 107174381 2.6678 6075 111541967 2.8229 6125 112440191 2.8590 6272 115080019 2.8595 6400 117377567 2.8738 7776 142017539 2.9085 8000 146019329 3.0031 8192 149447533 3.0191 8640 157439981 3.9448 8748 159365399 4.0645 9072 165138601 4.1045 9604 174608443 4.2650 9800 178094491 4.3199 10976 198980129 4.4162 11200 202952693 5.1175 11664 211176269 5.2472 11907 215480183 5.3872 12150 219782179 5.4020 12250 221551991 5.4762 12544 226753511 5.4953 12800 231280639 5.6827 15552 279831199 5.7517 15625 281116351 7.0191 15876 285534331 7.0333 16384 294471259 7.0384 16807 301908293 7.7869 17150 307935821 7.8517 17280 310219633 7.8554 17496 314013451 8.0196 18144 325388893 8.2523 21952 392070229 8.3905 22400 399897793 10.1928 23328 416101459 10.6708 23814 424581893 10.6803 24300 433058579 10.7332 24500 436545821 10.7743 25088 446794913 10.9927 25600 455715121 11.3715 25920 461288279 11.7309 26244 466929581 12.0685 27216 483844577 12.3786 27648 491358173 12.5897 28224 501372343 12.7468 28672 509158127 13.1385 28800 511382147 13.2269 30375 538730923 13.6812 31104 551379091 13.8575 32000 566915989 13.9916 32768 580225813 14.1780 [/CODE] [CODE]Starting M332192879 fft length = 21952K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Mar 08 19:12:13 | M332192879 10000 0xa19043095e213f4c | 21952K 0.00658 8.4420 84.42s | 32:10:58:17 0.00% | | Mar 08 19:13:37 | M332192879 20000 0xcb7bc66ac81b24be | 21952K 0.00684 8.4417 84.41s | 32:10:56:14 0.00% | | Mar 08 19:15:02 | M332192879 30000 0x38e4cc517de8fda3 | 21952K 0.00696 8.4417 84.41s | 32:10:54:27 0.00% | [/CODE] The Tesla P100-PCIE-[B][COLOR="Red"]12[/COLOR][/B]GB is identical to the P100-PCIE-[B][COLOR="Red"]16[/COLOR][/B]GB except that [U]memory capacity and bandwidth is only 3/4[/U]. Seems like CUDALucas is memory bandwidth bound on that P100... 732 GB/s (16GB) vs. 549 GB/s (12GB). Again no special tuning, just checkout sourcecode, compile and run benchmark. Oliver |
Nice
|
Very sensitive to bandwidth. And results for 332M test are different more than 3/4 (or 25%).
|
stock 1080 Ti "Founders Edition"
Again Linux, CUDALucas 2.05.1, CUDA 8.0 running './CUDALucas -cufftbench 2048 32768 20'
[CODE] Device Graphics Device Compatibility 6.1 clockRate (MHz) 1582 memClockRate (MHz) 5505 fft max exp ms/iter 2048 38492887 1.3294 2160 40551479 1.5075 2304 43194913 1.5292 2592 48471289 1.6809 2625 49075057 1.9591 2700 50446621 1.9755 2800 52274087 2.0140 2916 54392209 2.0207 3136 58404433 2.0756 3240 60298969 2.2794 3402 63247511 2.3988 3584 66556463 2.4230 3600 66847171 2.5569 4096 75846319 2.6252 4608 85111207 3.1588 5184 95507747 3.5201 5292 97454309 3.9470 5600 103000823 4.0726 5832 107174381 4.1691 6144 112781477 4.3976 6272 115080019 4.4866 6480 118813021 4.7310 6912 126558077 4.7951 7168 131142761 5.0873 7200 131715607 5.2188 8192 149447533 5.4126 8640 157439981 6.3776 9216 167703023 6.4584 9408 171120919 7.1689 9600 174537299 7.2167 9720 176671801 7.3573 10080 183071879 7.6530 10240 185914837 7.8567 10368 188188471 8.0238 10584 192023851 8.1263 10935 198252811 8.2703 11200 202952693 8.3223 11664 211176269 8.4596 12096 218826341 8.9538 12544 226753511 9.3923 12960 234109067 9.6682 13824 249369863 10.0077 14336 258403573 10.2316 14400 259532291 10.6340 15552 279831199 11.1631 16384 294471259 11.7147 18432 330441847 13.0869 18816 337176443 14.4504 19440 348113921 14.9802 20480 366326371 14.9975 20736 370806323 15.9969 21168 378363589 16.3615 23040 411074273 16.5980 23328 416101459 17.3808 24192 431175197 19.0066 25088 446794913 19.2028 25600 455715121 19.9979 27648 491358173 20.5194 28672 509158127 21.1422 28800 511382147 21.5948 32256 571353353 23.3608 32768 580225813 23.7932 [/CODE] Funny fact: driver version 378.13 doesn't know the name of the card... [CODE]Starting M332192879 fft length = 18432K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Mar 23 23:37:01 | M332192879 10000 0xa19043095e213f4c | 18432K 0.25000 12.5995 125.99s | 48:10:35:53 0.00% | | Mar 23 23:39:12 | M332192879 20000 0xcb7bc66ac81b24be | 18432K 0.25000 13.0765 130.76s | 49:08:34:04 0.00% | | Mar 23 23:41:23 | M332192879 30000 0x38e4cc517de8fda3 | 18432K 0.25781 13.0962 130.96s | 49:16:28:27 0.00% | [/CODE] Power consumption during this testrun hovers around 180W (board power as reported by nvidia driver). Oliver |
[QUOTE=LaurV;301515] ... For mfaktc I solve this with batches, anyhow I should create some batches for CL, just in case :razz:
[CODE]copy /b allresults.txt+cl0\result.txt del cl0\results.txt copy /b allresults.txt+cl1\result.txt del cl1\results.txt copy /b allresults.txt+cl2\result.txt del cl2\results.txt etc [/CODE]and launch it from time to time... That doesn't look very safe to me. This works, or you don't mind losing some results? (Copy is assumed to be successful. Note result.txt in copy versus results.txt in del.) (A little late nitpicking of my own ;) on not enough sleep to code myself.) |
Adding logging
[QUOTE=Dubslow;302866]Sigh... it had been on my personal todo list to clean up a lot of the functions, which would be necessary to produce logging functionality... it'd probably take me a couple of days.
I wouldn't think adding the logging itself would take long. A few decades ago (pre-386!) when I was writing LL code the hard slow way in c, before I heard of Woltman, Crandall, prime95, etc., I had gotten in deep before deciding to add logging. One day bit the bullet and replaced all my printf's with dprintf, and made a little routine called dprintf to print whatever it got fed, to both stdout and a log file. |
| All times are UTC. The time now is 22:42. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.