![]() |
Fast ist phun!
CUDA 9.0, CUDA driver 384.98, CUDALucas 2.05.1 (SVN rev. 99)
Benchmark FFT sizes './CUDALucas -cufftbench 2048 32768 20' [CODE]Device Tesla V100-PCIE-16GB Compatibility 7.0 clockRate (MHz) 1380 memClockRate (MHz) 877 fft max exp ms/iter 2048 38492887 0.3978 2187 41047411 0.5123 2304 43194913 0.5183 2401 44973503 0.5293 2500 46787207 0.5429 2592 48471289 0.5460 2744 51250889 0.5997 3136 58404433 0.6361 3200 59570449 0.6514 3456 64229677 0.7015 4096 75846319 0.7591 4375 80897867 0.9595 4608 85111207 0.9649 5184 95507747 1.0124 5488 100984691 1.1235 6272 115080019 1.2037 6400 117377567 1.2445 6561 120266023 1.3328 6912 126558077 1.3391 8000 146019329 1.5105 8192 149447533 1.5230 8575 156280961 1.8316 10368 188188471 1.9362 10976 198980129 2.1451 11907 215480183 2.3303 12544 226753511 2.3331 12800 231280639 2.3830 13824 249369863 2.5663 16384 294471259 2.9531 16807 301908293 3.3334 16875 303103441 3.5138 18225 326810201 3.7274 20736 370806323 3.7880 21952 392070229 4.2109 25088 446794913 4.5286 27783 493705637 5.5610 32000 566915989 5.8087 32768 580225813 5.8343 [/CODE] And timing 100M exponent './CUDALucas 332192879' [CODE]Starting M332192879 fft length = 20736K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Dec 06 21:52:36 | M332192879 10000 0xa19043095e213f4c | 20736K 0.01758 3.8055 38.05s | 14:15:09:00 0.00% | | Dec 06 21:53:14 | M332192879 20000 0xcb7bc66ac81b24be | 20736K 0.01709 3.8051 38.05s | 14:15:07:16 0.00% | | Dec 06 21:53:52 | M332192879 30000 0x38e4cc517de8fda3 | 20736K 0.01758 3.8051 38.05s | 14:15:06:19 0.00% | [/CODE] Power consumption (boardpower reported by 'nvidia-smi') is around 145W while running LL test of M332192879. Oliver |
So ~351*145Wh is the amount of energy consumed.
|
multiple instances effect on performance (win some lose some)
To follow up on [URL]http://www.mersenneforum.org/showpost.php?p=472866&postcount=2649[/URL], testing several combinations of applications (among CUDALucas, CUDAPm1, Mfaktc) run on several model GPUs, I have preliminary results per GPU and apps combination ranging from a few percent throughput reduction to over thirteen percent throughput increase. Throughput is computed as the sum for each simultaneously running instance on an individual GPU, of the rate of progress divided by the rate that was benchmarked to occur when that application was the only one running on that GPU. (This approach treats all run types, LL, P-1, trial factoring, as equally valuable; what's valued is a GPU-day of that model.) Estimated standard deviations so far are of order 0.2% to 0.5% for those I've evaluated, so the observed 1-13% gains evaluated are statistically significant. A spot check of a benchmark was repeatable quickly to 0.2%. Memory requirement is typically a small fraction of total GPU ram.
|
CUDA 9.1, CUDA driver 387.34, CUDALucas 2.05.1 (SVN rev. 99)
Updated P100-16GiB Benchmark (older CUDA 8 Benchmarks is [URL="http://mersenneforum.org/showpost.php?p=452751&postcount=2561"]here[/URL] and [URL="http://mersenneforum.org/showpost.php?p=452834&postcount=2566"]here[/URL]). Benchmark FFT sizes './CUDALucas -cufftbench 2048 32768 20' [CODE]Device Tesla P100-PCIE-16GB Compatibility 6.0 clockRate (MHz) 1328 memClockRate (MHz) 715 fft max exp ms/iter 2048 38492887 0.5972 2187 41047411 0.7118 2304 43194913 0.7301 2401 44973503 0.7656 2592 48471289 0.7971 2744 51250889 0.8863 3136 58404433 0.9482 3200 59570449 0.9733 3456 64229677 1.0467 3584 66556463 1.1321 4096 75846319 1.1423 4608 85111207 1.4124 5184 95507747 1.4988 5488 100984691 1.6450 6272 115080019 1.8127 6400 117377567 1.8730 6561 120266023 1.9556 6912 126558077 2.0301 7776 142017539 2.2474 8192 149447533 2.2688 8575 156280961 2.6593 9261 168504209 2.8483 10368 188188471 2.9439 10976 198980129 3.1604 12544 226753511 3.5621 12800 231280639 3.6567 13824 249369863 3.9843 15552 279831199 4.4018 16384 294471259 4.5018 16807 301908293 5.1300 16875 303103441 5.5609 18225 326810201 5.7337 20736 370806323 5.8287 21952 392070229 6.2511 25088 446794913 7.0258 27783 493705637 8.2696 31104 551379091 8.7884 32000 566915989 9.0541 32768 580225813 9.0641[/CODE] And timing 100M exponent './CUDALucas 332192879' [CODE]Starting M332192879 fft length = 20736K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Dec 23 20:37:12 | M332192879 10000 0xa19043095e213f4c | 20736K 0.01758 5.8218 58.21s | 22:09:12:09 0.00% | | Dec 23 20:38:10 | M332192879 20000 0xcb7bc66ac81b24be | 20736K 0.01709 5.8218 58.21s | 22:09:11:04 0.00% | | Dec 23 20:39:08 | M332192879 30000 0x38e4cc517de8fda3 | 20736K 0.01855 5.8249 58.24s | 22:09:15:49 0.00% | [/CODE] Oliver |
[QUOTE=TheJudger;474726]CUDA 9.1, CUDA driver 387.34, CUDALucas 2.05.1 (SVN rev. 99)
Updated P100-16GiB Benchmark (older CUDA 8 Benchmarks is [URL="http://mersenneforum.org/showpost.php?p=452751&postcount=2561"]here[/URL] and [URL="http://mersenneforum.org/showpost.php?p=452834&postcount=2566"]here[/URL]). Benchmark FFT sizes './CUDALucas -cufftbench 2048 32768 20' [CODE]Device Tesla P100-PCIE-16GB Compatibility 6.0 clockRate (MHz) 1328 memClockRate (MHz) 715 fft max exp ms/iter 2048 38492887 0.5972 2187 41047411 0.7118 2304 43194913 0.7301 2401 44973503 0.7656 2592 48471289 0.7971 2744 51250889 0.8863 3136 58404433 0.9482 3200 59570449 0.9733 3456 64229677 1.0467 3584 66556463 1.1321 4096 75846319 1.1423 4608 85111207 1.4124 5184 95507747 1.4988 5488 100984691 1.6450 6272 115080019 1.8127 6400 117377567 1.8730 6561 120266023 1.9556 6912 126558077 2.0301 7776 142017539 2.2474 8192 149447533 2.2688 8575 156280961 2.6593 9261 168504209 2.8483 10368 188188471 2.9439 10976 198980129 3.1604 12544 226753511 3.5621 12800 231280639 3.6567 13824 249369863 3.9843 15552 279831199 4.4018 16384 294471259 4.5018 16807 301908293 5.1300 16875 303103441 5.5609 18225 326810201 5.7337 20736 370806323 5.8287 21952 392070229 6.2511 25088 446794913 7.0258 27783 493705637 8.2696 31104 551379091 8.7884 32000 566915989 9.0541 32768 580225813 9.0641[/CODE] And timing 100M exponent './CUDALucas 332192879' [CODE]Starting M332192879 fft length = 20736K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Dec 23 20:37:12 | M332192879 10000 0xa19043095e213f4c | 20736K 0.01758 5.8218 58.21s | 22:09:12:09 0.00% | | Dec 23 20:38:10 | M332192879 20000 0xcb7bc66ac81b24be | 20736K 0.01709 5.8218 58.21s | 22:09:11:04 0.00% | | Dec 23 20:39:08 | M332192879 30000 0x38e4cc517de8fda3 | 20736K 0.01855 5.8249 58.24s | 22:09:15:49 0.00% | [/CODE] Oliver[/QUOTE] 22 days for a 100.000.000 digits number? |
Hi Luigi,
[QUOTE=ET_;474744]22 days for a 100.000.000 digits number?[/QUOTE] yes, but look at [URL="http://mersenneforum.org/showpost.php?p=473281&postcount=2652"]this[/URL] :smile: Oliver |
Updated bug and wish list
1 Attachment(s)
[QUOTE=kriesel;465505]Here is today's version of the list I am maintaining. As always, this is in appreciation of the authors' past contributions. Users may want to browse this for workarounds included in some of the descriptions, and for an awareness of some known pitfalls. Please respond with any comments, additions or suggestions you may have.[/QUOTE]
After a few months and holidays, here's an updated version. |
CUDALucas runtime scaling
1 Attachment(s)
The attachment is based on actual timed exponents on a 701 Mhz clocked GTX480. Times for a GTX1070 scale by about 70%. That is, what takes the 480 ten days takes the 1070 a week.
|
[QUOTE=kriesel;476927]After a few months and holidays, here's an updated version.[/QUOTE]
I had a fast read, some I didn't understand (need more time for me to read them deeper, I am in hurry now), but point 10 seems that it is actually not true. What you see is an effect of the save file storing the time when the test was started. The program computes the time like "how many iterations you did" over "how long time you worked on it", multiply with "how many iterations you still have" and that is a date in the future. You will experience the same effect if you interrupt your work for a while (days) and resume in the same computer. I remember a discussion in the past where we argued if the interruption time should be considered or not (i.e. averaged into the calculus) and it seems to me that it is better to be included. No matter if you take one picosecond per iteration, but if you spent 1 week to do half of the test (for whatever reasons, including interruptions), it would look normal for me that you will spend another week for the other half. In this way, your new computer doesn't know that the time per iteration is faster, but the ETA will "catch up" soon, as the iterations progress to higher numbers. The other way, to display ETA as the "number of remaining iterations" multiplied with "iteration time", will give you an immediate result when you move it to a faster toy, but it will be very-VERY jumpy ETA, due to the fact that iteration time varies a lot with how busy your computer is. Some of us use the computers for other activities too. So it is not "reliable". Some kind of "averaging" with the past values (either SMA, or EMA) need to be done, to avoid the jumpy ETA, and you will still see "no effect" when you move it, unless the MA (moving average) main period passes. Of course, it would be nice to have an option in the ini file, for example, where to chose an averaging period, something like 255 should be the actual method, (just an example), something like 0 should be "no averaging" (jumpy). But I feel we request too much already. |
EVGA GeForce GTX 1050 (2GB GDDR5)
CUDALucas2.05.1-CUDA8.0-Windows-x64.exe
GeForce 1050 CUDALucas benchmarks below followed by Intel i3-4150 Prime95 benchmarks for comparison [QUOTE]Device GeForce GTX 1050 Compatibility 6.1 clockRate (MHz) 1531 memClockRate (MHz) 3504 fft max exp ms/iter 1024 19535569 3.1435 1080 20580341 3.6334 1134 21586693 3.7268 1152 21921901 3.7988 1296 24599717 4.0779 1323 25101101 4.5591 1350 25602229 4.7156 1440 27271147 4.8550 1458 27604673 5.0420 1568 29640913 5.0514 1600 30232693 5.1335 1728 32597297 5.5383 1792 33778141 6.0359 2048 38492887 6.2727 2304 43194913 7.2045 2352 44075249 8.4636 2592 48471289 8.4958 2688 50227213 9.5387 2700 50446621 9.9943 2916 54392209 10.0159 3024 56362639 10.4233 3136 58404433 10.4462 3200 59570449 11.2744 3240 60298969 11.5216 3402 63247511 11.9492 3584 66556463 12.3054 3600 66847171 13.0066 4096 75846319 13.0730 4608 85111207 15.4038 4800 88579669 17.4774 5184 95507747 17.5136 5376 98967641 19.3875 5600 103000823 20.1571 5760 105879517 20.5611 5832 107174381 20.6808 6144 112781477 21.5606 6272 115080019 22.1675 6912 126558077 23.4366 7168 131142761 25.0548 7200 131715607 25.6965 8192 149447533 26.9798[/QUOTE] [QUOTE]Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz CPU speed: 3491.95 MHz, 2 hyperthreaded cores CPU features: Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA L1 cache size: 32 KB L2 cache size: 256 KB, L3 cache size: 3 MB L1 cache line size: 64 bytes L2 cache line size: 64 bytes TLBS: 64 Timing FFTs using 2 threads on 2 cores. Best time for 1792K FFT length: 4.751 ms., avg: 4.932 ms. Best time for 1920K FFT length: 5.255 ms., avg: 5.418 ms. Best time for 2016K FFT length: 5.469 ms., avg: 5.605 ms. Best time for 2048K FFT length: 5.513 ms., avg: 5.574 ms. Best time for 2304K FFT length: 6.246 ms., avg: 6.298 ms. Best time for 2400K FFT length: 6.436 ms., avg: 6.484 ms. Best time for 2560K FFT length: 6.825 ms., avg: 6.991 ms. Best time for 2688K FFT length: 7.409 ms., avg: 7.502 ms. Best time for 2880K FFT length: 7.736 ms., avg: 7.801 ms. Best time for 3072K FFT length: 8.351 ms., avg: 8.448 ms. Best time for 3200K FFT length: 8.811 ms., avg: 8.946 ms. Best time for 3360K FFT length: 9.705 ms., avg: 9.879 ms. Best time for 3456K FFT length: 9.940 ms., avg: 10.082 ms. Best time for 3584K FFT length: 10.128 ms., avg: 10.220 ms. Best time for 3840K FFT length: 10.919 ms., avg: 11.034 ms. Best time for 4096K FFT length: 13.515 ms., avg: 13.819 ms. Best time for 4480K FFT length: 12.547 ms., avg: 12.789 ms. Best time for 4608K FFT length: 12.952 ms., avg: 13.141 ms. Best time for 4800K FFT length: 13.462 ms., avg: 13.636 ms. Best time for 5120K FFT length: 14.454 ms., avg: 14.626 ms. Best time for 5376K FFT length: 15.308 ms., avg: 15.433 ms. Best time for 5760K FFT length: 16.797 ms., avg: 16.957 ms. Best time for 6144K FFT length: 17.702 ms., avg: 17.988 ms. Best time for 6400K FFT length: 18.452 ms., avg: 18.641 ms. Best time for 6720K FFT length: 20.265 ms., avg: 20.463 ms. Best time for 6912K FFT length: 20.733 ms., avg: 21.296 ms. Best time for 7168K FFT length: 22.067 ms., avg: 24.565 ms. Best time for 7680K FFT length: 22.115 ms., avg: 22.333 ms. Best time for 8064K FFT length: 24.796 ms., avg: 25.473 ms. Best time for 8192K FFT length: 26.976 ms., avg: 28.400 ms. [/QUOTE] |
Problem compiling CUDALucas for 1080 Ti under Linux
Hello!
I'm not sure it's correct place to ask this, but I'm bumping into a problem while trying to compile the [URL="https://sourceforge.net/p/cudalucas/code/HEAD/tree/trunk/"]latest[/URL] CUDALucas under Linux. The problem is: [QUOTE]$ make /usr/local/cuda/bin/nvcc -O1 --generate-code arch=compute_61,code=sm_61 --compiler-options=-Wall -I/usr/local/cuda/include -c CUDALucas.cu CUDALucas.cu(756): error: identifier "nvmlInit" is undefined CUDALucas.cu(757): error: identifier "nvmlDevice_t" is undefined CUDALucas.cu(758): error: identifier "nvmlDeviceGetHandleByIndex" is undefined CUDALucas.cu(759): error: identifier "nvmlDeviceGetUUID" is undefined CUDALucas.cu(760): error: identifier "nvmlShutdown" is undefined [/QUOTE]It's the same if I try different versions of compute/sm. I have CUDA Toolkit 9.1 installed. Any suggestions, please? |
| All times are UTC. The time now is 22:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.