mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

TheJudger 2017-12-06 21:01

Fast ist phun!
 
CUDA 9.0, CUDA driver 384.98, CUDALucas 2.05.1 (SVN rev. 99)

Benchmark FFT sizes './CUDALucas -cufftbench 2048 32768 20'
[CODE]Device Tesla V100-PCIE-16GB
Compatibility 7.0
clockRate (MHz) 1380
memClockRate (MHz) 877

fft max exp ms/iter
2048 38492887 0.3978
2187 41047411 0.5123
2304 43194913 0.5183
2401 44973503 0.5293
2500 46787207 0.5429
2592 48471289 0.5460
2744 51250889 0.5997
3136 58404433 0.6361
3200 59570449 0.6514
3456 64229677 0.7015
4096 75846319 0.7591
4375 80897867 0.9595
4608 85111207 0.9649
5184 95507747 1.0124
5488 100984691 1.1235
6272 115080019 1.2037
6400 117377567 1.2445
6561 120266023 1.3328
6912 126558077 1.3391
8000 146019329 1.5105
8192 149447533 1.5230
8575 156280961 1.8316
10368 188188471 1.9362
10976 198980129 2.1451
11907 215480183 2.3303
12544 226753511 2.3331
12800 231280639 2.3830
13824 249369863 2.5663
16384 294471259 2.9531
16807 301908293 3.3334
16875 303103441 3.5138
18225 326810201 3.7274
20736 370806323 3.7880
21952 392070229 4.2109
25088 446794913 4.5286
27783 493705637 5.5610
32000 566915989 5.8087
32768 580225813 5.8343
[/CODE]


And timing 100M exponent './CUDALucas 332192879'
[CODE]Starting M332192879 fft length = 20736K
| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done |
| Dec 06 21:52:36 | M332192879 10000 0xa19043095e213f4c | 20736K 0.01758 3.8055 38.05s | 14:15:09:00 0.00% |
| Dec 06 21:53:14 | M332192879 20000 0xcb7bc66ac81b24be | 20736K 0.01709 3.8051 38.05s | 14:15:07:16 0.00% |
| Dec 06 21:53:52 | M332192879 30000 0x38e4cc517de8fda3 | 20736K 0.01758 3.8051 38.05s | 14:15:06:19 0.00% |
[/CODE]
Power consumption (boardpower reported by 'nvidia-smi') is around 145W while running LL test of M332192879.

Oliver

Luis 2017-12-07 17:07

So ~351*145Wh is the amount of energy consumed.

kriesel 2017-12-07 17:08

multiple instances effect on performance (win some lose some)
 
To follow up on [URL]http://www.mersenneforum.org/showpost.php?p=472866&postcount=2649[/URL], testing several combinations of applications (among CUDALucas, CUDAPm1, Mfaktc) run on several model GPUs, I have preliminary results per GPU and apps combination ranging from a few percent throughput reduction to over thirteen percent throughput increase. Throughput is computed as the sum for each simultaneously running instance on an individual GPU, of the rate of progress divided by the rate that was benchmarked to occur when that application was the only one running on that GPU. (This approach treats all run types, LL, P-1, trial factoring, as equally valuable; what's valued is a GPU-day of that model.) Estimated standard deviations so far are of order 0.2% to 0.5% for those I've evaluated, so the observed 1-13% gains evaluated are statistically significant. A spot check of a benchmark was repeatable quickly to 0.2%. Memory requirement is typically a small fraction of total GPU ram.

TheJudger 2017-12-23 19:44

CUDA 9.1, CUDA driver 387.34, CUDALucas 2.05.1 (SVN rev. 99)

Updated P100-16GiB Benchmark (older CUDA 8 Benchmarks is [URL="http://mersenneforum.org/showpost.php?p=452751&postcount=2561"]here[/URL] and [URL="http://mersenneforum.org/showpost.php?p=452834&postcount=2566"]here[/URL]).

Benchmark FFT sizes './CUDALucas -cufftbench 2048 32768 20'
[CODE]Device Tesla P100-PCIE-16GB
Compatibility 6.0
clockRate (MHz) 1328
memClockRate (MHz) 715

fft max exp ms/iter
2048 38492887 0.5972
2187 41047411 0.7118
2304 43194913 0.7301
2401 44973503 0.7656
2592 48471289 0.7971
2744 51250889 0.8863
3136 58404433 0.9482
3200 59570449 0.9733
3456 64229677 1.0467
3584 66556463 1.1321
4096 75846319 1.1423
4608 85111207 1.4124
5184 95507747 1.4988
5488 100984691 1.6450
6272 115080019 1.8127
6400 117377567 1.8730
6561 120266023 1.9556
6912 126558077 2.0301
7776 142017539 2.2474
8192 149447533 2.2688
8575 156280961 2.6593
9261 168504209 2.8483
10368 188188471 2.9439
10976 198980129 3.1604
12544 226753511 3.5621
12800 231280639 3.6567
13824 249369863 3.9843
15552 279831199 4.4018
16384 294471259 4.5018
16807 301908293 5.1300
16875 303103441 5.5609
18225 326810201 5.7337
20736 370806323 5.8287
21952 392070229 6.2511
25088 446794913 7.0258
27783 493705637 8.2696
31104 551379091 8.7884
32000 566915989 9.0541
32768 580225813 9.0641[/CODE]

And timing 100M exponent './CUDALucas 332192879'
[CODE]Starting M332192879 fft length = 20736K
| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done |
| Dec 23 20:37:12 | M332192879 10000 0xa19043095e213f4c | 20736K 0.01758 5.8218 58.21s | 22:09:12:09 0.00% |
| Dec 23 20:38:10 | M332192879 20000 0xcb7bc66ac81b24be | 20736K 0.01709 5.8218 58.21s | 22:09:11:04 0.00% |
| Dec 23 20:39:08 | M332192879 30000 0x38e4cc517de8fda3 | 20736K 0.01855 5.8249 58.24s | 22:09:15:49 0.00% |
[/CODE]

Oliver

ET_ 2017-12-23 23:40

[QUOTE=TheJudger;474726]CUDA 9.1, CUDA driver 387.34, CUDALucas 2.05.1 (SVN rev. 99)

Updated P100-16GiB Benchmark (older CUDA 8 Benchmarks is [URL="http://mersenneforum.org/showpost.php?p=452751&postcount=2561"]here[/URL] and [URL="http://mersenneforum.org/showpost.php?p=452834&postcount=2566"]here[/URL]).

Benchmark FFT sizes './CUDALucas -cufftbench 2048 32768 20'
[CODE]Device Tesla P100-PCIE-16GB
Compatibility 6.0
clockRate (MHz) 1328
memClockRate (MHz) 715

fft max exp ms/iter
2048 38492887 0.5972
2187 41047411 0.7118
2304 43194913 0.7301
2401 44973503 0.7656
2592 48471289 0.7971
2744 51250889 0.8863
3136 58404433 0.9482
3200 59570449 0.9733
3456 64229677 1.0467
3584 66556463 1.1321
4096 75846319 1.1423
4608 85111207 1.4124
5184 95507747 1.4988
5488 100984691 1.6450
6272 115080019 1.8127
6400 117377567 1.8730
6561 120266023 1.9556
6912 126558077 2.0301
7776 142017539 2.2474
8192 149447533 2.2688
8575 156280961 2.6593
9261 168504209 2.8483
10368 188188471 2.9439
10976 198980129 3.1604
12544 226753511 3.5621
12800 231280639 3.6567
13824 249369863 3.9843
15552 279831199 4.4018
16384 294471259 4.5018
16807 301908293 5.1300
16875 303103441 5.5609
18225 326810201 5.7337
20736 370806323 5.8287
21952 392070229 6.2511
25088 446794913 7.0258
27783 493705637 8.2696
31104 551379091 8.7884
32000 566915989 9.0541
32768 580225813 9.0641[/CODE]

And timing 100M exponent './CUDALucas 332192879'
[CODE]Starting M332192879 fft length = 20736K
| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done |
| Dec 23 20:37:12 | M332192879 10000 0xa19043095e213f4c | 20736K 0.01758 5.8218 58.21s | 22:09:12:09 0.00% |
| Dec 23 20:38:10 | M332192879 20000 0xcb7bc66ac81b24be | 20736K 0.01709 5.8218 58.21s | 22:09:11:04 0.00% |
| Dec 23 20:39:08 | M332192879 30000 0x38e4cc517de8fda3 | 20736K 0.01855 5.8249 58.24s | 22:09:15:49 0.00% |
[/CODE]

Oliver[/QUOTE]

22 days for a 100.000.000 digits number?

TheJudger 2017-12-24 00:01

Hi Luigi,

[QUOTE=ET_;474744]22 days for a 100.000.000 digits number?[/QUOTE]

yes, but look at [URL="http://mersenneforum.org/showpost.php?p=473281&postcount=2652"]this[/URL] :smile:

Oliver

kriesel 2018-01-08 03:41

Updated bug and wish list
 
1 Attachment(s)
[QUOTE=kriesel;465505]Here is today's version of the list I am maintaining. As always, this is in appreciation of the authors' past contributions. Users may want to browse this for workarounds included in some of the descriptions, and for an awareness of some known pitfalls. Please respond with any comments, additions or suggestions you may have.[/QUOTE]

After a few months and holidays, here's an updated version.

kriesel 2018-01-08 03:52

CUDALucas runtime scaling
 
1 Attachment(s)
The attachment is based on actual timed exponents on a 701 Mhz clocked GTX480. Times for a GTX1070 scale by about 70%. That is, what takes the 480 ten days takes the 1070 a week.

LaurV 2018-01-08 07:24

[QUOTE=kriesel;476927]After a few months and holidays, here's an updated version.[/QUOTE]
I had a fast read, some I didn't understand (need more time for me to read them deeper, I am in hurry now), but point 10 seems that it is actually not true. What you see is an effect of the save file storing the time when the test was started. The program computes the time like "how many iterations you did" over "how long time you worked on it", multiply with "how many iterations you still have" and that is a date in the future. You will experience the same effect if you interrupt your work for a while (days) and resume in the same computer. I remember a discussion in the past where we argued if the interruption time should be considered or not (i.e. averaged into the calculus) and it seems to me that it is better to be included. No matter if you take one picosecond per iteration, but if you spent 1 week to do half of the test (for whatever reasons, including interruptions), it would look normal for me that you will spend another week for the other half. In this way, your new computer doesn't know that the time per iteration is faster, but the ETA will "catch up" soon, as the iterations progress to higher numbers.

The other way, to display ETA as the "number of remaining iterations" multiplied with "iteration time", will give you an immediate result when you move it to a faster toy, but it will be very-VERY jumpy ETA, due to the fact that iteration time varies a lot with how busy your computer is. Some of us use the computers for other activities too. So it is not "reliable". Some kind of "averaging" with the past values (either SMA, or EMA) need to be done, to avoid the jumpy ETA, and you will still see "no effect" when you move it, unless the MA (moving average) main period passes. Of course, it would be nice to have an option in the ini file, for example, where to chose an averaging period, something like 255 should be the actual method, (just an example), something like 0 should be "no averaging" (jumpy). But I feel we request too much already.

wfgarnett3 2018-01-28 12:35

EVGA GeForce GTX 1050 (2GB GDDR5)
 
CUDALucas2.05.1-CUDA8.0-Windows-x64.exe

GeForce 1050 CUDALucas benchmarks below followed by Intel i3-4150 Prime95 benchmarks for comparison

[QUOTE]Device GeForce GTX 1050
Compatibility 6.1
clockRate (MHz) 1531
memClockRate (MHz) 3504

fft max exp ms/iter
1024 19535569 3.1435
1080 20580341 3.6334
1134 21586693 3.7268
1152 21921901 3.7988
1296 24599717 4.0779
1323 25101101 4.5591
1350 25602229 4.7156
1440 27271147 4.8550
1458 27604673 5.0420
1568 29640913 5.0514
1600 30232693 5.1335
1728 32597297 5.5383
1792 33778141 6.0359
2048 38492887 6.2727
2304 43194913 7.2045
2352 44075249 8.4636
2592 48471289 8.4958
2688 50227213 9.5387
2700 50446621 9.9943
2916 54392209 10.0159
3024 56362639 10.4233
3136 58404433 10.4462
3200 59570449 11.2744
3240 60298969 11.5216
3402 63247511 11.9492
3584 66556463 12.3054
3600 66847171 13.0066
4096 75846319 13.0730
4608 85111207 15.4038
4800 88579669 17.4774
5184 95507747 17.5136
5376 98967641 19.3875
5600 103000823 20.1571
5760 105879517 20.5611
5832 107174381 20.6808
6144 112781477 21.5606
6272 115080019 22.1675
6912 126558077 23.4366
7168 131142761 25.0548
7200 131715607 25.6965
8192 149447533 26.9798[/QUOTE]


[QUOTE]Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz
CPU speed: 3491.95 MHz, 2 hyperthreaded cores
CPU features: Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA
L1 cache size: 32 KB
L2 cache size: 256 KB, L3 cache size: 3 MB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
TLBS: 64

Timing FFTs using 2 threads on 2 cores.
Best time for 1792K FFT length: 4.751 ms., avg: 4.932 ms.
Best time for 1920K FFT length: 5.255 ms., avg: 5.418 ms.
Best time for 2016K FFT length: 5.469 ms., avg: 5.605 ms.
Best time for 2048K FFT length: 5.513 ms., avg: 5.574 ms.
Best time for 2304K FFT length: 6.246 ms., avg: 6.298 ms.
Best time for 2400K FFT length: 6.436 ms., avg: 6.484 ms.
Best time for 2560K FFT length: 6.825 ms., avg: 6.991 ms.
Best time for 2688K FFT length: 7.409 ms., avg: 7.502 ms.
Best time for 2880K FFT length: 7.736 ms., avg: 7.801 ms.
Best time for 3072K FFT length: 8.351 ms., avg: 8.448 ms.
Best time for 3200K FFT length: 8.811 ms., avg: 8.946 ms.
Best time for 3360K FFT length: 9.705 ms., avg: 9.879 ms.
Best time for 3456K FFT length: 9.940 ms., avg: 10.082 ms.
Best time for 3584K FFT length: 10.128 ms., avg: 10.220 ms.
Best time for 3840K FFT length: 10.919 ms., avg: 11.034 ms.
Best time for 4096K FFT length: 13.515 ms., avg: 13.819 ms.
Best time for 4480K FFT length: 12.547 ms., avg: 12.789 ms.
Best time for 4608K FFT length: 12.952 ms., avg: 13.141 ms.
Best time for 4800K FFT length: 13.462 ms., avg: 13.636 ms.
Best time for 5120K FFT length: 14.454 ms., avg: 14.626 ms.
Best time for 5376K FFT length: 15.308 ms., avg: 15.433 ms.
Best time for 5760K FFT length: 16.797 ms., avg: 16.957 ms.
Best time for 6144K FFT length: 17.702 ms., avg: 17.988 ms.
Best time for 6400K FFT length: 18.452 ms., avg: 18.641 ms.
Best time for 6720K FFT length: 20.265 ms., avg: 20.463 ms.
Best time for 6912K FFT length: 20.733 ms., avg: 21.296 ms.
Best time for 7168K FFT length: 22.067 ms., avg: 24.565 ms.
Best time for 7680K FFT length: 22.115 ms., avg: 22.333 ms.
Best time for 8064K FFT length: 24.796 ms., avg: 25.473 ms.
Best time for 8192K FFT length: 26.976 ms., avg: 28.400 ms.
[/QUOTE]

Lexicographer 2018-04-04 17:23

Problem compiling CUDALucas for 1080 Ti under Linux
 
Hello!

I'm not sure it's correct place to ask this, but I'm bumping into a problem while trying to compile the [URL="https://sourceforge.net/p/cudalucas/code/HEAD/tree/trunk/"]latest[/URL] CUDALucas under Linux.

The problem is:

[QUOTE]$ make
/usr/local/cuda/bin/nvcc -O1 --generate-code arch=compute_61,code=sm_61 --compiler-options=-Wall -I/usr/local/cuda/include -c CUDALucas.cu
CUDALucas.cu(756): error: identifier "nvmlInit" is undefined

CUDALucas.cu(757): error: identifier "nvmlDevice_t" is undefined

CUDALucas.cu(758): error: identifier "nvmlDeviceGetHandleByIndex" is undefined

CUDALucas.cu(759): error: identifier "nvmlDeviceGetUUID" is undefined

CUDALucas.cu(760): error: identifier "nvmlShutdown" is undefined
[/QUOTE]It's the same if I try different versions of compute/sm.

I have CUDA Toolkit 9.1 installed.

Any suggestions, please?


All times are UTC. The time now is 22:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.