![]() |
|
|
#782 |
|
"Svein Johansen"
May 2013
Norway
3×67 Posts |
c:\mfakto-0.13pre5>mfakto-0.13pre5-pi-win64.exe --perftest
Runtime options Inifile mfakto.ini Verbosity 1 SieveOnGPU yes GPUSievePrimes 82486 GPUSieveSize 64Mi bits GPUSieveProcessSize 16Ki bits WorkFile worktodo.txt ResultsFile results.txt Checkpoints enabled CheckpointDelay 300s Stages enabled StopAfterFactor class PrintMode compact V5UserID none ComputerID none TimeStampInResults yes VectorSize 4 GPUType VLIW4 SmallExp no Select device - Get device info - Compiling kernels. Perftest Generate list of the first 10^6 primes: 6913.06 ms 1. Sieve-Init (once per class, 960 times per test, avg. for 10 iterations) Init_class(sieveprimes= 5000): 1.40 ms Init_class(sieveprimes= 20000): 6.32 ms Init_class(sieveprimes= 80000): 28.83 ms Init_class(sieveprimes= 200000): 78.58 ms Init_class(sieveprimes= 500000): 213.20 ms Init_class(sieveprimes=1000000): 451.98 ms 2. Sieve (M/s) Sieve size is fixed at compile time, cannot test with variable sizes. Just runni ng 3 fixed tests. SievePrimes: 256 396 611 945 1460 2257 3487 5389 8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 64 7083 1000000 SieveSizeLimit 24 kiB 264.8 241.2 220.6 202.2 184.8 168.9 155.0 141.8 1 28.8 116.1 105.0 90.4 75.2 61.7 50.4 40.5 31.8 24.1 17.1 10.0 24 kiB 264.1 241.2 220.7 202.1 184.8 169.4 155.3 142.2 1 28.6 116.1 104.5 90.6 75.3 61.7 50.4 40.4 31.6 23.8 16.2 10.1 24 kiB 263.1 240.3 207.6 200.7 183.8 169.2 154.9 141.9 1 28.8 115.9 104.7 90.5 75.0 61.7 50.5 40.5 31.8 24.0 17.2 10.0 Best SieveSizeLimit for SievePrimes: 256 396 611 945 1460 2257 3487 5389 8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 64 7083 1000000 at kiB: 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 max M/s: 264.8 241.2 220.7 202.2 184.8 169.4 155.3 142.2 1 28.8 116.1 105.0 90.6 75.3 61.7 50.5 40.5 31.8 24.1 17.2 10.1 Survivors: 36.36% 34.06% 32.05% 30.28% 28.69% 27.27% 26.00% 24.84% 23 .79% 22.82% 21.94% 21.12% 20.36% 19.67% 19.01% 18.40% 17.82% 17.29% 16 .80% 16.32% 3. Memory copy to GPU (blocks of 8388608 bytes) Standard copy, standard queue: 800 MB in 244.5 ms (3430.4 MB/s) (real) Standard copy, profiled queue: 800 MB in 244.4 ms (3432.1 MB/s) (real) 800 MB in 0.0 ms (103409861.9 MB/s) (profiled data) 8 MB in 0.0 ms ( 1.$ MB/s) (profiled data, peak) Standard copy, two queues: 800 MB in 194.5 ms (4312.0 MB/s) (real) 4. mfakto_cl_63 kernel soon 5. mfakto_cl_71 kernel soon 6. barrett_79 kernel soon 7. barrett_92 kernel soon c:\mfakto-0.13pre5> |
|
|
|
|
|
#783 |
|
"Svein Johansen"
May 2013
Norway
3·67 Posts |
I see that the output is very different from MfaktC, I guess that will get cleaned up, and mabye get similar kind of output ?
CalcBitToClear 82688 primes: 250 us (330.752 M/s) sieve using 262144 threads: 10.34 ms (25.3524 M/s), 6490.22 M FCs/s sieved TF using 1048576 threads: 48.4379 ms (21.6478 M/s), 1385.46 M FCs/s TF'd (incl. sieving) CalcBitToClear 82688 primes: 260.445 us (317.487 M/s) sieve using 262144 threads: 10.3958 ms (25.2164 M/s), 6455.4 M FCs/s sieved TF using 1048576 threads: 48.4279 ms (21.6523 M/s), 1385.75 M FCs/s TF'd (incl. sieving) CalcBitToClear 82688 primes: 240.778 us (343.42 M/s) sieve using 262144 threads: 10.4173 ms (25.1642 M/s), 6442.04 M FCs/s sieved TF using 1048576 threads: 48.4308 ms (21.651 M/s), 1385.67 M FCs/s TF'd (incl. s ieving) CalcBitToClear 82688 primes: 266.556 us (310.209 M/s) sieve using 262144 threads: 10.2207 ms (25.6484 M/s), 6566 M FCs/s sieved TF using 1048576 threads: 48.4203 ms (21.6557 M/s), 1385.96 M FCs/s TF'd (incl. sieving) CalcBitToClear 82688 primes: 268.556 us (307.899 M/s) sieve using 262144 threads: 10.3352 ms (25.3641 M/s), 6493.22 M FCs/s sieved TF using 1048576 threads: 48.4021 ms (21.6638 M/s), 1386.49 M FCs/s TF'd (incl. sieving) CalcBitToClear 82688 primes: 289.445 us (285.678 M/s) sieve using 262144 threads: 10.2964 ms (25.4597 M/s), 6517.67 M FCs/s sieved TF using 1048576 threads: 48.4094 ms (21.6606 M/s), 1386.28 M FCs/s TF'd (incl. sieving) CalcBitToClear 82688 primes: 273 us (302.886 M/s) sieve using 262144 threads: 10.4549 ms (25.0738 M/s), 6418.9 M FCs/s sieved TF using 1048576 threads: 48.4148 ms (21.6582 M/s), 1386.12 M FCs/s TF'd (incl. sieving) Using Factor=218687F2FF894FA83B3425A0F89061D5,77115127,70,71 |
|
|
|
|
|
#784 |
|
"Svein Johansen"
May 2013
Norway
3×67 Posts |
-st run, log attached, selftest passed.
I will start 500 tests now, let it run out, and then double check those 500 with the MfaktC on 2 titans afterwards, then send you the result. Great work on new version. it seems to have a huge throughput on the AMD cards.. |
|
|
|
|
|
#785 |
|
"Mr. Meeseeks"
Jan 2012
California, USA
23·271 Posts |
They should be same. What Bdot wants I think, is -st2.
|
|
|
|
|
|
#786 | |
|
Nov 2010
Germany
3·199 Posts |
Quote:
Try the binary without -pi- in its name, and the output should look familiar. And yes, throughput on high-end cards greatly benefits from GPU sieving. It should not be too long until I finished my stuff to release 0.13. Then it would be good if every user sent the output of one of the runs to James to allow more accurate updates of this page. Maybe I put this as a requirement into the license
|
|
|
|
|
|
|
#787 | |
|
"Svein Johansen"
May 2013
Norway
3×67 Posts |
Quote:
-st - passed -st2 - passed - gave 201mb output into the logfile hehe.. but it passed all tests. and it took many hours.. around 10 hours. I started it a few hours later than the first -st test I reported here and it finished some time ago.. It would be great with a timestamp at beginning of test, and a timestamp at end of test both -st and -st2. and calculate total runtime for it as it seems this software is really using the card to its full. Great job bdot.. I hope I can improve cudalucas during summer to the same extent, however,, its going to take time :) Im really impressed here. |
|
|
|
|
|
|
#788 |
|
"Svein Johansen"
May 2013
Norway
3·67 Posts |
|
|
|
|
|
|
#789 |
|
May 2005
23·7·29 Posts |
|
|
|
|
|
|
#790 | |
|
"Mr. Meeseeks"
Jan 2012
California, USA
23×271 Posts |
Quote:
|
|
|
|
|
|
|
#791 |
|
May 2005
162410 Posts |
|
|
|
|
|
|
#792 | ||
|
Nov 2010
Germany
10010101012 Posts |
Quote:
Quote:
This is within 5% of the theoretical #-of-cores x clockspeed comparison. |
||
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| gpuOwL: an OpenCL program for Mersenne primality testing | preda | GpuOwl | 2718 | 2021-07-06 18:30 |
| mfaktc: a CUDA program for Mersenne prefactoring | TheJudger | GPU Computing | 3497 | 2021-06-05 12:27 |
| LL with OpenCL | msft | GPU Computing | 433 | 2019-06-23 21:11 |
| OpenCL for FPGAs | TObject | GPU Computing | 2 | 2013-10-12 21:09 |
| Program to TF Mersenne numbers with more than 1 sextillion digits? | Stargate38 | Factoring | 24 | 2011-11-03 00:34 |