mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

Manpowre 2013-05-18 09:30

6970
 
c:\mfakto-0.13pre5>mfakto-0.13pre5-pi-win64.exe --perftest

Runtime options
Inifile mfakto.ini
Verbosity 1
SieveOnGPU yes
GPUSievePrimes 82486
GPUSieveSize 64Mi bits
GPUSieveProcessSize 16Ki bits
WorkFile worktodo.txt
ResultsFile results.txt
Checkpoints enabled
CheckpointDelay 300s
Stages enabled
StopAfterFactor class
PrintMode compact
V5UserID none
ComputerID none
TimeStampInResults yes
VectorSize 4
GPUType VLIW4
SmallExp no
Select device - Get device info - Compiling kernels.


Perftest

Generate list of the first 10^6 primes: 6913.06 ms

1. Sieve-Init (once per class, 960 times per test, avg. for 10 iterations)
Init_class(sieveprimes= 5000): 1.40 ms
Init_class(sieveprimes= 20000): 6.32 ms
Init_class(sieveprimes= 80000): 28.83 ms
Init_class(sieveprimes= 200000): 78.58 ms
Init_class(sieveprimes= 500000): 213.20 ms
Init_class(sieveprimes=1000000): 451.98 ms

2. Sieve (M/s)
Sieve size is fixed at compile time, cannot test with variable sizes. Just runni
ng 3 fixed tests.

SievePrimes: 256 396 611 945 1460 2257 3487 5389
8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 64
7083 1000000
SieveSizeLimit
24 kiB 264.8 241.2 220.6 202.2 184.8 168.9 155.0 141.8 1
28.8 116.1 105.0 90.4 75.2 61.7 50.4 40.5 31.8 24.1
17.1 10.0
24 kiB 264.1 241.2 220.7 202.1 184.8 169.4 155.3 142.2 1
28.6 116.1 104.5 90.6 75.3 61.7 50.4 40.4 31.6 23.8
16.2 10.1
24 kiB 263.1 240.3 207.6 200.7 183.8 169.2 154.9 141.9 1
28.8 115.9 104.7 90.5 75.0 61.7 50.5 40.5 31.8 24.0
17.2 10.0
Best SieveSizeLimit for
SievePrimes: 256 396 611 945 1460 2257 3487 5389
8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 64
7083 1000000
at kiB: 24 24 24 24 24 24 24 24
24 24 24 24 24 24 24 24 24 24
24 24
max M/s: 264.8 241.2 220.7 202.2 184.8 169.4 155.3 142.2 1
28.8 116.1 105.0 90.6 75.3 61.7 50.5 40.5 31.8 24.1
17.2 10.1
Survivors: 36.36% 34.06% 32.05% 30.28% 28.69% 27.27% 26.00% 24.84% 23
.79% 22.82% 21.94% 21.12% 20.36% 19.67% 19.01% 18.40% 17.82% 17.29% 16
.80% 16.32%


3. Memory copy to GPU (blocks of 8388608 bytes)

Standard copy, standard queue:
800 MB in 244.5 ms (3430.4 MB/s) (real)

Standard copy, profiled queue:
800 MB in 244.4 ms (3432.1 MB/s) (real)
800 MB in 0.0 ms (103409861.9 MB/s) (profiled data)
8 MB in 0.0 ms ( 1.$ MB/s) (profiled data, peak)

Standard copy, two queues:
800 MB in 194.5 ms (4312.0 MB/s) (real)

4. mfakto_cl_63 kernel
soon
5. mfakto_cl_71 kernel
soon
6. barrett_79 kernel
soon
7. barrett_92 kernel
soon

c:\mfakto-0.13pre5>

Manpowre 2013-05-18 09:38

6970
 
I see that the output is very different from MfaktC, I guess that will get cleaned up, and mabye get similar kind of output ?

CalcBitToClear 82688 primes: 250 us (330.752 M/s)
sieve using 262144 threads: 10.34 ms (25.3524 M/s), 6490.22 M FCs/s sieved
TF using 1048576 threads: 48.4379 ms (21.6478 M/s), 1385.46 M FCs/s TF'd (incl.
sieving)
CalcBitToClear 82688 primes: 260.445 us (317.487 M/s)
sieve using 262144 threads: 10.3958 ms (25.2164 M/s), 6455.4 M FCs/s sieved
TF using 1048576 threads: 48.4279 ms (21.6523 M/s), 1385.75 M FCs/s TF'd (incl.
sieving)
CalcBitToClear 82688 primes: 240.778 us (343.42 M/s)
sieve using 262144 threads: 10.4173 ms (25.1642 M/s), 6442.04 M FCs/s sieved
TF using 1048576 threads: 48.4308 ms (21.651 M/s), 1385.67 M FCs/s TF'd (incl. s
ieving)
CalcBitToClear 82688 primes: 266.556 us (310.209 M/s)
sieve using 262144 threads: 10.2207 ms (25.6484 M/s), 6566 M FCs/s sieved
TF using 1048576 threads: 48.4203 ms (21.6557 M/s), 1385.96 M FCs/s TF'd (incl.
sieving)
CalcBitToClear 82688 primes: 268.556 us (307.899 M/s)
sieve using 262144 threads: 10.3352 ms (25.3641 M/s), 6493.22 M FCs/s sieved
TF using 1048576 threads: 48.4021 ms (21.6638 M/s), 1386.49 M FCs/s TF'd (incl.
sieving)
CalcBitToClear 82688 primes: 289.445 us (285.678 M/s)
sieve using 262144 threads: 10.2964 ms (25.4597 M/s), 6517.67 M FCs/s sieved
TF using 1048576 threads: 48.4094 ms (21.6606 M/s), 1386.28 M FCs/s TF'd (incl.
sieving)
CalcBitToClear 82688 primes: 273 us (302.886 M/s)
sieve using 262144 threads: 10.4549 ms (25.0738 M/s), 6418.9 M FCs/s sieved
TF using 1048576 threads: 48.4148 ms (21.6582 M/s), 1386.12 M FCs/s TF'd (incl.
sieving)

Using
Factor=218687F2FF894FA83B3425A0F89061D5,77115127,70,71

Manpowre 2013-05-18 09:51

6970
 
1 Attachment(s)
-st run, log attached, selftest passed.

I will start 500 tests now, let it run out, and then double check those 500 with the MfaktC on 2 titans afterwards, then send you the result.

Great work on new version. it seems to have a huge throughput on the AMD cards..

kracker 2013-05-18 14:26

They should be same. What Bdot wants I think, is -st2.

Bdot 2013-05-18 20:11

[QUOTE=Manpowre;340869]-st run, log attached, selftest passed.

I will start 500 tests now, let it run out, and then double check those 500 with the MfaktC on 2 titans afterwards, then send you the result.

Great work on new version. it seems to have a huge throughput on the AMD cards..[/QUOTE]

Thanks for your tests. I hope, you're not using the -pi- version for the 500 tests - this version has the [B]P[/B]erformance[B]I[/B]nfo DEBUG option enabled which considerably slows down overall processing, but allows for exact measurement of the kernel runtime. That is also the reason why the output looks so different to mfaktc. Best is to use this binary with CPU-sieving (SieveOnGPU=0) and a short test (-st). I use this output to see which of the kernels runs at which speed on this particular hardware.

Try the binary without -pi- in its name, and the output should look familiar.

And yes, throughput on high-end cards greatly benefits from GPU sieving. It should not be too long until I finished my stuff to release 0.13. Then it would be good if every user sent the output of one of the runs to James to allow more accurate updates of [URL="http://www.mersenne.ca/mfaktc.php"]this page[/URL]. Maybe I put this as a requirement into the license :smile:

Manpowre 2013-05-18 22:56

[QUOTE=Bdot;340900]Thanks for your tests. I hope, you're not using the -pi- version for the 500 tests - this version has the [B]P[/B]erformance[B]I[/B]nfo DEBUG option enabled which considerably slows down overall processing, but allows for exact measurement of the kernel runtime. That is also the reason why the output looks so different to mfaktc. Best is to use this binary with CPU-sieving (SieveOnGPU=0) and a short test (-st). I use this output to see which of the kernels runs at which speed on this particular hardware.

Try the binary without -pi- in its name, and the output should look familiar.

And yes, throughput on high-end cards greatly benefits from GPU sieving. It should not be too long until I finished my stuff to release 0.13. Then it would be good if every user sent the output of one of the runs to James to allow more accurate updates of [URL="http://www.mersenne.ca/mfaktc.php"]this page[/URL]. Maybe I put this as a requirement into the license :smile:[/QUOTE]

I did run the 05 pi yes.mfakto-0.13pre5-pi-win64. (not the 04 a in the report logfile, I copied the command from an earlier post hehe)
-st - passed
-st2 - passed - gave 201mb output into the logfile hehe.. but it passed all tests. and it took many hours.. around 10 hours. I started it a few hours later than the first -st test I reported here and it finished some time ago..

It would be great with a timestamp at beginning of test, and a timestamp at end of test both -st and -st2. and calculate total runtime for it as it seems this software is really using the card to its full.

Great job bdot.. I hope I can improve cudalucas during summer to the same extent, however,, its going to take time :)

Im really impressed here.

Manpowre 2013-05-18 22:58

[QUOTE=Bdot;340900]
Try the binary without -pi- in its name, and the output should look familiar.
[/QUOTE]

ahh, thats why the ati card doesnt clock up when I run the -pi- hehe.. ok.. Ill run the 500 tests without the -pi-.. and then double check them with the titans..

thanks..

Cruelty 2013-05-19 23:11

HD7790 results
 
[url]https://www.box.com/s/pmxo4x26k5g2lcry46mk[/url]
st+st2+perftest :smile:

kracker 2013-05-19 23:25

[QUOTE=Cruelty;340982][url]https://www.box.com/s/pmxo4x26k5g2lcry46mk[/url]
st+st2+perftest :smile:[/QUOTE]

Interesting. I was considering that instead of the 7770, how many GHZ/days do you get on it?

Cruelty 2013-05-20 08:46

[QUOTE=kracker;340983]Interesting. I was considering that instead of the 7770, how many GHZ/days do you get on it?[/QUOTE]
Actually I did it out of curiosity - let me know which tests would you like me to perform and I'll do it in the late evening (CET).

Bdot 2013-05-20 09:52

[QUOTE=Cruelty;340982][URL]https://www.box.com/s/pmxo4x26k5g2lcry46mk[/URL]
st+st2+perftest :smile:[/QUOTE]
Thanks a lot for these tests! Seems to be a pretty good card, this HD7790. I've added "Bonaire" to the list of known GPU's - not sure how I missed it when I added the latest models ... But this really has been the last change for the 0.13 release!

[QUOTE=kracker;340983]Interesting. I was considering that instead of the 7770, how many GHZ/days do you get on it?[/QUOTE]

At 1200MHz (reference default is 1000MHz), this card has about 148% of your 7770@1100MHz (def 1000), or 98% of my 7850@1050MHz (def 860).
This is within 5% of the theoretical #-of-cores x clockspeed comparison.


All times are UTC. The time now is 23:09.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.