mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-05-18, 09:30   #782
Manpowre
 
"Svein Johansen"
May 2013
Norway

3118 Posts
Default 6970

c:\mfakto-0.13pre5>mfakto-0.13pre5-pi-win64.exe --perftest

Runtime options
Inifile mfakto.ini
Verbosity 1
SieveOnGPU yes
GPUSievePrimes 82486
GPUSieveSize 64Mi bits
GPUSieveProcessSize 16Ki bits
WorkFile worktodo.txt
ResultsFile results.txt
Checkpoints enabled
CheckpointDelay 300s
Stages enabled
StopAfterFactor class
PrintMode compact
V5UserID none
ComputerID none
TimeStampInResults yes
VectorSize 4
GPUType VLIW4
SmallExp no
Select device - Get device info - Compiling kernels.


Perftest

Generate list of the first 10^6 primes: 6913.06 ms

1. Sieve-Init (once per class, 960 times per test, avg. for 10 iterations)
Init_class(sieveprimes= 5000): 1.40 ms
Init_class(sieveprimes= 20000): 6.32 ms
Init_class(sieveprimes= 80000): 28.83 ms
Init_class(sieveprimes= 200000): 78.58 ms
Init_class(sieveprimes= 500000): 213.20 ms
Init_class(sieveprimes=1000000): 451.98 ms

2. Sieve (M/s)
Sieve size is fixed at compile time, cannot test with variable sizes. Just runni
ng 3 fixed tests.

SievePrimes: 256 396 611 945 1460 2257 3487 5389
8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 64
7083 1000000
SieveSizeLimit
24 kiB 264.8 241.2 220.6 202.2 184.8 168.9 155.0 141.8 1
28.8 116.1 105.0 90.4 75.2 61.7 50.4 40.5 31.8 24.1
17.1 10.0
24 kiB 264.1 241.2 220.7 202.1 184.8 169.4 155.3 142.2 1
28.6 116.1 104.5 90.6 75.3 61.7 50.4 40.4 31.6 23.8
16.2 10.1
24 kiB 263.1 240.3 207.6 200.7 183.8 169.2 154.9 141.9 1
28.8 115.9 104.7 90.5 75.0 61.7 50.5 40.5 31.8 24.0
17.2 10.0
Best SieveSizeLimit for
SievePrimes: 256 396 611 945 1460 2257 3487 5389
8328 12871 19890 30738 47503 73411 113449 175323 270944 418716 64
7083 1000000
at kiB: 24 24 24 24 24 24 24 24
24 24 24 24 24 24 24 24 24 24
24 24
max M/s: 264.8 241.2 220.7 202.2 184.8 169.4 155.3 142.2 1
28.8 116.1 105.0 90.6 75.3 61.7 50.5 40.5 31.8 24.1
17.2 10.1
Survivors: 36.36% 34.06% 32.05% 30.28% 28.69% 27.27% 26.00% 24.84% 23
.79% 22.82% 21.94% 21.12% 20.36% 19.67% 19.01% 18.40% 17.82% 17.29% 16
.80% 16.32%


3. Memory copy to GPU (blocks of 8388608 bytes)

Standard copy, standard queue:
800 MB in 244.5 ms (3430.4 MB/s) (real)

Standard copy, profiled queue:
800 MB in 244.4 ms (3432.1 MB/s) (real)
800 MB in 0.0 ms (103409861.9 MB/s) (profiled data)
8 MB in 0.0 ms ( 1.$ MB/s) (profiled data, peak)

Standard copy, two queues:
800 MB in 194.5 ms (4312.0 MB/s) (real)

4. mfakto_cl_63 kernel
soon
5. mfakto_cl_71 kernel
soon
6. barrett_79 kernel
soon
7. barrett_92 kernel
soon

c:\mfakto-0.13pre5>
Manpowre is offline   Reply With Quote
Old 2013-05-18, 09:38   #783
Manpowre
 
"Svein Johansen"
May 2013
Norway

3·67 Posts
Default 6970

I see that the output is very different from MfaktC, I guess that will get cleaned up, and mabye get similar kind of output ?

CalcBitToClear 82688 primes: 250 us (330.752 M/s)
sieve using 262144 threads: 10.34 ms (25.3524 M/s), 6490.22 M FCs/s sieved
TF using 1048576 threads: 48.4379 ms (21.6478 M/s), 1385.46 M FCs/s TF'd (incl.
sieving)
CalcBitToClear 82688 primes: 260.445 us (317.487 M/s)
sieve using 262144 threads: 10.3958 ms (25.2164 M/s), 6455.4 M FCs/s sieved
TF using 1048576 threads: 48.4279 ms (21.6523 M/s), 1385.75 M FCs/s TF'd (incl.
sieving)
CalcBitToClear 82688 primes: 240.778 us (343.42 M/s)
sieve using 262144 threads: 10.4173 ms (25.1642 M/s), 6442.04 M FCs/s sieved
TF using 1048576 threads: 48.4308 ms (21.651 M/s), 1385.67 M FCs/s TF'd (incl. s
ieving)
CalcBitToClear 82688 primes: 266.556 us (310.209 M/s)
sieve using 262144 threads: 10.2207 ms (25.6484 M/s), 6566 M FCs/s sieved
TF using 1048576 threads: 48.4203 ms (21.6557 M/s), 1385.96 M FCs/s TF'd (incl.
sieving)
CalcBitToClear 82688 primes: 268.556 us (307.899 M/s)
sieve using 262144 threads: 10.3352 ms (25.3641 M/s), 6493.22 M FCs/s sieved
TF using 1048576 threads: 48.4021 ms (21.6638 M/s), 1386.49 M FCs/s TF'd (incl.
sieving)
CalcBitToClear 82688 primes: 289.445 us (285.678 M/s)
sieve using 262144 threads: 10.2964 ms (25.4597 M/s), 6517.67 M FCs/s sieved
TF using 1048576 threads: 48.4094 ms (21.6606 M/s), 1386.28 M FCs/s TF'd (incl.
sieving)
CalcBitToClear 82688 primes: 273 us (302.886 M/s)
sieve using 262144 threads: 10.4549 ms (25.0738 M/s), 6418.9 M FCs/s sieved
TF using 1048576 threads: 48.4148 ms (21.6582 M/s), 1386.12 M FCs/s TF'd (incl.
sieving)

Using
Factor=218687F2FF894FA83B3425A0F89061D5,77115127,70,71
Manpowre is offline   Reply With Quote
Old 2013-05-18, 09:51   #784
Manpowre
 
"Svein Johansen"
May 2013
Norway

110010012 Posts
Default 6970

-st run, log attached, selftest passed.

I will start 500 tests now, let it run out, and then double check those 500 with the MfaktC on 2 titans afterwards, then send you the result.

Great work on new version. it seems to have a huge throughput on the AMD cards..
Attached Files
File Type: zip st-0.13pre4-pi-win64.zip (193.6 KB, 96 views)
Manpowre is offline   Reply With Quote
Old 2013-05-18, 14:26   #785
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

They should be same. What Bdot wants I think, is -st2.
kracker is offline   Reply With Quote
Old 2013-05-18, 20:11   #786
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by Manpowre View Post
-st run, log attached, selftest passed.

I will start 500 tests now, let it run out, and then double check those 500 with the MfaktC on 2 titans afterwards, then send you the result.

Great work on new version. it seems to have a huge throughput on the AMD cards..
Thanks for your tests. I hope, you're not using the -pi- version for the 500 tests - this version has the PerformanceInfo DEBUG option enabled which considerably slows down overall processing, but allows for exact measurement of the kernel runtime. That is also the reason why the output looks so different to mfaktc. Best is to use this binary with CPU-sieving (SieveOnGPU=0) and a short test (-st). I use this output to see which of the kernels runs at which speed on this particular hardware.

Try the binary without -pi- in its name, and the output should look familiar.

And yes, throughput on high-end cards greatly benefits from GPU sieving. It should not be too long until I finished my stuff to release 0.13. Then it would be good if every user sent the output of one of the runs to James to allow more accurate updates of this page. Maybe I put this as a requirement into the license
Bdot is offline   Reply With Quote
Old 2013-05-18, 22:56   #787
Manpowre
 
"Svein Johansen"
May 2013
Norway

3×67 Posts
Default

Quote:
Originally Posted by Bdot View Post
Thanks for your tests. I hope, you're not using the -pi- version for the 500 tests - this version has the PerformanceInfo DEBUG option enabled which considerably slows down overall processing, but allows for exact measurement of the kernel runtime. That is also the reason why the output looks so different to mfaktc. Best is to use this binary with CPU-sieving (SieveOnGPU=0) and a short test (-st). I use this output to see which of the kernels runs at which speed on this particular hardware.

Try the binary without -pi- in its name, and the output should look familiar.

And yes, throughput on high-end cards greatly benefits from GPU sieving. It should not be too long until I finished my stuff to release 0.13. Then it would be good if every user sent the output of one of the runs to James to allow more accurate updates of this page. Maybe I put this as a requirement into the license
I did run the 05 pi yes.mfakto-0.13pre5-pi-win64. (not the 04 a in the report logfile, I copied the command from an earlier post hehe)
-st - passed
-st2 - passed - gave 201mb output into the logfile hehe.. but it passed all tests. and it took many hours.. around 10 hours. I started it a few hours later than the first -st test I reported here and it finished some time ago..

It would be great with a timestamp at beginning of test, and a timestamp at end of test both -st and -st2. and calculate total runtime for it as it seems this software is really using the card to its full.

Great job bdot.. I hope I can improve cudalucas during summer to the same extent, however,, its going to take time :)

Im really impressed here.
Manpowre is offline   Reply With Quote
Old 2013-05-18, 22:58   #788
Manpowre
 
"Svein Johansen"
May 2013
Norway

20110 Posts
Default

Quote:
Originally Posted by Bdot View Post
Try the binary without -pi- in its name, and the output should look familiar.
ahh, thats why the ati card doesnt clock up when I run the -pi- hehe.. ok.. Ill run the 500 tests without the -pi-.. and then double check them with the titans..

thanks..
Manpowre is offline   Reply With Quote
Old 2013-05-19, 23:11   #789
Cruelty
 
Cruelty's Avatar
 
May 2005

23×7×29 Posts
Default HD7790 results

https://www.box.com/s/pmxo4x26k5g2lcry46mk
st+st2+perftest
Cruelty is offline   Reply With Quote
Old 2013-05-19, 23:25   #790
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

1000011110002 Posts
Default

Quote:
Originally Posted by Cruelty View Post
Interesting. I was considering that instead of the 7770, how many GHZ/days do you get on it?
kracker is offline   Reply With Quote
Old 2013-05-20, 08:46   #791
Cruelty
 
Cruelty's Avatar
 
May 2005

162410 Posts
Default

Quote:
Originally Posted by kracker View Post
Interesting. I was considering that instead of the 7770, how many GHZ/days do you get on it?
Actually I did it out of curiosity - let me know which tests would you like me to perform and I'll do it in the late evening (CET).
Cruelty is offline   Reply With Quote
Old 2013-05-20, 09:52   #792
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by Cruelty View Post
Thanks a lot for these tests! Seems to be a pretty good card, this HD7790. I've added "Bonaire" to the list of known GPU's - not sure how I missed it when I added the latest models ... But this really has been the last change for the 0.13 release!

Quote:
Originally Posted by kracker View Post
Interesting. I was considering that instead of the 7770, how many GHZ/days do you get on it?
At 1200MHz (reference default is 1000MHz), this card has about 148% of your 7770@1100MHz (def 1000), or 98% of my 7850@1050MHz (def 860).
This is within 5% of the theoretical #-of-cores x clockspeed comparison.
Bdot is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2718 2021-07-06 18:30
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3497 2021-06-05 12:27
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 07:45.


Mon Aug 2 07:45:20 UTC 2021 up 10 days, 2:14, 0 users, load averages: 1.55, 1.41, 1.37

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.