mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-01-09, 20:17   #23
Ralf Recker
 
Ralf Recker's Avatar
 
Oct 2010

191 Posts
Default

Quote:
Originally Posted by em99010pepe View Post
For

GPU: GTX470 TDP of 215 W
CPU: Q9550 TDP of 121 W ( 4 cores at 3.6 GHz)
CPU timing for 5*2^1282755+1 of 1056 secs

Then

the GPU client should be 7.1 times faster than the CPU version to achieve the same relation Watts/candidates of CPU LLR client in 24 hours. This is the optimal for this pair, Q9550 and GTX470.
At PrimeGrid's CPU/GPU sieves (PPS Sieve / CW Sieve) the 460 is faster than 12 Q9550s (48 cores).

Last fiddled with by Ralf Recker on 2011-01-09 at 20:30
Ralf Recker is offline   Reply With Quote
Old 2011-01-09, 20:31   #24
em99010pepe
 
em99010pepe's Avatar
 
Sep 2004

2·5·283 Posts
Default

Quote:
Originally Posted by Ralf Recker View Post
At PrimeGrid's GPU sieves (PPS Sieve / CW Sieve) the 460 is faster than 12 Q9550s (48 cores)...
One thing is to be faster other is to be better energy efficient, ok? If your statement is true then the GTX 460 is more efficient on sieving than 12 Q9550's.
For the calculations above I didn't consider if the GPU client uses some CPU power, I don't know if it is the case. If so and considering the investment of a machine with GPU, probably the GPU client would need to be more than 10x faster than the CPU client.

Anyway, next test should be at n=3M, like 7*2^3015762+1, also prime.

Last fiddled with by em99010pepe on 2011-01-09 at 20:35
em99010pepe is offline   Reply With Quote
Old 2011-01-09, 20:43   #25
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

19B16 Posts
Default

It's also good to compare GPU's vs CPU's computational performance on perf/$.
Oh, and there are motherboards with 8 PCI-E slots.
I didnt encounter a MB with 8 cpu sockets
Karl M Johnson is offline   Reply With Quote
Old 2011-01-09, 20:51   #26
em99010pepe
 
em99010pepe's Avatar
 
Sep 2004

54168 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
It's also good to compare GPU's vs CPU's computational performance on perf/$.
Oh, and there are motherboards with 8 PCI-E slots.
I didnt encounter a MB with 8 cpu sockets
You can have a motherboard with 8 PCI-E slots but you can also have quad socket motherboards, each socket can take 4 to 6 cores CPU (Tyan motherboars with Xeon processors). You need to compare specific energy, perf/$ is dependent of the latter.

I think msft needs to concentrate on having a GPU client at least 10x faster than a CPU client. Then we need to know if the ratio stands as we increase the size of the tests (higher n for k*2^n+1).

Last fiddled with by em99010pepe on 2011-01-09 at 20:54
em99010pepe is offline   Reply With Quote
Old 2011-01-09, 23:42   #27
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

2×197 Posts
Default

Very nice work, Shoichiro! Thank you!

I've been going over the code, and so far I've made a few optimizations with double2 that improve the speed of Ralf's test on my GTX 460/768@750 from about .84 to about .82 ms/bit. But now I'm looking at the normalization kernels.

First, I'm wondering about the error checking. Does maxerr really have to be a double? Or can it be just a float?
Ken_g6 is offline   Reply With Quote
Old 2011-01-09, 23:54   #28
msft
 
msft's Avatar
 
Jul 2009
Tokyo

26216 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
First, I'm wondering about the error checking. Does maxerr really have to be a double? Or can it be just a float?
I think float have Sufficient accuracy.
msft is offline   Reply With Quote
Old 2011-01-10, 02:18   #29
msft
 
msft's Avatar
 
Jul 2009
Tokyo

61010 Posts
Default

Support RE64. Only 2 line change. Wait Ken_g6's work.
Attached Files
File Type: gz llrCUDA.0.12.tar.gz (94.0 KB, 120 views)
msft is offline   Reply With Quote
Old 2011-01-10, 04:07   #30
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

2×197 Posts
Default

5*2^1282755+1 is prime! Time : 998.783 sec. .765 s/bit, down from .84 or so.

Changes from v0.12 also included.

That seems to be about as far as blind tweaking will take me. I've been unable to get cudaprof nee computeprof to run the app, which is why I'm tweaking blind. If someone could give me a ranking of the most time-costly kernels, I might be able to do more. Edit: Looks like I hit pretty much all the bases, so probably not.

Also, cuda_normalize2_kernel just bugs me, because it looks like it's an entire kernel using only one thread. But since I don't understand the algorithm well enough - I don't even see how the split between #2 and #3 works - I don't see a way to fix it.
Attached Files
File Type: bz2 llrcuda.0.13.tar.bz2 (87.5 KB, 119 views)

Last fiddled with by Ken_g6 on 2011-01-10 at 04:39
Ken_g6 is offline   Reply With Quote
Old 2011-01-10, 04:29   #31
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2·5·61 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
Also, cuda_normalize2_kernel just bugs me, because it looks like it's an entire kernel using only one thread. But since I don't understand the algorithm well enough - I don't even see how the split between #2 and #3 works - I don't see a way to fix it.
I concern index conflict with x[].
I can not understand "wrapindex".
msft is offline   Reply With Quote
Old 2011-01-10, 05:16   #32
Ralf Recker
 
Ralf Recker's Avatar
 
Oct 2010

191 Posts
Default

A first glance at v0.13:

ralf@quadriga ~/llrcuda.0.13 $ time ./llrCUDA -q5*2^1282755+1 -d
Starting Proth prime test of 5*2^1282755+1, FFTLEN = 131072 ; a = 3
5*2^1282755+1, bit: 80000 / 1282757 [6.23%]. Time per bit: 0.722 ms

Around 11% faster than v0.11...

Last fiddled with by Ralf Recker on 2011-01-10 at 05:20
Ralf Recker is offline   Reply With Quote
Old 2011-01-10, 05:30   #33
Ralf Recker
 
Ralf Recker's Avatar
 
Oct 2010

191 Posts
Thumbs up Nearly 2 minutes faster

ralf@quadriga ~/llrcuda.0.13 $ time ./llrCUDA -q5*2^1282755+1 -d
Starting Proth prime test of 5*2^1282755+1, FFTLEN = 131072 ; a = 3
5*2^1282755+1 is prime! Time : 935.885 sec.

real 15m36.000s
user 4m27.065s
sys 6m40.837s

Last fiddled with by Ralf Recker on 2011-01-10 at 05:31
Ralf Recker is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLRcuda shanecruise Riesel Prime Search 8 2014-09-16 02:09
LLRCUDA - getting it to work diep GPU Computing 1 2013-10-02 12:12

All times are UTC. The time now is 08:50.

Sat Jul 11 08:50:39 UTC 2020 up 108 days, 6:23, 0 users, load averages: 1.57, 1.39, 1.33

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.