![]() |
|
|
#23 | |
|
Oct 2010
191 Posts |
Quote:
Last fiddled with by Ralf Recker on 2011-01-09 at 20:30 |
|
|
|
|
|
|
#24 | |
|
Sep 2004
54168 Posts |
Quote:
For the calculations above I didn't consider if the GPU client uses some CPU power, I don't know if it is the case. If so and considering the investment of a machine with GPU, probably the GPU client would need to be more than 10x faster than the CPU client. Anyway, next test should be at n=3M, like 7*2^3015762+1, also prime. Last fiddled with by em99010pepe on 2011-01-09 at 20:35 |
|
|
|
|
|
|
#25 |
|
Mar 2010
1100110112 Posts |
It's also good to compare GPU's vs CPU's computational performance on perf/$.
Oh, and there are motherboards with 8 PCI-E slots. I didnt encounter a MB with 8 cpu sockets
|
|
|
|
|
|
#26 | |
|
Sep 2004
2·5·283 Posts |
Quote:
I think msft needs to concentrate on having a GPU client at least 10x faster than a CPU client. Then we need to know if the ratio stands as we increase the size of the tests (higher n for k*2^n+1). Last fiddled with by em99010pepe on 2011-01-09 at 20:54 |
|
|
|
|
|
|
#27 |
|
Jan 2005
Caught in a sieve
5·79 Posts |
Very nice work, Shoichiro! Thank you!
![]() I've been going over the code, and so far I've made a few optimizations with double2 that improve the speed of Ralf's test on my GTX 460/768@750 from about .84 to about .82 ms/bit. But now I'm looking at the normalization kernels. First, I'm wondering about the error checking. Does maxerr really have to be a double? Or can it be just a float? |
|
|
|
|
|
#28 |
|
Jul 2009
Tokyo
11428 Posts |
|
|
|
|
|
|
#29 |
|
Jul 2009
Tokyo
2×5×61 Posts |
Support RE64. Only 2 line change. Wait Ken_g6's work.
|
|
|
|
|
|
#30 |
|
Jan 2005
Caught in a sieve
18B16 Posts |
5*2^1282755+1 is prime! Time : 998.783 sec. .765 s/bit, down from .84 or so.
Changes from v0.12 also included. ![]() That seems to be about as far as blind tweaking will take me. I've been unable to get cudaprof nee computeprof to run the app, which is why I'm tweaking blind. If someone could give me a ranking of the most time-costly kernels, I might be able to do more. Edit: Looks like I hit pretty much all the bases, so probably not. Also, cuda_normalize2_kernel just bugs me, because it looks like it's an entire kernel using only one thread. But since I don't understand the algorithm well enough - I don't even see how the split between #2 and #3 works - I don't see a way to fix it. Last fiddled with by Ken_g6 on 2011-01-10 at 04:39 |
|
|
|
|
|
#31 | |
|
Jul 2009
Tokyo
2·5·61 Posts |
Quote:
I can not understand "wrapindex".
|
|
|
|
|
|
|
#32 |
|
Oct 2010
191 Posts |
A first glance at v0.13:
ralf@quadriga ~/llrcuda.0.13 $ time ./llrCUDA -q5*2^1282755+1 -d Starting Proth prime test of 5*2^1282755+1, FFTLEN = 131072 ; a = 3 5*2^1282755+1, bit: 80000 / 1282757 [6.23%]. Time per bit: 0.722 ms Around 11% faster than v0.11... Last fiddled with by Ralf Recker on 2011-01-10 at 05:20 |
|
|
|
|
|
#33 |
|
Oct 2010
191 Posts |
ralf@quadriga ~/llrcuda.0.13 $ time ./llrCUDA -q5*2^1282755+1 -d
Starting Proth prime test of 5*2^1282755+1, FFTLEN = 131072 ; a = 3 5*2^1282755+1 is prime! Time : 935.885 sec. real 15m36.000s user 4m27.065s sys 6m40.837s Last fiddled with by Ralf Recker on 2011-01-10 at 05:31 |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| LLRcuda | shanecruise | Riesel Prime Search | 8 | 2014-09-16 02:09 |
| LLRCUDA - getting it to work | diep | GPU Computing | 1 | 2013-10-02 12:12 |