![]() |
|
|
#12 | |
|
"Dave"
Sep 2005
UK
23·347 Posts |
Quote:
![]() The peak throughput was with m=127 where I got 282M p/sec with 0.81 CPU used. With m=128 the throughput fell off a cliff to 32M p/sec and 0.11 CPU used. m=127 is equal to cuda cores per multi-processor (32 for a 465) * 4 - 1. Go figure! |
|
|
|
|
|
|
#13 |
|
Jan 2005
Caught in a sieve
5·79 Posts |
I'll tell you why that happened: I interpret that parameter differently when it exceeds BLOCKSIZE, because I never thought more than BLOCKSIZE blocks would be needed. Guess what BLOCKSIZE is set to!
![]() So to keep going, start at 16384 (128*128, which is interpreted as total threads, not blocks), and increment by at least 128 at a time! Edit: By the way, that's total threads per multiprocessor. Edit2: As -m gets bigger, you should make sure your test range is big enough to account for all those tests. My little 20M test probably isn't big enough. Edit3: Finally, you should be looking at the number printed at the end, not any intermediate progress reports, for the best assessment of speed. Last fiddled with by Ken_g6 on 2010-09-07 at 21:37 |
|
|
|
|
|
#14 |
|
"Dave"
Sep 2005
UK
23·347 Posts |
I have confirmed that m=16384 gives the same performance as m=127. However, any value of m greater than this gives out of range argument.
|
|
|
|
|
|
#15 |
|
Jan 2005
Caught in a sieve
5×79 Posts |
OK, I've removed the apparently arbitrary range restriction. But I'm not entirely sure it's arbitrary; don't be too surprised if it crashes otherwise.
|
|
|
|
|
|
#16 |
|
"Dave"
Sep 2005
UK
23×347 Posts |
It doesn't appear to crash. Peak throughput on a GTX465 appears to be 329M p/sec using 0.95 CPU (Core i7@3.6GHz). This was achieved with 24,576 threads per multiprocessor or 6 threads per CUDA core. After completing this I ran a 200G range to confirm stability.
|
|
|
|
|
|
#17 |
|
Jan 2005
Caught in a sieve
18B16 Posts |
|
|
|
|
|
|
#18 |
|
"Dave"
Sep 2005
UK
277610 Posts |
A GTX465 actually has 11 multiprocessors each containing 32 CUDA cores for 352 CUDA cores in total. I believe you are thinking of the GTX460. That would suggest 768 threads per CUDA core!
|
|
|
|
|
|
#19 | |
|
May 2010
499 Posts |
Quote:
The CPU I'm using gives me 138M p/sec at 900T with 5.99 CPU. However, this drops to 134M p/sec at 6000T, since only 5.4 CPU is being used. Sieve speed goes down even further at 8000T. At that point, I'm only getting 129M p/sec with only 5.21 CPU used. I'm curious to see how pronounced this effect is with GPUs. |
|
|
|
|
|
|
#20 |
|
Jan 2005
Caught in a sieve
5·79 Posts |
The first thing I think of on the CPU client, when CPU usage drops, is that you should run a second process and split the threads between them. That might work on high-end Fermis as well.
For this particular problem, you could instead try bumping up your blocksize in tpconfig.txt to the size of your L2 cache, and maybe increase your chunksize as well. |
|
|
|
|
|
#21 | |
|
May 2010
499 Posts |
Quote:
edit: One minor drawback with those settings is the small hit in speed for lower p values. At p=940T, sieve speed goes down from 138.2M p/sec to 135.6M p/sec. Last fiddled with by Oddball on 2010-09-11 at 23:03 |
|
|
|
|
|
|
#22 |
|
"Dave"
Sep 2005
UK
AD816 Posts |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Fast Mersenne Testing on the GPU using CUDA | Andrew Thall | GPU Computing | 109 | 2014-07-28 22:14 |
| Inconsistent factors with TPSieve | Caldera | Twin Prime Search | 7 | 2013-01-05 18:32 |
| tpsieve-cuda slows down with increasing p | amphoria | Twin Prime Search | 0 | 2011-07-23 10:52 |
| Is TPSieve-0.2.1 faster than Newpgen? | cipher | Twin Prime Search | 4 | 2009-05-18 18:36 |
| Thread for non-PrimeNet LL testing | ThomRuley | Lone Mersenne Hunters | 6 | 2005-10-16 20:11 |