mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

kjaget 2010-05-11 13:00

1 Attachment(s)
Luigi -

I built these releases using an older version of CUDA - 2.2 I believe. It looks like Nvidia changed the name of the dll with the latest release. It looks like it's OK to redistribute this file, so I've attached it. Just drop it in the same directory as the exe and it should work.

[ATTACH]5180[/ATTACH]

(I'm stuck with the older version because any drivers newer than the ones which support CUDA 2.2 break a few games I'm still playing through. I guess I should try to upgrade again - I haven't in a few months so maybe the problem has been fixed.)

Oliver -

I haven't done any testing with the non-ASYNC_COPY code path - I just made a best guess at what would work there but didn't even try to compile it. Like you said, I don't think it would hurt to remove it.

Kevin

ET_ 2010-05-11 13:16

Thank you Kevin,

unfortunately the executable required a runtime cudart.dll library, while you sent the static .lib to the forum. I haven't the correct environment to recompile the sources on Windows 7 :-(

Luigi

kjaget 2010-05-11 15:17

1 Attachment(s)
Luigi -

Oops. Here's the DLL instead of the static lib.

Kevin

[ATTACH]5181[/ATTACH]

ET_ 2010-05-11 16:12

Here are some results.

[code]
Environment:
------------
mfaktc v0.06
Compiletime Options
THREADS_PER_GRID 983040
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
USE_PINNED_MEMORY enabled
USE_ASYNC_COPY enabled
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled

CUDA device info
name: GeForce 9500M GS
compute capabilities: 1.1
maximum threads per block: 512
number of multiprocessors: 4 (32 shader cores)
clock rate: 950MHz

Processor: Intel Core2 Duo T8300 @ 2.40 GHZ, 4 GB RAM

Windows 7 - 64 bit

tf(3321928097, 1, 66);

Tests:
------
Runtime Options
SievePrimes 25000
SievePrimesAdjust 0
CudaStreams 2

Elapsed time: 117.109 sec
-----
Runtime Options
SievePrimes 25000-100000
SievePrimesAdjust 1
CudaStreams 2


Elapsed time: 105.087 sec
-----
Runtime Options
SievePrimes 25000
SievePrimesAdjust 2
CudaStreams 2


Elapsed time: 101.790 sec
-----
Runtime Options
SievePrimes 25000-100000
SievePrimesAdjust 1
CudaStreams 1

Elapsed time: 114.625 sec
[/code]

Luigi

kjaget 2010-05-11 17:00

OK, great to see you got it working. The results are similar to what I expected - a small improvement using the new way to adjust the sieve depth but nothing too huge.

The only other test I'd like to see is if increasing CudaStreams above 2 helps. It did on my system - a small bit at least.

Also, I finished up the self-test using the hacked up code with no problems. This doesn't test the sieve primes adjustments but it does test the changes I made for CudaStreams so I feel a bit better about the code.

I'll run similar tests tonight with the same exponent and share the results. Most of my testing has been in the lower ranges - checking them against P95 factoring assignments in the 6xxxxxxx and 7xxxxxxx range. I know things behave differently on larger exponents but I'm not sure how much of a difference so it will be interesting to see.

ET_ 2010-05-11 17:24

[QUOTE=kjaget;214698]OK, great to see you got it working. The results are similar to what I expected - a small improvement using the new way to adjust the sieve depth but nothing too huge.

The only other test I'd like to see is if increasing CudaStreams above 2 helps. It did on my system - a small bit at least.

Also, I finished up the self-test using the hacked up code with no problems. This doesn't test the sieve primes adjustments but it does test the changes I made for CudaStreams so I feel a bit better about the code.

I'll run similar tests tonight with the same exponent and share the results. Most of my testing has been in the lower ranges - checking them against P95 factoring assignments in the 6xxxxxxx and 7xxxxxxx range. I know things behave differently on larger exponents but I'm not sure how much of a difference so it will be interesting to see.[/QUOTE]


On the system I am testing, and with tf(3321928097,1,66), raising CudaStreams only increase the average wait (with SievePrimes = 100.000).

Maybe I could have some advantages if I use a larger bit range, but now I'm short of time to carefully choose parameters.
Let me know what range are you testing, and I'll do some tests to see if these parameters compare.

Luigi

kjaget 2010-05-12 00:21

Here's my results from a run that just finished. The exponent and range are listed, as are the total run times (in mSec, or I guess in seconds if , is a decimal point) for various CudaStreams values. I ran them with both adjust = 1 and 2 to see the difference.

[CODE] Range CudaStreams, Adjust = 1 CudaStreams, Adjust = 2
Exponent Start End 1 2 3 4 1 2 3 4
3321928097 1 66 23,963 17,434 18,275 19,763 21,557 13,844 15,527 15,776
66362159 1 64 236,184 119,814 106,914 101,451 219,989 112,710 98,128 99,912
66362159 64 65 229,864 112,251 106,517 101,182 207,941 108,016 94,563 94,092
66362159 65 66 458,005 224,759 195,118 203,379 414,306 211,274 182,944 186,795
[/CODE]

For the large exponent you were testing it looks like 2 CudaStreams is best on my system as well. For smaller exponents in the range currently assigned by the PrimeNet server, 3-4 are best. I'll have to run 5 to see if it's a steady trend slower after 4 - hopefully it's not the case where 5 is better, 6 is worse, 7 is better, and so on.

I'd also like to make sure that exponents near each other behave the same. Although the numbers here roughly match what I've seen with other exponents in the 6xxxxxxx range there could be some variations. I doubt they'll be major but then again neither is the difference between 3 and 4 streams in some cases...

I'd also like to run the higher ranges to see if the pattern holds.

No need to duplicate any of these runs if you have more important things to do with CPU cycles. I'm just trying to understand why my timings look different than Oliver's. Could be windows vs linux, could be my machine is strange, or I've messed up the software somehow. Maybe someone will recognize a pattern here that I'm missing...

Edit - adding machine info :

[CODE]
mfaktc v0.06
Compiletime Options
THREADS_PER_GRID 983040
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
USE_PINNED_MEMORY enabled
USE_ASYNC_COPY enabled
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled

Runtime Options
SievePrimes 25000
SievePrimesAdjust 2
CudaStreams 4

CUDA device info
name: GeForce GTX 275
compute capabilities: 1.3
maximum threads per block: 512
number of multiprocessors: 30 (240 shader cores)
clock rate: 1440MHz
[/CODE]

CPU is an overclocked i7-750 @ 3.7GHz, Win7 64-bit, 4GB ram.

TheJudger 2010-05-12 21:29

Hi,

a code sniplet from the siever (sieve.c):

[CODE]ktab[k ]=ic+sieve_table_[1];
ktab[k+1]=ic+sieve_table_[2];
ktab[k+2]=ic+sieve_table_[3];
ktab[k+3]=ic+sieve_table_[4];[/CODE]

This looks like it could be done via SIMD.
ktab and sieve_table_ are unsigned int, in the current version ic is a signed int (could be changed to unsigned if needed).
ktab[k] [B]can't[/B] be alligned on a 16byte boundary.

Perhaps there is a SSE instruction which could do the trick?
If a suitable intruction exist: how would I use it?

Oliver

Ken_g6 2010-05-12 22:05

Hi, Oliver,

What you probably want is a set of SSE2 intrinsic functions. You probably want to first set up a __m128i with four copies of ic. Then load from sieve_table_ which can hopefully be aligned (did you mean sieve_table_[0-3] instead of [1-4]?), add your four copies of ic to that, and store it unaligned.

Ken

TheJudger 2010-05-12 22:33

Hi Ken,

first of all I have to notice that you won't find this code in mfaktc-0.06. This is an optimisation for the siever in the 0.07 version (unreleased yet). Sorry for that.

sieve_table_ could be alligned (I think). I really mean [1-4] but this can be changed, too.

old code:
[CODE]sieve_table_=sieve_table[ s &0xFF];
for(p=1;p<=sieve_table_[0];p++) ktab[k++]=ic +sieve_table_[p];[/CODE]

newer code:
[CODE]sieve_table_=sieve_table[ s &0xFF];
ktab[k ]=ic+sieve_table_[1];
ktab[k+1]=ic+sieve_table_[2];
ktab[k+2]=ic+sieve_table_[3];
ktab[k+3]=ic+sieve_table_[4];
if(sieve_table_[0]>4)
{
ktab[k+4]=ic+sieve_table_[5];
ktab[k+5]=ic+sieve_table_[6];
ktab[k+6]=ic+sieve_table_[7];
ktab[k+7]=ic+sieve_table_[8];
}
k+=sieve_table_[0];[/CODE]

So sieve_table[0] contains the number of elements that should be processed.
The newer code is faster on my Core i7. If helpfully for allignment: I could move the content of sieve_table_[0] to e.g. sieve_table_[8] and then it would be [0..3] and [4..7].

*sieve_table_ is just a pointer to a part of sieve_table[256][9].
sieve_table[X][0] contains the number of bits set in X and sieve_table[X][1...8] contain the position of the bits. ic some offset.

Oliver

kjaget 2010-05-13 15:10

Two ideas -

It might be quicker to make sieve_table_[5-8] zero if there's nothing to add. That way you can just unconditionally add in the whole thing - this will avoid a compare and branch which might be more expensive than just adding in the unneeded values. They're going to end up in the cache anyway because you're using [0-3] so it should be quick to do another add.

For streaming/cache efficiency, it might be best to move sieve_table_[0] to its own separate array. That way the start of each entry you're interested in will be aligned to 8 so you'll get better streaming through the caches performance (maybe?).

No guarantees any of this will matter ... you'll have to play around and see what works the best.


All times are UTC. The time now is 22:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.