![]() |
1 Attachment(s)
Luigi -
I built these releases using an older version of CUDA - 2.2 I believe. It looks like Nvidia changed the name of the dll with the latest release. It looks like it's OK to redistribute this file, so I've attached it. Just drop it in the same directory as the exe and it should work. [ATTACH]5180[/ATTACH] (I'm stuck with the older version because any drivers newer than the ones which support CUDA 2.2 break a few games I'm still playing through. I guess I should try to upgrade again - I haven't in a few months so maybe the problem has been fixed.) Oliver - I haven't done any testing with the non-ASYNC_COPY code path - I just made a best guess at what would work there but didn't even try to compile it. Like you said, I don't think it would hurt to remove it. Kevin |
Thank you Kevin,
unfortunately the executable required a runtime cudart.dll library, while you sent the static .lib to the forum. I haven't the correct environment to recompile the sources on Windows 7 :-( Luigi |
1 Attachment(s)
Luigi -
Oops. Here's the DLL instead of the static lib. Kevin [ATTACH]5181[/ATTACH] |
Here are some results.
[code] Environment: ------------ mfaktc v0.06 Compiletime Options THREADS_PER_GRID 983040 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled CUDA device info name: GeForce 9500M GS compute capabilities: 1.1 maximum threads per block: 512 number of multiprocessors: 4 (32 shader cores) clock rate: 950MHz Processor: Intel Core2 Duo T8300 @ 2.40 GHZ, 4 GB RAM Windows 7 - 64 bit tf(3321928097, 1, 66); Tests: ------ Runtime Options SievePrimes 25000 SievePrimesAdjust 0 CudaStreams 2 Elapsed time: 117.109 sec ----- Runtime Options SievePrimes 25000-100000 SievePrimesAdjust 1 CudaStreams 2 Elapsed time: 105.087 sec ----- Runtime Options SievePrimes 25000 SievePrimesAdjust 2 CudaStreams 2 Elapsed time: 101.790 sec ----- Runtime Options SievePrimes 25000-100000 SievePrimesAdjust 1 CudaStreams 1 Elapsed time: 114.625 sec [/code] Luigi |
OK, great to see you got it working. The results are similar to what I expected - a small improvement using the new way to adjust the sieve depth but nothing too huge.
The only other test I'd like to see is if increasing CudaStreams above 2 helps. It did on my system - a small bit at least. Also, I finished up the self-test using the hacked up code with no problems. This doesn't test the sieve primes adjustments but it does test the changes I made for CudaStreams so I feel a bit better about the code. I'll run similar tests tonight with the same exponent and share the results. Most of my testing has been in the lower ranges - checking them against P95 factoring assignments in the 6xxxxxxx and 7xxxxxxx range. I know things behave differently on larger exponents but I'm not sure how much of a difference so it will be interesting to see. |
[QUOTE=kjaget;214698]OK, great to see you got it working. The results are similar to what I expected - a small improvement using the new way to adjust the sieve depth but nothing too huge.
The only other test I'd like to see is if increasing CudaStreams above 2 helps. It did on my system - a small bit at least. Also, I finished up the self-test using the hacked up code with no problems. This doesn't test the sieve primes adjustments but it does test the changes I made for CudaStreams so I feel a bit better about the code. I'll run similar tests tonight with the same exponent and share the results. Most of my testing has been in the lower ranges - checking them against P95 factoring assignments in the 6xxxxxxx and 7xxxxxxx range. I know things behave differently on larger exponents but I'm not sure how much of a difference so it will be interesting to see.[/QUOTE] On the system I am testing, and with tf(3321928097,1,66), raising CudaStreams only increase the average wait (with SievePrimes = 100.000). Maybe I could have some advantages if I use a larger bit range, but now I'm short of time to carefully choose parameters. Let me know what range are you testing, and I'll do some tests to see if these parameters compare. Luigi |
Here's my results from a run that just finished. The exponent and range are listed, as are the total run times (in mSec, or I guess in seconds if , is a decimal point) for various CudaStreams values. I ran them with both adjust = 1 and 2 to see the difference.
[CODE] Range CudaStreams, Adjust = 1 CudaStreams, Adjust = 2 Exponent Start End 1 2 3 4 1 2 3 4 3321928097 1 66 23,963 17,434 18,275 19,763 21,557 13,844 15,527 15,776 66362159 1 64 236,184 119,814 106,914 101,451 219,989 112,710 98,128 99,912 66362159 64 65 229,864 112,251 106,517 101,182 207,941 108,016 94,563 94,092 66362159 65 66 458,005 224,759 195,118 203,379 414,306 211,274 182,944 186,795 [/CODE] For the large exponent you were testing it looks like 2 CudaStreams is best on my system as well. For smaller exponents in the range currently assigned by the PrimeNet server, 3-4 are best. I'll have to run 5 to see if it's a steady trend slower after 4 - hopefully it's not the case where 5 is better, 6 is worse, 7 is better, and so on. I'd also like to make sure that exponents near each other behave the same. Although the numbers here roughly match what I've seen with other exponents in the 6xxxxxxx range there could be some variations. I doubt they'll be major but then again neither is the difference between 3 and 4 streams in some cases... I'd also like to run the higher ranges to see if the pattern holds. No need to duplicate any of these runs if you have more important things to do with CPU cycles. I'm just trying to understand why my timings look different than Oliver's. Could be windows vs linux, could be my machine is strange, or I've messed up the software somehow. Maybe someone will recognize a pattern here that I'm missing... Edit - adding machine info : [CODE] mfaktc v0.06 Compiletime Options THREADS_PER_GRID 983040 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled Runtime Options SievePrimes 25000 SievePrimesAdjust 2 CudaStreams 4 CUDA device info name: GeForce GTX 275 compute capabilities: 1.3 maximum threads per block: 512 number of multiprocessors: 30 (240 shader cores) clock rate: 1440MHz [/CODE] CPU is an overclocked i7-750 @ 3.7GHz, Win7 64-bit, 4GB ram. |
Hi,
a code sniplet from the siever (sieve.c): [CODE]ktab[k ]=ic+sieve_table_[1]; ktab[k+1]=ic+sieve_table_[2]; ktab[k+2]=ic+sieve_table_[3]; ktab[k+3]=ic+sieve_table_[4];[/CODE] This looks like it could be done via SIMD. ktab and sieve_table_ are unsigned int, in the current version ic is a signed int (could be changed to unsigned if needed). ktab[k] [B]can't[/B] be alligned on a 16byte boundary. Perhaps there is a SSE instruction which could do the trick? If a suitable intruction exist: how would I use it? Oliver |
Hi, Oliver,
What you probably want is a set of SSE2 intrinsic functions. You probably want to first set up a __m128i with four copies of ic. Then load from sieve_table_ which can hopefully be aligned (did you mean sieve_table_[0-3] instead of [1-4]?), add your four copies of ic to that, and store it unaligned. Ken |
Hi Ken,
first of all I have to notice that you won't find this code in mfaktc-0.06. This is an optimisation for the siever in the 0.07 version (unreleased yet). Sorry for that. sieve_table_ could be alligned (I think). I really mean [1-4] but this can be changed, too. old code: [CODE]sieve_table_=sieve_table[ s &0xFF]; for(p=1;p<=sieve_table_[0];p++) ktab[k++]=ic +sieve_table_[p];[/CODE] newer code: [CODE]sieve_table_=sieve_table[ s &0xFF]; ktab[k ]=ic+sieve_table_[1]; ktab[k+1]=ic+sieve_table_[2]; ktab[k+2]=ic+sieve_table_[3]; ktab[k+3]=ic+sieve_table_[4]; if(sieve_table_[0]>4) { ktab[k+4]=ic+sieve_table_[5]; ktab[k+5]=ic+sieve_table_[6]; ktab[k+6]=ic+sieve_table_[7]; ktab[k+7]=ic+sieve_table_[8]; } k+=sieve_table_[0];[/CODE] So sieve_table[0] contains the number of elements that should be processed. The newer code is faster on my Core i7. If helpfully for allignment: I could move the content of sieve_table_[0] to e.g. sieve_table_[8] and then it would be [0..3] and [4..7]. *sieve_table_ is just a pointer to a part of sieve_table[256][9]. sieve_table[X][0] contains the number of bits set in X and sieve_table[X][1...8] contain the position of the bits. ic some offset. Oliver |
Two ideas -
It might be quicker to make sieve_table_[5-8] zero if there's nothing to add. That way you can just unconditionally add in the whole thing - this will avoid a compare and branch which might be more expensive than just adding in the unneeded values. They're going to end up in the cache anyway because you're using [0-3] so it should be quick to do another add. For streaming/cache efficiency, it might be best to move sieve_table_[0] to its own separate array. That way the start of each entry you're interested in will be aligned to 8 so you'll get better streaming through the caches performance (maybe?). No guarantees any of this will matter ... you'll have to play around and see what works the best. |
| All times are UTC. The time now is 22:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.