mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-05-11, 13:00   #199
kjaget
 
kjaget's Avatar
 
Jun 2005

3·43 Posts
Default

Luigi -

I built these releases using an older version of CUDA - 2.2 I believe. It looks like Nvidia changed the name of the dll with the latest release. It looks like it's OK to redistribute this file, so I've attached it. Just drop it in the same directory as the exe and it should work.

cudart-2.2-x64.zip

(I'm stuck with the older version because any drivers newer than the ones which support CUDA 2.2 break a few games I'm still playing through. I guess I should try to upgrade again - I haven't in a few months so maybe the problem has been fixed.)

Oliver -

I haven't done any testing with the non-ASYNC_COPY code path - I just made a best guess at what would work there but didn't even try to compile it. Like you said, I don't think it would hurt to remove it.

Kevin
kjaget is offline   Reply With Quote
Old 2010-05-11, 13:16   #200
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

113238 Posts
Default

Thank you Kevin,

unfortunately the executable required a runtime cudart.dll library, while you sent the static .lib to the forum. I haven't the correct environment to recompile the sources on Windows 7 :-(

Luigi
ET_ is offline   Reply With Quote
Old 2010-05-11, 15:17   #201
kjaget
 
kjaget's Avatar
 
Jun 2005

100000012 Posts
Default

Luigi -

Oops. Here's the DLL instead of the static lib.

Kevin

cudart-dll-22-x64.zip
kjaget is offline   Reply With Quote
Old 2010-05-11, 16:12   #202
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

10010110100112 Posts
Default

Here are some results.

Code:
Environment:
------------
mfaktc v0.06
Compiletime Options
  THREADS_PER_GRID    983040
  THREADS_PER_BLOCK   256
  SIEVE_SIZE_LIMIT    32kiB
  SIEVE_SIZE          230945bits
  USE_PINNED_MEMORY   enabled
  USE_ASYNC_COPY      enabled
  VERBOSE_TIMING      disabled
  SELFTEST            disabled
  MORE_CLASSES        disabled

CUDA device info
  name:                      GeForce 9500M GS
  compute capabilities:      1.1
  maximum threads per block: 512
  number of multiprocessors: 4 (32 shader cores)
  clock rate:                950MHz

Processor: Intel Core2 Duo T8300 @ 2.40 GHZ, 4 GB RAM

Windows 7 - 64 bit

tf(3321928097, 1, 66);

Tests:
------
Runtime Options
  SievePrimes         25000
  SievePrimesAdjust   0
  CudaStreams         2

Elapsed time: 117.109 sec
-----
Runtime Options
  SievePrimes         25000-100000
  SievePrimesAdjust   1
  CudaStreams         2


Elapsed time: 105.087 sec
-----
Runtime Options
  SievePrimes         25000
  SievePrimesAdjust   2
  CudaStreams         2


Elapsed time: 101.790 sec
-----
Runtime Options
  SievePrimes         25000-100000
  SievePrimesAdjust   1
  CudaStreams         1

Elapsed time: 114.625 sec
Luigi
ET_ is offline   Reply With Quote
Old 2010-05-11, 17:00   #203
kjaget
 
kjaget's Avatar
 
Jun 2005

8116 Posts
Default

OK, great to see you got it working. The results are similar to what I expected - a small improvement using the new way to adjust the sieve depth but nothing too huge.

The only other test I'd like to see is if increasing CudaStreams above 2 helps. It did on my system - a small bit at least.

Also, I finished up the self-test using the hacked up code with no problems. This doesn't test the sieve primes adjustments but it does test the changes I made for CudaStreams so I feel a bit better about the code.

I'll run similar tests tonight with the same exponent and share the results. Most of my testing has been in the lower ranges - checking them against P95 factoring assignments in the 6xxxxxxx and 7xxxxxxx range. I know things behave differently on larger exponents but I'm not sure how much of a difference so it will be interesting to see.
kjaget is offline   Reply With Quote
Old 2010-05-11, 17:24   #204
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

61·79 Posts
Default

Quote:
Originally Posted by kjaget View Post
OK, great to see you got it working. The results are similar to what I expected - a small improvement using the new way to adjust the sieve depth but nothing too huge.

The only other test I'd like to see is if increasing CudaStreams above 2 helps. It did on my system - a small bit at least.

Also, I finished up the self-test using the hacked up code with no problems. This doesn't test the sieve primes adjustments but it does test the changes I made for CudaStreams so I feel a bit better about the code.

I'll run similar tests tonight with the same exponent and share the results. Most of my testing has been in the lower ranges - checking them against P95 factoring assignments in the 6xxxxxxx and 7xxxxxxx range. I know things behave differently on larger exponents but I'm not sure how much of a difference so it will be interesting to see.

On the system I am testing, and with tf(3321928097,1,66), raising CudaStreams only increase the average wait (with SievePrimes = 100.000).

Maybe I could have some advantages if I use a larger bit range, but now I'm short of time to carefully choose parameters.
Let me know what range are you testing, and I'll do some tests to see if these parameters compare.

Luigi
ET_ is offline   Reply With Quote
Old 2010-05-12, 00:21   #205
kjaget
 
kjaget's Avatar
 
Jun 2005

3×43 Posts
Default

Here's my results from a run that just finished. The exponent and range are listed, as are the total run times (in mSec, or I guess in seconds if , is a decimal point) for various CudaStreams values. I ran them with both adjust = 1 and 2 to see the difference.

Code:
                Range		CudaStreams, Adjust = 1			CudaStreams, Adjust = 2			
Exponent	Start	End	1	2	3	4		1	2	3	4
3321928097	1	66	23,963	17,434	18,275	19,763		21,557	13,844	15,527	15,776
66362159	1	64	236,184	119,814	106,914	101,451		219,989	112,710	98,128	99,912
66362159	64	65	229,864	112,251	106,517	101,182		207,941	108,016	94,563	94,092
66362159	65	66	458,005	224,759	195,118	203,379		414,306	211,274	182,944	186,795
For the large exponent you were testing it looks like 2 CudaStreams is best on my system as well. For smaller exponents in the range currently assigned by the PrimeNet server, 3-4 are best. I'll have to run 5 to see if it's a steady trend slower after 4 - hopefully it's not the case where 5 is better, 6 is worse, 7 is better, and so on.

I'd also like to make sure that exponents near each other behave the same. Although the numbers here roughly match what I've seen with other exponents in the 6xxxxxxx range there could be some variations. I doubt they'll be major but then again neither is the difference between 3 and 4 streams in some cases...

I'd also like to run the higher ranges to see if the pattern holds.

No need to duplicate any of these runs if you have more important things to do with CPU cycles. I'm just trying to understand why my timings look different than Oliver's. Could be windows vs linux, could be my machine is strange, or I've messed up the software somehow. Maybe someone will recognize a pattern here that I'm missing...

Edit - adding machine info :

Code:
mfaktc v0.06
Compiletime Options
  THREADS_PER_GRID    983040
  THREADS_PER_BLOCK   256
  SIEVE_SIZE_LIMIT    32kiB
  SIEVE_SIZE          230945bits
  USE_PINNED_MEMORY   enabled
  USE_ASYNC_COPY      enabled
  VERBOSE_TIMING      disabled
  SELFTEST            disabled
  MORE_CLASSES        disabled

Runtime Options
  SievePrimes         25000
  SievePrimesAdjust   2
  CudaStreams         4

CUDA device info
  name:                      GeForce GTX 275
  compute capabilities:      1.3
  maximum threads per block: 512
  number of multiprocessors: 30 (240 shader cores)
  clock rate:                1440MHz
CPU is an overclocked i7-750 @ 3.7GHz, Win7 64-bit, 4GB ram.

Last fiddled with by kjaget on 2010-05-12 at 00:31
kjaget is offline   Reply With Quote
Old 2010-05-12, 21:29   #206
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45716 Posts
Default

Hi,

a code sniplet from the siever (sieve.c):

Code:
ktab[k  ]=ic+sieve_table_[1];
ktab[k+1]=ic+sieve_table_[2];
ktab[k+2]=ic+sieve_table_[3];
ktab[k+3]=ic+sieve_table_[4];
This looks like it could be done via SIMD.
ktab and sieve_table_ are unsigned int, in the current version ic is a signed int (could be changed to unsigned if needed).
ktab[k] can't be alligned on a 16byte boundary.

Perhaps there is a SSE instruction which could do the trick?
If a suitable intruction exist: how would I use it?

Oliver
TheJudger is offline   Reply With Quote
Old 2010-05-12, 22:05   #207
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

18B16 Posts
Default

Hi, Oliver,

What you probably want is a set of SSE2 intrinsic functions. You probably want to first set up a __m128i with four copies of ic. Then load from sieve_table_ which can hopefully be aligned (did you mean sieve_table_[0-3] instead of [1-4]?), add your four copies of ic to that, and store it unaligned.

Ken
Ken_g6 is offline   Reply With Quote
Old 2010-05-12, 22:33   #208
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi Ken,

first of all I have to notice that you won't find this code in mfaktc-0.06. This is an optimisation for the siever in the 0.07 version (unreleased yet). Sorry for that.

sieve_table_ could be alligned (I think). I really mean [1-4] but this can be changed, too.

old code:
Code:
sieve_table_=sieve_table[ s     &0xFF];
for(p=1;p<=sieve_table_[0];p++) ktab[k++]=ic   +sieve_table_[p];
newer code:
Code:
sieve_table_=sieve_table[ s     &0xFF];
ktab[k  ]=ic+sieve_table_[1];
ktab[k+1]=ic+sieve_table_[2];
ktab[k+2]=ic+sieve_table_[3];
ktab[k+3]=ic+sieve_table_[4];
if(sieve_table_[0]>4)
{
  ktab[k+4]=ic+sieve_table_[5];
  ktab[k+5]=ic+sieve_table_[6];
  ktab[k+6]=ic+sieve_table_[7];
  ktab[k+7]=ic+sieve_table_[8];
}
k+=sieve_table_[0];
So sieve_table[0] contains the number of elements that should be processed.
The newer code is faster on my Core i7. If helpfully for allignment: I could move the content of sieve_table_[0] to e.g. sieve_table_[8] and then it would be [0..3] and [4..7].

*sieve_table_ is just a pointer to a part of sieve_table[256][9].
sieve_table[X][0] contains the number of bits set in X and sieve_table[X][1...8] contain the position of the bits. ic some offset.

Oliver

Last fiddled with by TheJudger on 2010-05-12 at 22:33
TheJudger is offline   Reply With Quote
Old 2010-05-13, 15:10   #209
kjaget
 
kjaget's Avatar
 
Jun 2005

2018 Posts
Default

Two ideas -

It might be quicker to make sieve_table_[5-8] zero if there's nothing to add. That way you can just unconditionally add in the whole thing - this will avoid a compare and branch which might be more expensive than just adding in the unneeded values. They're going to end up in the cache anyway because you're using [0-3] so it should be quick to do another add.

For streaming/cache efficiency, it might be best to move sieve_table_[0] to its own separate array. That way the start of each entry you're interested in will be aligned to 8 so you'll get better streaming through the caches performance (maybe?).

No guarantees any of this will matter ... you'll have to play around and see what works the best.
kjaget is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 05:34.


Fri Aug 6 05:34:28 UTC 2021 up 14 days, 3 mins, 1 user, load averages: 3.13, 2.95, 2.71

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.