mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-07-09, 21:54   #298
Ethan (EO)
 
Ethan (EO)'s Avatar
 
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996

1348 Posts
Default

By the way, thanks so much for your work on this!!

It's a lot of fun to discover factors so quickly -- I found 34 factors of exponents in the 9.5e8 range in about four hours yesterday :)

And as I'm just learning CUDA it's been good to have a project I can relate to to learn from.

Thanks again :)

Last fiddled with by Ethan (EO) on 2010-07-09 at 21:55 Reason: typo.
Ethan (EO) is offline   Reply With Quote
Old 2010-07-09, 22:19   #299
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Hi Ethan,

I'm pretty sure that the issue on Windows is unchanged with 0.09.

Quote:
Originally Posted by Ethan (EO) View Post
NumStreams = 64
SievePrimes = 250 for 1 Instance; 5000 for 2 Instances
THREADS_PER_GRID = 6 * 3584
SIEVE_SIZE_LIMIT = 7

With the above parameters, I get nearly full speed with a single instance and GPU utilization meters show GPU utilization of about 95-100%. The NumStreams and SievePrimes values make the biggest difference.
Those parameters look scary too me.
- high NumStreams eats lots of memory
- low SievePrimes isn't very energy efficient and has a low overall TF-throughput
- low THREADS_PER_GRID should cause more overhead, I didn't expect that this would yield good results
- SIEVE_SIZE_LIMIT = 7... won't work if MORE_CLASSES is used (mark to myself: add a note in params.h about limits)

Quote:
Originally Posted by Ethan (EO) View Post
To use Karl's benchmark of 73708469 from 2^64 to 2^65 (in terms of throughput):
Code:
(GTX 470 @ Core 710 / Windows 7 x64 / Driver 258.69 / i5-860 @ 3.6GHz / mfaktc 0.08 with params.h edits)

                                                 1 Instance     2 Instances
3 Streams/SievePrimes 5000                        1 per 88s      1 per 44s
64 Streams/SievePrimes 250/5000                   1 per 52s      1 per 40s
Perhaps this exponent / bit level isn't a good measurement. Time per class is well below 1 second. Please try bigger assignments (e.g. 2^64 to 2^67).
Edit: M73708469 from 2^64 to 2^65 should be big enough as long as you are running without MORE_CLASSES defined.

Quote:
Originally Posted by Ethan (EO) View Post
So if you want to leave the other cores on a processor free to LL or something, the many-streams setting seems to be the clear winner.
A clear winner would be Linux in this case. (SCNR)
This is somehow screwed up on some Windows systems. Is there a chance that you run it with 19x.xx (e.g. 197.45) driver version?

Oliver

Last fiddled with by TheJudger on 2010-07-09 at 23:02
TheJudger is offline   Reply With Quote
Old 2010-07-09, 22:52   #300
amphoria
 
amphoria's Avatar
 
"Dave"
Sep 2005
UK

23·347 Posts
Default

Quote:
Originally Posted by TheJudger View Post
This is somehow screwed up on some Windows systems. Is there a chance that you run it with 19x.xx (e.g. 197.45) driver version?

Oliver
Ethan,

I have had similar issues with Windows 7 64-bit and a GTX 470. The latest driver that I have found to work with sensible values for NUM_STREAMS, SievePrimes, etc. was 197.75 which can be downloaded from www.nvidia.com/Download/Find.aspx?lang=en-us. With this driver I found that NUM_STREAMS=6 gave around 110-115 M/sec throughput. Note that I have only tried this with the CUDA 3.0 toolkit.

Dave

Last fiddled with by amphoria on 2010-07-09 at 22:54
amphoria is offline   Reply With Quote
Old 2010-07-09, 23:08   #301
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

100010101112 Posts
Default

Ethan, Dave:

perhaps you could run mfaktc with VERBOSE_TIMING enabled. Be warned, this will create alot of output on the screen. You should redirect the output to a file.
Please run 4 variants (if possible):

- 197.xx driver + small SievePrimes
- 197.xx driver + big SievePrimes
- 25x.xx driver + small SievePrimes
- 25x.xx driver + big SievePrimes

The first 200-300 lines of output should be enough.
Send me PM or email me the output if you want.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-07-09, 23:19   #302
Ethan (EO)
 
Ethan (EO)'s Avatar
 
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996

10111002 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Ethan, Dave:

perhaps you could run mfaktc with VERBOSE_TIMING enabled. Be warned, this will create alot of output on the screen. You should redirect the output to a file.
Please run 4 variants (if possible):
Any particular stream count, exponent, or candidate range you would like run? Want to check before a driver install/uninstall cycle :)

ethan
Ethan (EO) is offline   Reply With Quote
Old 2010-07-09, 23:36   #303
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

111110 Posts
Default

Hi Ethan,

Quote:
Originally Posted by Ethan (EO) View Post
Any particular stream count, exponent, or candidate range you would like run? Want to check before a driver install/uninstall cycle :)

ethan
you're right, I should have been a bit more specific.

M73708469 from 2^64 to 2^65

Defaults from 0.09 and change
- THREADS_PER_GRID (28<<15) // (GTX 470 has 448 CUDA cores)
- enable VERBOSE_TIMING
- NumStreams=6 (suggested by Dave)
- SievePrimesAdjust=0

small SievePrimes=5000
big SievePrimes=50000

Oliver
TheJudger is offline   Reply With Quote
Old 2010-07-10, 00:28   #304
Ethan (EO)
 
Ethan (EO)'s Avatar
 
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996

10111002 Posts
Post

Quote:
Originally Posted by TheJudger View Post
Hi Ethan,

..


Those parameters look scary too me.
- high NumStreams eats lots of memory
I noticed your remarks to that effect in the source; here is what I observe in practice:

Code:
NStreams         GPU Memory Use      mfaktc Private Working Set
-------------------------------------------------------------
3                  61MB               31MB
10                 88MB               58MB
64                 296MB             266MB
Quote:
- low THREADS_PER_GRID should cause more overhead, I didn't expect that this would yield good results
It looks like the very low THREADS_PER_GRID is only best in the fewer-streams case, but it is still better than the default or maximal values for the 64streams case:

Code:
                                Seconds/Test (73708649 64 65) 
THREADS_PER_GRID                64Streams     5Streams
-----------------------------------------------------------
        1*14*256                      114           144
        6*14*256                       62            76
       14*14*256                       59            91
       28*14*256                       57            89
       56*14*256                       57            89
      100*14*256                       57            89
        30 << 15                       63            90
      585*14*256                       74            91
Quote:
- SIEVE_SIZE_LIMIT = 7... won't work if MORE_CLASSES is used (mark to myself: add a note in params.h about limits)
Hmm -- on further inspection, this only helps in the fewer-streams case; I bet it is an artifact of the windows problem?

Quote:
A clear winner would be Linux in this case. (SCNR)
:D

Last fiddled with by Ethan (EO) on 2010-07-10 at 00:35 Reason: Fix a quote.
Ethan (EO) is offline   Reply With Quote
Old 2010-07-10, 06:20   #305
ckdo
 
ckdo's Avatar
 
Dec 2007
Cleves, Germany

2×5×53 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hello!

Here is mfaktc 0.09!

Highlights:
- should compile with CUDA 3.1
Ethan's binary requires CUDA 3.1, actually.

So I upgraded CUDA to 3.1 and the display drivers to 257.21. Oh the joy...

Now that binary still doesn't exactly work for me:

Code:
mfaktc v0.09-Win

Compiletime Options
  THREADS_PER_GRID    983040
  THREADS_PER_BLOCK   256
  SIEVE_SIZE_LIMIT    32kiB
  SIEVE_SIZE          230945bits
  VERBOSE_TIMING      disabled
  MORE_CLASSES        disabled

Runtime Options
  SievePrimes         5000
  SievePrimesAdjust   1
  NumStreams          5
  WorkFile            worktodo.txt
  Checkpoints         enabled

CUDA device info
  name:                      GeForce GT 220
  compute capabilities:      1.2
  maximum threads per block: 512
  number of multiprocessors: 6 (48 shader cores)
  clock rate:                1200MHz

running a simple selftest...
ERROR: cudaGetLastError() returned 8: invalid device function
At least 0.08 still works with the updated CUDA and drivers...

Last fiddled with by ckdo on 2010-07-10 at 06:20
ckdo is offline   Reply With Quote
Old 2010-07-10, 08:31   #306
Ethan (EO)
 
Ethan (EO)'s Avatar
 
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996

22·23 Posts
Thumbs up

! I think I [edit]may[/edit] have found a way around the "Windows Problem"... I just got

mfaktc -tf 73708469 64 65

to complete in 48s with only 2 streams on windows!

But I think I have an off-by-one error somewhere and I'm failing half of the self-tests. Basically I reworked the stream dispatch so that we no longer wait on streams to finish:

h_ktab is now a buffer of pregenerated ktabs in no order, larger than num_streams
whenever the number of pregenerated ktabs is above a definable minimum, we can iterate cudaStreamQuery over the stream list to dispatch ktabs to free streams
After checking for free streams, generate another ktab if there's space (and more if necessary to get above the low water mark)
Then check for free streams again.

Code:
  while(k_min <= k_max)
  {
#ifdef VERBOSE_TIMING
    printf("##### k_start = %" PRIu64 " #####\n",k_min);
    printf("mfakt(%u,...) start:    %" PRIu64 "msec\n",exp,timer_diff(&timer)/1000);
#endif

/* preprocessing and dispatch: populate the ktab array and check for available streams */

	do
	{
		if(ktabdirtycount < NUM_BUFFERS) 
		{ /* We have a free ktab slot */
			sieve_candidates(k_min, THREADS_PER_GRID, mystuff->h_ktab[ktabdirtycount], mystuff->sieve_primes);
			k_diff=mystuff->h_ktab[ktabdirtycount][THREADS_PER_GRID-1]+1;
			k_diff*=NUM_CLASSES; /* NUM_CLASSES because classes are mod NUM_CLASSES */
			k_min += k_diff;
			
			
			mystuff->h_kbase[ktabdirtycount] = k_min;
			
			ktabdirtycount++; /* Move to the next ktab slot. */
			
#ifdef VERBOSE_TIMING
			printf("Incrementing dirty count. Now %u.\n",ktabdirtycount);
			printf("mfakt(%u,...) sieved ktab slot %u:  %" PRIu64 "msec\n",exp,ktabdirtycount,timer_diff(&timer)/1000);
#endif
		}
	} while(ktabdirtycount < FULL_BUFFERS_MIN);
	

	/* Round-Robin Stream Dispatch */
	for(int i=0;i<4;i++)
		{
			streamNum = schedulingAttempts++ % mystuff->num_streams;
			
			if(ktabdirtycount > 0)
			{			
				if(cudaStreamQuery(mystuff->stream[streamNum]) == cudaSuccess) 
				{
#ifdef VERBOSE_TIMING
					printf("We have a free stream and a dirty ktab. Transfering.\n");
#endif
					cudaGetLastError();
					cudaMemcpyAsync(mystuff->d_ktab[streamNum], mystuff->h_ktab[ktabdirtycount-1], size, cudaMemcpyHostToDevice, mystuff->stream[streamNum]);
					cudaError = cudaGetLastError();
					if(cudaError != cudaSuccess)
						{
							printf("ERROR: cudaMemcpyAsync() in stream dispatch returned %d: %s\n", cudaError, cudaGetErrorString(cudaError));
						}
#ifdef SHORTCUT_75BIT
					mfakt_95_75<<<blocksPerGrid, threadsPerBlock, 0, mystuff->stream[streamNum]>>>(exp, mystuff->h_kbase[ktabdirtycount-1], mystuff->d_ktab[streamNum], shiftcount, b_preinit, mystuff->d_RES);
#else    
					mfakt_95<<<blocksPerGrid, threadsPerBlock, 0, mystuff->stream[streamNum]>>>(exp, mystuff->h_kbase[ktabdirtycount-1], mystuff->d_ktab[streamNum], shiftcount, b_preinit, mystuff->d_RES);
#endif				
					ktabdirtycount--;
#ifdef VERBOSE_TIMING
					printf("Decremented ktabdirtycount. Now %u.\n",ktabdirtycount);
#endif
				}  
			} else {
#ifdef VERBOSE_TIMING
				printf("Ran out of ktabs during scheduling round. Generating some.\n");
#endif
				break;
			}
		}
	}

/* Clean up any remaining dirty ktabs */

  while(ktabdirtycount > 0)
  {
		streamNum = schedulingAttempts++ % mystuff->num_streams;
		
		if(cudaStreamQuery(mystuff->stream[streamNum]) == cudaSuccess) 
		{
			printf("We have a free stream and a dirty ktab. Transfering.\n");
			cudaGetLastError();
			cudaMemcpyAsync(mystuff->d_ktab[streamNum], mystuff->h_ktab[ktabdirtycount-1], size, cudaMemcpyHostToDevice, mystuff->stream[streamNum]);
			cudaError = cudaGetLastError();
			if(cudaError != cudaSuccess)
			{
				printf("ERROR: cudaMemcpyAsync() in stream dispatch returned %d: %s\n", cudaError, cudaGetErrorString(cudaError));
			}
#ifdef SHORTCUT_75BIT
			mfakt_95_75<<<blocksPerGrid, threadsPerBlock, 0, mystuff->stream[streamNum]>>>(exp, mystuff->h_kbase[ktabdirtycount-1], mystuff->d_ktab[streamNum], shiftcount, b_preinit, mystuff->d_RES);
#else    
			mfakt_95<<<blocksPerGrid, threadsPerBlock, 0, mystuff->stream[streamNum]>>>(exp, mystuff->h_kbase[ktabdirtycount-1], mystuff->d_ktab[streamNum], shiftcount, b_preinit, mystuff->d_RES);
#endif				
			ktabdirtycount--;
			printf("Decremented ktabdirtycount. Now %u.\n",ktabdirtycount);
		}
  }
Like I said, this seems to work well for performance but is missing ~1/2 the self-tests. This is my first go at CUDA and I'm pretty tired so there may be a deeper flaw, but I'm hoping someone with fresher eyes can spot something obvious! The candidate counting and timing code is all torn out right now too, and my debugging printfs are just scattered around in there :)

If this approach is sound, the next step is probably to auto-tune sieve_primes using low- and high- watermarks in the buffer.

Last fiddled with by Ethan (EO) on 2010-07-10 at 08:40 Reason: Toned down enthusiasm, and added a few thoughts at bottom.
Ethan (EO) is offline   Reply With Quote
Old 2010-07-10, 12:27   #307
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45716 Posts
Default

Hi!

Quote:
Originally Posted by ckdo View Post
Ethan's binary requires CUDA 3.1, actually.

So I upgraded CUDA to 3.1 and the display drivers to 257.21. Oh the joy...
In theory CUDA executables should be upwards compatibly but not downwards. E.g. an executable compiled for CUDA 3.0 should run on CUDA 3.0 and 3.1 but not on 2.3.

Quote:
Originally Posted by ckdo View Post
Now that binary still doesn't exactly work for me:

ERROR: cudaGetLastError() returned 8: invalid device function
One possible reason for "invalid device function": The code is compiled for an architecture (compute capability) higher than your GPU. So when Ethan compiled the code for sm_20 (compute capability 2.0) for his GTX 470 than there is no chance for you to run on a compute capability 1.2 GPU.

Hopefully we'll have a CUDA 3.0 Windows binary (compiled for compute capability 1.0 or 1.1) soon.

Oliver

P.S. when you just want to use mfaktc without any motivation to debug/fix the issue on Windows/CUDA 3.1 I recommend to stay with CUDA 3.0.
TheJudger is offline   Reply With Quote
Old 2010-07-10, 12:37   #308
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Hi Ethan,

Quote:
Originally Posted by Ethan (EO) View Post
But I think I have an off-by-one error somewhere and I'm failing half of the self-tests. Basically I reworked the stream dispatch so that we no longer wait on streams to finish:
my first guess is that you modify those h_ktabs before they are uploaded completely to the GPU.

If you modification solves the issue than I would say that one possibility is that those streams are not executed in the order they where issued.

My .plan for this issue
- make sure that it is actually related to CUDA 3.1. (I hope for some reports like "works fine with CUDA 3.0 but with 3.1 the issue occurs" from some more users)
- perhaps it is just a bug in the Windows CUDA 3.1 and it will be fixed in future version

Oliver
TheJudger is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 06:00.


Fri Aug 6 06:00:57 UTC 2021 up 14 days, 29 mins, 1 user, load averages: 3.13, 3.16, 3.14

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.