![]() |
|
|
#298 |
|
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996
1348 Posts |
By the way, thanks so much for your work on this!!
It's a lot of fun to discover factors so quickly -- I found 34 factors of exponents in the 9.5e8 range in about four hours yesterday :) And as I'm just learning CUDA it's been good to have a project I can relate to to learn from. Thanks again :) Last fiddled with by Ethan (EO) on 2010-07-09 at 21:55 Reason: typo. |
|
|
|
|
|
#299 | |||
|
"Oliver"
Mar 2005
Germany
11·101 Posts |
Hi Ethan,
I'm pretty sure that the issue on Windows is unchanged with 0.09. ![]() Quote:
![]() - high NumStreams eats lots of memory - low SievePrimes isn't very energy efficient and has a low overall TF-throughput - low THREADS_PER_GRID should cause more overhead, I didn't expect that this would yield good results - SIEVE_SIZE_LIMIT = 7... won't work if MORE_CLASSES is used (mark to myself: add a note in params.h about limits) Quote:
Edit: M73708469 from 2^64 to 2^65 should be big enough as long as you are running without MORE_CLASSES defined. Quote:
(SCNR)This is somehow screwed up on some Windows systems. Is there a chance that you run it with 19x.xx (e.g. 197.45) driver version? Oliver Last fiddled with by TheJudger on 2010-07-09 at 23:02 |
|||
|
|
|
|
|
#300 | |
|
"Dave"
Sep 2005
UK
23·347 Posts |
Quote:
I have had similar issues with Windows 7 64-bit and a GTX 470. The latest driver that I have found to work with sensible values for NUM_STREAMS, SievePrimes, etc. was 197.75 which can be downloaded from www.nvidia.com/Download/Find.aspx?lang=en-us. With this driver I found that NUM_STREAMS=6 gave around 110-115 M/sec throughput. Note that I have only tried this with the CUDA 3.0 toolkit. Dave Last fiddled with by amphoria on 2010-07-09 at 22:54 |
|
|
|
|
|
|
#301 |
|
"Oliver"
Mar 2005
Germany
100010101112 Posts |
Ethan, Dave:
perhaps you could run mfaktc with VERBOSE_TIMING enabled. Be warned, this will create alot of output on the screen. You should redirect the output to a file. Please run 4 variants (if possible): - 197.xx driver + small SievePrimes - 197.xx driver + big SievePrimes - 25x.xx driver + small SievePrimes - 25x.xx driver + big SievePrimes The first 200-300 lines of output should be enough. Send me PM or email me the output if you want. Oliver |
|
|
|
|
|
#302 | |
|
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996
10111002 Posts |
Quote:
ethan |
|
|
|
|
|
|
#303 | |
|
"Oliver"
Mar 2005
Germany
111110 Posts |
Hi Ethan,
Quote:
M73708469 from 2^64 to 2^65 Defaults from 0.09 and change - THREADS_PER_GRID (28<<15) // (GTX 470 has 448 CUDA cores) - enable VERBOSE_TIMING - NumStreams=6 (suggested by Dave) - SievePrimesAdjust=0 small SievePrimes=5000 big SievePrimes=50000 Oliver |
|
|
|
|
|
|
#304 | ||||
|
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996
10111002 Posts |
Quote:
Code:
NStreams GPU Memory Use mfaktc Private Working Set ------------------------------------------------------------- 3 61MB 31MB 10 88MB 58MB 64 296MB 266MB Quote:
Code:
Seconds/Test (73708649 64 65)
THREADS_PER_GRID 64Streams 5Streams
-----------------------------------------------------------
1*14*256 114 144
6*14*256 62 76
14*14*256 59 91
28*14*256 57 89
56*14*256 57 89
100*14*256 57 89
30 << 15 63 90
585*14*256 74 91
Quote:
Quote:
Last fiddled with by Ethan (EO) on 2010-07-10 at 00:35 Reason: Fix a quote. |
||||
|
|
|
|
|
#305 | |
|
Dec 2007
Cleves, Germany
2×5×53 Posts |
Quote:
So I upgraded CUDA to 3.1 and the display drivers to 257.21. Oh the joy... ![]() Now that binary still doesn't exactly work for me: Code:
mfaktc v0.09-Win Compiletime Options THREADS_PER_GRID 983040 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits VERBOSE_TIMING disabled MORE_CLASSES disabled Runtime Options SievePrimes 5000 SievePrimesAdjust 1 NumStreams 5 WorkFile worktodo.txt Checkpoints enabled CUDA device info name: GeForce GT 220 compute capabilities: 1.2 maximum threads per block: 512 number of multiprocessors: 6 (48 shader cores) clock rate: 1200MHz running a simple selftest... ERROR: cudaGetLastError() returned 8: invalid device function
Last fiddled with by ckdo on 2010-07-10 at 06:20 |
|
|
|
|
|
|
#306 |
|
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996
22·23 Posts |
! I think I [edit]may[/edit] have found a way around the "Windows Problem"... I just got
mfaktc -tf 73708469 64 65 to complete in 48s with only 2 streams on windows! But I think I have an off-by-one error somewhere and I'm failing half of the self-tests. Basically I reworked the stream dispatch so that we no longer wait on streams to finish: h_ktab is now a buffer of pregenerated ktabs in no order, larger than num_streams whenever the number of pregenerated ktabs is above a definable minimum, we can iterate cudaStreamQuery over the stream list to dispatch ktabs to free streams After checking for free streams, generate another ktab if there's space (and more if necessary to get above the low water mark) Then check for free streams again. Code:
while(k_min <= k_max)
{
#ifdef VERBOSE_TIMING
printf("##### k_start = %" PRIu64 " #####\n",k_min);
printf("mfakt(%u,...) start: %" PRIu64 "msec\n",exp,timer_diff(&timer)/1000);
#endif
/* preprocessing and dispatch: populate the ktab array and check for available streams */
do
{
if(ktabdirtycount < NUM_BUFFERS)
{ /* We have a free ktab slot */
sieve_candidates(k_min, THREADS_PER_GRID, mystuff->h_ktab[ktabdirtycount], mystuff->sieve_primes);
k_diff=mystuff->h_ktab[ktabdirtycount][THREADS_PER_GRID-1]+1;
k_diff*=NUM_CLASSES; /* NUM_CLASSES because classes are mod NUM_CLASSES */
k_min += k_diff;
mystuff->h_kbase[ktabdirtycount] = k_min;
ktabdirtycount++; /* Move to the next ktab slot. */
#ifdef VERBOSE_TIMING
printf("Incrementing dirty count. Now %u.\n",ktabdirtycount);
printf("mfakt(%u,...) sieved ktab slot %u: %" PRIu64 "msec\n",exp,ktabdirtycount,timer_diff(&timer)/1000);
#endif
}
} while(ktabdirtycount < FULL_BUFFERS_MIN);
/* Round-Robin Stream Dispatch */
for(int i=0;i<4;i++)
{
streamNum = schedulingAttempts++ % mystuff->num_streams;
if(ktabdirtycount > 0)
{
if(cudaStreamQuery(mystuff->stream[streamNum]) == cudaSuccess)
{
#ifdef VERBOSE_TIMING
printf("We have a free stream and a dirty ktab. Transfering.\n");
#endif
cudaGetLastError();
cudaMemcpyAsync(mystuff->d_ktab[streamNum], mystuff->h_ktab[ktabdirtycount-1], size, cudaMemcpyHostToDevice, mystuff->stream[streamNum]);
cudaError = cudaGetLastError();
if(cudaError != cudaSuccess)
{
printf("ERROR: cudaMemcpyAsync() in stream dispatch returned %d: %s\n", cudaError, cudaGetErrorString(cudaError));
}
#ifdef SHORTCUT_75BIT
mfakt_95_75<<<blocksPerGrid, threadsPerBlock, 0, mystuff->stream[streamNum]>>>(exp, mystuff->h_kbase[ktabdirtycount-1], mystuff->d_ktab[streamNum], shiftcount, b_preinit, mystuff->d_RES);
#else
mfakt_95<<<blocksPerGrid, threadsPerBlock, 0, mystuff->stream[streamNum]>>>(exp, mystuff->h_kbase[ktabdirtycount-1], mystuff->d_ktab[streamNum], shiftcount, b_preinit, mystuff->d_RES);
#endif
ktabdirtycount--;
#ifdef VERBOSE_TIMING
printf("Decremented ktabdirtycount. Now %u.\n",ktabdirtycount);
#endif
}
} else {
#ifdef VERBOSE_TIMING
printf("Ran out of ktabs during scheduling round. Generating some.\n");
#endif
break;
}
}
}
/* Clean up any remaining dirty ktabs */
while(ktabdirtycount > 0)
{
streamNum = schedulingAttempts++ % mystuff->num_streams;
if(cudaStreamQuery(mystuff->stream[streamNum]) == cudaSuccess)
{
printf("We have a free stream and a dirty ktab. Transfering.\n");
cudaGetLastError();
cudaMemcpyAsync(mystuff->d_ktab[streamNum], mystuff->h_ktab[ktabdirtycount-1], size, cudaMemcpyHostToDevice, mystuff->stream[streamNum]);
cudaError = cudaGetLastError();
if(cudaError != cudaSuccess)
{
printf("ERROR: cudaMemcpyAsync() in stream dispatch returned %d: %s\n", cudaError, cudaGetErrorString(cudaError));
}
#ifdef SHORTCUT_75BIT
mfakt_95_75<<<blocksPerGrid, threadsPerBlock, 0, mystuff->stream[streamNum]>>>(exp, mystuff->h_kbase[ktabdirtycount-1], mystuff->d_ktab[streamNum], shiftcount, b_preinit, mystuff->d_RES);
#else
mfakt_95<<<blocksPerGrid, threadsPerBlock, 0, mystuff->stream[streamNum]>>>(exp, mystuff->h_kbase[ktabdirtycount-1], mystuff->d_ktab[streamNum], shiftcount, b_preinit, mystuff->d_RES);
#endif
ktabdirtycount--;
printf("Decremented ktabdirtycount. Now %u.\n",ktabdirtycount);
}
}
If this approach is sound, the next step is probably to auto-tune sieve_primes using low- and high- watermarks in the buffer. Last fiddled with by Ethan (EO) on 2010-07-10 at 08:40 Reason: Toned down enthusiasm, and added a few thoughts at bottom. |
|
|
|
|
|
#307 | ||
|
"Oliver"
Mar 2005
Germany
45716 Posts |
Hi!
Quote:
Quote:
![]() Hopefully we'll have a CUDA 3.0 Windows binary (compiled for compute capability 1.0 or 1.1) soon. ![]() Oliver P.S. when you just want to use mfaktc without any motivation to debug/fix the issue on Windows/CUDA 3.1 I recommend to stay with CUDA 3.0. |
||
|
|
|
|
|
#308 | |
|
"Oliver"
Mar 2005
Germany
11·101 Posts |
Hi Ethan,
Quote:
If you modification solves the issue than I would say that one possibility is that those streams are not executed in the order they where issued. My .plan for this issue - make sure that it is actually related to CUDA 3.1. (I hope for some reports like "works fine with CUDA 3.0 but with 3.1 the issue occurs" from some more users) - perhaps it is just a bug in the Windows CUDA 3.1 and it will be fixed in future version Oliver |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 32 | 2020-11-11 19:56 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |