![]() |
By the way, thanks so much for your work on this!!
It's a lot of fun to discover factors so quickly -- I found 34 factors of exponents in the 9.5e8 range in about four hours yesterday :) And as I'm just learning CUDA it's been good to have a project I can relate to to learn from. Thanks again :) |
Hi Ethan,
I'm pretty sure that the issue on Windows is unchanged with 0.09. :sad: [QUOTE=Ethan (EO);220939] NumStreams = 64 SievePrimes = 250 for 1 Instance; 5000 for 2 Instances THREADS_PER_GRID = 6 * 3584 SIEVE_SIZE_LIMIT = 7 With the above parameters, I get nearly full speed with a single instance and GPU utilization meters show GPU utilization of about 95-100%. The NumStreams and SievePrimes values make the biggest difference. [/QUOTE] Those parameters look scary too me. :surprised: - high NumStreams eats lots of memory - low SievePrimes isn't very energy efficient and has a low overall TF-throughput - low THREADS_PER_GRID should cause more overhead, I didn't expect that this would yield good results - SIEVE_SIZE_LIMIT = 7... won't work if MORE_CLASSES is used (mark to myself: add a note in params.h about limits) [QUOTE=Ethan (EO);220939] To use Karl's benchmark of 73708469 from 2^64 to 2^65 (in terms of throughput): [code] (GTX 470 @ Core 710 / Windows 7 x64 / Driver 258.69 / i5-860 @ 3.6GHz / mfaktc 0.08 with params.h edits) 1 Instance 2 Instances 3 Streams/SievePrimes 5000 1 per 88s 1 per 44s 64 Streams/SievePrimes 250/5000 1 per 52s 1 per 40s [/code] [/QUOTE] Perhaps this exponent / bit level isn't a good measurement. Time per class is well below 1 second. Please try bigger assignments (e.g. 2^64 to 2^67). [B]Edit:[/B] M73708469 from 2^64 to 2^65 should be big enough as long as you are running without MORE_CLASSES defined. [QUOTE=Ethan (EO);220939]So if you want to leave the other cores on a processor free to LL or something, the many-streams setting seems to be the clear winner. [/QUOTE] A clear winner would be Linux in this case. :razz: (SCNR) This is somehow screwed up on some Windows systems. Is there a chance that you run it with 19x.xx (e.g. 197.45) driver version? Oliver |
[QUOTE=TheJudger;220944]
This is somehow screwed up on some Windows systems. Is there a chance that you run it with 19x.xx (e.g. 197.45) driver version? Oliver[/QUOTE] Ethan, I have had similar issues with Windows 7 64-bit and a GTX 470. The latest driver that I have found to work with sensible values for NUM_STREAMS, SievePrimes, etc. was 197.75 which can be downloaded from [url]www.nvidia.com/Download/Find.aspx?lang=en-us[/url]. With this driver I found that NUM_STREAMS=6 gave around 110-115 M/sec throughput. Note that I have only tried this with the CUDA 3.0 toolkit. Dave |
Ethan, Dave:
perhaps you could run mfaktc with VERBOSE_TIMING enabled. Be warned, this will create alot of output on the screen. You should redirect the output to a file. Please run 4 variants (if possible): - 197.xx driver + small SievePrimes - 197.xx driver + big SievePrimes - 25x.xx driver + small SievePrimes - 25x.xx driver + big SievePrimes The first 200-300 lines of output should be enough. Send me PM or email me the output if you want. Oliver |
[QUOTE=TheJudger;220948]Ethan, Dave:
perhaps you could run mfaktc with VERBOSE_TIMING enabled. Be warned, this will create alot of output on the screen. You should redirect the output to a file. Please run 4 variants (if possible): [/QUOTE] Any particular stream count, exponent, or candidate range you would like run? Want to check before a driver install/uninstall cycle :) ethan |
Hi Ethan,
[QUOTE=Ethan (EO);220949]Any particular stream count, exponent, or candidate range you would like run? Want to check before a driver install/uninstall cycle :) ethan[/QUOTE] you're right, I should have been a bit more specific. M73708469 from 2^64 to 2^65 Defaults from 0.09 and change - THREADS_PER_GRID (28<<15) // (GTX 470 has 448 CUDA cores) - enable VERBOSE_TIMING - NumStreams=6 (suggested by Dave) - SievePrimesAdjust=0 small SievePrimes=5000 big SievePrimes=50000 Oliver |
[QUOTE=TheJudger;220944]Hi Ethan,
.. Those parameters look scary too me. :surprised: - high NumStreams eats lots of memory [/QUOTE] I noticed your remarks to that effect in the source; here is what I observe in practice: [CODE] NStreams GPU Memory Use mfaktc Private Working Set ------------------------------------------------------------- 3 61MB 31MB 10 88MB 58MB 64 296MB 266MB [/CODE] [QUOTE] - low THREADS_PER_GRID should cause more overhead, I didn't expect that this would yield good results [/QUOTE] It looks like the very low THREADS_PER_GRID is only best in the fewer-streams case, but it is still better than the default or maximal values for the 64streams case: [CODE] Seconds/Test (73708649 64 65) THREADS_PER_GRID 64Streams 5Streams ----------------------------------------------------------- 1*14*256 114 144 6*14*256 62 76 14*14*256 59 91 28*14*256 57 89 56*14*256 57 89 100*14*256 57 89 30 << 15 63 90 585*14*256 74 91 [/CODE] [QUOTE] - SIEVE_SIZE_LIMIT = 7... won't work if MORE_CLASSES is used (mark to myself: add a note in params.h about limits) [/QUOTE] Hmm -- on further inspection, this only helps in the fewer-streams case; I bet it is an artifact of the windows problem? [QUOTE] A clear winner would be Linux in this case. :razz: (SCNR) [/QUOTE] :D |
[quote=TheJudger;220929]Hello!
Here is mfaktc 0.09! :smile: Highlights: - should compile with CUDA 3.1 [/quote] Ethan's binary [I]requires [/I]CUDA 3.1, actually. So I upgraded CUDA to 3.1 and the display drivers to 257.21. Oh the joy... :loco: Now that binary still doesn't exactly work for me: [code] mfaktc v0.09-Win Compiletime Options THREADS_PER_GRID 983040 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits VERBOSE_TIMING disabled MORE_CLASSES disabled Runtime Options SievePrimes 5000 SievePrimesAdjust 1 NumStreams 5 WorkFile worktodo.txt Checkpoints enabled CUDA device info name: GeForce GT 220 compute capabilities: 1.2 maximum threads per block: 512 number of multiprocessors: 6 (48 shader cores) clock rate: 1200MHz running a simple selftest... ERROR: cudaGetLastError() returned 8: invalid device function [/code]At least 0.08 still works with the updated CUDA and drivers... :wink: |
! I think I [edit][U]may[/U][/edit] have found a way around the "Windows Problem"... I just got
mfaktc -tf 73708469 64 65 to complete in 48s with only 2 streams on windows! But I think I have an off-by-one error somewhere and I'm failing half of the self-tests. Basically I reworked the stream dispatch so that we no longer wait on streams to finish: h_ktab is now a buffer of pregenerated ktabs in no order, larger than num_streams whenever the number of pregenerated ktabs is above a definable minimum, we can iterate cudaStreamQuery over the stream list to dispatch ktabs to free streams After checking for free streams, generate another ktab if there's space (and more if necessary to get above the low water mark) Then check for free streams again. [CODE] while(k_min <= k_max) { #ifdef VERBOSE_TIMING printf("##### k_start = %" PRIu64 " #####\n",k_min); printf("mfakt(%u,...) start: %" PRIu64 "msec\n",exp,timer_diff(&timer)/1000); #endif /* preprocessing and dispatch: populate the ktab array and check for available streams */ do { if(ktabdirtycount < NUM_BUFFERS) { /* We have a free ktab slot */ sieve_candidates(k_min, THREADS_PER_GRID, mystuff->h_ktab[ktabdirtycount], mystuff->sieve_primes); k_diff=mystuff->h_ktab[ktabdirtycount][THREADS_PER_GRID-1]+1; k_diff*=NUM_CLASSES; /* NUM_CLASSES because classes are mod NUM_CLASSES */ k_min += k_diff; mystuff->h_kbase[ktabdirtycount] = k_min; ktabdirtycount++; /* Move to the next ktab slot. */ #ifdef VERBOSE_TIMING printf("Incrementing dirty count. Now %u.\n",ktabdirtycount); printf("mfakt(%u,...) sieved ktab slot %u: %" PRIu64 "msec\n",exp,ktabdirtycount,timer_diff(&timer)/1000); #endif } } while(ktabdirtycount < FULL_BUFFERS_MIN); /* Round-Robin Stream Dispatch */ for(int i=0;i<4;i++) { streamNum = schedulingAttempts++ % mystuff->num_streams; if(ktabdirtycount > 0) { if(cudaStreamQuery(mystuff->stream[streamNum]) == cudaSuccess) { #ifdef VERBOSE_TIMING printf("We have a free stream and a dirty ktab. Transfering.\n"); #endif cudaGetLastError(); cudaMemcpyAsync(mystuff->d_ktab[streamNum], mystuff->h_ktab[ktabdirtycount-1], size, cudaMemcpyHostToDevice, mystuff->stream[streamNum]); cudaError = cudaGetLastError(); if(cudaError != cudaSuccess) { printf("ERROR: cudaMemcpyAsync() in stream dispatch returned %d: %s\n", cudaError, cudaGetErrorString(cudaError)); } #ifdef SHORTCUT_75BIT mfakt_95_75<<<blocksPerGrid, threadsPerBlock, 0, mystuff->stream[streamNum]>>>(exp, mystuff->h_kbase[ktabdirtycount-1], mystuff->d_ktab[streamNum], shiftcount, b_preinit, mystuff->d_RES); #else mfakt_95<<<blocksPerGrid, threadsPerBlock, 0, mystuff->stream[streamNum]>>>(exp, mystuff->h_kbase[ktabdirtycount-1], mystuff->d_ktab[streamNum], shiftcount, b_preinit, mystuff->d_RES); #endif ktabdirtycount--; #ifdef VERBOSE_TIMING printf("Decremented ktabdirtycount. Now %u.\n",ktabdirtycount); #endif } } else { #ifdef VERBOSE_TIMING printf("Ran out of ktabs during scheduling round. Generating some.\n"); #endif break; } } } /* Clean up any remaining dirty ktabs */ while(ktabdirtycount > 0) { streamNum = schedulingAttempts++ % mystuff->num_streams; if(cudaStreamQuery(mystuff->stream[streamNum]) == cudaSuccess) { printf("We have a free stream and a dirty ktab. Transfering.\n"); cudaGetLastError(); cudaMemcpyAsync(mystuff->d_ktab[streamNum], mystuff->h_ktab[ktabdirtycount-1], size, cudaMemcpyHostToDevice, mystuff->stream[streamNum]); cudaError = cudaGetLastError(); if(cudaError != cudaSuccess) { printf("ERROR: cudaMemcpyAsync() in stream dispatch returned %d: %s\n", cudaError, cudaGetErrorString(cudaError)); } #ifdef SHORTCUT_75BIT mfakt_95_75<<<blocksPerGrid, threadsPerBlock, 0, mystuff->stream[streamNum]>>>(exp, mystuff->h_kbase[ktabdirtycount-1], mystuff->d_ktab[streamNum], shiftcount, b_preinit, mystuff->d_RES); #else mfakt_95<<<blocksPerGrid, threadsPerBlock, 0, mystuff->stream[streamNum]>>>(exp, mystuff->h_kbase[ktabdirtycount-1], mystuff->d_ktab[streamNum], shiftcount, b_preinit, mystuff->d_RES); #endif ktabdirtycount--; printf("Decremented ktabdirtycount. Now %u.\n",ktabdirtycount); } } [/CODE] Like I said, this seems to work well for performance but is missing ~1/2 the self-tests. This is my first go at CUDA and I'm pretty tired so there may be a deeper flaw, but I'm hoping someone with fresher eyes can spot something obvious! The candidate counting and timing code is all torn out right now too, and my debugging printfs are just scattered around in there :) If this approach is sound, the next step is probably to auto-tune sieve_primes using low- and high- watermarks in the buffer. |
Hi!
[QUOTE=ckdo;220964]Ethan's binary [I]requires [/I]CUDA 3.1, actually. So I upgraded CUDA to 3.1 and the display drivers to 257.21. Oh the joy... :loco: [/QUOTE] In theory CUDA executables should be upwards compatibly but not downwards. E.g. an executable compiled for CUDA 3.0 should run on CUDA 3.0 and 3.1 but not on 2.3. [QUOTE=ckdo;220964] Now that binary still doesn't exactly work for me: ERROR: cudaGetLastError() returned 8: invalid device function [/QUOTE] One possible reason for "invalid device function": The code is compiled for an architecture (compute capability) higher than your GPU. So when Ethan compiled the code for sm_20 (compute capability 2.0) for his GTX 470 than there is no chance for you to run on a compute capability 1.2 GPU. :sad: Hopefully we'll have a CUDA 3.0 Windows binary (compiled for compute capability 1.0 or 1.1) soon. :wink: Oliver P.S. when you just want to use mfaktc without any motivation to debug/fix the issue on Windows/CUDA 3.1 I recommend to stay with CUDA 3.0. |
Hi Ethan,
[QUOTE=Ethan (EO);220970]But I think I have an off-by-one error somewhere and I'm failing half of the self-tests. Basically I reworked the stream dispatch so that we no longer wait on streams to finish: [/QUOTE] my first guess is that you modify those h_ktabs [B]before[/B] they are uploaded completely to the GPU. If you modification solves the issue than I would say that one possibility is that those streams are not executed in the order they where issued. My .plan for this issue - make sure that it is actually related to CUDA 3.1. (I hope for some reports like "works fine with CUDA 3.0 but with 3.1 the issue occurs" from some more users) - [B]perhaps[/B] it is just a bug in the Windows CUDA 3.1 and it will be fixed in future version Oliver |
| All times are UTC. The time now is 22:42. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.