![]() |
1 Attachment(s)
For a 50M, the GPU usage is ~86%.
Image with precise data attached. If it's the dll, well, not much can be done then. Hey, do you know why the performance is measured in time(in ms mostly) per LL product, not like, say, LL products/1 second ? Tradition ? It's easy to convert, but I wonder. |
GPU Observer Gadget tells me 97% load on GTX 560 Ti for CUDALucas 1.0b and 1.2.
|
1 Attachment(s)
Karl, I'm not sure if this is the situation for you, but I think I've identified the cause of low GPU utilization with CUDALucas.
I noticed that on my system, even though CUDALucas hovers below 1% cpu utilization (approximately 1 second of CPU time for 5 minutes of wall time), running prime95 alongside CUDALucas slowed CUDALucas by about 20%. Additionally, GPU utilization as reported by MSI Afterburner dropped from ~99% to ~85% when I started Prime95, even if I had fewer worker threads than physical cores. So I took a look at the 1.2 source from apsen's package above (which is what I was working from), and I noticed an extraneous DeviceToHost transfer in lucas_square, at line 853 of CUDALucas.cu: [CODE] cutilSafeCall(cudaMemcpy(c_maxerr,g_maxerr, sizeof(BIG_DOUBLE),cudaMemcpyDeviceToHost)); err = 0.0; if(error_log) { cutilSafeCall(cudaMemcpy(c_maxerr,g_maxerr, sizeof(BIG_DOUBLE)*N/512, cudaMemcpyDeviceToHost)); for(i=0;i<(N/512);i++) if (c_maxerr[i]>err) err=c_maxerr[i]; } } #ifdef _MSC_VER free (c_maxerr); #endif return(err); }[/CODE] You can see that there's a cudaMemcpyDeviceToHost in the if(error_log) case, which is appropriate, but there's also one sitting up there that's getting executed on every lucas_square call, which is to say on every iteration! Profiling with Compute Visual Profiler showed that this was only taking a few microseconds of the GPU's time, but when a memory intensive CPU application (like Prime95) is running, the CPU has to wait a decent time to complete the memory transfer. Unless I'm missing something, c_maxerr is never referenced before the return in the !error_log path, so this copy is extraneous. A version of CUDALucas compiled with this call removed operates at 99% of CPU time all the time for me, and doesn't slow down when Prime95 runs! This is great, since the promise of CUDALucas lies in the fact that virtually the entire calculation can happen on the GPU with no host transfers except for at initialization, checkpointing, and completion. I'm attaching my build below; I've built against CUDA 4.0 because on my system (GTX 470, Windows 7 x64, Driver 280.19), it is about 7% faster than a 3.2 build. Karl -- can you try this out and see if you're seeing better GPU utilization? This needs more testing, since the change is in the compute path and I'm working from apsen's base which he indicated needs more testing. I plan to add a Torture Test mode using the exponents and residues George uses in Prime95 for the Torture Test to help with this... speak up please anyone who is going down that road so I don't duplicate work :) I'd also like to do some really serious pruning of the code base in general -- it is quite a bit to dive into right now, in large part due to its heritage as a much more generalized application. For the CUDALucas purposes, we can probably get more help in development if people aren't scared off by the stuff that isn't active anymore. So -- before I do that -- is apsen's archive above a reasonable place to start? When is it time to consider some source control? Thanks all! |
Hi ,Ethan (EO)
CUDALucas is Open Source,you can do everything. Have Fun!:bounce: |
The GPU usage has indeed increased to 99%, and it was indeed a bit faster than Apsen's build(275.33 ForceWare), but on the first 10K iterations, I instantly got this as residue: 0x00000000000000c2 . I think that means roundoff errors, right ?
|
[QUOTE=Karl M Johnson;268177]The GPU usage has indeed increased to 99%, and it was indeed a bit faster than Apsen's build(275.33 ForceWare), but on the first 10K iterations, I instantly got this as residue: 0x00000000000000c2 . I think that means roundoff errors, right ?[/QUOTE]
Hi Karl -- what exponent was this? It's entirely possible that I missed a side effect of the copy I removed, so I appreciate the testing. Before posting that build, I ran 216,091 to completion as a quick test and didn't have any problem, and I'm doing a double check of a 25###### number that should be finishing up about now, so I'll let you know what I see there. In the meantime -- the build I posted should be assumed bad unless proven otherwise! Ethan |
'twas 50M.
Maybe it's the drivers that cause these errors ? Nvidia likes to silently change stuff, you know. Maybe it's because I'm on 275.33 and you're on 280.19. |
Karl -- can you check to see if you see the same behavior (the bad residue) on the .exe apsen posted? Because I do see it, and the residue reported at various iterations depends on how frequently I ask it to checkpoint (ignore the slow iteration times):
[CODE] >dir CUDALucas.cuda3.2.sm_13.WIN64.exe 07/23/2011 01:07 PM 178,176 CUDALucas.cuda3.2.sm_13.WIN64.exe >.\CUDALucas.cuda3.2.sm_13.WIN64.exe -c 100 -t 39845887 CUDALucas: Could not find a checkpoint file to resume from Iteration 100 M( 39845887 )C, 0x00000000000000c2, n = 4194304, CUDALucas v1.2b (0:13 real, 134.0818 ms/iter, ETA 1484:03:03) Iteration 200 M( 39845887 )C, 0x54da87a48f8a830b, n = 4194304, CUDALucas v1.2b (0:10 real, 99.4864 ms/iter, ETA 1101:08:14) .\CUDALucas.cuda3.2.sm_13.WIN64.exe -c 25 -t 39845887 CUDALucas: Could not find a checkpoint file to resume from Iteration 25 M( 39845887 )C, 0x00000000000000c2, n = 4194304, CUDALucas v1.2b (0:06 real, 225.9590 ms/iter, ETA 2500:58:46) Iteration 50 M( 39845887 )C, 0x711c79d64566435b, n = 4194304, CUDALucas v1.2b (0:05 real, 212.7032 ms/iter, ETA 2354:15:32) Iteration 75 M( 39845887 )C, 0x2c9ed624e25b45d0, n = 4194304, CUDALucas v1.2b (0:06 real, 214.3305 ms/iter, ETA 2372:16:09) Iteration 100 M( 39845887 )C, 0xbdecbb426516968b, n = 4194304, CUDALucas v1.2b (0:05 real, 213.2418 ms/iter, ETA 2360:13:04) Iteration 125 M( 39845887 )C, 0x54da87a48f8a830b, n = 4194304, CUDALucas v1.2b (0:05 real, 213.7901 ms/iter, ETA 2366:17:07) Iteration 150 M( 39845887 )C, 0xeebce568af9b1274, n = 4194304, CUDALucas v1.2b (0:06 real, 212.3993 ms/iter, ETA 2350:53:24) Iteration 175 M( 39845887 )C, 0xaeb573ffa26dab12, n = 4194304, CUDALucas v1.2b (0:05 real, 212.7814 ms/iter, ETA 2355:07:03) Iteration 200 M( 39845887 )C, 0xa33e0ddda016adb1, n = 4194304, CUDALucas v1.2b (0:05 real, 212.8008 ms/iter, ETA 2355:19:53)[/CODE] Just noticed this behavior -- it may be cosmetic. Investigating now :) |
Uhm, it(bad residue) does happens with Apsen's version IF you start your LL with it.
|
Okay -- my version (no c_maxerr copy in the lucas_square loop, based on apsen's 1.2b source) successfully completed a 25xxxxxx range double-check:
[CODE] Processing result: M( 25128553 )C, 0x7e4eb730d558a5bf, n = 2097152, CUDALucas v1.2b_eoc LL test successfully completes double-check of M25128553 CPU credit is 24.9889 GHz-days.[/CODE] My inclination is to say that the bad residue displays are cosmetic only but I won't be able to look at the issue in detail for a few days. Caution is recommended for now. George has given me the go-ahead to use the residues from the Prime95 torture test to implement a self-test for CUDALucas, and that should help with confidence in future builds! Look for it early next week. -Ethan |
[QUOTE=Ethan (EO);268251] Just noticed this behavior -- it may be cosmetic. Investigating now :)[/QUOTE]
I seem to remember that If you do checkpoints after low number of iterations you may be reporting data that has not been transferred from GPU yet. So it may indeed be just cosmetic. It may be also possible that you get repeated residues as the data may be transferred to CPU side less often then it is reported. I've started removing extra code from the program but there's still long way to go. If you are up to setting source control this may be a good idea. I was going to look into the code again some time but not sure when I'll get enough time. However looking into this reporting issue shouldn't take long so maybe I'll get to it in the next couple of days. |
| All times are UTC. The time now is 23:03. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.