![]() |
[QUOTE=kriesel;456055]Hi,
Thanks for a quick reply. I am more interested in updated Windows binaries...[/QUOTE] What video card are you running CUDALucas on? I can build you an updated version for now and then work to incorporate more changes and update all binaries soon. [QUOTE=LaurV;456224]does it get eny faster? Otherwise we are happy with the current version we use, and we don't want to fix it (i.e. upgrade) as long as it works... :razz:[/QUOTE] Only faster if CUDA 8 is faster on your system :razz: |
re Which version to update first? (or at all?)
1 Attachment(s)
[QUOTE=flashjh;456204]Hello all,
From what I can see the updates from post [URL="http://mersenneforum.org/showpost.php?p=425734&postcount=2464"]2464[/URL] were not incorporated on sourceforge, so they're not in any of the code now. It's been over a year since all that discussion went on, are there still issues with residues? Either way, I have the code updated, along with some miscellaneous changes. The biggest change is that CUDA 8 does not support 32 bit for CUDALucas, nor compute < 2.0. What versions is everyone using now? I don't mind getting setup to compile versions <8, but I don't want to do it if no one is using them anymore. So, let me know what architecture everyone needs and I'll make it happen :smile:[/QUOTE] I'm curious what those miscellaneous code changes might be, or rather what feature changes they implement. There are still issues with bad intermediate residues. That cost me days of run time early on. I looked at ATH's benchmarks varying CUDA version and fft length (post 2535, page 231) some more. See the updated attachment. Granted, it's risky comparing for slight differences, but I'm using the data that's available. It should be pretty good since he ran 20 iterations. V8 has most of the slowest timings in that table, and very few of the fastest timings. V6.5 has the most fastest timings, most in the top 1% for an fft length, and none of the slowest. For that card and that system, driver version unknown or at least not in my notes. ATH's benchmarks are a comparison among 64-bit builds. Benchmarks reported in post 2534 showed a speed advantage for 32bit, for driver v 373. That would put V8 at an additional speed disadvantage. That was for a smallish Mersenne prime ~2.97M, so some compact fft length. That test indicated V8 was a little slower than V4.2. ATH found V4.2 too slow to include in his. My faster cards (maybe I should say less-slow) are compute capability 2.0, which you state V8 does not support, but the driver timeout issue does P) They also appear to be about to drop off the bottom of the list of products supported as NVIDIA continues to add new card support. So other things being equal, which they never are, I think 32-bit V6.5 would be a pretty good candidate. I'm in the process of doing benchmark timings for my old cards, versus version, analogous to what ATH has done, but got distracted by some hardware issues on other systems. Also the driver timeout issue is derailing my benchmarking on one card type, so I'm preparing to downgrade considerably on driver version and start that one over. The driver timeout issue seems to be getting worse as I step up in version on that card. Or maybe it's a time trend. It is not temperature. I'm thinking of doing benchmark timing versus driver level for my cards too. Has anyone reported or seen a noticeable effect of that? Thanks! |
[QUOTE=flashjh;456238]What video card are you running CUDALucas on? I can build you an updated version for now and then work to incorporate more changes and update all binaries soon.
[/QUOTE] Sounds great. Quadro 2000 and GeForce GTX480 for now. More follows. |
1 Attachment(s)
[QUOTE=flashjh;456238]What video card are you running CUDALucas on? I can build you an updated version for now and then work to incorporate more changes and update all binaries soon.
[/QUOTE] I'm running Quadro 2000 and GeForce GTX480 currently, contemplating adding some other models later. Running threads 1024 on either reliably causes bad residues on them for current exponents, and for some but not the initial -r checks. This left me with the question, how to know which thread counts or other parameters are reasonable for a particular GPU type. Is there a better way than simply testing, to determine good parameters? Or are both of these GPUs defective somehow? And perhaps threadbench could be modified, to check for and flag occurrence of pathological values, after completing its individual timing loops, and then exclude the flagged ones from selection as the optimal. |
ambitious fftlength crash on Quadro 2000
[QUOTE=flashjh;456238]What video card are you running CUDALucas on? I can build you an updated version for now and then work to incorporate more changes and update all binaries soon.
[/QUOTE] On a Quadro 2000, testing the limits of fft length crashes the program, without completing any benchmarks as far as I can tell. Observed repeatable on the only Quadro 2000 I've yet run it on. $ cudalucas2.05.1-cuda4.2-windows-x64 -cufftbench 1 65536 5 >>bigfftbench.txt CUDALucas.cu(1055) : cudaSafeCall() Runtime API error 2: out of memory. On a GeForce GTX480, it will run many fft lengths and output timings to stdout, then terminate before reaching 65536 or producing the fft lengths file. At least on mine. Then scaling back the maximum to what it reached on stdout produces a file. Quadro 2000 has 1GB VRAM, GTX480 has 1.5. |
Anomalous 1024-square-threads timings examples
2 Attachment(s)
Bad 1024-threads timings examples (occurs on both Quadro2000 compute capability 2.1 and GEForce GTX480 CC 2.0, produces minimal timings, get selected and produces bad residues like repeating 0xfffffffffffffffd). The 1024-thread timings are more than a factor of two faster than for any other thread number, above 1024k fft length. (If I recall correctly, at very short fft lengths the difference disappears, and at large fft lengths it becomes an even more dramatic difference. So for currently useful lengths for first time testing or double checking, a bad 1024 length could easily be screened for by a modified program.)
|
Some threadbench combinations timed twice, minimum missed
1 Attachment(s)
threadbench could be accelerated a bit by benchmarking the squaring and slice combinations once, rather than one combination per fft length twice.
Computing it twice sometimes has a second run slower than the first and replacing apparently the first timing. The timing difference can be enough so that the minimum time's parameters are not selected for storage as the combination to use. Individual time savings are small. Minimum timing per fft length is marked with an *. |
For now, what version do you want compiled for your tests with the new code?
|
Next build
[QUOTE=flashjh;456462]For now, what version do you want compiled for your tests with the new code?[/QUOTE]
I think 32-bit V6.5 would be a pretty good candidate. |
CUDALucas 2.06beta
I incorporated the code changes listed and made some modifications to the supporting code -- no changes to any of the math. I'll upload to code tomorrow, getting late.
2.06beta is [URL="https://sourceforge.net/projects/cudalucas/files/2.06Beta/"]here[/URL] Lib files are [URL="https://sourceforge.net/projects/cudalucas/files/CUDA%20Libs/"]here[/URL], if you need them I have a 1050ti, was able to test all versions. I changed the way it compiles because when I used the old way, it would not run any version on my 1050ti except for CUDA 8. When I switched to a 940M I was able to get >=6.0 to run. Any verison below the version that worked on my cards *always* caused 0 or 2 results during self-test. NOTE: You might see small delay on 1st startup of each CUDA version now, due to JIT, but only if it doesn't find code for your GPU. So, now it's working but I need a lot of testing done. Anyone who was having issues with the bad residues before, please test these versions and let me know if you're able to make it give you bad results. Everyone, let me know what you find that needs to be fixed and what you would like changed. ~Cheers |
feature request
[QUOTE=flashjh;456872]I incorporated the code changes listed and made some modifications to the supporting code -- no changes to any of the math. I'll upload to code tomorrow, getting late.
2.06beta is [URL="https://sourceforge.net/projects/cudalucas/files/2.06Beta/"]here[/URL] Lib files are [URL="https://sourceforge.net/projects/cudalucas/files/CUDA%20Libs/"]here[/URL], if you need them I have a 1050ti, was able to test all versions. I changed the way it compiles because when I used the old way, it would not run any version on my 1050ti except for CUDA 8. When I switched to a 940M I was able to get >=6.0 to run. Any verison below the version that worked on my cards *always* caused 0 or 2 results during self-test. NOTE: You might see small delay on 1st startup of each CUDA version now, due to JIT, but only if it doesn't find code for your GPU. So, now it's working but I need a lot of testing done. Anyone who was having issues with the bad residues before, please test these versions and let me know if you're able to make it give you bad results. Everyone, let me know what you find that needs to be fixed and what you would like changed. ~Cheers[/QUOTE] For benchmarking and more generally keeping the output of various versions and systems straight, it would be helpful if in addition to GPU card information, the header output of -info, -fftbench, -threadbench, GPU fft length output and threads output files, and general program operation would include the following items: 1) CUDALucas version, 2) CUDA level; 3) if possible, NVIDIA driver version; 4) Operating system; 5) 32 or 64 bit; 6) system name. Incorporate thread benchmarking sanity checks. Check for and flag occurrence of pathological values, after completing its individual timing loops, and then exclude the flagged ones from selection as the optimal. Add a runtime estimate column for maximum exponent per fft length to fft.txt. Add checks that the card at least meets the Compute capability required, and the driver supports the CUDA level required that CUDALucas was compiled for. In prime95, in just a few characters, like We4: a results line is tagged with the program version ID. I'd like to see something like that added to CUDALucas too. (Maybe at the far right in case someone has a program that parses or pattern matches the results lines.) |
| All times are UTC. The time now is 22:42. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.