![]() |
I can do that if someone provides me with CUDALucas 1.2 64 bit binaries, compiled for 3.2 TK and 32 bit binaries, compiled for 4.0 TK, for Windows.
|
[QUOTE=Karl M Johnson;266367]I can do that if someone provides me with CUDALucas 1.2 64 bit binaries, compiled for 3.2 TK and 32 bit binaries, compiled for 4.0 TK, for Windows.[/QUOTE]
I could compile all of them tonight unless someone beets me to it. |
Good, good!
I look forward to this ! If you dont mind, please use [B]-arch=sm_20[/B] flag instead of [B]-arch=sm_13[/B] . I never lose hope, even though most of the apps did not benefit much from being compiled specifically for sm_20 Fermi GPUs. |
1 Attachment(s)
[QUOTE=Karl M Johnson;266381]Good, good!
I look forward to this ! If you dont mind, please use [B]-arch=sm_20[/B] flag instead of [B]-arch=sm_13[/B] ..[/QUOTE] Here they are. :redface: Sorry for archive in archive but zip cannot compress multiple files together and this board does not allow rars so it's rar inside zip. This way it's only one small file. |
[QUOTE=apsen;266446]but zip cannot compress multiple files together[/QUOTE]
OT: Since when? :glare: |
'tis ok, though I personally prefer 7z/.tar.lzma.
Before you read the results, here's how I did all these benchmarks: Launched CUDALucas against a 50M exponent. Waited till it printed first 10K iterations, then I started measuring time, using a stopwatch:smile: Stopped when it printed 60K iterations. A little bit of math then. Additional notes: Windows7 SP1 x64 Q6600 CPU @ 3.0 Ghz GTX 480 @ 800 Mhz ForceWare 275.33 Results: [CODE]CUDALucas.cuda3.2.sm_20.WIN32 = 8.0728 ms/iter CUDALucas.cuda3.2.sm_20.WIN64 = 8.0726 ms/iter CUDALucas.cuda4.0.sm_20.WIN32 = 8.3190 ms/iter CUDALucas.cuda4.0.sm_20.WIN64 = 8.7344 ms/iter CUDALucas.cuda3.2.sm_13.WIN32 = 8.0570 ms/iter CUDALucas.cuda3.2.sm_13.WIN64 = 8.0560 ms/iter CUDALucas.cuda4.0.sm_13.WIN32 = 8.3466 ms/iter CUDALucas.cuda4.0.sm_13.WIN64 = 8.7036 ms/iter[/CODE] Conclusions: -arch=sm_xx flag has little effect on performance, sm_13 seems best for Fermi, not sm_20. CUDA 3.2 toolkit seems to be the best in terms of performance. CUDA 4.0 toolkit creates 64 bit binaries, which are slower than their 32 bit variants(at least for Windows). |
[QUOTE=axn;266450]OT: Since when? :glare:[/QUOTE]
...cannot compress multiple files together [COLOR="DarkRed"]as if it was one file (it's called solid archive in rar)[/COLOR]. So it misses an opportunity to capitalize on a lot of redundancy across the multiple files. Just try to compress the content of the above with zip directly: there's about 3 times difference in size. If zip can do it I'd like to know which one it is. |
[QUOTE=Karl M Johnson;266455]
-arch=sm_xx flag has little effect on performance...[/QUOTE] BTW CUDALucas compiles fine with sm_10 or sm_11. Is there a problem with that or it was just random pick to use sm_13? If sm_10/11 is fine would you, Karl, care to test them too? |
Thanks for compiling and testing the various binaries.
|
sm_13 is required for double precision support.
My guess that it may compile, but it will not work. |
[QUOTE=Karl M Johnson;266492]sm_13 is required for double precision support.
My guess that it may compile, but it will not work.[/QUOTE] sm_10 produces exactly the same output as sm_13. :unsure: (and seem to be faster but I have other stuff running so that might not really mean anything). Does anyone know if double precision is actually necessary and if so how to make it show? |
| All times are UTC. The time now is 23:01. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.