![]() |
1 Attachment(s)
[QUOTE=Ethan (EO);268251] Just noticed this behavior -- it may be cosmetic. Investigating now :)[/QUOTE]
I did a quick look and I've replaced: [CODE] if ((j % output_frequency) == 0) { printbits(x, q, n, (q > 64L) ? 64L : q, b, c, high, low, version, outfp, dupfp, output_frequency, j); } [/CODE] with: [CODE] if ((j % output_frequency) == 0) { cutilSafeCall(cudaMemcpy(x,g_x, sizeof(BIG_DOUBLE)*n, cudaMemcpyDeviceToHost)); printbits(x, q, n, (q > 64L) ? 64L : q, b, c, high, low, version, outfp, dupfp, output_frequency, j); } [/CODE] and I'm not seeing the problem anymore. Please test it with attached version. If the problem disappears then it is purely cosmetic. I'm attaching the version compiled with 4.0 as I no longer have 3.2 environment. |
The residue is now correct, but the despicable cuda4.0 toolkit spit out slower code:smile:
|
[QUOTE=Karl M Johnson;268448]The residue is now correct, but the despicable cuda4.0 toolkit spit out slower code:smile:[/QUOTE]
May be I'll setup 3.2 environment on different computer... But you could still use the earlier compile as the difference is only in progress output... |
Hey guys... Random interjection here...
How do I compile the source code in Linux? I've got it running now in Windows, but I have some things I need to do in Linux. (Ubuntu 11.04) Also, what's the most recent version? Is there a way to automatically move on to another exponent once the current one is done? |
M( 51936263 )C, 0x38dd7484726e3e_, n = 4194304, CUDALucas v1.2
CUDALucas version information: $Id: MacLucasFFTW.c,v 8.1 2007/06/23 22:33:35 wedgingt Exp $ [EMAIL="wedgingt@acm.org"]wedgingt@acm.org[/EMAIL] |
[QUOTE=Dubslow;268466]Hey guys... Random interjection here...
How do I compile the source code in Linux? I've got it running now in Windows, but I have some things I need to do in Linux. (Ubuntu 11.04) Also, what's the most recent version? Is there a way to automatically move on to another exponent once the current one is done?[/QUOTE] Read the instructions in README.txt. |
1 Attachment(s)
I've been busy painting for several days, so haven't had much time to poke at CUDALucas, but I did whip up a testing-only build incorporating apsen's fix for the residue display bug and my elimination of an extra cudaMemcpy.
As a bonus, this build is about 10% faster on my setup (280 series drivers, GTX 470)!: [CODE]Iteration 1540000 M( 25038353 )C, 0x9b8f301501e89155, n = 2097152, CUDALucas v1.3alpha_eoc (0:52 real, 5.1601 ms/iter, ETA 33:40:10) versus Iteration 1590000 M( 25038353 )C, 0x5b07a8286cad15ba, n = 2097152, CUDALucas v1.2b-test (0:57 real, 5.7521 ms/iter, ETA 37:27:09) [/CODE] As a side effect, this build is _only_ for compute capability 2.0 devices, because I changed the block size for the transpose kernels to 32 from 16; I think that will overrun shared memory in sm<=1.3 or may fail silently and mess up your run. I've tested this build against 216091, but haven't tested it at larger fft sizes -- so please test! :) There's further room for improvement on the transpose kernels, and I'll look at that eventually, but I should return to the self-test implemenation before fiddling around more. Also -- in no way am I trying to centralize or control CUDALucas source development, but I have setup a bzr repository for my work on CUDALucas, and anyone else who wants to coordinate can send me their public key and I'll add them for write access. Read-only access is available at [CODE]bzr+ssh://bzruser@utila.eocys.com/CUDALucas/trunk[/CODE], using the following private key: [CODE]-----BEGIN RSA PRIVATE KEY----- MIIEpAIBAAKCAQEAuf0oTjwsArSJRbbUQ+mFYpDZrt+f3d0wulSb9i+XojMTnmZC WApmcW8dkKdQVmXo7bcmRatJkHPuzTkYXMoTbASg4rjXDmKaGRrBz68InzCa76Qt FVC7xExHm4KrdXErWGSDQ5Nn+AMjw7nt8khKeA8FBPhI2hZDTYYG75FFisaWn0bb BPyK8OqyPO0bH4C8NpeYWxORRhdZoFY8vm8OZCkCtqFXHUuSI6sYfp4vwiswb1XB coaRjmqWKahpP8+MFCRhOlwkPtMT/UXC47xPEu9mXBPD6I/blWtSHvSSbtZHzK33 l81VmmRyACp5II9l10AFpcw4l0Pb7qm5iTC60QIDAQABAoIBAQCtSsKuOoRrNNme Wh5m9IMydnJM7NGgwAIx6rmyZV+sYljKQs9YBsCyumxapnpFNgkUzIxdZ55QeKSt FKCtfB8iiyF4fe7q2VZpQ7QHlTe8U2ZZGKhk7uc0nDowHE0zTPGtF2Hyqbq6q/o7 2NZq4453VM9BdTEz+oBVECcQOlQIWy0f6t7ZC6qkAA8zBVug6ZC3P4POyXrK5J8e 0gAJfy5D8W2pe2W/p5BU0GRr/z4W1X4/P7lQ4mBEZVONg/rXS/6apMmwlUZ5GH17 V7LR+yMhNZszH9vHQ3fljLOXUrLB413fUDRQE2ER1Nk6gYPGIaJuwI3R1UEDosNS 9Q3lyFkRAoGBAPSXQbTwVpEpszA9UXG5SbY121AVV56booLc9rVOlge/7GdRh1Mi /l/ueS9r7bWimJQkbXC70rMRCcZFJ4b4c5On55nhXwdEFctCRQWhu1a/kDGKkwUX 4kVIl4YofRZMaxYPVL3vuN1VcskLVc29oRt+nvgykGaOmRyVT4r8Qkh9AoGBAMKq HWCoN+bOCPAZRYOEBJC8/3+qSqrE2qnmBucJwgcI/VU5GitwxNJm1HCpZQjbb32S ZF62tlieiIbdsVnD0tvQwKc2HMP5Ml3mCGgPt1r4n99Lcl6803kbwLzOojR8+3Sa /ddOWw51+oF0RJezUXaYZCjr1w+nckSJzVoEld/lAoGBAJm/asI+QWxGdijgoo2G F1u+Rvn0MHu3AVSZaUtW9uAwOH5JtXMBED1lPjAc+/OtHZQhwdmXdz6weyBy4AHr s1shtGf6Ty3WEo0OPyznGUfSauV5YilVdhpvIzBlyxt1NetL/8zVH7OhvuG5ilol 1VvfIDaMMlEFWiGpibKoF1JRAoGAGwnkALv838s4hJkOBcF9nNkTqBjwPB4RvU2d IdRCJhYCkibXUrdcL1lnIqr0xLEuIEQIOvuoAlEq54i9jJldnXi2ecNTZYkkjNRZ 0JJ2RmWIV0y0eyJBQW4wbElLUH0XtE+e+JwCm9SZUgfjSyr2IsHyD5kKizsX7Rsy 8dD3PF0CgYB0ft0q6+S4YWuja9r2aSYFdhBJYp2rmV3ZZC2wQXkatcU968y7DqUf 0OU0NITw5RoDXKHqCIUBmGEtxkolc8nNEXo4TKkV6+aY0DEK9gKLgvpJDOVavm1O m4HGdh0J7AWwleRgCll69WG8xxX4aIADHjGHAmVPdW9olkHev+NHjQ== -----END RSA PRIVATE KEY----- [/CODE] (If you find that it isn't read-only, please let me know!! :) ) I'm doing my own work in a separate branch and pushing changes in clumps to trunk; my finer-grained commit notes are available with --include-merges (or through your favorite bzr gui). Cheers, Ethan (EO) |
[QUOTE=Ethan (EO);268927]As a bonus, this build is about 10% faster on my setup (280 series drivers, GTX 470)!:[/QUOTE]
On this topic, the first WHQL 280 series driver build is out: [url]http://www.nvidia.com/object/win7-winvista-64bit-280.26-whql-driver.html[/url] And my gut feeling is they've improved CUDA 4.0 performance in these releases. No data to back that up :P |
msft,
[CODE] >.\cufftBench.cuda4.0.sm_20.WIN64.exe CUFFT_Z2Z size=2048 k time=1.276797 msec CUFFT_D2Z size=2048 k time=2.060746 msec CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=2048 k time=1.768099 msec CUFFT_Z2Z size=2560 k time=1.873456 msec CUFFT_D2Z size=2560 k time=2.604093 msec CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=2560 k time=2.252314 msec CUFFT_Z2Z size=3072 k time=2.468416 msec CUFFT_D2Z size=3072 k time=3.350067 msec CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=3072 k time=2.900029 msec CUFFT_Z2Z size=3584 k time=2.667795 msec CUFFT_D2Z size=3584 k time=3.691437 msec CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=3584 k time=3.190755 msec CUFFT_Z2Z size=4096 k time=3.164784 msec CUFFT_D2Z size=4096 k time=4.203641 msec CUFFT_D2Z(CUFFT_COMPATIBILITY_NATIVE) size=4096 k time=3.683325 msec[/CODE] GTX 470 / Windows 7 x64 / 280.26 Driver I think the performance is coming closer...? Thank you, Ethan |
[QUOTE=Ethan (EO);268935]
I think the performance is coming closer...? [/QUOTE] Indeed. But I can't rewrite code. |
Question about CUDALUCAS and mfaktc on same card
I'm considering running CUDALucas on the same card as mfaktc, since, if only one CPU is running mfaktc, the process is CPU bound.
My question is, how are the relative priorities of the two sets of requests balanced? Is there some kind of priority structure that can be accessed from the CUDA toolkit? I'd like to continue to run the one CPU ragged with mfaktc and have the remaining CUDA horsepower running CUDALucas. This will be on Win64 initially, with a GTX480 card. Can you point me to production-quality binaries? (That is, I want to produce good LL tests, not test the latest changes -- which I realise will cost some productivity) |
| All times are UTC. The time now is 23:03. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.