![]() |
[QUOTE=flashjh;331496]The most recent CUDALucas.cu file is giving me a problem compiling:
[CODE]Line 703: struct stat FileAttrib; Line 717: if(stat(chkpnt_cfn, &FileAttrib) == 0) fprintf (stderr, "\nUnable to open the checkpoint file. Trying the backup file.\n"); Line 737: if(stat(chkpnt_cfn, &FileAttrib) == 0) fprintf (stderr, "\nUnable to open the backup file. Restarting test.\n"); [/CODE] Compiler output: [CODE]C:\CUDA\src>make -f makefile.win makefile.win:12: Extraneous text after `else' directive "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0/bin/nvcc" -c CUDALucas. cu -o CUDALucas.cuda5.0.sm_35.x64.obj -m64 --ptxas-options=-v -ccbin="C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\amd64" -Xcompiler /EHsc,/W3,/no logo,/Ox,/Oy,/GL -arch=sm_35 -O3 CUDALucas.cu CUDALucas.cu(703): error: incomplete type is not allowed CUDALucas.cu(717): error: incomplete type is not allowed CUDALucas.cu(737): error: incomplete type is not allowed[/CODE] If you let me know what you're trying to do with the struct maybe I can figure out how to make MSVS happy.[/QUOTE] Try changing stat to _stat and see if it compiles. |
I'll try tonight.
|
Regarding FFT timings, I have brief access to a Tesla K20m. I ran a benchmark starting at 1440k going up in 16k increments, and here are the results. As usual, lengths slower than longer FFTs have been deleted. I'm surprised how relatively short this table is.
[CODE]FFT length (k), ms/iteration 1568 0.508496 1600 0.596174 2000 0.645019 2048 0.655126 2592 0.820283 3136 1.123238 4000 1.256788 4096 1.304601 4320 1.804463 4608 1.876166 4704 1.910958 5120 2.120896 5488 2.136009 5600 2.270577 6000 2.438436 6048 2.448022 6144 2.480506 6272 2.526666 7776 2.620803 [/CODE] For the most recent prime, actually using 4000k rather than 3200k results in a slightly longer ETA, presumably because of doing the multiplication and normalization on the longer array. [CODE]./CUDALucas -k 57885161 Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:41 real, 4.0933 ms/iter, ETA 65:47:57) Iteration 20000 M( 57885161 )C, 0xfd8e311d20ffe6ab, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:41 real, 4.0613 ms/iter, ETA 65:16:29) ./CUDALucas -f 4000k -k 57885161 Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 4000K, CUDALucas v2.04 Beta err = 0.0009 (0:42 real, 4.2419 ms/iter, ETA 68:11:21) Iteration 20000 M( 57885161 )C, 0xfd8e311d20ffe6ab, n = 4000K, CUDALucas v2.04 Beta err = 0.0010 (0:42 real, 4.2032 ms/iter, ETA 67:33:15) [/CODE] |
Just to be sure...
I'm working fine with CudaLucas v2.01 under Linux 64bit: the program aautomatically recognize errors on the FFT computation and rollbacks to a safer size, even if not the most efficient. 1 - Is v2.04 available for Linux? 2 - Is it more reliable? 3 - Is it faster? 4 - Does it aautomagically choose the fastest FFT size? Thank you for the infos.. Luigi |
2.03 is stable, 2.04 Beta works (but was never pushed out of beta), 2.05 is under development, and all work on GNU-Linux.
None are safer and more reliable; the main differences from 2.01 to 2.04 are interface features (i.e. worktodo in version 2.03, other good stuff in 2.04). They all automagically choose a *roughly* good FFT size, but manual futzing can usually get an extra 5-10% performance boost. |
[QUOTE=Dubslow;331713]2.03 is stable, 2.04 Beta works (but was never pushed out of beta), 2.05 is under development, and all work on GNU-Linux.
None are safer and more reliable; the main differences from 2.01 to 2.04 are interface features (i.e. worktodo in version 2.03, other good stuff in 2.04). They all automagically choose a *roughly* good FFT size, but manual futzing can usually get an extra 5-10% performance boost.[/QUOTE] Thank you Dubslow, that's exactly what I supposed... I'm waiting at the window, looking at the development, helping with tests if needed, but still keeping my old plain-vanilla v2.01 :smile: Luigi |
Efficiency formula
[QUOTE=aaronhaviland;295742]Absolutely (from the CUFFT documentation):
I've been testing CUFFT timings for other lengths than just multiples of 32768. I've excluded the timings because they're not run exactly as CUDALucas would run them, but the fact that they are "optimal lengths" should still apply. Eff% is is calculated similarly to the prior examples here, but scaled so that the results are all within the range 0-100. Very few lengths have Eff% between 15% - 75%; the majority of inefficient lengths ran around 9-10%. These have all been excluded. Some of the 70-80% efficient run-lengths have also been excluded because they are smaller than a larger+faster length. [COLOR=Blue]Note the exponents in blue which would be skipped over if only looking at multiples of 32768[/COLOR]: [CODE] FFT Exponent Size Eff% 2 3 5 7 ====================== 1048576 97.23 20 0 0 0 [COLOR=Blue]1105920 88.82 13 3 1 0[/COLOR] 1179648 91.20 17 2 0 0[COLOR=Blue] 1204224 82.49 13 1 0 2[/COLOR] 1310720 89.06 18 0 1 0[COLOR=Blue] 1327104 90.86 14 4 0 0[/COLOR] 1376256 85.13 16 1 0 1 1474560 89.14 15 2 1 0[COLOR=Blue] 1548288 89.05 13 3 0 1[/COLOR] 1572864 89.23 19 1 0 0 1605632 88.84 15 0 0 2 1769472 92.58 16 3 0 0 1835008 89.17 18 0 0 1 2097152 95.87 21 0 0 0[COLOR=Blue] 2211840 87.81 14 3 1 0[/COLOR] 2359296 89.84 18 2 0 0[COLOR=Blue] 2370816 80.62 8 3 0 3[/COLOR][COLOR=Blue] 2408448 81.08 14 1 0 2[/COLOR] 2621440 87.60 19 0 1 0 2654208 85.52 15 4 0 0 [COLOR=Blue]2709504 82.21 11 3 0 2 2809856 82.38 13 0 0 3 2985984 87.28 12 6 0 0 3096576 85.87 14 3 0 1 [/COLOR]3145728 85.74 20 1 0 0 3211264 82.12 16 0 0 2 [COLOR=Blue]3317760 82.69 13 4 1 0 3359232 74.71 9 8 0 0 3386880 71.31 9 3 1 2 [/COLOR]3932160 80.93 18 1 1 0 [COLOR=Blue]4014080 80.65 14 0 1 2 [/COLOR]4096000 73.66 15 0 3 0 4194304 95.87 22 0 0 0 4423680 87.81 15 3 1 0 4718592 89.84 19 2 0 0 [COLOR=Blue]4741632 80.62 9 3 0 3 [/COLOR]4816896 81.08 15 1 0 2 5242880 87.60 20 0 1 0 5308416 85.52 16 4 0 0 [COLOR=Blue]5419008 82.21 12 3 0 2 5619712 82.38 14 0 0 3 5971968 87.28 13 6 0 0 [/COLOR]6193152 85.87 15 3 0 1 6291456 85.74 21 1 0 0 6422528 82.12 17 0 0 2 [COLOR=Blue]6635520 82.69 14 4 1 0 6718464 74.71 10 8 0 0 6773760 71.31 10 3 1 2 [/COLOR]7864320 80.93 19 1 1 0 8028160 80.65 15 0 1 2 8192000 73.66 16 0 3 0[/CODE][/QUOTE] [QUOTE]For sizes handled by the Cooley-Tukey code path (that is, representable as 2 a ⋅ 3 b ⋅ 5 c ⋅ 7 d ), the most efficient implementation is obtained by applying the following constraints (listed in order from the most generic to the most specialized constraint, with each subsequent constraint providing the potential of an additional performance improvement). [LIST][*] [I]Restrict the size along all dimensions to be representable as 2 a ⋅ 3 b ⋅ 5 c ⋅ 7 d .[/I] The CUFFT library has highly optimized kernels for tranforms whose dimensions have these prime factors.[*] [I]Restrict the size along each dimension to use fewer distinct prime factors.[/I] For example, a transform of size 3 n will usually be faster than one of size 2 i ∗ 3 j even if the latter is slightly smaller.[*] [I]Restrict the power-of-two factorization term of the x dimension to be a multiple of either 2 5 6 for single-precision transforms or 6 4 for double-precision transforms.[/I] This further aids with memory coalescing.[*] [I]Restrict the x dimension of single-precision transforms to be strictly a power of two either between 2 and 8 1 9 2 for Fermi-class GPUs or between 2 and 2 0 4 8 for earlier architectures.[/I] These transforms are implemented as specialized hand-coded kernels that keep all intermediate results in shared memory.[*] [I]Use Native compatibility mode for in-place complex-to-real or real-to-complex transforms.[/I] This scheme reduces the write/read of padding bytes hence helping with coalescing of the data.[/LIST][/QUOTE]I'd like to have a formula that computes an efficiency score between 0 and 100% for a given FFT length like Aaron did. Any suggestions? Aaron, can you post yours? Thoughts: - measure FFT lengths runtime and adapt formula to match ms/iter - theoretical Gflops throuhput as given by Nvidia for radix-2 to -7 - theoretical model with weighted scores for radix-2 to -7 and penalty for more distinct prime factors used <-- my try but scores must be analysed - ... Suggestions? Or has this been invented yet? |
[QUOTE=Brain;333237]
Suggestions? Or has this been invented yet?[/QUOTE] I think Aaronhaviland did this with the cufftbench option. I will do the same and normalize the timings by FFT length / time. Raw data that will be used can be found in this [URL="http://www.mersenneforum.org/showpost.php?p=333372&postcount=234"]Titan's thread post[/URL]. |
Might there be a bug in CUDALucas, or do I have bad hardware?
Hey all. Looking for some guidance from the Gurus...
I offered to help alpha test owftheevil's new GPU P-1 program on my 2GB GTX560, and very quickly it started reporting round-off errors. owftheevil asked if I had run the CUDALucas self-test, which I had to admit I hadn't. After receiving a couple of different versions of the source from owftheevil, and also download the code from GIThub, it rarely passed the self test (only once out of about ten different runs). Concerned that my mfaktc work might be bad, I ran it's deep self-test. 100% success. (I know, of course, that the two programs work in very different ways.) I then reran the memory test program I used when I first bought the 560 -- [url]http://wili.cc/blog/gpu-burn.html[/url] -- no errors. I downloaded and compiled the Open Source version of memtest80 -- [url]https://github.com/ihaque/memtestG80[/url] -- after more than an hour, no errors. I often use this GPU for a computer vision project I'm working on. This involves SIFTing large images, and then matching the descriptors. The former process uses about 90% of the card's memory. Shortly after purchasing the card I ran a sanity check experiment where I ran a >1000 image job on the GPU, and then the same job only on the CPU, and the results were almost identical (GPU SIFTing is known to have slightly different results). Lastly, when I first bought the card I also ran several tests under WinBlows, including FurMark. Not a single reported error. (I can't immediately rerun those tests as the machine is in an office on the other side of the country (not really that far away).) I am running the latest CUDA 5.0. The box is a hyper-threaded quad-core with 4GB of RAM, running CentOS 6.3 64-bit. Any thoughts from anyone? I'd be happy to provide unprivileged SSH access to the box to any of the CUDALucas developers if it might help being "in situ". |
If you can, downclock the GPU memory and try. It is almost certainly hardware problem. You can also get GeneferCUDA and do a test. It should also produce similar issues.
|
It is not unknown for programs on this forum to find hardware faults that nothing else will. I would suggest reducing your memory clock and find what is stable.
|
| All times are UTC. The time now is 23:14. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.