mersenneforum.org CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW)
 Register FAQ Search Today's Posts Mark Forums Read

2013-03-01, 14:45   #1739
kjaget

Jun 2005

3×43 Posts

Quote:
 Originally Posted by flashjh The most recent CUDALucas.cu file is giving me a problem compiling: Code: Line 703: struct stat FileAttrib; Line 717: if(stat(chkpnt_cfn, &FileAttrib) == 0) fprintf (stderr, "\nUnable to open the checkpoint file. Trying the backup file.\n"); Line 737: if(stat(chkpnt_cfn, &FileAttrib) == 0) fprintf (stderr, "\nUnable to open the backup file. Restarting test.\n"); Compiler output: Code: C:\CUDA\src>make -f makefile.win makefile.win:12: Extraneous text after else' directive "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0/bin/nvcc" -c CUDALucas. cu -o CUDALucas.cuda5.0.sm_35.x64.obj -m64 --ptxas-options=-v -ccbin="C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\amd64" -Xcompiler /EHsc,/W3,/no logo,/Ox,/Oy,/GL -arch=sm_35 -O3 CUDALucas.cu CUDALucas.cu(703): error: incomplete type is not allowed CUDALucas.cu(717): error: incomplete type is not allowed CUDALucas.cu(737): error: incomplete type is not allowed If you let me know what you're trying to do with the struct maybe I can figure out how to make MSVS happy.
Try changing stat to _stat and see if it compiles.

 2013-03-01, 19:54 #1740 flashjh     "Jerry" Nov 2011 Vancouver, WA 112310 Posts I'll try tonight.
 2013-03-01, 21:40 #1741 frmky     Jul 2003 So Cal 23·11·23 Posts Regarding FFT timings, I have brief access to a Tesla K20m. I ran a benchmark starting at 1440k going up in 16k increments, and here are the results. As usual, lengths slower than longer FFTs have been deleted. I'm surprised how relatively short this table is. Code: FFT length (k), ms/iteration 1568 0.508496 1600 0.596174 2000 0.645019 2048 0.655126 2592 0.820283 3136 1.123238 4000 1.256788 4096 1.304601 4320 1.804463 4608 1.876166 4704 1.910958 5120 2.120896 5488 2.136009 5600 2.270577 6000 2.438436 6048 2.448022 6144 2.480506 6272 2.526666 7776 2.620803 For the most recent prime, actually using 4000k rather than 3200k results in a slightly longer ETA, presumably because of doing the multiplication and normalization on the longer array. Code: ./CUDALucas -k 57885161 Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:41 real, 4.0933 ms/iter, ETA 65:47:57) Iteration 20000 M( 57885161 )C, 0xfd8e311d20ffe6ab, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:41 real, 4.0613 ms/iter, ETA 65:16:29) ./CUDALucas -f 4000k -k 57885161 Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 4000K, CUDALucas v2.04 Beta err = 0.0009 (0:42 real, 4.2419 ms/iter, ETA 68:11:21) Iteration 20000 M( 57885161 )C, 0xfd8e311d20ffe6ab, n = 4000K, CUDALucas v2.04 Beta err = 0.0010 (0:42 real, 4.2032 ms/iter, ETA 67:33:15) Last fiddled with by frmky on 2013-03-01 at 21:56
 2013-03-02, 12:54 #1742 ET_ Banned     "Luigi" Aug 2002 Team Italia 10010100101102 Posts Just to be sure... I'm working fine with CudaLucas v2.01 under Linux 64bit: the program aautomatically recognize errors on the FFT computation and rollbacks to a safer size, even if not the most efficient. 1 - Is v2.04 available for Linux? 2 - Is it more reliable? 3 - Is it faster? 4 - Does it aautomagically choose the fastest FFT size? Thank you for the infos.. Luigi
 2013-03-02, 21:36 #1743 Dubslow Basketry That Evening!     "Bunslow the Bold" Jun 2011 40
2013-03-02, 23:19   #1744
ET_
Banned

"Luigi"
Aug 2002
Team Italia

2·3·13·61 Posts

Quote:
 Originally Posted by Dubslow 2.03 is stable, 2.04 Beta works (but was never pushed out of beta), 2.05 is under development, and all work on GNU-Linux. None are safer and more reliable; the main differences from 2.01 to 2.04 are interface features (i.e. worktodo in version 2.03, other good stuff in 2.04). They all automagically choose a *roughly* good FFT size, but manual futzing can usually get an extra 5-10% performance boost.
Thank you Dubslow, that's exactly what I supposed... I'm waiting at the window, looking at the development, helping with tests if needed, but still keeping my old plain-vanilla v2.01

Luigi

2013-03-13, 21:07   #1745
Brain

Dec 2009
Peine, Germany

331 Posts
Efficiency formula

Quote:
 Originally Posted by aaronhaviland Absolutely (from the CUFFT documentation): I've been testing CUFFT timings for other lengths than just multiples of 32768. I've excluded the timings because they're not run exactly as CUDALucas would run them, but the fact that they are "optimal lengths" should still apply. Eff% is is calculated similarly to the prior examples here, but scaled so that the results are all within the range 0-100. Very few lengths have Eff% between 15% - 75%; the majority of inefficient lengths ran around 9-10%. These have all been excluded. Some of the 70-80% efficient run-lengths have also been excluded because they are smaller than a larger+faster length. Note the exponents in blue which would be skipped over if only looking at multiples of 32768: Code: FFT Exponent Size Eff% 2 3 5 7 ====================== 1048576 97.23 20 0 0 0 1105920 88.82 13 3 1 0 1179648 91.20 17 2 0 0 1204224 82.49 13 1 0 2 1310720 89.06 18 0 1 0 1327104 90.86 14 4 0 0 1376256 85.13 16 1 0 1 1474560 89.14 15 2 1 0 1548288 89.05 13 3 0 1 1572864 89.23 19 1 0 0 1605632 88.84 15 0 0 2 1769472 92.58 16 3 0 0 1835008 89.17 18 0 0 1 2097152 95.87 21 0 0 0 2211840 87.81 14 3 1 0 2359296 89.84 18 2 0 0 2370816 80.62 8 3 0 3 2408448 81.08 14 1 0 2 2621440 87.60 19 0 1 0 2654208 85.52 15 4 0 0 2709504 82.21 11 3 0 2 2809856 82.38 13 0 0 3 2985984 87.28 12 6 0 0 3096576 85.87 14 3 0 1 3145728 85.74 20 1 0 0 3211264 82.12 16 0 0 2 3317760 82.69 13 4 1 0 3359232 74.71 9 8 0 0 3386880 71.31 9 3 1 2 3932160 80.93 18 1 1 0 4014080 80.65 14 0 1 2 4096000 73.66 15 0 3 0 4194304 95.87 22 0 0 0 4423680 87.81 15 3 1 0 4718592 89.84 19 2 0 0 4741632 80.62 9 3 0 3 4816896 81.08 15 1 0 2 5242880 87.60 20 0 1 0 5308416 85.52 16 4 0 0 5419008 82.21 12 3 0 2 5619712 82.38 14 0 0 3 5971968 87.28 13 6 0 0 6193152 85.87 15 3 0 1 6291456 85.74 21 1 0 0 6422528 82.12 17 0 0 2 6635520 82.69 14 4 1 0 6718464 74.71 10 8 0 0 6773760 71.31 10 3 1 2 7864320 80.93 19 1 1 0 8028160 80.65 15 0 1 2 8192000 73.66 16 0 3 0`
Quote:
 For sizes handled by the Cooley-Tukey code path (that is, representable as 2 a ⋅ 3 b ⋅ 5 c ⋅ 7 d ), the most efficient implementation is obtained by applying the following constraints (listed in order from the most generic to the most specialized constraint, with each subsequent constraint providing the potential of an additional performance improvement). Restrict the size along all dimensions to be representable as 2 a ⋅ 3 b ⋅ 5 c ⋅ 7 d . The CUFFT library has highly optimized kernels for tranforms whose dimensions have these prime factors. Restrict the size along each dimension to use fewer distinct prime factors. For example, a transform of size 3 n will usually be faster than one of size 2 i ∗ 3 j even if the latter is slightly smaller. Restrict the power-of-two factorization term of the x dimension to be a multiple of either 2 5 6 for single-precision transforms or 6 4 for double-precision transforms. This further aids with memory coalescing. Restrict the x dimension of single-precision transforms to be strictly a power of two either between 2 and 8 1 9 2 for Fermi-class GPUs or between 2 and 2 0 4 8 for earlier architectures. These transforms are implemented as specialized hand-coded kernels that keep all intermediate results in shared memory. Use Native compatibility mode for in-place complex-to-real or real-to-complex transforms. This scheme reduces the write/read of padding bytes hence helping with coalescing of the data.
I'd like to have a formula that computes an efficiency score between 0 and 100% for a given FFT length like Aaron did. Any suggestions? Aaron, can you post yours?
Thoughts:
- measure FFT lengths runtime and adapt formula to match ms/iter
- theoretical Gflops throuhput as given by Nvidia for radix-2 to -7
- theoretical model with weighted scores for radix-2 to -7 and penalty for more distinct prime factors used <-- my try but scores must be analysed
- ...

Suggestions? Or has this been invented yet?

2013-03-14, 21:48   #1746
Brain

Dec 2009
Peine, Germany

1010010112 Posts

Quote:
 Originally Posted by Brain Suggestions? Or has this been invented yet?
I think Aaronhaviland did this with the cufftbench option.
I will do the same and normalize the timings by FFT length / time.
Raw data that will be used can be found in this Titan's thread post.

 2013-04-19, 15:51 #1748 axn     Jun 2003 466210 Posts If you can, downclock the GPU memory and try. It is almost certainly hardware problem. You can also get GeneferCUDA and do a test. It should also produce similar issues.
 2013-04-19, 15:52 #1749 henryzz Just call me Henry     "David" Sep 2007 Cambridge (GMT) 569010 Posts It is not unknown for programs on this forum to find hardware faults that nothing else will. I would suggest reducing your memory clock and find what is stable.

 Similar Threads Thread Thread Starter Forum Replies Last Post LaurV Data 131 2017-05-02 18:41 Brain GPU Computing 13 2016-02-19 15:53 Karl M Johnson GPU Computing 15 2015-10-13 04:44 fairsky GPU Computing 11 2013-11-03 02:08 Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 03:17.

Tue Aug 4 03:17:43 UTC 2020 up 17 days, 23:04, 0 users, load averages: 1.19, 1.54, 1.54