mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-03-01, 14:45   #1739
kjaget
 
kjaget's Avatar
 
Jun 2005

3·43 Posts
Default

Quote:
Originally Posted by flashjh View Post
The most recent CUDALucas.cu file is giving me a problem compiling:

Code:
Line 703: struct stat FileAttrib;
Line 717: if(stat(chkpnt_cfn, &FileAttrib) == 0) fprintf (stderr, "\nUnable to open the checkpoint file. Trying the backup file.\n");
Line 737: if(stat(chkpnt_cfn, &FileAttrib) == 0) fprintf (stderr, "\nUnable to open the backup file. Restarting test.\n");
Compiler output:
Code:
C:\CUDA\src>make -f makefile.win
makefile.win:12: Extraneous text after `else' directive
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0/bin/nvcc" -c CUDALucas.
cu -o CUDALucas.cuda5.0.sm_35.x64.obj -m64 --ptxas-options=-v -ccbin="C:\Program
 Files (x86)\Microsoft Visual Studio 10.0\VC\bin\amd64" -Xcompiler /EHsc,/W3,/no
logo,/Ox,/Oy,/GL -arch=sm_35 -O3
CUDALucas.cu
CUDALucas.cu(703): error: incomplete type is not allowed

CUDALucas.cu(717): error: incomplete type is not allowed

CUDALucas.cu(737): error: incomplete type is not allowed
If you let me know what you're trying to do with the struct maybe I can figure out how to make MSVS happy.
Try changing stat to _stat and see if it compiles.
kjaget is offline   Reply With Quote
Old 2013-03-01, 19:54   #1740
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

21438 Posts
Default

I'll try tonight.
flashjh is offline   Reply With Quote
Old 2013-03-01, 21:40   #1741
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

23×11×23 Posts
Default

Regarding FFT timings, I have brief access to a Tesla K20m. I ran a benchmark starting at 1440k going up in 16k increments, and here are the results. As usual, lengths slower than longer FFTs have been deleted. I'm surprised how relatively short this table is.

Code:
FFT length (k), ms/iteration
1568	0.508496
1600	0.596174
2000	0.645019
2048	0.655126
2592	0.820283
3136	1.123238
4000	1.256788
4096	1.304601
4320	1.804463
4608	1.876166
4704	1.910958
5120	2.120896
5488	2.136009
5600	2.270577
6000	2.438436
6048	2.448022
6144	2.480506
6272	2.526666
7776	2.620803
For the most recent prime, actually using 4000k rather than 3200k results in a slightly longer ETA, presumably because of doing the multiplication and normalization on the longer array.

Code:
./CUDALucas -k 57885161
Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:41 real, 4.0933 ms/iter, ETA 65:47:57)
Iteration 20000 M( 57885161 )C, 0xfd8e311d20ffe6ab, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:41 real, 4.0613 ms/iter, ETA 65:16:29)


./CUDALucas -f 4000k -k 57885161
Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 4000K, CUDALucas v2.04 Beta err = 0.0009 (0:42 real, 4.2419 ms/iter, ETA 68:11:21)
Iteration 20000 M( 57885161 )C, 0xfd8e311d20ffe6ab, n = 4000K, CUDALucas v2.04 Beta err = 0.0010 (0:42 real, 4.2032 ms/iter, ETA 67:33:15)

Last fiddled with by frmky on 2013-03-01 at 21:56
frmky is online now   Reply With Quote
Old 2013-03-02, 12:54   #1742
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

475810 Posts
Default

Just to be sure...

I'm working fine with CudaLucas v2.01 under Linux 64bit: the program aautomatically recognize errors on the FFT computation and rollbacks to a safer size, even if not the most efficient.

1 - Is v2.04 available for Linux?
2 - Is it more reliable?
3 - Is it faster?
4 - Does it aautomagically choose the fastest FFT size?

Thank you for the infos..

Luigi
ET_ is offline   Reply With Quote
Old 2013-03-02, 21:36   #1743
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

2.03 is stable, 2.04 Beta works (but was never pushed out of beta), 2.05 is under development, and all work on GNU-Linux.

None are safer and more reliable; the main differences from 2.01 to 2.04 are interface features (i.e. worktodo in version 2.03, other good stuff in 2.04). They all automagically choose a *roughly* good FFT size, but manual futzing can usually get an extra 5-10% performance boost.
Dubslow is offline   Reply With Quote
Old 2013-03-02, 23:19   #1744
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

112268 Posts
Default

Quote:
Originally Posted by Dubslow View Post
2.03 is stable, 2.04 Beta works (but was never pushed out of beta), 2.05 is under development, and all work on GNU-Linux.

None are safer and more reliable; the main differences from 2.01 to 2.04 are interface features (i.e. worktodo in version 2.03, other good stuff in 2.04). They all automagically choose a *roughly* good FFT size, but manual futzing can usually get an extra 5-10% performance boost.
Thank you Dubslow, that's exactly what I supposed... I'm waiting at the window, looking at the development, helping with tests if needed, but still keeping my old plain-vanilla v2.01

Luigi
ET_ is offline   Reply With Quote
Old 2013-03-13, 21:07   #1745
Brain
 
Brain's Avatar
 
Dec 2009
Peine, Germany

33110 Posts
Default Efficiency formula

Quote:
Originally Posted by aaronhaviland View Post
Absolutely (from the CUFFT documentation):

I've been testing CUFFT timings for other lengths than just multiples of 32768. I've excluded the timings because they're not run exactly as CUDALucas would run them, but the fact that they are "optimal lengths" should still apply.

Eff% is is calculated similarly to the prior examples here, but scaled so that the results are all within the range 0-100. Very few lengths have Eff% between 15% - 75%; the majority of inefficient lengths ran around 9-10%. These have all been excluded. Some of the 70-80% efficient run-lengths have also been excluded because they are smaller than a larger+faster length. Note the exponents in blue which would be skipped over if only looking at multiples of 32768:

Code:
FFT           Exponent
Size    Eff%   2 3 5 7
======================
1048576 97.23 20 0 0 0
1105920 88.82 13 3 1 0
1179648 91.20 17 2 0 0
1204224 82.49 13 1 0 2
1310720 89.06 18 0 1 0
1327104 90.86 14 4 0 0
1376256 85.13 16 1 0 1
1474560 89.14 15 2 1 0
1548288 89.05 13 3 0 1
1572864 89.23 19 1 0 0
1605632 88.84 15 0 0 2
1769472 92.58 16 3 0 0
1835008 89.17 18 0 0 1
2097152 95.87 21 0 0 0
2211840 87.81 14 3 1 0
2359296 89.84 18 2 0 0
2370816 80.62 8  3 0 3
2408448 81.08 14 1 0 2
2621440 87.60 19 0 1 0
2654208 85.52 15 4 0 0
2709504 82.21 11 3 0 2
2809856 82.38 13 0 0 3
2985984 87.28 12 6 0 0
3096576 85.87 14 3 0 1
3145728 85.74 20 1 0 0
3211264 82.12 16 0 0 2
3317760 82.69 13 4 1 0
3359232 74.71 9  8 0 0
3386880 71.31 9  3 1 2
3932160 80.93 18 1 1 0
4014080 80.65 14 0 1 2
4096000 73.66 15 0 3 0
4194304 95.87 22 0 0 0
4423680 87.81 15 3 1 0
4718592 89.84 19 2 0 0
4741632 80.62 9  3 0 3
4816896 81.08 15 1 0 2
5242880 87.60 20 0 1 0
5308416 85.52 16 4 0 0
5419008 82.21 12 3 0 2
5619712 82.38 14 0 0 3
5971968 87.28 13 6 0 0
6193152 85.87 15 3 0 1
6291456 85.74 21 1 0 0
6422528 82.12 17 0 0 2
6635520 82.69 14 4 1 0
6718464 74.71 10 8 0 0
6773760 71.31 10 3 1 2
7864320 80.93 19 1 1 0
8028160 80.65 15 0 1 2
8192000 73.66 16 0 3 0
Quote:
For sizes handled by the Cooley-Tukey code path (that is, representable as 2 a ⋅ 3 b ⋅ 5 c ⋅ 7 d ), the most efficient implementation is obtained by applying the following constraints (listed in order from the most generic to the most specialized constraint, with each subsequent constraint providing the potential of an additional performance improvement).
  • Restrict the size along all dimensions to be representable as 2 a ⋅ 3 b ⋅ 5 c ⋅ 7 d .
    The CUFFT library has highly optimized kernels for tranforms whose dimensions have these prime factors.
  • Restrict the size along each dimension to use fewer distinct prime factors.
    For example, a transform of size 3 n will usually be faster than one of size 2 i ∗ 3 j even if the latter is slightly smaller.
  • Restrict the power-of-two factorization term of the x dimension to be a multiple of either 2 5 6 for single-precision transforms or 6 4 for double-precision transforms.
    This further aids with memory coalescing.
  • Restrict the x dimension of single-precision transforms to be strictly a power of two either between 2 and 8 1 9 2 for Fermi-class GPUs or between 2 and 2 0 4 8 for earlier architectures.
    These transforms are implemented as specialized hand-coded kernels that keep all intermediate results in shared memory.
  • Use Native compatibility mode for in-place complex-to-real or real-to-complex transforms.
    This scheme reduces the write/read of padding bytes hence helping with coalescing of the data.
I'd like to have a formula that computes an efficiency score between 0 and 100% for a given FFT length like Aaron did. Any suggestions? Aaron, can you post yours?
Thoughts:
- measure FFT lengths runtime and adapt formula to match ms/iter
- theoretical Gflops throuhput as given by Nvidia for radix-2 to -7
- theoretical model with weighted scores for radix-2 to -7 and penalty for more distinct prime factors used <-- my try but scores must be analysed
- ...

Suggestions? Or has this been invented yet?
Brain is offline   Reply With Quote
Old 2013-03-14, 21:48   #1746
Brain
 
Brain's Avatar
 
Dec 2009
Peine, Germany

331 Posts
Default

Quote:
Originally Posted by Brain View Post
Suggestions? Or has this been invented yet?
I think Aaronhaviland did this with the cufftbench option.
I will do the same and normalize the timings by FFT length / time.
Raw data that will be used can be found in this Titan's thread post.
Brain is offline   Reply With Quote
Old 2013-04-19, 15:39   #1747
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

239816 Posts
Default Might there be a bug in CUDALucas, or do I have bad hardware?

Hey all. Looking for some guidance from the Gurus...

I offered to help alpha test owftheevil's new GPU P-1 program on my 2GB GTX560, and very quickly it started reporting round-off errors. owftheevil asked if I had run the CUDALucas self-test, which I had to admit I hadn't.

After receiving a couple of different versions of the source from owftheevil, and also download the code from GIThub, it rarely passed the self test (only once out of about ten different runs).

Concerned that my mfaktc work might be bad, I ran it's deep self-test. 100% success. (I know, of course, that the two programs work in very different ways.)

I then reran the memory test program I used when I first bought the 560 -- http://wili.cc/blog/gpu-burn.html -- no errors.

I downloaded and compiled the Open Source version of memtest80 -- https://github.com/ihaque/memtestG80 -- after more than an hour, no errors.

I often use this GPU for a computer vision project I'm working on. This involves SIFTing large images, and then matching the descriptors. The former process uses about 90% of the card's memory. Shortly after purchasing the card I ran a sanity check experiment where I ran a >1000 image job on the GPU, and then the same job only on the CPU, and the results were almost identical (GPU SIFTing is known to have slightly different results).

Lastly, when I first bought the card I also ran several tests under WinBlows, including FurMark. Not a single reported error. (I can't immediately rerun those tests as the machine is in an office on the other side of the country (not really that far away).)

I am running the latest CUDA 5.0. The box is a hyper-threaded quad-core with 4GB of RAM, running CentOS 6.3 64-bit.

Any thoughts from anyone?

I'd be happy to provide unprivileged SSH access to the box to any of the CUDALucas developers if it might help being "in situ".
chalsall is offline   Reply With Quote
Old 2013-04-19, 15:51   #1748
axn
 
axn's Avatar
 
Jun 2003

2·32·7·37 Posts
Default

If you can, downclock the GPU memory and try. It is almost certainly hardware problem. You can also get GeneferCUDA and do a test. It should also produce similar issues.
axn is online now   Reply With Quote
Old 2013-04-19, 15:52   #1749
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT)

22·1,423 Posts
Default

It is not unknown for programs on this forum to find hardware faults that nothing else will. I would suggest reducing your memory clock and find what is stable.
henryzz is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Don't DC/LL them with CudaLucas LaurV Data 131 2017-05-02 18:41
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 Brain GPU Computing 13 2016-02-19 15:53
CUDALucas: which binary to use? Karl M Johnson GPU Computing 15 2015-10-13 04:44
settings for cudaLucas fairsky GPU Computing 11 2013-11-03 02:08
Trying to run CUDALucas on Windows 8 CP Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 07:37.

Wed Aug 5 07:37:11 UTC 2020 up 19 days, 3:23, 1 user, load averages: 2.27, 1.57, 1.32

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.