mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-11-11, 19:50   #419
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

481810 Posts
Default

Quote:
Originally Posted by owftheevil View Post
How far past stage1 gcd does it get? What is the reported available memory and estimate of memory it will use (beginning of stage 1)? Are you doing stage 1 and stage 2 on different cards?
I had the same behavior. It turned out that the outside temperature was too hot...

Luigi
ET_ is offline   Reply With Quote
Old 2013-11-12, 05:20   #420
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

72×197 Posts
Default

Sorry for the lack of the details. I will get home in the evening and come back with some output snippets. I still have the stage 1 final files. To confirm: the problem is not related to temperature, to my cards, to my computer, etc. I can accurately reproduce it in different cards (all 580 with 1536M ram), different computers. The problem is related to 69M exponents, the program just crash without giving any error, immediately after the init of stage 2 finishes. I tried -e2 and -d2 switches with different values, and I also tried to put "UnusedMem=" in the ini file. The number of primes and the memory used always differ (like from 1 prime to 27 primes, from 308M to 1370M used, the card says ~1400M free), as a sign that the switches are working. I also retried stage 1 from previous checkpoints, which can finish without problem, do gcd, init stage 2, BOOM!

You know me from the past, I always have been the first to blame heat, dust, whatever, when other people had problems, but now, it is not a hardware, nor temperature problem. I have external water cooler (I mean, outside of the house!), with 4 liters of water, two pumps and 12 fans, and in the night the temperature outside drops to +16 in this period of the year. The installation is designed for tremendous hot Thai Aprils, when the temperature outside is in average 40C degrees. If I start all fans NOW, in this perios of the year, there is no way that the temperatures go over 50C.

Trust me, this time, there is a bug in the program.

Last fiddled with by LaurV on 2013-11-12 at 05:21
LaurV is offline   Reply With Quote
Old 2013-11-12, 15:09   #421
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

Once stage 2 is begun, it would be a big mess to allow the various parameters, e, d, nrp to change. So if fact, regardless of any command line parameters, if stage 2 has already been initialized, e, d, nrp, b1, and b2 are taken from the saveflie. If stage 2 is initialized with 3Gb of available memory and then resumed with only 1.5Gb of available memory, there will be problems. If this is the case, delete the stage 2 savefile. Then everything should work fine. If not, I'll need all the information you can give me. Thanks for pointing out this bug.
owftheevil is offline   Reply With Quote
Old 2013-11-12, 15:10   #422
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

226658 Posts
Default

I am back.

here is the stuff:
Code:
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

e:\cpm1_1>
e:\cpm1_1>cudapm1 69010547
CUDAPm1 v0.20
Warning: Couldn't parse ini file option UnusedMem; using default.
------- DEVICE 1 -------
name                GeForce GTX 580
Compatibility       2.0
clockRate (MHz)     1564
memClockRate (MHz)  2004
totalGlobalMem      zu
totalConstMem       zu
l2CacheSize         786432
sharedMemPerBlock   zu
regsPerBlock        32768
warpSize            32
memPitch            zu
maxThreadsPerBlock  1024
maxThreadsPerMP     1536
multiProcessorCount 16
maxThreadsDim[3]    1024,1024,64
maxGridSize[3]      65535,65535,65535
textureAlignment    zu
deviceOverlap       1

CUDA reports 1432M of 1536M GPU memory free.
Using threads: norm1 64, mult 64, norm2 32.
No stage 2 checkpoint.
Using up to 1312M GPU memory.
Selected B1=640000, B2=14400000, 3.7% chance of finding a factor
Using B1 = 640000 from savefile.
Continuing stage 2 from a partial result of M69010547 fft length = 4096K
Starting stage 2.
Using b1 = 640000, b2 = 14400000, d = 2310, e = 2, nrp = 30
Zeros: 644312, Ones: 737128, Pairs: 145269
Processing 1 - 30 of 480 relative primes.
Inititalizing pass... )

Quitting, estimated time spent = 0:01

e:\cpm1_1>cudapm1 69010547 -e2 6 -d2 30
CUDAPm1 v0.20
Warning: Couldn't parse ini file option UnusedMem; using default.
------- DEVICE 1 -------
name                GeForce GTX 580
Compatibility       2.0
clockRate (MHz)     1564
memClockRate (MHz)  2004
totalGlobalMem      zu
totalConstMem       zu
l2CacheSize         786432
sharedMemPerBlock   zu
regsPerBlock        32768
warpSize            32
memPitch            zu
maxThreadsPerBlock  1024
maxThreadsPerMP     1536
multiProcessorCount 16
maxThreadsDim[3]    1024,1024,64
maxGridSize[3]      65535,65535,65535
textureAlignment    zu
deviceOverlap       1

CUDA reports 1432M of 1536M GPU memory free.
Using threads: norm1 64, mult 64, norm2 32.
No stage 2 checkpoint.
Using up to 1312M GPU memory.
Selected B1=640000, B2=14400000, 3.7% chance of finding a factor
Using B1 = 640000 from savefile.
Continuing stage 2 from a partial result of M69010547 fft length = 4096K
Starting stage 2.
Using b1 = 640000, b2 = 14400000, d = 30, e = 6, nrp = 8
Zeros: 898832, Ones: 746888, Pairs: 135479
Processing 1 - 8 of 8 relative primes.
Inititalizing pass... )

Quitting, estimated time spent = 0:01

e:\cpm1_1>
mention that there is no "stage 2" checkpoint file, and none is created during the crash. The only file is the last stage 1 checkpoint.

Same story with the undocumented stuff in the ini file, setting the unused mem to 1200M (remark that the used mem in this case is much less, it could be anything between like 300 and 800M, or so)

Code:
e:\cpm1_1>cudapm1 69010547
CUDAPm1 v0.20
------- DEVICE 1 -------
name                GeForce GTX 580
Compatibility       2.0
clockRate (MHz)     1564
memClockRate (MHz)  2004
totalGlobalMem      zu
totalConstMem       zu
l2CacheSize         786432
sharedMemPerBlock   zu
regsPerBlock        32768
warpSize            32
memPitch            zu
maxThreadsPerBlock  1024
maxThreadsPerMP     1536
multiProcessorCount 16
maxThreadsDim[3]    1024,1024,64
maxGridSize[3]      65535,65535,65535
textureAlignment    zu
deviceOverlap       1

CUDA reports 1432M of 1536M GPU memory free.
Using threads: norm1 64, mult 64, norm2 32.
No stage 2 checkpoint.
Using up to 608M GPU memory.
Selected B1=495000, B2=2475000, 2.48% chance of finding a factor
Using B1 = 640000 from savefile.
Continuing stage 2 from a partial result of M69010547 fft length = 4096K
Starting stage 2.
Using b1 = 640000, b2 = 2475000, d = 210, e = 2, nrp = 1
Zeros: 100255, Ones: 109505, Pairs: 19827
Processing 1 - 1 of 48 relative primes.
Inititalizing pass... done. transforms: 168, err = 0.02612, (0.56 real, 3.3270 ms/tran,  ETA NA)

Quitting, estimated time spent = 0:00

e:\cpm1_1>cudapm1 69010547 -e2 2 -d2 30
CUDAPm1 v0.20
------- DEVICE 1 -------
name                GeForce GTX 580
Compatibility       2.0
clockRate (MHz)     1564
memClockRate (MHz)  2004
totalGlobalMem      zu
totalConstMem       zu
l2CacheSize         786432
sharedMemPerBlock   zu
regsPerBlock        32768
warpSize            32
memPitch            zu
maxThreadsPerBlock  1024
maxThreadsPerMP     1536
multiProcessorCount 16
maxThreadsDim[3]    1024,1024,64
maxGridSize[3]      65535,65535,65535
textureAlignment    zu
deviceOverlap       1

CUDA reports 1432M of 1536M GPU memory free.
Using threads: norm1 64, mult 64, norm2 32.
No stage 2 checkpoint.
Using up to 352M GPU memory.
Selected B1=495000, B2=2475000, 2.48% chance of finding a factor
Using B1 = 640000 from savefile.
Continuing stage 2 from a partial result of M69010547 fft length = 4096K
Starting stage 2.
Using b1 = 640000, b2 = 2475000, d = 30, e = 2, nrp = 1
Zeros: 132682, Ones: 111990, Pairs: 17316
Processing 1 - 1 of 8 relative primes.
Inititalizing pass... done. transforms: 168, err = 0.02441, (0.56 real, 3.3294 ms/tran,  ETA NA)

Quitting, estimated time spent = 0:00

e:\cpm1_1>
LaurV is offline   Reply With Quote
Old 2013-11-12, 15:14   #423
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

72·197 Posts
Default

post size limit, I had to delete some stuff

For now I solved the problem very simple: I reported the stage 1 only (and not going to do any stage 2), and I moved all the unstarted 69M exponents into prime95's worktodo, and I brought all the 64M exponents from prime95 to cudapm1. Under 69M (fft 3584 and smaller) everything works fine.

I also tried fft 3600, it is still working. I tried 4096 also with all threads combinations I could think of, not working. But working in cards with 3G memory. So, it is related to some allocation.

edit2: still keeping stage 1 files

Last fiddled with by LaurV on 2013-11-12 at 15:20
LaurV is offline   Reply With Quote
Old 2013-11-12, 15:42   #424
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32·5·7 Posts
Default

LaurV please edit the threads.txt file, changing the norm1 threads to 128. e.g. from

Code:
4096 64 64 32
to

Code:
4096 128 64 32
and try again.

The error you are getting is a round off error (Maybe I could add a line which actually tells you this?). The thread optimizing function probably has a <= where it needs a < when checking if the thread sizes are acceptable.
owftheevil is offline   Reply With Quote
Old 2013-11-12, 16:00   #425
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

25B516 Posts
Default



You are my man! Tell me when you come to northern Thailand so I fill the refrigerators with beer!

It is working. Plain and simple...

And I am sure I tried many different thread combinations too (in command line only, however, not in the thread tuning file!).

BTW, related to tuning file, I just wrote a "tuning batch" with a pari/gp for loop, so if someone is tired to tune all FFT possibilities, run this batch and keep the "GeForce GTX blabla threads.txt" file. You may have few FFT sizes which are not in the list, but they won't be more then few (depending on your card). Saves you of a lot of typing...

Code:
cudapm1 -cufftbench 4 4 6
cudapm1 -cufftbench 5 5 6
cudapm1 -cufftbench 14 14 6
cudapm1 -cufftbench 32 32 6
cudapm1 -cufftbench 36 36 6
cudapm1 -cufftbench 40 40 6
cudapm1 -cufftbench64 64 6
cudapm1 -cufftbench 80 80 6
cudapm1 -cufftbench 96 96 6
cudapm1 -cufftbench 98 98 6
cudapm1 -cufftbench 128 128 6
cudapm1 -cufftbench 144 144 6
cudapm1 -cufftbench 160 160 6
cudapm1 -cufftbench 162 162 6
cudapm1 -cufftbench 192 192 6
cudapm1 -cufftbench 224 224 6
cudapm1 -cufftbench 256 256 6
cudapm1 -cufftbench 288 288 6
cudapm1 -cufftbench 320 320 6
cudapm1 -cufftbench 324 324 6
cudapm1 -cufftbench 336 336 6
cudapm1 -cufftbench 384 384 6
cudapm1 -cufftbench 392 392 6
cudapm1 -cufftbench 400 400 6
cudapm1 -cufftbench 448 448 6
cudapm1 -cufftbench 512 512 6
cudapm1 -cufftbench 576 576 6
cudapm1 -cufftbench 640 640 6
cudapm1 -cufftbench 648 640 6
cudapm1 -cufftbench 672 672 6
cudapm1 -cufftbench 720 720 6
cudapm1 -cufftbench 768 768 6
cudapm1 -cufftbench 784 784 6
cudapm1 -cufftbench 800 800 6
cudapm1 -cufftbench 864 864 6
cudapm1 -cufftbench 896 896 6
cudapm1 -cufftbench 1024 1024 6
cudapm1 -cufftbench 1152 1152 6
cudapm1 -cufftbench 1176 1176 6
cudapm1 -cufftbench 1280 1280 6
cudapm1 -cufftbench 1296 1296 6
cudapm1 -cufftbench 1344 1344 6
cudapm1 -cufftbench 1440 1440 6
cudapm1 -cufftbench 1512 1512 6
cudapm1 -cufftbench 1536 1536 6
cudapm1 -cufftbench 1568 1568 6
cudapm1 -cufftbench 1600 1600 6
cudapm1 -cufftbench 1728 1728 6
cudapm1 -cufftbench 1792 1792 6
cudapm1 -cufftbench 2048 2048 6
cudapm1 -cufftbench 2240 2240 6
cudapm1 -cufftbench 2304 2304 6
cudapm1 -cufftbench 2352 2352 6
cudapm1 -cufftbench 2592 2592 6
cudapm1 -cufftbench 2688 2688 6
cudapm1 -cufftbench 2880 2880 6
cudapm1 -cufftbench 3024 3024 6
cudapm1 -cufftbench 3136 3136 6
cudapm1 -cufftbench 3150 3150 6
cudapm1 -cufftbench 3200 3200 6
cudapm1 -cufftbench 3360 3360 6
cudapm1 -cufftbench 3456 3456 6
cudapm1 -cufftbench 3584 3584 6
cudapm1 -cufftbench 3600 3600 6
cudapm1 -cufftbench 4096 4096 6
cudapm1 -cufftbench 4320 4320 6
cudapm1 -cufftbench 4608 4608 6
cudapm1 -cufftbench 4704 4704 6
cudapm1 -cufftbench 5040 5040 6
cudapm1 -cufftbench 5184 5184 6
cudapm1 -cufftbench 5292 5292 6
cudapm1 -cufftbench 5400 5400 6
cudapm1 -cufftbench 5600 5600 6
cudapm1 -cufftbench 5670 5670 6
cudapm1 -cufftbench 5760 5760 6
cudapm1 -cufftbench 6144 6144 6
cudapm1 -cufftbench 6272 6272 6
cudapm1 -cufftbench 6480 6480 6
cudapm1 -cufftbench 6720 6720 6
cudapm1 -cufftbench 6912 6912 6
cudapm1 -cufftbench 7056 7056 6
cudapm1 -cufftbench 7168 7168 6
cudapm1 -cufftbench 7200 7200 6
cudapm1 -cufftbench 7776 7776 6
cudapm1 -cufftbench 8064 8064 6
cudapm1 -cufftbench 8192 8192 6
(hehehe)


edit: reserving more 69M exponents... I first was stupid, trying to bring back those from prime95, until some light bulb turned on in my head...

Last fiddled with by LaurV on 2013-11-12 at 16:05
LaurV is offline   Reply With Quote
Old 2013-11-12, 16:21   #426
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

4738 Posts
Default

I'm glad it was that simple. Thanks again for your input. I'll get a fix up as soon as no home internet and windows adjusting to new hardware allow.
owftheevil is offline   Reply With Quote
Old 2013-11-18, 20:07   #427
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

The code with the fix of what turned out to be a threads issue on one of the kernels is now up at sourceforge, as is a cuda5.5 linked windows executable.

https://sourceforge.net/projects/cudapm1

Last fiddled with by owftheevil on 2013-11-18 at 20:09
owftheevil is offline   Reply With Quote
Old 2013-11-22, 17:34   #428
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

2·3·11·73 Posts
Default CudaPm1 v0.20

I just downloaded and recompiled CudaPm1 v0.20 svn 51.

Before testing an exponent, I ran

Code:
./CudaPm1 -cufftbench 1 8192 1
that gave me a file named "GeForce GTX 580 fft.txt"

I then ran the program on M67,231.XXX with the default INI file. everything looked fine.

I also noticed a message from the output:

Code:
No GeForce GTX 580 threads.txt file found. Using default thread sizes.
For optimal thread selection, please run
./CUDAPm1 -cufftbench 4096 4096 r
for some small r, 0 < r < 6 e.g.
Using threads: norm1 256, mult 128, norm2 128.
I ran the program with r = 4 and got the folowing message:

Code:
CUDA bench, testing various thread sizes for fft 4096K, doing 4 passes.
fft size = 4096K, ave time = 6.0904 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 32
fft size = 4096K, ave time = 6.0979 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 64
fft size = 4096K, ave time = 6.1026 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 128
fft size = 4096K, ave time = 6.1052 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 256
fft size = 4096K, ave time = 6.1038 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 512
fft size = 4096K, ave time = 6.1022 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 1024
fft size = 4096K, ave time = 6.0837 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 32
fft size = 4096K, ave time = 6.0929 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 64
fft size = 4096K, ave time = 6.0960 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 128
fft size = 4096K, ave time = 6.1007 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 256
fft size = 4096K, ave time = 6.0981 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 512
fft size = 4096K, ave time = 6.0974 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 1024
fft size = 4096K, ave time = 6.1389 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 32
fft size = 4096K, ave time = 6.1475 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 64
fft size = 4096K, ave time = 6.1538 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 128
fft size = 4096K, ave time = 6.1559 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 256
fft size = 4096K, ave time = 6.1564 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 512
fft size = 4096K, ave time = 6.1548 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 1024
fft size = 4096K, ave time = 6.1958 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 32
fft size = 4096K, ave time = 6.2042 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 64
fft size = 4096K, ave time = 6.2083 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 128
fft size = 4096K, ave time = 6.2126 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 256
fft size = 4096K, ave time = 6.2120 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 512
fft size = 4096K, ave time = 6.2096 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 1024
fft size = 4096K, ave time = 6.2603 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 32
fft size = 4096K, ave time = 6.2692 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 64
fft size = 4096K, ave time = 6.2705 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 128
fft size = 4096K, ave time = 6.2746 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 256
fft size = 4096K, ave time = 6.2736 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 512
fft size = 4096K, ave time = 6.2739 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 1024
CUDAPm1.cu(2163) : cufftSafeCall() CUFFT error 6: CUFFT_EXEC_FAILED
It seems that the best timing for my system was

Code:
fft size = 4096K, ave time = 6.0837 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 32
I also tried to modify the "Threads=" parameter in the INI file: the best results were obtained with 128.

As the program still failed to recognize the file created by the -cufftbench option, I renamed it as asked by the executable, and got this message:

Code:
Using threads: norm1 75846319, mult 6, norm2 0.
over specifications Grid = 174762
try increasing mult threads (6) or decreasing FFT length (4096K)
I rollbacked the last change.

My actual configuration is the one shown in the second CODE box, with Threads=256 in the INI file, getting 6.3266 ms/iter.

What am I missing?

Luigi
ET_ is offline   Reply With Quote
Old 2013-11-22, 18:06   #429
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11×311 Posts
Default

Quote:
Originally Posted by ET_ View Post
As the program still failed to recognize the file created by the -cufftbench option, I renamed it as asked by the executable
Note that the "<gpu> fft.txt" and "<gpu> threads.txt" files are distinct from each other.

<gpu> fft.txt should look something like
Code:
Device              GeForce GTX 670
Compatibility       3.0
clockRate (MHz)     980
memClockRate (MHz)  3004

  fft    max exp  ms/iter
    4      85933   0.0697
   16     333803   0.1153
   32     657719   0.1306
   36     738083   0.1618
   48     978041   0.1635
... skip a whole bunch of fft lines ...
28800  511382147  76.5273
32768  580225813  79.6749
Whereas "<gpu> threads.txt" should be quite short (and more cryptic), mine looks like:
Code:
17496  256   64  512  45.9160
 3456  256  128   32   8.0790
I suspect it didn't make a "<gpu> threads.txt" file for you because it appears to have failed partway through the process:
Quote:
CUDAPm1.cu(2163) : cufftSafeCall() CUFFT error 6: CUFFT_EXEC_FAILED
James Heinrich is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3497 2021-06-05 12:27
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26
Factoring program need help Citrix Lone Mersenne Hunters 8 2005-09-16 02:31
Factoring program ET_ Programming 3 2003-11-25 02:57

All times are UTC. The time now is 06:57.


Mon Aug 2 06:57:51 UTC 2021 up 10 days, 1:26, 0 users, load averages: 1.28, 1.20, 1.15

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.