mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   The P-1 factoring CUDA program (https://www.mersenneforum.org/showthread.php?t=17835)

ET_ 2013-11-11 19:50

[QUOTE=owftheevil;358998]How far past stage1 gcd does it get? What is the reported available memory and estimate of memory it will use (beginning of stage 1)? Are you doing stage 1 and stage 2 on different cards?[/QUOTE]

I had the same behavior. It turned out that the outside temperature was too hot... :smile:

Luigi

LaurV 2013-11-12 05:20

Sorry for the lack of the details. I will get home in the evening and come back with some output snippets. I still have the stage 1 final files. To confirm: the problem is not related to temperature, to my cards, to my computer, etc. I can accurately reproduce it in different cards (all 580 with 1536M ram), different computers. The problem is related to 69M exponents, the program just crash without giving any error, immediately after the init of stage 2 finishes. I tried -e2 and -d2 switches with different values, and I [B][U]also[/U][/B] tried to put "UnusedMem=" in the ini file. The number of primes and the memory used always differ (like from 1 prime to 27 primes, from 308M to 1370M used, the card says ~1400M free), as a sign that the switches [B][U]are working[/U][/B]. I also retried stage 1 from previous checkpoints, which can finish without problem, do gcd, init stage 2, BOOM!

You know me from the past, I always have been the first to blame heat, dust, whatever, when other people had problems, but now, it is not a hardware, nor temperature problem. I have [URL="http://www.mersenneforum.org/showthread.php?t=16829"]external water cooler[/URL] (I mean, outside of the house!), with 4 liters of water, two pumps and 12 fans, and in the night the temperature outside drops to +16 in this period of the year. The installation is designed for tremendous hot Thai Aprils, when the temperature outside is in average 40C degrees. If I start all fans NOW, in this perios of the year, there is no way that the temperatures go over 50C.

Trust me, this time, there is a bug in the program.

owftheevil 2013-11-12 15:09

Once stage 2 is begun, it would be a big mess to allow the various parameters, e, d, nrp to change. So if fact, regardless of any command line parameters, if stage 2 has already been initialized, e, d, nrp, b1, and b2 are taken from the saveflie. If stage 2 is initialized with 3Gb of available memory and then resumed with only 1.5Gb of available memory, there will be problems. If this is the case, delete the stage 2 savefile. Then everything should work fine. If not, I'll need all the information you can give me. Thanks for pointing out this bug.

LaurV 2013-11-12 15:10

I am back.

here is the stuff:
[CODE]
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.

e:\cpm1_1>
e:\cpm1_1>cudapm1 69010547
CUDAPm1 v0.20
Warning: Couldn't parse ini file option UnusedMem; using default.
------- DEVICE 1 -------
name GeForce GTX 580
Compatibility 2.0
clockRate (MHz) 1564
memClockRate (MHz) 2004
totalGlobalMem zu
totalConstMem zu
l2CacheSize 786432
sharedMemPerBlock zu
regsPerBlock 32768
warpSize 32
memPitch zu
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 16
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment zu
deviceOverlap 1

CUDA reports 1432M of 1536M GPU memory free.
Using threads: norm1 64, mult 64, norm2 32.
No stage 2 checkpoint.
Using up to 1312M GPU memory.
Selected B1=640000, B2=14400000, 3.7% chance of finding a factor
Using B1 = 640000 from savefile.
Continuing stage 2 from a partial result of M69010547 fft length = 4096K
Starting stage 2.
Using b1 = 640000, b2 = 14400000, d = 2310, e = 2, nrp = 30
Zeros: 644312, Ones: 737128, Pairs: 145269
Processing 1 - 30 of 480 relative primes.
Inititalizing pass... )

Quitting, estimated time spent = 0:01

e:\cpm1_1>cudapm1 69010547 -e2 6 -d2 30
CUDAPm1 v0.20
Warning: Couldn't parse ini file option UnusedMem; using default.
------- DEVICE 1 -------
name GeForce GTX 580
Compatibility 2.0
clockRate (MHz) 1564
memClockRate (MHz) 2004
totalGlobalMem zu
totalConstMem zu
l2CacheSize 786432
sharedMemPerBlock zu
regsPerBlock 32768
warpSize 32
memPitch zu
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 16
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment zu
deviceOverlap 1

CUDA reports 1432M of 1536M GPU memory free.
Using threads: norm1 64, mult 64, norm2 32.
No stage 2 checkpoint.
Using up to 1312M GPU memory.
Selected B1=640000, B2=14400000, 3.7% chance of finding a factor
Using B1 = 640000 from savefile.
Continuing stage 2 from a partial result of M69010547 fft length = 4096K
Starting stage 2.
Using b1 = 640000, b2 = 14400000, d = 30, e = 6, nrp = 8
Zeros: 898832, Ones: 746888, Pairs: 135479
Processing 1 - 8 of 8 relative primes.
Inititalizing pass... )

Quitting, estimated time spent = 0:01

e:\cpm1_1>
[/CODE]

mention that there is no "stage 2" checkpoint file, and none is created during the crash. The only file is the last stage 1 checkpoint.

Same story with the undocumented stuff in the ini file, setting the unused mem to 1200M (remark that the used mem in this case is much less, it could be anything between like 300 and 800M, or so)

[CODE]
e:\cpm1_1>cudapm1 69010547
CUDAPm1 v0.20
------- DEVICE 1 -------
name GeForce GTX 580
Compatibility 2.0
clockRate (MHz) 1564
memClockRate (MHz) 2004
totalGlobalMem zu
totalConstMem zu
l2CacheSize 786432
sharedMemPerBlock zu
regsPerBlock 32768
warpSize 32
memPitch zu
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 16
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment zu
deviceOverlap 1

CUDA reports 1432M of 1536M GPU memory free.
Using threads: norm1 64, mult 64, norm2 32.
No stage 2 checkpoint.
Using up to 608M GPU memory.
Selected B1=495000, B2=2475000, 2.48% chance of finding a factor
Using B1 = 640000 from savefile.
Continuing stage 2 from a partial result of M69010547 fft length = 4096K
Starting stage 2.
Using b1 = 640000, b2 = 2475000, d = 210, e = 2, nrp = 1
Zeros: 100255, Ones: 109505, Pairs: 19827
Processing 1 - 1 of 48 relative primes.
Inititalizing pass... done. transforms: 168, err = 0.02612, (0.56 real, 3.3270 ms/tran, ETA NA)

Quitting, estimated time spent = 0:00

e:\cpm1_1>cudapm1 69010547 -e2 2 -d2 30
CUDAPm1 v0.20
------- DEVICE 1 -------
name GeForce GTX 580
Compatibility 2.0
clockRate (MHz) 1564
memClockRate (MHz) 2004
totalGlobalMem zu
totalConstMem zu
l2CacheSize 786432
sharedMemPerBlock zu
regsPerBlock 32768
warpSize 32
memPitch zu
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 16
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment zu
deviceOverlap 1

CUDA reports 1432M of 1536M GPU memory free.
Using threads: norm1 64, mult 64, norm2 32.
No stage 2 checkpoint.
Using up to 352M GPU memory.
Selected B1=495000, B2=2475000, 2.48% chance of finding a factor
Using B1 = 640000 from savefile.
Continuing stage 2 from a partial result of M69010547 fft length = 4096K
Starting stage 2.
Using b1 = 640000, b2 = 2475000, d = 30, e = 2, nrp = 1
Zeros: 132682, Ones: 111990, Pairs: 17316
Processing 1 - 1 of 8 relative primes.
Inititalizing pass... done. transforms: 168, err = 0.02441, (0.56 real, 3.3294 ms/tran, ETA NA)

Quitting, estimated time spent = 0:00

e:\cpm1_1>[/CODE]

LaurV 2013-11-12 15:14

post size limit, I had to delete some stuff

For now I solved the problem very simple: I reported the stage 1 only (and not going to do any stage 2), and I moved all the unstarted 69M exponents into prime95's worktodo, and I brought all the 64M exponents from prime95 to cudapm1. Under 69M (fft 3584 and smaller) everything works fine.

I also tried fft 3600, it is still working. I tried 4096 also with all threads combinations I could think of, not working. But working in cards with 3G memory. So, it is related to some allocation.

edit2: still keeping stage 1 files

owftheevil 2013-11-12 15:42

LaurV please edit the threads.txt file, changing the norm1 threads to 128. e.g. from

[CODE]4096 64 64 32 [/CODE]

to

[CODE]4096 128 64 32[/CODE]

and try again.

The error you are getting is a round off error (Maybe I could add a line which actually tells you this?). The thread optimizing function probably has a <= where it needs a < when checking if the thread sizes are acceptable.

LaurV 2013-11-12 16:00

:party:

You are my man! Tell me when you come to northern Thailand so I fill the refrigerators with beer!

It is working. Plain and simple...

And I am sure I tried many different thread combinations too (in command line only, however, not in the thread tuning file!).

BTW, related to tuning file, I just wrote a "tuning batch" with a pari/gp for loop, so if someone is tired to tune all FFT possibilities, run this batch and keep the "GeForce GTX blabla threads.txt" file. You may have few FFT sizes which are not in the list, but they won't be more then few (depending on your card). Saves you of a lot of typing...

[CODE]
cudapm1 -cufftbench 4 4 6
cudapm1 -cufftbench 5 5 6
cudapm1 -cufftbench 14 14 6
cudapm1 -cufftbench 32 32 6
cudapm1 -cufftbench 36 36 6
cudapm1 -cufftbench 40 40 6
cudapm1 -cufftbench64 64 6
cudapm1 -cufftbench 80 80 6
cudapm1 -cufftbench 96 96 6
cudapm1 -cufftbench 98 98 6
cudapm1 -cufftbench 128 128 6
cudapm1 -cufftbench 144 144 6
cudapm1 -cufftbench 160 160 6
cudapm1 -cufftbench 162 162 6
cudapm1 -cufftbench 192 192 6
cudapm1 -cufftbench 224 224 6
cudapm1 -cufftbench 256 256 6
cudapm1 -cufftbench 288 288 6
cudapm1 -cufftbench 320 320 6
cudapm1 -cufftbench 324 324 6
cudapm1 -cufftbench 336 336 6
cudapm1 -cufftbench 384 384 6
cudapm1 -cufftbench 392 392 6
cudapm1 -cufftbench 400 400 6
cudapm1 -cufftbench 448 448 6
cudapm1 -cufftbench 512 512 6
cudapm1 -cufftbench 576 576 6
cudapm1 -cufftbench 640 640 6
cudapm1 -cufftbench 648 640 6
cudapm1 -cufftbench 672 672 6
cudapm1 -cufftbench 720 720 6
cudapm1 -cufftbench 768 768 6
cudapm1 -cufftbench 784 784 6
cudapm1 -cufftbench 800 800 6
cudapm1 -cufftbench 864 864 6
cudapm1 -cufftbench 896 896 6
cudapm1 -cufftbench 1024 1024 6
cudapm1 -cufftbench 1152 1152 6
cudapm1 -cufftbench 1176 1176 6
cudapm1 -cufftbench 1280 1280 6
cudapm1 -cufftbench 1296 1296 6
cudapm1 -cufftbench 1344 1344 6
cudapm1 -cufftbench 1440 1440 6
cudapm1 -cufftbench 1512 1512 6
cudapm1 -cufftbench 1536 1536 6
cudapm1 -cufftbench 1568 1568 6
cudapm1 -cufftbench 1600 1600 6
cudapm1 -cufftbench 1728 1728 6
cudapm1 -cufftbench 1792 1792 6
cudapm1 -cufftbench 2048 2048 6
cudapm1 -cufftbench 2240 2240 6
cudapm1 -cufftbench 2304 2304 6
cudapm1 -cufftbench 2352 2352 6
cudapm1 -cufftbench 2592 2592 6
cudapm1 -cufftbench 2688 2688 6
cudapm1 -cufftbench 2880 2880 6
cudapm1 -cufftbench 3024 3024 6
cudapm1 -cufftbench 3136 3136 6
cudapm1 -cufftbench 3150 3150 6
cudapm1 -cufftbench 3200 3200 6
cudapm1 -cufftbench 3360 3360 6
cudapm1 -cufftbench 3456 3456 6
cudapm1 -cufftbench 3584 3584 6
cudapm1 -cufftbench 3600 3600 6
cudapm1 -cufftbench 4096 4096 6
cudapm1 -cufftbench 4320 4320 6
cudapm1 -cufftbench 4608 4608 6
cudapm1 -cufftbench 4704 4704 6
cudapm1 -cufftbench 5040 5040 6
cudapm1 -cufftbench 5184 5184 6
cudapm1 -cufftbench 5292 5292 6
cudapm1 -cufftbench 5400 5400 6
cudapm1 -cufftbench 5600 5600 6
cudapm1 -cufftbench 5670 5670 6
cudapm1 -cufftbench 5760 5760 6
cudapm1 -cufftbench 6144 6144 6
cudapm1 -cufftbench 6272 6272 6
cudapm1 -cufftbench 6480 6480 6
cudapm1 -cufftbench 6720 6720 6
cudapm1 -cufftbench 6912 6912 6
cudapm1 -cufftbench 7056 7056 6
cudapm1 -cufftbench 7168 7168 6
cudapm1 -cufftbench 7200 7200 6
cudapm1 -cufftbench 7776 7776 6
cudapm1 -cufftbench 8064 8064 6
cudapm1 -cufftbench 8192 8192 6
[/CODE](hehehe)


edit: reserving more 69M exponents... :smile: I first was stupid, trying to bring back those from prime95, until some light bulb turned on in my head...

owftheevil 2013-11-12 16:21

I'm glad it was that simple. Thanks again for your input. I'll get a fix up as soon as no home internet and windows adjusting to new hardware allow.

owftheevil 2013-11-18 20:07

The code with the fix of what turned out to be a threads issue on one of the kernels is now up at sourceforge, as is a cuda5.5 linked windows executable.

[URL="https://sourceforge.net/projects/cudapm1/?source=navbarhttp://"]https://sourceforge.net/projects/cudapm1[/URL]

ET_ 2013-11-22 17:34

CudaPm1 v0.20
 
I just downloaded and recompiled CudaPm1 v0.20 svn 51.

Before testing an exponent, I ran

[code]
./CudaPm1 -cufftbench 1 8192 1
[/code]

that gave me a file named "[FONT="Courier New"]GeForce GTX 580 fft.txt[/FONT]"

I then ran the program on M67,231.XXX with the default INI file. everything looked fine.

I also noticed a message from the output:

[code]
No GeForce GTX 580 threads.txt file found. Using default thread sizes.
For optimal thread selection, please run
./CUDAPm1 -cufftbench 4096 4096 r
for some small r, 0 < r < 6 e.g.
Using threads: norm1 256, mult 128, norm2 128.
[/code]

I ran the program with r = 4 and got the folowing message:

[code]
CUDA bench, testing various thread sizes for fft 4096K, doing 4 passes.
fft size = 4096K, ave time = 6.0904 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 32
fft size = 4096K, ave time = 6.0979 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 64
fft size = 4096K, ave time = 6.1026 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 128
fft size = 4096K, ave time = 6.1052 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 256
fft size = 4096K, ave time = 6.1038 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 512
fft size = 4096K, ave time = 6.1022 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 1024
fft size = 4096K, ave time = 6.0837 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 32
fft size = 4096K, ave time = 6.0929 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 64
fft size = 4096K, ave time = 6.0960 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 128
fft size = 4096K, ave time = 6.1007 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 256
fft size = 4096K, ave time = 6.0981 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 512
fft size = 4096K, ave time = 6.0974 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 1024
fft size = 4096K, ave time = 6.1389 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 32
fft size = 4096K, ave time = 6.1475 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 64
fft size = 4096K, ave time = 6.1538 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 128
fft size = 4096K, ave time = 6.1559 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 256
fft size = 4096K, ave time = 6.1564 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 512
fft size = 4096K, ave time = 6.1548 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 1024
fft size = 4096K, ave time = 6.1958 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 32
fft size = 4096K, ave time = 6.2042 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 64
fft size = 4096K, ave time = 6.2083 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 128
fft size = 4096K, ave time = 6.2126 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 256
fft size = 4096K, ave time = 6.2120 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 512
fft size = 4096K, ave time = 6.2096 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 1024
fft size = 4096K, ave time = 6.2603 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 32
fft size = 4096K, ave time = 6.2692 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 64
fft size = 4096K, ave time = 6.2705 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 128
fft size = 4096K, ave time = 6.2746 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 256
fft size = 4096K, ave time = 6.2736 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 512
fft size = 4096K, ave time = 6.2739 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 1024
CUDAPm1.cu(2163) : cufftSafeCall() CUFFT error 6: CUFFT_EXEC_FAILED
[/code]

It seems that the best timing for my system was

[code]
fft size = 4096K, ave time = 6.0837 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 32
[/code]

I also tried to modify the "[FONT="Courier New"]Threads=[/FONT]" parameter in the INI file: the best results were obtained with [COLOR="Red"]128[/COLOR].

As the program still failed to recognize the file created by the [FONT="Courier New"]-cufftbench[/FONT] option, I renamed it as asked by the executable, and got this message:

[code]
Using threads: norm1 75846319, mult 6, norm2 0.
over specifications Grid = 174762
try increasing mult threads (6) or decreasing FFT length (4096K)
[/code]

I rollbacked the last change.

My actual configuration is the one shown in the second CODE box, with Threads=256 in the INI file, getting 6.3266 ms/iter.

What am I missing? :help:

Luigi

James Heinrich 2013-11-22 18:06

[QUOTE=ET_;359999]As the program still failed to recognize the file created by the [FONT="Courier New"]-cufftbench[/FONT] option, I renamed it as asked by the executable[/QUOTE]Note that the "<gpu> fft.txt" and "<gpu> threads.txt" files are distinct from each other.

<gpu> fft.txt should look something like[code]Device GeForce GTX 670
Compatibility 3.0
clockRate (MHz) 980
memClockRate (MHz) 3004

fft max exp ms/iter
4 85933 0.0697
16 333803 0.1153
32 657719 0.1306
36 738083 0.1618
48 978041 0.1635
... skip a whole bunch of fft lines ...
28800 511382147 76.5273
32768 580225813 79.6749[/code]Whereas "<gpu> threads.txt" should be quite short (and more cryptic), mine looks like:[code]17496 256 64 512 45.9160
3456 256 128 32 8.0790[/code]
I suspect it didn't make a "<gpu> threads.txt" file for you because it appears to have failed partway through the process:[quote]CUDAPm1.cu(2163) : cufftSafeCall() CUFFT error 6: CUFFT_EXEC_FAILED[/quote]


All times are UTC. The time now is 23:19.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.