mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   The P-1 factoring CUDA program (https://www.mersenneforum.org/showthread.php?t=17835)

Karl M Johnson 2013-09-29 03:12

How about a global source code search and replace from "CUDALucas" to "CUDAPm1" ?
It's in the ini file too.

owftheevil 2013-09-29 03:51

Thanks. Done, except for the ini file. It needs a complete rewrite anyway.

ixfd64 2013-11-02 04:49

I have a few questions:

1. Has stage 2 saving been implemented?
2. Does this program output Prime95-style timestamps?
3. Considering that GPUs are much faster than CPUs, would it be reasonable to use larger B1 and B2 values?

Sorry if any of them have already been covered.

LaurV 2013-11-02 05:22

[QUOTE=ixfd64;358173]I have a few questions:

1. Has stage 2 saving been implemented?
2. Does this program output Prime95-style timestamps?
3. Considering that GPUs are much faster than CPUs, would it be reasonable to use larger B1 and B2 values?

Sorry if any of them have already been covered.[/QUOTE]
1. Yes it is, and resuming works very nice (REALLY NICE!) for both stage 1 and stage 2, but [U]be careful[/U] and DO NOT DELETE the stage [B][U]1[/U][/B] saving files when resuming. I don't know if it is a bug, or it was intended to work like that, but if you want to resume a stage 2, even if you have the stage 2 checkpoints (so theoretically you don't need stage 1 checkpoints anymore), if stage 1 checkpoints are not found, the program (is a little bit stupid :razz: and it) will do stage 1 from scratch. I found this by mistake, because normally I keep all the stage 1 files, with big hopes that sometime in the future we will be able to EXTEND the B1 limit, which - in my opinion - is more important than resuming stage 2. As it is now, you can't extend B1, without doing all the stage 1 work from scratch. (edit: which is the case for Prime95 too, and it is a pity, because a lot of contributors wasted time to redo the P-1 stage 1 when they wanted to extend the limits. The main problem is that keeping P-1 huge checkpoint files on the server will take too much space and it will generate too much traffic).

But well... regardless of my dreams, resuming both stage 1 and 2 works very nice in the actual implementation.

2. No. But you don't need those, anyhow manual reports can't parse them and you would need to manually edit your result files, which would be not recommended...

3. You still can specify your own B1 and B2 in the command line, just create a small batch. I do low expos (under 1M) to high limits in this way. The developers promised (see few posts above) a full rewritten ini file :razz:, we wait for the time when all B1, B2, b (the base, sometime specifying a base different of 3 might help, see my pari implementation), e, d, etc be allowed to be specified in the ini files. Dreaming on...

BTW, same batch files with command lines for testing, same hardware, same limits, same everything, the new version shows "e=2" in the result files, where the old version used to show "e=6". There seems to be no difference in working. Stupid question: why?

Also: (@owftheevil) the residues which are attached to the file names (you know my old fixed idea... can't get it out of my head... :razz:) are wrong. Compare them with displayed values. They are "one -c step" behind, showing not the residue reached after calculus, but the residue from which the calculus started. I found this the hard way: I was running one test with "-c 1000", because I didn't want to wait between the outputs (it was just for testing) and then I realized that the space on the disk is consumed too fast (due to many checkpoints), I switched to -c 10k for the second test (the comparison, witness, DC, whatever you want to call it). Of course, meantime I deleted all checkpoints from the first run which were not multiple of 10k... Big mistake! At the end the remaining files had the residue from iteration 9000, 19000, 29000, and so on, and the names did not match with the second run, which had the residues from 10k, 20k, etc, but attached to the wrong files (i.e. the 20k iteration checkpoint had the residue of 10k iteration, the 30k checkpoint had the residue of 20k iteration, and so on, versus the first run where the remaining-after-deletion files, the 10k iteration checkpoint had the residue of 9k iteration, the 20k checkpoint had the residue of 19k iteration, and so on). Interesting enough, the content off the files (when compared ignoring the file name) was identical, and correct.

And to finish in a positive note: I really REALLY love the FFT and especially the threads tuning mechanism. BRILLIANT! You have to try it and see, to beleive. On a gtx580, you can get about 10% or more performance only from thread tuning, without touching the clocks or fft lengths.

ixfd64 2013-11-02 05:32

Thanks for the detailed reply. I hope these shortcomings will soon be eliminated.

Timestamps aren't [I]that[/I] essential, but they can be quite useful for investigating interruptions. For example, if I let the program run unsupervised for a few days and come back to see that it had crashed, the timestamp could quickly help me determine when the crash occurred. In any case, implementing timestamps probably wouldn't require more than a few lines of code. :smile:

LaurV 2013-11-02 05:43

I use the date/time of the checkpoint files for that thing. Also (see few posts above), having a batch file like

:label
cudapm1
goto label

is very helpful, as the program may often crash due to a bug in memory allocation of the cuda55 drivers [edit: for my system, this seems to happen only for the card which drives the monitors, and not for the other cards, even if they do physx too, but SLI is disabled, and they are not connected to monitors]. Having the right ini file and starting from a cycling batch as above helps a lot in this situation, the program will restart and resume properly (tried already).

LaurV 2013-11-02 05:57

@owftheevil re. extending B1: If it helps, I have a pari implementation of the algorithm, with all the calculus of the power differences for the small primes, etc.

flashjh 2013-11-02 16:35

[QUOTE=LaurV;358176](edit: which is the case for Prime95 too, and it is a pity, because a lot of contributors wasted time to redo the P-1 stage 1 when they wanted to extend the limits. The main problem is that keeping P-1 huge checkpoint files on the server will take too much space and it will generate too much traffic). [/QUOTE]
This is not true... see [URL="http://www.mersenneforum.org/showthread.php?p=40816#post40816"]here[/URL]. Also, from the P95 undoc.txt file:
[CODE]By default P-1 work does not delete the save files when the work unit completes.
This lets you run P-1 to a higher bound at a later date. You can force
the program to delete save files by adding this line to prime.txt:
KeepPminus1SaveFiles=0[/CODE]I have experimented with many settings. To extend B1, while B1 is in progress, just increase B1. If you know you'll want to increase B1 later, do B1=B2 and then save the file(s) once complete, noting the current B1 value. When you want to increase B1, just set B1=B2 > current B1 and P95 will continue. If B1 is very high, it may appear like it's not working, but just be patient. (To see the progress, just stop the worker and then restart, you'll see it's going quite fast to get back to where it left off). This allows you to get to a certain B1, run a B2 value and then raise B1 and do another B2 run. You can also continue to raise B1 on another system (or from another folder) and then run B2 again.

LaurV 2013-11-02 17:09

[QUOTE=flashjh;358204]This is not true... see [URL="http://www.mersenneforum.org/showthread.php?p=40816#post40816"]here[/URL].
[/QUOTE]
That is very true. Read what I wrote. I did some exponent to B1=100. How do YOU extend it to B1=1000?

James Heinrich 2013-11-02 17:16

[QUOTE=flashjh;358204]This is not true...[/QUOTE]I think what [i]LaurV[/i] meant is that B1 can't be extended [u]without having the stage1 savefile[/u]. If you do have the savefile then sure you can extend stage1 no problem. And if the PrimeNet server would save all the stage1 savefiles then any user could extend B1 done by any other user, but that would take up too much server storage and bandwidth.

LaurV 2013-11-02 17:28

Yes indeed, you explained it better. I was referring to the fact that if some user wants to extend some B1 (for an expo assigned by primenet or gpu72, for which he didn't do any work before, like there are many on mersenne.ca with insufficient P-1 done), he currently need to do everything from scratch. That is because storing checkpoint files on the server is costly (not as much to store - now there are cheap big HDDs - as to download, the trafic will be overkill).

If you look how many P-1 "extensions" were done, especially for small exponents (I just [URL="http://www.mersenne.org/report_exponent/?exp_lo=219647&exp_hi=&B1=Get+status"]did one recently[/URL]), a lot of resources were wasted by doing stage 1 from scratch, for some expos 5 or 6 times!.

Anyhow, in spite of the fact that what Jerry said about P95 is also true, my current problem wasn't P95, but cudapm1.

flashjh 2013-11-02 18:15

Right. I see I overlooked what you meant. I just wanted to point out that P95 can extend B1, so the code for such work exists, even if it can't be used in cudapm1. Extending B2 would be awesome!

TheMawn 2013-11-02 19:40

How efficient are GPU's at P-1? Compared to trial factoring, for example?

James Heinrich 2013-11-02 20:30

[QUOTE=TheMawn;358217]How efficient are GPU's at P-1? Compared to trial factoring, for example?[/QUOTE]I'm sure someone will give a better answer, but:
CUDAPm1 is based on CudaLucas so the relative performance charts on my [url=http://www.mersenne.ca/cudalucas.php]CudaLucas page[/url] compared to my [url=http://www.mersenne.ca/mfaktc.php]mfaktc page[/url] should be vaguely applicable.

The latest version of CUDAPm1 does [url=http://www.mersenneforum.org/showpost.php?p=354013&postcount=375]include a benchmark[/url] but so far only one person has sent me any data from it so I don't want to read too much into such a small sample size.

firejuggler 2013-11-02 21:57

for a 560
[code]
Iteration 32000 M11802799, 0xb8423c5eaf567790, n = 648K, CUDAPm1 v0.10 err = 0.
7080 (0:01 real, 2.5918 ms/iter, ETA 8:17)
Iteration 33000 M11802799, 0x22fb44273d4c946e, n = 648K, CUDAPm1 v0.10 err = 0.
6982 (0:03 real, 2.3335 ms/iter, ETA 7:25)
Iteration 34000 M11802799, 0x50efa92a42ce2b4b, n = 648K, CUDAPm1 v0.10 err = 0.
7031 (0:02 real, 2.3368 ms/iter, ETA 7:23)
Iteration 35000 M11802799, 0xeb03cb8632c5b33b, n = 648K, CUDAPm1 v0.10 err = 0.
6836 (0:02 real, 2.3391 ms/iter, ETA 7:21)
Iteration 36000 M11802799, 0xdd08a619769a545f, n = 648K, CUDAPm1 v0.10 err = 0.
7031 (0:03 real, 2.3192 ms/iter, ETA 7:15)
Iteration 37000 M11802799, 0xd874a96b5ddd0ff7, n = 648K, CUDAPm1 v0.10 err = 0.
6641 (0:02 real, 2.3400 ms/iter, ETA 7:17)
[/code]

and stage 2
[code]
Transforms: 2052 M11802943, 0xde1309e648ca422a, n = 648K, CUDAPm1 v0.10 err = 0
.06641 (0:03 real, 1.1989 ms/tran, ETA 0:07)
Transforms: 2148 M11802943, 0xe642ce5b422dd69d, n = 648K, CUDAPm1 v0.10 err = 0
.07031 (0:02 real, 1.2181 ms/tran, ETA 0:05)
Transforms: 2046 M11802943, 0x894a405d75167ee1, n = 648K, CUDAPm1 v0.10 err = 0
.06641 (0:03 real, 1.1928 ms/tran, ETA 0:02)
Transforms: 2092 M11802943, 0xfcb58d1a13c3410f, n = 648K, CUDAPm1 v0.10 err = 0
.06641 (0:02 real, 1.2029 ms/tran, ETA 0:00)
[/code]

James Heinrich 2013-11-02 22:45

[QUOTE=firejuggler;358231]for a 560[/QUOTE]The benchmark invoked by the linked code snippet generates a benchmark file, and [i]owftheevil[/i] was even kind enough to suggest in the screen output that people email it to me. :smile:[code]./CUDAPm1 -cufftbench 1 8192 1[/code]

firejuggler 2013-11-02 22:54

sorry
[code]
CUDAPm1 v0.10
CUFFT bench start = 1 end = 8192 distance = 1
CUFFT_Z2Z size= 1024 time= 0.008226 msec
[/code]
better?
timing varies between 0.007559 msec to 0.008309 msec

James Heinrich 2013-11-03 01:41

Better, yes, but if you (and anyone else who cares to share) could email me the whole file that'd be great, I can try and establish some expected-performance numbers once I have a decent sample size.

firejuggler 2013-11-03 02:09

1 Attachment(s)
Earlier version only gave me the one line I posted earlier.
Now here is a file that might help. Sent by mail too.
The first version of the file was while I ran Msieve_gpu. So I just reran it.

flashjh 2013-11-03 02:24

1 Attachment(s)
GTX 580, emailed also.

Mark Rose 2013-11-03 04:10

[QUOTE=LaurV;358209]Yes indeed, you explained it better. I was referring to the fact that if some user wants to extend some B1 (for an expo assigned by primenet or gpu72, for which he didn't do any work before, like there are many on mersenne.ca with insufficient P-1 done), he currently need to do everything from scratch. That is because storing checkpoint files on the server is costly (not as much to store - now there are cheap big HDDs - as to download, the trafic will be overkill).[/QUOTE]

It doesn't seem that expensive to me. I'm P-1ing M63970349, using 3456K, which result in a save file of about 13.5 MB. My internet connection at home could download or upload that in 2.16 seconds (unlimited 50/50 Mbps symmetrical can be had for $115/month, or 175/175 for $211). Alternatively, I could rent a server with 20 TB/month traffic for $100/month, which would provide for the uploading or downloading of over 1 million save files.

James Heinrich 2013-11-03 04:28

[QUOTE=Mark Rose;358288]My internet connection at home could download or upload that in 2.16 seconds (unlimited 50/50 Mbps symmetrical can be had for $115/month, or 175/175 for $211)[/QUOTE]You are most fortunate. Many of us would consider 1Mbps upstream "very good".

Mark Rose 2013-11-03 04:41

[QUOTE=James Heinrich;358290]You are most fortunate. Many of us would consider 1Mbps upstream "very good".[/QUOTE]

I suppose fast internet is one benefit of living in Hogtown. Personally, I'd rather be out in the country! Though I'd probably start scheming to trench fiber...

I do remember the days of having to use a 2400 bits per second connection a mere 14 years ago. My parents took the computer away, but I assembled another out of antique free parts. Downloading JPEGs was a luxury.

LaurV 2013-11-03 05:07

1 Attachment(s)
[QUOTE=James Heinrich;358258]Better, yes, but if you (and anyone else who cares to share) could email me the whole file that'd be great, I can try and establish some expected-performance numbers once I have a decent sample size.[/QUOTE]
here you are for 580

Mini-Geek 2013-11-03 13:04

1 Attachment(s)
FWIW, here's another GTX 560, CUDAPm1 v0.20.

LaurV 2013-11-10 10:29

Trouble with cudapm1. I have a couple of 69M exponents reserved from gpu72, for which stage 1 was done, but stage 2 crashes without any error. Two of my cards wasted the last 20 hours or so, retrying stage 2 (the batch loop, remember?).

When I saw the thing (in fact, I saw less lines in the result file than expected), I stopped the batch and tried with different parameters for -e2, -d2, and even "UnusedMem=" in the ini file, with or without "M" at the end, the result is the same: some memory-related crash (i assume, because the test works for cards with 3G, and fails in all cards with 1536M of memory).

I did not report the "stage 1" yet, and still keep the s1 checkpoint files, if someone else is going to try (8MB checkpoint file size, for a 4096k FFT)

ET_ 2013-11-10 19:57

[QUOTE=LaurV;358915]Trouble with cudapm1. I have a couple of 69M exponents reserved from gpu72, for which stage 1 was done, but stage 2 crashes without any error. Two of my cards wasted the last 20 hours or so, retrying stage 2 (the batch loop, remember?).

When I saw the thing (in fact, I saw less lines in the result file than expected), I stopped the batch and tried with different parameters for -e2, -d2, and even "UnusedMem=" in the ini file, with or without "M" at the end, the result is the same: some memory-related crash (i assume, because the test works for cards with 3G, and fails in all cards with 1536M of memory).

I did not report the "stage 1" yet, and still keep the s1 checkpoint files, if someone else is going to try (8MB checkpoint file size, for a 4096k FFT)[/QUOTE]

Two weeks on vacation, and I've lost you guys :sad:

Is there some documetation on how to use benchmark files on CudaP-1 ?

Luigi

firejuggler 2013-11-10 20:07

Quoting post 404,
[QUOTE=James Heinrich;358234]The benchmark invoked by the linked code snippet generates a benchmark file, and [i]owftheevil[/i] was even kind enough to suggest in the screen output that people email it to me. :smile:[code]./CUDAPm1 -cufftbench 1 8192 1[/code][/QUOTE]

ET_ 2013-11-11 16:52

[QUOTE=firejuggler;358944]Quoting post 404,[/QUOTE]

Thanks :smile: :bow:

Luigi

owftheevil 2013-11-11 19:13

[QUOTE=LaurV;358915]Trouble with cudapm1. I have a couple of 69M exponents reserved from gpu72, for which stage 1 was done, but stage 2 crashes without any error. Two of my cards wasted the last 20 hours or so, retrying stage 2 (the batch loop, remember?).

When I saw the thing (in fact, I saw less lines in the result file than expected), I stopped the batch and tried with different parameters for -e2, -d2, and even "UnusedMem=" in the ini file, with or without "M" at the end, the result is the same: some memory-related crash (i assume, because the test works for cards with 3G, and fails in all cards with 1536M of memory).

I did not report the "stage 1" yet, and still keep the s1 checkpoint files, if someone else is going to try (8MB checkpoint file size, for a 4096k FFT)[/QUOTE]

How far past stage1 gcd does it get? What is the reported available memory and estimate of memory it will use (beginning of stage 1)? Are you doing stage 1 and stage 2 on different cards?

ET_ 2013-11-11 19:50

[QUOTE=owftheevil;358998]How far past stage1 gcd does it get? What is the reported available memory and estimate of memory it will use (beginning of stage 1)? Are you doing stage 1 and stage 2 on different cards?[/QUOTE]

I had the same behavior. It turned out that the outside temperature was too hot... :smile:

Luigi

LaurV 2013-11-12 05:20

Sorry for the lack of the details. I will get home in the evening and come back with some output snippets. I still have the stage 1 final files. To confirm: the problem is not related to temperature, to my cards, to my computer, etc. I can accurately reproduce it in different cards (all 580 with 1536M ram), different computers. The problem is related to 69M exponents, the program just crash without giving any error, immediately after the init of stage 2 finishes. I tried -e2 and -d2 switches with different values, and I [B][U]also[/U][/B] tried to put "UnusedMem=" in the ini file. The number of primes and the memory used always differ (like from 1 prime to 27 primes, from 308M to 1370M used, the card says ~1400M free), as a sign that the switches [B][U]are working[/U][/B]. I also retried stage 1 from previous checkpoints, which can finish without problem, do gcd, init stage 2, BOOM!

You know me from the past, I always have been the first to blame heat, dust, whatever, when other people had problems, but now, it is not a hardware, nor temperature problem. I have [URL="http://www.mersenneforum.org/showthread.php?t=16829"]external water cooler[/URL] (I mean, outside of the house!), with 4 liters of water, two pumps and 12 fans, and in the night the temperature outside drops to +16 in this period of the year. The installation is designed for tremendous hot Thai Aprils, when the temperature outside is in average 40C degrees. If I start all fans NOW, in this perios of the year, there is no way that the temperatures go over 50C.

Trust me, this time, there is a bug in the program.

owftheevil 2013-11-12 15:09

Once stage 2 is begun, it would be a big mess to allow the various parameters, e, d, nrp to change. So if fact, regardless of any command line parameters, if stage 2 has already been initialized, e, d, nrp, b1, and b2 are taken from the saveflie. If stage 2 is initialized with 3Gb of available memory and then resumed with only 1.5Gb of available memory, there will be problems. If this is the case, delete the stage 2 savefile. Then everything should work fine. If not, I'll need all the information you can give me. Thanks for pointing out this bug.

LaurV 2013-11-12 15:10

I am back.

here is the stuff:
[CODE]
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.

e:\cpm1_1>
e:\cpm1_1>cudapm1 69010547
CUDAPm1 v0.20
Warning: Couldn't parse ini file option UnusedMem; using default.
------- DEVICE 1 -------
name GeForce GTX 580
Compatibility 2.0
clockRate (MHz) 1564
memClockRate (MHz) 2004
totalGlobalMem zu
totalConstMem zu
l2CacheSize 786432
sharedMemPerBlock zu
regsPerBlock 32768
warpSize 32
memPitch zu
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 16
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment zu
deviceOverlap 1

CUDA reports 1432M of 1536M GPU memory free.
Using threads: norm1 64, mult 64, norm2 32.
No stage 2 checkpoint.
Using up to 1312M GPU memory.
Selected B1=640000, B2=14400000, 3.7% chance of finding a factor
Using B1 = 640000 from savefile.
Continuing stage 2 from a partial result of M69010547 fft length = 4096K
Starting stage 2.
Using b1 = 640000, b2 = 14400000, d = 2310, e = 2, nrp = 30
Zeros: 644312, Ones: 737128, Pairs: 145269
Processing 1 - 30 of 480 relative primes.
Inititalizing pass... )

Quitting, estimated time spent = 0:01

e:\cpm1_1>cudapm1 69010547 -e2 6 -d2 30
CUDAPm1 v0.20
Warning: Couldn't parse ini file option UnusedMem; using default.
------- DEVICE 1 -------
name GeForce GTX 580
Compatibility 2.0
clockRate (MHz) 1564
memClockRate (MHz) 2004
totalGlobalMem zu
totalConstMem zu
l2CacheSize 786432
sharedMemPerBlock zu
regsPerBlock 32768
warpSize 32
memPitch zu
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 16
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment zu
deviceOverlap 1

CUDA reports 1432M of 1536M GPU memory free.
Using threads: norm1 64, mult 64, norm2 32.
No stage 2 checkpoint.
Using up to 1312M GPU memory.
Selected B1=640000, B2=14400000, 3.7% chance of finding a factor
Using B1 = 640000 from savefile.
Continuing stage 2 from a partial result of M69010547 fft length = 4096K
Starting stage 2.
Using b1 = 640000, b2 = 14400000, d = 30, e = 6, nrp = 8
Zeros: 898832, Ones: 746888, Pairs: 135479
Processing 1 - 8 of 8 relative primes.
Inititalizing pass... )

Quitting, estimated time spent = 0:01

e:\cpm1_1>
[/CODE]

mention that there is no "stage 2" checkpoint file, and none is created during the crash. The only file is the last stage 1 checkpoint.

Same story with the undocumented stuff in the ini file, setting the unused mem to 1200M (remark that the used mem in this case is much less, it could be anything between like 300 and 800M, or so)

[CODE]
e:\cpm1_1>cudapm1 69010547
CUDAPm1 v0.20
------- DEVICE 1 -------
name GeForce GTX 580
Compatibility 2.0
clockRate (MHz) 1564
memClockRate (MHz) 2004
totalGlobalMem zu
totalConstMem zu
l2CacheSize 786432
sharedMemPerBlock zu
regsPerBlock 32768
warpSize 32
memPitch zu
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 16
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment zu
deviceOverlap 1

CUDA reports 1432M of 1536M GPU memory free.
Using threads: norm1 64, mult 64, norm2 32.
No stage 2 checkpoint.
Using up to 608M GPU memory.
Selected B1=495000, B2=2475000, 2.48% chance of finding a factor
Using B1 = 640000 from savefile.
Continuing stage 2 from a partial result of M69010547 fft length = 4096K
Starting stage 2.
Using b1 = 640000, b2 = 2475000, d = 210, e = 2, nrp = 1
Zeros: 100255, Ones: 109505, Pairs: 19827
Processing 1 - 1 of 48 relative primes.
Inititalizing pass... done. transforms: 168, err = 0.02612, (0.56 real, 3.3270 ms/tran, ETA NA)

Quitting, estimated time spent = 0:00

e:\cpm1_1>cudapm1 69010547 -e2 2 -d2 30
CUDAPm1 v0.20
------- DEVICE 1 -------
name GeForce GTX 580
Compatibility 2.0
clockRate (MHz) 1564
memClockRate (MHz) 2004
totalGlobalMem zu
totalConstMem zu
l2CacheSize 786432
sharedMemPerBlock zu
regsPerBlock 32768
warpSize 32
memPitch zu
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 16
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment zu
deviceOverlap 1

CUDA reports 1432M of 1536M GPU memory free.
Using threads: norm1 64, mult 64, norm2 32.
No stage 2 checkpoint.
Using up to 352M GPU memory.
Selected B1=495000, B2=2475000, 2.48% chance of finding a factor
Using B1 = 640000 from savefile.
Continuing stage 2 from a partial result of M69010547 fft length = 4096K
Starting stage 2.
Using b1 = 640000, b2 = 2475000, d = 30, e = 2, nrp = 1
Zeros: 132682, Ones: 111990, Pairs: 17316
Processing 1 - 1 of 8 relative primes.
Inititalizing pass... done. transforms: 168, err = 0.02441, (0.56 real, 3.3294 ms/tran, ETA NA)

Quitting, estimated time spent = 0:00

e:\cpm1_1>[/CODE]

LaurV 2013-11-12 15:14

post size limit, I had to delete some stuff

For now I solved the problem very simple: I reported the stage 1 only (and not going to do any stage 2), and I moved all the unstarted 69M exponents into prime95's worktodo, and I brought all the 64M exponents from prime95 to cudapm1. Under 69M (fft 3584 and smaller) everything works fine.

I also tried fft 3600, it is still working. I tried 4096 also with all threads combinations I could think of, not working. But working in cards with 3G memory. So, it is related to some allocation.

edit2: still keeping stage 1 files

owftheevil 2013-11-12 15:42

LaurV please edit the threads.txt file, changing the norm1 threads to 128. e.g. from

[CODE]4096 64 64 32 [/CODE]

to

[CODE]4096 128 64 32[/CODE]

and try again.

The error you are getting is a round off error (Maybe I could add a line which actually tells you this?). The thread optimizing function probably has a <= where it needs a < when checking if the thread sizes are acceptable.

LaurV 2013-11-12 16:00

:party:

You are my man! Tell me when you come to northern Thailand so I fill the refrigerators with beer!

It is working. Plain and simple...

And I am sure I tried many different thread combinations too (in command line only, however, not in the thread tuning file!).

BTW, related to tuning file, I just wrote a "tuning batch" with a pari/gp for loop, so if someone is tired to tune all FFT possibilities, run this batch and keep the "GeForce GTX blabla threads.txt" file. You may have few FFT sizes which are not in the list, but they won't be more then few (depending on your card). Saves you of a lot of typing...

[CODE]
cudapm1 -cufftbench 4 4 6
cudapm1 -cufftbench 5 5 6
cudapm1 -cufftbench 14 14 6
cudapm1 -cufftbench 32 32 6
cudapm1 -cufftbench 36 36 6
cudapm1 -cufftbench 40 40 6
cudapm1 -cufftbench64 64 6
cudapm1 -cufftbench 80 80 6
cudapm1 -cufftbench 96 96 6
cudapm1 -cufftbench 98 98 6
cudapm1 -cufftbench 128 128 6
cudapm1 -cufftbench 144 144 6
cudapm1 -cufftbench 160 160 6
cudapm1 -cufftbench 162 162 6
cudapm1 -cufftbench 192 192 6
cudapm1 -cufftbench 224 224 6
cudapm1 -cufftbench 256 256 6
cudapm1 -cufftbench 288 288 6
cudapm1 -cufftbench 320 320 6
cudapm1 -cufftbench 324 324 6
cudapm1 -cufftbench 336 336 6
cudapm1 -cufftbench 384 384 6
cudapm1 -cufftbench 392 392 6
cudapm1 -cufftbench 400 400 6
cudapm1 -cufftbench 448 448 6
cudapm1 -cufftbench 512 512 6
cudapm1 -cufftbench 576 576 6
cudapm1 -cufftbench 640 640 6
cudapm1 -cufftbench 648 640 6
cudapm1 -cufftbench 672 672 6
cudapm1 -cufftbench 720 720 6
cudapm1 -cufftbench 768 768 6
cudapm1 -cufftbench 784 784 6
cudapm1 -cufftbench 800 800 6
cudapm1 -cufftbench 864 864 6
cudapm1 -cufftbench 896 896 6
cudapm1 -cufftbench 1024 1024 6
cudapm1 -cufftbench 1152 1152 6
cudapm1 -cufftbench 1176 1176 6
cudapm1 -cufftbench 1280 1280 6
cudapm1 -cufftbench 1296 1296 6
cudapm1 -cufftbench 1344 1344 6
cudapm1 -cufftbench 1440 1440 6
cudapm1 -cufftbench 1512 1512 6
cudapm1 -cufftbench 1536 1536 6
cudapm1 -cufftbench 1568 1568 6
cudapm1 -cufftbench 1600 1600 6
cudapm1 -cufftbench 1728 1728 6
cudapm1 -cufftbench 1792 1792 6
cudapm1 -cufftbench 2048 2048 6
cudapm1 -cufftbench 2240 2240 6
cudapm1 -cufftbench 2304 2304 6
cudapm1 -cufftbench 2352 2352 6
cudapm1 -cufftbench 2592 2592 6
cudapm1 -cufftbench 2688 2688 6
cudapm1 -cufftbench 2880 2880 6
cudapm1 -cufftbench 3024 3024 6
cudapm1 -cufftbench 3136 3136 6
cudapm1 -cufftbench 3150 3150 6
cudapm1 -cufftbench 3200 3200 6
cudapm1 -cufftbench 3360 3360 6
cudapm1 -cufftbench 3456 3456 6
cudapm1 -cufftbench 3584 3584 6
cudapm1 -cufftbench 3600 3600 6
cudapm1 -cufftbench 4096 4096 6
cudapm1 -cufftbench 4320 4320 6
cudapm1 -cufftbench 4608 4608 6
cudapm1 -cufftbench 4704 4704 6
cudapm1 -cufftbench 5040 5040 6
cudapm1 -cufftbench 5184 5184 6
cudapm1 -cufftbench 5292 5292 6
cudapm1 -cufftbench 5400 5400 6
cudapm1 -cufftbench 5600 5600 6
cudapm1 -cufftbench 5670 5670 6
cudapm1 -cufftbench 5760 5760 6
cudapm1 -cufftbench 6144 6144 6
cudapm1 -cufftbench 6272 6272 6
cudapm1 -cufftbench 6480 6480 6
cudapm1 -cufftbench 6720 6720 6
cudapm1 -cufftbench 6912 6912 6
cudapm1 -cufftbench 7056 7056 6
cudapm1 -cufftbench 7168 7168 6
cudapm1 -cufftbench 7200 7200 6
cudapm1 -cufftbench 7776 7776 6
cudapm1 -cufftbench 8064 8064 6
cudapm1 -cufftbench 8192 8192 6
[/CODE](hehehe)


edit: reserving more 69M exponents... :smile: I first was stupid, trying to bring back those from prime95, until some light bulb turned on in my head...

owftheevil 2013-11-12 16:21

I'm glad it was that simple. Thanks again for your input. I'll get a fix up as soon as no home internet and windows adjusting to new hardware allow.

owftheevil 2013-11-18 20:07

The code with the fix of what turned out to be a threads issue on one of the kernels is now up at sourceforge, as is a cuda5.5 linked windows executable.

[URL="https://sourceforge.net/projects/cudapm1/?source=navbarhttp://"]https://sourceforge.net/projects/cudapm1[/URL]

ET_ 2013-11-22 17:34

CudaPm1 v0.20
 
I just downloaded and recompiled CudaPm1 v0.20 svn 51.

Before testing an exponent, I ran

[code]
./CudaPm1 -cufftbench 1 8192 1
[/code]

that gave me a file named "[FONT="Courier New"]GeForce GTX 580 fft.txt[/FONT]"

I then ran the program on M67,231.XXX with the default INI file. everything looked fine.

I also noticed a message from the output:

[code]
No GeForce GTX 580 threads.txt file found. Using default thread sizes.
For optimal thread selection, please run
./CUDAPm1 -cufftbench 4096 4096 r
for some small r, 0 < r < 6 e.g.
Using threads: norm1 256, mult 128, norm2 128.
[/code]

I ran the program with r = 4 and got the folowing message:

[code]
CUDA bench, testing various thread sizes for fft 4096K, doing 4 passes.
fft size = 4096K, ave time = 6.0904 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 32
fft size = 4096K, ave time = 6.0979 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 64
fft size = 4096K, ave time = 6.1026 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 128
fft size = 4096K, ave time = 6.1052 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 256
fft size = 4096K, ave time = 6.1038 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 512
fft size = 4096K, ave time = 6.1022 msec, Norm1 threads 64, Mult threads 32, Norm2 threads 1024
fft size = 4096K, ave time = 6.0837 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 32
fft size = 4096K, ave time = 6.0929 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 64
fft size = 4096K, ave time = 6.0960 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 128
fft size = 4096K, ave time = 6.1007 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 256
fft size = 4096K, ave time = 6.0981 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 512
fft size = 4096K, ave time = 6.0974 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 1024
fft size = 4096K, ave time = 6.1389 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 32
fft size = 4096K, ave time = 6.1475 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 64
fft size = 4096K, ave time = 6.1538 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 128
fft size = 4096K, ave time = 6.1559 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 256
fft size = 4096K, ave time = 6.1564 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 512
fft size = 4096K, ave time = 6.1548 msec, Norm1 threads 64, Mult threads 128, Norm2 threads 1024
fft size = 4096K, ave time = 6.1958 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 32
fft size = 4096K, ave time = 6.2042 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 64
fft size = 4096K, ave time = 6.2083 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 128
fft size = 4096K, ave time = 6.2126 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 256
fft size = 4096K, ave time = 6.2120 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 512
fft size = 4096K, ave time = 6.2096 msec, Norm1 threads 64, Mult threads 256, Norm2 threads 1024
fft size = 4096K, ave time = 6.2603 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 32
fft size = 4096K, ave time = 6.2692 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 64
fft size = 4096K, ave time = 6.2705 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 128
fft size = 4096K, ave time = 6.2746 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 256
fft size = 4096K, ave time = 6.2736 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 512
fft size = 4096K, ave time = 6.2739 msec, Norm1 threads 64, Mult threads 512, Norm2 threads 1024
CUDAPm1.cu(2163) : cufftSafeCall() CUFFT error 6: CUFFT_EXEC_FAILED
[/code]

It seems that the best timing for my system was

[code]
fft size = 4096K, ave time = 6.0837 msec, Norm1 threads 64, Mult threads 64, Norm2 threads 32
[/code]

I also tried to modify the "[FONT="Courier New"]Threads=[/FONT]" parameter in the INI file: the best results were obtained with [COLOR="Red"]128[/COLOR].

As the program still failed to recognize the file created by the [FONT="Courier New"]-cufftbench[/FONT] option, I renamed it as asked by the executable, and got this message:

[code]
Using threads: norm1 75846319, mult 6, norm2 0.
over specifications Grid = 174762
try increasing mult threads (6) or decreasing FFT length (4096K)
[/code]

I rollbacked the last change.

My actual configuration is the one shown in the second CODE box, with Threads=256 in the INI file, getting 6.3266 ms/iter.

What am I missing? :help:

Luigi

James Heinrich 2013-11-22 18:06

[QUOTE=ET_;359999]As the program still failed to recognize the file created by the [FONT="Courier New"]-cufftbench[/FONT] option, I renamed it as asked by the executable[/QUOTE]Note that the "<gpu> fft.txt" and "<gpu> threads.txt" files are distinct from each other.

<gpu> fft.txt should look something like[code]Device GeForce GTX 670
Compatibility 3.0
clockRate (MHz) 980
memClockRate (MHz) 3004

fft max exp ms/iter
4 85933 0.0697
16 333803 0.1153
32 657719 0.1306
36 738083 0.1618
48 978041 0.1635
... skip a whole bunch of fft lines ...
28800 511382147 76.5273
32768 580225813 79.6749[/code]Whereas "<gpu> threads.txt" should be quite short (and more cryptic), mine looks like:[code]17496 256 64 512 45.9160
3456 256 128 32 8.0790[/code]
I suspect it didn't make a "<gpu> threads.txt" file for you because it appears to have failed partway through the process:[quote]CUDAPm1.cu(2163) : cufftSafeCall() CUFFT error 6: CUFFT_EXEC_FAILED[/quote]

ET_ 2013-11-22 18:12

[QUOTE=James Heinrich;360002]Note that the "<gpu> fft.txt" and "<gpu> threads.txt" files are distinct from each other.

<gpu> fft.txt should look something like[code]Device GeForce GTX 670
Compatibility 3.0
clockRate (MHz) 980
memClockRate (MHz) 3004

fft max exp ms/iter
4 85933 0.0697
16 333803 0.1153
32 657719 0.1306
36 738083 0.1618
48 978041 0.1635
... skip a whole bunch of fft lines ...
28800 511382147 76.5273
32768 580225813 79.6749[/code]Whereas "<gpu> threads.txt" should be quite short (and more cryptic), mine looks like:[code]17496 256 64 512 45.9160
3456 256 128 32 8.0790[/code]
I suspect it didn't make a "<gpu> threads.txt" file for you because it appears to have failed partway through the process:[/QUOTE]

Thanks James. :bow:

From what you said, I assume that there should be 2 distinct files: the first created by cufftbench 1.8192 1, the second by -cufftbench 4096 4096 4

I'll try to modify the [COLOR="Red"]r[/COLOR] parameter of the second bench run and see if it suffices.

Luigi

ET_ 2013-11-22 18:26

[QUOTE=ET_;360003]Thanks James. :bow:

From what you said, I assume that there should be 2 distinct files: the first created by cufftbench 1.8192 1, the second by -cufftbench 4096 4096 4

I'll try to modify the [COLOR="Red"]r[/COLOR] parameter of the second bench run and see if it suffices.

Luigi[/QUOTE]

Sadly, I always get "[FONT="Courier New"][COLOR="Red"]CUDAPm1.cu(2163) : cufftSafeCall() CUFFT error 6: CUFFT_EXEC_FAILED[/COLOR][/FONT]" with r between 1 and 5 and Threads=128 or 256.

Hints?

Luigi

owftheevil 2013-11-22 19:57

Does it always fail at the same place in the test?

Also try putting

[CODE]cutilSafeThreadSync();[/CODE]

after the cufft call on line 2161 and after the square call on 2162. That will at least tell us what is failing.

ET_ 2013-11-22 20:05

[QUOTE=owftheevil;360013]Does it always fail at the same place in the test?

Also try putting

[CODE]cutilSafeThreadSync();[/CODE]

after the cufft call on line 2161 and after the square call on 2162. That will at least tell us what is failing.[/QUOTE]

Yes, it always fails at the same place.

Added the line in the 2 places you asked. A new result:

[code]
CUDAPm1.cu(2165) : cufftSafeCall() CUFFT error 6: CUFFT_EXEC_FAILED
[/code]

Added a new sync after line 2165: same error.

Luigi

owftheevil 2013-11-22 20:10

Sorry, I jumped too quickly on the safecall stuff. More is needed. Let me think a bit.

ET_ 2013-11-22 20:14

[QUOTE=owftheevil;360017]Sorry, I jumped too quickly on the safecall stuff. More is needed. Let me think a bit.[/QUOTE]

No hurry. I'm actually playing with Threads=128 and the program is working: I just tried to squeeze some more juice from it.

I'll be quietly waiting for your thoughts, thank you.

Luigi :smile:

owftheevil 2013-11-22 20:38

Could you try this little snippet after the square call on 2162?

[CODE]cudaThreadSynchronize();
{
cudaError_t error = cudaGetLastError();
if(error != cudaSuccess)
{
printf("CUDA error: %s\n", cudaGetErrorString(error));
exit(2);
}
}[/CODE]

ET_ 2013-11-22 21:13

[QUOTE=owftheevil;360021]Could you try this little snippet after the square call on 2162?

[CODE]cudaThreadSynchronize();
{
cudaError_t error = cudaGetLastError();
if(error != cudaSuccess)
{
printf("CUDA error: %s\n", cudaGetErrorString(error));
exit(2);
}
}[/CODE][/QUOTE]

The error is:

[code]
CUDA error: too many resources requested for launch
[/code]

while the environment is:

[code]
------- DEVICE 0 -------
name GeForce GTX 580
Compatibility 2.0
clockRate (MHz) 1594
memClockRate (MHz) 2025
totalGlobalMem 1610285056
totalConstMem 65536
l2CacheSize 786432
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 16
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment 512
deviceOverlap 1
[/code]

HTH... :smile: thanks.

Luigi

ET_ 2013-11-22 21:15

Sorry for the delay... I was dining.

owftheevil 2013-11-22 22:42

Thanks for getting back with that. The only thing I can think of right now is that somehow, either t2 or the threads array have messed up values. I'll look at it over the weekend and get back on Monday.

ET_ 2013-11-23 11:03

[QUOTE=owftheevil;360032]Thanks for getting back with that. The only thing I can think of right now is that somehow, either t2 or the threads array have messed up values. I'll look at it over the weekend and get back on Monday.[/QUOTE]

Thanks :bow:

I add that I am using Linux_64, driver 304.88.

CUDA version info:

[code]
CUDA version info
binary compiled for CUDA 4.10
CUDA runtime version 4.10
CUDA driver version 5.0
[/code]

Luigi

James Heinrich 2013-11-23 14:29

[QUOTE=ET_;360005]Sadly, I always get "[FONT="Courier New"][COLOR="Red"]CUDAPm1.cu(2163) : cufftSafeCall() CUFFT error 6: CUFFT_EXEC_FAILED[/COLOR][/FONT]" with r between 1 and 5 and Threads=128 or 256.[/QUOTE]I just tried running the FFT benchmark (CudaPm1 -cufftbench 1 8192 1) on my new GTX 580, and I also got failure:[code]...
fft size = 3645K, ave time = 6.4376 msec, max-ave = 0.00000
fft size = 3675K, ave time = 6.9818 msec, max-ave = 0.00000
fft size = 3750K, ave time = 6.7061 msec, max-ave = 0.00000
C:/Users/filbert/Documents/Visual Studio 2010/Projects/CUDAPm1/CUDAPm1.cu(2279)
: cudaSafeCall() Runtime API error 30: unknown error.[/code]Screen went black for a second as the NVIDIA drivers recovered from the crash.
Win7, drivers v331.82, GTX 580 3GB
Line number is slightly different, but this is the 24-Sep-2013 Windows binary if that helps.

[i]edit: but a second attempt at running the same command, with no changes, resulted in success.[/i] :cmd:

kladner 2013-11-23 15:46

Owftheevil has said that some errors like this one are caused by a problem in the nVidia drivers starting with the 3xx series. While I have 64 bit drivers going back to 285.62, I have assumed that it is not worth trying to install anything that old as they are probably not compatible with current CUDA libraries.

ET_ 2013-11-23 16:35

[QUOTE=James Heinrich;360071]I just tried running the FFT benchmark (CudaPm1 -cufftbench 1 8192 1) on my new GTX 580, and I also got failure:[code]...
fft size = 3645K, ave time = 6.4376 msec, max-ave = 0.00000
fft size = 3675K, ave time = 6.9818 msec, max-ave = 0.00000
fft size = 3750K, ave time = 6.7061 msec, max-ave = 0.00000
C:/Users/filbert/Documents/Visual Studio 2010/Projects/CUDAPm1/CUDAPm1.cu(2279)
: cudaSafeCall() Runtime API error 30: unknown error.[/code]Screen went black for a second as the NVIDIA drivers recovered from the crash.
Win7, drivers v331.82, GTX 580 3GB
Line number is slightly different, but this is the 24-Sep-2013 Windows binary if that helps.

[i]edit: but a second attempt at running the same command, with no changes, resulted in success.[/i] :cmd:[/QUOTE]

I got the error while running [COLOR="Red"]Cudapm1 -cufftbench 4096 4096 3[/COLOR] and reaching [COLOR="Red"]Mult threads 1024[/COLOR].

My run of [COLOR="SeaGreen"]Cudapm1 -cufftbench 1 8192 1[/COLOR] ran smoothly :smile:

Luigi

owftheevil 2013-11-25 15:36

Revision 52, up at sourceforge now has a partial fix. I haven't tested this, power was off due to a snowstorm this weekend. It might not even compile. But barring any stupid mistakes, it should allow you to run that benchmark. Looks like 4.1 is not as good at optimizing register use as 5.5. It will still fail in stage 2 if you try to test with mult threads = 1024, but I will wait until I can test it too make all the other necessary changes.

ET_ 2013-11-25 19:04

[QUOTE=owftheevil;360263]Revision 52, up at sourceforge now has a partial fix. I haven't tested this, power was off due to a snowstorm this weekend. It might not even compile. But barring any stupid mistakes, it should allow you to run that benchmark. Looks like 4.1 is not as good at optimizing register use as 5.5. It will still fail in stage 2 if you try to test with mult threads = 1024, but I will wait until I can test it too make all the other necessary changes.[/QUOTE]

Same error. :no:

I'm planning on updating to 5.5 (although I'd rather wait for 6.0...)

Luigi

flashjh 2013-11-25 19:59

[QUOTE=ET_;360278]Same error. :no:

I'm planning on updating to 5.5 (although I'd rather wait for 6.0...)

Luigi[/QUOTE]
It's going to be a while before 6 comes out.

ET_ 2013-11-26 09:18

[QUOTE=flashjh;360282]It's going to be a while before 6 comes out.[/QUOTE]

Thanks for the hint. Now I have no reason to wait furhter...

Luigi

firejuggler 2014-04-14 19:38

speed test on a GTX 750 ti
[code]
Device GeForce GTX 750 Ti
Compatibility 5.0
clockRate (MHz) 1110
memClockRate (MHz) 2700

fft max exp ms/iter
4 85933 0.0725
8 169409 0.1211
16 333803 0.1354
18 374587 0.1556
20 415253 0.1585
25 516589 0.1904
28 577177 0.1980
32 657719 0.1988
36 738083 0.2152
40 818239 0.2225
48 978041 0.2775
50 1017889 0.2837
56 1137271 0.2938
64 1296011 0.3271
72 1454273 0.3766
80 1612249 0.4374
81 1631969 0.4650
84 1691093 0.4911
90 1809193 0.5352
96 1927129 0.5400
100 2005673 0.5505
112 2240863 0.5863
128 2553659 0.6611
135 2690201 0.7673
144 2865601 0.7755
160 3176779 0.8483
168 3332107 0.9370
180 3564823 1.0058
200 3951977 1.0640
216 4261051 1.1387
224 4415431 1.1826
225 4434721 1.2089
256 5031737 1.2806
288 5646379 1.4322
320 6259537 1.7545
324 6336103 1.7849
360 7024163 1.9349
392 7634537 2.0081
400 7786967 2.1736
432 8395997 2.2762
448 8700169 2.3520
450 8738161 2.4762
512 9914521 2.4840
576 11125619 3.0138
588 11352347 3.4336
640 12333809 3.4786
648 12484649 3.4864
720 13840423 3.8105
729 14009689 3.9701
800 15343429 4.1448
864 16543493 4.5113
896 17142793 4.7358
900 17217653 5.1151
1024 19535569 5.1737
1080 20580341 5.9243
1152 21921901 6.0684
1280 24302527 6.8055
1296 24599717 7.0482
1344 25490893 7.6906
1350 25602229 7.7762
1440 27271147 7.9282
1512 28604657 8.2026
1568 29640913 8.3392
1600 30232693 8.4548
1728 32597297 9.1322
1792 33778141 9.5213
1800 33925711 10.2388
2048 38492887 10.4488
2304 43194913 11.8395
2560 47885689 13.6211
2592 48471289 14.2237
2688 50227213 15.4735
2880 53735041 16.1879
2916 54392209 16.8134
3072 57237889 17.1524
3136 58404433 17.1638
3200 59570449 18.1494
3240 60298969 18.7758
3584 66556463 19.2476
4096 75846319 21.2812
4608 85111207 25.4287
4800 88579669 28.4094
5120 94353877 28.4263
5184 95507747 29.3381
5376 98967641 32.0367
5600 103000823 32.5258
5760 105879517 33.4296
5832 107174381 33.9940
6048 111056879 35.1400
6144 112781477 35.5020
6272 115080019 35.8715
6400 117377567 38.3167
6912 126558077 38.6704
7168 131142761 40.1696
7200 131715607 41.7772
8192 149447533 44.0675
[/code]
strange thing is that the mem speed is half of what ist is supposed to be.

James Heinrich 2014-04-14 20:02

[QUOTE=firejuggler;371177]strange thing is that the mem speed is half of what ist is supposed to be.[/QUOTE]Not uncommon to see that. There's a subtle distinction between the clock frequency of the memory and the rate of data transfers. In the good old days there was one transaction per clock cycle. Then they invented [url=http://en.wikipedia.org/wiki/Double_data_rate]DDR = Double Data Rate[/url] where data is transferred twice per clock cycle. Often utilities will report the memory clock frequency (for modern GDDR5 video cards that's usually in the 2.5-3.0GHz range) whereas marketing materials will report the number of transactions per second (double the clock rate), usually mislabeled with "GHz" (billion cycles per second) rather than GT/s (billion transactions per second).

houding 2014-07-22 07:18

Is there a way to specify the B1 and B2 values manually?

Currently I get a manual assignment, put it in the worktodo txt file and run the program (using 0.20). The software decides on the B1 and B2 values.

Adolf

LaurV 2014-07-22 08:30

One way to increase the limits for all assignments, but still let the program calculate them optimally for each exponent, is to specify a higher number of "LL tests saved", like substituting the default "1" or "2" at the end of the line with "3"... "9" (it can be higher, but it is not effective, and generally higher values are waste of time).

owftheevil 2014-07-22 13:03

Yes, for example

[CODE]./CUDAPm1 -b1 value -b2 value exponent[/CODE]

nucleon 2014-09-07 09:35

When submitting results from cuda p-1, I'm getting this:

Found 1 lines to process.
processing: P-1 no-factor for M68xxxxxx (B1=645,000, B2=15,802,500, E=4)
Error: Missing checksum. Correct the problem or email results to [email]woltman@alum.mit.edu[/email].

Is there anything we can do to fix this?

-- Craig

James Heinrich 2014-09-07 14:01

[QUOTE=nucleon;382377]Error: Missing checksum.
Is there anything we can do to fix this?[/QUOTE]I have fixed the manual results form to accept CUDAPm1 results without checksum, as it was supposed to be.

I guess there's potentially also the possibility that George could share his checksum-generating code with the CUDAPm1 authors (either explicitly or wrapped inside a closed-source DLL or similar) in which case CUDAPm1 could generate the correct checksums on its own, but that's a whole other area of discussion.

nucleon 2014-09-07 14:08

Thanks.

I have my titan working on P-1, I'd hate to stop it. :)

nucleon 2014-09-07 14:12

Still no good.

Submitting:
M67xxxxxx found no factor (P-1, B1=635000, B2=14446250, e=2, n=4096K CUDAPm1 v0.20)

And I get:

Found 1 lines to process.
processing: P-1 no-factor for M67xxxxxx (B1=635,000, B2=14,446,250, E=2)
Error: Missing checksum. Correct the problem or email results to [email]woltman@alum.mit.edu[/email].

-- Craig

James Heinrich 2014-09-07 15:02

[QUOTE=nucleon;382390]Still no good.[/QUOTE]Sorry, I put the checksum code in the wrong place. Please try again?

owftheevil 2014-09-07 15:11

Its working for me now. Thanks.

ugonabuj 2014-09-07 15:23

It is not working for me. When a factor is found I get the answer:
[COLOR=darkgreen]processing: P-1 factor 888024044817831733817 for [URL="http://www.mersenne.org/report_exponent/?exp_lo=2297327&full=1"][COLOR=#000080]M2297327[/COLOR][/URL] (B1=615,000, B2=12,000,000, E=12)[/COLOR]
[COLOR=red]Insufficient information for accurate CPU credit. For stats purposes, assuming factor was found using ECM with B1 = 50000. CPU credit is 0.0908 GHz-days. [/COLOR]
[COLOR=red] [/COLOR]
[COLOR=red] [/COLOR]
[COLOR=red] [/COLOR]

James Heinrich 2014-09-07 15:31

[QUOTE=ugonabuj;382395]It is not working for me. When a factor is found I get the answer:
[COLOR=darkgreen]processing: P-1 factor 888024044817831733817 for [URL="http://www.mersenne.org/report_exponent/?exp_lo=2297327&full=1"][COLOR=#000080]M2297327[/COLOR][/URL] (B1=615,000, B2=12,000,000, E=12)[/COLOR]
[COLOR=red]Insufficient information for accurate CPU credit. For stats purposes, assuming factor was found using ECM with B1 = 50000. CPU credit is 0.0908 GHz-days.[/COLOR][/QUOTE]That's very interesting. That shouldn't happen. Investigating...

edit: and found the problem. Turns out I was still calling the method-unknown factor recording function rather than the found-by-PM1 function. Sorry about that.

ugonabuj 2014-09-07 15:38

When I send it to your site it is working perfectly. In fact it has never
worked for CUDAPm1 from the start in May 2013.


Userid:Hbendtz

James Heinrich 2014-09-07 15:45

It's a long-standing problem with mersenne.org's manual results parser which was just upgraded within the past 48 hours. Apparently I'm still finding a few bugs.

The old code ignored anything on the result line other than the exponent+factor, then made some broad assumptions about how factors were found. In this example, the rule that applied was "exponent < 16M, therefore must be ECM".

But, the bug should be fixed now, please let me know if you see that message again, because you shouldn't. :smile:

[i]edit: There is a probable plan to update the previously-recorded factor results that were falsely recorded as ECM or P-1 when in fact they were P-1 or TF, but that won't happen for a few days at the earliest.[/i]

nucleon 2014-09-07 23:42

I'm happy.

Everything working aok now :)

ugonabuj 2014-09-08 17:29

The primenet server is still kicking my head (repetedly).
Now when I report a factor for a P-1 assigned exponent the bastard answer:

processing: P-1 factor 6272775095469249097847 for M3406181 (B1=1,000,000, B2=20,000,000, E=12)
Result type inappropriate for the assignment type. Processing result but not deleting assignment.
CPU credit is 0.3099 GHz-days.


If I don't have the aid for that exponent anymore I can't unreserve it, it
don't shows in my account assignment but it is counted.


HBendtz

houding 2014-09-08 19:18

This has happened to me 5 times in the past. And it happens when I found a factor using cudapm1.

In my assignment list, there is no P-1 listed.

But on my summary page I have 5 P-1 listed under workload.

As I have factored those 5 exponents, I'm not to worried. Even though the avg days is almost 100 now.

James Heinrich 2014-09-08 19:25

PrimeNet tries hard to give credit to the right user. Checking the [url=http://www.mersenne.org/report_exponent/default.php?exp_lo=3406181&full=1]exponent history for M3406181[/url] you can see that the P-1 factor was indeed credited to "HBendtz".

I just had another look at the code, and the "result type inappropriate" message should not have been shown in your case (it was being incorrectly shown for P-1 and ECM factors). I have fixed that now.

If the exponent was assigned to you, it would show up under [url=http://www.mersenne.org/workload/]My Account > Assignments[/url] when logged in and you could unreserve or check on the status of the assignment. If it's not there, then it's not assigned to you.

James Heinrich 2014-09-08 19:26

[QUOTE=houding;382493]In my assignment list, there is no P-1 listed.
But on my summary page I have 5 P-1 listed under workload.[/QUOTE]Could you please tell me these exponents so I can investigate what's happening, please?

houding 2014-09-08 19:47

[QUOTE=James Heinrich;382495]Could you please tell me these exponents so I can investigate what's happening, please?[/QUOTE]

I had to go digging through my results page. Luckily all of these were manual testing.

66402887
71020303
69804067
69454157
69453871

As you will see in exponent status, when I submitted the factor, on the same day/moment is says expired as well.

Just in case things look fishy - my forum name is houding, my primenet name is AdolfNor.

James Heinrich 2014-09-08 19:56

At quick glance things seems as they should be, but perhaps I'm not looking in the right section.
Can you please PM or [url=mailto:james@mersenne.ca]email me[/url] a screenshot of where you see these exponents showing up on your list?

ugonabuj 2014-09-08 20:00

No James it is not shown in [URL="http://www.mersenne.org/workload/"][COLOR=#000080]My Account > Assignments[/COLOR][/URL] but it is counted for
in the number of total assignment for me.
When I look in "Manual Testing > Extensions" it is shown. But that does not
help there I only can extend the time.

HBendtz

chalsall 2014-09-08 21:24

[QUOTE=ugonabuj;382498]No James it is not shown in [URL="http://www.mersenne.org/workload/"][COLOR=#000080]My Account > Assignments[/COLOR][/URL] but it is counted for
in the number of total assignment for me.
When I look in "Manual Testing > Extensions" it is shown. But that does not
help there I only can extend the time.

HBendtz[/QUOTE]

If I may please just spend a little bit of hard earned experience to speak some reasonable advice...

Some mean to be serious. But some others just want to seriously take the piss to cause a problem for those who actually think.

Welcome to realitity....

Prime95 2014-09-08 21:36

[QUOTE=James Heinrich;382497]At quick glance things seems as they should be, but perhaps I'm not looking in the right section.
Can you please PM or [EMAIL="james@mersenne.ca"]email me[/EMAIL] a screenshot of where you see these exponents showing up on your list?[/QUOTE]

James: It looks like $done is properly set to TRUE. I'm guessing the global $t_assigned is not set properly for manual_post_processing to delete the assignment row.

James Heinrich 2014-09-08 21:39

[QUOTE=chalsall;382506]Welcome to realitity....[/QUOTE]I'm not sure I quite follow what your above statement was supposed to mean.
But [i]houding[/i] and [i]ugonabuj[/i] have indeed pointed out an actual inconsistency in what's displayed in the assignments page. I've got George looking into it. It may just be a display issue or it may involve something deeper in relation to submitting PM1-factor results after an assignment expires.

James Heinrich 2014-09-08 21:42

[QUOTE=Prime95;382509]James: It looks like \$done is properly set to TRUE. I'm guessing the global \$t_assigned is not set properly for manual_post_processing to delete the assignment row.[/QUOTE]The 5 examples quoted by [I]houding[/I] were all submitted on the old manual_results form. It's possible that whatever bug was there has already been fixed with the new manual_results form. It is, of course, also possible that the bug still exists. If someone submits a new P-1 factor and finds that the assignment still shows up on your Extension or Workload lists, please let me know.

Prime95 2014-09-08 21:59

[QUOTE=James Heinrich;382512]The 5 examples quoted by [i]houding[/i] were all submitted on the old manual_results form.[/QUOTE]

I'll manually repair the database

ugonabuj 2014-09-08 22:18

Thank you James for fixing the scripts for P-1 manual reporting.
It is really working now.


HBendtz

houding 2014-09-09 04:35

Thank you James, George.

I will do a few pm1's with cudapm1 and hopefully find a factor or 2.

Will let you know what I find.

ET_ 2015-02-18 14:46

Cudapm1 for linux
 
I have been on SourceForge, and found release 0.20 of CudaPm1 executable for Windows.

Is there a place where some Linux executables or sources can be found?

Luigi

owftheevil 2015-02-19 01:20

What parameters fit your needs, ie, Cuda version, cc 2.0, 3.5, etc.? I don't have any posted but if you can't or don't want to build it yourself, I would be happy to post some.

ET_ 2015-02-19 09:40

[QUOTE=owftheevil;395800]What parameters fit your needs, ie, Cuda version, cc 2.0, 3.5, etc.? I don't have any posted but if you can't or don't want to build it yourself, I would be happy to post some.[/QUOTE]

I will as soon as my new environment is set up (a new 980 on its way), thanks :bow:

ET_ 2015-02-21 16:52

[QUOTE=owftheevil;395800]What parameters fit your needs, ie, Cuda version, cc 2.0, 3.5, etc.? I don't have any posted but if you can't or don't want to build it yourself, I would be happy to post some.[/QUOTE]

Linux Ubuntu 14.04 LTS 64 bit

[code]
CUDA version info
binary compiled for CUDA 6.50
CUDA runtime version 6.50
CUDA driver version 6.50

CUDA device info
name GeForce GTX 980
compute capability 5.2
max threads per block 1024
max shared memory per MP 98304 byte
number of multiprocessors 16
CUDA cores per MP 128
CUDA cores - total 2048
clock rate (CUDA cores) 1342MHz
memory clock rate: 3505MHz
memory bus width: 256 bit
[/code]

Thanks :-)

Luigi

ET_ 2015-05-01 10:37

[QUOTE=ET_;395994]Linux Ubuntu 14.04 LTS 64 bit

[code]
CUDA version info
binary compiled for CUDA 6.50
CUDA runtime version 6.50
CUDA driver version 6.50

CUDA device info
name GeForce GTX 980
compute capability 5.2
max threads per block 1024
max shared memory per MP 98304 byte
number of multiprocessors 16
CUDA cores per MP 128
CUDA cores - total 2048
clock rate (CUDA cores) 1342MHz
memory clock rate: 3505MHz
memory bus width: 256 bit
[/code]

Thanks :-)

Luigi[/QUOTE]

I can try and build CUDAP-1 by myself, I just need the right source and makefile :smile:

Luigi

owftheevil 2015-05-01 11:54

With subversion: [CODE] svn checkout svn://svn.code.sf.net/p/cudapm1/code/trunk cudapm1-code [/CODE] or http:

[URL="http://sourceforge.net/p/cudapm1/code/HEAD/tree/trunk/#"]http://sourceforge.net/p/cudapm1/code/HEAD/tree/trunk/[/URL]

The readme is oblolete. Alter the make file as you did for cudalucas. To run with assignments in a worktodo.txt, no command line parameters are needed.

ET_ 2015-05-01 16:59

[QUOTE=owftheevil;401413]With subversion: [CODE] svn checkout svn://svn.code.sf.net/p/cudapm1/code/trunk cudapm1-code [/CODE] or http:

[URL="http://sourceforge.net/p/cudapm1/code/HEAD/tree/trunk/#"]http://sourceforge.net/p/cudapm1/code/HEAD/tree/trunk/[/URL]

The readme is oblolete. Alter the make file as you did for cudalucas. To run with assignments in a worktodo.txt, no command line parameters are needed.[/QUOTE]

Thank you! :-)

henryzz 2015-05-01 19:21

I noticed today that when my screen goes blank my iteration times triple. I am guessing that this is a known feature on windows. Is there an easy way around this?


All times are UTC. The time now is 23:18.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.