mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   The P-1 factoring CUDA program (https://www.mersenneforum.org/showthread.php?t=17835)

Uncwilly 2013-04-17 00:25

[QUOTE=garo;337309]P-1 with 2GB memory in the 61M range gives a probability of success of 3.3-3.6% depending on the TF level. Dunno where you got 5-8%.[/QUOTE]I have seeing Prime95 giving around 3.75% for 60M exponents that have been taken to 73.

c10ck3r 2013-04-17 00:56

Any luck getting Winbloze compiled? Once it's compiled and available, I'll reinstall my 460 to play with it :)

owftheevil 2013-04-17 01:13

No windows. I'll need help with that as I don't have access to any windows boxes.

axn 2013-04-17 04:00

[QUOTE=NBtarheel_33;337330]
M61000000, factored to 70 bits, assuming 2 L-L tests saved, with B1=670,000 and B2=16,750,000, using K*B^N+C = 1*2^61000000-1
Probability = [B]5.664070%[/B]

M65000000, factored to 70 bits, with B1=800,000 and B2=24,000,000, using K*B^N+C = 1*2^65000000-1
Probability = [B]6.224824%[/B][/QUOTE]

But 70 bits is not realistic for these assignments anymore, isn't it? 73 & 74 respectively would be more accurate. Here you'd get probabilities of 3.7-3.8 (for 73) and 3.3-3.4 (for 74).

c10ck3r 2013-04-17 15:27

[QUOTE=axn;337369]But 70 bits is not realistic for these assignments anymore, isn't it? 73 & 74 respectively would be more accurate. Here you'd get probabilities of 3.7-3.8 (for 73) and 3.3-3.4 (for 74).[/QUOTE]
No matter how you cut it, GPUs are faster than CPUs. We can get to the specifics once its public and has more testing. :)

Mini-Geek 2013-04-17 15:48

[QUOTE=c10ck3r;337403]No matter how you cut it, GPUs are faster than CPUs. We can get to the specifics once its public and has more testing. :)[/QUOTE]

GPUs are also faster than CPUs at LLs, but they are so much faster at TF that TF makes the most sense. Do we know yet that the P-1 speed is sufficient to make sense? (I haven't been following this thread closely enough to know, so feel free to tell me to :rtfm:)

chappy 2013-04-17 16:15

And perhaps even more importantly, doesn't the project need P-1's right now? So even if the speed difference isn't of the magnitude of the TF increase it might still be best for the project. I know that there is no general concensus on this, and also that everyone should do whatever part of the project makes them happy. But many of us are willing to sacrifice GHz days to work on whatever needs working on.

firejuggler 2013-04-17 16:18

I heard (read) of a 25 time sêed increase.
ah found the quote


[code]Originally Posted by owftheevil
Cudapm1 output:

Code:
M61076737 has a factor: 432634830991289176546683053423
Run with B1 = 65000, B2 = 12035000, n = 3360k, d = 2310, e =2, 8 rp per pass. It used about 600Mb of device memory. Stage 2 took ~53 minutes.

Edit: Looks like about 15 minutes longer to make e = 4.
[/code]
[code]
To compare with CPU speed running the same curve in Prime95.
Laptop with Corei7 2720QM sandy bridge:
using 1 core: stage1 43min, stage2 ~ 8h (3 Gb RAM)
using 4 cores: stage1: 19 min, stage2 ~ 3.8h (3 Gb RAM)

I only completed ~20% of stage2 and extrapolated the runtime.
[/code]

axn 2013-04-17 16:54

[QUOTE=firejuggler;337409]I heard (read) of a 25 time sêed increase.
ah found the quote
[/QUOTE]

8h/53min ~= 8x. But later found that only half the things were being processed, so it is more like 4x?

firejuggler 2013-04-17 17:16

There is no word for the stage 1 speed increase, and maybe there is none. But i'm sure i've read a 25 time speed somewhere. Oh well, 4 time speed increase, thats still good.

Aramis Wyler 2013-04-17 19:31

[QUOTE=Aramis Wyler;336999]That would definately put a dent in our p-1 deficit. Though it's hard to trade 25x p-1 work for 125x factoring work.

EDIT: Not that it wouldn't get used though. I was trading up 10 ghz day of factoring per ghz day of p-1, and this is a better deal than that. :smile:[/QUOTE]

I said 25x because it takes me 20 hours to do a stage 2 on my Athlon x4, though that's if I'm running 3 at a time. It was a personal benchmark, because I tf 125x faster on my gpus than my cpus.

garo 2013-04-17 21:58

[QUOTE=axn;337369]But 70 bits is not realistic for these assignments anymore, isn't it? 73 & 74 respectively would be more accurate. Here you'd get probabilities of 3.7-3.8 (for 73) and 3.3-3.4 (for 74).[/QUOTE]

Bingo!

owftheevil 2013-04-17 23:13

Optimal b1 and b2 are not going to be much different than what mprime selects. The basic unit of measurement used to compute optimal values is the time it takes to do 1 fft, which is relative to the device its running on. Gpus will probably favor slightly higher b2 because of memory bandwidth, and smaller e, at least for most cards that don't have lots of memory.

Recent run of 6108xxxx with b1 = 580000, b2 = 12035000, e = 2, d = 2310, and rp = 20:

Stage1 96m, 43s, 1673702 ffts
Stage2 89m, 42s, 1504234 ffts

Each increase of e by 2 will take about 15 minutes more.

bcp19 2013-04-18 03:07

[QUOTE=axn;337412]8h/53min ~= 8x. But later found that only half the things were being processed, so it is more like 4x?[/QUOTE]
Actually, I'm pretty sure it was stage 2 vs stage 2, so 8x would still apply.

This matters not though, since you need a comparison to a fairly fixed given. If the GPU is not using a CPU core, then what do you compare it to? If I have a GTX 480 in an I5 2500 I get a speed up, but if I run that same 480 in a Core2Quad I have a more significant speed up when compared to the system it is in.

I'd say you'd need to do like CPU based P-1, compare the time it takes to run the GPU P-1 VS the time it takes to run that same exponent on CUDALucas (or Mfaktc/o?). Otherwise the 'speed increase' is technically unknown.

axn 2013-04-18 03:46

[QUOTE=NBtarheel_33;337164]Factoring to 7x bits (assuming an increase of one bit level) gives you (roughly) a 1/7x = 1.27-1.43% chance of finding a factor.

P-1 with decent bounds will typically give you a 5-8% chance of finding a factor.

So, given 125 TF attempts, we'd expect roughly 1.6-1.8 factors found. On the other hand, 25 TF attempts should yield roughly 1.25-2.0 factors found.

If GPU P-1 allows us to increase bounds or make more frequent use of the Brent-Suyama extension, the expected number of successes will be at or above the higher end of this range. In that case, it would make complete sense to trade 125x TF for 25x P-1.

Note also that GPU P-1 will make use of the *GPU* RAM, rather than the system RAM. This could bring in P-1'ers who were previously unable to dedicate large quantities of RAM to Stage 2.[/QUOTE]

This looks like a good jumping point into the discussion. For those of you who're thinking of switching from GPU TF to GPU P-1 and worried whether it'll impact the number of expos cleared, I propose the following model.
Compare the efficiency (expos cleared/unit time) of doing the _last bit_ of TF to the efficiency of doing P-1. Simple. Assuming that the last bit work is for 73->74, how many "last bit" TFs can be done in a day on a particular GPU, and how many P-1 can be done in the same time? Then calculate the expected number of factors. Whichever is higher wins. If they're approximately the same (within 20% of each other), picking either one should be fine.

frmky 2013-04-20 22:31

[QUOTE=garo;337309]P-1 with 2GB memory in the 61M range gives a probability of success of 3.3-3.6% depending on the TF level. Dunno where you got 5-8%.[/QUOTE]

What typical values of B1, B2, and e does Prime95 choose at 61M with 2GB memory and TF to 73 bits?

garo 2013-04-21 15:41

[QUOTE=frmky;337743]What typical values of B1, B2, and e does Prime95 choose at 61M with 2GB memory and TF to 73 bits?[/QUOTE]

No factor to 74 bits:
M61482791 completed P-1, B1=545000, B2=10355000

To 73 bits:
M59518889 completed P-1, B1=555000, B2=10961250

e=0 in both cases.

axn 2013-04-21 16:31

[QUOTE=garo;337793]e=0 in both cases.[/QUOTE]

Which is secretly e=2 :smile:

garo 2013-04-23 05:40

[QUOTE=axn;337795]Which is secretly e=2 :smile:[/QUOTE]

Ah didn't know that. Thanks.

frmky 2013-04-25 06:38

With owftheevil's permission, I have posted a [I]very[/I] early version at Sourceforge, [URL="https://sourceforge.net/projects/cudapm1/?source=directory"]https://sourceforge.net/projects/cudapm1/?source=directory[/URL]

It does read Pfactor lines from worktodo.txt and output to results.txt, and George has indicated that he will add support for the results output soon. The core routines have survived testing on 30+ known factors over the past few days. Autoselection of FFT sizes may need tweaking. It currently does not intelligently select B1 and B2 sizes; for now parameters should be specified manually (it defaults to B1=600k, B2=12M, e=6 which is reasonable for current ~61M exponents). Error checking should be added in many places. It does not support checkpointing. In summary, it is still very alpha.

frmky 2013-04-25 07:46

The default parameters will require ~900 MB of GPU memory. If you do not have that available, try using -nrp2 10 or -nrp2 4. You can also save a little memory by using -e2 4 or -e2 2. For really low memory cards, use -d2 30 -e2 2 -nrp2 2. Autoselection of these parameters based on available GPU memory is on the TODO.

firejuggler 2013-04-25 07:57

My 560 has 1024. Might be a bit tight.

frmky 2013-04-25 08:09

[QUOTE=firejuggler;338239]My 560 has 1024. Might be a bit tight.[/QUOTE]

Try a run with a small B1, say 3000, just to see if stage 2 starts successfully without waiting a long time for B1 to finish.

ET_ 2013-04-25 09:44

[QUOTE=frmky;338242]Try a run with a small B1, say 3000, just to see if stage 2 starts successfully without waiting a long time for B1 to finish.[/QUOTE]

Hi Greg, I am doing some time testing on the e, d and nrp values for different B2 values.
It will take some time to check all the combinations for 5 distinct test-cases, but I think it may be useful to automate the choice of these parameters.

I hope I am not stepping over others' feet.

Luigi

P.S. Thanks again to Carl that started the project... :smile:

henryzz 2013-04-25 09:48

1024 of memory as it is close it might be worth slightly reducing B2 in order to fit. B1 can be increased to compensate.

James Heinrich 2013-04-25 10:47

[QUOTE=frmky;338230]George has indicated that he will add support for the results output soon.[/QUOTE]Can you please post a sample of possible variations in the output? I need to add support to both mersenne.org and mersenne.ca

owftheevil 2013-04-25 11:02

[QUOTE=henryzz;338248]1024 of memory as it is close it might be worth slightly reducing B2 in order to fit. B1 can be increased to compensate.[/QUOTE]

Only e2 and nrp2 have an effect on memory use. b1 and b2 have an effect on how many transforms need to be done.

The reason frmky said to try low b1 is that you don't know until the end of stage 1 if the necessary memory for stage 2 is available. So until you are confident that you will be able to allocate all the stage 2 memory, its best to do short stage 1 runs.

ET_ 2013-04-25 11:13

[QUOTE=owftheevil;338255]Only e2 and nrp2 have an effect on memory use. b1 and b2 have an effect on how many transforms need to be done.

The reason frmky said to try low b1 is that you don't know until the end of stage 1 if the necessary memory for stage 2 is available. So [B]until you are confident that you will be able to allocate all the stage 2 memory, its best to do short stage 1 runs[/B].[/QUOTE]

Good advice! This will help me shorten the time required for the speed tests :smile:

Luigi

henryzz 2013-04-25 12:58

[QUOTE=owftheevil;338255]Only e2 and nrp2 have an effect on memory use. b1 and b2 have an effect on how many transforms need to be done.

The reason frmky said to try low b1 is that you don't know until the end of stage 1 if the necessary memory for stage 2 is available. So until you are confident that you will be able to allocate all the stage 2 memory, its best to do short stage 1 runs.[/QUOTE]
B2 normally effects memory usage with prime95 and gmp-ecm. My problem for misunderstanding.

Stef42 2013-04-25 17:16

No Windows version yet? :redface:

James Heinrich 2013-04-25 17:26

[QUOTE=Stef42;338288]No Windows version yet? :redface:[/QUOTE]I would also very much appreciate the ability to try it out :smile:

frmky 2013-04-26 00:45

Autoselection of B1, B2, and GPU memory related parameters (d2, e, nrp) should now work. It tries to use as much GPU memory as it thinks is safe, so let me know if you get memory allocation errors.

It still lacks proper error checking in many places, checkpointing, and the ability to interrupt stage 2 with a Ctrl-C. If anyone wants to dive into that code, feel free!

Also, I'm not set up to compile on Windows with both CUDA and GMP. If anyone here is, I'm sure it will be appreciated! :smile:

frmky 2013-04-26 06:48

Haven't posted timing in a while.

M60973753 P-1, B1=535000, B2=10432500, e=6, n=3584K
Time for stage 1: 56:27
Time for stage 2: 47:18

This was on a K20. A GTX Titan should be a bit faster. My GTX 480 will likely take nearly twice as long.

And it does find factors. :smile:
M60870653 has a factor: 87951105041429114235889 (P-1, B1=600000, B2=12000000, e=6, n=3584K CUDAPm1 v0.00)

ET_ 2013-04-26 08:18

[QUOTE=frmky;338341]Autoselection of B1, B2, and GPU memory related parameters (d2, e, nrp) should now work. It tries to use as much GPU memory as it thinks is safe, so let me know if you get memory allocation errors.

It still lacks proper error checking in many places, checkpointing, and the ability to interrupt stage 2 with a Ctrl-C. If anyone wants to dive into that code, feel free!

Also, I'm not set up to compile on Windows with both CUDA and GMP. If anyone here is, I'm sure it will be appreciated! :smile:[/QUOTE]

If the autoselection now works, I guess I can quit my work on timing different exponents with different e, d and nrp... :davieddy::bangheadonwall:

I will get the alpha from sourceforge...

Luigi

owftheevil 2013-04-26 11:07

@ ET The timings are interesting, but the work on how big nrp can get without a crash is still just as important.

A recent run on a 570:

Selected B1=605000, B2=16637500, 4.1% chance of finding a factor
CUDA reports 732M of 1279M GPU memory free.
Using e=6, d=2310, nrp=10
Starting stage 1 P-1, M61410829, B1 = 605000, B2 = 16637500, e = 6, fft length = 3360K
.
.
.
Stage 1 complete, estimated total time = 1:41:14
.
.
.
Stage 2 complete, estimated total time = 3:42:23
M61410829 Stage 2 found no factor (P-1, B1=605000, B2=16637500, e=6, n=3360K CUDAPm1 v0.00)


There were ~350Mb of free memory during stage 2.

ET_ 2013-04-26 12:02

[QUOTE=owftheevil;338386]@ ET The timings are interesting, but the work on how big nrp can get without a crash is still just as important.

A recent run on a 570:

Selected B1=605000, B2=16637500, 4.1% chance of finding a factor
CUDA reports 732M of 1279M GPU memory free.
Using e=6, d=2310, nrp=10
Starting stage 1 P-1, M61410829, B1 = 605000, B2 = 16637500, e = 6, fft length = 3360K
.
.
.
Stage 1 complete, estimated total time = 1:41:14
.
.
.
Stage 2 complete, estimated total time = 3:42:23
M61410829 Stage 2 found no factor (P-1, B1=605000, B2=16637500, e=6, n=3360K CUDAPm1 v0.00)


There were ~350Mb of free memory during stage 2.[/QUOTE]

Should I continue the analysis?

Luigi

owftheevil 2013-04-26 13:00

Yes. Right now the memory settings are on the conservative side. Data from different cards about how much memory use they can tolerate would be very useful.

Edit: small b1 and small b2 or terminate after 1 pass of stage2, I just want to see if it handles the memory load.

ET_ 2013-04-26 13:28

[QUOTE=owftheevil;338391]Yes. Right now the memory settings are on the conservative side. Data from different cards about how much memory use they can tolerate would be very useful.

Edit: small b1 and small b2 or terminate after 1 pass of stage2, I just want to see if it handles the memory load.[/QUOTE]

Fine :smile:

Just one more question. The version I have can't be stopped during B2, while the one on sourceforge doesn't allow to modify the automatized parameters.

Or at least I didn't yet see how. :guilty:
I'll study during the weekend...

Luigi

owftheevil 2013-04-26 13:52

I just look up the process number and kill it.

ET_ 2013-04-26 14:12

[QUOTE=owftheevil;338398]I just look up the process number and kill it.[/QUOTE]

That way you don't have the presumed time to complete Stage 2...

But will be waaaaay faster :wink:

Luigi

frmky 2013-04-28 01:46

[QUOTE=owftheevil;338386]There were ~350Mb of free memory during stage 2.[/QUOTE]

The current version now reports expected memory use. Please compare that to the amount of memory used as reported by the driver to see if it is reasonably accurate. It looks accurate on both my K20 and GTX 480. There is additional memory free as reported by the driver, but CUDA says it isn't free and I'm going with what CUDA reports.

nucleon 2013-05-01 11:36

I'll add a me too.

When the windows exe gets released, I have a Titan here itching to have a bash at P-1.

-- Craig

frmky 2013-05-02 06:07

Here's a Windows version to try:
[URL="https://www.dropbox.com/s/hdu6eqwkk9vr9p8/cudapm1_20130501.zip"]https://www.dropbox.com/s/hdu6eqwkk9vr9p8/cudapm1_20130501.zip[/URL]

Included is a worktodo.txt that should find a factor. I can't actually test it here since I have no Windows machine with a CC >= 1.3 card, so please try that first, make sure the factor is found, and please let me know if it worked!

There is a problem with reading the text entries in the .ini file, but the numerical entries work fine. Something in parse.c that Visual Studio doesn't like, but I don't have the motivation to track that down right now. It just uses the default input and output files, worktodo.txt and results.txt.

Also, make sure you have the latest graphics drivers. I compiled with CUDA 5.0 downloaded today.

firejuggler 2013-05-02 08:47

looking good for now
[code]
C:\Users\Vincent\Desktop\cudapm1>CUDAPm1.exe

Warning: Couldn't parse ini file option WorkFile; using default "worktodo.txt"
Warning: Couldn't parse ini file option ResultsFile; using default "results.txt"

Selected B1=605000, B2=16637500, 4.1% chance of finding a factor
CUDA reports zuM of zuM GPU memory free.
Using e=2, d=210, nrp=4
Using approximately zuM GPU memory.
Starting stage 1 P-1, M61262347, B1 = 605000, B2 = 16637500, e = 2, fft length =
3360K
Doing 873133 iterations
Iteration 1000 M61262347, 0xf19a7f6041953a97, n = 3360K, CUDAPm1 v0.00 err = 0.2
1094 (0:16 real, 16.1885 ms/iter, ETA 3:55:18)
Iteration 2000 M61262347, 0xaf1d15aad49fcee8, n = 3360K, CUDAPm1 v0.00 err = 0.1
9336 (0:13 real, 12.8097 ms/iter, ETA 3:05:58)
Iteration 3000 M61262347, 0xb702298e7a8c9a8e, n = 3360K, CUDAPm1 v0.00 err = 0.2
1680 (0:13 real, 12.8040 ms/iter, ETA 3:05:41)
Iteration 4000 M61262347, 0xc53d1695707d3dc0, n = 3360K, CUDAPm1 v0.00 err = 0.1
9922 (0:13 real, 12.7794 ms/iter, ETA 3:05:07)
Iteration 5000 M61262347, 0xf154bc3c5f15a9c9, n = 3360K, CUDAPm1 v0.00 err = 0.1
9727 (0:12 real, 12.8228 ms/iter, ETA 3:05:31)
[/code]
Will report back tonight

Stef42 2013-05-02 09:37

I got it running in Windows 8, but it crashed when starting stage 2.

[CODE]Stage 1 complete, estimated total time = 1:20:10
Starting stage 1 gcd.
M61262347 Stage 1 found no factor (P-1, B1=605000, B2=16637500, e=6, n=3360K CUDAPm1 v0.00
Starting stage 2[/CODE]

Some details:
memusage during stage 1: 334MB (MSI afterburner)
memusage during start of stage 2: 1185MB

Total video memory is 1,5GB. I'm trying to figure out if there was not enough memory for some reason.
Details of my graphics card: [url]http://www.msi.com/product/vga/N580GTX-Lightning.html[/url]

owftheevil 2013-05-02 11:59

If its not able to allocate the stage 2 memory, it should be giving you a message like this:

[CODE]CUDAPm1.cu(2628) : cudaSafeCall() Runtime API error 2: out of memory.[/CODE]So something else is likely going on.

Also, the difference in memory use between stage 1 and the beginning of stage 2 is consistent with e = 6 and nrp = 24. Is that what you saw?

James Heinrich 2013-05-02 11:59

[QUOTE=frmky;338990]Here's a Windows version to try
Included is a worktodo.txt that should find a factor.
please try that first, make sure the factor is found, and please let me know if it worked![/QUOTE]Starting a run on my GTX 570. One thing that might be a concern (especially in light of [i]Stef42[/i]'s comment about stage2 memory) is the references to "zu" as a quantity of graphics memory:[quote]Selected B1=605000, B2=16637500, 4.1% chance of finding a factor
CUDA reports [color=red]zu[/color]M of [color=red]zu[/color]M GPU memory free.
Using e=6, d=2310, nrp=16
Using approximately [color=red]zu[/color]M GPU memory.
Starting stage 1 P-1, M61262347, B1 = 605000, B2 = 16637500, e = 6, fft length =
3360K
Doing 873133 iterations[/quote]Running at 7.0ms/it, stage1 should be done in 1h40m and I'll report back with what happens when stage2 starts.

owftheevil 2013-05-02 12:21

The zu is not a big deal, simply a size specifier specific to gcc. For windows we need Iu instead.

Stef42 2013-05-02 12:22

Right after stage 1 finished and stage 2 was initiated, I got a popup saying that CUDAPm1 crashed.
The Windows error log showed an APPCRASH, which is not very useful I think.

When followed was that after the gpu load dropped from 99% to 0%,
the memory remained at 1134MB usage until, I guess because of a time-out, was flushed.

Maybe I'll try a smaller P-1 exponent with a factor found to check.

Karl M Johnson 2013-05-02 13:14

Feedback so far:
1. Does not create checkpoints.
2. Beats CUDALucas in memory stability stress testing(60M exponents were free from errors on CL, found errors at 50K iterations on CP+1)
3. Fails at the beginning of stage 2 with out of memory error, should not be the case (6GB of vRAM, 16GB of RAM).

James Heinrich 2013-05-02 13:56

[QUOTE=owftheevil;339001]The zu is not a big deal, simply a size specifier specific to gcc. For windows we need Iu instead.[/QUOTE]Would it be a big deal that we're seeing "zu" instead of "Iu" on Windows?

owftheevil 2013-05-02 14:06

I think I found one problem. chalsall has his SPEs. Well, this was an ISPE, I for Ineffably. Its an easy fix, but will have to wait until I get home from work. In the meantime, running with b2 = even multiple of 2310 should bypass the error.

@Karl M Johnson:

1. Checkpoints are coming soon, maybe this weekend.
2. CPm1 during stage 1 does do more global memory reads than CuLu, so maybe thats why.
3. Unexpected. What is the error message?

Thank you all for your input.

owftheevil 2013-05-02 14:09

[QUOTE=James Heinrich;339008]Would it be a big deal that we're seeing "zu" instead of "Iu" on Windows?[/QUOTE]

%zu in printf prints size_t variable values, you need %Iu in windows to do the same thing.

James Heinrich 2013-05-02 14:10

Ah, now I understand what you mean.

Stef42 2013-05-02 14:26

I restarted CUDAPm1 using a b2 value, of a known P-1 with a factor found by me earlier.

This was the command-line:
[CODE]cudapm1.exe -b2 550000[/CODE]

Output:
[CODE]Iteration 164000 M9090017, 0xd7661b0c859fa9e5, n = 512K, CUDAPm1 v0.00 err = 0.0
2734 (0:01 real, 0.7921 ms/iter, ETA 0:01)
Iteration 165000 M9090017, 0x7d3f99a08f445b8b, n = 512K, CUDAPm1 v0.00 err = 0.0
2734 (0:01 real, 0.7878 ms/iter, ETA 0:00)
M9090017, 0x1d50507696eeef9f, offset = 0, n = 512K, CUDAPm1 v0.00
Stage 1 complete, estimated total time = 2:14
Starting stage 1 gcd.
M9090017 Stage 1 found no factor (P-1, B1=115000, B2=[COLOR="Red"]1495000[/COLOR], e=6, n=512K CUDAPm
1 v0.00)
Starting stage 2.
Zeros: 59077, Ones: 84923, Pairs: 18379
itime: 14.921770, transforms: 1, average: 14921.770000
ptime: 35.394836, transforms: 88612, average: 0.399436
ETA: 0:50
itime: 17.911887, transforms: 1, average: 17911.887000
ptime: 35.547328, transforms: 88434, average: 0.401964
ETA: 0:00
Stage 2 complete, estimated total time = 1:43
Accumulated Product: M9090017, 0x1a6840caa5d05db3, n = 512K, CUDAPm1 v0.00
Starting stage 2 gcd.
M9090017 has a factor: 516770062491225473521 (P-1, B1=115000, B2=1495000, e=6, n
=512K CUDAPm1 v0.00)[/CODE]

As you can see, there is a different B2 value. Still, it finished well.
Earlier on, the program would crash when starting stage 2. Any thoughts? I must have done something wrong :smile:

Bit more surprising: according to mersenne.ca,
in the past the factor was found in stage 1 using prime95, but CudaPm1 reports stage 2 in the output... ?
Exponent [URL="http://www.mersenne.ca/exponent/9090017#"]9090017[/URL]

James Heinrich 2013-05-02 14:45

[QUOTE=Stef42;339012][i]M9090017 Stage 1 found no factor (P-1, [b]B1=115000, B2=1495000[/b], e=6, n=512K CUDAPm
1 v0.00)[/i]
Bit more surprising: according to mersenne.ca,
in the past the factor was found in stage 1 using prime95, but CudaPm1 reports stage 2 in the output... ?[/QUOTE]That is a bit disturbing.

[URL=http://www.mersenne.ca/exponent/9090017]M9090017[/url] has factor [url=http://www.mersenne.ca/factor/516770062491225473521]516770062491225473521[/url], with a k of [url=http://www.mersenne.ca/k/28425142796280]28425142796280[/url]
k-factored = 2[sup]3[/sup] × 3 × 5 × 61 × 97 × 389 × 102913
minimal bounds to find this factor in stage2 would be B1=389,B2=102913
minimal bounds to find this factor in stage1 would be B1=102913

You ran this with B1=115000 so it [i]should[/i] have found the factor, at least according to my understand of P-1 :unsure:

Stef42 2013-05-02 14:48

It can still find it, although I wonder, as you mentioned, in stage 2 rather than stage 1....
I will do some further testing on a different exponent.

Did the same test again on prime95 to verify:
[CODE][May 2 16:52] Worker starting
[May 2 16:52] Setting affinity to run worker on any logical CPU.
[May 2 16:52] P-1 on M9090017 with B1=110000
[May 2 16:54] M9090017 stage 1 complete. 317502 transforms. Time: 128.915 sec.
[May 2 16:54] Stage 1 GCD complete. Time: 6.593 sec.
[May 2 16:54] P-1 found a factor in stage #1, B1=110000.
[May 2 16:54] M9090017 has a factor: 516770062491225473521
[May 2 16:54] No work to do at the present time. Waiting.
[/CODE]

Aramis Wyler 2013-05-02 15:18

[QUOTE=James Heinrich;339013]That is a bit disturbing.

[URL="http://www.mersenne.ca/exponent/9090017"]M9090017[/URL] has factor [URL="http://www.mersenne.ca/factor/516770062491225473521"]516770062491225473521[/URL], with a k of [URL="http://www.mersenne.ca/k/28425142796280"]28425142796280[/URL]
k-factored = 2[sup]3[/sup] × 3 × 5 × 61 × 97 × 389 × 102913
minimal bounds to find this factor in stage2 would be B1=389,B2=102913
minimal bounds to find this factor in stage1 would be B1=102913

You ran this with B1=115000 so it [I]should[/I] have found the factor, at least according to my understand of P-1 :unsure:[/QUOTE]


If I follow that correctly, then with a B1 of 110000 not only should it have found it in stage 1, but it should not have been possible to find in stage 2 (B1 too high). Is that right? Or could it have found it in stage 2 as a multiple of 102913 (like 205826)? further, if it did find it as a multiple of 102913, would it have given the same factor (516770062491225473521) as prime95 did in stage1?

owftheevil 2013-05-02 15:24

You are all right. There's something weird going on here.

Edit: 102913 is pairing up with 1341143, which gets caught in stage 2. But I still don't know why stage 1 is not finding the factor.

Edit2. Found it. Stage 1 doesn't stand a chance of finding any factor at the moment. Its not looking at the right data. Fix coming this evening.

Stef42 2013-05-02 15:48

I did some other exponents which had factors in low P-1 bounds.
Each and everyone of them was reported by prime95 in stage 1, CUDAPm1 found them in stage 2.

James Heinrich 2013-05-02 15:54

1 Attachment(s)
[QUOTE=Stef42;339002]Right after stage 1 finished and stage 2 was initiated, I got a popup saying that CUDAPm1 crashed.[/QUOTE]Just for clarity, I have attached a screenshot showing this happening.

James Heinrich 2013-05-02 16:07

[QUOTE=Stef42;339019]I did some other exponents which had factors in low P-1 bounds.
Each and everyone of them was reported by prime95 in stage 1, CUDAPm1 found them in stage 2.[/QUOTE]I tried looking for factor where k=1, so I tried[code]CUDAPm1 4444091 -b1 100 -b2 1000[/code]Should've been found in stage1, actually it should have found 2 factors in stage 1:
[url=http://www.mersenne.ca/factor/8888183]8888183[/url] k = 1
[url=http://www.mersenne.ca/factor/319974553]319974553[/url] k = 36

But no factor(s) found:[quote]Stage 1 complete, estimated total time = 0:00
Starting stage 1 gcd.
M4444091 Stage 1 found no factor (P-1, B1=100, B2=390390, e=6, n=256K CUDAPm1 v0.00)
Starting stage 2.
Zeros: 12986, Ones: 24934, Pairs: 8522[/quote]

One side note:[quote]B2 should be at least 390390, increasing it.
Starting stage 1 P-1, M4444091, B1 = 100, B2 = 390390, e = 6, fft length = 256K[/quote]Is the B2>=390390 a fixed limitation, or tied to the exponent, or FFT, or...? It could be interesting to play with CUDAPm1 with a smaller B2 bound than that, if possible.

chalsall 2013-05-02 16:08

[QUOTE=owftheevil;339018]You are all right. There's something weird going on here.

Edit: 102913 is pairing up with 1341143, which gets caught in stage 2. But I still don't know why stage 1 is not finding the factor.

Edit2. Found it. Stage 1 doesn't stand a chance of finding any factor at the moment. Its not looking at the right data. Fix coming this evening.[/QUOTE]

A fundamental truth: software is hard. Computers do [B][I][U]exactly[/U][/I][/B] what we tell them to do (usually; damn bad hardware!). My second born for a DWIM command! :wink:

This is why extensive testing -- by many different people -- is required.

Good work everyone! :smile:

James Heinrich 2013-05-02 16:14

[QUOTE=James Heinrich;339025]actually it should have found 2 factors in stage 1[/QUOTE]On the plus side, it did find all 3 known factors in stage2, albeit as the composite of all of them:[code]M4444091 has a factor: 1809798096458971047321927127 (P-1, B1=100, B2=390390, e=6, n=256K CUDAPm1 v0.00)[/code]

Stef42 2013-05-02 16:20

[QUOTE=James Heinrich;339025]One side note:Is the B2>=390390 a fixed limitation, or tied to the exponent, or FFT, or...? It could be interesting to play with CUDAPm1 with a smaller B2 bound than that, if possible.[/QUOTE]

[CODE]B2 should be at least 1560000, increasing it.
Starting stage 1 P-1, M9090017, B1 = 120000, B2 = 1560000, e = 6, fft length = 5
12K[/CODE]

I'm not that good in figuring out what it's bound too. Example might help tough.

James Heinrich 2013-05-02 16:47

Playing with some limits checking.

smallest exponent: [url=http://www.mersenne.ca/exponent/86243]M86243[/url] (aka 28[sup]th[/sup] [url=http://www.mersenne.ca/prime.php]Mersenne Prime[/url]) -- checked for and warns user

non-prime exponents: checks for and warns user

maximum exponent: uncertain. Haven't tested extensively, but testing in OBD range isn't working nicely: "CUDAPm1 3333333011 -b1 100 -b2 1000" crashes quickly ("CUDAPm1 has stopped working...", whereas "CUDAPm1 3333333011" (no bounds specified) just sits there (no GPU load, no crash, no progress). Just under 2[sup]31[/sup] (M2000000011) does the same thing.
Just under 2[sup]30[/sup], it doesn't crash, but the error message is somewhat cryptic to me as an end-user:[code]CUDAPm1 1000000009 -b1 1000 -b2 10000
over specifications Grid = 110592
try increasing threads (512) or decreasing FFT length (55296K)[/code]

Specifying a negative exponent (e.g. "CUDAPm1 -3333333011") doesn't work, but doesn't issue any warnings either. I guess it's being treated as an unrecognized parameter, but a warning should be generated for unrecognized parameters.

owftheevil 2013-05-02 18:31

Threads is a parameter you can set in the ini file. 1024 is the largest possible value. That value should enable 1000000009 to run.

c10ck3r 2013-05-02 18:32

Q? about proto-p-1-cuda...
Is the does it write a .bu or .bu2 file like P95 does? If so, are they compatible? i.e. could I run Stage 1 on GPU and Stage 2 on CPU?

kracker 2013-05-02 19:13

[QUOTE=c10ck3r;339039]Q? about proto-p-1-cuda...
Is the does it write a .bu or .bu2 file like P95 does? If so, are they compatible? i.e. could I run Stage 1 on GPU and Stage 2 on CPU?[/QUOTE]

I believe I read your answer is somewhere in this thread.

EDIT: Or maybe that was on CuLu, I can't remember. :(

owftheevil 2013-05-02 19:18

About the 1000000009 run, I got to thinking (funny how I do most of that after speaking) that you would need an incredible amount of memory to get that to work, ~4.2GB at the absolute minimum. ~2.4 just for stage 1.

James Heinrich 2013-05-02 19:23

Far beyond my card, but Karl mentioned above he has a 6GB vidcard...

kracker 2013-05-02 19:35

[QUOTE=James Heinrich;339050]Far beyond my card, but Karl mentioned above he has a 6GB vidcard...[/QUOTE]

The Titan is the only nVidia Geforce card I think that has 6 GB.

EDIT: That or the Tesla K20X...

firejuggler 2013-05-02 20:29

there.
window7 home premium (64 bit)
[code]
Iteration 873000 M61262347, 0x92b46441f57f0dc1, n = 3360K, CUDAPm1 v0.00 err = 0
.19531 (0:13 real, 12.7994 ms/iter, ETA 0:01)
M61262347, 0xfd7ab9d857ea4a36, offset = 0, n = 3360K, CUDAPm1 v0.00
Stage 1 complete, estimated total time = 3:06:32
Starting stage 1 gcd.
M61262347 Stage 1 found no factor (P-1, B1=605000, B2=16637500, e=2, n=3360K CUD
APm1 v0.00)
Starting stage 2.
Zeros: 875508, Ones: 853116, Pairs: 166845
itime: 1.982408, transforms: 1, average: 1982.408000
ptime: 1863.879498, transforms: 285724, average: 6.523356
ETA: 5:42:04
itime: 2.236556, transforms: 1, average: 2236.556000
ptime: 1867.422307, transforms: 286126, average: 6.526573
ETA: 5:11:17
itime: 2.341590, transforms: 1, average: 2341.590000
ptime: 1863.484573, transforms: 286070, average: 6.514086
ETA: 4:40:04
itime: 2.443132, transforms: 1, average: 2443.132000
ptime: 1864.386307, transforms: 286206, average: 6.514141
ETA: 4:08:56
itime: 2.479896, transforms: 1, average: 2479.896000
ptime: 1865.738907, transforms: 286420, average: 6.513997
ETA: 3:37:50
itime: 2.566038, transforms: 1, average: 2566.038000
ptime: 1866.830105, transforms: 286588, average: 6.513986
ETA: 3:06:45
itime: 2.578672, transforms: 1, average: 2578.672000
ptime: 1863.986985, transforms: 286146, average: 6.514112
ETA: 2:35:37
itime: 2.578564, transforms: 1, average: 2578.564000
ptime: 1868.104663, transforms: 286782, average: 6.514023
ETA: 2:04:31
itime: 2.616162, transforms: 1, average: 2616.162000
ptime: 1864.357941, transforms: 286198, average: 6.514224
ETA: 1:33:23
itime: 2.704018, transforms: 1, average: 2704.018000
ptime: 1869.413957, transforms: 286978, average: 6.514137
ETA: 1:02:16
itime: 2.703811, transforms: 1, average: 2703.811000
ptime: 1861.521090, transforms: 285758, average: 6.514327
ETA: 31:07
itime: 2.665333, transforms: 1, average: 2665.333000
ptime: 1862.245724, transforms: 285860, average: 6.514538
ETA: 0:00
Stage 2 complete, estimated total time = 6:13:31
Accumulated Product: M61262347, 0xa77ba20d6e2648c2, n = 3360K, CUDAPm1 v0.00
Starting stage 2 gcd.
M61262347 has a factor: 195362848474407049033033 (P-1, B1=605000, B2=16637500, e
=2, n=3360K CUDAPm1 v0.00)
[/code]
Cudaluca 5.0 installed too. on a gtx560, 1024 Mo ram.

firejuggler 2013-05-02 20:49

now thanks to jwb52z we have
[code]
P-1 found a factor in stage #1, B1=580000.
UID: Jwb52z/Clay, M61761811 has a factor: 664146289430268916763473

79.136 bits.
[/code]

wich is k = 2^3 * 3 * 269 * 331 * 8363 * 300857
wich coulld tehorically found with a B1 of 8363 and a B2 of 300857.
stage 1 used 355 Mb of ram and stage 2 838
and 850 with stage2 GCD
[code]
C:\Users\Vincent\Desktop\cudapm1>CUDAPm1.exe 61761811 -b1 8363 -b2 300857

Warning: Couldn't parse ini file option WorkFile; using default "worktodo.txt"
Warning: Couldn't parse ini file option ResultsFile; using default "results.txt"

CUDA reports zuM of zuM GPU memory free.
Using e=2, d=210, nrp=6
Using approximately zuM GPU memory.
Starting stage 1 P-1, M61761811, B1 = 8363, B2 = 300857, e = 2, fft length = 336
0K
Doing 12072 iterations
Iteration 1000 M61761811, 0x58d24f1daf85c89d, n = 3360K, CUDAPm1 v0.00 err = 0.2
2656 (0:16 real, 16.2453 ms/iter, ETA 2:59)
Iteration 2000 M61761811, 0xbf2f93dbb5319ece, n = 3360K, CUDAPm1 v0.00 err = 0.2
4219 (0:13 real, 12.9319 ms/iter, ETA 2:10)
Iteration 3000 M61761811, 0x92d6f0e4c26aff33, n = 3360K, CUDAPm1 v0.00 err = 0.2
3438 (0:13 real, 12.8572 ms/iter, ETA 1:56)

...

22656 (0:13 real, 12.8680 ms/iter, ETA 0:13)
Iteration 12000 M61761811, 0x19f76d7f61bb24ed, n = 3360K, CUDAPm1 v0.00 err = 0.
23828 (0:13 real, 12.9096 ms/iter, ETA 0:00)
M61761811, 0xd041eb56158c648e, offset = 0, n = 3360K, CUDAPm1 v0.00
Stage 1 complete, estimated total time = 2:39
Starting stage 1 gcd.
M61761811 Stage 1 found no factor (P-1, B1=8363, B2=300857, e=2, n=3360K CUDAPm1
v0.00)
Starting stage 2.
Zeros: 11856, Ones: 19440, Pairs: 5592
itime: 2.051921, transforms: 1, average: 2051.921000
ptime: 48.817234, transforms: 7426, average: 6.573826
ETA: 5:56
iETA: 0:51
itime: 2.995641, transforms: 1, average: 2995.641000
ptime: 49.863910, transforms: 7440, average: 6.702138
ETA: 0:00
Stage 2 complete, estimated total time = 6:56
Accumulated Product: M61761811, 0x5e6c85d01c0aae6e, n = 3360K, CUDAPm1 v0.00
Starting stage 2 gcd.
M61761811 has a factor: 664146289430268916763473 (P-1, B1=8363, B2=300857, e=2,
n=3360K CUDAPm1 v0.00)
[/code]

Karl M Johnson 2013-05-02 21:51

Technically, with that binary, we should not be able to use more than 2GB of vRAM and 3-4GB of RAM, for it is 32 bit:smile:
This should be worth mentioning.

firejuggler 2013-05-02 22:16

another thing :
[code]
M55824233 has a factor: 833043841114609831879 (P-1, B1=839, B2=11550, e=2, n=307
2K CUDAPm1 v0.00)
[/code]
This one has a k of p1*p2*...*839 aand should have been found with a B1 of 839. it doesn't find it. I have to wait till the end of stage 2 to get the factor.

owftheevil 2013-05-02 23:05

One problem fixed.

[CODE]filbert@filbert:~/Build/cudapm1-0.00/cudapm1-code/trunk$ ./CUDAPm1 55824233 -b1 839

CUDA reports 716M of 1279M GPU memory free.
Using e=6, d=2310, nrp=12
Using approximately 681M GPU memory.
B1 should be at least 18324, increasing it.
Starting stage 1 P-1, M55824233, B1 = 839, B2 = 12625000, e = 6, fft length = 3072K
Doing 1239 iterations
Iteration 1000 M55824233, 0xa6b0b535ca74136a, n = 3072K, CUDAPm1 v0.00 err = 0.16406 (0:16 real, 15.6659 ms/iter, ETA 0:03)
M55824233, 0x8e2dd418ceb91638, offset = 0, n = 3072K, CUDAPm1 v0.00
Stage 1 complete, estimated total time = 0:18
Starting stage 1 gcd.
M55824233 has a factor: 833043841114609831879 (P-1, B1=839, B2=12625000, e=6, n=3072K CUDAPm1 v0.00)

[/CODE]

firejuggler 2013-05-02 23:42

Thanks. I guess that was an easy one.

Aramis Wyler 2013-05-02 23:59

Will there be a new windows build with the stage1 fix in? I have an [URL="http://www.evga.com/products/pdf/03G-P3-1591.pdf"]unusual 580[/URL] that I would be willing to run some tests against.

kladner 2013-05-03 03:43

[QUOTE=Aramis Wyler;339072]Will there be a new windows build with the stage1 fix in? I have an [URL="http://www.evga.com/products/pdf/03G-P3-1591.pdf"]unusual 580[/URL] that I would be willing to run some tests against.[/QUOTE]

Dang! That [U]is[/U] unusual! Nice amount of RAM, too. Does it OC at all?

Aramis Wyler 2013-05-03 04:34

Well, the 1.5gb version runs at 850, so as a lark I tried to crank this one up to 850/1700 as well. Sure enough, it has been stable. I have never tried to take it past 850/1700.

EDIT: Saying it clocks at 850 doesn't always mean anything in speed terms, so I grabbed this out of the mfactc window for reference.
[CODE]
got assignment: exp=63249397 bit_min=73 bit_max=74 (30.25 GHz-days)
Starting trial factoring M63249397 from 2^73 to 2^74 (30.25 GHz-days)
k_min = 74662632479820
k_max = 149325264962435
Using GPU kernel "barrett76_mul32_gs"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
May 03 00:35 | 3827 82.8% | 5.773 15m53s | 471.52 69941 n.a.%
[/CODE]

frmky 2013-05-03 07:17

Still don't have the motivation to track down the problem reading text from ini files, but here's the next version to try.
[URL="https://www.dropbox.com/s/2b840sgu33bqm6l/cudapm1_20130502.zip"]https://www.dropbox.com/s/2b840sgu33bqm6l/cudapm1_20130502.zip[/URL]

Again, completely untested in Windows by me.

frmky 2013-05-03 07:30

[QUOTE=Stef42;339030][CODE]B2 should be at least 1560000, increasing it.
Starting stage 1 P-1, M9090017, B1 = 120000, B2 = 1560000, e = 6, fft length = 5
12K[/CODE]

I'm not that good in figuring out what it's bound too. Example might help tough.[/QUOTE]

The limits depend on B1, B2, d2, and e2. It's somewhat non-trivial, which is why the code handles it automatically.

frmky 2013-05-03 07:33

[QUOTE=owftheevil;339065][CODE]
Starting stage 1 gcd.
M55824233 has a factor: 833043841114609831879 (P-1, B1=839, B2=12625000, e=6, n=3072K CUDAPm1 v0.00)
[/CODE][/QUOTE]
If the factor is found in stage 1, should the value of B2 in the output be equal to B1 as in the following:
M55824233 has a factor: 833043841114609831879 (P-1, B1=839, B2=839, e=6, n=3072K CUDAPm1 v0.00)

If so, that's an easy change.

Karl M Johnson 2013-05-03 08:07

Yay, it works!
Manged to get to stage 2 with -b1 500 -b2 0.5M, it was using around 1.8GB of vRAM, as the program calculated.
Now, how do I change the 'e' parameter to increase memory usage (is there a way at all, indirect perhaps?)?
[CODE]
CUDAPm1 -d 0 -threads 512 -c 10000 -t -polite 0 -b1 99000 -b2 99000000 8000117
Warning: Couldn't parse ini file option WorkFile; using default "worktodo.txt"
Warning: Couldn't parse ini file option ResultsFile; using default "results.txt"
------- DEVICE 0 -------
name GeForce GTX TITAN
totalGlobalMem -1
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
totalConstMem 65536
Compatibility 3.5
clockRate (MHz) 928
textureAlignment 512
deviceOverlap 1
multiProcessorCount 14

CUDA reports 4095M of 4095M GPU memory free.
Using e=6, d=2310, nrp=480
Using approximately 1737M GPU memory.
B1 should be at least 143687, increasing it.
Starting stage 1 P-1, M8000117, B1 = 143687, B2 = 99000000, e = 6, fft length = 448K
Doing 207401 iterations
Iteration 10000 M8000117, 0xcdeeefc9ed0c8af2, n = 448K, CUDAPm1 v0.00 err = 0.03418 (0:10 real, 1.0008 ms/iter, ETA 3:17)
Iteration 20000 M8000117, 0xc1f6fb554bd5366d, n = 448K, CUDAPm1 v0.00 err = 0.03809 (0:10 real, 0.9963 ms/iter, ETA 3:06)
Iteration 30000 M8000117, 0xa8c3682070917470, n = 448K, CUDAPm1 v0.00 err = 0.03711 (0:10 real, 0.9987 ms/iter, ETA 2:57)
Iteration 40000 M8000117, 0x8641b21065c7c3c4, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:10 real, 0.9896 ms/iter, ETA 2:45)
Iteration 50000 M8000117, 0xdde465fe55ac1ecb, n = 448K, CUDAPm1 v0.00 err = 0.03711 (0:10 real, 0.9817 ms/iter, ETA 2:34)
Iteration 60000 M8000117, 0xa795e30debbb03a1, n = 448K, CUDAPm1 v0.00 err = 0.03516 (0:10 real, 0.9906 ms/iter, ETA 2:26)
Iteration 70000 M8000117, 0xbe53ab8c34cac0e3, n = 448K, CUDAPm1 v0.00 err = 0.03711 (0:10 real, 0.9818 ms/iter, ETA 2:14)
Iteration 80000 M8000117, 0x3f4e80a97be2b8b8, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:10 real, 0.9974 ms/iter, ETA 2:07)
Iteration 90000 M8000117, 0x64c10d213e1edda8, n = 448K, CUDAPm1 v0.00 err = 0.03516 (0:10 real, 0.9985 ms/iter, ETA 1:57)
Iteration 100000 M8000117, 0xb85ce1dc3b7d9537, n = 448K, CUDAPm1 v0.00 err = 0.03711 (0:10 real, 0.9842 ms/iter, ETA 1:45)
Iteration 110000 M8000117, 0xa031e593b3e4eb0e, n = 448K, CUDAPm1 v0.00 err = 0.03516 (0:10 real, 1.0001 ms/iter, ETA 1:37)
Iteration 120000 M8000117, 0x33806a25d8628703, n = 448K, CUDAPm1 v0.00 err = 0.03418 (0:10 real, 1.0034 ms/iter, ETA 1:27)
Iteration 130000 M8000117, 0x4d78d18fdbe49d31, n = 448K, CUDAPm1 v0.00 err = 0.03516 (0:10 real, 1.0009 ms/iter, ETA 1:17)
Iteration 140000 M8000117, 0x8340c832411bb464, n = 448K, CUDAPm1 v0.00 err = 0.03516 (0:10 real, 1.0004 ms/iter, ETA 1:07)
Iteration 150000 M8000117, 0xf9531f22d8b5d8fb, n = 448K, CUDAPm1 v0.00 err = 0.03516 (0:10 real, 1.0039 ms/iter, ETA 0:57)
Iteration 160000 M8000117, 0xa5ee8cd34b352e31, n = 448K, CUDAPm1 v0.00 err = 0.03711 (0:10 real, 1.0033 ms/iter, ETA 0:47)
Iteration 170000 M8000117, 0x978568529d0b8b98, n = 448K, CUDAPm1 v0.00 err = 0.03516 (0:10 real, 1.0025 ms/iter, ETA 0:37)
Iteration 180000 M8000117, 0xe27641ef5a0da890, n = 448K, CUDAPm1 v0.00 err = 0.03442 (0:10 real, 1.0026 ms/iter, ETA 0:27)
Iteration 190000 M8000117, 0x28680bf513e7c074, n = 448K, CUDAPm1 v0.00 err = 0.03516 (0:10 real, 0.9994 ms/iter, ETA 0:17)
Iteration 200000 M8000117, 0xc34709e98a8b4b98, n = 448K, CUDAPm1 v0.00 err = 0.03516 (0:10 real, 1.0015 ms/iter, ETA 0:07)
M8000117, 0x930fd9ded05878a5, offset = 0, n = 448K, CUDAPm1 v0.00
Stage 1 complete, estimated total time = 3:27
Starting stage 1 gcd.
M8000117 has a factor: 418913928878609399 (P-1, B1=143687, B2=99000000, e=6, n=448K CUDAPm1 v0.00)[/CODE]

Running the same binary with the same options, but for GTX 480(1.5GB vRAM) results in great success!
[CODE]CUDAPm1 -d 1 -threads 512 -c 10000 -t -polite 0 -b1 99000 -b2 99000000 8000117

Warning: Couldn't parse ini file option WorkFile; using default "worktodo.txt"
Warning: Couldn't parse ini file option ResultsFile; using default "results.txt"
------- DEVICE 1 -------
name GeForce GTX 480
totalGlobalMem 1610285056
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
Compatibility 2.0
clockRate (MHz) 1600
textureAlignment 512
deviceOverlap 1
multiProcessorCount 15

CUDA reports 1404M of 1535M GPU memory free.
Using e=6, d=2310, nrp=240
Using approximately 897M GPU memory.
B1 should be at least 143687, increasing it.
Starting stage 1 P-1, M8000117, B1 = 143687, B2 = 99000000, e = 6, fft length = 448K
Doing 207401 iterations
Iteration 10000 M8000117, 0xcdeeefc9ed0c8af2, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:12 real, 1.2635 ms/iter, ETA 4:09)
Iteration 20000 M8000117, 0xc1f6fb554bd5366d, n = 448K, CUDAPm1 v0.00 err = 0.03418 (0:12 real, 1.2055 ms/iter, ETA 3:45)
Iteration 30000 M8000117, 0xa8c3682070917470, n = 448K, CUDAPm1 v0.00 err = 0.03516 (0:12 real, 1.2043 ms/iter, ETA 3:33)
Iteration 40000 M8000117, 0x8641b21065c7c3c4, n = 448K, CUDAPm1 v0.00 err = 0.03418 (0:13 real, 1.2206 ms/iter, ETA 3:24)
Iteration 50000 M8000117, 0xdde465fe55ac1ecb, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:12 real, 1.2151 ms/iter, ETA 3:11)
Iteration 60000 M8000117, 0xa795e30debbb03a1, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:12 real, 1.1971 ms/iter, ETA 2:56)
Iteration 70000 M8000117, 0xbe53ab8c34cac0e3, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:12 real, 1.2213 ms/iter, ETA 2:47)
Iteration 80000 M8000117, 0x3f4e80a97be2b8b8, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:12 real, 1.2120 ms/iter, ETA 2:34)
Iteration 90000 M8000117, 0x64c10d213e1edda8, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:12 real, 1.2175 ms/iter, ETA 2:22)
Iteration 100000 M8000117, 0xb85ce1dc3b7d9537, n = 448K, CUDAPm1 v0.00 err = 0.03418 (0:12 real, 1.2178 ms/iter, ETA 2:10)
Iteration 110000 M8000117, 0xa031e593b3e4eb0e, n = 448K, CUDAPm1 v0.00 err = 0.03711 (0:13 real, 1.2204 ms/iter, ETA 1:58)
Iteration 120000 M8000117, 0x33806a25d8628703, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:12 real, 1.2092 ms/iter, ETA 1:45)
Iteration 130000 M8000117, 0x4d78d18fdbe49d31, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:12 real, 1.2287 ms/iter, ETA 1:35)
Iteration 140000 M8000117, 0x8340c832411bb464, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:12 real, 1.2191 ms/iter, ETA 1:22)
Iteration 150000 M8000117, 0xf9531f22d8b5d8fb, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:12 real, 1.2197 ms/iter, ETA 1:10)
Iteration 160000 M8000117, 0xa5ee8cd34b352e31, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:13 real, 1.2187 ms/iter, ETA 0:57)
Iteration 170000 M8000117, 0x978568529d0b8b98, n = 448K, CUDAPm1 v0.00 err = 0.03369 (0:12 real, 1.1983 ms/iter, ETA 0:44)
Iteration 180000 M8000117, 0xe27641ef5a0da890, n = 448K, CUDAPm1 v0.00 err = 0.03516 (0:12 real, 1.2121 ms/iter, ETA 0:33)
Iteration 190000 M8000117, 0x28680bf513e7c074, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:12 real, 1.2194 ms/iter, ETA 0:21)
Iteration 200000 M8000117, 0xc34709e98a8b4b98, n = 448K, CUDAPm1 v0.00 err = 0.03320 (0:12 real, 1.2186 ms/iter, ETA 0:09)
M8000117, 0x930fd9ded05878a5, offset = 0, n = 448K, CUDAPm1 v0.00
Stage 1 complete, estimated total time = 4:12
Starting stage 1 gcd.
M8000117 has a factor: 418913928878609399 (P-1, B1=143687, B2=99000000, e=6, n=448K CUDAPm1 v0.00)[/CODE]

firejuggler 2013-05-03 08:30

I would suggest to fiddle with the E value.

frmky 2013-05-03 08:59

[QUOTE=Karl M Johnson;339106]
Now, how do I change the 'e' parameter to increase memory usage (is there a way at all, indirect perhaps?)?
[/QUOTE]
Yes. e can be 2, 4, 6, 8, 10, or 12. Just use -e2 12.

Karl M Johnson 2013-05-03 09:13

Setting the Brent-Suyama exponent to the max resulted in a slight increase of memory usage, around 20 additional MBs.
Now, since the other parameter is nrp(-nrp2 n?), I've tried increasing it too, but the program ignored the switch.

firejuggler 2013-05-03 09:16

then... higher exponent? or higher bound.

Karl M Johnson 2013-05-03 09:22

We do not seek simple solutions:smile:
Will try to find out the threshold of current binary for vRAM, I can confirm it is NOT this: 2147483647 bytes(mempitch).

firejuggler 2013-05-03 09:36

grab the windows binary, there is a ini file in it that might help you.
higher fftlength?

Karl M Johnson 2013-05-03 10:23

With e=12, d=2310 and nrp=480, the last exponent, which can be checked on current binary is 14,155,777.
The next exponent, 14,155,807, cant go to stage 2.

Now, the real vRAM usage for CPm1 for the 14,155,777 exp is ~3073MB (MSI afterburner delta method), reported approx. vRAM usage was 3014MB.

As a conclusion of this micro research, if you see approx. memory usage of >=3139MB, be sure that stage 2 will not work, even if you have a lot more than that.

Proof:
[URL]http://i.imgur.com/iUpQaMr.png[/URL]
[URL]http://i.imgur.com/W8fqlWQ.png[/URL]

James Heinrich 2013-05-03 11:57

[QUOTE=frmky;339103]If the factor is found in stage 1, should the value of B2 in the output be equal to B1 as in the following:
M55824233 has a factor: 833043841114609831879 (P-1, B1=839, B2=839, e=6, n=3072K CUDAPm1 v0.00)
If so, that's an easy change.[/QUOTE]Yes, please, it would be helpful if the results indicated that.

James Heinrich 2013-05-03 12:06

[QUOTE=frmky;339101]here's the next version to try.[/QUOTE]Starting a new run looks better than last time:[code]Selected B1=560000, B2=14280000, 3.55% chance of finding a factor
CUDA reports 781M of 1279M GPU memory free.
Using e=6, d=2310, nrp=12
Using approximately 744M GPU memory.
Starting stage 1 P-1, M60817711, B1 = 560000, B2 = 14280000, e = 6, fft length = 3360K
Doing 807829 iterations[/code]I'll let it run and see if it finds the [url=http://www.mersenne.ca/exponent/60817711]known stage2 factor[/url].

kjaget 2013-05-03 14:02

[QUOTE=frmky;339101]Still don't have the motivation to track down the problem reading text from ini files[/QUOTE]


Remove the #define sscanf sscanf_s line from parse.c. Using sscanf_s requires each string var scanned into to be followed by an argument with the size of that string, but that's not done in the sscanf call in IniGetStr. This means the sscanf_s checking picks a random uninitialized value off the stack for the length of the dest string, leading to random failures.

A real fix is implementing a wrapper like the sprintf() one which includes this parameter in the call to sscanf_s. Or just ignore the safe version of this function since it is more trouble than it is worth.

Stef42 2013-05-03 15:45

I'm getting a lot of cudaDeviceSynchronize() error 30...
Usually on high B2 value's while only 400-500MB is used (low exponents).
Why this might have happened: [url]http://stackoverflow.com/questions/12200994/cuda-runtime-api-error-30-repeated-kernel-calls[/url]

James Heinrich 2013-05-03 17:34

[QUOTE=James Heinrich;339130]I'll let it run and see if it finds the [url=http://www.mersenne.ca/exponent/60817711]known stage2 factor[/url].[/QUOTE]It did:[code]Stage 2 complete, estimated total time = 2:57:29
Accumulated Product: M60817711, 0x978923630c42303f, n = 3360K, CUDAPm1 v0.00
Starting stage 2 gcd.
M60817711 has a factor: 3493866477323309653137460319 (P-1, B1=560000, B2=14280000, e=6, n=3360K CUDAPm1 v0.00)[/code]4.212GHz-days in 2h57m29s = 34GHz-days/day. A far cry from the ~400GHd/d the GTX570 can push in mfaktc, but also notably faster than can be done on my CPU.


All times are UTC. The time now is 23:18.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.