mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   The P-1 factoring CUDA program (https://www.mersenneforum.org/showthread.php?t=17835)

firejuggler 2013-02-20 18:52

The P-1 factoring CUDA program
 
6 GB of non-shared memory mean decent P-1 power, right?

firejuggler 2013-02-20 19:29

to **** with it. i'll try something, but it will be unclever, slow, and useless at first.

chalsall 2013-02-20 19:35

Plans for a P-1 factoring CUDA program
 
[QUOTE=firejuggler;330225]to **** with it. i'll try something, but it will be unclever, slow, and useless at first.[/QUOTE]

Hey, deliver a prototype P-1 GPU program and I'll give you my first born child.

(Please note: I'm intentionally childless.)

ET_ 2013-02-20 19:40

[QUOTE=firejuggler;330225]to **** with it. i'll try something, but it will be unclever, slow, and useless at first.[/QUOTE]

Are you planning to use CuFFT or a GPU-enabled GMP?

I have the second, but I'm afraid I can reach a decent efficiency using montgomery multiplication...

What I need is how to apply the FFT multiplication routines to the code.

Yes, I'm dreaming about coding something myself...

Firejuggler, if you like you may share your ideas with us.
We could open a dedicated thread on how to design a GPU-enabled P-1 program (and invite people with experience...).

What do you think? :boxer::big grin::sos::et_:

Luigi

firejuggler 2013-02-20 19:41

please note : I have a lot of free time, and a less than average programming capacity...
the important part being lot of free time.

ET_ 2013-02-20 19:45

[QUOTE=firejuggler;330229]please note : I have a lot of free time, and a less than average programming capacity...[/QUOTE]

Fine! I have some programming capacities, and already worked out some CUDA and math programs, I can read and check blueprints, code pseudo-algorithms and read specifications, but my time to code is limited (no CUDA 2.0 at work).

If you have freetime, I'd like to have links of different, working, easy to read P-1 programs...

Luigi

Dubslow 2013-02-20 20:02

Using CUDALucas' FFT/mul code would be a good place to start. I believe there is also other work being done on this particular front of the GPU P-1 issue.

chalsall 2013-02-20 20:07

[QUOTE=Dubslow;330233]Using CUDALucas' FFT/mul code would be a good place to start. I believe there is also other work being done on this particular front of the GPU P-1 issue.[/QUOTE]

Care to expand?

Belief is one thing. Knowledge is quite another....

Dubslow 2013-02-20 20:13

[QUOTE=chalsall;330235]Care to expand?[/quote]Nope.
[QUOTE=chalsall;330235]
Belief is one thing. Knowledge is quite another....[/QUOTE]
I did just double check that I was remembering correctly, and I was. One more knowledgeable than I in such large-number arithmetic on GPUs has "been thinking about p-1 on gpus".

I don't know if that's going anywhere though.

ET_ 2013-02-20 20:30

Plans for a P-1 factoring CUDA program
 
I've been thinking about a CUDA program for P-1 factoring for quite a bit, and think that many other Mersennaries had.

First of all, note that I have only a limited knowledge about the math involved, but I'm willing to expand this limitation studying under the guide of more informed people, and eventually start coding something with the ir help.

I'd like to gather ideas about how such a program should be designed. Some questions will be trivial, some other maybe deeper, but all of them will be enclosed in this thread.

Some naif subjects to talk about:

- Parallelization of tasks
- Limitations due to the memory factor of the GPU (how far may we go having 0.5, 1, 2,3 or 6GB of memory?
- Limitations of the GPU shared memory.
- Description of steps 1 and 2 (from MersenneWiki I got a grasp of it, but a talk would explain more).
- use of streams to pass chunks of bytes to analyze.
- How to apply CuFFT library to the algorithm.
- Is a parallel Montgomery multiplication algorithm out of question for such algorithm?

I hope it may help both people in need for a CUDA P-1 program, programmers, mathematicians.

Luigi

ET_ 2013-02-20 20:31

[QUOTE=Dubslow;330237]Nope.

I did just double check that I was remembering correctly, and I was. One more knowledgeable than I in such large-number arithmetic on GPUs has "been thinking about p-1 on gpus".

I don't know if that's going anywhere though.[/QUOTE]

Would you mind posting [URL="http://www.mersenneforum.org/showthread.php?p=330241#post330241"]here[/URL]?

Luigi

firejuggler 2013-02-20 20:39

Since we were taking about mersenne test,my first ideas would include arrays,
multiple Millers-Rabin primality test, and modulo operation. But I realise that
would only work for the first phase of a P-1.

naive implementation :
split B1 range in as many core-1 as there is,
determine k, calculate 2*k*p+1, if prime (determined by MR test)=> k in array

last core would take care of the value in the array and determine if it divide Mp

owftheevil 2013-02-20 23:39

[QUOTE=chalsall;330227]Hey, deliver a prototype P-1 GPU program and I'll give you my first born child.

(Please note: I'm intentionally childless.)[/QUOTE]

Does this mean that for a stage 1 only program, we would get half of your nonexistent first born?

chalsall 2013-02-20 23:56

[QUOTE=owftheevil;330265]Does this mean that for a stage 1 only program, we would get half of your nonexistent first born?[/QUOTE]

Sure.

I'll even put you up in a hotel in the "Gap" for half a night....

nucleon 2013-02-23 09:06

Thinking outside the box for a minute.

Is there any other primality/factoring tests that would easily suit GPUs and deliver pretty good search percentages over time? (similar to percentage chance of finding a factor close to the time of TF/P-1?)

Finding the round peg for a round hole analogy comes to mind.

-- Craig

ewmayer 2013-02-26 20:17

[QUOTE=ET_;330241]Some naif subjects to talk about:

- Parallelization of tasks
- Limitations due to the memory factor of the GPU (how far may we go having 0.5, 1, 2,3 or 6GB of memory?
- Limitations of the GPU shared memory.
- Description of steps 1 and 2 (from MersenneWiki I got a grasp of it, but a talk would explain more).
- use of streams to pass chunks of bytes to analyze.
- How to apply CuFFT library to the algorithm.
- Is a parallel Montgomery multiplication algorithm out of question for such algorithm?[/QUOTE]

Let's start with the last item: Montmul is used for multiply w.r.to *generic* moduli ... for Mersenne/Fermat/etc-mod arithmetic you want the specialized FFT-based implicit mod algos, same as for LL (or more general primality) testing.

All the core ops for p-1 are quite similar to what you know from the lens of LL testing, just a couple twists:

1. For stage 1 you need to decide whether you want to use a fixed bound B1 or allow for later "powering-up" of a previous stage 1 residue to a larger B1. Most efficient is a fixed B1, that way you can precompute the product of all the needed stage 1 prime powers and use the resulting bitstring for fast LR binary modpow;

2. For stage 2 you need to precompute a set of small powers of the stage 1 residue, and store those in RAM in forward-FFTed form. Those get pointwise-multiplied by a "live" forward-FFTed dataset (i.e. the dyadic-mul takes data from 2 vectors and combines them, as opposed to the single-vector-data squaring used in LL) and then iFFT/carry steps are as before. Multiple stage 2 intervals can run in parallel starting with a common stage 1 residue.

But what you really need to do first of all is some kind of "simple" implementation of the algorithm in HLL code ... useless to try to port an algorithm to specialized hardware unless you understand it in some detail.

owftheevil 2013-02-26 21:31

[QUOTE=ewmayer;331136]Let's start with the last item: Montmul is used for multiply w.r.to *generic* moduli ... for Mersenne/Fermat/etc-mod arithmetic you want the specialized FFT-based implicit mod algos, same as for LL (or more general primality) testing.

All the core ops for p-1 are quite similar to what you know from the lens of LL testing, just a couple twists:

1. For stage 1 you need to decide whether you want to use a fixed bound B1 or allow for later "powering-up" of a previous stage 1 residue to a larger B1. Most efficient is a fixed B1, that way you can precompute the product of all the needed stage 1 prime powers and use the resulting bitstring for fast LR binary modpow;

2. For stage 2 you need to precompute a set of small powers of the stage 1 residue, and store those in RAM in forward-FFTed form. Those get pointwise-multiplied by a "live" forward-FFTed dataset (i.e. the dyadic-mul takes data from 2 vectors and combines them, as opposed to the single-vector-data squaring used in LL) and then iFFT/carry steps are as before. Multiple stage 2 intervals can run in parallel starting with a common stage 1 residue.

But what you really need to do first of all is some kind of "simple" implementation of the algorithm in HLL code ... useless to try to port an algorithm to specialized hardware unless you understand it in some detail.[/QUOTE]

I haven't thought much about the stage 2 part yet, but for stage 1, The basic gpu kernels in CudaLucas can easily be modified to do everything expect providing the list of prime powers to multiply, multiplying that list of prime powers, and then finding the gcd at the end. The first two would not be that hard to implement, but I don't think I have the tenacity to code an efficient cuda gdc algorithm. However, there's no reason all this has to be done on the gpu.

ewmayer 2013-02-26 22:04

GCD need be done infrequently enough - once at the end of stage 1 and perhaps every few hours during stage 2 - that one should just use some decent third-party code (GMP or George's optimized version of the Crandall/Buhler "giants library" GCD.

owftheevil 2013-02-27 01:37

Pretty much what I had in mind. I was thinking at first to do the product of the prime powers with gmp too. Also, maybe its possible to save the result of stage 1 in mprime or prime95 readable format and let those programs finish up the stage 2. On many cards without much memory this would be useful.

owftheevil 2013-02-27 02:39

Proto-cuda-pm1 now computes powers of 3. Now to teach it which powers of 3 to compute.

Anybody know where i can get some residues of relatively small powers of 3 mod 2^q - 1 for various value of q?

henryzz 2013-02-27 11:21

[QUOTE=owftheevil;331191]Proto-cuda-pm1 now computes powers of 3. Now to teach it which powers of 3 to compute.

Anybody know where i can get some residues of relatively small powers of 3 mod 2^q - 1 for various value of q?[/QUOTE]

Run the following code through pfgw:
[CODE]SCRIPT

DIM expo, 31
DIM base, 3
DIM max_n, 100
DIM min_n, 1
OPENFILEAPP r_file,results.txt



DIM n, min_n
DIM result, 0
DIM mers, 2^expo-1
DIMS rstr



LABEL next_n
POWMOD result,base,n,mers
WRITE r_file,result
SET n, n+1
IF (n<=max_n) THEN GOTO next_n

LABEL end[/CODE]

ET_ 2013-02-27 13:20

[QUOTE=henryzz;331226]Run the following code through pfgw:
[CODE]SCRIPT

DIM expo, 31
DIM base, 3
DIM max_n, 100
DIM min_n, 1
OPENFILEAPP r_file,results.txt



DIM n, min_n
DIM result, 0
DIM mers, 2^expo-1
DIMS rstr



LABEL next_n
POWMOD result,base,n,mers
WRITE r_file,result
SET n, n+1
IF (n<=max_n) THEN GOTO next_n

LABEL end[/CODE][/QUOTE]

Thanks Henry, you gave me a hint to solve another problem :bow:

Luigi

R.D. Silverman 2013-02-27 15:18

[QUOTE=ET_;330241]I've been thinking about a CUDA program for P-1 factoring for quite a bit, and think that many other Mersennaries had.

First of all, note that I have only a limited knowledge about the math involved, but I'm willing to expand this limitation studying under the guide of more informed people, and eventually start coding something with the ir help.
[/QUOTE]

Read:
P. Montgomery & R.D. Silverman
An FFT Extension to the P-1 Factoring Algorithm
Math. Comp.

It is available on the net. It should tell you everything you need to know.

[QUOTE]

I'd like to gather ideas about how such a program should be designed. Some questions will be trivial, some other maybe deeper, but all of them will be enclosed in this thread.

Some naif subjects to talk about:

- Parallelization of tasks
- Limitations due to the memory factor of the GPU (how far may we go having 0.5, 1, 2,3 or 6GB of memory?
- Description of steps 1 and 2 (from MersenneWiki I got a grasp of it, but a talk would explain more).
[/QUOTE]

It discusses all of this.

[QUOTE]
- Is a parallel Montgomery multiplication algorithm out of question for such algorithm?
[/QUOTE]

It would be OK for step 1. It is not relevant to step 2.

ET_ 2013-02-27 15:32

[QUOTE=R.D. Silverman;331247]Read:
P. Montgomery & R.D. Silverman
An FFT Extension to the P-1 Factoring Algorithm
Math. Comp.

It is available on the net. It should tell you everything you need to know.



It discusses all of this.



It would be OK for step 1. It is not relevant to step 2.[/QUOTE]

Thank you very much for the information!

I'm going to study that paper and eventually will come back for more questions.

Luigi

owftheevil 2013-02-28 13:18

Output from proto-CUDA-P-1:

[CODE]
filbert@filbert:~/Build/cudapm1-0.00$ ./CUDA-P-1 1594559

Starting Stage 1 P-1, M1594559, B1 = 1, fft length = 96K
Stage 1 complete, estimated total time = 0:00
3189119 is a factor of M1594559

[/CODE]

Of course with B1 = 1 its easy.

ET_ 2013-02-28 14:58

[QUOTE=owftheevil;331380]Output from proto-CUDA-P-1:

[CODE]
filbert@filbert:~/Build/cudapm1-0.00$ ./CUDA-P-1 1594559

Starting Stage 1 P-1, M1594559, B1 = 1, fft length = 96K
Stage 1 complete, estimated total time = 0:00
3189119 is a factor of M1594559

[/CODE]

Of course with B1 = 1 its easy.[/QUOTE]

It found a factor with B1=1 in "no time"! :et_:
Obviously it had k=1, but what the hell, it works!!! :smile:

Luigi

owftheevil 2013-02-28 15:29

Well, it only has to do 21 squares and a few multiply by 3s with a 96k fft.
The output time is calibrated in seconds, so less than 1 second won't show up.

owftheevil 2013-02-28 15:31

[QUOTE=henryzz;331226]Run the following code through pfgw:
[CODE]SCRIPT

DIM expo, 31
DIM base, 3
DIM max_n, 100
DIM min_n, 1
OPENFILEAPP r_file,results.txt



DIM n, min_n
DIM result, 0
DIM mers, 2^expo-1
DIMS rstr



LABEL next_n
POWMOD result,base,n,mers
WRITE r_file,result
SET n, n+1
IF (n<=max_n) THEN GOTO next_n

LABEL end[/CODE][/QUOTE]


Thanks for the script.

owftheevil 2013-03-02 04:20

Stage 1 seems to be working ok.

[CODE]
filbert@filbert:~/Build/cudapm1-0.00$ ./CUDALucas 60981299 -f 3360k

Starting Stage 1 P-1, M60981299, B1 = 580000, fft length = 3360K
Stage 1 complete, estimated total time = 1:34:14
7124551590389340568394253966215081 is a factor of M60981299
[/CODE]

To put the elapsed time in context, this is on a 570 that completes a 28xxxxxx double check in 21:00--21:30 hours.

ET_ 2013-03-02 13:00

[QUOTE=owftheevil;331625]Stage 1 seems to be working ok.

[CODE]
filbert@filbert:~/Build/cudapm1-0.00$ ./CUDALucas 60981299 -f 3360k

Starting Stage 1 P-1, M60981299, B1 = 580000, fft length = 3360K
Stage 1 complete, estimated total time = 1:34:14
7124551590389340568394253966215081 is a factor of M60981299
[/CODE]

To put the elapsed time in context, this is on a 570 that completes a 28xxxxxx double check in 21:00--21:30 hours.[/QUOTE]

KUDOS! :bow:

Should you need a tester with a GTX580, I'm here to help. :smile:

I still can't understand the "estimated total time" (1:34:14) thingie. Do you mean thet the code completes stage 1 at B1=580,000 in 1 hour and 34 minutes the DC assignment in 21 hours with CuLu?

Which version of CuFFT are you using?


Luigi

chalsall 2013-03-02 13:11

[QUOTE=ET_;331649]KUDOS! :bow:[/QUOTE]

+1!!! :smile:

[QUOTE=ET_;331649]Should you need a tester with a GTX580, I'm here to help.[/QUOTE]

Ditto. I only have a 560, but it is the 2GB version, so when you need some Stage 2 testing.... :wink:

henryzz 2013-03-02 13:13

[QUOTE=ET_;331649]KUDOS! :bow:

Should you need a tester with a GTX580, I'm here to help. :smile:

I still can't understand the "estimated total time" (1:34:14) thingie. Do you mean thet the code completes stage 1 at B1=580,000 in 1 hour and 34 minutes the DC assignment in 21 hours with CuLu?

Which version of CuFFT are you using?


Luigi[/QUOTE]

Stage 1 of a ~61M exponent in 1:34:14 and a DC of a ~28M exponent.


I am intrigued how fast this would run small numbers upto a much higher B1. How much does the exponent affect the runtime? Could you try a sub 1000-bit exponent? Maybe a range of different size exponents would be appropriate.

firejuggler 2013-03-02 13:35

1H and 34 min? that's 5 to 7 time faster than a ordinary CPU. nice.

ET_ 2013-03-02 13:42

[QUOTE=firejuggler;331659]1H and 34 min? that's 5 to 7 time faster than a ordinary CPU. nice.[/QUOTE]

Please notice the B1=580,000... :smile:

Is the B1 limit actually a fixed one?

Luigi

firejuggler 2013-03-02 13:54

1 Attachment(s)
yup,
1H34 is about 5700 second.
Below a run of a similar sized expo , total run for phase 1 is 24400 seconds. Ok.. so speed up is only about 4.2 time

ET_ 2013-03-02 14:13

[QUOTE=firejuggler;331664]yup,
1H34 is about 5700 second.
Below a run of a similar sized expo , total run for phase 1 is 24400 seconds. Ok.. so speed up is only about 4.2 time[/QUOTE]

Please note that not everyone here has an high-clocked AVX2 processor available... Not to mention that the code here is still a proof of concept with just few optimizations.

Luigi

firejuggler 2013-03-02 14:45

Please note that i'm genuinely impressed by the "proof-of-concept" speed up.

As for my CPU, it is a stock speed i5 2500k, which is pretty 'ordinary' for today computer-enthusiast ( the run was done on one core).

kracker 2013-03-02 15:42

[QUOTE=chalsall;331651]+1!!! :smile:



Ditto. I only have a 560, but it is the 2GB version, so when you need some Stage 2 testing.... :wink:[/QUOTE]

Now give us your half-first born child. Naow! :razz:

chalsall 2013-03-02 15:50

[QUOTE=kracker;331677]Now give us your half-first born child. Naow! :razz:[/QUOTE]

Sure. Where should half of my non-existent first-born be delivered? :wink:

owftheevil 2013-03-02 16:32

[QUOTE=henryzz;331652]
I am intrigued how fast this would run small numbers upto a much higher B1. How much does the exponent affect the runtime? Could you try a sub 1000-bit exponent? Maybe a range of different size exponents would be appropriate.[/QUOTE]

It could be coerced into taking exponents that small (I assume you mean exponents p with Mp < 1000 bits), but it wouldn't be very efficient. Toom-Cook multiplication would be better, or even grammar school multiplication if you go small enough. A very rough upper bound on the number of iterations you need for a given B1 is log2(B1) * the number of primes < B1. Iteration times will be close to what CuLu gets for the same fft. This is after all only a slight modification of CuLu. For very large B1 things will be about 5-10% slower for some final segment.

owftheevil 2013-03-02 16:40

[QUOTE=ET_;331661]Please notice the B1=580,000... :smile:

Is the B1 limit actually a fixed one?

Luigi[/QUOTE]

Yes, at the moment, I have to rebuild it every time I want to use a different B1. If fact the next thing I'm going to do is enable it to get the B1 from the command line.

[QUOTE]
Please note that not everyone here has an high-clocked AVX2 processor available... Not to mention that the code here is still a proof of concept with just few optimizations.

Luigi[/QUOTE]It won't get much faster than it is now.

owftheevil 2013-03-02 16:45

[QUOTE=chalsall;331678]Sure. Where should half of my non-existent first-born be delivered? :wink:[/QUOTE]

e-mail would be fine. But if its half of a boy, don't bother. My wife only wants it if its half of a girl.

And as for the half night in the gap hotel, I presume I have to find my own way to Barbados? Or are you also going to provide transportation for half of the way there?

ET_ 2013-03-02 16:55

[QUOTE=owftheevil;331679]It could be coerced into taking exponents that small (I assume you mean exponents p with Mp < 1000 bits), but it wouldn't be very efficient. Toom-Cook multiplication would be better, or even grammar school multiplication if you go small enough. A very rough upper bound on the number of iterations you need for a given B1 is log2(B1) * the number of primes < B1. Iteration times will be close to what CuLu gets for the same fft. This is after all only a slight modification of CuLu. For very large B1 things will be about 5-10% slower for some final segment.[/QUOTE]

Maybe Montgomery/Barrett could fill the gap between Toom-Cook and grammar multiplication, but at the moment I'd rather have a "fully-functional" P-1 on current exponents, leaving the smaller ones to either mfakt or gpu-ecm...

You rock, owftheevil! :et_:

Luigi

chalsall 2013-03-02 17:00

[QUOTE=owftheevil;331683]e-mail would be fine. But if its half of a boy, don't bother. My wife only wants it if its half of a girl.[/QUOTE]

Actually it's transgendered... MtF... :wink:

[QUOTE=owftheevil;331683]And as for the half night in the gap hotel, I presume I have to find my own way to Barbados? Or are you also going to provide transportation for half of the way there?[/QUOTE]

Since I just got a big contact, I'll also provide half a trip here. You'll have to swim the rest of the way.... :smile:

Sincerely though, thanks [I][U]very[/U][/I] much for your work! :bow:

owftheevil 2013-03-02 17:06

And I thank you for yours.

chalsall 2013-03-02 17:28

[QUOTE=owftheevil;331688]And I thank you for yours.[/QUOTE]

Thank you.

xilman 2013-03-02 17:47

[QUOTE=kracker;331677]Now give us your half-first born child. Naow! :razz:[/QUOTE]Top half or bottom half?

ET_ 2013-03-02 18:49

[QUOTE=xilman;331690]Top half or bottom half?[/QUOTE]

Axial symmetry?

Luigi

owftheevil 2013-03-02 21:56

1 Attachment(s)
Here's the code if anyone wants to play with it. It builds without problems on Ubuntu 12.04 with either Cuda4.2 or Cuda5.0 and gmp5.1.0. Its esssentially a slightly modified CUDALucas, so if you can build CuLu on Windows, you have a good start on building this.


Edit to add: run it with e.g.

[CODE]./CUDA-pm1 60593041, -b1 1000, [-f 3360k][/CODE]

If you don't specify b1 it defaults to 1. Specifying the fft is only necessary if b1 is small (< ~ 690). In that case, the test will finish before it knows if the fft is big enough and will sometimes give invalid results.

Batalov 2013-03-02 22:15

Very impressive and very "evil"! "The lesser of two weevils". :tu:

ET_ 2013-03-02 23:21

[QUOTE=owftheevil;331717]Here's the code if anyone wants to play with it. It builds without problems on Ubuntu 12.04 with either Cuda4.2 or Cuda5.0 and gmp5.1.0. Its esssentially a slightly modified CUDALucas, so if you can build CuLu on Windows, you have a good start on building this.


Edit to add: run it with e.g.

[CODE]./CUDA-pm1 60593041, -b1 1000, [-f 3360k][/CODE]

If you don't specify b1 it defaults to 1. Specifying the fft is only necessary if b1 is small (< ~ 690). In that case, the test will finish before it knows if the fft is big enough and will sometimes give invalid results.[/QUOTE]

Version 0.00, I love it!

Thanks a lot!

Luigi

P.S. the auto-correction works like a charm...
[code]
luigi@luigi-ubuntu:~/luigi/CUDA/cudapm1-0.00$ ./CUDA-Pm1 60593041, -b1 1000

Starting Stage 1 P-1, M60593041, B1 = 1000, fft length = 3200K
Doing 1475 iterations
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a longer FFT.
Iteration 100, average error = 0.14299, max error = 0.34317
Iteration 200, average error = 0.13821, max error = 0.32842
Iteration = 284 < 1000 && err = 0.35178 >= 0.35, increasing n from 3200K
Starting Stage 1 P-1, M60593041, B1 = 1000, fft length = 3360K
Doing 1475 iterations
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a longer FFT.
Iteration 100, average error = 0.05615, max error = 0.12774
Iteration 200, average error = 0.05472, max error = 0.13322
Iteration 300, average error = 0.05601, max error = 0.13491
Iteration 400, average error = 0.05518, max error = 0.14575
Iteration 500, average error = 0.05690, max error = 0.12720
Iteration 600, average error = 0.05746, max error = 0.13921
Iteration 700, average error = 0.05881, max error = 0.13656
Iteration 800, average error = 0.05901, max error = 0.13856
Iteration 900, average error = 0.05955, max error = 0.14368
Iteration 1000, average error = 0.05916 < 0.25 (max error = 0.14575), continuing test.
M60593041, 0x962b95049cafb7d9, offset = 0, n = 3360K, CUDA-P-1 v0.00
Stage 1 complete, estimated total time = 0:58
M60593041 has a factor: 2105528336291622770155712978260232660484461209
[/code]

P.S.2: M1257787 has a factor: 1

owftheevil 2013-03-02 23:35

I'm not sure, but I think Dubslow is responsible for the roundoff test part. Its hard to tell who did what on CuLu.

Edit: I didn't see the PS. I have so far been too lazy to make a different message for no factor found. I was thinking of just adding "but you already knew that, didn't you."

henryzz 2013-03-02 23:39

I can see that cpus are going to become obsolete for P-1 stage 1 soon. This should help kill the P-1 deficit.

firejuggler 2013-03-02 23:43

Thanks for your work owftheevil.

frmky 2013-03-03 01:39

Great! I very much look forward to this being polished. Stage-1 only is certainly much better than no P-1. Once this uses GIMPS input files and saves results in GIMPS output format, I'll switch my 4 C1060's from double-checks to P-1 Stage-1. :smile:

ixfd64 2013-03-03 01:43

I may be wrong, but I believe stage 2 can be easily parallelized. If that's true, then the GPU firepower will really come in handy.

owftheevil 2013-03-03 03:59

I have seen one case now where an exponent, 59756923 passes the round off test but show no factor. Increasing the fft correctly finds the factor. Going to have to slow it down a little.

frmky 2013-03-03 04:55

For that case, the max error is significantly larger than the average error. Looks like an average error < [STRIKE]0.15[/STRIKE] 0.1 might be an appropriate check with some safety margin?

[CODE][childers@physicstitan cudapm1]$ ./CUDA-Pm1 59756923, -b1 1100

Starting Stage 1 P-1, M59756923, B1 = 1100, fft length = 3072K
Doing 1637 iterations
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a longer FFT.
Iteration = 27 < 1000 && err = 0.50000 >= 0.35, increasing n from 3072K
Starting Stage 1 P-1, M59756923, B1 = 1100, fft length = 3200K
Doing 1637 iterations
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a longer FFT.
Iteration 100, average error = 0.08286, max error = 0.27734
Iteration 200, average error = 0.08826, max error = 0.29688
Iteration 300, average error = 0.09689, max error = 0.28125
Iteration 400, average error = 0.10635, max error = 0.30859
Iteration 500, average error = 0.11458, max error = 0.27344
Iteration 600, average error = 0.11786, max error = 0.28125
Iteration 700, average error = 0.11941, max error = 0.28125
Iteration 800, average error = 0.12019, max error = 0.28125
Iteration 900, average error = 0.12194, max error = 0.31250
Iteration 1000, average error = 0.12054 < 0.25 (max error = 0.31250), continuing test.
M59756923, 0xd040e885dd81e22d, offset = 0, n = 3200K, CUDA-P-1 v0.00
Stage 1 complete, estimated total time = 0:53
M59756923 has a factor: 1
[/CODE]

frmky 2013-03-03 05:07

[QUOTE=owftheevil;331717]./CUDA-pm1 60593041, -b1 1000, [-f 3360k][/QUOTE]
How is this factor found with B1=1000?
P-1 = 2^3 * 3^5 * [B]2551[/B] * 60593041 * [B]P9[/B] * [B]P23[/B]

Batalov 2013-03-03 07:32

It's composite, so that's ok.
[CODE]2105528336291622770155712978260232660484461209 =
969488657 * 19874517449 * 2208979902697 * 49468643416729[/CODE]each of which passes with B1=41 (!).
There are more factors known for [URL="http://mersenne.org/report_exponent/?exp_lo=60593041&exp_hi=10000&B1=Get+status"]this Mp[/URL]

frmky 2013-03-03 07:34

Ah, makes sense. Thanks! :smile:

Dubslow 2013-03-03 10:51

[QUOTE=owftheevil;331733]I'm not sure, but I think Dubslow is responsible for the roundoff test part. Its hard to tell who did what on CuLu.

Edit: I didn't see the PS. I have so far been too lazy to make a different message for no factor found. I was thinking of just adding "but you already knew that, didn't you."[/QUOTE]
I added a bit, but msft had the gist of it.

[QUOTE=henryzz;331735]I can see that cpus are going to become obsolete for P-1 stage 1 soon. This should help kill the P-1 deficit.[/QUOTE]
I wouldn't quite go that far yet, especially w.r.t. stage 2. :smile:

[QUOTE=frmky;331757]For that case, the max error is significantly larger than the average error. Looks like an average error < [STRIKE]0.15[/STRIKE] 0.1 might be an appropriate check with some safety margin?
[/QUOTE]

Wowzers... I've not seen anything like that before. Perhaps a better solution would be maxerr/avgerr < 1.5 (or maybe 2)?

owftheevil 2013-03-03 14:58

1 Attachment(s)
Ok, this should be better. 3%-4% slower, but gives correct results, even when the max error is allowed to go as high as 0.42. Also fixes the error reporting problem.

ET_ 2013-03-03 18:59

[QUOTE=owftheevil;331784]Ok, this should be better. 3%-4% slower, but gives correct results, even when the max error is allowed to go as high as 0.42. Also fixes the error reporting problem.[/QUOTE]

With this update I got a couple of strange behaviors:

1:
[code]
luigi@luigi-ubuntu:~/luigi/CUDA/cudapm1-0.00$ ./CUDA-Pm1 4170308402961950452420687314125107372845632692860124825390003761727514150572517983869509135472975278394865154210790597209778982578895669768763371749038447454396115727404741278971617695528084038894140322072199744865271524521758726031117787322230290427036555791315034863880063825719334586180093, -b1 1000

Can't open workfile worktodo.txt
[/code]

I suppose that the P-1 program actually only works with mersenne exponents, and that's not a bad thing, but the search for the string "Can't open workfile worktodo.txt" on the source code reported null :sad:


2:
[code]
luigi@luigi-ubuntu:~/luigi/CUDA/cudapm1-0.00$ ./CUDA-Pm1 60593041, -b1 1000

Starting Stage 1 P-1, M60593041, B1 = 1000, fft length = 3200K
Doing 1475 iterations
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a longer FFT.
Iteration 100, average error = 0.22696, max error = 0.34664
^C SIGINT caught, writing checkpoint.Iteration 200, average error = 0.26131, max error = 0.34241
Iteration 300, average error = 0.27237, max error = 0.33556
^C^C SIGINT caught, writing checkpoint. SIGINT caught, writing checkpoint.Iteration 400, average error = 0.27678, max error = 0.36553
Iteration 500, average error = 0.28131, max error = 0.33734
Iteration 600, average error = 0.28373, max error = 0.33752
Iteration 700, average error = 0.28460, max error = 0.34566
Iteration 800, average error = 0.28564, max error = 0.35941
Iteration 900, average error = 0.28658, max error = 0.35676
Iteration 1000, average error = 0.28638 < 0.25 (max error = 0.36553), continuing test.
Estimated time spent so far: 0:39
[/code]

i.e. the ctrl-C is trapped during the error-checking routine, but is not passed to the program.

Luigi

Dubslow 2013-03-04 00:22

The second is deliberate (though I forget why). It should quit immediately after the roundoff test is finished (as it seems it did).

As for the first, it's probably a printf substitution -- search for "Can't open workfile %s" or, more safely, search for "Can't open" or "Can't open workfile".

owftheevil 2013-03-06 00:09

[QUOTE=ixfd64;331747]I may be wrong, but I believe stage 2 can be easily parallelized. If that's true, then the GPU firepower will really come in handy.[/QUOTE]

You are right. The way I am seeing it now, stage two naturally splits into 3 tasks that can be separated into different streams. Not sure yet how much this will speed things up and how much will have the different streams stepping on each others toes.

owftheevil 2013-03-10 16:39

Just a quick progress report. Stage 2 is coming along nicely. Referring to (p*k)^e - rp^e, it understands different values of e and p, I'm teaching it about rp now. Then its just a matter of choreography and logistics. The general multiplication kernel is functioning properly, a quick and dirty test on M51943429 with b1 = 1248, b2 = 1249, e = 2, and p = 6 found the factor.

ET_ 2013-03-10 17:34

[QUOTE=owftheevil;332672]Just a quick progress report. Stage 2 is coming along nicely. Referring to (p*k)^e - rp^e, it understands different values of e and p, I'm teaching it about rp now. Then its just a matter of choreography and logistics. The general multiplication kernel is functioning properly, a quick and dirty test on M51943429 with b1 = 1248, b2 = 1249, e = 2, and p = 6 found the factor.[/QUOTE]

:tu::big grin::fusion:

swl551 2013-03-10 18:06

I realize your work is still under development, but I have to ask if CudaP-1 will have a similar configuration/.INI to MFAKTO/C. Of course, I ask this to determine or sway its development to ensure MISFIT can easily work with the final product. Ideally MISFIT for MFAKTO/C would be able to front CudaP-1 with only minor changes.


Thanks

Scott

ET_ 2013-03-10 18:11

[QUOTE=swl551;332693]I realize your work is still under development, but I have to ask if CudaP-1 will have a similar configuration/.INI to MFAKTO/C. Of course, I ask this to determine or sway its development to ensure MISFIT can easily work with the final product. Ideally MISFIT for MFAKTO/C would be able to front CudaP-1 with only minor changes.


Thanks

Scott[/QUOTE]

CUDA P-1 is stil under development, and AFAIK (and have tested) there is still no user interface (though I found arecall to a !worktodo" file).

I expect that the final versio should be similar to either Mfakt? or CudaLucas to be usable with manual reservations.

Luigi

Dubslow 2013-03-10 23:56

The "final" version should be easy to make work the same way as CUDALucas, seeing as it's a CUDALucas conversion.

Stef42 2013-03-11 15:59

I suppose there is only a linux version atm?

Dubslow 2013-03-11 16:30

[QUOTE=Stef42;332821]I suppose there is only a linux version atm?[/QUOTE]

flash/Brain ought to be able to compile it for Winbloze.

Stef42 2013-03-11 18:01

That would be very nice. I got one GTX 590 and one GTX 570 running full speed, P-1 could be a nice addition to that :smile:

Brain 2013-03-12 21:17

No Windows compile
 
[QUOTE=Dubslow;332825]flash/Brain ought to be able to compile it for Winbloze.[/QUOTE]
I cannot compile for Windows for at least 2 reasons:
- #include <gmp.h> fails. Changed to "gmp.h" but I have no Windows GMP library, only the header file gmp.h
- #include <cuda_safecalls.h> fails
I could make an investigation but don't have much time and I'm no linker expert. If anybody else can offer/point to precompiled libs and a suitable VC10 makefile...

Dubslow 2013-03-12 21:33

Hmm... I'm decently certain that the cuda_safecalls.h is included in CUDALucas as well, but I'm absolutely certain that the GMP dependency is new to P-1. You or others who desire a compilation ought to ask Brian Gladman for help; he's the forum's resident Windows/VS/GMP-for-Windows-aka-MPIR expert. :smile:

ET_ 2013-03-21 20:32

Question.

How is the GPU memory use computed, related to exponent, B1 and B2?

In other words, how can I know if B1 and B2 ranges fit in my GPU memory?

Luigi

owftheevil 2013-03-22 12:36

[QUOTE=ET_;334402]Question.

How is the GPU memory use computed, related to exponent, B1 and B2?

In other words, how can I know if B1 and B2 ranges fit in my GPU memory?

Luigi[/QUOTE]

The exponent affects the memory use by way of the fft size needed for that exponent. The B1 and B2 don't really have an effect. What really affects the memory use is the B-S exponent e and the number of relative primes p processed in a pass. If n is the fft size, each data sequence uses 8 * n bytes. I think I can get by with an overhead of 4 such sequences. In addition, e + p sequences are needed.

NBtarheel_33 2013-03-22 22:46

Is there any way of getting the GPU to make use of the system RAM? That would *really* give you some power.

Dubslow 2013-03-22 23:06

[QUOTE=NBtarheel_33;334570]Is there any way of getting the GPU to make use of the system RAM? That would *really* give you some power.[/QUOTE]

Host to device and back memory transfers are [i]painfully[/i] slow. CUDALucas (at least pre-bit shift) used one device to host memory transfer (just one!) per iteration at its maximum "-polite" setting -- that caused a performance hit of 20%. Actually using main memory (many transfers of much data in an "iteration) in a useful way will be impossibly slow.

owftheevil 2013-03-23 01:04

[QUOTE=NBtarheel_33;334570]Is there any way of getting the GPU to make use of the system RAM? That would *really* give you some power.[/QUOTE]

There is one place where host ram can be used. Stage 2 initialization data can be stored there. That will make starting a new pass for the next batch of relative primes relatively quick and painless. The host to device transfers would be spread out enough so as not to cause a log jam.

NBtarheel_33 2013-03-24 09:07

[QUOTE=Dubslow;334575]Host to device and back memory transfers are [I]painfully[/I] slow. CUDALucas (at least pre-bit shift) used one device to host memory transfer (just one!) per iteration at its maximum "-polite" setting -- that caused a performance hit of 20%. Actually using main memory (many transfers of much data in an "iteration) in a useful way will be impossibly slow.[/QUOTE]

Bummer. But owftheevil's idea of storing Stage 2 initialization data there is better than nothing, I suppose.

GPUs should have ever-increasing amounts of RAM in the years to come, anyway. It's also much faster RAM - I think GPUs are already at DDR5, while DDR4 system RAM is still in its infancy.

Dubslow 2013-03-24 10:45

[QUOTE=NBtarheel_33;334752]
GPUs should have ever-increasing amounts of RAM in the years to come, anyway. It's also much faster RAM - I think GPUs are already at DDR5, while DDR4 system RAM is still in its infancy.[/QUOTE]

Don't be confused -- it's GDDR5, not DDR5. It is rather faster than DDR3 though
:smile:

[url]http://en.wikipedia.org/wiki/GDDR5[/url]
[quote]Like its predecessor, GDDR4, GDDR5 is based on DDR3 SDRAM memory which has double the data lines compared to DDR2 SDRAM...[/quote]
(GDDR3 is based on DDR2 tech.)

owftheevil 2013-04-13 22:45

Cudapm1 output:

[CODE]
M61076737 has a factor: 432634830991289176546683053423
[/CODE]Run with B1 = 65000, B2 = 12035000, n = 3360k, d = 2310, e =2, 8 rp per pass. It used about 600Mb of device memory. Stage 2 took ~53 minutes.

Edit: Looks like about 15 minutes longer to make e = 4.

Aramis Wyler 2013-04-13 23:22

That would definately put a dent in our p-1 deficit. Though it's hard to trade 25x p-1 work for 125x factoring work.

EDIT: Not that it wouldn't get used though. I was trading up 10 ghz day of factoring per ghz day of p-1, and this is a better deal than that. :smile:

bcp19 2013-04-14 15:22

Is the program fairly stable now or is it still 'in beta'? Also, is everything done on the GPU or does it take up a CPU core?

owftheevil 2013-04-14 16:36

[QUOTE=bcp19;337058]Is the program fairly stable now or is it still 'in beta'? Also, is everything done on the GPU or does it take up a CPU core?[/QUOTE]

Its not at all stable yet, and lacks a lot of basic functionality besides. It makes heavy use of a cpu core during initialization of stage 1 and when computing the gcd after either stage. Other than that, the cpu load is not noticeable, much like CUDALucas.

kladner 2013-04-14 16:44

[QUOTE=owftheevil;337063]Its not at all stable yet, and lacks a lot of basic functionality besides. It makes heavy use of a cpu core during initialization of stage 1 and when computing the gcd after either stage. Other than that, the cpu load is not noticeable, much like CUDALucas.[/QUOTE]

I look forward to trying it when it is ready to debut!

Chuck 2013-04-14 17:33

Please make it use a worktodo file to stage work.

ATH 2013-04-14 21:20

[QUOTE=owftheevil;336998]Cudapm1 output:

[CODE]
M61076737 has a factor: 432634830991289176546683053423
[/CODE]Run with B1 = 65000, B2 = 12035000, n = 3360k, d = 2310, e =2, 8 rp per pass. It used about 600Mb of device memory. Stage 2 took ~53 minutes.

Edit: Looks like about 15 minutes longer to make e = 4.[/QUOTE]

To compare with CPU speed running the same curve in Prime95.
Laptop with Corei7 2720QM sandy bridge:
using 1 core: stage1 43min, stage2 ~ 8h (3 Gb RAM)
using 4 cores: stage1: 19 min, stage2 ~ 3.8h (3 Gb RAM)

I only completed ~20% of stage2 and extrapolated the runtime.

owftheevil 2013-04-15 01:11

I've been thinking about the numbers and it seems that 53m for stage 2 is faster than possible. I have a strong suspicion that there is an extra or missing factor of 2 in the code causing only half the rps to get processed. Off to look for it.

owftheevil 2013-04-15 01:19

Found it. Sorry about the false expectations.

kladner 2013-04-15 05:11

It is interesting to watch the process of things getting worked out.:smile: :popcorn::smile:

NBtarheel_33 2013-04-15 08:52

[QUOTE=owftheevil;337120]Found it. Sorry about the false expectations.[/QUOTE]

53 minutes, 106 minutes, still a huge improvement over CPU P-1! :smile:

NBtarheel_33 2013-04-15 09:03

[QUOTE=Aramis Wyler;336999]Though it's hard to trade 25x p-1 work for 125x factoring work.[/QUOTE]

Factoring to 7x bits (assuming an increase of one bit level) gives you (roughly) a 1/7x = 1.27-1.43% chance of finding a factor.

P-1 with decent bounds will typically give you a 5-8% chance of finding a factor.

So, given 125 TF attempts, we'd expect roughly 1.6-1.8 factors found. On the other hand, 25 TF attempts should yield roughly 1.25-2.0 factors found.

If GPU P-1 allows us to increase bounds or make more frequent use of the Brent-Suyama extension, the expected number of successes will be at or above the higher end of this range. In that case, it would make complete sense to trade 125x TF for 25x P-1.

Note also that GPU P-1 will make use of the *GPU* RAM, rather than the system RAM. This could bring in P-1'ers who were previously unable to dedicate large quantities of RAM to Stage 2.

garo 2013-04-16 18:56

P-1 with 2GB memory in the 61M range gives a probability of success of 3.3-3.6% depending on the TF level. Dunno where you got 5-8%.

NBtarheel_33 2013-04-16 20:55

[QUOTE=garo;337309]P-1 with 2GB memory in the 61M range gives a probability of success of 3.3-3.6% depending on the TF level. Dunno where you got 5-8%.[/QUOTE]

OK, probably won't see 8% unless you're a fan of strong P-1 with high B1 and B2, but one doesn't have to stretch too much to see 5%...

From James' site ([URL]http://mersenne.ca[/URL]):

M61000000, factored to 70 bits, assuming 2 L-L tests saved, with B1=670,000 and B2=16,750,000, using K*B^N+C = 1*2^61000000-1
Probability = [B]5.664070%[/B]

M65000000, factored to 70 bits, with B1=800,000 and B2=24,000,000, using K*B^N+C = 1*2^65000000-1
Probability = [B]6.224824%[/B]


All times are UTC. The time now is 23:18.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.