mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-05-30, 00:01   #276
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32·5·7 Posts
Default

Past the edit time limit. I realized I left out a factor of 53 in the bit about b1 and b2 restrictions.

b2 / d / 53 <= b1, otherwise some small primes > b1 do not get included.
owftheevil is offline   Reply With Quote
Old 2013-05-30, 11:00   #277
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

24·13·47 Posts
Default

Quote:
Originally Posted by owftheevil View Post
Speed. 0's in the binary representation of the exponent require a squaring, 1's require an additional multiplication by the base. If the base is 3, this can be done with a modified normalization kernel with negligible increase in time, but with a huge integer base, it requires an additional fft multiplication.
That is generally not true, I mean in average, you get the same amount of squarings and shifts/additions if you do the primes one by one, or multiply them first.

Example:
5*7*11=385, or in binary 101*111*1011=110000001, in the "one by one" you do 2+2+3=7 squaring and 1+2+2=5 shift/add, in the "mul first" case you do 8 squaring and 2 shift/add, certainly advantage for "one by one" for a small base.
7*61=427 (I deliberately select primes with many ones!) = 111*111101=110101011, (7+6) against (8+5).
If there are many factors, you must do their sum(size[i]-1) squarings, but when you multiply first you do size(sum-1), i.e. you waste a squaring for each aditional factor. 75 factors, each having 20 bits, you do 75*19=1425 squarings if you take them one by one (because you ignore the first bit being 1 every time in exponentiation), but if you do the product first, you have also 75*(20-1)=1425 squarings. Statistically, in 1425 bits you have to do 713 multiplications (they can be 1 or 0), in any case. Even if not, computing them one by one has a lot of other advantages (easy to extend the B1, lower waiting state, etc).

Also, for the "one huge prime variation", (the stage 2 part), you always can get a compromise which is "almost" same fast, but used much less memory (this does not work if you want to have the BrS extension, but if you don't want it, you can do stage 2 almost as same fast, but using less memory, which will allow much higher B2, and will also put the "cheap cards" in the game, right now only people with a lot of mem on the card can do "good" P-1 with the GPU).

You can have a look to a pari implementation attached below, I think I posted it in the past, but I don't know where. Rename to delete the ".txt", load it in pari with \r, and use "mpm1(257,,10^6,10^7)" as a sample command. The stage 2 of that is VERY fast (well, for pari!) and it does not consume memory (of course, no BrS extension, just classical). I am also working to implement something like this with cuda, but I stopped due to the lack of math knowledge related to fft and due to poor skill with cuda, which I am still learning. The last cuda version works with my gtx580 with a kind of karatsuba multiplication (divide and conquer, integer only) and it is quite slow. I can give parameters for the number of iterations after which a GCD() is taken, which (depending of the expo) can speed up the things. For example if B2 is very big, better take a couple of GCD's, one might get lucky and find the factor faster, etc. You can consider these things (and many other) for the future versions of cudaPm1.
Attached Files
File Type: txt pollardq.gp.txt (11.4 KB, 115 views)

Last fiddled with by LaurV on 2013-05-30 at 11:26
LaurV is offline   Reply With Quote
Old 2013-05-30, 12:17   #278
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

Quote:
That is generally not true, I mean in average, you get the same amount of squarings and shifts/additions if you do the primes one by one, or multiply them first.

Example:
5*7*11=385, or in binary 101*111*1011=110000001, in the "one by one" you do 2+2+3=7 squaring and 1+2+2=5 shift/add, in the "mul first" case you do 8 squaring and 2 shift/add, certainly advantage for "one by one" for a small base.
7*61=427 (I deliberately select primes with many ones!) = 111*111101=110101011, (7+6) against (8+5).
If there are many factors, you must do their sum(size[i]-1) squarings, but when you multiply first you do size(sum-1), i.e. you waste a squaring for each aditional factor. 75 factors, each having 20 bits, you do 75*19=1425 squarings if you take them one by one (because you ignore the first bit being 1 every time in exponentiation), but if you do the product first, you have also 75*(20-1)=1425 squarings. Statistically, in 1425 bits you have to do 713 multiplications (they can be 1 or 0), in any case. Even if not, computing them one by one has a lot of other advantages (easy to extend the B1, lower waiting state, etc).
You are correct, but I was not claiming that the total number of operations was reduced by lumping the prime powers together. Multiplying by 3 as opposed to multiplying by a huge integer is where the time gets saved.

Quote:
Also, for the "one huge prime variation", (the stage 2 part), you always can get a compromise which is "almost" same fast, but used much less memory (this does not work if you want to have the BrS extension, but if you don't want it, you can do stage 2 almost as same fast, but using less memory, which will allow much higher B2, and will also put the "cheap cards" in the game, right now only people with a lot of mem on the card can do "good" P-1 with the GPU).
I don't know what you mean by "one huge prime variation" and "cheap cards". I would be interested in hearing hows this works.

Again, thanks for your input.

Carl
owftheevil is offline   Reply With Quote
Old 2013-05-30, 12:38   #279
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

24×13×47 Posts
Default

On the bright side, look what I have found in the logs when I just arrived home now:

Code:
M631247 has a factor: 83947913480780207864691737 (P-1, B1=3100000, B2=3100000, e=6, n=40K CUDAPm1 v0.10)
Unfortunately other 30 of them had no factor, and the situation would have been differently if I could give a 12M limit for B1 for a 10% chance

BTW, for all of them, the B2 was automatically extended to 40300000, which is what the log shows (the screen redirected in a text file), the test was done, in a single "gulp" (1 to 480 primes), but the results.txt only shows 3M1 for both limits (?!), I don't know if the test was really done to which B2 limit. For the factor found, the "k" is smoother than B1 so this proves nothing, there was no stage 2.

On the dark side, PrimeNet says:
Code:
No factor lines found: 0 
Mfaktc no factor lines found: 0 
Mfakto no factor lines found: 0 
CUDAPm1-factor lines found: 1 
Insufficient information for accurate CPU credit. 
For stats purposes, assuming factor was found using ECM with B1 = 50000. 
CPU credit is 0.0170 GHz-days. 
CUDAPm1-nofactor lines found: 38 
CPU credit is 0.1130 GHz-days. 
CPU credit is 0.1130 GHz-days. 
CPU credit is 0.1130 GHz-days.
<etc>
and robbed me of 0.1 GD of credit The "factor" line was not the firtst in the file, and anyhow, I thought this problem is only for TF/mfaktc factors seen as P-1, but not for P-1 seen as ECM shock:

Last fiddled with by LaurV on 2013-05-30 at 12:41
LaurV is offline   Reply With Quote
Old 2013-05-30, 13:04   #280
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

350210 Posts
Default

Quote:
Originally Posted by LaurV View Post
I thought this problem is only for TF/mfaktc factors seen as P-1, but not for P-1 seen as ECM shock:
I still need to grind through the manual-results code and get it parsing sensibly. Several result types (e.g. mfakt*, CUDAPm1) are not unclear as to the result type, but the extant PrimeNet code still "guesses" how the factor was found, even though it should already be known.

I've posted this before somewhere a while back, but this is the abbreviated logic for how PrimeNet currently guesses as to the factor method:
Code:
add_user_msg( $resp, "Insufficient information for accurate CPU credit.");
if ( $k == 1 && $b == 2 && $n % 4096 == 0 && $c == 1 ) {
	add_user_msg( $resp, "For stats purposes, assuming factor was found using ECM with B1 = 44M.");
} else if ( $k != 1 || $b != 2 || $c != -1 || $n < 16000000) {
	if ($bits > 55) {
		add_user_msg( $resp, "For stats purposes, assuming factor was found using ECM with B1 = 50000.");
	} else {
		add_user_msg( $resp, "For stats purposes, assuming factor was found by trial factoring using prime95.");
	}
} else {
	if ($bits > $t_results[tflimit]) {
		add_user_msg( $resp, "For stats purposes, assuming factor was found using P-1 with B1 = 800000.");
	} else {
		add_user_msg( $resp, "For stats purposes, assuming factor was found by trial factoring using prime95.");
	}
}
Hopefully I'll get to rewriting the manual-results parsing logic within the next couple weeks or so.
James Heinrich is offline   Reply With Quote
Old 2013-05-30, 13:26   #281
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

24×13×47 Posts
Default

Quote:
Originally Posted by LaurV View Post
I don't know if the test was really done to which B2 limit. For the factor found, the "k" is smoother than B1 so this proves nothing, there was no stage 2.
Ignore that! I was stupid. The results are stored properly in the results file, with the right extended B2, and the test is done to extended B2, too (just found a new factor). After a shower and a some dinner I can read properly the results.txt
LaurV is offline   Reply With Quote
Old 2013-05-30, 13:31   #282
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

24×13×47 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
I still need to grind through the manual-results code and get it parsing sensibly.
Looking into my log file, I see that the "factor" lines were searched first, and the "no factor" were searched only after (see the former post, log included). Maybe that is the problem? Adding "no factor lines in front would be no use in this case (like in mfaktx would have no preliminary "no factor" line, misfit is changing a line order, problem solved). Just an idea...
LaurV is offline   Reply With Quote
Old 2013-05-30, 13:40   #283
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

350210 Posts
Default

The whole manual results parsing code needs to be reworked. Fortunately I already have nicely-working code on mersenne.ca that I can pretty much just copy over. I just need a little while to sit down and work through it.
James Heinrich is offline   Reply With Quote
Old 2013-05-30, 13:45   #284
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

3×1,609 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
The whole manual results parsing code needs to be reworked. Fortunately I already have nicely-working code on mersenne.ca that I can pretty much just copy over. I just need a little while to sit down and work through it.
Thanks James

Luigi
ET_ is online now   Reply With Quote
Old 2013-05-30, 14:21   #285
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

From LaurV:

Quote:
But not this is the main problem. All values between 3200000 and 20M are parsed wrong, it says "B1 need to be at least 1" and does a test with B1=1 and B2=393xxx or so, which does find a factor, if one exists for these values.
Integer overflow. It will go away when the silly b1 and b2 restrictions are removed.
owftheevil is offline   Reply With Quote
Old 2013-05-31, 18:50   #286
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

LaurV,

I took a look at your pari implementation of p-1. What you are doing is essentially the case e = 1 and d = 6. CUDAPm1 currently can deal with d = 6, but to get e =1, I would have to include clauses in some of the stage 2 initialization functions, and provide for incrementing the main stage 2 loop counter by 1 instead of 2. Not hard to do. Any coder that doesn't suck could do it in an hour. I might be able to do it in a day. It would save 16 * (fft length) bytes of device memory, but take a performance hit directly related to the number of prime pairs 6*k - 5, 6*k +5 and 6*k -1, 6*k + 1 between b2 / 5 and b2 compared to using e = 2 and d = 6.

By the way, very clean and easy to read code. I wish I could do that.

Last fiddled with by owftheevil on 2013-05-31 at 19:19
owftheevil is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3506 2021-09-18 00:04
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26
Factoring program need help Citrix Lone Mersenne Hunters 8 2005-09-16 02:31
Factoring program ET_ Programming 3 2003-11-25 02:57

All times are UTC. The time now is 18:28.


Tue Oct 19 18:28:21 UTC 2021 up 88 days, 12:57, 0 users, load averages: 1.35, 1.19, 1.17

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.