mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   SQUFOF implementation (https://www.mersenneforum.org/showthread.php?t=13276)

alpertron 2010-04-10 23:14

SQUFOF implementation
 
Since there are not many places on the Internet where you can find a SQUFOF implementation, I've written mine in Java based on the algorithm explained in the Gower and Wagstaff paper. This will be used in my factorization applet.

[code]
/* Implementation of algorithm explained in Gower and Wagstaff paper */
static int SQUFOF(long N, int queue[])
{
double sqrt;
int B, Q, Q1, P, P1, L, S;
int i, r, s, t, q, u;
int queueHead, queueTail, queueIndex;
/* Step 1: Initialize */
if ((N & 3) == 1)
{
N <<= 1;
}
sqrt = Math.sqrt(N);
S = (int)sqrt;
if ((long)(S+1)*(long)(S+1)<=N)
{
S++;
}
if ((long)S*(long)S > N)
{
S--;
}
if ((long)S*(long)S == N)
{
return S;
}
Q1 = 1;
P = S;
Q = (int)N - P*P;
L = (int)(2*Math.sqrt(2*sqrt));
B = L << 1;
queueHead = 0;
queueTail = 0;
/* Step 2: Cycle forward to find a proper square form */
for (i=0; i<=B; i++)
{
q = (S+P)/Q;
P1 = q*Q-P;
if (Q <= L)
{
if ((Q & 1) == 0)
{
queue[queueHead++] = Q >> 1;
queue[queueHead++] = P % (Q >> 1);
if (queueHead == 100)
{
queueHead = 0;
}
}
else if (Q+Q<=L)
{
queue[queueHead++] = Q;
queue[queueHead++] = P % Q;
if (queueHead == 100)
{
queueHead = 0;
}
}
}
t = Q1+q*(P-P1);
Q1 = Q;
Q = t;
P = P1;
if ((i & 1) == 0 && ((Q & 7) < 2 || (Q & 7) == 4))
{
r = (int)Math.sqrt(Q);
if (r*r == Q)
{
queueIndex = queueTail;
for (;;)
{
if (queueIndex == queueHead)
{
/* Step 3: Compute inverse square root of the square form */
Q1 = r;
u = (S-P)%r;
u += (u >> 31) & r;
P = S-u;
Q = (int)((N-(long)P*(long)P)/Q1);
/* Step 4: Cycle in the reverse direction to find a factor of N */
for (;;)
{
q = (S+P)/Q;
P1 = q*Q-P;
if (P == P1)
{
break;
}
t = Q1 +q*(P-P1);
Q1 = Q;
Q = t;
P = P1;
}
/* Step 5: Get the factor of N */
if ((Q & 1) == 0)
{
return Q >> 1;
}
return Q;
}
s = queue[queueIndex++];
t = queue[queueIndex++];
if (queueIndex == 100)
{
queueIndex = 0;
}
if ((P-t)%s == 0)
{
break;
}
}
if (r > 1)
{
queueTail = queueIndex;
}
if (r == 1)
{
queueIndex = queueTail;
for (;;)
{
if (queueIndex == queueHead)
{
break;
}
if (queue[queueIndex] == 1)
{
return 0;
}
queueIndex += 2;
if (queueIndex == 100)
{
queueIndex = 0;
}
}
}
}
}
}
return 0;
}
[/code]

The queue (an integer array with 100 elements) is created by the caller so it can be reused.
Notice that in Step 2 a square root must be computed every two cycles of the loop. This can probably be optimized. The paper says that step 4 can be optimized but I will need to investigate it.
Maybe some forum participants who programmed SQUFOF before (Bob and others) can give me some hints in order to optimize it.

bsquared 2010-04-11 01:45

Are you using multipliers? I didn't see that you were, but maybe I missed where they were implemented or maybe they are already in N when this routine is called. Racing several small multipliers (by keeping a small amount of state memory for each multipler and running smallish batches of iterations on each) is a good optimization. Msieve and yafu both have implementations which do this.

The step 4 optimization I believe is called fast return. Search for papers by Steven McMath on this subject - he researched it extensively and also published some code, I think... yes, it's called "Parallel integer factorization using quadratic forms". I never got around to implementing fast return, so if you are successful I'd be very interested in seeing it.

alpertron 2010-04-11 02:01

I have not implemented multipliers in the code. First I wanted to be sure that the code worked. There were two errors in the algorithm (section 3.3 of the Gower and Wagstaff paper) that I corrected in the implementation.

From what I read in [url]http://hal.inria.fr/docs/00/45/16/04/PDF/smallint_expcomp_draft_02_1.pdf[/url] it appears that the "fast return" is really faster when factoring numbers greater than 20 digits, but I think that in this case SIQS will be faster.

jasonp 2010-04-11 11:57

What corrections did you have to make? It's been a long time, but the SQUFOF code in msieve was taken directly from Gower's description, and I didn't understand enough of the underlying math to make corrections.

Also, it was quite helpful in C to handle a special case of q=1 before doing the divisions at the beginning of the even and odd cycles. It probably wouldn't help in Java though.

R.D. Silverman 2010-04-11 12:16

[QUOTE=alpertron;211339]Since there are not many places on the Internet where you can find a SQUFOF implementation, I've written mine in Java based on the algorithm explained in the Gower and Wagstaff paper. This will be used in my factorization applet.

[code]
/* Implementation of algorithm explained in Gower and Wagstaff paper */
static int SQUFOF(long N, int queue[])
{
double sqrt;
int B, Q, Q1, P, P1, L, S;
int i, r, s, t, q, u;
int queueHead, queueTail, queueIndex;
/* Step 1: Initialize */
if ((N & 3) == 1)
{
N <<= 1;
}
sqrt = Math.sqrt(N);
S = (int)sqrt;
if ((long)(S+1)*(long)(S+1)<=N)
{
S++;
}
if ((long)S*(long)S > N)
{
S--;
}
if ((long)S*(long)S == N)
{
return S;
}
Q1 = 1;
P = S;
Q = (int)N - P*P;
L = (int)(2*Math.sqrt(2*sqrt));
B = L << 1;
queueHead = 0;
queueTail = 0;
/* Step 2: Cycle forward to find a proper square form */
for (i=0; i<=B; i++)
{
q = (S+P)/Q;
P1 = q*Q-P;
if (Q <= L)
{
if ((Q & 1) == 0)
{
queue[queueHead++] = Q >> 1;
queue[queueHead++] = P % (Q >> 1);
if (queueHead == 100)
{
queueHead = 0;
}
}
else if (Q+Q<=L)
{
queue[queueHead++] = Q;
queue[queueHead++] = P % Q;
if (queueHead == 100)
{
queueHead = 0;
}
}
}
t = Q1+q*(P-P1);
Q1 = Q;
Q = t;
P = P1;
if ((i & 1) == 0 && ((Q & 7) < 2 || (Q & 7) == 4))
{
r = (int)Math.sqrt(Q);
if (r*r == Q)
{
queueIndex = queueTail;
for (;;)
{
if (queueIndex == queueHead)
{
/* Step 3: Compute inverse square root of the square form */
Q1 = r;
u = (S-P)%r;
u += (u >> 31) & r;
P = S-u;
Q = (int)((N-(long)P*(long)P)/Q1);
/* Step 4: Cycle in the reverse direction to find a factor of N */
for (;;)
{
q = (S+P)/Q;
P1 = q*Q-P;
if (P == P1)
{
break;
}
t = Q1 +q*(P-P1);
Q1 = Q;
Q = t;
P = P1;
}
/* Step 5: Get the factor of N */
if ((Q & 1) == 0)
{
return Q >> 1;
}
return Q;
}
s = queue[queueIndex++];
t = queue[queueIndex++];
if (queueIndex == 100)
{
queueIndex = 0;
}
if ((P-t)%s == 0)
{
break;
}
}
if (r > 1)
{
queueTail = queueIndex;
}
if (r == 1)
{
queueIndex = queueTail;
for (;;)
{
if (queueIndex == queueHead)
{
break;
}
if (queue[queueIndex] == 1)
{
return 0;
}
queueIndex += 2;
if (queueIndex == 100)
{
queueIndex = 0;
}
}
}
}
}
}
return 0;
}
[/code]

The queue (an integer array with 100 elements) is created by the caller so it can be reused.
Notice that in Step 2 a square root must be computed every two cycles of the loop. This can probably be optimized. The paper says that step 4 can be optimized but I will need to investigate it.
Maybe some forum participants who programmed SQUFOF before (Bob and others) can give me some hints in order to optimize it.[/QUOTE]



[code]
squfof(n,fac1,fac2)
int n[];
int *fac1, *fac2;

{ /* start of squfof: factor n as fac1*fac2 faster in FP?????*/

int temp[MPDIM], temp1[MPDIM];
register int iq,ll,l2,p,pnext,q,qlast,r,s,t,i;
int jter,iter;

if (SQUFOF) squfof_calls++;
#define DEBUG_QSC 0
if (DEBUG_QSC) { (void) printf("trying qs on: "); write_num(n,numstr); fflush(stdout); }
jter = small_qs(n,fac1,fac2);
if (DEBUG_QSC)
{
(void) printf("qs returns %d\n",jter);
(void) printf("tried qs on "); write_num(n, numstr);
(void) printf("fac1, fac2 = %d %d, squfof returning %d\n", *fac1, *fac2, jter);
fflush(stdout);
} /* -1 should be very rare; QS found no non-trivial GCD's */
if (jter != -1) return(jter);

/* Notes: could replace calls to the MP routines [on Pentium] by using*/
/* __int64 for temp,temp1, convert n to __int64 etc. But the MP stuff */
/* is not inside a loop. It is not worth doing */

qlast = 1;
mpsqrt(n,temp);

s = LAST(temp);
p = s;
mult(temp,temp,temp1);
subt(n,temp1,temp); /* temp = n - floor(sqrt(n))^2 */
if (SIZE(temp) <= SINGLE && LAST(temp) == 0)
{ /* Here n is a square */
*fac1 = s;
*fac2 = s;
return(1);
}

q = LAST(temp); /* q = excess of n over next smaller square */
ll = 1 + 2*(int)sqrt((double)(p+p));
l2 = ll/2;
qpoint = 0;

/* In the loop below, we need to check if q is a square right before */
/* the end of the loop. Is there a faster way? The current way is */
/* EXPENSIVE! (many branches and double prec sqrt) */

for (jter=0; jter < 800000; jter++) /* I see no way to speed this */
{ /* main loop */
iq = (s + p)/q;
pnext = iq*q - p;
if (q <= ll)
{
if ((q & 1) == 0) enqu(q/2,&jter);
else if (q <= l2) enqu(q,&jter);
if (jter < 0)
{ /* queue overflow: try pollard rho */
if (Q_OVERFLOW) overflow_count++; /* should be rare */
return(pollardrho2(n,fac1,fac2)); /* also try squfof(3N)? */
}
}
t = qlast + iq*(p - pnext);
qlast = q;
q = t;
p = pnext; /* check for square; even iter */
if (jter & 1) continue; /* jter is odd:omit square test */
// if (t = (q & 3) > 1) continue; /* skip t = 2,3 mod 4 */
// if (q & 7 == 5) continue; /* skip q = 5 mod 8 */
// if (t == 0 && (q & 15) > 5) continue; /* skip 8, 12 mod 16 */
// if (s3465[q % 3465] == 'n') continue; /* skip QNR's mod 3465 */
r = (int)sqrt((double)q); /* r = floor(sqrt(q)) */
if (q != r*r) continue;
if (qpoint == 0) goto gotit;
for (i=0; i<qpoint-1; i+=2) /* treat queue as list for simplicity*/
{
if (r == qqueue[i]) goto contin;
if (r == qqueue[i+1]) goto contin;
}
if (r == qqueue[qpoint-1]) continue;
goto gotit;
contin:;
} /* end of main loop */

void sqfprep()
{
int i;
for (i=0; i<3465; i++) s3465[i] = 'n';
for (i=0; i<1733; i++) s3465[(i*i) % 3465] = 'r';
}

void enqu(q,iter)
int q;
int *iter;
{
qqueue[qpoint] = q;
if (++qpoint > 50) *iter = -1;
}
[/code]

alpertron 2010-04-11 13:40

[QUOTE=jasonp;211373]What corrections did you have to make? It's been a long time, but the SQUFOF code in msieve was taken directly from Gower's description, and I didn't understand enough of the underlying math to make corrections.

Also, it was quite helpful in C to handle a special case of q=1 before doing the divisions at the beginning of the even and odd cycles. It probably wouldn't help in Java though.[/QUOTE]

In Step 3 the paper says: [tex]Q \leftarrow (N-P*P)/Q'[/tex] , but it must be [tex]Q \leftarrow (D-P*P)/Q'[/tex]

In Step 4a the paper says Set [tex]q \leftarrow \lfloor (S+P)/Q'\rfloor[/tex] but it must be [tex]q \leftarrow \lfloor (S+P)/Q\rfloor[/tex]

The last one should be easy to catch because even the example provided does not work using the step 4a as written on the paper.

alpertron 2010-04-11 14:37

Acording to the paper, the implementation of the queue as a list as written by Bob above is slower than the queue because some proper square forms could be skipped.

bsquared 2010-04-11 16:49

comparisons
 
I went ahead and did some comparisons. Bob, I had to port your code so that it works outside your lattice siever, in my test routine. If anything it should be faster, since it uses native 64 bit integers rather than your multiple precision routines, when necessary. I also got rid of the calls to pollard rho, and your mini-qs, so that the comparison is between squfof routines only. Here is your modified code:

[CODE]int qqueue[100];
int qpoint;

void enqu(int q,int *iter)
{
qqueue[qpoint] = q;
if (++qpoint > 100) *iter = -1;
}


int squfof_rds(int64 n,int *fac1,int *fac2)
{ /* start of squfof: factor n as fac1*fac2 faster in FP?????*/
int64 temp, temp1;
register int iq,ll,l2,p,pnext,q,qlast,r,s,t,i;
int jter,iter;

qlast = 1;
s = (int)sqrt(n);

p = s;
temp1 = s * s;
temp = n - temp1; /* temp = n - floor(sqrt(n))^2 */
if (temp == 0)
{ /* Here n is a square */
*fac1 = s;
*fac2 = s;
return(1);
}

q = (int)temp; /* q = excess of n over next smaller square */
ll = 1 + 2*(int)sqrt((double)(p+p));
l2 = ll/2;
qpoint = 0;

/* In the loop below, we need to check if q is a square right before */
/* the end of the loop. Is there a faster way? The current way is */
/* EXPENSIVE! (many branches and double prec sqrt) */

for (jter=0; jter < 800000; jter++) /* I see no way to speed this */
{ /* main loop */
iq = (s + p)/q;
pnext = iq*q - p;
if (q <= ll)
{
if ((q & 1) == 0) enqu(q/2,&jter);
else if (q <= l2) enqu(q,&jter);
if (jter < 0)
{
return 0;
}
}
t = qlast + iq*(p - pnext);
qlast = q;
q = t;
p = pnext; /* check for square; even iter */
if (jter & 1) continue; /* jter is odd:omit square test */
r = (int)sqrt((double)q); /* r = floor(sqrt(q)) */
if (q != r*r) continue;
if (qpoint == 0) goto gotit;
for (i=0; i<qpoint-1; i+=2) /* treat queue as list for simplicity*/
{
if (r == qqueue[i]) goto contin;
if (r == qqueue[i+1]) goto contin;
}
if (r == qqueue[qpoint-1]) continue;
goto gotit;
contin:;
} /* end of main loop */

gotit: ;
qlast = r;
p = p + r*((s - p)/r);
temp = (int64)p * (int64)p;
temp = n - temp;
temp1 = temp / qlast;
q = (int)temp1; /* q = (n - p*p)/qlast (div is exact)*/
for (iter=0; iter<40000; iter++)
{ /* begin second main loop */
iq = (s + p)/q; /* unroll it, of course */
pnext = iq*q - p;
if (p == pnext) goto gotfac;
t = qlast + iq*(p - pnext);
qlast = q;
q = t;
p = pnext;
iq = (s + p)/q;
pnext = iq*q - p;
if (p == pnext) goto gotfac;
t = qlast + iq*(p - pnext);
qlast = q;
q = t;
p = pnext;
iq = (s + p)/q;
pnext = iq*q - p;
if (p == pnext) goto gotfac;
t = qlast + iq*(p - pnext);
qlast = q;
q = t;
p = pnext;
iq = (s + p)/q;
pnext = iq*q - p;
if (p == pnext) goto gotfac;
t = qlast + iq*(p - pnext);
qlast = q;
q = t;
p = pnext;
}


return(0); /* this shouldn't happen */

gotfac: ; if ((q & 1) == 0) q/=2; /* q was factor or 2*factor */
*fac1 = q;
temp = n / q;
*fac2 = (int)temp;
return(1);
}[/CODE]

and here are the comparisons between yafu, msieve, Dario, your routines. I run each routine in a loop over 100,000 randomly generated pseudoprimes of varying bit sizes and record the average time and the number of successful factorizations.

[CODE]
avg time (ms) / % correct
avg. input size yafu msieve dario RDS
41 bits 0.03/100 0.05/100 0.04/100 0.04/100
51 bits 0.16/100 0.16/100 0.22/95.8 0.22/99.98
55 bits 0.30/100 0.31/99.99 0.43/95.85 0.44/99.99
59 bits 0.64/97.9 0.68/98.7 -- 0.87/99.82

[/CODE]

jasonp 2010-04-11 17:12

Actually the pseudocode in Gower's thesis did not look familiar, I probably used McMath's papers as pointed to in the SQUFOF wiki page.

alpertron 2010-04-12 02:14

By unrolling both loops so there are two copies for each, the code is now 15% faster, according to my measurements.

R.D. Silverman 2010-04-12 15:22

[QUOTE=bsquared;211396]I went ahead and did some comparisons. Bob, I had to port your code so that it works outside your lattice siever, in my test routine. If anything it should be faster, since it uses native 64 bit integers rather than your multiple precision routines, when necessary. I also got rid of the calls to pollard rho, and your mini-qs, so that the comparison is between squfof routines only. Here is your modified code:

[CODE]int qqueue[100];
int qpoint;

void enqu(int q,int *iter)
{
qqueue[qpoint] = q;
if (++qpoint > 100) *iter = -1;
}


int squfof_rds(int64 n,int *fac1,int *fac2)
{ /* start of squfof: factor n as fac1*fac2 faster in FP?????*/
int64 temp, temp1;
register int iq,ll,l2,p,pnext,q,qlast,r,s,t,i;
int jter,iter;

qlast = 1;
s = (int)sqrt(n);

p = s;
temp1 = s * s;
temp = n - temp1; /* temp = n - floor(sqrt(n))^2 */
if (temp == 0)
{ /* Here n is a square */
*fac1 = s;
*fac2 = s;
return(1);
}

q = (int)temp; /* q = excess of n over next smaller square */
ll = 1 + 2*(int)sqrt((double)(p+p));
l2 = ll/2;
qpoint = 0;

/* In the loop below, we need to check if q is a square right before */
/* the end of the loop. Is there a faster way? The current way is */
/* EXPENSIVE! (many branches and double prec sqrt) */

for (jter=0; jter < 800000; jter++) /* I see no way to speed this */
{ /* main loop */
iq = (s + p)/q;
pnext = iq*q - p;
if (q <= ll)
{
if ((q & 1) == 0) enqu(q/2,&jter);
else if (q <= l2) enqu(q,&jter);
if (jter < 0)
{
return 0;
}
}
t = qlast + iq*(p - pnext);
qlast = q;
q = t;
p = pnext; /* check for square; even iter */
if (jter & 1) continue; /* jter is odd:omit square test */
r = (int)sqrt((double)q); /* r = floor(sqrt(q)) */
if (q != r*r) continue;
if (qpoint == 0) goto gotit;
for (i=0; i<qpoint-1; i+=2) /* treat queue as list for simplicity*/
{
if (r == qqueue[i]) goto contin;
if (r == qqueue[i+1]) goto contin;
}
if (r == qqueue[qpoint-1]) continue;
goto gotit;
contin:;
} /* end of main loop */

gotit: ;
qlast = r;
p = p + r*((s - p)/r);
temp = (int64)p * (int64)p;
temp = n - temp;
temp1 = temp / qlast;
q = (int)temp1; /* q = (n - p*p)/qlast (div is exact)*/
for (iter=0; iter<40000; iter++)
{ /* begin second main loop */
iq = (s + p)/q; /* unroll it, of course */
pnext = iq*q - p;
if (p == pnext) goto gotfac;
t = qlast + iq*(p - pnext);
qlast = q;
q = t;
p = pnext;
iq = (s + p)/q;
pnext = iq*q - p;
if (p == pnext) goto gotfac;
t = qlast + iq*(p - pnext);
qlast = q;
q = t;
p = pnext;
iq = (s + p)/q;
pnext = iq*q - p;
if (p == pnext) goto gotfac;
t = qlast + iq*(p - pnext);
qlast = q;
q = t;
p = pnext;
iq = (s + p)/q;
pnext = iq*q - p;
if (p == pnext) goto gotfac;
t = qlast + iq*(p - pnext);
qlast = q;
q = t;
p = pnext;
}


return(0); /* this shouldn't happen */

gotfac: ; if ((q & 1) == 0) q/=2; /* q was factor or 2*factor */
*fac1 = q;
temp = n / q;
*fac2 = (int)temp;
return(1);
}[/CODE]

and here are the comparisons between yafu, msieve, Dario, your routines. I run each routine in a loop over 100,000 randomly generated pseudoprimes of varying bit sizes and record the average time and the number of successful factorizations.

[CODE]
avg time (ms) / % correct
avg. input size yafu msieve dario RDS
41 bits 0.03/100 0.05/100 0.04/100 0.04/100
51 bits 0.16/100 0.16/100 0.22/95.8 0.22/99.98
55 bits 0.30/100 0.31/99.99 0.43/95.85 0.44/99.99
59 bits 0.64/97.9 0.68/98.7 -- 0.87/99.82

[/CODE][/QUOTE]

I never bothered to do more than trivial optimizations to the squfof code.

(1) NFS spends so little time inside this routine, that optimizing it will not
have much effect.

(2) MPQS is a [b]lot[/b] faster for numbers of this size.


All times are UTC. The time now is 23:22.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.