mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2008-08-08, 21:09   #1
pstach
 
Aug 2008

19 Posts
Default Yet another Block Wiedemann thread

I'm trying to implement block wiedemann to see if I can reduce the per system memory requirements for the linear algebra step of nfs. I'm currently using block lanczos code with the matrix divided into sub-matrices and the resultant of each mat-vector or transmat-vector per submatrix is summed, although as my matrix size increases obviously the communication bandwidth increases. I've read Coppersmith's paper, Villard's, and the FFT minpoly paper and am left with the following questions:

Am I to assume this algorithm requires matrix B to be symmetric such that i have to do tmp = B*x, then y = trans(B) * tmp? I can get results with symmetric matrices, but not with assymetric. Although my code is based on the wlss2 code, which contains many bugs.

Does anyone have a reference implementation of the FFT minpoly?

PS - Its been a decade since I took a linear algeba course
pstach is offline   Reply With Quote
Old 2008-08-09, 22:10   #2
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

22×32×179 Posts
Default

I'm interested that you get results at all - I'm stuck at getting a kernel vector from the minpoly, on small test cases (small enough to do the linear algebra densely) with non-symmetric square matrices. The minpoly I get is correct - at least it divides MinimalPolynomial(M) as given by magma - but I don't see how you get a kernel from it.

I've not seen a working FFT minpoly, it's one of the things I'm hoping to see at Nancy in October; isn't the FFT used basically to multiply the polynomials, so you could use NTL's poly arithmetic, which is pretty decent (Zimmermann did better but I don't know if his changes have been merged), initially?
fivemack is offline   Reply With Quote
Old 2008-08-11, 01:44   #3
pstach
 
Aug 2008

19 Posts
Default Asymmetric to symmetric trick

So I don't know if this falls under the realm of obvious, but I've not seen it presented in any implementations or papers. What I've been doing to get symmetric matrices is to pad the row count with extra QCB entries (beyond the 128 I'm using). This removes the necessity of computing y = B' * B * x and you are left with y = A * x (A = sym, B = asym)
I guess the additional QCB entries only pose a problem if you heavily oversieved (ie: ncols - nrows > 128 or such)

-Patrick
pstach is offline   Reply With Quote
Old 2008-08-13, 01:02   #4
pstach
 
Aug 2008

19 Posts
Default

Sorry for the confusion, I meant non-square to square not asym to sym.
pstach is offline   Reply With Quote
Old 2008-08-13, 15:26   #5
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,181 Posts
Default

Quote:
Originally Posted by pstach View Post
Sorry for the confusion, I meant non-square to square not asym to sym.
If anyone can clarify based on the discussions we've had via email: does block Lanczos require structural symmetry of the matrix, or can it get by with a structureally unsymmetric matrix that happens to be both square and singular? Montgomery's original block Lanczos paper requires a structurally symmetric matrix to generate the orthogonality relations that makes the Lanczos iteration work, but can that requirement get relaxed?

Modifying my code to find out would be fairly straightforward, but could take a while.
jasonp is offline   Reply With Quote
Old 2008-08-13, 16:53   #6
Chris Card
 
Chris Card's Avatar
 
Aug 2004

2×5×13 Posts
Default

Quote:
Originally Posted by jasonp View Post
If anyone can clarify based on the discussions we've had via email: does block Lanczos require structural symmetry of the matrix, or can it get by with a structureally unsymmetric matrix that happens to be both square and singular? Montgomery's original block Lanczos paper requires a structurally symmetric matrix to generate the orthogonality relations that makes the Lanczos iteration work, but can that requirement get relaxed?

Modifying my code to find out would be fairly straightforward, but could take a while.
I thought block Lanczos required a symmetric matrix, but it's a while since I read Montgomery's paper.
But you never have to calculate (or store) the symmetric matrix in the BL implementation, so does it matter?

Chris
Chris Card is offline   Reply With Quote
Old 2008-08-13, 17:27   #7
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3×1,181 Posts
Default

Quote:
Originally Posted by Chris Card View Post
I thought block Lanczos required a symmetric matrix, but it's a while since I read Montgomery's paper.
But you never have to calculate (or store) the symmetric matrix in the BL implementation, so does it matter?
As Patrick mentioned, if the matrix only has to be square, then you can add quadratic character rows (most of which will be redundant), so that you end up with a matrix that is square but still singular. Then the core of the Lanczos iteration only needs one matrix multiply instead of now, when you need two matrix multiplies to construct a symmetrized matrix. That could make block Lanczos 50% faster.
jasonp is offline   Reply With Quote
Old 2008-08-13, 17:54   #8
Chris Card
 
Chris Card's Avatar
 
Aug 2004

2×5×13 Posts
Default

Quote:
Originally Posted by jasonp View Post
As Patrick mentioned, if the matrix only has to be square, then you can add quadratic character rows (most of which will be redundant), so that you end up with a matrix that is square but still singular. Then the core of the Lanczos iteration only needs one matrix multiply instead of now, when you need two matrix multiplies to construct a symmetrized matrix. That could make block Lanczos 50% faster.
I can find some references to nonsymmetric block Lanczos on google, so maybe it's possible.

Chris
Chris Card is offline   Reply With Quote
Old 2008-08-14, 05:09   #9
pstach
 
Aug 2008

19 Posts
Default

Quote:
Originally Posted by Chris Card View Post
I can find some references to nonsymmetric block Lanczos on google, so maybe it's possible.

Chris
Currently the block lanczos code, say in msieve, or your code (Chris Card), uses a non-square asymmetric matrix in GF(2). Lanczos relies on the matrix being square and symmetric. Any matrix can be multiplied by its transpose to create a symmetric matrix. Thus you compute y = B' * B * x, which is broken into tmp = B * x, y = B' * tmp. Now just because you're multiplying by the transpose doesn't mean you have to store the transpose anywhere, you "compute it on the fly" when performing the y = B' * tmp operation. Thus its 2 operations, one matrix vector product and one transpose matrix vector product. If you're storing the matrix once, obviously the performance with regards to scatter/gather effects on cache and memory bandwidth due to non-coalesced reads/writes.

What I've been dabbling with is a way to work Lanczos with a square matrix (not a symmetric matrix) with only matrix vector products. However I've yet to find a tridiagonal without calculating some form of generator polynomial (ala block wiedemann). A buddy of mine and I have been working on alternate schemes.

Here's some good docs on the background of Lanczos algorithms:
http://elib.zib.de/netlib/lanczos/vol2/

None of these are related to GF(2), but the concepts can be carried forward when reading Montgomery's paper.

So back to my mischief. I've put the Lanczos work on halt, because of the previously mentioned problem of needing a generator polynomial. Which is exactly what the Block Wiedemann algorithm computes in Step 2. Step 1 and Step 3 of Wiedemann require repeated calculation of matrix vector products and scalar products. Note: no transpose matrix vector products, which means you can optimize the hell out of your storage structures and get higher performance. My implementation, which has minpoly code is based on Kaltofen's paper and probably less than optimal, seems to require similar number of matrix (regardless of non-transpose or transpose) to that of lanczos. I'm trying to better understand the heuristic nature of these algorithms with regards to overall number of operations (as I discussed with jasonp via email).

I've been putting together a possible submission to the CADO workshop on optimizing the expensive parts of block wiedemann and block lanczos and debunking a few pseudo myths about the distributed nature of Wiedemann (splitting its blocking factor). I've targetted most of my numbers around the matrices I've constructed from RSA-100, RSA-110, RSA-130, and now a fake SNFS-1024, using parameters from the Aoki, Franke, Kleinjung, Lenstra, and Osvik report. If the talk is accepted or no, I'll forward some finalized code to whomever wants a copy once I get an implementation of the subquadratic FFT minpoly (used in Stage 2 of Wiedemann).

-Patrick
pstach is offline   Reply With Quote
Old 2008-08-14, 05:34   #10
pstach
 
Aug 2008

19 Posts
Default

Oh, forgot to post a couple of questions about the SNFS 1024 if anyone knows as I've emailed Kleinjung and Aoki but have yet to hear back.

The matrix used was 66,718,154 x 66,718,154 with a weight of ~9.5b. However later it mentions quadratic character tests were performed after block wiedemann? Am I to read this as no quadratic character base was present in the matrix? Further if one was present, how many entries?

The fake matrix I've generated takes into account a random yield between 50 and 100 per q for the lattice ranges mentioned, a 1/p distribution for the factor base, and 128 qcb entries.
pstach is offline   Reply With Quote
Old 2008-08-14, 18:40   #11
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

8AF16 Posts
Default

Quote:
Originally Posted by pstach View Post
debunking a few pseudo myths about the distributed nature of Wiedemann
Currently Bruce Dodson and I have been completing a number of SNFS factorizations with difficulty around 260-270 digits. We have the sieving resources available to increase that difficulty, but have found that the Block Lanczos is the bottleneck. We end up with matrices around 20M x 20M. We each have access to a computer with 32 cores and lots of memory (64 GB in my case). Block Lanczos as implemented in msieve (which is the fastest implementation we know of) slows down when run with more than 8 threads, and running with 8 threads takes about 1/3 the time of running with 1 thread. As a balance between overall throughput and storage space of the relations, I am currently running 4 instances of msieve, each with 8 threads, on 4 different matrices.

Do you think Block Wiedemann would scale well with 32 threads? Even if overall throughput wouldn't be better, the storage space advantages of running only one matrix at a time would make it preferable. We have been hoping that a good implementation would come along, but in light of your comment I'm wondering if it will be the answer to our problem.

Thanks,
Greg
frmky is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Block Wiedemann for beginners paul0 Factoring 7 2015-11-16 17:09
block wiedemann and block lanczos ravlyuchenko Msieve 5 2011-05-09 13:16
Very basic question about Wiedemann methods fivemack Math 0 2008-06-16 10:57
P-1: Block of 73: 19.6-19.65M dave_0273 Completed Missions 2 2005-05-16 17:55
Deutscher Thread (german thread) TauCeti NFSNET Discussion 0 2003-12-11 22:12

All times are UTC. The time now is 21:32.


Tue Nov 30 21:32:54 UTC 2021 up 130 days, 16:01, 0 users, load averages: 1.37, 1.46, 1.50

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.