mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   Yet another Block Wiedemann thread (https://www.mersenneforum.org/showthread.php?t=10528)

pstach 2008-08-08 21:09

Yet another Block Wiedemann thread
 
I'm trying to implement block wiedemann to see if I can reduce the per system memory requirements for the linear algebra step of nfs. I'm currently using block lanczos code with the matrix divided into sub-matrices and the resultant of each mat-vector or transmat-vector per submatrix is summed, although as my matrix size increases obviously the communication bandwidth increases. I've read Coppersmith's paper, Villard's, and the FFT minpoly paper and am left with the following questions:

Am I to assume this algorithm requires matrix B to be symmetric such that i have to do tmp = B*x, then y = trans(B) * tmp? I can get results with symmetric matrices, but not with assymetric. Although my code is based on the wlss2 code, which contains many bugs.

Does anyone have a reference implementation of the FFT minpoly?

PS - Its been a decade since I took a linear algeba course

fivemack 2008-08-09 22:10

I'm interested that you get results at all - I'm stuck at getting a kernel vector from the minpoly, on small test cases (small enough to do the linear algebra densely) with non-symmetric square matrices. The minpoly I get is correct - at least it divides MinimalPolynomial(M) as given by magma - but I don't see how you get a kernel from it.

I've not seen a working FFT minpoly, it's one of the things I'm hoping to see at Nancy in October; isn't the FFT used basically to multiply the polynomials, so you could use NTL's poly arithmetic, which is pretty decent (Zimmermann did better but I don't know if his changes have been merged), initially?

pstach 2008-08-11 01:44

Asymmetric to symmetric trick
 
So I don't know if this falls under the realm of obvious, but I've not seen it presented in any implementations or papers. What I've been doing to get symmetric matrices is to pad the row count with extra QCB entries (beyond the 128 I'm using). This removes the necessity of computing y = B' * B * x and you are left with y = A * x (A = sym, B = asym)
I guess the additional QCB entries only pose a problem if you heavily oversieved (ie: ncols - nrows > 128 or such)

-Patrick

pstach 2008-08-13 01:02

Sorry for the confusion, I meant non-square to square not asym to sym.

jasonp 2008-08-13 15:26

[QUOTE=pstach;139235]Sorry for the confusion, I meant non-square to square not asym to sym.[/QUOTE]
If anyone can clarify based on the discussions we've had via email: does block Lanczos require structural symmetry of the matrix, or can it get by with a structureally unsymmetric matrix that happens to be both square and singular? Montgomery's original block Lanczos paper requires a structurally symmetric matrix to generate the orthogonality relations that makes the Lanczos iteration work, but can that requirement get relaxed?

Modifying my code to find out would be fairly straightforward, but could take a while.

Chris Card 2008-08-13 16:53

[QUOTE=jasonp;139275]If anyone can clarify based on the discussions we've had via email: does block Lanczos require structural symmetry of the matrix, or can it get by with a structureally unsymmetric matrix that happens to be both square and singular? Montgomery's original block Lanczos paper requires a structurally symmetric matrix to generate the orthogonality relations that makes the Lanczos iteration work, but can that requirement get relaxed?

Modifying my code to find out would be fairly straightforward, but could take a while.[/QUOTE]

I thought block Lanczos required a symmetric matrix, but it's a while since I read Montgomery's paper.
But you never have to calculate (or store) the symmetric matrix in the BL implementation, so does it matter?

Chris

jasonp 2008-08-13 17:27

[QUOTE=Chris Card;139279]I thought block Lanczos required a symmetric matrix, but it's a while since I read Montgomery's paper.
But you never have to calculate (or store) the symmetric matrix in the BL implementation, so does it matter?
[/QUOTE]
As Patrick mentioned, if the matrix only has to be square, then you can add quadratic character rows (most of which will be redundant), so that you end up with a matrix that is square but still singular. Then the core of the Lanczos iteration only needs one matrix multiply instead of now, when you need two matrix multiplies to construct a symmetrized matrix. That could make block Lanczos 50% faster.

Chris Card 2008-08-13 17:54

[QUOTE=jasonp;139282]As Patrick mentioned, if the matrix only has to be square, then you can add quadratic character rows (most of which will be redundant), so that you end up with a matrix that is square but still singular. Then the core of the Lanczos iteration only needs one matrix multiply instead of now, when you need two matrix multiplies to construct a symmetrized matrix. That could make block Lanczos 50% faster.[/QUOTE]

I can find some references to nonsymmetric block Lanczos on google, so maybe it's possible.

Chris

pstach 2008-08-14 05:09

[quote=Chris Card;139285]I can find some references to nonsymmetric block Lanczos on google, so maybe it's possible.

Chris[/quote]

Currently the block lanczos code, say in msieve, or your code (Chris Card), uses a non-square asymmetric matrix in GF(2). Lanczos relies on the matrix being square and symmetric. Any matrix can be multiplied by its transpose to create a symmetric matrix. Thus you compute y = B' * B * x, which is broken into tmp = B * x, y = B' * tmp. Now just because you're multiplying by the transpose doesn't mean you have to store the transpose anywhere, you "compute it on the fly" when performing the y = B' * tmp operation. Thus its 2 operations, one matrix vector product and one transpose matrix vector product. If you're storing the matrix once, obviously the performance with regards to scatter/gather effects on cache and memory bandwidth due to non-coalesced reads/writes.

What I've been dabbling with is a way to work Lanczos with a square matrix (not a symmetric matrix) with only matrix vector products. However I've yet to find a tridiagonal without calculating some form of generator polynomial (ala block wiedemann). A buddy of mine and I have been working on alternate schemes.

Here's some good docs on the background of Lanczos algorithms:
[url]http://elib.zib.de/netlib/lanczos/vol2/[/url]

None of these are related to GF(2), but the concepts can be carried forward when reading Montgomery's paper.

So back to my mischief. I've put the Lanczos work on halt, because of the previously mentioned problem of needing a generator polynomial. Which is exactly what the Block Wiedemann algorithm computes in Step 2. Step 1 and Step 3 of Wiedemann require repeated calculation of matrix vector products and scalar products. Note: no transpose matrix vector products, which means you can optimize the hell out of your storage structures and get higher performance. My implementation, which has minpoly code is based on Kaltofen's paper and probably less than optimal, seems to require similar number of matrix (regardless of non-transpose or transpose) to that of lanczos. I'm trying to better understand the heuristic nature of these algorithms with regards to overall number of operations (as I discussed with jasonp via email).

I've been putting together a possible submission to the CADO workshop on optimizing the expensive parts of block wiedemann and block lanczos and debunking a few pseudo myths about the distributed nature of Wiedemann (splitting its blocking factor). I've targetted most of my numbers around the matrices I've constructed from RSA-100, RSA-110, RSA-130, and now a fake SNFS-1024, using parameters from the Aoki, Franke, Kleinjung, Lenstra, and Osvik report. If the talk is accepted or no, I'll forward some finalized code to whomever wants a copy once I get an implementation of the subquadratic FFT minpoly (used in Stage 2 of Wiedemann).

-Patrick

pstach 2008-08-14 05:34

Oh, forgot to post a couple of questions about the SNFS 1024 if anyone knows as I've emailed Kleinjung and Aoki but have yet to hear back.

The matrix used was 66,718,154 x 66,718,154 with a weight of ~9.5b. However later it mentions quadratic character tests were performed after block wiedemann? Am I to read this as no quadratic character base was present in the matrix? Further if one was present, how many entries?

The fake matrix I've generated takes into account a random yield between 50 and 100 per q for the lattice ranges mentioned, a 1/p distribution for the factor base, and 128 qcb entries.

frmky 2008-08-14 18:40

[QUOTE=pstach;139296]debunking a few pseudo myths about the distributed nature of Wiedemann[/QUOTE]

Currently Bruce Dodson and I have been completing a number of SNFS factorizations with difficulty around 260-270 digits. We have the sieving resources available to increase that difficulty, but have found that the Block Lanczos is the bottleneck. We end up with matrices around 20M x 20M. We each have access to a computer with 32 cores and lots of memory (64 GB in my case). Block Lanczos as implemented in msieve (which is the fastest implementation we know of) slows down when run with more than 8 threads, and running with 8 threads takes about 1/3 the time of running with 1 thread. As a balance between overall throughput and storage space of the relations, I am currently running 4 instances of msieve, each with 8 threads, on 4 different matrices.

Do you think Block Wiedemann would scale well with 32 threads? Even if overall throughput wouldn't be better, the storage space advantages of running only one matrix at a time would make it preferable. We have been hoping that a good implementation would come along, but in light of your comment I'm wondering if it will be the answer to our problem.

Thanks,
Greg


All times are UTC. The time now is 15:40.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.