mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Linear algebra with large vectors (https://www.mersenneforum.org/showthread.php?t=22386)

jasonp 2017-06-14 12:05

Linear algebra with large vectors
 
This was a set of patches that were first written in 2008 and have been folded into the block Lanczos code every few years. The idea is that the Lanczos algorithm does not care how large a group of bits in the solution is, and larger such groups use up more memory but also cache better. So instead of using 64-bit words, you can use 128-bit words and do half the number of Lanczos iterations. You can also use SIMD instructions that match the vector width.

For many years this was never a performance win; then in 2013 the threading code was changed to cache much better, so I ported the large-vector patches to the latest source. It's not polished yet but finally the initial results are encouraging.

If you want to play with this yourself, check out branches/lacuda from sourceforge and add VBITS=X (X=64,128,256) to your standard invocation of 'make'. Do *not* build with CUDA=1, you will also turn on CUDA code that is broken. I've verified that multithreaded runs work, and checkpoint/restart also works although the savefile format is vector-length-specific now. This means savefiles are not backward-compatible and you will not be able to restart from an existing linear algebra checkpoint. MPI is a work in progress, it's supposed to work now but I haven't tested it yet.

On my Sandy Bridge laptop and gcc 6.2, using VBITS=128 is about 15% faster with no other special tuning or manual use of SSE2. Hopefully that will apply to everyone else as well. I will update this thread as improvements accumulate.

(Incidentally, this branch was supposed to be for adding CUDA support for running the linear algebra, but that's nowhere close to done and turns out to be a PhD problem)

bsquared 2017-06-15 16:24

Cool! I'm starting to test this out.

When attempting a build with MPI=1 I get some errors:
[CODE]common/lanczos/lanczos.c: In function ‘block_lanczos’:
common/lanczos/lanczos.c:1582:29: error: invalid operands to binary * (have ‘int32[35]’ and ‘int’)
packed_matrix.row_counts *= VWORDS;
^
common/lanczos/lanczos.c:1583:30: error: invalid operands to binary * (have ‘int32[35]’ and ‘int’)
packed_matrix.row_offsets *= VWORDS;
^
common/lanczos/lanczos.c:1584:32: error: invalid operands to binary * (have ‘int32[35]’ and ‘int’)
packed_matrix.subcol_counts *= VWORDS;
^
common/lanczos/lanczos.c:1585:33: error: invalid operands to binary * (have ‘int32[35]’ and ‘int’)
packed_matrix.subcol_offsets *= VWORDS;
^
common/lanczos/lanczos.c:1588:29: error: invalid operands to binary * (have ‘int32[35]’ and ‘int’)
packed_matrix.col_counts *= VWORDS;
^
common/lanczos/lanczos.c:1589:30: error: invalid operands to binary * (have ‘int32[35]’ and ‘int’)
packed_matrix.col_offsets *= VWORDS;
[/CODE]

bsquared 2017-06-15 17:42

With three different builds:
make ECM=1 NO_ZLIB=1 VBITS={64 | 128 | 256}

I get these results:

VBITS=64
[CODE]matrix is 5603830 x 5604055 (1643.2 MB) with weight 425206718 (75.87/col)
sparse part has weight 374704646 (66.86/col)
using block size 8192 and superblock size 3440640 for processor cache size 35840 kB
commencing Lanczos iteration
memory use: 1588.5 MB
linear algebra at 0.0%, ETA 43h 3m604055 dimensions (0.0%, ETA 43h 3m)
checkpointing every 140000 dimensions055 dimensions (0.0%, ETA 42h17m)
^Cnear algebra completed 12889 of 5604055 dimensions (0.2%, ETA 41h56m)
[/CODE]

VBITS=128
[CODE]matrix is 5603766 x 5604055 (1570.7 MB) with weight 374704646 (66.86/col)
sparse part has weight 344512627 (61.48/col)
using block size 8192 and superblock size 3440640 for processor cache size 35840 kB
commencing Lanczos iteration
memory use: 1871.9 MB
linear algebra at 0.0%, ETA 50h51m604055 dimensions (0.0%, ETA 50h51m)
checkpointing every 120000 dimensions055 dimensions (0.0%, ETA 50h19m)
[/CODE]

VBITS=256
[CODE]matrix is 5603638 x 5604055 (1569.1 MB) with weight 344512627 (61.48/col)
sparse part has weight 321663773 (57.40/col)
using block size 8192 and superblock size 3440640 for processor cache size 35840 kB
commencing Lanczos iteration
memory use: 2510.4 MB
linear algebra at 0.0%, ETA 55h50m604055 dimensions (0.0%, ETA 55h50m)
checkpointing every 100000 dimensions055 dimensions (0.0%, ETA 56h23m)
^Cnear algebra completed 23220 of 5604055 dimensions (0.4%, ETA 55h24m)
received signal 2; shutting down
linear algebra completed 23475 of 5604055 dimensions (0.4%, ETA 55h24m)
lanczos halted after 92 iterations (dim = 23475)
[/CODE]


These are all using the same filtering output (just re-running -nc2 on the three different msieves after an -nc1 with the VBITS=64 version), yet the matrix in each case has significantly different weight, which I'm guessing is the main reason for the difference in ETA, not the VBITS parameter. Should I be setting a target density?

jasonp 2017-06-16 00:24

I would have expected the superblock size to go down as the vector size increases. The makefile doesn't know a full rebuild is necessary if you just change VBITS in the makefie; are you sure the code rebuilt?

bsquared 2017-06-16 01:10

[QUOTE=jasonp;461332]I would have expected the superblock size to go down as the vector size increases. The makefile doesn't know a full rebuild is necessary if you just change VBITS in the makefie; are you sure the code rebuilt?[/QUOTE]

It did go down; I changed it to keep it constant :redface:. However if I remember correctly I changed it because it was faster that way... I will redo it but I think it was even worse with the decreased superblock size.

I remembered wrong... it wasn't worse, but neither was it better than the VBITS=64 case. I guess that's why I decided to keep the superblock size constant. And now I'm noticing that the density is changing probably because it is holding more packed rows. I'll try throwing more threads at it and let it go further.

[CODE]saving the first 240 matrix rows for later
matrix includes 256 packed rows
matrix is 5603638 x 5604055 (1569.1 MB) with weight 344512627 (61.48/col)
sparse part has weight 321663773 (57.40/col)
using block size 8192 and superblock size 860160 for processor cache size 35840 kB
commencing Lanczos iteration
memory use: 2510.4 MB
linear algebra at 0.0%, ETA 46h45m604055 dimensions (0.0%, ETA 46h45m)
checkpointing every 120000 dimensions055 dimensions (0.0%, ETA 47h13m)
linear algebra completed 9454 of 5604055 dimensions (0.2%, ETA 45h22m) [/CODE]

bsquared 2017-06-16 02:13

Ok, here's data while letting superblock size scale:

64b
[CODE]saving the first 48 matrix rows for later
matrix includes 64 packed rows
matrix is 5603830 x 5604055 (1643.2 MB) with weight 425206718 (75.87/col)
sparse part has weight 374704646 (66.86/col)
using block size 8192 and superblock size 3440640 for processor cache size 35840 kB
commencing Lanczos iteration (14 threads)
memory use: 1588.8 MB
linear algebra at 0.0%, ETA 5h 5m5604055 dimensions (0.0%, ETA 5h 5m)
checkpointing every 1050000 dimensions55 dimensions (0.0%, ETA 5h21m)
linear algebra completed 103663 of 5604055 dimensions (1.8%, ETA 5h31m) [/CODE]

128b
[CODE]saving the first 112 matrix rows for later
matrix includes 128 packed rows
matrix is 5603766 x 5604055 (1570.7 MB) with weight 374704646 (66.86/col)
sparse part has weight 344512627 (61.48/col)
using block size 8192 and superblock size 1720320 for processor cache size 35840 kB
commencing Lanczos iteration (14 threads)
memory use: 1872.5 MB
linear algebra at 0.0%, ETA 6h 6m5604055 dimensions (0.0%, ETA 6h 6m)
checkpointing every 920000 dimensions055 dimensions (0.0%, ETA 6h 6m)
linear algebra completed 100516 of 5604055 dimensions (1.8%, ETA 5h37m) [/CODE]

256b
[CODE]saving the first 240 matrix rows for later
matrix includes 256 packed rows
matrix is 5603638 x 5604055 (1569.1 MB) with weight 344512627 (61.48/col)
sparse part has weight 321663773 (57.40/col)
using block size 8192 and superblock size 860160 for processor cache size 35840 kB
commencing Lanczos iteration (14 threads)
memory use: 2511.5 MB
linear algebra at 0.0%, ETA 8h 7m5604055 dimensions (0.0%, ETA 8h 7m)
checkpointing every 740000 dimensions055 dimensions (0.0%, ETA 7h37m)
linear algebra completed 101847 of 5604055 dimensions (1.8%, ETA 7h21m)
[/CODE]

bsquared 2017-06-19 13:00

Over the weekend I let both the 64b and 128b solve and found they took nearly identical amounts of time: 5 hrs 59 min for 64b and 5 hrs 58 min for 128b. 64b needed 88622 iterations and 128b needed 44046 iterations.

Oh, and I used [URL="http://www.mersenneforum.org/showthread.php?t=9787"]this [/URL]as my test number. What took two people nearly a month in 2007 took me about 1.5 days in 2017.

I'm guessing that my old gcc (version 4.8.5) is not as good at auto-vectorizing as the newer ones, and that your 15% speedup comes from the compiler's improved use of SSE2/AVX/AVX2 in the wider generic-C codepaths.

frmky 2017-07-06 04:32

I played with the MPI version a bit more today, but not very successfully. The patch below allows it to compile and the 64-bit version seems to run ok, but the 128-bit version quickly aborts with lanczos error: submatrix is not invertible.

[CODE]login3.stampede2(8)$ diff -w -u msieve/common/lanczos/lanczos.c msieve_lacuda_1015/common/lanczos/lanczos.c
--- msieve/common/lanczos/lanczos.c 2017-07-05 23:15:03.000000000 -0500
+++ msieve_lacuda_1015/common/lanczos/lanczos.c 2017-07-05 23:17:57.000000000 -0500
@@ -1579,14 +1579,14 @@
{
uint32 i;
for (i = 0; i < obj->mpi_nrows; i++) {
- packed_matrix.row_counts *= VWORDS;
- packed_matrix.row_offsets *= VWORDS;
- packed_matrix.subcol_counts *= VWORDS;
- packed_matrix.subcol_offsets *= VWORDS;
+ packed_matrix.row_counts[i] *= VWORDS;
+ packed_matrix.row_offsets[i] *= VWORDS;
+ packed_matrix.subcol_counts[i] *= VWORDS;
+ packed_matrix.subcol_offsets[i] *= VWORDS;
}
for (i = 0; i < obj->mpi_ncols; i++) {
- packed_matrix.col_counts *= VWORDS;
- packed_matrix.col_offsets *= VWORDS;
+ packed_matrix.col_counts[i] *= VWORDS;
+ packed_matrix.col_offsets[i] *= VWORDS;
}
}
#endif
@@ -1597,7 +1597,7 @@
if (post_lanczos_matrix != NULL && obj->mpi_la_row_rank == 0) {
if (obj->mpi_la_col_rank == 0) {
post_lanczos_matrix = xrealloc(post_lanczos_matrix,
- max_ncols * sizeof(uint64));
+ max_ncols * sizeof(v_t));
}

MPI_TRY(MPI_Gatherv((obj->mpi_la_col_rank == 0) ?[/CODE]

frmky 2017-07-11 06:14

I tracked down the remaining problem, so the MPI version works. It was a one line change from the previous patch. Here's the final patch from rev 1015:

[CODE]diff -r -w -u msieve/common/lanczos/lanczos.c msieve_la_test_works/common/lanczos/lanczos.c
--- msieve/common/lanczos/lanczos.c 2017-07-11 00:56:18.000000000 -0500
+++ msieve_la_test_works/common/lanczos/lanczos.c 2017-07-10 18:30:18.000000000 -0500
@@ -1579,14 +1579,14 @@
{
uint32 i;
for (i = 0; i < obj->mpi_nrows; i++) {
- packed_matrix.row_counts *= VWORDS;
- packed_matrix.row_offsets *= VWORDS;
- packed_matrix.subcol_counts *= VWORDS;
- packed_matrix.subcol_offsets *= VWORDS;
+ packed_matrix.row_counts[i] *= VWORDS;
+ packed_matrix.row_offsets[i] *= VWORDS;
+ packed_matrix.subcol_counts[i] *= VWORDS;
+ packed_matrix.subcol_offsets[i] *= VWORDS;
}
for (i = 0; i < obj->mpi_ncols; i++) {
- packed_matrix.col_counts *= VWORDS;
- packed_matrix.col_offsets *= VWORDS;
+ packed_matrix.col_counts[i] *= VWORDS;
+ packed_matrix.col_offsets[i] *= VWORDS;
}
}
#endif
@@ -1597,7 +1597,7 @@
if (post_lanczos_matrix != NULL && obj->mpi_la_row_rank == 0) {
if (obj->mpi_la_col_rank == 0) {
post_lanczos_matrix = xrealloc(post_lanczos_matrix,
- max_ncols * sizeof(uint64));
+ max_ncols * sizeof(v_t));
}

MPI_TRY(MPI_Gatherv((obj->mpi_la_col_rank == 0) ?
diff -r -w -u msieve/common/lanczos/matmul_util.c msieve_la_test_works/common/lanczos/matmul_util.c
--- msieve/common/lanczos/matmul_util.c 2017-07-11 00:56:18.000000000 -0500
+++ msieve_la_test_works/common/lanczos/matmul_util.c 2017-07-10 18:52:00.000000000 -0500
@@ -73,7 +73,7 @@

/* asynchronously send the current chunk */

- MPI_TRY(MPI_Isend(curr_buf + m * chunk, size,
+ MPI_TRY(MPI_Isend(curr_buf + m * chunk, VWORDS * size,
MPI_LONG_LONG, next_id, 97,
comm, &mpi_req))
[/CODE]

I also created a version that uses a derived MPI data type rather than multiplying the sizes to send by VWORDS. It's a more extensive set of changes, and it also seems to work fine. Not sure which is faster yet. I'll run benchmarks hopefully later this week.

[PASTEBIN]egncBYwr[/PASTEBIN]

pinhodecarlos 2018-01-10 21:35

Bump.

Any progress on the benchmarks Greg?

jasonp 2018-01-11 00:43

I at least folded his patches into the branch with the changes. No time to get any further though.


All times are UTC. The time now is 01:07.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.