mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve GPU Linear Algebra (https://www.mersenneforum.org/showthread.php?t=27042)

 frmky 2021-09-24 06:17

[QUOTE=charybdis;588469]@frmky, for future reference, when I tested this I found that rational side sieving with *algebraic* 3LP was fastest. This shouldn't be too much of a surprise: the rational norms are larger, but not so much larger that 6 large primes across the two sides should split 4/2 rather than 3/3 (don't forget the special-q is a "free" large prime).[/QUOTE]
I'll try that, thanks!

 frmky 2021-09-24 06:21

[QUOTE=frmky;588086]filtering yielded
[CODE]matrix is 102063424 x 102063602 (51045.3 MB) with weight 14484270868 (141.91/col)[/CODE]
Normally I'd try to bring this down, but testing on a quad V100 system with NVLink gives
[CODE]linear algebra completed 2200905 of 102060161 dimensions (2.2%, ETA 129h 5m)[/CODE]
[/QUOTE]
And it's done. LA on the 102M matrix with restarts took 5 days 14 hours.
[PASTEBIN]cB1qD1hJ[/PASTEBIN]

 charybdis 2021-09-24 12:36

[QUOTE=frmky;588534]I'll try that, thanks![/QUOTE]

Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?

 pinhodecarlos 2021-09-24 13:40

[QUOTE=charybdis;588567]Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?[/QUOTE]

95%.

 frmky 2021-09-24 15:13

[QUOTE=charybdis;588567]Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?[/QUOTE]

A large fraction encounter issues when exceeding 1GB/thread, so I stay a little below that.

 charybdis 2021-09-24 15:50

If lims have to stay at 250M, it would probably be possible to stretch the upper limit of doable jobs a bit by using 3LP on both sides to catch some of the relations that are lost due to the low lims. This makes sec/rel ~30% worse but increases yield by ~50%, while also increasing the number of relations needed by some unknown amount (almost certainly below 50%) and making LA that bit harder as a result.

But as long as you can cope with lpb 34/34 and 3LP on only one side, there shouldn't be any need for this.

 ryanp 2021-10-22 13:34

In general, given a GPU with X GB RAM, and an N x N matrix, is there a way to determine (reasonably) optimal VBITS and block_nnz values?

 frmky 2021-10-22 23:00

Technically it's an MxN matrix with M slightly less than N, but for this question we can approximate it as NxN.

Volta (and I'm hoping Turing and Ampere) GPUs aren't very sensitive to the block_nnz value, so just keep it at its default 1.75 billion. The actual limit is that the number of nonzeros in a cub SpMV call is stored in an int32 so each matrix block must have less than 2^31 nonzeros. block_nnz sets an estimate, especially for the transpose matrix, so I've been a bit conservative setting it at 1.75B. We want to keep the number of blocks reasonably small since each block for both the normal and transpose matrix needs a 4*(N+1)-byte row offset array in addition to the 4*num_nonzeros-byte column array in GPU memory.

For VBITS, a global memory fetch on current nVidia GPUs by default moves 64 bytes into the L2 cache (although this can be reduced to 32 bytes on A100). With VBITS=128, we are only using 16 bytes of that data with little chance of cache reuse in most of the matrix. Increasing VBITS uses more of the data and thus more efficiently uses global memory bandwidth in the SpMV. However, each iteration also has multiple VBITSxN • NxVBITS dense matrix multiplications which require strided access to arrays. This strided access has a larger impact at VBITS=512. Also, the vectors require 7*N*VBITS/8 bytes of GPU memory. In practice on the V100 I've gotten about equal performance from VBITS of 384 and 512, and poorer performance with decreasing values. Of the two I use 384 since it requires less GPU memory. However, lower VBITS values are useful if GPU memory is tight. Once I have access to an A100 I will compare using VBITS=256 with cudaLimitMaxL2FetchGranularity of 32 to VBITS=384 or 512 with the default.

So, in short, unless GPU memory is tight use VBITS=384 and the default block_nnz on V100 and likely on A100 as well.

 frmky 2021-10-26 04:05

2,2174M is in LA, so here's one more data point. Running on [B]eight[/B] NVLink-connected V100's,
[CODE]Sun Oct 24 01:15:27 2021 matrix is 106764994 x 106765194 (56998.7 MB) with weight 16127184931 (151.05/col)
Sun Oct 24 01:15:27 2021 sparse part has weight 13874205635 (129.95/col)
...
Sun Oct 24 23:03:59 2021 commencing linear algebra
Sun Oct 24 23:03:59 2021 using VBITS=384
Sun Oct 24 23:03:59 2021 skipping matrix build
Sun Oct 24 23:03:59 2021 initialized process (0,0) of 2 x 4 grid
Sun Oct 24 23:09:35 2021 matrix starts at (0, 0)
Sun Oct 24 23:09:39 2021 matrix is 53382681 x 25338016 (8267.4 MB) with weight 2435546404 (96.12/col)
Sun Oct 24 23:09:39 2021 sparse part has weight 1913870759 (75.53/col)
Sun Oct 24 23:09:39 2021 saving the first 368 matrix rows for later
Sun Oct 24 23:09:46 2021 matrix includes 384 packed rows
Sun Oct 24 23:10:15 2021 matrix is 53382313 x 25338016 (7468.9 MB) with weight 1554978635 (61.37/col)
Sun Oct 24 23:10:15 2021 sparse part has weight 1451172382 (57.27/col)
Sun Oct 24 23:10:15 2021 using GPU 0 (Tesla V100-SXM2-32GB)
Sun Oct 24 23:10:15 2021 selected card has CUDA arch 7.0
Sun Oct 24 23:12:44 2021 commencing Lanczos iteration
Sun Oct 24 23:12:47 2021 memory use: 20898.7 MB
Sun Oct 24 23:12:56 2021 linear algebra at 0.0%, ETA 90h17m
[/CODE]
It'll take a bit longer due to queue logistics, but hopefully it'll be done within the week.

 pinhodecarlos 2021-10-26 06:21

And I suppose you will be comparing the other sieve run with higher LP’s, probably still some left overs.

 frmky 2021-10-26 07:48

[QUOTE=pinhodecarlos;591648]And I suppose you will be comparing the other sieve run with higher LP’s, probably still some left overs.[/QUOTE]
We didn't sieve it twice. Only a little at the beginning was sieved with 33 bit LPs and all the relations were combined. There are a few stragglers that I'm not worrying about.

All times are UTC. The time now is 16:45.