mersenneforum.org Msieve GPU Linear Algebra
 Register FAQ Search Today's Posts Mark Forums Read

2021-09-24, 06:17   #56
frmky

Jul 2003
So Cal

1000101100102 Posts

Quote:
 Originally Posted by charybdis @frmky, for future reference, when I tested this I found that rational side sieving with *algebraic* 3LP was fastest. This shouldn't be too much of a surprise: the rational norms are larger, but not so much larger that 6 large primes across the two sides should split 4/2 rather than 3/3 (don't forget the special-q is a "free" large prime).
I'll try that, thanks!

2021-09-24, 06:21   #57
frmky

Jul 2003
So Cal

2×3×7×53 Posts

Quote:
 Originally Posted by frmky filtering yielded Code: matrix is 102063424 x 102063602 (51045.3 MB) with weight 14484270868 (141.91/col) Normally I'd try to bring this down, but testing on a quad V100 system with NVLink gives Code: linear algebra completed 2200905 of 102060161 dimensions (2.2%, ETA 129h 5m)
And it's done. LA on the 102M matrix with restarts took 5 days 14 hours.

2021-09-24, 12:36   #58
charybdis

Apr 2020

547 Posts

Quote:
 Originally Posted by frmky I'll try that, thanks!
Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?

2021-09-24, 13:40   #59
pinhodecarlos

"Carlos Pinho"
Oct 2011
Milton Keynes, UK

3·1,663 Posts

Quote:
 Originally Posted by charybdis Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?
95%.

2021-09-24, 15:13   #60
frmky

Jul 2003
So Cal

2×3×7×53 Posts

Quote:
 Originally Posted by charybdis Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?
A large fraction encounter issues when exceeding 1GB/thread, so I stay a little below that.

 2021-09-24, 15:50 #61 charybdis     Apr 2020 547 Posts If lims have to stay at 250M, it would probably be possible to stretch the upper limit of doable jobs a bit by using 3LP on both sides to catch some of the relations that are lost due to the low lims. This makes sec/rel ~30% worse but increases yield by ~50%, while also increasing the number of relations needed by some unknown amount (almost certainly below 50%) and making LA that bit harder as a result. But as long as you can cope with lpb 34/34 and 3LP on only one side, there shouldn't be any need for this.
 2021-10-22, 13:34 #62 ryanp     Jun 2012 Boulder, CO 24×3×7 Posts In general, given a GPU with X GB RAM, and an N x N matrix, is there a way to determine (reasonably) optimal VBITS and block_nnz values?
 2021-10-22, 23:00 #63 frmky     Jul 2003 So Cal 8B216 Posts Technically it's an MxN matrix with M slightly less than N, but for this question we can approximate it as NxN. Volta (and I'm hoping Turing and Ampere) GPUs aren't very sensitive to the block_nnz value, so just keep it at its default 1.75 billion. The actual limit is that the number of nonzeros in a cub SpMV call is stored in an int32 so each matrix block must have less than 2^31 nonzeros. block_nnz sets an estimate, especially for the transpose matrix, so I've been a bit conservative setting it at 1.75B. We want to keep the number of blocks reasonably small since each block for both the normal and transpose matrix needs a 4*(N+1)-byte row offset array in addition to the 4*num_nonzeros-byte column array in GPU memory. For VBITS, a global memory fetch on current nVidia GPUs by default moves 64 bytes into the L2 cache (although this can be reduced to 32 bytes on A100). With VBITS=128, we are only using 16 bytes of that data with little chance of cache reuse in most of the matrix. Increasing VBITS uses more of the data and thus more efficiently uses global memory bandwidth in the SpMV. However, each iteration also has multiple VBITSxN • NxVBITS dense matrix multiplications which require strided access to arrays. This strided access has a larger impact at VBITS=512. Also, the vectors require 7*N*VBITS/8 bytes of GPU memory. In practice on the V100 I've gotten about equal performance from VBITS of 384 and 512, and poorer performance with decreasing values. Of the two I use 384 since it requires less GPU memory. However, lower VBITS values are useful if GPU memory is tight. Once I have access to an A100 I will compare using VBITS=256 with cudaLimitMaxL2FetchGranularity of 32 to VBITS=384 or 512 with the default. So, in short, unless GPU memory is tight use VBITS=384 and the default block_nnz on V100 and likely on A100 as well.
 2021-10-26, 04:05 #64 frmky     Jul 2003 So Cal 8B216 Posts 2,2174M is in LA, so here's one more data point. Running on eight NVLink-connected V100's, Code: Sun Oct 24 01:15:27 2021 matrix is 106764994 x 106765194 (56998.7 MB) with weight 16127184931 (151.05/col) Sun Oct 24 01:15:27 2021 sparse part has weight 13874205635 (129.95/col) ... Sun Oct 24 23:03:59 2021 commencing linear algebra Sun Oct 24 23:03:59 2021 using VBITS=384 Sun Oct 24 23:03:59 2021 skipping matrix build Sun Oct 24 23:03:59 2021 initialized process (0,0) of 2 x 4 grid Sun Oct 24 23:09:35 2021 matrix starts at (0, 0) Sun Oct 24 23:09:39 2021 matrix is 53382681 x 25338016 (8267.4 MB) with weight 2435546404 (96.12/col) Sun Oct 24 23:09:39 2021 sparse part has weight 1913870759 (75.53/col) Sun Oct 24 23:09:39 2021 saving the first 368 matrix rows for later Sun Oct 24 23:09:46 2021 matrix includes 384 packed rows Sun Oct 24 23:10:15 2021 matrix is 53382313 x 25338016 (7468.9 MB) with weight 1554978635 (61.37/col) Sun Oct 24 23:10:15 2021 sparse part has weight 1451172382 (57.27/col) Sun Oct 24 23:10:15 2021 using GPU 0 (Tesla V100-SXM2-32GB) Sun Oct 24 23:10:15 2021 selected card has CUDA arch 7.0 Sun Oct 24 23:12:44 2021 commencing Lanczos iteration Sun Oct 24 23:12:47 2021 memory use: 20898.7 MB Sun Oct 24 23:12:56 2021 linear algebra at 0.0%, ETA 90h17m It'll take a bit longer due to queue logistics, but hopefully it'll be done within the week.
 2021-10-26, 06:21 #65 pinhodecarlos     "Carlos Pinho" Oct 2011 Milton Keynes, UK 3×1,663 Posts And I suppose you will be comparing the other sieve run with higher LP’s, probably still some left overs.
2021-10-26, 07:48   #66
frmky

Jul 2003
So Cal

2·3·7·53 Posts

Quote:
 Originally Posted by pinhodecarlos And I suppose you will be comparing the other sieve run with higher LP’s, probably still some left overs.
We didn't sieve it twice. Only a little at the beginning was sieved with 33 bit LPs and all the relations were combined. There are a few stragglers that I'm not worrying about.

 Similar Threads Thread Thread Starter Forum Replies Last Post Timic Msieve 35 2020-10-05 23:08 aein Msieve 2 2017-10-05 01:52 fivemack Hardware 3 2017-10-03 03:11 CRGreathouse Msieve 8 2009-08-05 07:25 Damian Math 8 2007-02-12 22:25

All times are UTC. The time now is 11:45.

Tue Dec 7 11:45:46 UTC 2021 up 137 days, 6:14, 0 users, load averages: 0.97, 1.26, 1.43