mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve GPU Linear Algebra (https://www.mersenneforum.org/showthread.php?t=27042)

 VBCurtis 2021-08-02 19:23

Msieve GPU Linear Algebra

LA = 3331s[/C]

Now that we can run msieve on a GPU we have no reason to ever run it on a CPU again. [/QUOTE]

Do you mean you'll just leave the big matrices to others? How big can you solve on your GPU?

 Xyzzy 2021-08-02 19:28

We were told:[QUOTE]You can probably run up to a 25M-30M matrix, perhaps a bit larger, on that card.[/QUOTE]

 frmky 2021-08-02 21:48

[QUOTE=VBCurtis;584640]Do you mean you'll just leave the big matrices to others? How big can you solve on your GPU?[/QUOTE]
Unlike most of us, Mike has a high-end workstation GPU with 48GB memory. He can fit all but the largest "f small" matrices on his GPU.

 xilman 2021-08-03 11:49

[QUOTE=frmky;584454]After reimplementing the CUDA SpMV with CUB, the Tesla V100 now takes 36 minutes.[/QUOTE]I'm sorry but I am getting seriously out of date with CUDA since the updated compilers stopped working on my Ubuntu and Gentoo systems.

The three systems still in use have a 460, a 970, and a 1060 with drivers 390.138, 390.144 and 390.141 respectively.

Do you think your new code might run on any of those? If so, I will try again to get CUDA installed and working.

Thanks.

 frmky 2021-08-03 14:15

Technically yes, but consumer cards don't have enough memory to store interesting matrices. If the GTX 1060 has 6GB, it could run matrices up to about 5Mx5M. The problem is that block Lanczos requires multiplying by both the matrix and its transpose, but gpus only seem to work well with the matrix in CSR, which doesn't allow efficiently calculating the transpose. So we load both the matrix and its transpose onto the card.

It would be possible to create a version that stores the matrices in system memory and loads the next matrix block into GPU memory while calculating the product with the current block. The block size is adjustable, but I don't know how performant that would be.

 Xyzzy 2021-08-03 14:45

How important is ECC on a video card? (Most consumer cards don't have that, right?)

Our card has it, and we have it enabled, but it runs faster without.

We haven't logged an ECC error yet. Note the "aggregate" counter described below.

[CODE]ECC Errors
NVIDIA GPUs can provide error counts for various types of ECC errors. Some ECC errors are either single or double bit, where single bit errors are corrected and double bit
errors are uncorrectable. Texture memory errors may be correctable via resend or uncorrectable if the resend fails. These errors are available across two timescales (volatile
and aggregate). Single bit ECC errors are automatically corrected by the HW and do not result in data corruption. Double bit errors are detected but not corrected. Please see
the ECC documents on the web for information on compute application behavior when double bit errors occur. Volatile error counters track the number of errors detected since the
last driver load. Aggregate error counts persist indefinitely and thus act as a lifetime counter.[/CODE]:mike:

 frmky 2021-08-04 00:32

[QUOTE=frmky;584710]It would be possible to create a version that stores the matrices in system memory[/QUOTE]
I did that, and it's not terrible with the right settings...
[CODE]using VBITS=512
matrix is 42100909 x 42101088 (20033.9 MB) with weight 6102777434 (144.96/col)
...
using GPU 0 (Tesla V100-SXM2-32GB) <-------- 32 GB card
...
vector memory use: 17987.6 MB <-- 7 x matrix columns x VBITS / 8 bytes on card, adjust VBITS as needed
dense rows memory use: 2569.6 MB <-- on card but could be moved to cpu memory
sparse matrix memory use: 30997.3 MB <-- Hosted in cpu memory, transferred on card as needed
memory use: 51554.6 MB <-- significantly exceeds 32 GB
Allocated 357.7 MB for SpMV library
...
linear algebra completed 33737 of 42101088 dimensions (0.1%, ETA 133h21m)
[/CODE]

 frmky 2021-08-04 02:42

[QUOTE=Xyzzy;584714]How important is ECC on a video card? (Most consumer cards don't have that, right?)

Our card has it, and we have it enabled, but it runs faster without.

We haven't logged an ECC error yet. Note the "aggregate" counter described below.
[/QUOTE]
What's your risk tolerance? msieve has robust error detection so it's not as important. But it's usually a small price to ensure no memory faults.

 mathwiz 2021-08-04 02:47

Are there instructions on how to check out and build the msieve GPU LA code? Is it in trunk or a separate branch?

 VBCurtis 2021-08-04 03:41

[QUOTE=frmky;584751]I did that, and it's not terrible with the right settings...
[CODE]using VBITS=512
matrix is 42100909 x 42101088 (20033.9 MB) with weight 6102777434 (144.96/col)
...
using GPU 0 (Tesla V100-SXM2-32GB) <-------- 32 GB card
...
vector memory use: 17987.6 MB <-- 7 x matrix columns x VBITS / 8 bytes on card, adjust VBITS as needed
dense rows memory use: 2569.6 MB <-- on card but could be moved to cpu memory
sparse matrix memory use: 30997.3 MB <-- Hosted in cpu memory, transferred on card as needed
memory use: 51554.6 MB <-- significantly exceeds 32 GB
Allocated 357.7 MB for SpMV library
...
linear algebra completed 33737 of 42101088 dimensions (0.1%, ETA 133h21m)
[/CODE][/QUOTE]

This is simply amazing! I'm running a matrix that size for GNFS-201 (from f-small) right now, at ~700 hr on a 12-core single-socket Haswell.
I hope this means you'll be digging out of your matrix backlog from the big siever queue. :tu:

 frmky 2021-08-04 04:51

[QUOTE=mathwiz;584761]Are there instructions on how to check out and build the msieve GPU LA code? Is it in trunk or a separate branch?[/QUOTE]

It's very much a work-in-progress and things may change or occasionally be broken, but you can play with it. I have it in GitHub. I recommend using CUDA 10.2 because CUDA 11.x incorporates CUB into the toolkit and tries to force you to use it, but it's missing a few pieces. That complicates things. You can get the source with

git clone [url]https://github.com/gchilders/msieve_nfsathome.git[/url] -b msieve-lacuda-nfsathome
cd msieve_nfsathome
make all VBITS=128 CUDA=XX

where XX is the two-digit CUDA compute capability of your GPU. Specifying CUDA=1 defaults to a compute capability of 60. You may want to experiment with both VBITS=128 and VBITS=256 to see which is best on your GPU.

If you want to copy msieve to another directory, you need the msieve binary, both *.ptx files, and in the cub directory both *.so files. Or just run it from the build directory.

 Xyzzy 2021-08-04 12:22

We played around with 10867_67m1 which is SNFS(270.42) and has a 27M matrix.

[C]025M | 38GB | 50H
100M | 25GB | 51H
500M | 21GB | 59H[/C]

The first column is the block size (?) used on the GPU. (25M is the default.)
The second column is the memory used on the GPU.
The third column is the estimated time in hours for the LA phase.

:mike:

 Xyzzy 2021-08-04 12:29

If you are using RHEL 8 (8.4) you can install the proprietary Nvidia driver easily via these directions:

[URL]https://developer.nvidia.com/blog/streamlining-nvidia-driver-deployment-on-rhel-8-with-modularity-streams/[/URL]

Then you will need these packages installed:

[C]gcc
make
cuda-nvcc-10-2
cuda-cudart-dev-10-2-10.2.89-1[/C]

And possibly:

[C]gmp-devel
zlib-devel[/C]

[C]export PATH="/usr/local/cuda-10.2/bin:\$PATH"[/C]

:mike:

 Xyzzy 2021-08-04 19:32

[QUOTE=Xyzzy;584797]We played around with 10867_67m1 which is SNFS(270.42) and has a 27M matrix.

[C]025M | 38GB | 50H
100M | 25GB | 51H
500M | 21GB | 59H[/C]

The first column is the block size (?) used on the GPU. (25M is the default.)
The second column is the memory used on the GPU.
The third column is the estimated time in hours for the LA phase.[/QUOTE]
Here are more benchmarks on the same data:[CODE]VBITS = 64; BLOCKS = 25M; MEM = 37.7GB; TIME = 58.8HR
VBITS = 64; BLOCKS = 100M; MEM = 23.8GB; TIME = 66.5HR
VBITS = 64; BLOCKS = 500M; MEM = 20.0GB; TIME = 98.9HR
VBITS = 64; BLOCKS = 1750M; MEM = 19.3GB; TIME = 109.9HR

VBITS = 128; BLOCKS = 25M; MEM = 37.4GB; TIME = 49.5HR
VBITS = 128; BLOCKS = 100M; MEM = 24.2GB; TIME = 50.3HR
VBITS = 128; BLOCKS = 500M; MEM = 20.7GB; TIME = 58.5HR
VBITS = 128; BLOCKS = 1750M; MEM = 20.1GB; TIME = 61.2HR

VBITS = 256; BLOCKS = 25M; MEM = 39.1GB; TIME = 47.4HR
VBITS = 256; BLOCKS = 100M; MEM = 26.5GB; TIME = 37.2HR
VBITS = 256; BLOCKS = 500M; MEM = 23.2GB; TIME = 37.2HR
VBITS = 256; BLOCKS = 1750M; MEM = 22.6GB; TIME = 37.5HR

VBITS = 512; BLOCKS = 25M; MEM = 44.1GB; TIME = 57.1HR
VBITS = 512; BLOCKS = 100M; MEM = 32.2GB; TIME = 43.5HR
VBITS = 512; BLOCKS = 500M; MEM = 28.9GB; TIME = 41.3HR
VBITS = 512; BLOCKS = 1750M; MEM = 28.5GB; TIME = 40.9HR[/CODE]37.2 hours!

:mike:

 frmky 2021-08-05 01:04

That's great! The older V100 definitely doesn't like the VBITS=256 blocks=100M or 500M settings. It doubles the runtime. Anyone using this really needs to test different settings on their card.

 ryanp 2021-08-06 14:05

Trying this out on an NVIDIA A100. Compiled and starts to run. I'm invoking with:

[CODE]./msieve -v -g 0 -i ./f/input.ini -l ./f/input.log -s ./f/input.dat -nf ./f/input.fb -nc2[/CODE]

[CODE]Fri Aug 6 12:43:59 2021 commencing linear algebra
Fri Aug 6 12:43:59 2021 using VBITS=256
Fri Aug 6 12:44:04 2021 read 36267445 cycles
Fri Aug 6 12:45:36 2021 cycles contain 123033526 unique relations
Fri Aug 6 12:58:53 2021 read 123033526 relations
Fri Aug 6 13:02:37 2021 using 20 quadratic characters above 4294917295
Fri Aug 6 13:14:54 2021 building initial matrix
Fri Aug 6 13:45:04 2021 memory use: 16201.2 MB
Fri Aug 6 13:45:24 2021 read 36267445 cycles
Fri Aug 6 13:45:28 2021 matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col)
Fri Aug 6 13:45:28 2021 sparse part has weight 4115543151 (113.48/col)
Fri Aug 6 13:50:59 2021 filtering completed in 1 passes
Fri Aug 6 13:51:04 2021 matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col)
Fri Aug 6 13:51:04 2021 sparse part has weight 4115543151 (113.48/col)
Fri Aug 6 13:54:35 2021 matrix starts at (0, 0)
Fri Aug 6 13:54:40 2021 matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col)
Fri Aug 6 13:54:40 2021 sparse part has weight 4115543151 (113.48/col)
Fri Aug 6 13:54:40 2021 saving the first 240 matrix rows for later
Fri Aug 6 13:54:47 2021 matrix includes 256 packed rows
Fri Aug 6 13:55:00 2021 matrix is 36267035 x 36267445 (15850.8 MB) with weight 3758763803 (103.64/col)
Fri Aug 6 13:55:00 2021 sparse part has weight 3574908223 (98.57/col)
Fri Aug 6 13:55:01 2021 using GPU 0 (NVIDIA A100-SXM4-40GB)
Fri Aug 6 13:55:01 2021 selected card has CUDA arch 8.0[/CODE]

Then a long sequence of numbers, and:

[CODE]25000136 36267035 221384
25000059 36267035 218336
25000041 36267035 214416
25000174 36267035 211066
25000044 36267035 212574
25000047 36267035 212320
25000174 36267035 207956
25000171 36267035 202904
25000117 36267035 197448
25000171 36267035 191566
25000130 36267035 185008
25000136 36267035 178722
24898531 36267035 168358
3811898 36267445 264
22016023 36267445 48
24836805 36267445 60
27790270 36267445 75
24929949 36267445 75
22849647 36267445 75
24896299 36267445 90
22990599 36267445 90
25502972 36267445 110
23602625 36267445 110
26327686 36267445 135
23662886 36267445 135
26145282 36267445 165
23549845 36267445 165
26371744 36267445 205
23884092 36267445 205
26835429 36267445 255
24055165 36267445 255
26699873 36267445 315
23947051 36267445 315
26916570 36267445 390
24419378 36267445 390
27622355 36267445 485
error (line 373): CUDA_ERROR_OUT_OF_MEMORY[/CODE]

 frmky 2021-08-06 14:12

At the end after -nc2, add block_nnz=100000000

Edit: That's a big matrix for that card. If that still doesn't work, try changing 100M to 500M, then 1000M, then 1750M. I think one of those should work.

If you still run out of memory with 1750M, then switch to VBITS=128 and start over at 100M. and run through them again. That will use less GPU memory for the vectors saving more for the matrix.

Finally, if you still run out of memory with VBITS=128 and block_nnz=1750000000 then use VBITS=512 with -nc2 "block_nnz=1750000000 use_managed=1"
That will save the matrix overflow in CPU memory and move it from there as needed. It's slower, but likely still faster than running the CPU version.

Edit 2: I should add that once the matrix is built, you can skip that step with, e.g., -nc2 "skip_matbuild=1 block_nnz=100000000"
I haven't tested on an A100, so you may want to benchmark the various settings that work to find the optimal for your card.

 ryanp 2021-08-06 16:39

[QUOTE=frmky;584976]At the end after -nc2, add block_nnz=100000000

Edit: That's a big matrix for that card. If that still doesn't work, try changing 100M to 500M, then 1000M, then 1750M. I think one of those should work.

If you still run out of memory with 1750M, then switch to VBITS=128 and start over at 100M. and run through them again. That will use less GPU memory for the vectors saving more for the matrix.

Finally, if you still run out of memory with VBITS=128 and block_nnz=1750000000 then use VBITS=512 with -nc2 "block_nnz=1750000000 use_managed=1"
That will save the matrix overflow in CPU memory and move it from there as needed. It's slower, but likely still faster than running the CPU version.

Edit 2: I should add that once the matrix is built, you can skip that step with, e.g., -nc2 "skip_matbuild=1 block_nnz=100000000"
I haven't tested on an A100, so you may want to benchmark the various settings that work to find the optimal for your card.[/QUOTE]

Tried a bunch of settings:

* all of 100M, 500M, 1000M, 1750M with VBITS=256 all ran out of memory
* managed to get the matrix down to 27M with more sieving. VBITS=256, 1750M still runs out of memory.
* will try VBITS=128 next with the various settings

Is there any work planned to pick optimal (or at least, functional, won't crash) settings automatically?

 frmky 2021-08-06 17:52

[QUOTE=ryanp;584987]Is there any work planned to pick optimal (or at least, functional, won't crash) settings automatically?[/QUOTE]

Optimal and functional are very different parameters. I can try automatically picking a block_nnz value that is more likely to work, but VBITS is a compile-time setting that can't be changed at runtime. Adding use_managed=1 will make it work in most cases but can significantly slow it down, so I've defaulted it to off.

 ryanp 2021-08-06 18:03

Looks like it's working now with VBITS=128, and a pretty decent runtime:

[CODE]./msieve -v -g 0 -i ./f/input.ini -l ./f/input.log -s ./f/input.dat -nf ./f/input.fb -nc2 block_nnz=1000000000
...
matrix starts at (0, 0)
matrix is 27724170 x 27724341 (13842.2 MB) with weight 3947756174 (142.39/col)
sparse part has weight 3351414840 (120.88/col)
saving the first 112 matrix rows for later
matrix includes 128 packed rows
matrix is 27724058 x 27724341 (12940.4 MB) with weight 3222020630 (116.22/col)
sparse part has weight 3059558876 (110.36/col)
using GPU 0 (NVIDIA A100-SXM4-40GB)
selected card has CUDA arch 8.0
Nonzeros per block: 1000000000
converting matrix to CSR and copying it onto the GPU
1000000043 27724058 8774182
1000000028 27724058 9604503
1000000099 27724058 8923418
59558706 27724058 422238
1082873143 27724341 40960
954052655 27724341 1455100
916348921 27724341 16939530
106284157 27724341 9288468
commencing Lanczos iteration
vector memory use: 2961.3 MB
dense rows memory use: 423.0 MB
sparse matrix memory use: 24188.7 MB
memory use: 27573.0 MB
Allocated 82.0 MB for SpMV library
Allocated 88.6 MB for SpMV library
linear algebra at 0.0%, ETA 20h11m7724341 dimensions (0.0%, ETA 20h11m)
checkpointing every 1230000 dimensions341 dimensions (0.0%, ETA 22h42m)
linear algebra completed 12223 of 27724341 dimensions (0.0%, ETA 20h46m)[/CODE]

 frmky 2021-08-06 18:03

The 17.5M matrix for 2,1359+ took just under 15 hours on a V100.

 frmky 2021-08-06 18:20

[QUOTE=ryanp;584999]Looks like it's working now with VBITS=128, and a pretty decent runtime:[/QUOTE]
A 27.7M matrix in 21 hours. The A100 is a nice card!

Do you have access to an A40 to try it? I'm curious if the slower global memory significantly increases the runtime.

 ryanp 2021-08-06 20:02

[QUOTE=frmky;585002]A 27.7M matrix in 21 hours. The A100 is a nice card![/QUOTE]
Indeed!

[QUOTE]Do you have access to an A40 to try it? I'm curious if the slower global memory significantly increases the runtime.[/QUOTE]

No, sadly, "just" A100 and V100's.

 ryanp 2021-08-06 21:20

On that note, though, are there any plans to support multiple GPUs? If a single A100 is this fast, [URL="https://cloud.google.com/blog/products/compute/a2-vms-with-nvidia-a100-gpus-are-ga"]16x A100's[/URL] with a fully interconnected fabric could probably tear through big matrices?

 frmky 2021-08-06 21:41

The current version supports multiple GPUs using MPI (compile with CUDA=1 MPI=1 CUDAAWARE=1) but relies on a good MPI implementation. OpenMPI's collectives transfer the data from card and do the reduction on the CPU. MVAPICH2-GDR I think keeps the reductions on the card, but SDSC doesn't have that working on Expanse GPU yet so I haven't been able to test it. I hope to have time on NCSA Delta later this fall to try it out.

Edit: [STRIKE]My backup plan if that doesn't work out is to use a non-CUDA-aware MPI to pass IPC handles between processes and do the reduction on the GPU myself.[/STRIKE]

Edit 2: I've got a draft version working just now that passes vectors between GPUs using MPI CUDA-aware point-to-point comms (which uses NVLink or GPUDirect when available) then does the reduction on the GPU manually. In a quick test on a 43M matrix using two V100's connected with NVLink, this reduces LA time from nearly 90 hours when passing vectors through CPU memory to [STRIKE]57[/STRIKE] 56 hours transferring directly between GPUs.

Edit 3: It's now in GitHub. Just compile with a CUDA-Aware MPI like OpenMPI using CUDA=XX MPI=1 CUDAAWARE=1 where XX is replaced by the compute capability of your GPU.

 frmky 2021-08-07 06:44

[CODE]linear algebra completed 45452 of 42101088 dimensions (0.1%, ETA 21h 4m)[/CODE]
Using four V100's, I'm getting about 21 hours to solve a 42.1M matrix.

 fivemack 2021-08-07 11:10

Interesting! That's about a p3.8xlarge instance, for which the spot price is \$4/hr, so that's \$84 = £60 to solve the matrix.

I'm paying 19p/kWh here, and my Skylake machine uses about 250W and takes 820 hours for a 44M matrix, so that's £40 of electricity (but probably £60 in depreciation, assuming the £3360 machine will last five years); on another hand it's taking a month rather than a day, on a third hand that's still keeping up with my sieving resources.

 frmky 2021-08-07 16:13

[CODE]linear algebra completed 49005 of 84248506 dimensions (0.1%, ETA 94h30m)[/CODE]
And scaling well. The 84.2M matrix for 2,2162M should take about 4 days on four NVLink-connected V100's. It's using about 26GB on each card.

 ryanp 2021-08-07 16:20

[QUOTE=frmky;585109][CODE]linear algebra completed 49005 of 84248506 dimensions (0.1%, ETA 94h30m)[/CODE]
And scaling well. The 84.2M matrix for 2,2162M should take about 4 days on four NVLink-connected V100's. It's using about 26GB on each card.[/QUOTE]

That's quite impressive. I dug this up which I believe was your MPI run of a 109.4M matrix from a few months back?

[CODE]linear algebra completed 20216008 of 109441779 dimensions (18.5%, ETA 854h19m)[/CODE]

 frmky 2021-08-07 16:40

Yes, that would have been on 6 Sandy Bridge nodes with 2x 10 core cpus each.

Here's the companion 2,2162L matrix, also 84.2M, running on 8 Fujitsu A64FX nodes.

[CODE]Fri Jul 2 01:59:19 2021 linear algebra at 0.0%, ETA 337h 2m[/CODE]

 wombatman 2021-08-08 00:00

Would something like work on my 3090? It has 24GB of ram on it, though I would have to get some help with compilation as I use WSL2, which doesn't support CUDA applications (yet).

 frmky 2021-08-08 00:57

[QUOTE=wombatman;585135]Would something like work on my 3090? It has 24GB of ram on it, though I would have to get some help with compilation as I use WSL2, which doesn't support CUDA applications (yet).[/QUOTE]
Yes, you could solve a matrix up to about 15M or so on the card. If you have at least 32 GB system memory, you could go a bit larger transferring the matrix from system memory as needed using CUDA managed memory. But I have no experience compiling msieve for Windows.

 frmky 2021-08-11 22:09

1 Attachment(s)
The LA for 2,2162M, an 84.2M matrix, successfully completed on four NVLink-connected V100's in a total of 95.5 hours of runtime. There was a restart due to the 48-hour queue time limit on SDSC Expanse GPU. This run used just over 26GB of GPU memory on each of the four V100's.

Attached is a snapshot of the timeline for two block Lanzcos iterations on three of the four gpus. Per the time scale at the top, it takes just over 1 second/iteration. Over 80% of the time is spent in the SpMV routine. The transfer of vectors directly between GPU's takes relatively little time when NVLink is used.

[PASTEBIN]TsUMyBr8[/PASTEBIN]

 Xyzzy 2021-09-09 20:28

1 Attachment(s)
We are not sure if this is interesting or not.

[C]13_2_909m1 - Near-Cunningham - SNFS(274)[/C]

This is a big (33 bit?) job. The [C]msieve.dat[/C] file, uncompressed and with duplicates and bad relations removed, is 49GB.[CODE]\$ ls -lh
total 105G
-rw-rw-r--. 1 m m 36G Sep 8 20:33 13_2_909m1.dat.gz
drwx------. 2 m m 50 Aug 4 12:17 cub
-r--------. 1 m m 29K Aug 4 12:16 lanczos_kernel.ptx
-r-x------. 1 m m 3.4M Aug 4 12:16 msieve
-rw-rw-r--. 1 m m 49G Sep 8 22:02 msieve.dat
-rw-rw-r--. 1 m m 4.2G Sep 9 14:17 msieve.dat.bak.chk
-rw-rw-r--. 1 m m 4.2G Sep 9 14:54 msieve.dat.chk
-rw-rw-r--. 1 m m 969M Sep 9 12:11 msieve.dat.cyc
-rw-rw-r--. 1 m m 12G Sep 9 12:11 msieve.dat.mat
-rw-rw-r--. 1 m m 415 Sep 2 19:15 msieve.fb
-rw-rw-r--. 1 m m 13K Sep 9 15:10 msieve.log
-r--------. 1 m m 108K Aug 4 12:16 stage1_core.ptx
-rw-rw-r--. 1 m m 264 Sep 2 19:15 worktodo.ini[/CODE]There are ~442M relations. Setting [C]block_nnz[/C] to 500M resulted in an OOM error, so we used 1B instead.[CODE]commencing linear algebra
using VBITS=256
skipping matrix build
matrix starts at (0, 0)
matrix is 27521024 x 27521194 (12901.7 MB) with weight 3687594306 (133.99/col)
sparse part has weight 3106904079 (112.89/col)
saving the first 240 matrix rows for later
matrix includes 256 packed rows
matrix is 27520784 x 27521194 (12034.4 MB) with weight 2848207923 (103.49/col)
sparse part has weight 2714419599 (98.63/col)
using GPU 0 (Quadro RTX 8000)
selected card has CUDA arch 7.5
Nonzeros per block: 1000000000
converting matrix to CSR and copying it onto the GPU
1000000013 27520784 9680444
1000000057 27520784 11295968
714419529 27520784 6544782
1039631367 27521194 100000
917599197 27521194 3552480
757189035 27521194 23868304
commencing Lanczos iteration
vector memory use: 5879.2 MB
dense rows memory use: 839.9 MB
sparse matrix memory use: 21339.3 MB
memory use: 28058.3 MB
Allocated 123.0 MB for SpMV library
Allocated 127.8 MB for SpMV library
linear algebra at 0.0%, ETA 49h57m7521194 dimensions (0.0%, ETA 49h57m)
checkpointing every 570000 dimensions
linear algebra completed 925789 of 27521194 dimensions (3.4%, ETA 45h13m)
linear algebra completed 926044 of 27521194 dimensions (3.4%, ETA 45h12m)
lanczos halted after 3628 iterations (dim = 926044)
BLanczosTime: 5932
elapsed time 01:38:53

current factorization was interrupted[/CODE]So the LA step is under 50 hours which seems pretty fast! (We have no plans to complete it since it is assigned to VBCurtis.)

We have the raw files saved if there are other configurations worth investigating. If so, just let us know!

:mike:

 VBCurtis 2021-09-09 21:00

It's a 32/33 hybrid, with a healthy amount of oversieving (I wanted a matrix below 30M dimensions, success!).

I'm impressed that fits on your card, and 50hr is pretty amazing- I just started the matrix a few hr ago on a 10-core Ivy Bridge, ETA is 365 hr.

If you have the free cycles to run it, please be my guest! That 20+ core weeks saved is enough to ECM the next candidate.

 frmky 2021-09-13 05:22

I spent time with Nsight Compute looking at the SpMV kernel. As expected for SpMV it's memory bandwidth limited, so increasing occupancy to hide latency should help. I adjusted parameters to reduce both register and shared memory use, which increased the occupancy. This yielded a runtime improvement of only about 5% on the V100 but it may differ on other cards. I also increased the default block_nnz to 1750M to reduce global memory use a bit.

 frmky 2021-09-16 06:00

Today I expanded the allowed values of VBITS to any of 64, 128, 192, 256, 320, 384, 448, or 512. This works on both CPUs and GPUs, but I don't expect much, if any, speedup on CPUs. As a GPU benchmark, I tested a 42.1M matrix on two NVLink-connected V100's. Here are the results.
[CODE]
VBITS Time (hours)
64 109.5
128 63.75
192 50
256 40.25
320 40.25
384 37.75
448 40.25
512 37.25[/CODE]
Combined with the new SpMV parameters, I get the best times with VBITS of 384 and 512, but 384 uses less memory. Overall, I get about 6% better performance than with VBITS=256.

 Xyzzy 2021-09-17 11:38

Our system has a single GPU. When we are doing compute work on the GPU the display lags. We can think of two ways to fix this.

Some sort of niceness assignment to the compute process.
Limiting the compute process to less than 100% of the GPU.

Are either of these approaches possible?

:mike:

 Xyzzy 2021-09-17 11:41

Since GPU LA is so fast, should we rethink how many relations are generated by the sieving process?

:mike:

 axn 2021-09-17 12:15

[QUOTE=Xyzzy;588025]Our system has a single GPU. When we are doing compute work on the GPU the display lags.[/QUOTE]
Install a cheap second GPU (like GT 1030 / RX 550) to drive your display (if your mobo has provision for a second one)

 frmky 2021-09-17 14:56

[QUOTE=Xyzzy;588025]Our system has a single GPU. When we are doing compute work on the GPU the display lags.[/QUOTE]
CUDA doesn't allow display updates while a kernel is running. The only way to improve responsiveness without using a second GPU is to shorten the kernel run times. The longest kernel is the SpMV, and you can shorten that by lowering block_nnz. Use the lowest value that still allows everything to fit in GPU memory.

Edit: Lowering VBITS will also reduce kernel runtimes, but don't go below 128. See the benchmark a few posts above. Also, you can't change VBITS in the middle of a run. You would need to start over from the beginning. You can change block_nnz during a restart.

 chris2be8 2021-09-17 15:37

[QUOTE=axn;588028]Install a cheap second GPU (like GT 1030 / RX 550) to drive your display (if your mobo has provision for a second one)[/QUOTE]

Or use on-board graphics if the MOBO has them. That's how I run my main GPU.

 VBCurtis 2021-09-17 16:52

[QUOTE=Xyzzy;588027]Since GPU LA is so fast, should we rethink how many relations are generated by the sieving process?

:mike:[/QUOTE]

In principle, yes. There's an electricity savings to be had by over-sieving less and accepting larger matrices, especially on the e-small queue where matrices are nearly all under 15M. However, one can't push this very far, as relation sets that fail to build any matrix delay jobs and require human-admin time to add Q to the job. I've been trying to pick relations targets that leave jobs uncertain to build a matrix at TD=120, and I advocate this for everyone on e-small now. Some of the bigger 15e jobs could yield matrices over 30M / over the memory capabilities of GPU-LA, so maybe those shouldn't change much?

Another way to view this is to aim for the number of relations one would use if one were doing the entire job on one's own equipment, and then add just a bit to reduce the chance of needing to ask for more Q from admin (like round Q up to the nearest 5M or 10M increment).

 Xyzzy 2021-09-17 18:56

What is the difference in relations needed between TD=120 and TD=100? (Do we have this data?)

We think a GPU could do a TD=100 job faster than a CPU could do a TD=120 job.

Personally, we don't mind having to rerun matrix building if there aren't enough relations. We don't know if it is a drag for the admins to add additional relations, but if it isn't a big deal the project could probably run more efficiently.

There doesn't seem to be a shortage of LA power so maybe the project could skew a bit in favor of more jobs overall with less relations per job? Is the bottleneck server storage space?

What percentage in CPU-hours is the sieving versus the post-processing work? Does one additional hour of post-processing "save" 1000 hours of sieving? More? Less?

[SIZE=1](We lack the technical knowledge and vocabulary to express what we are thinking. Hopefully what we wrote makes a little sense.)[/SIZE]

:mike:

 VBCurtis 2021-09-17 19:44

I'm gonna hand-wave here, since only a few people have bothered taking data:

When a relation set is right at the cusp of building a matrix, a few more hours sieving will save more than a few hours to solve the matrix on that same machine (meaning CPU in both cases).

At the relation counts most e-small and 15e jobs are processed at, 20 more core-hours of sieving might save 5 or 10 core-hours of matrix work (again, both measured on a CPU). I've done a few experiments at home, and I have yet to find a job where the sieving required to build a matrix at TD=120 saved more CPU time than it cost. I believe this could/would be the case on really big jobs, say with matrices at 50M+ in size.

We have historically sieved more than needed because BOINC computation is cheap, while matrix solving time was in short supply. So, now that GPU matrix solving makes matrices not in short supply, we should sieve less. Something like 5-10% fewer relations, which means 5-10% more jobs done per calendar month.

 frmky 2021-09-17 19:47

[QUOTE=Xyzzy;588059]Is the bottleneck server storage space?[/QUOTE]
No. The server is currently using 467G of 3.6T.

 frmky 2021-09-18 03:58

For 2,2174L, 1355M relations yielded 734M uniques. With nearly 50% duplicates, we have clearly reached the limit for 16e. Anyway, filtering yielded
[CODE]matrix is 102063424 x 102063602 (51045.3 MB) with weight 14484270868 (141.91/col)[/CODE]
Normally I'd try to bring this down, but testing on a quad V100 system with NVLink gives
[CODE]linear algebra completed 2200905 of 102060161 dimensions (2.2%, ETA 129h 5m)[/CODE]
So more sieving would only save a day or so in LA. I have the cluster time, so I'll let it run.

 pinhodecarlos 2021-09-18 07:02

[QUOTE=VBCurtis;588061]

We have historically sieved more than needed because BOINC computation is cheap, while matrix solving time was in short supply. So, now that GPU matrix solving makes matrices not in short supply, we should sieve less. Something like 5-10% fewer relations, which means 5-10% more jobs done per calendar month.[/QUOTE]

Totally agree with you now. And more, when someone says a number is under LA I would recommend (I know Greg!…lol) to cancel all queued wus, this will also speed up next number to sieve. Sievers are wasting a few days (my experience) processing unnecessary work ( I just manually abort them to go to someone else), just be careful to not do this under any challenges since it will interfere with strategic bunkering.

 Xyzzy 2021-09-18 13:47

1 Attachment(s)
[QUOTE=Xyzzy;584798]If you are using RHEL 8 (8.4) you can install the proprietary Nvidia driver easily via these directions:

[URL]https://developer.nvidia.com/blog/streamlining-nvidia-driver-deployment-on-rhel-8-with-modularity-streams/[/URL]

Then you will need these packages installed:

[C]gcc
make
cuda-nvcc-10-2
cuda-cudart-dev-10-2-10.2.89-1[/C]

And possibly:

[C]gmp-devel
zlib-devel[/C]

[C]export PATH="/usr/local/cuda-10.2/bin:\$PATH"[/C]

:mike:[/QUOTE]Here are simpler instructions.

[CODE]sudo subscription-manager repos --enable=rhel-8-for-x86_64-appstream-rpms
sudo subscription-manager repos --enable=rhel-8-for-x86_64-baseos-rpms
sudo dnf module install nvidia-driver:latest
sudo reboot
sudo dnf install cuda-11-4
echo 'export PATH=/usr/local/cuda-11.4/bin:\$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64/:\$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc[/CODE]Then just use the attached archive to set up your work.

:mike:

 charybdis 2021-09-18 14:37

[QUOTE=frmky;588086]For 2,2174L, 1355M relations yielded 734M uniques. With nearly 50% duplicates, we have clearly reached the limit for 16e.[/QUOTE]

Or is this just the limit for 16e with 33-bit large primes? I know you've avoided going higher because of the difficulty of the LA and the msieve filtering bug, but now that the bug is fixed and GPUs make the LA much easier, might it be worth going up to 34-bit?

 frmky 2021-09-18 15:59

[QUOTE=charybdis;588104]Or is this just the limit for 16e with 33-bit large primes?[/QUOTE]
Does the lasieve5 code work correctly with 34-bit large primes? I know the check is commented out, but I haven't tested it.

 charybdis 2021-09-19 00:30

I tested the binary from [URL="https://www.mersenneforum.org/showpost.php?p=470249&postcount=10"]here[/URL] on 2,2174L with 34-bit large primes and it seemed to work fine. Yield was more than double that at 33-bit so definitely looks worth it, as one would expect. There were no issues with setting mfba=99 either.

 henryzz 2021-09-19 08:02

I looked through the code a few years ago and found no issues. Lasieve4 is also fine although it is limited to 96 bit mfba/r.

 wreck 2021-09-23 11:46

I give a try to receive NFS@Home WU and found lpbr and lpba 34 assignment of 2,2174M.
Here is the polynomial file S2M2174b.poly's contents.

[CODE]
n: 470349924831928271476705309712184283829671891500377511256458133476241008159328553358384317181001385841345904968378352588310952651779460262173005355061503024245423661736289481941107679294474063050602745740433565487767078338816787736757703231764661986524341166060777900926495463269979500293362217153953866146837
skew: 1.22341
c6: 2
c5: 0
c4: 0
c3: 2
c2: 0
c1: 0
c0: 1
Y1: 1
Y0: -3064991081731777716716694054300618367237478244367204352
type: snfs
rlim: 250000000
alim: 250000000
lpbr: 34
lpba: 34
mfbr: 99
mfba: 69
rlambda: 3.6
alambda: 2.6
[/CODE]

When q is near 784M, the memory used is 743MB.

 charybdis 2021-09-23 12:46

[QUOTE=wreck;588461][CODE]
lpbr: 34
lpba: 34
mfbr: 99
mfba: 69
[/CODE][/QUOTE]

@frmky, for future reference, when I tested this I found that rational side sieving with *algebraic* 3LP was fastest. This shouldn't be too much of a surprise: the rational norms are larger, but not so much larger that 6 large primes across the two sides should split 4/2 rather than 3/3 (don't forget the special-q is a "free" large prime).

 frmky 2021-09-24 06:17

[QUOTE=charybdis;588469]@frmky, for future reference, when I tested this I found that rational side sieving with *algebraic* 3LP was fastest. This shouldn't be too much of a surprise: the rational norms are larger, but not so much larger that 6 large primes across the two sides should split 4/2 rather than 3/3 (don't forget the special-q is a "free" large prime).[/QUOTE]
I'll try that, thanks!

 frmky 2021-09-24 06:21

[QUOTE=frmky;588086]filtering yielded
[CODE]matrix is 102063424 x 102063602 (51045.3 MB) with weight 14484270868 (141.91/col)[/CODE]
Normally I'd try to bring this down, but testing on a quad V100 system with NVLink gives
[CODE]linear algebra completed 2200905 of 102060161 dimensions (2.2%, ETA 129h 5m)[/CODE]
[/QUOTE]
And it's done. LA on the 102M matrix with restarts took 5 days 14 hours.
[PASTEBIN]cB1qD1hJ[/PASTEBIN]

 charybdis 2021-09-24 12:36

[QUOTE=frmky;588534]I'll try that, thanks![/QUOTE]

Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?

 pinhodecarlos 2021-09-24 13:40

[QUOTE=charybdis;588567]Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?[/QUOTE]

95%.

 frmky 2021-09-24 15:13

[QUOTE=charybdis;588567]Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?[/QUOTE]

A large fraction encounter issues when exceeding 1GB/thread, so I stay a little below that.

 charybdis 2021-09-24 15:50

If lims have to stay at 250M, it would probably be possible to stretch the upper limit of doable jobs a bit by using 3LP on both sides to catch some of the relations that are lost due to the low lims. This makes sec/rel ~30% worse but increases yield by ~50%, while also increasing the number of relations needed by some unknown amount (almost certainly below 50%) and making LA that bit harder as a result.

But as long as you can cope with lpb 34/34 and 3LP on only one side, there shouldn't be any need for this.

 ryanp 2021-10-22 13:34

In general, given a GPU with X GB RAM, and an N x N matrix, is there a way to determine (reasonably) optimal VBITS and block_nnz values?

 frmky 2021-10-22 23:00

Technically it's an MxN matrix with M slightly less than N, but for this question we can approximate it as NxN.

Volta (and I'm hoping Turing and Ampere) GPUs aren't very sensitive to the block_nnz value, so just keep it at its default 1.75 billion. The actual limit is that the number of nonzeros in a cub SpMV call is stored in an int32 so each matrix block must have less than 2^31 nonzeros. block_nnz sets an estimate, especially for the transpose matrix, so I've been a bit conservative setting it at 1.75B. We want to keep the number of blocks reasonably small since each block for both the normal and transpose matrix needs a 4*(N+1)-byte row offset array in addition to the 4*num_nonzeros-byte column array in GPU memory.

For VBITS, a global memory fetch on current nVidia GPUs by default moves 64 bytes into the L2 cache (although this can be reduced to 32 bytes on A100). With VBITS=128, we are only using 16 bytes of that data with little chance of cache reuse in most of the matrix. Increasing VBITS uses more of the data and thus more efficiently uses global memory bandwidth in the SpMV. However, each iteration also has multiple VBITSxN • NxVBITS dense matrix multiplications which require strided access to arrays. This strided access has a larger impact at VBITS=512. Also, the vectors require 7*N*VBITS/8 bytes of GPU memory. In practice on the V100 I've gotten about equal performance from VBITS of 384 and 512, and poorer performance with decreasing values. Of the two I use 384 since it requires less GPU memory. However, lower VBITS values are useful if GPU memory is tight. Once I have access to an A100 I will compare using VBITS=256 with cudaLimitMaxL2FetchGranularity of 32 to VBITS=384 or 512 with the default.

So, in short, unless GPU memory is tight use VBITS=384 and the default block_nnz on V100 and likely on A100 as well.

 frmky 2021-10-26 04:05

2,2174M is in LA, so here's one more data point. Running on [B]eight[/B] NVLink-connected V100's,
[CODE]Sun Oct 24 01:15:27 2021 matrix is 106764994 x 106765194 (56998.7 MB) with weight 16127184931 (151.05/col)
Sun Oct 24 01:15:27 2021 sparse part has weight 13874205635 (129.95/col)
...
Sun Oct 24 23:03:59 2021 commencing linear algebra
Sun Oct 24 23:03:59 2021 using VBITS=384
Sun Oct 24 23:03:59 2021 skipping matrix build
Sun Oct 24 23:03:59 2021 initialized process (0,0) of 2 x 4 grid
Sun Oct 24 23:09:35 2021 matrix starts at (0, 0)
Sun Oct 24 23:09:39 2021 matrix is 53382681 x 25338016 (8267.4 MB) with weight 2435546404 (96.12/col)
Sun Oct 24 23:09:39 2021 sparse part has weight 1913870759 (75.53/col)
Sun Oct 24 23:09:39 2021 saving the first 368 matrix rows for later
Sun Oct 24 23:09:46 2021 matrix includes 384 packed rows
Sun Oct 24 23:10:15 2021 matrix is 53382313 x 25338016 (7468.9 MB) with weight 1554978635 (61.37/col)
Sun Oct 24 23:10:15 2021 sparse part has weight 1451172382 (57.27/col)
Sun Oct 24 23:10:15 2021 using GPU 0 (Tesla V100-SXM2-32GB)
Sun Oct 24 23:10:15 2021 selected card has CUDA arch 7.0
Sun Oct 24 23:12:44 2021 commencing Lanczos iteration
Sun Oct 24 23:12:47 2021 memory use: 20898.7 MB
Sun Oct 24 23:12:56 2021 linear algebra at 0.0%, ETA 90h17m
[/CODE]
It'll take a bit longer due to queue logistics, but hopefully it'll be done within the week.

 pinhodecarlos 2021-10-26 06:21

And I suppose you will be comparing the other sieve run with higher LP’s, probably still some left overs.

 frmky 2021-10-26 07:48

[QUOTE=pinhodecarlos;591648]And I suppose you will be comparing the other sieve run with higher LP’s, probably still some left overs.[/QUOTE]
We didn't sieve it twice. Only a little at the beginning was sieved with 33 bit LPs and all the relations were combined. There are a few stragglers that I'm not worrying about.

 charybdis 2021-10-26 09:35

What was your general impression of 34-bit vs 33-bit? Will the extra bit allow slightly larger jobs to be run as I'd hoped?

 VBCurtis 2021-10-26 15:47

[QUOTE=frmky;591636]2,2174M is in LA, so here's one more data point. Running on [B]eight[/B] NVLink-connected V100's,
It'll take a bit longer due to queue logistics, but hopefully it'll be done within the week.[/QUOTE]

How many relations did you collect? Was the unique ratio better than 2,2174L's? The matrices came out pretty similar in size, so a comparison of relations counts (raw and unique) gives a nice 33 vs 34 data point.

 frmky 2021-10-26 18:06

For 2,2174L we sieved from 20M - 6B, and collected 1.36B relations. This gave 734M uniques, so about 46% duplicates.

For 2,2174M we sieved from 20M - 4B, and collected 2.19B relations. This gave 1.29B uniques, so about 41% duplicates. However, we sieved a considerably narrower range of q, and it was overall much faster.

 LaurV 2021-10-27 03:14

[offtopic] I changed the thread title. The old one made me [URL="https://en.wikipedia.org/wiki/Romanian_profanity"]nostalgic[/URL] every time someone posted in it... The new title is easier to search too, as the thread contains a lot of useful info... [/offtopic]

 frmky 2021-10-31 18:58

[QUOTE=frmky;591636]2,2174M is in LA, so here's one more data point.[/QUOTE]
It's done.
[PASTEBIN]5RGLguge[/PASTEBIN]

 All times are UTC. The time now is 15:33.