mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve GPU Linear Algebra (https://www.mersenneforum.org/showthread.php?t=27042)

Xyzzy 2021-08-04 12:22

We played around with 10867_67m1 which is SNFS(270.42) and has a 27M matrix.

[C]025M | 38GB | 50H
100M | 25GB | 51H
500M | 21GB | 59H[/C]

The first column is the block size (?) used on the GPU. (25M is the default.)
The second column is the memory used on the GPU.
The third column is the estimated time in hours for the LA phase.

:mike:

Xyzzy 2021-08-04 12:29

If you are using RHEL 8 (8.4) you can install the proprietary Nvidia driver easily via these directions:

[URL]https://developer.nvidia.com/blog/streamlining-nvidia-driver-deployment-on-rhel-8-with-modularity-streams/[/URL]

Then you will need these packages installed:

[C]gcc
make
cuda-nvcc-10-2
cuda-cudart-dev-10-2-10.2.89-1[/C]

And possibly:

[C]gmp-devel
zlib-devel[/C]

You also have to manually adjust your path variable in [C]~/.bashrc[/C]:

[C]export PATH="/usr/local/cuda-10.2/bin:$PATH"[/C]

:mike:

Xyzzy 2021-08-04 19:32

[QUOTE=Xyzzy;584797]We played around with 10867_67m1 which is SNFS(270.42) and has a 27M matrix.

[C]025M | 38GB | 50H
100M | 25GB | 51H
500M | 21GB | 59H[/C]

The first column is the block size (?) used on the GPU. (25M is the default.)
The second column is the memory used on the GPU.
The third column is the estimated time in hours for the LA phase.[/QUOTE]
Here are more benchmarks on the same data:[CODE]VBITS = 64; BLOCKS = 25M; MEM = 37.7GB; TIME = 58.8HR
VBITS = 64; BLOCKS = 100M; MEM = 23.8GB; TIME = 66.5HR
VBITS = 64; BLOCKS = 500M; MEM = 20.0GB; TIME = 98.9HR
VBITS = 64; BLOCKS = 1750M; MEM = 19.3GB; TIME = 109.9HR

VBITS = 128; BLOCKS = 25M; MEM = 37.4GB; TIME = 49.5HR
VBITS = 128; BLOCKS = 100M; MEM = 24.2GB; TIME = 50.3HR
VBITS = 128; BLOCKS = 500M; MEM = 20.7GB; TIME = 58.5HR
VBITS = 128; BLOCKS = 1750M; MEM = 20.1GB; TIME = 61.2HR

VBITS = 256; BLOCKS = 25M; MEM = 39.1GB; TIME = 47.4HR
VBITS = 256; BLOCKS = 100M; MEM = 26.5GB; TIME = 37.2HR
VBITS = 256; BLOCKS = 500M; MEM = 23.2GB; TIME = 37.2HR
VBITS = 256; BLOCKS = 1750M; MEM = 22.6GB; TIME = 37.5HR

VBITS = 512; BLOCKS = 25M; MEM = 44.1GB; TIME = 57.1HR
VBITS = 512; BLOCKS = 100M; MEM = 32.2GB; TIME = 43.5HR
VBITS = 512; BLOCKS = 500M; MEM = 28.9GB; TIME = 41.3HR
VBITS = 512; BLOCKS = 1750M; MEM = 28.5GB; TIME = 40.9HR[/CODE]37.2 hours!

:mike:

frmky 2021-08-05 01:04

That's great! The older V100 definitely doesn't like the VBITS=256 blocks=100M or 500M settings. It doubles the runtime. Anyone using this really needs to test different settings on their card.

ryanp 2021-08-06 14:05

Trying this out on an NVIDIA A100. Compiled and starts to run. I'm invoking with:

[CODE]./msieve -v -g 0 -i ./f/input.ini -l ./f/input.log -s ./f/input.dat -nf ./f/input.fb -nc2[/CODE]

[CODE]Fri Aug 6 12:43:59 2021 commencing linear algebra
Fri Aug 6 12:43:59 2021 using VBITS=256
Fri Aug 6 12:44:04 2021 read 36267445 cycles
Fri Aug 6 12:45:36 2021 cycles contain 123033526 unique relations
Fri Aug 6 12:58:53 2021 read 123033526 relations
Fri Aug 6 13:02:37 2021 using 20 quadratic characters above 4294917295
Fri Aug 6 13:14:54 2021 building initial matrix
Fri Aug 6 13:45:04 2021 memory use: 16201.2 MB
Fri Aug 6 13:45:24 2021 read 36267445 cycles
Fri Aug 6 13:45:28 2021 matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col)
Fri Aug 6 13:45:28 2021 sparse part has weight 4115543151 (113.48/col)
Fri Aug 6 13:50:59 2021 filtering completed in 1 passes
Fri Aug 6 13:51:04 2021 matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col)
Fri Aug 6 13:51:04 2021 sparse part has weight 4115543151 (113.48/col)
Fri Aug 6 13:54:35 2021 matrix starts at (0, 0)
Fri Aug 6 13:54:40 2021 matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col)
Fri Aug 6 13:54:40 2021 sparse part has weight 4115543151 (113.48/col)
Fri Aug 6 13:54:40 2021 saving the first 240 matrix rows for later
Fri Aug 6 13:54:47 2021 matrix includes 256 packed rows
Fri Aug 6 13:55:00 2021 matrix is 36267035 x 36267445 (15850.8 MB) with weight 3758763803 (103.64/col)
Fri Aug 6 13:55:00 2021 sparse part has weight 3574908223 (98.57/col)
Fri Aug 6 13:55:01 2021 using GPU 0 (NVIDIA A100-SXM4-40GB)
Fri Aug 6 13:55:01 2021 selected card has CUDA arch 8.0[/CODE]

Then a long sequence of numbers, and:

[CODE]25000136 36267035 221384
25000059 36267035 218336
25000041 36267035 214416
25000174 36267035 211066
25000044 36267035 212574
25000047 36267035 212320
25000174 36267035 207956
25000171 36267035 202904
25000117 36267035 197448
25000171 36267035 191566
25000130 36267035 185008
25000136 36267035 178722
24898531 36267035 168358
3811898 36267445 264
22016023 36267445 48
24836805 36267445 60
27790270 36267445 75
24929949 36267445 75
22849647 36267445 75
24896299 36267445 90
22990599 36267445 90
25502972 36267445 110
23602625 36267445 110
26327686 36267445 135
23662886 36267445 135
26145282 36267445 165
23549845 36267445 165
26371744 36267445 205
23884092 36267445 205
26835429 36267445 255
24055165 36267445 255
26699873 36267445 315
23947051 36267445 315
26916570 36267445 390
24419378 36267445 390
27622355 36267445 485
error (line 373): CUDA_ERROR_OUT_OF_MEMORY[/CODE]

frmky 2021-08-06 14:12

At the end after -nc2, add block_nnz=100000000

Edit: That's a big matrix for that card. If that still doesn't work, try changing 100M to 500M, then 1000M, then 1750M. I think one of those should work.

If you still run out of memory with 1750M, then switch to VBITS=128 and start over at 100M. and run through them again. That will use less GPU memory for the vectors saving more for the matrix.

Finally, if you still run out of memory with VBITS=128 and block_nnz=1750000000 then use VBITS=512 with -nc2 "block_nnz=1750000000 use_managed=1"
That will save the matrix overflow in CPU memory and move it from there as needed. It's slower, but likely still faster than running the CPU version.

Edit 2: I should add that once the matrix is built, you can skip that step with, e.g., -nc2 "skip_matbuild=1 block_nnz=100000000"
I haven't tested on an A100, so you may want to benchmark the various settings that work to find the optimal for your card.

ryanp 2021-08-06 16:39

[QUOTE=frmky;584976]At the end after -nc2, add block_nnz=100000000

Edit: That's a big matrix for that card. If that still doesn't work, try changing 100M to 500M, then 1000M, then 1750M. I think one of those should work.

If you still run out of memory with 1750M, then switch to VBITS=128 and start over at 100M. and run through them again. That will use less GPU memory for the vectors saving more for the matrix.

Finally, if you still run out of memory with VBITS=128 and block_nnz=1750000000 then use VBITS=512 with -nc2 "block_nnz=1750000000 use_managed=1"
That will save the matrix overflow in CPU memory and move it from there as needed. It's slower, but likely still faster than running the CPU version.

Edit 2: I should add that once the matrix is built, you can skip that step with, e.g., -nc2 "skip_matbuild=1 block_nnz=100000000"
I haven't tested on an A100, so you may want to benchmark the various settings that work to find the optimal for your card.[/QUOTE]

Tried a bunch of settings:

* all of 100M, 500M, 1000M, 1750M with VBITS=256 all ran out of memory
* managed to get the matrix down to 27M with more sieving. VBITS=256, 1750M still runs out of memory.
* will try VBITS=128 next with the various settings

Is there any work planned to pick optimal (or at least, functional, won't crash) settings automatically?

frmky 2021-08-06 17:52

[QUOTE=ryanp;584987]Is there any work planned to pick optimal (or at least, functional, won't crash) settings automatically?[/QUOTE]

Optimal and functional are very different parameters. I can try automatically picking a block_nnz value that is more likely to work, but VBITS is a compile-time setting that can't be changed at runtime. Adding use_managed=1 will make it work in most cases but can significantly slow it down, so I've defaulted it to off.

ryanp 2021-08-06 18:03

Looks like it's working now with VBITS=128, and a pretty decent runtime:

[CODE]./msieve -v -g 0 -i ./f/input.ini -l ./f/input.log -s ./f/input.dat -nf ./f/input.fb -nc2 block_nnz=1000000000
...
matrix starts at (0, 0)
matrix is 27724170 x 27724341 (13842.2 MB) with weight 3947756174 (142.39/col)
sparse part has weight 3351414840 (120.88/col)
saving the first 112 matrix rows for later
matrix includes 128 packed rows
matrix is 27724058 x 27724341 (12940.4 MB) with weight 3222020630 (116.22/col)
sparse part has weight 3059558876 (110.36/col)
using GPU 0 (NVIDIA A100-SXM4-40GB)
selected card has CUDA arch 8.0
Nonzeros per block: 1000000000
converting matrix to CSR and copying it onto the GPU
1000000043 27724058 8774182
1000000028 27724058 9604503
1000000099 27724058 8923418
59558706 27724058 422238
1082873143 27724341 40960
954052655 27724341 1455100
916348921 27724341 16939530
106284157 27724341 9288468
commencing Lanczos iteration
vector memory use: 2961.3 MB
dense rows memory use: 423.0 MB
sparse matrix memory use: 24188.7 MB
memory use: 27573.0 MB
Allocated 82.0 MB for SpMV library
Allocated 88.6 MB for SpMV library
linear algebra at 0.0%, ETA 20h11m7724341 dimensions (0.0%, ETA 20h11m)
checkpointing every 1230000 dimensions341 dimensions (0.0%, ETA 22h42m)
linear algebra completed 12223 of 27724341 dimensions (0.0%, ETA 20h46m)[/CODE]

frmky 2021-08-06 18:03

The 17.5M matrix for 2,1359+ took just under 15 hours on a V100.

frmky 2021-08-06 18:20

[QUOTE=ryanp;584999]Looks like it's working now with VBITS=128, and a pretty decent runtime:[/QUOTE]
A 27.7M matrix in 21 hours. The A100 is a nice card!

Do you have access to an A40 to try it? I'm curious if the slower global memory significantly increases the runtime.


All times are UTC. The time now is 14:27.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.