mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve GPU Linear Algebra (https://www.mersenneforum.org/showthread.php?t=27042)

VBCurtis 2021-08-02 19:23

Msieve GPU Linear Algebra
 
[QUOTE=Xyzzy;584617][C]GPU = Quadro RTX 8000
LA = 3331s[/C]

Now that we can run msieve on a GPU we have no reason to ever run it on a CPU again. [/QUOTE]

Do you mean you'll just leave the big matrices to others? How big can you solve on your GPU?

Xyzzy 2021-08-02 19:28

We were told:[QUOTE]You can probably run up to a 25M-30M matrix, perhaps a bit larger, on that card.[/QUOTE]

frmky 2021-08-02 21:48

[QUOTE=VBCurtis;584640]Do you mean you'll just leave the big matrices to others? How big can you solve on your GPU?[/QUOTE]
Unlike most of us, Mike has a high-end workstation GPU with 48GB memory. He can fit all but the largest "f small" matrices on his GPU.

xilman 2021-08-03 11:49

[QUOTE=frmky;584454]After reimplementing the CUDA SpMV with CUB, the Tesla V100 now takes 36 minutes.[/QUOTE]I'm sorry but I am getting seriously out of date with CUDA since the updated compilers stopped working on my Ubuntu and Gentoo systems.

The three systems still in use have a 460, a 970, and a 1060 with drivers 390.138, 390.144 and 390.141 respectively.

Do you think your new code might run on any of those? If so, I will try again to get CUDA installed and working.

Thanks.

frmky 2021-08-03 14:15

Technically yes, but consumer cards don't have enough memory to store interesting matrices. If the GTX 1060 has 6GB, it could run matrices up to about 5Mx5M. The problem is that block Lanczos requires multiplying by both the matrix and its transpose, but gpus only seem to work well with the matrix in CSR, which doesn't allow efficiently calculating the transpose. So we load both the matrix and its transpose onto the card.

It would be possible to create a version that stores the matrices in system memory and loads the next matrix block into GPU memory while calculating the product with the current block. The block size is adjustable, but I don't know how performant that would be.

Xyzzy 2021-08-03 14:45

How important is ECC on a video card? (Most consumer cards don't have that, right?)

Our card has it, and we have it enabled, but it runs faster without.

We haven't logged an ECC error yet. Note the "aggregate" counter described below.

[CODE]ECC Errors
NVIDIA GPUs can provide error counts for various types of ECC errors. Some ECC errors are either single or double bit, where single bit errors are corrected and double bit
errors are uncorrectable. Texture memory errors may be correctable via resend or uncorrectable if the resend fails. These errors are available across two timescales (volatile
and aggregate). Single bit ECC errors are automatically corrected by the HW and do not result in data corruption. Double bit errors are detected but not corrected. Please see
the ECC documents on the web for information on compute application behavior when double bit errors occur. Volatile error counters track the number of errors detected since the
last driver load. Aggregate error counts persist indefinitely and thus act as a lifetime counter.[/CODE]:mike:

frmky 2021-08-04 00:32

[QUOTE=frmky;584710]It would be possible to create a version that stores the matrices in system memory[/QUOTE]
I did that, and it's not terrible with the right settings...
[CODE]using VBITS=512
matrix is 42100909 x 42101088 (20033.9 MB) with weight 6102777434 (144.96/col)
...
using GPU 0 (Tesla V100-SXM2-32GB) <-------- 32 GB card
...
vector memory use: 17987.6 MB <-- 7 x matrix columns x VBITS / 8 bytes on card, adjust VBITS as needed
dense rows memory use: 2569.6 MB <-- on card but could be moved to cpu memory
sparse matrix memory use: 30997.3 MB <-- Hosted in cpu memory, transferred on card as needed
memory use: 51554.6 MB <-- significantly exceeds 32 GB
Allocated 357.7 MB for SpMV library
...
linear algebra completed 33737 of 42101088 dimensions (0.1%, ETA 133h21m)
[/CODE]

frmky 2021-08-04 02:42

[QUOTE=Xyzzy;584714]How important is ECC on a video card? (Most consumer cards don't have that, right?)

Our card has it, and we have it enabled, but it runs faster without.

We haven't logged an ECC error yet. Note the "aggregate" counter described below.
[/QUOTE]
What's your risk tolerance? msieve has robust error detection so it's not as important. But it's usually a small price to ensure no memory faults.

mathwiz 2021-08-04 02:47

Are there instructions on how to check out and build the msieve GPU LA code? Is it in trunk or a separate branch?

VBCurtis 2021-08-04 03:41

[QUOTE=frmky;584751]I did that, and it's not terrible with the right settings...
[CODE]using VBITS=512
matrix is 42100909 x 42101088 (20033.9 MB) with weight 6102777434 (144.96/col)
...
using GPU 0 (Tesla V100-SXM2-32GB) <-------- 32 GB card
...
vector memory use: 17987.6 MB <-- 7 x matrix columns x VBITS / 8 bytes on card, adjust VBITS as needed
dense rows memory use: 2569.6 MB <-- on card but could be moved to cpu memory
sparse matrix memory use: 30997.3 MB <-- Hosted in cpu memory, transferred on card as needed
memory use: 51554.6 MB <-- significantly exceeds 32 GB
Allocated 357.7 MB for SpMV library
...
linear algebra completed 33737 of 42101088 dimensions (0.1%, ETA 133h21m)
[/CODE][/QUOTE]

This is simply amazing! I'm running a matrix that size for GNFS-201 (from f-small) right now, at ~700 hr on a 12-core single-socket Haswell.
I hope this means you'll be digging out of your matrix backlog from the big siever queue. :tu:

frmky 2021-08-04 04:51

[QUOTE=mathwiz;584761]Are there instructions on how to check out and build the msieve GPU LA code? Is it in trunk or a separate branch?[/QUOTE]

It's very much a work-in-progress and things may change or occasionally be broken, but you can play with it. I have it in GitHub. I recommend using CUDA 10.2 because CUDA 11.x incorporates CUB into the toolkit and tries to force you to use it, but it's missing a few pieces. That complicates things. You can get the source with

git clone [url]https://github.com/gchilders/msieve_nfsathome.git[/url] -b msieve-lacuda-nfsathome
cd msieve_nfsathome
make all VBITS=128 CUDA=XX

where XX is the two-digit CUDA compute capability of your GPU. Specifying CUDA=1 defaults to a compute capability of 60. You may want to experiment with both VBITS=128 and VBITS=256 to see which is best on your GPU.

If you want to copy msieve to another directory, you need the msieve binary, both *.ptx files, and in the cub directory both *.so files. Or just run it from the build directory.

Xyzzy 2021-08-04 12:22

We played around with 10867_67m1 which is SNFS(270.42) and has a 27M matrix.

[C]025M | 38GB | 50H
100M | 25GB | 51H
500M | 21GB | 59H[/C]

The first column is the block size (?) used on the GPU. (25M is the default.)
The second column is the memory used on the GPU.
The third column is the estimated time in hours for the LA phase.

:mike:

Xyzzy 2021-08-04 12:29

If you are using RHEL 8 (8.4) you can install the proprietary Nvidia driver easily via these directions:

[URL]https://developer.nvidia.com/blog/streamlining-nvidia-driver-deployment-on-rhel-8-with-modularity-streams/[/URL]

Then you will need these packages installed:

[C]gcc
make
cuda-nvcc-10-2
cuda-cudart-dev-10-2-10.2.89-1[/C]

And possibly:

[C]gmp-devel
zlib-devel[/C]

You also have to manually adjust your path variable in [C]~/.bashrc[/C]:

[C]export PATH="/usr/local/cuda-10.2/bin:$PATH"[/C]

:mike:

Xyzzy 2021-08-04 19:32

[QUOTE=Xyzzy;584797]We played around with 10867_67m1 which is SNFS(270.42) and has a 27M matrix.

[C]025M | 38GB | 50H
100M | 25GB | 51H
500M | 21GB | 59H[/C]

The first column is the block size (?) used on the GPU. (25M is the default.)
The second column is the memory used on the GPU.
The third column is the estimated time in hours for the LA phase.[/QUOTE]
Here are more benchmarks on the same data:[CODE]VBITS = 64; BLOCKS = 25M; MEM = 37.7GB; TIME = 58.8HR
VBITS = 64; BLOCKS = 100M; MEM = 23.8GB; TIME = 66.5HR
VBITS = 64; BLOCKS = 500M; MEM = 20.0GB; TIME = 98.9HR
VBITS = 64; BLOCKS = 1750M; MEM = 19.3GB; TIME = 109.9HR

VBITS = 128; BLOCKS = 25M; MEM = 37.4GB; TIME = 49.5HR
VBITS = 128; BLOCKS = 100M; MEM = 24.2GB; TIME = 50.3HR
VBITS = 128; BLOCKS = 500M; MEM = 20.7GB; TIME = 58.5HR
VBITS = 128; BLOCKS = 1750M; MEM = 20.1GB; TIME = 61.2HR

VBITS = 256; BLOCKS = 25M; MEM = 39.1GB; TIME = 47.4HR
VBITS = 256; BLOCKS = 100M; MEM = 26.5GB; TIME = 37.2HR
VBITS = 256; BLOCKS = 500M; MEM = 23.2GB; TIME = 37.2HR
VBITS = 256; BLOCKS = 1750M; MEM = 22.6GB; TIME = 37.5HR

VBITS = 512; BLOCKS = 25M; MEM = 44.1GB; TIME = 57.1HR
VBITS = 512; BLOCKS = 100M; MEM = 32.2GB; TIME = 43.5HR
VBITS = 512; BLOCKS = 500M; MEM = 28.9GB; TIME = 41.3HR
VBITS = 512; BLOCKS = 1750M; MEM = 28.5GB; TIME = 40.9HR[/CODE]37.2 hours!

:mike:

frmky 2021-08-05 01:04

That's great! The older V100 definitely doesn't like the VBITS=256 blocks=100M or 500M settings. It doubles the runtime. Anyone using this really needs to test different settings on their card.

ryanp 2021-08-06 14:05

Trying this out on an NVIDIA A100. Compiled and starts to run. I'm invoking with:

[CODE]./msieve -v -g 0 -i ./f/input.ini -l ./f/input.log -s ./f/input.dat -nf ./f/input.fb -nc2[/CODE]

[CODE]Fri Aug 6 12:43:59 2021 commencing linear algebra
Fri Aug 6 12:43:59 2021 using VBITS=256
Fri Aug 6 12:44:04 2021 read 36267445 cycles
Fri Aug 6 12:45:36 2021 cycles contain 123033526 unique relations
Fri Aug 6 12:58:53 2021 read 123033526 relations
Fri Aug 6 13:02:37 2021 using 20 quadratic characters above 4294917295
Fri Aug 6 13:14:54 2021 building initial matrix
Fri Aug 6 13:45:04 2021 memory use: 16201.2 MB
Fri Aug 6 13:45:24 2021 read 36267445 cycles
Fri Aug 6 13:45:28 2021 matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col)
Fri Aug 6 13:45:28 2021 sparse part has weight 4115543151 (113.48/col)
Fri Aug 6 13:50:59 2021 filtering completed in 1 passes
Fri Aug 6 13:51:04 2021 matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col)
Fri Aug 6 13:51:04 2021 sparse part has weight 4115543151 (113.48/col)
Fri Aug 6 13:54:35 2021 matrix starts at (0, 0)
Fri Aug 6 13:54:40 2021 matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col)
Fri Aug 6 13:54:40 2021 sparse part has weight 4115543151 (113.48/col)
Fri Aug 6 13:54:40 2021 saving the first 240 matrix rows for later
Fri Aug 6 13:54:47 2021 matrix includes 256 packed rows
Fri Aug 6 13:55:00 2021 matrix is 36267035 x 36267445 (15850.8 MB) with weight 3758763803 (103.64/col)
Fri Aug 6 13:55:00 2021 sparse part has weight 3574908223 (98.57/col)
Fri Aug 6 13:55:01 2021 using GPU 0 (NVIDIA A100-SXM4-40GB)
Fri Aug 6 13:55:01 2021 selected card has CUDA arch 8.0[/CODE]

Then a long sequence of numbers, and:

[CODE]25000136 36267035 221384
25000059 36267035 218336
25000041 36267035 214416
25000174 36267035 211066
25000044 36267035 212574
25000047 36267035 212320
25000174 36267035 207956
25000171 36267035 202904
25000117 36267035 197448
25000171 36267035 191566
25000130 36267035 185008
25000136 36267035 178722
24898531 36267035 168358
3811898 36267445 264
22016023 36267445 48
24836805 36267445 60
27790270 36267445 75
24929949 36267445 75
22849647 36267445 75
24896299 36267445 90
22990599 36267445 90
25502972 36267445 110
23602625 36267445 110
26327686 36267445 135
23662886 36267445 135
26145282 36267445 165
23549845 36267445 165
26371744 36267445 205
23884092 36267445 205
26835429 36267445 255
24055165 36267445 255
26699873 36267445 315
23947051 36267445 315
26916570 36267445 390
24419378 36267445 390
27622355 36267445 485
error (line 373): CUDA_ERROR_OUT_OF_MEMORY[/CODE]

frmky 2021-08-06 14:12

At the end after -nc2, add block_nnz=100000000

Edit: That's a big matrix for that card. If that still doesn't work, try changing 100M to 500M, then 1000M, then 1750M. I think one of those should work.

If you still run out of memory with 1750M, then switch to VBITS=128 and start over at 100M. and run through them again. That will use less GPU memory for the vectors saving more for the matrix.

Finally, if you still run out of memory with VBITS=128 and block_nnz=1750000000 then use VBITS=512 with -nc2 "block_nnz=1750000000 use_managed=1"
That will save the matrix overflow in CPU memory and move it from there as needed. It's slower, but likely still faster than running the CPU version.

Edit 2: I should add that once the matrix is built, you can skip that step with, e.g., -nc2 "skip_matbuild=1 block_nnz=100000000"
I haven't tested on an A100, so you may want to benchmark the various settings that work to find the optimal for your card.

ryanp 2021-08-06 16:39

[QUOTE=frmky;584976]At the end after -nc2, add block_nnz=100000000

Edit: That's a big matrix for that card. If that still doesn't work, try changing 100M to 500M, then 1000M, then 1750M. I think one of those should work.

If you still run out of memory with 1750M, then switch to VBITS=128 and start over at 100M. and run through them again. That will use less GPU memory for the vectors saving more for the matrix.

Finally, if you still run out of memory with VBITS=128 and block_nnz=1750000000 then use VBITS=512 with -nc2 "block_nnz=1750000000 use_managed=1"
That will save the matrix overflow in CPU memory and move it from there as needed. It's slower, but likely still faster than running the CPU version.

Edit 2: I should add that once the matrix is built, you can skip that step with, e.g., -nc2 "skip_matbuild=1 block_nnz=100000000"
I haven't tested on an A100, so you may want to benchmark the various settings that work to find the optimal for your card.[/QUOTE]

Tried a bunch of settings:

* all of 100M, 500M, 1000M, 1750M with VBITS=256 all ran out of memory
* managed to get the matrix down to 27M with more sieving. VBITS=256, 1750M still runs out of memory.
* will try VBITS=128 next with the various settings

Is there any work planned to pick optimal (or at least, functional, won't crash) settings automatically?

frmky 2021-08-06 17:52

[QUOTE=ryanp;584987]Is there any work planned to pick optimal (or at least, functional, won't crash) settings automatically?[/QUOTE]

Optimal and functional are very different parameters. I can try automatically picking a block_nnz value that is more likely to work, but VBITS is a compile-time setting that can't be changed at runtime. Adding use_managed=1 will make it work in most cases but can significantly slow it down, so I've defaulted it to off.

ryanp 2021-08-06 18:03

Looks like it's working now with VBITS=128, and a pretty decent runtime:

[CODE]./msieve -v -g 0 -i ./f/input.ini -l ./f/input.log -s ./f/input.dat -nf ./f/input.fb -nc2 block_nnz=1000000000
...
matrix starts at (0, 0)
matrix is 27724170 x 27724341 (13842.2 MB) with weight 3947756174 (142.39/col)
sparse part has weight 3351414840 (120.88/col)
saving the first 112 matrix rows for later
matrix includes 128 packed rows
matrix is 27724058 x 27724341 (12940.4 MB) with weight 3222020630 (116.22/col)
sparse part has weight 3059558876 (110.36/col)
using GPU 0 (NVIDIA A100-SXM4-40GB)
selected card has CUDA arch 8.0
Nonzeros per block: 1000000000
converting matrix to CSR and copying it onto the GPU
1000000043 27724058 8774182
1000000028 27724058 9604503
1000000099 27724058 8923418
59558706 27724058 422238
1082873143 27724341 40960
954052655 27724341 1455100
916348921 27724341 16939530
106284157 27724341 9288468
commencing Lanczos iteration
vector memory use: 2961.3 MB
dense rows memory use: 423.0 MB
sparse matrix memory use: 24188.7 MB
memory use: 27573.0 MB
Allocated 82.0 MB for SpMV library
Allocated 88.6 MB for SpMV library
linear algebra at 0.0%, ETA 20h11m7724341 dimensions (0.0%, ETA 20h11m)
checkpointing every 1230000 dimensions341 dimensions (0.0%, ETA 22h42m)
linear algebra completed 12223 of 27724341 dimensions (0.0%, ETA 20h46m)[/CODE]

frmky 2021-08-06 18:03

The 17.5M matrix for 2,1359+ took just under 15 hours on a V100.

frmky 2021-08-06 18:20

[QUOTE=ryanp;584999]Looks like it's working now with VBITS=128, and a pretty decent runtime:[/QUOTE]
A 27.7M matrix in 21 hours. The A100 is a nice card!

Do you have access to an A40 to try it? I'm curious if the slower global memory significantly increases the runtime.

ryanp 2021-08-06 20:02

[QUOTE=frmky;585002]A 27.7M matrix in 21 hours. The A100 is a nice card![/QUOTE]
Indeed!

[QUOTE]Do you have access to an A40 to try it? I'm curious if the slower global memory significantly increases the runtime.[/QUOTE]

No, sadly, "just" A100 and V100's.

ryanp 2021-08-06 21:20

On that note, though, are there any plans to support multiple GPUs? If a single A100 is this fast, [URL="https://cloud.google.com/blog/products/compute/a2-vms-with-nvidia-a100-gpus-are-ga"]16x A100's[/URL] with a fully interconnected fabric could probably tear through big matrices?

frmky 2021-08-06 21:41

The current version supports multiple GPUs using MPI (compile with CUDA=1 MPI=1 CUDAAWARE=1) but relies on a good MPI implementation. OpenMPI's collectives transfer the data from card and do the reduction on the CPU. MVAPICH2-GDR I think keeps the reductions on the card, but SDSC doesn't have that working on Expanse GPU yet so I haven't been able to test it. I hope to have time on NCSA Delta later this fall to try it out.

Edit: [STRIKE]My backup plan if that doesn't work out is to use a non-CUDA-aware MPI to pass IPC handles between processes and do the reduction on the GPU myself.[/STRIKE]

Edit 2: I've got a draft version working just now that passes vectors between GPUs using MPI CUDA-aware point-to-point comms (which uses NVLink or GPUDirect when available) then does the reduction on the GPU manually. In a quick test on a 43M matrix using two V100's connected with NVLink, this reduces LA time from nearly 90 hours when passing vectors through CPU memory to [STRIKE]57[/STRIKE] 56 hours transferring directly between GPUs.

Edit 3: It's now in GitHub. Just compile with a CUDA-Aware MPI like OpenMPI using CUDA=XX MPI=1 CUDAAWARE=1 where XX is replaced by the compute capability of your GPU.

frmky 2021-08-07 06:44

[CODE]linear algebra completed 45452 of 42101088 dimensions (0.1%, ETA 21h 4m)[/CODE]
Using four V100's, I'm getting about 21 hours to solve a 42.1M matrix.

fivemack 2021-08-07 11:10

Interesting! That's about a p3.8xlarge instance, for which the spot price is $4/hr, so that's $84 = £60 to solve the matrix.

I'm paying 19p/kWh here, and my Skylake machine uses about 250W and takes 820 hours for a 44M matrix, so that's £40 of electricity (but probably £60 in depreciation, assuming the £3360 machine will last five years); on another hand it's taking a month rather than a day, on a third hand that's still keeping up with my sieving resources.

frmky 2021-08-07 16:13

[CODE]linear algebra completed 49005 of 84248506 dimensions (0.1%, ETA 94h30m)[/CODE]
And scaling well. The 84.2M matrix for 2,2162M should take about 4 days on four NVLink-connected V100's. It's using about 26GB on each card.

ryanp 2021-08-07 16:20

[QUOTE=frmky;585109][CODE]linear algebra completed 49005 of 84248506 dimensions (0.1%, ETA 94h30m)[/CODE]
And scaling well. The 84.2M matrix for 2,2162M should take about 4 days on four NVLink-connected V100's. It's using about 26GB on each card.[/QUOTE]

That's quite impressive. I dug this up which I believe was your MPI run of a 109.4M matrix from a few months back?

[CODE]linear algebra completed 20216008 of 109441779 dimensions (18.5%, ETA 854h19m)[/CODE]

frmky 2021-08-07 16:40

Yes, that would have been on 6 Sandy Bridge nodes with 2x 10 core cpus each.

Here's the companion 2,2162L matrix, also 84.2M, running on 8 Fujitsu A64FX nodes.

[CODE]Fri Jul 2 01:59:19 2021 linear algebra at 0.0%, ETA 337h 2m[/CODE]

wombatman 2021-08-08 00:00

Would something like work on my 3090? It has 24GB of ram on it, though I would have to get some help with compilation as I use WSL2, which doesn't support CUDA applications (yet).

frmky 2021-08-08 00:57

[QUOTE=wombatman;585135]Would something like work on my 3090? It has 24GB of ram on it, though I would have to get some help with compilation as I use WSL2, which doesn't support CUDA applications (yet).[/QUOTE]
Yes, you could solve a matrix up to about 15M or so on the card. If you have at least 32 GB system memory, you could go a bit larger transferring the matrix from system memory as needed using CUDA managed memory. But I have no experience compiling msieve for Windows.

frmky 2021-08-11 22:09

1 Attachment(s)
The LA for 2,2162M, an 84.2M matrix, successfully completed on four NVLink-connected V100's in a total of 95.5 hours of runtime. There was a restart due to the 48-hour queue time limit on SDSC Expanse GPU. This run used just over 26GB of GPU memory on each of the four V100's.

Attached is a snapshot of the timeline for two block Lanzcos iterations on three of the four gpus. Per the time scale at the top, it takes just over 1 second/iteration. Over 80% of the time is spent in the SpMV routine. The transfer of vectors directly between GPU's takes relatively little time when NVLink is used.

[PASTEBIN]TsUMyBr8[/PASTEBIN]

Xyzzy 2021-09-09 20:28

1 Attachment(s)
We are not sure if this is interesting or not.

[C]13_2_909m1 - Near-Cunningham - SNFS(274)[/C]

This is a big (33 bit?) job. The [C]msieve.dat[/C] file, uncompressed and with duplicates and bad relations removed, is 49GB.[CODE]$ ls -lh
total 105G
-rw-rw-r--. 1 m m 36G Sep 8 20:33 13_2_909m1.dat.gz
drwx------. 2 m m 50 Aug 4 12:17 cub
-r--------. 1 m m 29K Aug 4 12:16 lanczos_kernel.ptx
-r-x------. 1 m m 3.4M Aug 4 12:16 msieve
-rw-rw-r--. 1 m m 49G Sep 8 22:02 msieve.dat
-rw-rw-r--. 1 m m 4.2G Sep 9 14:17 msieve.dat.bak.chk
-rw-rw-r--. 1 m m 4.2G Sep 9 14:54 msieve.dat.chk
-rw-rw-r--. 1 m m 969M Sep 9 12:11 msieve.dat.cyc
-rw-rw-r--. 1 m m 12G Sep 9 12:11 msieve.dat.mat
-rw-rw-r--. 1 m m 415 Sep 2 19:15 msieve.fb
-rw-rw-r--. 1 m m 13K Sep 9 15:10 msieve.log
-r--------. 1 m m 108K Aug 4 12:16 stage1_core.ptx
-rw-rw-r--. 1 m m 264 Sep 2 19:15 worktodo.ini[/CODE]There are ~442M relations. Setting [C]block_nnz[/C] to 500M resulted in an OOM error, so we used 1B instead.[CODE]commencing linear algebra
using VBITS=256
skipping matrix build
matrix starts at (0, 0)
matrix is 27521024 x 27521194 (12901.7 MB) with weight 3687594306 (133.99/col)
sparse part has weight 3106904079 (112.89/col)
saving the first 240 matrix rows for later
matrix includes 256 packed rows
matrix is 27520784 x 27521194 (12034.4 MB) with weight 2848207923 (103.49/col)
sparse part has weight 2714419599 (98.63/col)
using GPU 0 (Quadro RTX 8000)
selected card has CUDA arch 7.5
Nonzeros per block: 1000000000
converting matrix to CSR and copying it onto the GPU
1000000013 27520784 9680444
1000000057 27520784 11295968
714419529 27520784 6544782
1039631367 27521194 100000
917599197 27521194 3552480
757189035 27521194 23868304
commencing Lanczos iteration
vector memory use: 5879.2 MB
dense rows memory use: 839.9 MB
sparse matrix memory use: 21339.3 MB
memory use: 28058.3 MB
Allocated 123.0 MB for SpMV library
Allocated 127.8 MB for SpMV library
linear algebra at 0.0%, ETA 49h57m7521194 dimensions (0.0%, ETA 49h57m)
checkpointing every 570000 dimensions
linear algebra completed 925789 of 27521194 dimensions (3.4%, ETA 45h13m)
received signal 2; shutting down
linear algebra completed 926044 of 27521194 dimensions (3.4%, ETA 45h12m)
lanczos halted after 3628 iterations (dim = 926044)
BLanczosTime: 5932
elapsed time 01:38:53

current factorization was interrupted[/CODE]So the LA step is under 50 hours which seems pretty fast! (We have no plans to complete it since it is assigned to VBCurtis.)

We have the raw files saved if there are other configurations worth investigating. If so, just let us know!

:mike:

VBCurtis 2021-09-09 21:00

It's a 32/33 hybrid, with a healthy amount of oversieving (I wanted a matrix below 30M dimensions, success!).

I'm impressed that fits on your card, and 50hr is pretty amazing- I just started the matrix a few hr ago on a 10-core Ivy Bridge, ETA is 365 hr.

If you have the free cycles to run it, please be my guest! That 20+ core weeks saved is enough to ECM the next candidate.

frmky 2021-09-13 05:22

I spent time with Nsight Compute looking at the SpMV kernel. As expected for SpMV it's memory bandwidth limited, so increasing occupancy to hide latency should help. I adjusted parameters to reduce both register and shared memory use, which increased the occupancy. This yielded a runtime improvement of only about 5% on the V100 but it may differ on other cards. I also increased the default block_nnz to 1750M to reduce global memory use a bit.

frmky 2021-09-16 06:00

Today I expanded the allowed values of VBITS to any of 64, 128, 192, 256, 320, 384, 448, or 512. This works on both CPUs and GPUs, but I don't expect much, if any, speedup on CPUs. As a GPU benchmark, I tested a 42.1M matrix on two NVLink-connected V100's. Here are the results.
[CODE]
VBITS Time (hours)
64 109.5
128 63.75
192 50
256 40.25
320 40.25
384 37.75
448 40.25
512 37.25[/CODE]
Combined with the new SpMV parameters, I get the best times with VBITS of 384 and 512, but 384 uses less memory. Overall, I get about 6% better performance than with VBITS=256.

Xyzzy 2021-09-17 11:38

Our system has a single GPU. When we are doing compute work on the GPU the display lags. We can think of two ways to fix this.

Some sort of niceness assignment to the compute process.
Limiting the compute process to less than 100% of the GPU.

Are either of these approaches possible?

:mike:

Xyzzy 2021-09-17 11:41

Since GPU LA is so fast, should we rethink how many relations are generated by the sieving process?

:mike:

axn 2021-09-17 12:15

[QUOTE=Xyzzy;588025]Our system has a single GPU. When we are doing compute work on the GPU the display lags.[/QUOTE]
Install a cheap second GPU (like GT 1030 / RX 550) to drive your display (if your mobo has provision for a second one)

frmky 2021-09-17 14:56

[QUOTE=Xyzzy;588025]Our system has a single GPU. When we are doing compute work on the GPU the display lags.[/QUOTE]
CUDA doesn't allow display updates while a kernel is running. The only way to improve responsiveness without using a second GPU is to shorten the kernel run times. The longest kernel is the SpMV, and you can shorten that by lowering block_nnz. Use the lowest value that still allows everything to fit in GPU memory.

Edit: Lowering VBITS will also reduce kernel runtimes, but don't go below 128. See the benchmark a few posts above. Also, you can't change VBITS in the middle of a run. You would need to start over from the beginning. You can change block_nnz during a restart.

chris2be8 2021-09-17 15:37

[QUOTE=axn;588028]Install a cheap second GPU (like GT 1030 / RX 550) to drive your display (if your mobo has provision for a second one)[/QUOTE]

Or use on-board graphics if the MOBO has them. That's how I run my main GPU.

VBCurtis 2021-09-17 16:52

[QUOTE=Xyzzy;588027]Since GPU LA is so fast, should we rethink how many relations are generated by the sieving process?

:mike:[/QUOTE]

In principle, yes. There's an electricity savings to be had by over-sieving less and accepting larger matrices, especially on the e-small queue where matrices are nearly all under 15M. However, one can't push this very far, as relation sets that fail to build any matrix delay jobs and require human-admin time to add Q to the job. I've been trying to pick relations targets that leave jobs uncertain to build a matrix at TD=120, and I advocate this for everyone on e-small now. Some of the bigger 15e jobs could yield matrices over 30M / over the memory capabilities of GPU-LA, so maybe those shouldn't change much?

Another way to view this is to aim for the number of relations one would use if one were doing the entire job on one's own equipment, and then add just a bit to reduce the chance of needing to ask for more Q from admin (like round Q up to the nearest 5M or 10M increment).

Xyzzy 2021-09-17 18:56

What is the difference in relations needed between TD=120 and TD=100? (Do we have this data?)

We think a GPU could do a TD=100 job faster than a CPU could do a TD=120 job.

Personally, we don't mind having to rerun matrix building if there aren't enough relations. We don't know if it is a drag for the admins to add additional relations, but if it isn't a big deal the project could probably run more efficiently.

There doesn't seem to be a shortage of LA power so maybe the project could skew a bit in favor of more jobs overall with less relations per job? Is the bottleneck server storage space?

What percentage in CPU-hours is the sieving versus the post-processing work? Does one additional hour of post-processing "save" 1000 hours of sieving? More? Less?

[SIZE=1](We lack the technical knowledge and vocabulary to express what we are thinking. Hopefully what we wrote makes a little sense.)[/SIZE]

:mike:

VBCurtis 2021-09-17 19:44

I'm gonna hand-wave here, since only a few people have bothered taking data:

When a relation set is right at the cusp of building a matrix, a few more hours sieving will save more than a few hours to solve the matrix on that same machine (meaning CPU in both cases).

At the relation counts most e-small and 15e jobs are processed at, 20 more core-hours of sieving might save 5 or 10 core-hours of matrix work (again, both measured on a CPU). I've done a few experiments at home, and I have yet to find a job where the sieving required to build a matrix at TD=120 saved more CPU time than it cost. I believe this could/would be the case on really big jobs, say with matrices at 50M+ in size.

We have historically sieved more than needed because BOINC computation is cheap, while matrix solving time was in short supply. So, now that GPU matrix solving makes matrices not in short supply, we should sieve less. Something like 5-10% fewer relations, which means 5-10% more jobs done per calendar month.

frmky 2021-09-17 19:47

[QUOTE=Xyzzy;588059]Is the bottleneck server storage space?[/QUOTE]
No. The server is currently using 467G of 3.6T.

frmky 2021-09-18 03:58

For 2,2174L, 1355M relations yielded 734M uniques. With nearly 50% duplicates, we have clearly reached the limit for 16e. Anyway, filtering yielded
[CODE]matrix is 102063424 x 102063602 (51045.3 MB) with weight 14484270868 (141.91/col)[/CODE]
Normally I'd try to bring this down, but testing on a quad V100 system with NVLink gives
[CODE]linear algebra completed 2200905 of 102060161 dimensions (2.2%, ETA 129h 5m)[/CODE]
So more sieving would only save a day or so in LA. I have the cluster time, so I'll let it run.

pinhodecarlos 2021-09-18 07:02

[QUOTE=VBCurtis;588061]

We have historically sieved more than needed because BOINC computation is cheap, while matrix solving time was in short supply. So, now that GPU matrix solving makes matrices not in short supply, we should sieve less. Something like 5-10% fewer relations, which means 5-10% more jobs done per calendar month.[/QUOTE]

Totally agree with you now. And more, when someone says a number is under LA I would recommend (I know Greg!…lol) to cancel all queued wus, this will also speed up next number to sieve. Sievers are wasting a few days (my experience) processing unnecessary work ( I just manually abort them to go to someone else), just be careful to not do this under any challenges since it will interfere with strategic bunkering.

Xyzzy 2021-09-18 13:47

1 Attachment(s)
[QUOTE=Xyzzy;584798]If you are using RHEL 8 (8.4) you can install the proprietary Nvidia driver easily via these directions:

[URL]https://developer.nvidia.com/blog/streamlining-nvidia-driver-deployment-on-rhel-8-with-modularity-streams/[/URL]

Then you will need these packages installed:

[C]gcc
make
cuda-nvcc-10-2
cuda-cudart-dev-10-2-10.2.89-1[/C]

And possibly:

[C]gmp-devel
zlib-devel[/C]

You also have to manually adjust your path variable in [C]~/.bashrc[/C]:

[C]export PATH="/usr/local/cuda-10.2/bin:$PATH"[/C]

:mike:[/QUOTE]Here are simpler instructions.

[CODE]sudo subscription-manager repos --enable=rhel-8-for-x86_64-appstream-rpms
sudo subscription-manager repos --enable=rhel-8-for-x86_64-baseos-rpms
sudo subscription-manager repos --enable=codeready-builder-for-rhel-8-x86_64-rpms
sudo dnf config-manager --add-repo=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
sudo dnf module install nvidia-driver:latest
sudo reboot
sudo dnf install cuda-11-4
echo 'export PATH=/usr/local/cuda-11.4/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64/:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc[/CODE]Then just use the attached archive to set up your work.

:mike:

charybdis 2021-09-18 14:37

[QUOTE=frmky;588086]For 2,2174L, 1355M relations yielded 734M uniques. With nearly 50% duplicates, we have clearly reached the limit for 16e.[/QUOTE]

Or is this just the limit for 16e with 33-bit large primes? I know you've avoided going higher because of the difficulty of the LA and the msieve filtering bug, but now that the bug is fixed and GPUs make the LA much easier, might it be worth going up to 34-bit?

frmky 2021-09-18 15:59

[QUOTE=charybdis;588104]Or is this just the limit for 16e with 33-bit large primes?[/QUOTE]
Does the lasieve5 code work correctly with 34-bit large primes? I know the check is commented out, but I haven't tested it.

charybdis 2021-09-19 00:30

I tested the binary from [URL="https://www.mersenneforum.org/showpost.php?p=470249&postcount=10"]here[/URL] on 2,2174L with 34-bit large primes and it seemed to work fine. Yield was more than double that at 33-bit so definitely looks worth it, as one would expect. There were no issues with setting mfba=99 either.

henryzz 2021-09-19 08:02

I looked through the code a few years ago and found no issues. Lasieve4 is also fine although it is limited to 96 bit mfba/r.

wreck 2021-09-23 11:46

I give a try to receive NFS@Home WU and found lpbr and lpba 34 assignment of 2,2174M.
Here is the polynomial file S2M2174b.poly's contents.

[CODE]
n: 470349924831928271476705309712184283829671891500377511256458133476241008159328553358384317181001385841345904968378352588310952651779460262173005355061503024245423661736289481941107679294474063050602745740433565487767078338816787736757703231764661986524341166060777900926495463269979500293362217153953866146837
skew: 1.22341
c6: 2
c5: 0
c4: 0
c3: 2
c2: 0
c1: 0
c0: 1
Y1: 1
Y0: -3064991081731777716716694054300618367237478244367204352
type: snfs
rlim: 250000000
alim: 250000000
lpbr: 34
lpba: 34
mfbr: 99
mfba: 69
rlambda: 3.6
alambda: 2.6
[/CODE]

When q is near 784M, the memory used is 743MB.

charybdis 2021-09-23 12:46

[QUOTE=wreck;588461][CODE]
lpbr: 34
lpba: 34
mfbr: 99
mfba: 69
[/CODE][/QUOTE]

@frmky, for future reference, when I tested this I found that rational side sieving with *algebraic* 3LP was fastest. This shouldn't be too much of a surprise: the rational norms are larger, but not so much larger that 6 large primes across the two sides should split 4/2 rather than 3/3 (don't forget the special-q is a "free" large prime).

frmky 2021-09-24 06:17

[QUOTE=charybdis;588469]@frmky, for future reference, when I tested this I found that rational side sieving with *algebraic* 3LP was fastest. This shouldn't be too much of a surprise: the rational norms are larger, but not so much larger that 6 large primes across the two sides should split 4/2 rather than 3/3 (don't forget the special-q is a "free" large prime).[/QUOTE]
I'll try that, thanks!

frmky 2021-09-24 06:21

[QUOTE=frmky;588086]filtering yielded
[CODE]matrix is 102063424 x 102063602 (51045.3 MB) with weight 14484270868 (141.91/col)[/CODE]
Normally I'd try to bring this down, but testing on a quad V100 system with NVLink gives
[CODE]linear algebra completed 2200905 of 102060161 dimensions (2.2%, ETA 129h 5m)[/CODE]
[/QUOTE]
And it's done. LA on the 102M matrix with restarts took 5 days 14 hours.
[PASTEBIN]cB1qD1hJ[/PASTEBIN]

charybdis 2021-09-24 12:36

[QUOTE=frmky;588534]I'll try that, thanks![/QUOTE]

Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?

pinhodecarlos 2021-09-24 13:40

[QUOTE=charybdis;588567]Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?[/QUOTE]

95%.

frmky 2021-09-24 15:13

[QUOTE=charybdis;588567]Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?[/QUOTE]

A large fraction encounter issues when exceeding 1GB/thread, so I stay a little below that.

charybdis 2021-09-24 15:50

If lims have to stay at 250M, it would probably be possible to stretch the upper limit of doable jobs a bit by using 3LP on both sides to catch some of the relations that are lost due to the low lims. This makes sec/rel ~30% worse but increases yield by ~50%, while also increasing the number of relations needed by some unknown amount (almost certainly below 50%) and making LA that bit harder as a result.

But as long as you can cope with lpb 34/34 and 3LP on only one side, there shouldn't be any need for this.

ryanp 2021-10-22 13:34

In general, given a GPU with X GB RAM, and an N x N matrix, is there a way to determine (reasonably) optimal VBITS and block_nnz values?

frmky 2021-10-22 23:00

Technically it's an MxN matrix with M slightly less than N, but for this question we can approximate it as NxN.

Volta (and I'm hoping Turing and Ampere) GPUs aren't very sensitive to the block_nnz value, so just keep it at its default 1.75 billion. The actual limit is that the number of nonzeros in a cub SpMV call is stored in an int32 so each matrix block must have less than 2^31 nonzeros. block_nnz sets an estimate, especially for the transpose matrix, so I've been a bit conservative setting it at 1.75B. We want to keep the number of blocks reasonably small since each block for both the normal and transpose matrix needs a 4*(N+1)-byte row offset array in addition to the 4*num_nonzeros-byte column array in GPU memory.

For VBITS, a global memory fetch on current nVidia GPUs by default moves 64 bytes into the L2 cache (although this can be reduced to 32 bytes on A100). With VBITS=128, we are only using 16 bytes of that data with little chance of cache reuse in most of the matrix. Increasing VBITS uses more of the data and thus more efficiently uses global memory bandwidth in the SpMV. However, each iteration also has multiple VBITSxN • NxVBITS dense matrix multiplications which require strided access to arrays. This strided access has a larger impact at VBITS=512. Also, the vectors require 7*N*VBITS/8 bytes of GPU memory. In practice on the V100 I've gotten about equal performance from VBITS of 384 and 512, and poorer performance with decreasing values. Of the two I use 384 since it requires less GPU memory. However, lower VBITS values are useful if GPU memory is tight. Once I have access to an A100 I will compare using VBITS=256 with cudaLimitMaxL2FetchGranularity of 32 to VBITS=384 or 512 with the default.

So, in short, unless GPU memory is tight use VBITS=384 and the default block_nnz on V100 and likely on A100 as well.

frmky 2021-10-26 04:05

2,2174M is in LA, so here's one more data point. Running on [B]eight[/B] NVLink-connected V100's,
[CODE]Sun Oct 24 01:15:27 2021 matrix is 106764994 x 106765194 (56998.7 MB) with weight 16127184931 (151.05/col)
Sun Oct 24 01:15:27 2021 sparse part has weight 13874205635 (129.95/col)
...
Sun Oct 24 23:03:59 2021 commencing linear algebra
Sun Oct 24 23:03:59 2021 using VBITS=384
Sun Oct 24 23:03:59 2021 skipping matrix build
Sun Oct 24 23:03:59 2021 initialized process (0,0) of 2 x 4 grid
Sun Oct 24 23:09:35 2021 matrix starts at (0, 0)
Sun Oct 24 23:09:39 2021 matrix is 53382681 x 25338016 (8267.4 MB) with weight 2435546404 (96.12/col)
Sun Oct 24 23:09:39 2021 sparse part has weight 1913870759 (75.53/col)
Sun Oct 24 23:09:39 2021 saving the first 368 matrix rows for later
Sun Oct 24 23:09:46 2021 matrix includes 384 packed rows
Sun Oct 24 23:10:15 2021 matrix is 53382313 x 25338016 (7468.9 MB) with weight 1554978635 (61.37/col)
Sun Oct 24 23:10:15 2021 sparse part has weight 1451172382 (57.27/col)
Sun Oct 24 23:10:15 2021 using GPU 0 (Tesla V100-SXM2-32GB)
Sun Oct 24 23:10:15 2021 selected card has CUDA arch 7.0
Sun Oct 24 23:12:44 2021 commencing Lanczos iteration
Sun Oct 24 23:12:47 2021 memory use: 20898.7 MB
Sun Oct 24 23:12:56 2021 linear algebra at 0.0%, ETA 90h17m
[/CODE]
It'll take a bit longer due to queue logistics, but hopefully it'll be done within the week.

pinhodecarlos 2021-10-26 06:21

And I suppose you will be comparing the other sieve run with higher LP’s, probably still some left overs.

frmky 2021-10-26 07:48

[QUOTE=pinhodecarlos;591648]And I suppose you will be comparing the other sieve run with higher LP’s, probably still some left overs.[/QUOTE]
We didn't sieve it twice. Only a little at the beginning was sieved with 33 bit LPs and all the relations were combined. There are a few stragglers that I'm not worrying about.

charybdis 2021-10-26 09:35

What was your general impression of 34-bit vs 33-bit? Will the extra bit allow slightly larger jobs to be run as I'd hoped?

VBCurtis 2021-10-26 15:47

[QUOTE=frmky;591636]2,2174M is in LA, so here's one more data point. Running on [B]eight[/B] NVLink-connected V100's,
It'll take a bit longer due to queue logistics, but hopefully it'll be done within the week.[/QUOTE]

How many relations did you collect? Was the unique ratio better than 2,2174L's? The matrices came out pretty similar in size, so a comparison of relations counts (raw and unique) gives a nice 33 vs 34 data point.

frmky 2021-10-26 18:06

For 2,2174L we sieved from 20M - 6B, and collected 1.36B relations. This gave 734M uniques, so about 46% duplicates.

For 2,2174M we sieved from 20M - 4B, and collected 2.19B relations. This gave 1.29B uniques, so about 41% duplicates. However, we sieved a considerably narrower range of q, and it was overall much faster.

LaurV 2021-10-27 03:14

[offtopic] I changed the thread title. The old one made me [URL="https://en.wikipedia.org/wiki/Romanian_profanity"]nostalgic[/URL] every time someone posted in it... The new title is easier to search too, as the thread contains a lot of useful info... [/offtopic]

frmky 2021-10-31 18:58

[QUOTE=frmky;591636]2,2174M is in LA, so here's one more data point.[/QUOTE]
It's done.
[PASTEBIN]5RGLguge[/PASTEBIN]

EdH 2022-02-20 15:14

I'm contemplating playing with Colab to see if it could be used with smaller matrices. But I wonder if there is really any worth.

If I do everything but LA locally and only upload the necessary files for the matrix work, I'm still looking at a pretty large relations file for anything of value. But, I'm currently looking at more than a day of local CPU LA for ~c170 candidates. If I could knock that down to a few hours, maybe it would be "fun" to try.

The assigned GPUs vary widely as well. My last two experiments (sessions with GPU ECM) yielded a P100 and a K80. I do normally get some longer session times, but it's not guaranteed. Also, I may have only been getting half the card. (I'm still confused on shader/core/sm/etc.

If my source is correct the K80 is only CUDA 3.7. Is this current enough to work?

Would d/ling the checkpoint file at regular intervals be enough to be able to restart a timed out session later?

What else would I need to consider?

Sorry for the questions. Thanks for any help.

An extra question: Since the K80 is only CUDA 3.7 architecture, would it even be worth obtaining one? It seems the current minimum is at 3.5 and I'd hate to have another obsolete card right after getting one.

frmky 2022-02-21 02:38

Yes, it will work on a K80. My updated version requires CC 3.5 or greater.

You don't need to transfer the large relations file. Do this:
1. Complete the filtering and build the matrix locally. You can stop it manually once you see "commencing Lanczos iteration".
2. Transfer the ini, fb, and mat files (and mat.idx if using multiple GPUs with MPI, not covered here) to the GPU node.
3. On the GPU node, start the LA with options like ./msieve -nc2 skip_matbuild=1 -g 0 -v
4. You can interrupt it and restart it with "-ncr -g 0".
5. Once it's complete, transfer the dep file to the local node and run sqrt with -nc3 as usual.

The local and GPU msieve binaries can be compiled with different values for VBITS since the LA is run entirely using the GPU binary. And yes, you just need the chk file in addition to the other files above to restart.

A K80 is a dual GPU card, so without using MPI you will only be using half the card. And each half is only a little bit faster than a K20. It will be slower than a P100 as you would expect.

EdH 2022-02-21 03:29

Thanks frmky! This helps a bunch. I will pursue the Colab session. [strike]I also have a 3.5 card to play with, but it only has 2GB. Not sure if that's enough to even get a small matrix into.[/strike]

I'm off to study. . .

EdH 2022-02-22 01:05

I'm saddened to report that even had I been successful with my Colab experiments, it would still be impractical.

I was able to compile Msieve for two different GPUs, a K80 (3.7) and a T4 (7.5). However, Msieve refused to understand the options although I tried all the variations I could think of in both Python and BASH scripts, with and without single/double quotes around various portions, and in a variety of orders. In all cases, Msieve simply displayed all the available options.

In any case, the impracticality is that for a c160, the msieve.dat.mat file is just short of 2GB. The two tested methods of getting the file loaded into the Colab sessions were via SSH and via Google Drive. SSH took just under two hours. Uploading the file to Google Drive took just under two hours. The first method held the session open without using the GPU for anything, for which Colab complained, while the second allowed the session to start rather quickly (after the two hour upload to Google Drive). But, since a c160 created a 2GB file, I'm expecting larger matrices will just take a much longer time to load into a Colab Session.

I may try again later to get Msieve to process the test case, since at this point I have the needed files in Google Drive, but the practicality is in doubt.

Thank you for the assistance. I will surely put this to use when I finally acquire a usable CUDA GPU. (I'm even eying some K20s ATM.)

EdH 2022-02-22 23:40

[QUOTE=EdH;600475]. . .
I may try again later to get Msieve to process the test case, since at this point I have the needed files in Google Drive, but the practicality is in doubt.

Thank you for the assistance. I will surely put this to use when I finally acquire a usable CUDA GPU. (I'm even eying some K20s ATM.)[/QUOTE]I'm going to claim success!

I got a Colab session to run Msieve LA on a Tesla T4! I didn't let it complete, but the log claims:[code]
Tue Feb 22 22:48:53 2022 linear algebra at 0.0%, ETA 3h44m [/code]The best time I could get for a 40 threaded Xeon was about twice that long.

I was able to compress the .mat file to almost half the size, but it still takes an hour to upload it to Google Drive and a little bit of time to decompress it. (Others may be able to upload a lot faster.)

The actual details are much more complicated than my other sessions, so I need to work quite a bit on them before I can publish them. As to the earlier comments of practicality, I will have to study this further for my use. On one hand, it takes a lot of manual intervention and timely success is not guaranteed. On the other hand, all of this work being done by Colab is letting the local machines perform other work. Perhaps the value can be realized for larger jobs.

I don't seem to be getting the screen output I expected from the [C]-v[/C] option.

Is there a way to redirect the checkpoint file? I couldn't find an option that I thought existed.

Thanks again for all the help.

EdH 2022-02-24 01:24

Sorry if you're tired of these reports, but here's another:

I have a full-fledged Colab session that works through completion of LA. I let a c157 finish today, that I had recently run on my 20c/40t Xeon. The times were nearly identical:[code]Xeon 04:17:41 elapsed time
Colab 04:19:08 elapsed time[/code]I hope to do the same test with a different GPU, to compare.

frmky 2022-02-24 06:17

[QUOTE=EdH;600624]Sorry if you're tired of these reports[/QUOTE]
Not at all! I look forward to seeing how you have gotten it to work in Colab!

EdH 2022-02-26 00:50

[QUOTE=EdH;600624]. . .
I hope to do the same test with a different GPU, to compare.[/QUOTE]Well, I spent quite a bit of time today with a T4, but I didn't let it finish, because I was (unsuccessfully) trying to get a file copy for the checkpoint file to work right, so it could be saved past a session end. However, the T4 did consistently give estimates of 2:33 for completion. This was the same matrix that took 4:19 to finish on the K80.

EdH 2022-02-28 03:39

Disappointing update. Although Colab successfully completed LA on the test set, the returned msieve.dat.dep file is corrupt according to Msieve on the local machine.:sad:

EdH 2022-03-02 23:58

I have not been playing with Colab for the last few days, due to trying to get a Tesla K20Xm working locally. I had it working with GMP-ECM, but couldn't get frmky's Msieve to run. I battled with all kinds of CUDA (9/10.2/11.x,etc.). All resisted, including the stand alone cuda 10.2 .run runfile. For some time, I lost GMP-ECM, too.

But, I'm happy to mention I finally have all (GMP-ECM, Msieve and frmky's Msieve) running. I'm using CUDA 11.4, NVidia driver 470.103.71 and I had to install a shared object file from CUDA 9 (that may have been for GMP-ECM, in which I also had to disable some code in the Makefile). In any case, they are all running the K20Xm!

As to performance, the limited testing seems to show nearly a halving of the time taken on my 24 thread machine, but the 40 thread machines still have an edge on the K20Xm. But, in effect, it represents an extra machine, since it can free up the others.

The good part is that now that I have this local card running, I can get back to my Colab session work and have a local card to compare and help figure things out.

Thank you to everyone for all the help in this and other threads!

EdH 2022-03-05 22:58

The Colab "How I. . ." is complete. I have tested it directly from the thread and it worked as designed. The latest session was assigned a K80, which was detected correctly and its Compute Capability used during the compilation of Msieve.

It can be reviewed at:

[URL="https://www.mersenneforum.org/showthread.php?t=27634"]How I Use a Colab GPU to Perform Msieve Linear Algebra (-nc2)[/URL]

Thanks everyone for all the help!

EdH 2022-03-25 14:17

I've hit a snag playing with my GPU and wonder why:

Machine is Core2 Duo with 8GB RAM and GPU is K20Xm with 6GB RAM.
Composite is 170 digits and the matrix was built on a separate machine, with msieve.dat.mat, msieve.fb and worktodo.ini supplied from the original alternate named files.

I tried this twice. Here is the terminal display for the last try:[code]$ ./msieve -nc2 skip_matbuild=1 -g 0 -v

Msieve v. 1.54 (SVN Unversioned directory)
Fri Mar 25 09:45:54 2022
random seeds: 6dc60c6a 05868252
factoring 10559103707847604096214709430530773995264391543587654452108598611359547436885517060868607845904851346765842831319837349071427368916165620453753530586945871555707605156809 (170 digits)
no P-1/P+1/ECM available, skipping
commencing number field sieve (170-digit input)
R0: -513476789674487020805844014359613
R1: 4613148128511433126577
A0: -638650427125602136382789058618425254350
A1: 413978338424926800646481002860017
A2: 268129428386547641102884323
A3: -15312382615381572243
A4: -8137373995372
A5: 295890
skew 1.00, size 1.799e-16, alpha -5.336, combined = 2.653e-15 rroots = 5

commencing linear algebra
using VBITS=256
skipping matrix build
matrix starts at (0, 0)
matrix is 11681047 x 11681223 (3520.8 MB) with weight 1098647874 (94.05/col)
sparse part has weight 794456977 (68.01/col)
saving the first 240 matrix rows for later
matrix includes 256 packed rows
matrix is 11680807 x 11681223 (3303.4 MB) with weight 723296676 (61.92/col)
sparse part has weight 679062899 (58.13/col)
using GPU 0 (Tesla K20Xm)
selected card has CUDA arch 3.5
Nonzeros per block: 1750000000
converting matrix to CSR and copying it onto the GPU
Killed[/code]And, here is the log:[code]Fri Mar 25 09:45:54 2022 Msieve v. 1.54 (SVN Unversioned directory)
Fri Mar 25 09:45:54 2022 random seeds: 6dc60c6a 05868252
Fri Mar 25 09:45:54 2022 factoring 10559103707847604096214709430530773995264391543587654452108598611359547436885517060868607845904851346765842831319837349071427368916165620453753530586945871555707605156809 (170 digits)
Fri Mar 25 09:45:55 2022 no P-1/P+1/ECM available, skipping
Fri Mar 25 09:45:55 2022 commencing number field sieve (170-digit input)
Fri Mar 25 09:45:55 2022 R0: -513476789674487020805844014359613
Fri Mar 25 09:45:55 2022 R1: 4613148128511433126577
Fri Mar 25 09:45:55 2022 A0: -638650427125602136382789058618425254350
Fri Mar 25 09:45:55 2022 A1: 413978338424926800646481002860017
Fri Mar 25 09:45:55 2022 A2: 268129428386547641102884323
Fri Mar 25 09:45:55 2022 A3: -15312382615381572243
Fri Mar 25 09:45:55 2022 A4: -8137373995372
Fri Mar 25 09:45:55 2022 A5: 295890
Fri Mar 25 09:45:55 2022 skew 1.00, size 1.799e-16, alpha -5.336, combined = 2.653e-15 rroots = 5
Fri Mar 25 09:45:55 2022
Fri Mar 25 09:45:55 2022 commencing linear algebra
Fri Mar 25 09:45:55 2022 using VBITS=256
Fri Mar 25 09:45:55 2022 skipping matrix build
Fri Mar 25 09:46:24 2022 matrix starts at (0, 0)
Fri Mar 25 09:46:26 2022 matrix is 11681047 x 11681223 (3520.8 MB) with weight 1098647874 (94.05/col)
Fri Mar 25 09:46:26 2022 sparse part has weight 794456977 (68.01/col)
Fri Mar 25 09:46:26 2022 saving the first 240 matrix rows for later
Fri Mar 25 09:46:30 2022 matrix includes 256 packed rows
Fri Mar 25 09:46:35 2022 matrix is 11680807 x 11681223 (3303.4 MB) with weight 723296676 (61.92/col)
Fri Mar 25 09:46:35 2022 sparse part has weight 679062899 (58.13/col)
Fri Mar 25 09:46:35 2022 using GPU 0 (Tesla K20Xm)
Fri Mar 25 09:46:35 2022 selected card has CUDA arch 3.5[/code]Is it possible the CSR conversion is overrunning memory?

frmky 2022-03-25 23:46

That looks like the linux OOM killer. Which would mean the it has run out of available system (not GPU) memory.

EdH 2022-03-26 00:18

[QUOTE=frmky;602572]That looks like the linux OOM killer. Which would mean the it has run out of available system (not GPU) memory.[/QUOTE]Thanks! I wondered, since it seemed the Msieve reported matrix size was similar to the nvidia-smi reported size, but that isn't case with the run I just checked. Msieve says 545 MB and nvidia-smi says 1491MiB.

I'll play a bit more with some sizes in between and see what may be the limit.

Do you think a large swap file would be of any use?

chris2be8 2022-03-26 16:45

[QUOTE=EdH;602573]
Do you think a large swap file would be of any use?[/QUOTE]

Yes. I'd add 16-32Gb of swap space. Which should stop OOM killing jobs if they ask for lots of memory.

But the system could start page thrashing if they try to heavily use more memory than you have RAM. SSDs are faster than spinning disks but more prone to wearing out if heavily used.

Adding more RAM would be the best option, if the system can take it. But that costs money unless you have some spare RAM to install.

EdH 2022-03-27 12:31

Well, more study seems to say I might not be able to get there with a 32G swap,* although I might see what happens. I tried a matrix that was built with t_d=70 for a c158 to compare times with a 40 thread machine and I got a little more info. Here's what top says about Msieve:[code] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21349 math55 20 0 33.7g 7.3g 47128 D 3.7 93.3 1:21.89 msieve[/code]The machine only has 8G and it would be very expensive to take it to its max at 16G, which doesn't look sufficient, either.

Here's what Msieve had to say:[code]commencing linear algebra
using VBITS=256
skipping matrix build
matrix starts at (0, 0)
matrix is 7793237 x 7793427 (2367.4 MB) with weight 742866434 (95.32/col)
sparse part has weight 534863189 (68.63/col)
saving the first 240 matrix rows for later
matrix includes 256 packed rows
matrix is 7792997 x 7793427 (2195.0 MB) with weight 483246339 (62.01/col)
sparse part has weight 450716902 (57.83/col)
using GPU 0 (Tesla K20Xm)
selected card has CUDA arch 3.5
Nonzeros per block: 1750000000
converting matrix to CSR and copying it onto the GPU
450716902 7792997 7793427
450716902 7793427 7792997
commencing Lanczos iteration
vector memory use: 1664.9 MB
dense rows memory use: 237.8 MB
sparse matrix memory use: 3498.2 MB
memory use: 5400.9 MB
error (spmv_engine.cu:78): out of memory[/code]This looks to me like the card ran out, too. The K20Xm has 6G (displayed as 5700MiB by nvidia-smi).

* The machine currently has an 8G swap partition and I have a 32G microSD handy that I might try to add to the system, to both test the concept of using such a card as swap and to add the swap space if it works.

chris2be8 2022-03-27 15:59

[QUOTE=EdH;602678]
The machine only has 8G and it would be very expensive to take it to its max at 16G, which doesn't look sufficient, either.
[/QUOTE]

The OOM killer should put messages into syslog so check syslog and dmesg output before buying memory or spending a lot of time checking other things. I should have said to do that first in my previous post, sorry.

8Gb should be enough to solve the matrix on the CPU. I've done a GNFS c178 in 16Gb (the system has 32GB swap space as well but wasn't obviously paging).

EdH 2022-03-27 16:19

[QUOTE=chris2be8;602703]The OOM killer should put messages into syslog so check syslog and dmesg output before buying memory or spending a lot of time checking other things. I should have said to do that first in my previous post, sorry.

8Gb should be enough to solve the matrix on the CPU. I've done a GNFS c178 in 16Gb (the system has 32GB swap space as well but wasn't obviously paging).[/QUOTE]Thanks! I'll play more later, but for now, I'm running a c145 that nvidia-smi reports as using 1802MiB on the GPU. It is happily stomping the 40 thread CPU machine. The GPU machine started after copying the files from the CPU machine, and it is ahead with ETA 38m vs. ETA 1h 0m.

I did get the microSD card added as swap, but I needed help from kruoli in the linux sub-forum. [C]top[/C] now shows nearly 40G for swap space, but it is all totally free for this c145.

frmky 2022-03-27 17:36

[QUOTE=EdH;602678][code]commencing linear algebra
using VBITS=256
...
error (spmv_engine.cu:78): out of memory[/code][/QUOTE]
You almost had enough. It ran out while trying to allocate working memory for the spmv library. Recompile with VBITS=128 and it should fit, even if it's not optimal. (Don't forget to copy the .ptx and .so files.)

EdH 2022-03-27 21:13

[QUOTE=frmky;602717]You almost had enough. It ran out while trying to allocate working memory for the spmv library. Recompile with VBITS=128 and it should fit, even if it's not optimal. (Don't forget to copy the .ptx and .so files.)[/QUOTE]Thanks! That did the trick! nvidia-smi is reporting 4999MiB / 5700 MiB and Msieve is using 2.7g of 8G. The ETA is just over 11 hours, where as the CPU took 12:30 with 32 threads. I had forgotten to edit Msieve for 40 threads on this machine.

I still need to do some more testing and find out where the crossover is, but all this is encouraging.

EdH 2022-04-08 13:11

Sorry if these questions are annoying:

I've been playing with my K20Xm card now for a little bit and, of course, it isn't "good enough." I can get more of them at reasonable prices, but, why, If they aren't? And, most of my card capable machines don't have an extra fan connector, which would be needed for a K20Xm.

Comparing a GTX 980, the memory is the same, so I still wouldn't be able to run larger matrices. Is the matrix size increase proportional in a manner I could estimate? e.g. 5 more digits doubles the matrix? If a GTX 1080 with 11GB would only give me 5 more digits, I couldn't consider it worth the cost.

Is there a similar estimation for target_density? I currently use t_d 70 so the CADO-NFS clients can move to a subsequent server sooner, but I haven't empirically determined if that is best.

I'm not sure if this might be a typo, but while the 980 shows a much better performance overall, the FP64 (double) performance only shows 189.4 GFLOPS (1:32), while for the K20Xm, it is shown as 1,312 GFLOPS (1:3). Would that be of significance in LA solving?

It's been mentioned that the K80 consists of two cards that are each a little better than the K20Xm. How much larger matrices might I be able to run with mpi across both sections of a 24GB card?

EdH 2022-04-09 12:42

Any familiarity with the Tesla M40 24GB for Msieve LA? That would be about 4x memory for 2x cost over the K20Xm.

frmky 2022-04-09 19:45

I sieve enough to use target_density of at least 100-110 as it brings down the matrix size. An 11 GB card can likely handle matrices with about 10M rows (GNFS-175ish), whereas a 24GB card would take you up to around 20M rows (GNFS-184ish). With enough system memory, the newer Tesla M40 would let you go a bit higher with a significant performance penalty by storing the matrix in system memory and transferring it as needed onto the GPU.

GPU LA is entirely integer code and doesn't depend on the FP64 performance. It's written using 64-bit integer operations, but even on the latest GPUs those are implemented with 32-bit integer instructions.

You lose some speed and memory efficiency splitting the matrix across two GPUs in a K80, but you should still be able to handle 9M rows or so (GNFS-174ish).

EdH 2022-04-09 20:29

Thanks! That helps me a bunch. My new interest is the M40 24GB now. But, I'm not quite ready, because the machines I'd like to use don't have any extra fan connectors. I'm considering the idea of a fan powered another way - possibly from an older PATA power cable.

EdH 2022-05-01 02:08

I'm hoping to set up a machine to primarily do GPU LA with an M40 24GB card.

- Will a Core2 Duo 3.16GHz be better (or much worse) than a slower Quad core?
- - When running the GPU, is LA doing anything with more than one CPU core? I only see one core in use via [C]top[/C].

- Will 8GB of machine RAM be insufficient to feed the 24GB card?
- - If insufficient, would a large swap file, via MicroSD 32GB ease the memory limit?

frmky 2022-05-01 17:33

GPU LA uses only a single CPU core to do a very small part of each iteration. Likewise, filtering and traditional sqrt use only a single core. The Core2 Duo should be fine.

With a 24GB card, you should be able to solve up to around 20Mx20M matrices, which would be about 10GB in size. While transferring the matrix to the card, you need to store the entire matrix in COO plus a portion of it in CSR format. 8 GB would not be enough. 16 GB plus a swap file should be enough, but leave room for expansion later if needed.


All times are UTC. The time now is 20:25.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.