-   Msieve (
-   -   Msieve GPU Linear Algebra (

Xyzzy 2021-09-09 20:28

1 Attachment(s)
We are not sure if this is interesting or not.

[C]13_2_909m1 - Near-Cunningham - SNFS(274)[/C]

This is a big (33 bit?) job. The [C]msieve.dat[/C] file, uncompressed and with duplicates and bad relations removed, is 49GB.[CODE]$ ls -lh
total 105G
-rw-rw-r--. 1 m m 36G Sep 8 20:33 13_2_909m1.dat.gz
drwx------. 2 m m 50 Aug 4 12:17 cub
-r--------. 1 m m 29K Aug 4 12:16 lanczos_kernel.ptx
-r-x------. 1 m m 3.4M Aug 4 12:16 msieve
-rw-rw-r--. 1 m m 49G Sep 8 22:02 msieve.dat
-rw-rw-r--. 1 m m 4.2G Sep 9 14:17 msieve.dat.bak.chk
-rw-rw-r--. 1 m m 4.2G Sep 9 14:54 msieve.dat.chk
-rw-rw-r--. 1 m m 969M Sep 9 12:11 msieve.dat.cyc
-rw-rw-r--. 1 m m 12G Sep 9 12:11 msieve.dat.mat
-rw-rw-r--. 1 m m 415 Sep 2 19:15 msieve.fb
-rw-rw-r--. 1 m m 13K Sep 9 15:10 msieve.log
-r--------. 1 m m 108K Aug 4 12:16 stage1_core.ptx
-rw-rw-r--. 1 m m 264 Sep 2 19:15 worktodo.ini[/CODE]There are ~442M relations. Setting [C]block_nnz[/C] to 500M resulted in an OOM error, so we used 1B instead.[CODE]commencing linear algebra
using VBITS=256
skipping matrix build
matrix starts at (0, 0)
matrix is 27521024 x 27521194 (12901.7 MB) with weight 3687594306 (133.99/col)
sparse part has weight 3106904079 (112.89/col)
saving the first 240 matrix rows for later
matrix includes 256 packed rows
matrix is 27520784 x 27521194 (12034.4 MB) with weight 2848207923 (103.49/col)
sparse part has weight 2714419599 (98.63/col)
using GPU 0 (Quadro RTX 8000)
selected card has CUDA arch 7.5
Nonzeros per block: 1000000000
converting matrix to CSR and copying it onto the GPU
1000000013 27520784 9680444
1000000057 27520784 11295968
714419529 27520784 6544782
1039631367 27521194 100000
917599197 27521194 3552480
757189035 27521194 23868304
commencing Lanczos iteration
vector memory use: 5879.2 MB
dense rows memory use: 839.9 MB
sparse matrix memory use: 21339.3 MB
memory use: 28058.3 MB
Allocated 123.0 MB for SpMV library
Allocated 127.8 MB for SpMV library
linear algebra at 0.0%, ETA 49h57m7521194 dimensions (0.0%, ETA 49h57m)
checkpointing every 570000 dimensions
linear algebra completed 925789 of 27521194 dimensions (3.4%, ETA 45h13m)
received signal 2; shutting down
linear algebra completed 926044 of 27521194 dimensions (3.4%, ETA 45h12m)
lanczos halted after 3628 iterations (dim = 926044)
BLanczosTime: 5932
elapsed time 01:38:53

current factorization was interrupted[/CODE]So the LA step is under 50 hours which seems pretty fast! (We have no plans to complete it since it is assigned to VBCurtis.)

We have the raw files saved if there are other configurations worth investigating. If so, just let us know!


VBCurtis 2021-09-09 21:00

It's a 32/33 hybrid, with a healthy amount of oversieving (I wanted a matrix below 30M dimensions, success!).

I'm impressed that fits on your card, and 50hr is pretty amazing- I just started the matrix a few hr ago on a 10-core Ivy Bridge, ETA is 365 hr.

If you have the free cycles to run it, please be my guest! That 20+ core weeks saved is enough to ECM the next candidate.

frmky 2021-09-13 05:22

I spent time with Nsight Compute looking at the SpMV kernel. As expected for SpMV it's memory bandwidth limited, so increasing occupancy to hide latency should help. I adjusted parameters to reduce both register and shared memory use, which increased the occupancy. This yielded a runtime improvement of only about 5% on the V100 but it may differ on other cards. I also increased the default block_nnz to 1750M to reduce global memory use a bit.

frmky 2021-09-16 06:00

Today I expanded the allowed values of VBITS to any of 64, 128, 192, 256, 320, 384, 448, or 512. This works on both CPUs and GPUs, but I don't expect much, if any, speedup on CPUs. As a GPU benchmark, I tested a 42.1M matrix on two NVLink-connected V100's. Here are the results.
VBITS Time (hours)
64 109.5
128 63.75
192 50
256 40.25
320 40.25
384 37.75
448 40.25
512 37.25[/CODE]
Combined with the new SpMV parameters, I get the best times with VBITS of 384 and 512, but 384 uses less memory. Overall, I get about 6% better performance than with VBITS=256.

Xyzzy 2021-09-17 11:38

Our system has a single GPU. When we are doing compute work on the GPU the display lags. We can think of two ways to fix this.

Some sort of niceness assignment to the compute process.
Limiting the compute process to less than 100% of the GPU.

Are either of these approaches possible?


Xyzzy 2021-09-17 11:41

Since GPU LA is so fast, should we rethink how many relations are generated by the sieving process?


axn 2021-09-17 12:15

[QUOTE=Xyzzy;588025]Our system has a single GPU. When we are doing compute work on the GPU the display lags.[/QUOTE]
Install a cheap second GPU (like GT 1030 / RX 550) to drive your display (if your mobo has provision for a second one)

frmky 2021-09-17 14:56

[QUOTE=Xyzzy;588025]Our system has a single GPU. When we are doing compute work on the GPU the display lags.[/QUOTE]
CUDA doesn't allow display updates while a kernel is running. The only way to improve responsiveness without using a second GPU is to shorten the kernel run times. The longest kernel is the SpMV, and you can shorten that by lowering block_nnz. Use the lowest value that still allows everything to fit in GPU memory.

Edit: Lowering VBITS will also reduce kernel runtimes, but don't go below 128. See the benchmark a few posts above. Also, you can't change VBITS in the middle of a run. You would need to start over from the beginning. You can change block_nnz during a restart.

chris2be8 2021-09-17 15:37

[QUOTE=axn;588028]Install a cheap second GPU (like GT 1030 / RX 550) to drive your display (if your mobo has provision for a second one)[/QUOTE]

Or use on-board graphics if the MOBO has them. That's how I run my main GPU.

VBCurtis 2021-09-17 16:52

[QUOTE=Xyzzy;588027]Since GPU LA is so fast, should we rethink how many relations are generated by the sieving process?


In principle, yes. There's an electricity savings to be had by over-sieving less and accepting larger matrices, especially on the e-small queue where matrices are nearly all under 15M. However, one can't push this very far, as relation sets that fail to build any matrix delay jobs and require human-admin time to add Q to the job. I've been trying to pick relations targets that leave jobs uncertain to build a matrix at TD=120, and I advocate this for everyone on e-small now. Some of the bigger 15e jobs could yield matrices over 30M / over the memory capabilities of GPU-LA, so maybe those shouldn't change much?

Another way to view this is to aim for the number of relations one would use if one were doing the entire job on one's own equipment, and then add just a bit to reduce the chance of needing to ask for more Q from admin (like round Q up to the nearest 5M or 10M increment).

Xyzzy 2021-09-17 18:56

What is the difference in relations needed between TD=120 and TD=100? (Do we have this data?)

We think a GPU could do a TD=100 job faster than a CPU could do a TD=120 job.

Personally, we don't mind having to rerun matrix building if there aren't enough relations. We don't know if it is a drag for the admins to add additional relations, but if it isn't a big deal the project could probably run more efficiently.

There doesn't seem to be a shortage of LA power so maybe the project could skew a bit in favor of more jobs overall with less relations per job? Is the bottleneck server storage space?

What percentage in CPU-hours is the sieving versus the post-processing work? Does one additional hour of post-processing "save" 1000 hours of sieving? More? Less?

[SIZE=1](We lack the technical knowledge and vocabulary to express what we are thinking. Hopefully what we wrote makes a little sense.)[/SIZE]


All times are UTC. The time now is 14:22.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.