mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve GPU Linear Algebra (https://www.mersenneforum.org/showthread.php?t=27042)

ryanp 2021-08-06 20:02

[QUOTE=frmky;585002]A 27.7M matrix in 21 hours. The A100 is a nice card![/QUOTE]
Indeed!

[QUOTE]Do you have access to an A40 to try it? I'm curious if the slower global memory significantly increases the runtime.[/QUOTE]

No, sadly, "just" A100 and V100's.

ryanp 2021-08-06 21:20

On that note, though, are there any plans to support multiple GPUs? If a single A100 is this fast, [URL="https://cloud.google.com/blog/products/compute/a2-vms-with-nvidia-a100-gpus-are-ga"]16x A100's[/URL] with a fully interconnected fabric could probably tear through big matrices?

frmky 2021-08-06 21:41

The current version supports multiple GPUs using MPI (compile with CUDA=1 MPI=1 CUDAAWARE=1) but relies on a good MPI implementation. OpenMPI's collectives transfer the data from card and do the reduction on the CPU. MVAPICH2-GDR I think keeps the reductions on the card, but SDSC doesn't have that working on Expanse GPU yet so I haven't been able to test it. I hope to have time on NCSA Delta later this fall to try it out.

Edit: [STRIKE]My backup plan if that doesn't work out is to use a non-CUDA-aware MPI to pass IPC handles between processes and do the reduction on the GPU myself.[/STRIKE]

Edit 2: I've got a draft version working just now that passes vectors between GPUs using MPI CUDA-aware point-to-point comms (which uses NVLink or GPUDirect when available) then does the reduction on the GPU manually. In a quick test on a 43M matrix using two V100's connected with NVLink, this reduces LA time from nearly 90 hours when passing vectors through CPU memory to [STRIKE]57[/STRIKE] 56 hours transferring directly between GPUs.

Edit 3: It's now in GitHub. Just compile with a CUDA-Aware MPI like OpenMPI using CUDA=XX MPI=1 CUDAAWARE=1 where XX is replaced by the compute capability of your GPU.

frmky 2021-08-07 06:44

[CODE]linear algebra completed 45452 of 42101088 dimensions (0.1%, ETA 21h 4m)[/CODE]
Using four V100's, I'm getting about 21 hours to solve a 42.1M matrix.

fivemack 2021-08-07 11:10

Interesting! That's about a p3.8xlarge instance, for which the spot price is $4/hr, so that's $84 = £60 to solve the matrix.

I'm paying 19p/kWh here, and my Skylake machine uses about 250W and takes 820 hours for a 44M matrix, so that's £40 of electricity (but probably £60 in depreciation, assuming the £3360 machine will last five years); on another hand it's taking a month rather than a day, on a third hand that's still keeping up with my sieving resources.

frmky 2021-08-07 16:13

[CODE]linear algebra completed 49005 of 84248506 dimensions (0.1%, ETA 94h30m)[/CODE]
And scaling well. The 84.2M matrix for 2,2162M should take about 4 days on four NVLink-connected V100's. It's using about 26GB on each card.

ryanp 2021-08-07 16:20

[QUOTE=frmky;585109][CODE]linear algebra completed 49005 of 84248506 dimensions (0.1%, ETA 94h30m)[/CODE]
And scaling well. The 84.2M matrix for 2,2162M should take about 4 days on four NVLink-connected V100's. It's using about 26GB on each card.[/QUOTE]

That's quite impressive. I dug this up which I believe was your MPI run of a 109.4M matrix from a few months back?

[CODE]linear algebra completed 20216008 of 109441779 dimensions (18.5%, ETA 854h19m)[/CODE]

frmky 2021-08-07 16:40

Yes, that would have been on 6 Sandy Bridge nodes with 2x 10 core cpus each.

Here's the companion 2,2162L matrix, also 84.2M, running on 8 Fujitsu A64FX nodes.

[CODE]Fri Jul 2 01:59:19 2021 linear algebra at 0.0%, ETA 337h 2m[/CODE]

wombatman 2021-08-08 00:00

Would something like work on my 3090? It has 24GB of ram on it, though I would have to get some help with compilation as I use WSL2, which doesn't support CUDA applications (yet).

frmky 2021-08-08 00:57

[QUOTE=wombatman;585135]Would something like work on my 3090? It has 24GB of ram on it, though I would have to get some help with compilation as I use WSL2, which doesn't support CUDA applications (yet).[/QUOTE]
Yes, you could solve a matrix up to about 15M or so on the card. If you have at least 32 GB system memory, you could go a bit larger transferring the matrix from system memory as needed using CUDA managed memory. But I have no experience compiling msieve for Windows.

frmky 2021-08-11 22:09

1 Attachment(s)
The LA for 2,2162M, an 84.2M matrix, successfully completed on four NVLink-connected V100's in a total of 95.5 hours of runtime. There was a restart due to the 48-hour queue time limit on SDSC Expanse GPU. This run used just over 26GB of GPU memory on each of the four V100's.

Attached is a snapshot of the timeline for two block Lanzcos iterations on three of the four gpus. Per the time scale at the top, it takes just over 1 second/iteration. Over 80% of the time is spent in the SpMV routine. The transfer of vectors directly between GPU's takes relatively little time when NVLink is used.

[PASTEBIN]TsUMyBr8[/PASTEBIN]


All times are UTC. The time now is 19:59.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.