![]() |
[QUOTE=frmky;585002]A 27.7M matrix in 21 hours. The A100 is a nice card![/QUOTE]
Indeed! [QUOTE]Do you have access to an A40 to try it? I'm curious if the slower global memory significantly increases the runtime.[/QUOTE] No, sadly, "just" A100 and V100's. |
On that note, though, are there any plans to support multiple GPUs? If a single A100 is this fast, [URL="https://cloud.google.com/blog/products/compute/a2-vms-with-nvidia-a100-gpus-are-ga"]16x A100's[/URL] with a fully interconnected fabric could probably tear through big matrices?
|
The current version supports multiple GPUs using MPI (compile with CUDA=1 MPI=1 CUDAAWARE=1) but relies on a good MPI implementation. OpenMPI's collectives transfer the data from card and do the reduction on the CPU. MVAPICH2-GDR I think keeps the reductions on the card, but SDSC doesn't have that working on Expanse GPU yet so I haven't been able to test it. I hope to have time on NCSA Delta later this fall to try it out.
Edit: [STRIKE]My backup plan if that doesn't work out is to use a non-CUDA-aware MPI to pass IPC handles between processes and do the reduction on the GPU myself.[/STRIKE] Edit 2: I've got a draft version working just now that passes vectors between GPUs using MPI CUDA-aware point-to-point comms (which uses NVLink or GPUDirect when available) then does the reduction on the GPU manually. In a quick test on a 43M matrix using two V100's connected with NVLink, this reduces LA time from nearly 90 hours when passing vectors through CPU memory to [STRIKE]57[/STRIKE] 56 hours transferring directly between GPUs. Edit 3: It's now in GitHub. Just compile with a CUDA-Aware MPI like OpenMPI using CUDA=XX MPI=1 CUDAAWARE=1 where XX is replaced by the compute capability of your GPU. |
[CODE]linear algebra completed 45452 of 42101088 dimensions (0.1%, ETA 21h 4m)[/CODE]
Using four V100's, I'm getting about 21 hours to solve a 42.1M matrix. |
Interesting! That's about a p3.8xlarge instance, for which the spot price is $4/hr, so that's $84 = £60 to solve the matrix.
I'm paying 19p/kWh here, and my Skylake machine uses about 250W and takes 820 hours for a 44M matrix, so that's £40 of electricity (but probably £60 in depreciation, assuming the £3360 machine will last five years); on another hand it's taking a month rather than a day, on a third hand that's still keeping up with my sieving resources. |
[CODE]linear algebra completed 49005 of 84248506 dimensions (0.1%, ETA 94h30m)[/CODE]
And scaling well. The 84.2M matrix for 2,2162M should take about 4 days on four NVLink-connected V100's. It's using about 26GB on each card. |
[QUOTE=frmky;585109][CODE]linear algebra completed 49005 of 84248506 dimensions (0.1%, ETA 94h30m)[/CODE]
And scaling well. The 84.2M matrix for 2,2162M should take about 4 days on four NVLink-connected V100's. It's using about 26GB on each card.[/QUOTE] That's quite impressive. I dug this up which I believe was your MPI run of a 109.4M matrix from a few months back? [CODE]linear algebra completed 20216008 of 109441779 dimensions (18.5%, ETA 854h19m)[/CODE] |
Yes, that would have been on 6 Sandy Bridge nodes with 2x 10 core cpus each.
Here's the companion 2,2162L matrix, also 84.2M, running on 8 Fujitsu A64FX nodes. [CODE]Fri Jul 2 01:59:19 2021 linear algebra at 0.0%, ETA 337h 2m[/CODE] |
Would something like work on my 3090? It has 24GB of ram on it, though I would have to get some help with compilation as I use WSL2, which doesn't support CUDA applications (yet).
|
[QUOTE=wombatman;585135]Would something like work on my 3090? It has 24GB of ram on it, though I would have to get some help with compilation as I use WSL2, which doesn't support CUDA applications (yet).[/QUOTE]
Yes, you could solve a matrix up to about 15M or so on the card. If you have at least 32 GB system memory, you could go a bit larger transferring the matrix from system memory as needed using CUDA managed memory. But I have no experience compiling msieve for Windows. |
1 Attachment(s)
The LA for 2,2162M, an 84.2M matrix, successfully completed on four NVLink-connected V100's in a total of 95.5 hours of runtime. There was a restart due to the 48-hour queue time limit on SDSC Expanse GPU. This run used just over 26GB of GPU memory on each of the four V100's.
Attached is a snapshot of the timeline for two block Lanzcos iterations on three of the four gpus. Per the time scale at the top, it takes just over 1 second/iteration. Over 80% of the time is spent in the SpMV routine. The transfer of vectors directly between GPU's takes relatively little time when NVLink is used. [PASTEBIN]TsUMyBr8[/PASTEBIN] |
All times are UTC. The time now is 23:32. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.