![]() |
![]() |
#23 |
Jun 2012
Boulder, CO
5×7×13 Posts |
![]() |
![]() |
![]() |
![]() |
#24 |
Jun 2012
Boulder, CO
1110001112 Posts |
![]()
On that note, though, are there any plans to support multiple GPUs? If a single A100 is this fast, 16x A100's with a fully interconnected fabric could probably tear through big matrices?
|
![]() |
![]() |
![]() |
#25 |
Jul 2003
So Cal
264610 Posts |
![]()
The current version supports multiple GPUs using MPI (compile with CUDA=1 MPI=1 CUDAAWARE=1) but relies on a good MPI implementation. OpenMPI's collectives transfer the data from card and do the reduction on the CPU. MVAPICH2-GDR I think keeps the reductions on the card, but SDSC doesn't have that working on Expanse GPU yet so I haven't been able to test it. I hope to have time on NCSA Delta later this fall to try it out.
Edit: Edit 2: I've got a draft version working just now that passes vectors between GPUs using MPI CUDA-aware point-to-point comms (which uses NVLink or GPUDirect when available) then does the reduction on the GPU manually. In a quick test on a 43M matrix using two V100's connected with NVLink, this reduces LA time from nearly 90 hours when passing vectors through CPU memory to Edit 3: It's now in GitHub. Just compile with a CUDA-Aware MPI like OpenMPI using CUDA=XX MPI=1 CUDAAWARE=1 where XX is replaced by the compute capability of your GPU. Last fiddled with by frmky on 2021-08-12 at 08:20 |
![]() |
![]() |
![]() |
#26 |
Jul 2003
So Cal
2×33×72 Posts |
![]() Code:
linear algebra completed 45452 of 42101088 dimensions (0.1%, ETA 21h 4m) |
![]() |
![]() |
![]() |
#27 |
(loop (#_fork))
Feb 2006
Cambridge, England
2·7·461 Posts |
![]()
Interesting! That's about a p3.8xlarge instance, for which the spot price is $4/hr, so that's $84 = £60 to solve the matrix.
I'm paying 19p/kWh here, and my Skylake machine uses about 250W and takes 820 hours for a 44M matrix, so that's £40 of electricity (but probably £60 in depreciation, assuming the £3360 machine will last five years); on another hand it's taking a month rather than a day, on a third hand that's still keeping up with my sieving resources. |
![]() |
![]() |
![]() |
#28 |
Jul 2003
So Cal
1010010101102 Posts |
![]() Code:
linear algebra completed 49005 of 84248506 dimensions (0.1%, ETA 94h30m) |
![]() |
![]() |
![]() |
#29 | |
Jun 2012
Boulder, CO
5·7·13 Posts |
![]() Quote:
Code:
linear algebra completed 20216008 of 109441779 dimensions (18.5%, ETA 854h19m) |
|
![]() |
![]() |
![]() |
#30 |
Jul 2003
So Cal
2×33×72 Posts |
![]()
Yes, that would have been on 6 Sandy Bridge nodes with 2x 10 core cpus each.
Here's the companion 2,2162L matrix, also 84.2M, running on 8 Fujitsu A64FX nodes. Code:
Fri Jul 2 01:59:19 2021 linear algebra at 0.0%, ETA 337h 2m |
![]() |
![]() |
![]() |
#31 |
I moo ablest echo power!
May 2013
26·29 Posts |
![]()
Would something like work on my 3090? It has 24GB of ram on it, though I would have to get some help with compilation as I use WSL2, which doesn't support CUDA applications (yet).
|
![]() |
![]() |
![]() |
#32 |
Jul 2003
So Cal
1010010101102 Posts |
![]()
Yes, you could solve a matrix up to about 15M or so on the card. If you have at least 32 GB system memory, you could go a bit larger transferring the matrix from system memory as needed using CUDA managed memory. But I have no experience compiling msieve for Windows.
|
![]() |
![]() |
![]() |
#33 |
Jul 2003
So Cal
2×33×72 Posts |
![]()
The LA for 2,2162M, an 84.2M matrix, successfully completed on four NVLink-connected V100's in a total of 95.5 hours of runtime. There was a restart due to the 48-hour queue time limit on SDSC Expanse GPU. This run used just over 26GB of GPU memory on each of the four V100's.
Attached is a snapshot of the timeline for two block Lanzcos iterations on three of the four gpus. Per the time scale at the top, it takes just over 1 second/iteration. Over 80% of the time is spent in the SpMV routine. The transfer of vectors directly between GPU's takes relatively little time when NVLink is used. |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Resume linear algebra | Timic | Msieve | 35 | 2020-10-05 23:08 |
use msieve linear algebra after CADO-NFS filtering | aein | Msieve | 2 | 2017-10-05 01:52 |
Has anyone tried linear algebra on a Threadripper yet? | fivemack | Hardware | 3 | 2017-10-03 03:11 |
Linear algebra at 600% | CRGreathouse | Msieve | 8 | 2009-08-05 07:25 |
Linear algebra proof | Damian | Math | 8 | 2007-02-12 22:25 |