-   Msieve (
-   -   Msieve GPU Linear Algebra (

EdH 2022-03-27 16:19

[QUOTE=chris2be8;602703]The OOM killer should put messages into syslog so check syslog and dmesg output before buying memory or spending a lot of time checking other things. I should have said to do that first in my previous post, sorry.

8Gb should be enough to solve the matrix on the CPU. I've done a GNFS c178 in 16Gb (the system has 32GB swap space as well but wasn't obviously paging).[/QUOTE]Thanks! I'll play more later, but for now, I'm running a c145 that nvidia-smi reports as using 1802MiB on the GPU. It is happily stomping the 40 thread CPU machine. The GPU machine started after copying the files from the CPU machine, and it is ahead with ETA 38m vs. ETA 1h 0m.

I did get the microSD card added as swap, but I needed help from kruoli in the linux sub-forum. [C]top[/C] now shows nearly 40G for swap space, but it is all totally free for this c145.

frmky 2022-03-27 17:36

[QUOTE=EdH;602678][code]commencing linear algebra
using VBITS=256
error ( out of memory[/code][/QUOTE]
You almost had enough. It ran out while trying to allocate working memory for the spmv library. Recompile with VBITS=128 and it should fit, even if it's not optimal. (Don't forget to copy the .ptx and .so files.)

EdH 2022-03-27 21:13

[QUOTE=frmky;602717]You almost had enough. It ran out while trying to allocate working memory for the spmv library. Recompile with VBITS=128 and it should fit, even if it's not optimal. (Don't forget to copy the .ptx and .so files.)[/QUOTE]Thanks! That did the trick! nvidia-smi is reporting 4999MiB / 5700 MiB and Msieve is using 2.7g of 8G. The ETA is just over 11 hours, where as the CPU took 12:30 with 32 threads. I had forgotten to edit Msieve for 40 threads on this machine.

I still need to do some more testing and find out where the crossover is, but all this is encouraging.

EdH 2022-04-08 13:11

Sorry if these questions are annoying:

I've been playing with my K20Xm card now for a little bit and, of course, it isn't "good enough." I can get more of them at reasonable prices, but, why, If they aren't? And, most of my card capable machines don't have an extra fan connector, which would be needed for a K20Xm.

Comparing a GTX 980, the memory is the same, so I still wouldn't be able to run larger matrices. Is the matrix size increase proportional in a manner I could estimate? e.g. 5 more digits doubles the matrix? If a GTX 1080 with 11GB would only give me 5 more digits, I couldn't consider it worth the cost.

Is there a similar estimation for target_density? I currently use t_d 70 so the CADO-NFS clients can move to a subsequent server sooner, but I haven't empirically determined if that is best.

I'm not sure if this might be a typo, but while the 980 shows a much better performance overall, the FP64 (double) performance only shows 189.4 GFLOPS (1:32), while for the K20Xm, it is shown as 1,312 GFLOPS (1:3). Would that be of significance in LA solving?

It's been mentioned that the K80 consists of two cards that are each a little better than the K20Xm. How much larger matrices might I be able to run with mpi across both sections of a 24GB card?

EdH 2022-04-09 12:42

Any familiarity with the Tesla M40 24GB for Msieve LA? That would be about 4x memory for 2x cost over the K20Xm.

frmky 2022-04-09 19:45

I sieve enough to use target_density of at least 100-110 as it brings down the matrix size. An 11 GB card can likely handle matrices with about 10M rows (GNFS-175ish), whereas a 24GB card would take you up to around 20M rows (GNFS-184ish). With enough system memory, the newer Tesla M40 would let you go a bit higher with a significant performance penalty by storing the matrix in system memory and transferring it as needed onto the GPU.

GPU LA is entirely integer code and doesn't depend on the FP64 performance. It's written using 64-bit integer operations, but even on the latest GPUs those are implemented with 32-bit integer instructions.

You lose some speed and memory efficiency splitting the matrix across two GPUs in a K80, but you should still be able to handle 9M rows or so (GNFS-174ish).

EdH 2022-04-09 20:29

Thanks! That helps me a bunch. My new interest is the M40 24GB now. But, I'm not quite ready, because the machines I'd like to use don't have any extra fan connectors. I'm considering the idea of a fan powered another way - possibly from an older PATA power cable.

EdH 2022-05-01 02:08

I'm hoping to set up a machine to primarily do GPU LA with an M40 24GB card.

- Will a Core2 Duo 3.16GHz be better (or much worse) than a slower Quad core?
- - When running the GPU, is LA doing anything with more than one CPU core? I only see one core in use via [C]top[/C].

- Will 8GB of machine RAM be insufficient to feed the 24GB card?
- - If insufficient, would a large swap file, via MicroSD 32GB ease the memory limit?

frmky 2022-05-01 17:33

GPU LA uses only a single CPU core to do a very small part of each iteration. Likewise, filtering and traditional sqrt use only a single core. The Core2 Duo should be fine.

With a 24GB card, you should be able to solve up to around 20Mx20M matrices, which would be about 10GB in size. While transferring the matrix to the card, you need to store the entire matrix in COO plus a portion of it in CSR format. 8 GB would not be enough. 16 GB plus a swap file should be enough, but leave room for expansion later if needed.

EdH 2022-05-01 18:50

Thanks! I think it would be too costly to bring the Core2 up to 16 GB, so I'll look at other options. I appreciate all the help!

EdH 2022-05-09 20:09

Sorry to annoy, but I'm having troubles getting an M40 to run. The system sees it, but not [C]nvidia-smi[/C] or Msieve. This machine runs the k20X and an NVS-510 fine. Do I need to reinstall CUDA with the M40 in place, perhaps?

All times are UTC. The time now is 09:08.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.