mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve GPU Linear Algebra (https://www.mersenneforum.org/showthread.php?t=27042)

frmky 2022-02-24 06:17

[QUOTE=EdH;600624]Sorry if you're tired of these reports[/QUOTE]
Not at all! I look forward to seeing how you have gotten it to work in Colab!

EdH 2022-02-26 00:50

[QUOTE=EdH;600624]. . .
I hope to do the same test with a different GPU, to compare.[/QUOTE]Well, I spent quite a bit of time today with a T4, but I didn't let it finish, because I was (unsuccessfully) trying to get a file copy for the checkpoint file to work right, so it could be saved past a session end. However, the T4 did consistently give estimates of 2:33 for completion. This was the same matrix that took 4:19 to finish on the K80.

EdH 2022-02-28 03:39

Disappointing update. Although Colab successfully completed LA on the test set, the returned msieve.dat.dep file is corrupt according to Msieve on the local machine.:sad:

EdH 2022-03-02 23:58

I have not been playing with Colab for the last few days, due to trying to get a Tesla K20Xm working locally. I had it working with GMP-ECM, but couldn't get frmky's Msieve to run. I battled with all kinds of CUDA (9/10.2/11.x,etc.). All resisted, including the stand alone cuda 10.2 .run runfile. For some time, I lost GMP-ECM, too.

But, I'm happy to mention I finally have all (GMP-ECM, Msieve and frmky's Msieve) running. I'm using CUDA 11.4, NVidia driver 470.103.71 and I had to install a shared object file from CUDA 9 (that may have been for GMP-ECM, in which I also had to disable some code in the Makefile). In any case, they are all running the K20Xm!

As to performance, the limited testing seems to show nearly a halving of the time taken on my 24 thread machine, but the 40 thread machines still have an edge on the K20Xm. But, in effect, it represents an extra machine, since it can free up the others.

The good part is that now that I have this local card running, I can get back to my Colab session work and have a local card to compare and help figure things out.

Thank you to everyone for all the help in this and other threads!

EdH 2022-03-05 22:58

The Colab "How I. . ." is complete. I have tested it directly from the thread and it worked as designed. The latest session was assigned a K80, which was detected correctly and its Compute Capability used during the compilation of Msieve.

It can be reviewed at:

[URL="https://www.mersenneforum.org/showthread.php?t=27634"]How I Use a Colab GPU to Perform Msieve Linear Algebra (-nc2)[/URL]

Thanks everyone for all the help!

EdH 2022-03-25 14:17

I've hit a snag playing with my GPU and wonder why:

Machine is Core2 Duo with 8GB RAM and GPU is K20Xm with 6GB RAM.
Composite is 170 digits and the matrix was built on a separate machine, with msieve.dat.mat, msieve.fb and worktodo.ini supplied from the original alternate named files.

I tried this twice. Here is the terminal display for the last try:[code]$ ./msieve -nc2 skip_matbuild=1 -g 0 -v

Msieve v. 1.54 (SVN Unversioned directory)
Fri Mar 25 09:45:54 2022
random seeds: 6dc60c6a 05868252
factoring 10559103707847604096214709430530773995264391543587654452108598611359547436885517060868607845904851346765842831319837349071427368916165620453753530586945871555707605156809 (170 digits)
no P-1/P+1/ECM available, skipping
commencing number field sieve (170-digit input)
R0: -513476789674487020805844014359613
R1: 4613148128511433126577
A0: -638650427125602136382789058618425254350
A1: 413978338424926800646481002860017
A2: 268129428386547641102884323
A3: -15312382615381572243
A4: -8137373995372
A5: 295890
skew 1.00, size 1.799e-16, alpha -5.336, combined = 2.653e-15 rroots = 5

commencing linear algebra
using VBITS=256
skipping matrix build
matrix starts at (0, 0)
matrix is 11681047 x 11681223 (3520.8 MB) with weight 1098647874 (94.05/col)
sparse part has weight 794456977 (68.01/col)
saving the first 240 matrix rows for later
matrix includes 256 packed rows
matrix is 11680807 x 11681223 (3303.4 MB) with weight 723296676 (61.92/col)
sparse part has weight 679062899 (58.13/col)
using GPU 0 (Tesla K20Xm)
selected card has CUDA arch 3.5
Nonzeros per block: 1750000000
converting matrix to CSR and copying it onto the GPU
Killed[/code]And, here is the log:[code]Fri Mar 25 09:45:54 2022 Msieve v. 1.54 (SVN Unversioned directory)
Fri Mar 25 09:45:54 2022 random seeds: 6dc60c6a 05868252
Fri Mar 25 09:45:54 2022 factoring 10559103707847604096214709430530773995264391543587654452108598611359547436885517060868607845904851346765842831319837349071427368916165620453753530586945871555707605156809 (170 digits)
Fri Mar 25 09:45:55 2022 no P-1/P+1/ECM available, skipping
Fri Mar 25 09:45:55 2022 commencing number field sieve (170-digit input)
Fri Mar 25 09:45:55 2022 R0: -513476789674487020805844014359613
Fri Mar 25 09:45:55 2022 R1: 4613148128511433126577
Fri Mar 25 09:45:55 2022 A0: -638650427125602136382789058618425254350
Fri Mar 25 09:45:55 2022 A1: 413978338424926800646481002860017
Fri Mar 25 09:45:55 2022 A2: 268129428386547641102884323
Fri Mar 25 09:45:55 2022 A3: -15312382615381572243
Fri Mar 25 09:45:55 2022 A4: -8137373995372
Fri Mar 25 09:45:55 2022 A5: 295890
Fri Mar 25 09:45:55 2022 skew 1.00, size 1.799e-16, alpha -5.336, combined = 2.653e-15 rroots = 5
Fri Mar 25 09:45:55 2022
Fri Mar 25 09:45:55 2022 commencing linear algebra
Fri Mar 25 09:45:55 2022 using VBITS=256
Fri Mar 25 09:45:55 2022 skipping matrix build
Fri Mar 25 09:46:24 2022 matrix starts at (0, 0)
Fri Mar 25 09:46:26 2022 matrix is 11681047 x 11681223 (3520.8 MB) with weight 1098647874 (94.05/col)
Fri Mar 25 09:46:26 2022 sparse part has weight 794456977 (68.01/col)
Fri Mar 25 09:46:26 2022 saving the first 240 matrix rows for later
Fri Mar 25 09:46:30 2022 matrix includes 256 packed rows
Fri Mar 25 09:46:35 2022 matrix is 11680807 x 11681223 (3303.4 MB) with weight 723296676 (61.92/col)
Fri Mar 25 09:46:35 2022 sparse part has weight 679062899 (58.13/col)
Fri Mar 25 09:46:35 2022 using GPU 0 (Tesla K20Xm)
Fri Mar 25 09:46:35 2022 selected card has CUDA arch 3.5[/code]Is it possible the CSR conversion is overrunning memory?

frmky 2022-03-25 23:46

That looks like the linux OOM killer. Which would mean the it has run out of available system (not GPU) memory.

EdH 2022-03-26 00:18

[QUOTE=frmky;602572]That looks like the linux OOM killer. Which would mean the it has run out of available system (not GPU) memory.[/QUOTE]Thanks! I wondered, since it seemed the Msieve reported matrix size was similar to the nvidia-smi reported size, but that isn't case with the run I just checked. Msieve says 545 MB and nvidia-smi says 1491MiB.

I'll play a bit more with some sizes in between and see what may be the limit.

Do you think a large swap file would be of any use?

chris2be8 2022-03-26 16:45

[QUOTE=EdH;602573]
Do you think a large swap file would be of any use?[/QUOTE]

Yes. I'd add 16-32Gb of swap space. Which should stop OOM killing jobs if they ask for lots of memory.

But the system could start page thrashing if they try to heavily use more memory than you have RAM. SSDs are faster than spinning disks but more prone to wearing out if heavily used.

Adding more RAM would be the best option, if the system can take it. But that costs money unless you have some spare RAM to install.

EdH 2022-03-27 12:31

Well, more study seems to say I might not be able to get there with a 32G swap,* although I might see what happens. I tried a matrix that was built with t_d=70 for a c158 to compare times with a 40 thread machine and I got a little more info. Here's what top says about Msieve:[code] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21349 math55 20 0 33.7g 7.3g 47128 D 3.7 93.3 1:21.89 msieve[/code]The machine only has 8G and it would be very expensive to take it to its max at 16G, which doesn't look sufficient, either.

Here's what Msieve had to say:[code]commencing linear algebra
using VBITS=256
skipping matrix build
matrix starts at (0, 0)
matrix is 7793237 x 7793427 (2367.4 MB) with weight 742866434 (95.32/col)
sparse part has weight 534863189 (68.63/col)
saving the first 240 matrix rows for later
matrix includes 256 packed rows
matrix is 7792997 x 7793427 (2195.0 MB) with weight 483246339 (62.01/col)
sparse part has weight 450716902 (57.83/col)
using GPU 0 (Tesla K20Xm)
selected card has CUDA arch 3.5
Nonzeros per block: 1750000000
converting matrix to CSR and copying it onto the GPU
450716902 7792997 7793427
450716902 7793427 7792997
commencing Lanczos iteration
vector memory use: 1664.9 MB
dense rows memory use: 237.8 MB
sparse matrix memory use: 3498.2 MB
memory use: 5400.9 MB
error (spmv_engine.cu:78): out of memory[/code]This looks to me like the card ran out, too. The K20Xm has 6G (displayed as 5700MiB by nvidia-smi).

* The machine currently has an 8G swap partition and I have a 32G microSD handy that I might try to add to the system, to both test the concept of using such a card as swap and to add the swap space if it works.

chris2be8 2022-03-27 15:59

[QUOTE=EdH;602678]
The machine only has 8G and it would be very expensive to take it to its max at 16G, which doesn't look sufficient, either.
[/QUOTE]

The OOM killer should put messages into syslog so check syslog and dmesg output before buying memory or spending a lot of time checking other things. I should have said to do that first in my previous post, sorry.

8Gb should be enough to solve the matrix on the CPU. I've done a GNFS c178 in 16Gb (the system has 32GB swap space as well but wasn't obviously paging).


All times are UTC. The time now is 04:43.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.