mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve GPU Linear Algebra (https://www.mersenneforum.org/showthread.php?t=27042)

EdH 2022-05-01 18:50

Thanks! I think it would be too costly to bring the Core2 up to 16 GB, so I'll look at other options. I appreciate all the help!

EdH 2022-05-09 20:09

Sorry to annoy, but I'm having troubles getting an M40 to run. The system sees it, but not [C]nvidia-smi[/C] or Msieve. This machine runs the k20X and an NVS-510 fine. Do I need to reinstall CUDA with the M40 in place, perhaps?

frmky 2022-05-09 22:08

If nvidia-smi doesn't see it, then msieve won't. Perhaps you need to reinstall the CUDA driver with the M40 installed?

EdH 2022-05-10 01:22

[QUOTE=frmky;605578]If nvidia-smi doesn't see it, then msieve won't. Perhaps you need to reinstall the CUDA driver with the M40 installed?[/QUOTE]Reinstalled driver and CUDA in different variations and no joy. The computer says it's there, but CUDA says it isn't. I put the K20Xm back in and it sees it every time. Both are PCIEx16 v3.0.

Giving up for now. . .

ETA: Msieve compiled with 5.2, but couldn't find the cqard, as expected.

Thanks for the help.

EdH 2022-05-10 17:52

I guess I have found my answer for the M40:[code][ 1562.849818] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
[ 1562.849819] NVRM: The system BIOS may have misconfigured your GPU.
[ 1562.849824] nvidia: probe of 0000:01:00.0 failed with error -1
[ 1562.849839] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 1562.849840] NVRM: None of the NVIDIA devices were initialized.[/code]And, no newer BIOS updates addressing any PCI issues.

frmky 2022-05-10 18:13

That's a BIOS issue. Google says to look for options deep in the BIOS menus like PCI Express 64-bit BAR Support, large BARs, or above 4G decoding.

EdH 2022-05-10 20:45

[QUOTE=frmky;605610]That's a BIOS issue. Google says to look for options deep in the BIOS menus like PCI Express 64-bit BAR Support, large BARs, or above 4G decoding.[/QUOTE]Thank you for all the help with everythig. I do appreciate it, but I'm going to leave it sit for now. I did search the BIOS and all I found were two things: a Robust Graphics Booster with Auto/Fast/Turbo setting, for which there is a red message (for all three settings), "[COLOR=Red]Warning: VGA Graphics card is not guaranteed to operate normally[/COLOR]," and a PCIE frequency adjustment with a warning about setting it above 100MHz. The messages are displayed for the K20Xm as well. I guess I should consider myself lucky that one works.

Thank you, again, for all your help.

EdH 2022-07-15 14:38

A small follow-up:

I now have the Tesla M40 24GB running and am quite pleased. But, there is room for improvement. It is throttling due to insufficient cooling. It gets to 87C and cuts its processing. I have a push fan and a pull fan, but the throughput is just not there. I will have to pursue an alternate method. Would hate to wait until winter to get the full capability.

RichD 2022-07-16 22:51

[QUOTE=frmky;606227]Yep. With the managed memory option, the program stores portions of the sparse matrix blocks in main memory if necessary and moves them to the GPU when they are needed in each iteration. This significantly increases traffic on the PCIe bus. The GPU spends much more time waiting for data, but it can still be faster than running on the CPU.[/QUOTE]
I am thinking of tackling a much larger job where the matrix might be 5-6 times the GPU memory I have on a GTX 1660 (6GB) card. I know it helps on smaller jobs where the memory requirements are less than 2X. Would it better to utilize the GPU or just go for it and report my results here? (Using use_managed=1)

frmky 2022-07-16 22:58

There's a good chance that won't work. The vectors are always kept on the card and may take most of the GPU memory, leaving little for the matrix blocks and spmv scratch space. Nothing beats experiment, though, so give it a try and see what happens.

RichD 2022-07-31 01:54

[QUOTE=frmky;609666]There's a good chance that won't work. The vectors are always kept on the card and may take most of the GPU memory, leaving little for the matrix blocks and spmv scratch space. Nothing beats experiment, though, so give it a try and see what happens.[/QUOTE]
Attempting a ridiculous LA with the matrix needing more than five times the GPU memory, even trying with [C]use-managed=1[/C], was a no-go as expected.
[CODE]matrix is 33782739 x 33783144 (13141.4 MB) with weight 3041417453 (90.03/col)
sparse part has weight 2904400096 (85.97/col)
using GPU 0 (NVIDIA GeForce GTX 1660)
selected card has CUDA arch 7.5
Nonzeros per block: 1750000000
Storing matrix in managed memory
converting matrix to CSR and copying it onto the GPU
Killed[/CODE]Maybe a 2-3 times the size needed won't so obnoxious. :smile:

EdH 2022-08-17 16:51

Is there a simple way to check if Msieve was successfully compiled for GPU use?

Plutie 2022-08-17 17:39

the easiest way would probably be running "msieve -nc2 -g 0" - if it outputs a line showing the VBITS value you compiled msieve with, then it's compiled properly.

EdH 2022-08-17 18:06

[QUOTE=Plutie;611645]the easiest way would probably be running "msieve -nc2 -g 0" - if it outputs a line showing the VBITS value you compiled msieve with, then it's compiled properly.[/QUOTE]Thanks, but to do that, I think -np1 would work better since I wouldn't need to create as many other files first. But I'd still need to look for a value (such as "using GPU" in the log. I was looking for a simple value check or existence check for a file, perhaps a .ptx.

Plutie 2022-08-17 18:30

ah, in that case - you can look for the lanczos_kernel.ptx file (or stage1_core.ptx)

EdH 2022-08-17 18:35

[QUOTE=Plutie;611648]ah, in that case - you can look for the lanczos_kernel.ptx file (or stage1_core.ptx)[/QUOTE]Thanks! I'll work with that.

EdH 2022-08-18 13:14

I'm too excited to keep this to myself. I finally have sufficient cooling for my M40 GPU and am running a c173 that is in LA on both, the 40-thread machine and the GPU machine.

This is the 40-thread (40GB) machine at start of LA:[code]Wed Aug 17 22:53:22 2022 linear algebra at 0.0%, ETA 66h33m[/code]and, current state (08:13):[code]linear algebra completed 2537146 of 16995095 dimensions (14.9%, ETA 53h31m)[/code]Here is the GPU machine at start of LA:[code]Wed Aug 17 23:11:13 2022 linear algebra at 0.0%, ETA 24h39m[/code]and. current state (08:13):[code]linear algebra completed 6241861 of 16995095 dimensions (36.7%, ETA 15h35m)[/code]Here's a litle extra from the GPU machine log:[code]Wed Aug 17 22:59:01 2022 using VBITS=256
Wed Aug 17 22:59:01 2022 skipping matrix build
Wed Aug 17 22:59:04 2022 matrix starts at (0, 0)
Wed Aug 17 22:59:07 2022 matrix is 16994916 x 16995095 (5214.6 MB) with weight 1611774956 (94.84/col)
Wed Aug 17 22:59:07 2022 sparse part has weight 1163046519 (68.43/col)
Wed Aug 17 22:59:07 2022 saving the first 240 matrix rows for later
Wed Aug 17 22:59:11 2022 matrix includes 256 packed rows
Wed Aug 17 22:59:16 2022 matrix is 16994676 x 16995095 (4829.9 MB) with weight 1060776224 (62.42/col)
Wed Aug 17 22:59:16 2022 sparse part has weight 994218947 (58.50/col)
Wed Aug 17 22:59:16 2022 using GPU 0 (Tesla M40 24GB)
Wed Aug 17 22:59:16 2022 selected card has CUDA arch 5.2
Wed Aug 17 23:10:30 2022 commencing Lanczos iteration
Wed Aug 17 23:10:31 2022 memory use: 11864.2 MB[/code]The GPU is showing "12701MiB / 22945MiB" for its memory use, so I should be able to do some even larger numbers.:smile:

LaurV 2022-08-19 03:12

Sorry I didn't follow this thread very close.

Are you saying that you do NFS completely on GPU? I mean, I knew poly can be done, and I am reading now about LA? How about sieving?
If so, where can I grab the exe and the "for dummy" tutorial? :lol:
Windows/Linux available? I may give it a try on local (where I run few quite powerful AMD and Nvidia cards) or on Colab (where I have occasional access to P100, V100 and - if lucky- A100).

Plutie 2022-08-19 03:56

[QUOTE=LaurV;611723]Sorry I didn't follow this thread very close.

Are you saying that you do NFS completely on GPU? I mean, I knew poly can be done, and I am reading now about LA? How about sieving?
If so, where can I grab the exe and the "for dummy" tutorial? :lol:
Windows/Linux available? I may give it a try on local (where I run few quite powerful AMD and Nvidia cards) or on Colab (where I have occasional access to P100, V100 and - if lucky- A100).[/QUOTE]

currently, polyselect and LA can be done on GPU - sieving and filtering are still on CPU.

here's a quick guide for linux specifically, but I don't think the process will be too different on windows.

[QUOTE]find the compute capability of your GPU - can be found [URL="https://developer.nvidia.com/cuda-gpus"]here[/URL].

compilation example here is for a GTX 1060 (CC 6.1)
[CODE]git clone https://github.com/gchilders/msieve_nfsathome -b msieve-lacuda-nfsathome
cd msieve_nfsathome
make all CUDA=61 VBITS=256
[/CODE][/QUOTE]
once compiled, you can run both polyselect and LA just as you would with normal msieve, just add "-g (gpu_num)" to the command. you can lower the VBITS value to fit larger matrices onto GPU during LA, but at a performance penalty.

RichD 2022-08-29 01:57

I forgot to add [C]-g 0[/C] to the command line and it seemed to default to device 0. I did specify use_managed=1 so maybe that was enough to invoke the GPU. Then again, I may be using an earlier release.

RichD 2022-10-05 23:21

Here is a data point for the crossover using a GPU for LA.

Attempt to run 50+% memory over subscribed on a 6GB card. [C]use_managed=1[/C]
[CODE]saving the first 240 matrix rows for later
matrix includes 256 packed rows
matrix is 10820818 x 10821229 (4662.2 MB) with weight 1103319671 (101.96/col)
sparse part has weight 1049028095 (96.94/col)
using GPU 0 (NVIDIA GeForce GTX 1660)
selected card has CUDA arch 7.5
Nonzeros per block: 1750000000
Storing matrix in managed memory
converting matrix to CSR and copying it onto the GPU
1049028095 10820818 10821229
1049028095 10821229 10820818
commencing Lanczos iteration
vector memory use: 2311.7 MB
dense rows memory use: 330.2 MB
sparse matrix memory use: 8086.0 MB
memory use: 10727.9 MB
Allocated 761.4 MB for SpMV library
Allocated 761.4 MB for SpMV library
linear algebra at 0.1%, ETA 139h41m821229 dimensions (0.1%, ETA 139h41m)
checkpointing every 80000 dimensions21229 dimensions (0.1%, ETA 139h44m)
linear algebra completed 376713 of 10821229 dimensions (3.5%, ETA 136h25m) [/CODE]Running without the use of a GPU.
[CODE]saving the first 240 matrix rows for later
matrix includes 256 packed rows
matrix is 10820818 x 10821229 (4662.2 MB) with weight 1103319671 (101.96/col)
sparse part has weight 1049028095 (96.94/col)
using block size 8192 and superblock size 147456 for processor cache size 6144 kB
commencing Lanczos iteration (4 threads)
memory use: 6409.8 MB
linear algebra at 0.0%, ETA 105h56m821229 dimensions (0.0%, ETA 105h56m)
checkpointing every 110000 dimensions1229 dimensions (0.0%, ETA 107h23m)
linear algebra completed 45961 of 10821229 dimensions (0.4%, ETA 103h24m) [/CODE]


All times are UTC. The time now is 08:05.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.