mersenneforum.org Msieve GPU LA
 Register FAQ Search Today's Posts Mark Forums Read

2021-08-02, 19:23   #1
VBCurtis

"Curtis"
Feb 2005
Riverside, CA

33·5·37 Posts
Msieve GPU LA

Quote:
 Originally Posted by Xyzzy GPU = Quadro RTX 8000 LA = 3331s Now that we can run msieve on a GPU we have no reason to ever run it on a CPU again.
Do you mean you'll just leave the big matrices to others? How big can you solve on your GPU?

2021-08-02, 19:28   #2
Xyzzy

Aug 2002

831010 Posts

We were told:
Quote:
 You can probably run up to a 25M-30M matrix, perhaps a bit larger, on that card.

2021-08-02, 21:48   #3
frmky

Jul 2003
So Cal

1000100011002 Posts

Quote:
 Originally Posted by VBCurtis Do you mean you'll just leave the big matrices to others? How big can you solve on your GPU?
Unlike most of us, Mike has a high-end workstation GPU with 48GB memory. He can fit all but the largest "f small" matrices on his GPU.

2021-08-03, 11:49   #4
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

101010110001102 Posts

Quote:
 Originally Posted by frmky After reimplementing the CUDA SpMV with CUB, the Tesla V100 now takes 36 minutes.
I'm sorry but I am getting seriously out of date with CUDA since the updated compilers stopped working on my Ubuntu and Gentoo systems.

The three systems still in use have a 460, a 970, and a 1060 with drivers 390.138, 390.144 and 390.141 respectively.

Do you think your new code might run on any of those? If so, I will try again to get CUDA installed and working.

Thanks.

 2021-08-03, 14:15 #5 frmky     Jul 2003 So Cal 22×547 Posts Technically yes, but consumer cards don't have enough memory to store interesting matrices. If the GTX 1060 has 6GB, it could run matrices up to about 5Mx5M. The problem is that block Lanczos requires multiplying by both the matrix and its transpose, but gpus only seem to work well with the matrix in CSR, which doesn't allow efficiently calculating the transpose. So we load both the matrix and its transpose onto the card. It would be possible to create a version that stores the matrices in system memory and loads the next matrix block into GPU memory while calculating the product with the current block. The block size is adjustable, but I don't know how performant that would be.
 2021-08-03, 14:45 #6 Xyzzy     Aug 2002 2×3×5×277 Posts How important is ECC on a video card? (Most consumer cards don't have that, right?) Our card has it, and we have it enabled, but it runs faster without. We haven't logged an ECC error yet. Note the "aggregate" counter described below. Code: ECC Errors NVIDIA GPUs can provide error counts for various types of ECC errors. Some ECC errors are either single or double bit, where single bit errors are corrected and double bit errors are uncorrectable. Texture memory errors may be correctable via resend or uncorrectable if the resend fails. These errors are available across two timescales (volatile and aggregate). Single bit ECC errors are automatically corrected by the HW and do not result in data corruption. Double bit errors are detected but not corrected. Please see the ECC documents on the web for information on compute application behavior when double bit errors occur. Volatile error counters track the number of errors detected since the last driver load. Aggregate error counts persist indefinitely and thus act as a lifetime counter.
2021-08-04, 00:32   #7
frmky

Jul 2003
So Cal

1000100011002 Posts

Quote:
 Originally Posted by frmky It would be possible to create a version that stores the matrices in system memory
I did that, and it's not terrible with the right settings...
Code:
using VBITS=512
matrix is 42100909 x 42101088 (20033.9 MB) with weight 6102777434 (144.96/col)
...
using GPU 0 (Tesla V100-SXM2-32GB)   <-------- 32 GB card
...
vector memory use: 17987.6 MB  <-- 7 x matrix columns x VBITS / 8 bytes on card, adjust VBITS as needed
dense rows memory use: 2569.6 MB  <-- on card but could be moved to cpu memory
sparse matrix memory use: 30997.3 MB  <-- Hosted in cpu memory, transferred on card as needed
memory use: 51554.6 MB  <-- significantly exceeds 32 GB
Allocated 357.7 MB for SpMV library
...
linear algebra completed 33737 of 42101088 dimensions (0.1%, ETA 133h21m)

2021-08-04, 02:42   #8
frmky

Jul 2003
So Cal

22·547 Posts

Quote:
 Originally Posted by Xyzzy How important is ECC on a video card? (Most consumer cards don't have that, right?) Our card has it, and we have it enabled, but it runs faster without. We haven't logged an ECC error yet. Note the "aggregate" counter described below.
What's your risk tolerance? msieve has robust error detection so it's not as important. But it's usually a small price to ensure no memory faults.

 2021-08-04, 02:47 #9 mathwiz   Mar 2019 2·32·11 Posts Are there instructions on how to check out and build the msieve GPU LA code? Is it in trunk or a separate branch?
2021-08-04, 03:41   #10
VBCurtis

"Curtis"
Feb 2005
Riverside, CA

138316 Posts

Quote:
 Originally Posted by frmky I did that, and it's not terrible with the right settings... Code: using VBITS=512 matrix is 42100909 x 42101088 (20033.9 MB) with weight 6102777434 (144.96/col) ... using GPU 0 (Tesla V100-SXM2-32GB) <-------- 32 GB card ... vector memory use: 17987.6 MB <-- 7 x matrix columns x VBITS / 8 bytes on card, adjust VBITS as needed dense rows memory use: 2569.6 MB <-- on card but could be moved to cpu memory sparse matrix memory use: 30997.3 MB <-- Hosted in cpu memory, transferred on card as needed memory use: 51554.6 MB <-- significantly exceeds 32 GB Allocated 357.7 MB for SpMV library ... linear algebra completed 33737 of 42101088 dimensions (0.1%, ETA 133h21m)
This is simply amazing! I'm running a matrix that size for GNFS-201 (from f-small) right now, at ~700 hr on a 12-core single-socket Haswell.
I hope this means you'll be digging out of your matrix backlog from the big siever queue.

2021-08-04, 04:51   #11
frmky

Jul 2003
So Cal

22·547 Posts

Quote:
 Originally Posted by mathwiz Are there instructions on how to check out and build the msieve GPU LA code? Is it in trunk or a separate branch?
It's very much a work-in-progress and things may change or occasionally be broken, but you can play with it. I have it in GitHub. I recommend using CUDA 10.2 because CUDA 11.x incorporates CUB into the toolkit and tries to force you to use it, but it's missing a few pieces. That complicates things. You can get the source with

git clone https://github.com/gchilders/msieve_nfsathome.git -b msieve-lacuda-nfsathome
cd msieve_nfsathome
make all VBITS=128 CUDA=XX

where XX is the two-digit CUDA compute capability of your GPU. Specifying CUDA=1 defaults to a compute capability of 60. You may want to experiment with both VBITS=128 and VBITS=256 to see which is best on your GPU.

If you want to copy msieve to another directory, you need the msieve binary, both *.ptx files, and in the cub directory both *.so files. Or just run it from the build directory.

Last fiddled with by frmky on 2021-08-12 at 08:17 Reason: Add specifying the compute capability on the make command line.

 Similar Threads Thread Thread Starter Forum Replies Last Post frmky Msieve 3 2016-11-06 11:45 burrobert Msieve 9 2012-10-26 22:46 em99010pepe Msieve 23 2009-09-27 16:13 masser Sierpinski/Riesel Base 5 83 2007-11-17 19:39

All times are UTC. The time now is 03:24.

Wed Oct 20 03:24:39 UTC 2021 up 88 days, 21:53, 1 user, load averages: 1.88, 1.90, 1.85