mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve GPU LA (https://www.mersenneforum.org/showthread.php?t=27042)

VBCurtis 2021-08-02 19:23

Msieve GPU LA
 
[QUOTE=Xyzzy;584617][C]GPU = Quadro RTX 8000
LA = 3331s[/C]

Now that we can run msieve on a GPU we have no reason to ever run it on a CPU again. [/QUOTE]

Do you mean you'll just leave the big matrices to others? How big can you solve on your GPU?

Xyzzy 2021-08-02 19:28

We were told:[QUOTE]You can probably run up to a 25M-30M matrix, perhaps a bit larger, on that card.[/QUOTE]

frmky 2021-08-02 21:48

[QUOTE=VBCurtis;584640]Do you mean you'll just leave the big matrices to others? How big can you solve on your GPU?[/QUOTE]
Unlike most of us, Mike has a high-end workstation GPU with 48GB memory. He can fit all but the largest "f small" matrices on his GPU.

xilman 2021-08-03 11:49

[QUOTE=frmky;584454]After reimplementing the CUDA SpMV with CUB, the Tesla V100 now takes 36 minutes.[/QUOTE]I'm sorry but I am getting seriously out of date with CUDA since the updated compilers stopped working on my Ubuntu and Gentoo systems.

The three systems still in use have a 460, a 970, and a 1060 with drivers 390.138, 390.144 and 390.141 respectively.

Do you think your new code might run on any of those? If so, I will try again to get CUDA installed and working.

Thanks.

frmky 2021-08-03 14:15

Technically yes, but consumer cards don't have enough memory to store interesting matrices. If the GTX 1060 has 6GB, it could run matrices up to about 5Mx5M. The problem is that block Lanczos requires multiplying by both the matrix and its transpose, but gpus only seem to work well with the matrix in CSR, which doesn't allow efficiently calculating the transpose. So we load both the matrix and its transpose onto the card.

It would be possible to create a version that stores the matrices in system memory and loads the next matrix block into GPU memory while calculating the product with the current block. The block size is adjustable, but I don't know how performant that would be.

Xyzzy 2021-08-03 14:45

How important is ECC on a video card? (Most consumer cards don't have that, right?)

Our card has it, and we have it enabled, but it runs faster without.

We haven't logged an ECC error yet. Note the "aggregate" counter described below.

[CODE]ECC Errors
NVIDIA GPUs can provide error counts for various types of ECC errors. Some ECC errors are either single or double bit, where single bit errors are corrected and double bit
errors are uncorrectable. Texture memory errors may be correctable via resend or uncorrectable if the resend fails. These errors are available across two timescales (volatile
and aggregate). Single bit ECC errors are automatically corrected by the HW and do not result in data corruption. Double bit errors are detected but not corrected. Please see
the ECC documents on the web for information on compute application behavior when double bit errors occur. Volatile error counters track the number of errors detected since the
last driver load. Aggregate error counts persist indefinitely and thus act as a lifetime counter.[/CODE]:mike:

frmky 2021-08-04 00:32

[QUOTE=frmky;584710]It would be possible to create a version that stores the matrices in system memory[/QUOTE]
I did that, and it's not terrible with the right settings...
[CODE]using VBITS=512
matrix is 42100909 x 42101088 (20033.9 MB) with weight 6102777434 (144.96/col)
...
using GPU 0 (Tesla V100-SXM2-32GB) <-------- 32 GB card
...
vector memory use: 17987.6 MB <-- 7 x matrix columns x VBITS / 8 bytes on card, adjust VBITS as needed
dense rows memory use: 2569.6 MB <-- on card but could be moved to cpu memory
sparse matrix memory use: 30997.3 MB <-- Hosted in cpu memory, transferred on card as needed
memory use: 51554.6 MB <-- significantly exceeds 32 GB
Allocated 357.7 MB for SpMV library
...
linear algebra completed 33737 of 42101088 dimensions (0.1%, ETA 133h21m)
[/CODE]

frmky 2021-08-04 02:42

[QUOTE=Xyzzy;584714]How important is ECC on a video card? (Most consumer cards don't have that, right?)

Our card has it, and we have it enabled, but it runs faster without.

We haven't logged an ECC error yet. Note the "aggregate" counter described below.
[/QUOTE]
What's your risk tolerance? msieve has robust error detection so it's not as important. But it's usually a small price to ensure no memory faults.

mathwiz 2021-08-04 02:47

Are there instructions on how to check out and build the msieve GPU LA code? Is it in trunk or a separate branch?

VBCurtis 2021-08-04 03:41

[QUOTE=frmky;584751]I did that, and it's not terrible with the right settings...
[CODE]using VBITS=512
matrix is 42100909 x 42101088 (20033.9 MB) with weight 6102777434 (144.96/col)
...
using GPU 0 (Tesla V100-SXM2-32GB) <-------- 32 GB card
...
vector memory use: 17987.6 MB <-- 7 x matrix columns x VBITS / 8 bytes on card, adjust VBITS as needed
dense rows memory use: 2569.6 MB <-- on card but could be moved to cpu memory
sparse matrix memory use: 30997.3 MB <-- Hosted in cpu memory, transferred on card as needed
memory use: 51554.6 MB <-- significantly exceeds 32 GB
Allocated 357.7 MB for SpMV library
...
linear algebra completed 33737 of 42101088 dimensions (0.1%, ETA 133h21m)
[/CODE][/QUOTE]

This is simply amazing! I'm running a matrix that size for GNFS-201 (from f-small) right now, at ~700 hr on a 12-core single-socket Haswell.
I hope this means you'll be digging out of your matrix backlog from the big siever queue. :tu:

frmky 2021-08-04 04:51

[QUOTE=mathwiz;584761]Are there instructions on how to check out and build the msieve GPU LA code? Is it in trunk or a separate branch?[/QUOTE]

It's very much a work-in-progress and things may change or occasionally be broken, but you can play with it. I have it in GitHub. I recommend using CUDA 10.2 because CUDA 11.x incorporates CUB into the toolkit and tries to force you to use it, but it's missing a few pieces. That complicates things. You can get the source with

git clone [url]https://github.com/gchilders/msieve_nfsathome.git[/url] -b msieve-lacuda-nfsathome
cd msieve_nfsathome
make all VBITS=128 CUDA=XX

where XX is the two-digit CUDA compute capability of your GPU. Specifying CUDA=1 defaults to a compute capability of 60. You may want to experiment with both VBITS=128 and VBITS=256 to see which is best on your GPU.

If you want to copy msieve to another directory, you need the msieve binary, both *.ptx files, and in the cub directory both *.so files. Or just run it from the build directory.


All times are UTC. The time now is 06:25.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.