mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   Faster GPU-ECM with CGBN (https://www.mersenneforum.org/showthread.php?t=27103)

SethTro 2021-11-12 04:04

I spent a good part of this week trying to implement fast squaring for CGBN. Ultimately [URL="https://github.com/NVlabs/CGBN/issues/19#issuecomment-966779554"]my code[/URL] was 10% slower and still had breaking edge cases.

In the best case with 100% faster fast squaring, there are 4 `mont_sqr` and 4 `mont_mul` so it would only be 8 / (4 / 2 + 4) - 1 = 33% faster.

Using [URL="https://gmplib.org/manual/Basecase-Multiplication"]GMP's 50% faster number[/URL] it would be 1 - 8 / (4 / 1.5 + 4) - 1 = 20% faster.

I'll reach out to the author of the repo because they mention fast squaring in their paper "Optimizing Modular Multiplication for NVIDIA’s
Maxwell GPUs" [url]http://www.acsel-lab.com/arithmetic/arith23/data/1616a047.pdf[/url] but it's unlikely to happen.

henryzz 2021-11-27 22:05

Just tried to upgrade my version of this as I was on a fairly old version and certain numbers were crashing.

Compiling has failed with the following error:

[CODE]/bin/bash ./libtool --tag=CC --mode=compile /usr/local/cuda/bin/nvcc --compile -I/mnt/c/Users/david/Downloads/gmp-ecm-gpu_integration/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_75,code=sm_75 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU -o cgbn_stage1.lo cgbn_stage1.cu -static
libtool: compile: /usr/local/cuda/bin/nvcc --compile -I/mnt/c/Users/david/Downloads/gmp-ecm-gpu_integration/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_75,code=sm_75 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU cgbn_stage1.cu -o cgbn_stage1.o
cgbn_stage1.cu(437): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]"
(800): here

cgbn_stage1.cu(444): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]"
(800): here

cgbn_stage1.cu(407): warning: variable "temp" was declared but never referenced
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]"
(800): here

cgbn_stage1.cu(437): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]"
(803): here

cgbn_stage1.cu(444): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]"
(803): here

cgbn_stage1.cu(407): warning: variable "temp" was declared but never referenced
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]"
(803): here

4 errors detected in the compilation of "cgbn_stage1.cu".[/CODE]

Have I messed something up while updating my local git repository or is the gpu_integration branch broken currently?

henryzz 2021-11-28 04:43

May have discovered the issue. I think I need to update CGBN
edit: confirmed

chris2be8 2022-03-03 16:45

My GTX 970 has burnt out, so I've had to replace it, with a RTX 3060 Ti. That's sm_86 so I had to reinstall ECM-GPU.

After updating CUDA to the latest driver and runtime version (11.6) I fetched gmp-ecm again:
[c]git clone https://github.com/sethtroisi/gmp-ecm/ -b gpu_integration[/c]
But ./configure doesn't support GPU arch above 75 so I had to run:
[c]./configure --enable-gpu=75 --with-cuda=/usr/local/cuda CC=gcc-9 -with-cgbn-include=/home/chris/CGBN/include/cgbn[/c]
Then manually update the makefiles to sm_86.

nvcc -h says in part
[code]
--gpu-code <code>,... (-code)
Specify the name of the NVIDIA GPU to assemble and optimize PTX for.
nvcc embeds a compiled code image in the resulting executable for each specified
<code> architecture, which is a true binary load image for each 'real' architecture
(such as sm_50), and PTX code for the 'virtual' architecture (such as compute_50).
During runtime, such embedded PTX code is dynamically compiled by the CUDA
runtime system if no binary load image is found for the 'current' GPU.
Architectures specified for options '--gpu-architecture' and '--gpu-code'
may be 'virtual' as well as 'real', but the <code> architectures must be
compatible with the <arch> architecture. When the '--gpu-code' option is
used, the value for the '--gpu-architecture' option must be a 'virtual' PTX
architecture.
For instance, '--gpu-architecture=compute_60' is not compatible with '--gpu-code=sm_52',
because the earlier compilation stages will assume the availability of 'compute_60'
features that are not present on 'sm_52'.
Note: the values compute_30, compute_32, compute_35, compute_37, compute_50,
sm_30, sm_32, sm_35, sm_37 and sm_50 are deprecated and may be removed in
a future release.
Allowed values for this option: 'compute_35','compute_37','compute_50',
'compute_52','compute_53','compute_60','compute_61','compute_62','compute_70',
'compute_72','compute_75','compute_80','compute_86','compute_87','lto_35',
'lto_37','lto_50','lto_52','lto_53','lto_60','lto_61','lto_62','lto_70',
'lto_72','lto_75','lto_80','lto_86','lto_87','sm_35','sm_37','sm_50','sm_52',
'sm_53','sm_60','sm_61','sm_62','sm_70','sm_72','sm_75','sm_80','sm_86',
'sm_87'.
[/code]

That nvcc has an option --list-gpu-code to list the gpu architectures supported by the compiler. But older versions of it don't have that option.

Older versions of nvcc will probably give a list with:
[c]nvcc -h | grep -o -E 'sm_[0-9]+' | sort -u[/c]
But that won't work for 11.6 because the help lists sm_30 as deprecated even though it's no longer valid.

It seems to work OK, but I've not tried it on a big job yet. And I need to update my scripts because the new GPU does 2432 stage 1 curves per run. Which limits its use if I just need to do t30.

@ SethTro, can you update configure to support sm_86?

One other grouse is that Nvidia seem to regard details of what level of CUDA you need for a given card as top secret information. I wasted a lot of time searching for it.

EdH 2022-03-04 03:45

[QUOTE=chris2be8;601028]My GTX 970 has burnt out,. . .
One other grouse is that Nvidia seem to regard details of what level of CUDA you need for a given card as top secret information. I wasted a lot of time searching for it.[/QUOTE]Sorry to hear about your card. Can it be repaired?

You're probably aware of this site, but I've been having good luck at [URL="https://www.techpowerup.com/gpu-specs/"]techpowerup[/URL] for all the details on the various cards.

e.g. [URL]https://www.techpowerup.com/gpu-specs/geforce-rtx-3060-ti.c3681[/URL], which shows CUDA 8.6.

chris2be8 2022-03-04 16:39

And I've got another problem with the new card:
[code]
tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -cgbn -save test1.save 110000000 110000000 <b58+148.ini
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits)
Using B1=110000000, B2=110000004, sigma=3:3698165927-3:3698168358 (2432 curves)
ecm: cgbn_stage1.cu:525: char* allocate_and_set_s_bits(const __mpz_struct*, int*): Assertion `1 <= num_bits && num_bits <= 100000000' failed.
Aborted (core dumped)
[/code]

It did t50 (B1 up to 43000000) OK, but failed with B1=110000000.

Testing various B1's it fails at 70000000 but works at 60000000:
[code]
tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -cgbn -save test1.save 60000000 1 <b58+148.ini
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits)
Using B1=60000000, B2=1, sigma=3:4285427795-3:4285430226 (2432 curves)
GPU: Using device code targeted for architecture compile_86
GPU: Ptx version is 86
GPU: maxThreadsPerBlock = 896
GPU: numRegsPerThread = 65 sharedMemPerBlock = 0 bytes
Computing 2432 Step 1 took 3151ms of CPU time / 2557979ms of GPU time
[/code]

And I've just started a test at B1=110000000 *without* -cgbn and it seems to be running (the failures happened after a few seconds). I may be able to get round this by not using -cgbn but that's not ideal.
[code]
tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -save test2.save 110000000 1 <b58+148.ini
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits)
Using B1=110000000, B2=1, sigma=3:2243519347-3:2243521778 (2432 curves)
GPU: Using device code targeted for architecture compile_86
GPU: Ptx version is 86
GPU: maxThreadsPerBlock = 1024
GPU: numRegsPerThread = 30 sharedMemPerBlock = 24576 bytes
GPU: Block: 32x32x1 Grid: 76x1x1 (2432 parallel curves)
[/code]

@SethTro, do you want any more information about this bug? I can probably get a core dump if you want.

@EdH, I don't think the old card can be repaired, it smells of burnt plastic. And the one thing techpowerup don't say is what level of CUDA drivers and runtime the card needs.

EdH 2022-03-05 00:44

[QUOTE=chris2be8;601085]. . .
@EdH, I don't think the old card can be repaired, it smells of burnt plastic. And the one thing techpowerup don't say is what level of CUDA drivers and runtime the card needs.[/QUOTE]Which I haven't been trying to make use of yet. I still haven't figured out the correlation with sm/cores/?? and how many parallel processes are run by ECM.

Gimarel 2022-03-05 06:13

[QUOTE=chris2be8;601028]After updating CUDA to the latest driver and runtime version (11.6) I fetched gmp-ecm again:
[c]git clone https://github.com/sethtroisi/gmp-ecm/ -b gpu_integration[/c]
[/QUOTE]
CGBN has been merged into the main branch, it's probably better to use
[c]git clone https://gitlab.inria.fr/zimmerma/ecm.git[/c]

[QUOTE=chris2be8;601028]And I need to update my scripts because the new GPU does 2432 stage 1 curves per run.
[/QUOTE]
There are 4864 shader units on this card according to the technical infos linked above. So if this is correct, it's better to run 4864 curves at once.

[QUOTE=chris2be8;601028]Which limits its use if I just need to do t30.
[/QUOTE]

Why? If you just want to do t30, use 4864 curves and a lower bound of 37e4 and skip stage 2. That should be about t30. Unless you have a very powerful cpu it should be faster.

EdH 2022-03-05 13:55

[QUOTE=Gimarel;601117]CGBN has been merged into the main branch, it's probably better to use
[c]git clone https://gitlab.inria.fr/zimmerma/ecm.git[/c][/QUOTE]Is this where I should retrieve GMP-ECM rather than the svn source I reference, or is the svn source still current? Is the git source the official one?

[QUOTE=Gimarel;601117] There are 4864 shader units on this card according to the technical infos linked above. So if this is correct, it's better to run 4864 curves at once.[/QUOTE]This is confusing to me. GMP-ECM defaults to 64 curves for an NVS 510 with 192 shading units and for my K20X with 2688 shading units, the default is 896 curves. If I double (triple, etc.) the curves, it doubles (triples, etc.) the GPU time taken. This is all with the svn download.

All help in understanding this is appreciated.

Gimarel 2022-03-05 14:45

I don't know. I have a 2060 Super that has 2176 shader units. Anything below 2176 curves takes as much time as 2176 curves. Total throughput is about 5-10% better for 4352 concurrent curves.

EdH 2022-03-06 13:30

[QUOTE=Gimarel;601117]CGBN has been merged into the main branch, it's probably better to use
[c]git clone https://gitlab.inria.fr/zimmerma/ecm.git[/c]
. . .[/QUOTE]I am confused (yet, again). How do I start from scratch to compile GMP-ECM with CGBN for an sm_35 card?


All times are UTC. The time now is 13:19.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.