mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   Faster GPU-ECM with CGBN (https://www.mersenneforum.org/showthread.php?t=27103)

SethTro 2021-11-04 10:12

I was playing around with CGBN today and I realized that it [URL="https://github.com/NVlabs/CGBN/blob/master/include/cgbn/impl_cuda.cu#L1033"]doesn't use fast squaring[/URL]. in GMP fast squaring this yields a [URL="https://gmplib.org/manual/Basecase-Multiplication"]1.5x speedup[/URL]. I filed [URL="https://github.com/NVlabs/CGBN/issues/19"]issue 19[/URL] asking the author what would be needed to add support for fast squaring.

I then discovered their paper ([URL="http://www.acsel-lab.com/arithmetic/arith23/data/1616a047.pdf"]"Optimizing Modular Multiplication for NVIDIA’s Maxwell GPUs"[/URL]) on the subject. It suggests that a more modest 20-30% gain is likely.

The main doubling loop contains 11 additions, 4 multiplications, 4 squares so this would likely only be a ~10% final gain but something we (READ: I) should try to track down.

unconnected 2021-11-08 13:41

BTW, the 54-digit factor was found using Kaggle and ECM with CGBN support. 3584@43e6 took almost 3 hours for stage1 on Tesla P100.

[CODE]Resuming ECM residue saved by @58c8c7d3f28a with GMP-ECM 7.0.5-dev on Sun Nov 7 16:49:14 2021
Input number is 32548578398364358484341350345766214474783986512971108655859583723767495515336168718870906961859034438402149815916929838626831190652930427474273050773518305674391 (161 digits)
Using B1=43000000-43000000, B2=240490660426, polynomial Dickson(12), sigma=3:2723506384
Step 1 took 1ms
Step 2 took 40649ms
********** Factor found in step 2: 414964253388127406110807725062798487272054568225225131
Found prime factor of 54 digits: 414964253388127406110807725062798487272054568225225131
Prime cofactor 78437065681223350191914183317403238121110774952722134895604084183194227719884667275947965573494888337419461 has 107 digits
[/CODE]

SethTro 2021-11-12 04:04

I spent a good part of this week trying to implement fast squaring for CGBN. Ultimately [URL="https://github.com/NVlabs/CGBN/issues/19#issuecomment-966779554"]my code[/URL] was 10% slower and still had breaking edge cases.

In the best case with 100% faster fast squaring, there are 4 `mont_sqr` and 4 `mont_mul` so it would only be 8 / (4 / 2 + 4) - 1 = 33% faster.

Using [URL="https://gmplib.org/manual/Basecase-Multiplication"]GMP's 50% faster number[/URL] it would be 1 - 8 / (4 / 1.5 + 4) - 1 = 20% faster.

I'll reach out to the author of the repo because they mention fast squaring in their paper "Optimizing Modular Multiplication for NVIDIA’s
Maxwell GPUs" [url]http://www.acsel-lab.com/arithmetic/arith23/data/1616a047.pdf[/url] but it's unlikely to happen.

henryzz 2021-11-27 22:05

Just tried to upgrade my version of this as I was on a fairly old version and certain numbers were crashing.

Compiling has failed with the following error:

[CODE]/bin/bash ./libtool --tag=CC --mode=compile /usr/local/cuda/bin/nvcc --compile -I/mnt/c/Users/david/Downloads/gmp-ecm-gpu_integration/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_75,code=sm_75 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU -o cgbn_stage1.lo cgbn_stage1.cu -static
libtool: compile: /usr/local/cuda/bin/nvcc --compile -I/mnt/c/Users/david/Downloads/gmp-ecm-gpu_integration/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_75,code=sm_75 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU cgbn_stage1.cu -o cgbn_stage1.o
cgbn_stage1.cu(437): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]"
(800): here

cgbn_stage1.cu(444): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]"
(800): here

cgbn_stage1.cu(407): warning: variable "temp" was declared but never referenced
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]"
(800): here

cgbn_stage1.cu(437): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]"
(803): here

cgbn_stage1.cu(444): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]"
(803): here

cgbn_stage1.cu(407): warning: variable "temp" was declared but never referenced
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]"
(803): here

4 errors detected in the compilation of "cgbn_stage1.cu".[/CODE]

Have I messed something up while updating my local git repository or is the gpu_integration branch broken currently?

henryzz 2021-11-28 04:43

May have discovered the issue. I think I need to update CGBN
edit: confirmed

chris2be8 2022-03-03 16:45

My GTX 970 has burnt out, so I've had to replace it, with a RTX 3060 Ti. That's sm_86 so I had to reinstall ECM-GPU.

After updating CUDA to the latest driver and runtime version (11.6) I fetched gmp-ecm again:
[c]git clone https://github.com/sethtroisi/gmp-ecm/ -b gpu_integration[/c]
But ./configure doesn't support GPU arch above 75 so I had to run:
[c]./configure --enable-gpu=75 --with-cuda=/usr/local/cuda CC=gcc-9 -with-cgbn-include=/home/chris/CGBN/include/cgbn[/c]
Then manually update the makefiles to sm_86.

nvcc -h says in part
[code]
--gpu-code <code>,... (-code)
Specify the name of the NVIDIA GPU to assemble and optimize PTX for.
nvcc embeds a compiled code image in the resulting executable for each specified
<code> architecture, which is a true binary load image for each 'real' architecture
(such as sm_50), and PTX code for the 'virtual' architecture (such as compute_50).
During runtime, such embedded PTX code is dynamically compiled by the CUDA
runtime system if no binary load image is found for the 'current' GPU.
Architectures specified for options '--gpu-architecture' and '--gpu-code'
may be 'virtual' as well as 'real', but the <code> architectures must be
compatible with the <arch> architecture. When the '--gpu-code' option is
used, the value for the '--gpu-architecture' option must be a 'virtual' PTX
architecture.
For instance, '--gpu-architecture=compute_60' is not compatible with '--gpu-code=sm_52',
because the earlier compilation stages will assume the availability of 'compute_60'
features that are not present on 'sm_52'.
Note: the values compute_30, compute_32, compute_35, compute_37, compute_50,
sm_30, sm_32, sm_35, sm_37 and sm_50 are deprecated and may be removed in
a future release.
Allowed values for this option: 'compute_35','compute_37','compute_50',
'compute_52','compute_53','compute_60','compute_61','compute_62','compute_70',
'compute_72','compute_75','compute_80','compute_86','compute_87','lto_35',
'lto_37','lto_50','lto_52','lto_53','lto_60','lto_61','lto_62','lto_70',
'lto_72','lto_75','lto_80','lto_86','lto_87','sm_35','sm_37','sm_50','sm_52',
'sm_53','sm_60','sm_61','sm_62','sm_70','sm_72','sm_75','sm_80','sm_86',
'sm_87'.
[/code]

That nvcc has an option --list-gpu-code to list the gpu architectures supported by the compiler. But older versions of it don't have that option.

Older versions of nvcc will probably give a list with:
[c]nvcc -h | grep -o -E 'sm_[0-9]+' | sort -u[/c]
But that won't work for 11.6 because the help lists sm_30 as deprecated even though it's no longer valid.

It seems to work OK, but I've not tried it on a big job yet. And I need to update my scripts because the new GPU does 2432 stage 1 curves per run. Which limits its use if I just need to do t30.

@ SethTro, can you update configure to support sm_86?

One other grouse is that Nvidia seem to regard details of what level of CUDA you need for a given card as top secret information. I wasted a lot of time searching for it.

EdH 2022-03-04 03:45

[QUOTE=chris2be8;601028]My GTX 970 has burnt out,. . .
One other grouse is that Nvidia seem to regard details of what level of CUDA you need for a given card as top secret information. I wasted a lot of time searching for it.[/QUOTE]Sorry to hear about your card. Can it be repaired?

You're probably aware of this site, but I've been having good luck at [URL="https://www.techpowerup.com/gpu-specs/"]techpowerup[/URL] for all the details on the various cards.

e.g. [URL]https://www.techpowerup.com/gpu-specs/geforce-rtx-3060-ti.c3681[/URL], which shows CUDA 8.6.

chris2be8 2022-03-04 16:39

And I've got another problem with the new card:
[code]
tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -cgbn -save test1.save 110000000 110000000 <b58+148.ini
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits)
Using B1=110000000, B2=110000004, sigma=3:3698165927-3:3698168358 (2432 curves)
ecm: cgbn_stage1.cu:525: char* allocate_and_set_s_bits(const __mpz_struct*, int*): Assertion `1 <= num_bits && num_bits <= 100000000' failed.
Aborted (core dumped)
[/code]

It did t50 (B1 up to 43000000) OK, but failed with B1=110000000.

Testing various B1's it fails at 70000000 but works at 60000000:
[code]
tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -cgbn -save test1.save 60000000 1 <b58+148.ini
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits)
Using B1=60000000, B2=1, sigma=3:4285427795-3:4285430226 (2432 curves)
GPU: Using device code targeted for architecture compile_86
GPU: Ptx version is 86
GPU: maxThreadsPerBlock = 896
GPU: numRegsPerThread = 65 sharedMemPerBlock = 0 bytes
Computing 2432 Step 1 took 3151ms of CPU time / 2557979ms of GPU time
[/code]

And I've just started a test at B1=110000000 *without* -cgbn and it seems to be running (the failures happened after a few seconds). I may be able to get round this by not using -cgbn but that's not ideal.
[code]
tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -save test2.save 110000000 1 <b58+148.ini
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits)
Using B1=110000000, B2=1, sigma=3:2243519347-3:2243521778 (2432 curves)
GPU: Using device code targeted for architecture compile_86
GPU: Ptx version is 86
GPU: maxThreadsPerBlock = 1024
GPU: numRegsPerThread = 30 sharedMemPerBlock = 24576 bytes
GPU: Block: 32x32x1 Grid: 76x1x1 (2432 parallel curves)
[/code]

@SethTro, do you want any more information about this bug? I can probably get a core dump if you want.

@EdH, I don't think the old card can be repaired, it smells of burnt plastic. And the one thing techpowerup don't say is what level of CUDA drivers and runtime the card needs.

EdH 2022-03-05 00:44

[QUOTE=chris2be8;601085]. . .
@EdH, I don't think the old card can be repaired, it smells of burnt plastic. And the one thing techpowerup don't say is what level of CUDA drivers and runtime the card needs.[/QUOTE]Which I haven't been trying to make use of yet. I still haven't figured out the correlation with sm/cores/?? and how many parallel processes are run by ECM.

Gimarel 2022-03-05 06:13

[QUOTE=chris2be8;601028]After updating CUDA to the latest driver and runtime version (11.6) I fetched gmp-ecm again:
[c]git clone https://github.com/sethtroisi/gmp-ecm/ -b gpu_integration[/c]
[/QUOTE]
CGBN has been merged into the main branch, it's probably better to use
[c]git clone https://gitlab.inria.fr/zimmerma/ecm.git[/c]

[QUOTE=chris2be8;601028]And I need to update my scripts because the new GPU does 2432 stage 1 curves per run.
[/QUOTE]
There are 4864 shader units on this card according to the technical infos linked above. So if this is correct, it's better to run 4864 curves at once.

[QUOTE=chris2be8;601028]Which limits its use if I just need to do t30.
[/QUOTE]

Why? If you just want to do t30, use 4864 curves and a lower bound of 37e4 and skip stage 2. That should be about t30. Unless you have a very powerful cpu it should be faster.

EdH 2022-03-05 13:55

[QUOTE=Gimarel;601117]CGBN has been merged into the main branch, it's probably better to use
[c]git clone https://gitlab.inria.fr/zimmerma/ecm.git[/c][/QUOTE]Is this where I should retrieve GMP-ECM rather than the svn source I reference, or is the svn source still current? Is the git source the official one?

[QUOTE=Gimarel;601117] There are 4864 shader units on this card according to the technical infos linked above. So if this is correct, it's better to run 4864 curves at once.[/QUOTE]This is confusing to me. GMP-ECM defaults to 64 curves for an NVS 510 with 192 shading units and for my K20X with 2688 shading units, the default is 896 curves. If I double (triple, etc.) the curves, it doubles (triples, etc.) the GPU time taken. This is all with the svn download.

All help in understanding this is appreciated.

Gimarel 2022-03-05 14:45

I don't know. I have a 2060 Super that has 2176 shader units. Anything below 2176 curves takes as much time as 2176 curves. Total throughput is about 5-10% better for 4352 concurrent curves.

EdH 2022-03-06 13:30

[QUOTE=Gimarel;601117]CGBN has been merged into the main branch, it's probably better to use
[c]git clone https://gitlab.inria.fr/zimmerma/ecm.git[/c]
. . .[/QUOTE]I am confused (yet, again). How do I start from scratch to compile GMP-ECM with CGBN for an sm_35 card?

chris2be8 2022-03-06 16:37

[QUOTE=EdH;601195]I am confused (yet, again). How do I start from scratch to compile GMP-ECM with CGBN for an sm_35 card?[/QUOTE]

I would try:
mkdir ecm-gpu # Change dir name if you want
cd ecm-gpu
git clone [url]https://gitlab.inria.fr/zimmerma/ecm.git[/url]
cd ecm
autoreconf -si
./configure --enable-gpu=35 -with-cgbn-include=/home/chris/ecm-seth/CGBN/include/cgbn # Update path to CGBN as appropriate
make

But I can't test it because I don't have a sm_35 GPU.

EdH 2022-03-06 18:54

I guess my problem is getting CGBN. I'd already done the rest, but when I tried to get CGBN I got a "forbidden" message.

I thought the post from Gimarel meant that CGBN was included with GMP-ECM at that location.

I'll study the thread further. . .

SethTro 2022-03-06 19:26

"Getting" CGBN should be `git clone https://github.com/NVlabs/CGBN.git` which will create a new directory "CGBN" then the commands chris2be8 wrote (thanks) should work.

I got access to contribute back to gmp-ecm so I will work on adding support for sm_86.

@chris2be8. I think you have an old version of the code. That limit was removed at some point. Can you check that you are using [url]https://gitlab.inria.fr/zimmerma/ecm.git[/url] and not my personal repository ([url]https://github.com/sethtroisi/gmp-ecm[/url]).

EdH 2022-03-06 19:47

[QUOTE=SethTro;601229]"Getting" CGBN should be `git clone https://github.com/NVlabs/CGBN.git` which will create a new directory "CGBN" then the commands chris2be8 wrote (thanks) should work.

I got access to contribute back to gmp-ecm so I will work on adding support for sm_86.

@chris2be8. I think you have an old version of the code. That limit was removed at some point. Can you check that you are using [URL]https://gitlab.inria.fr/zimmerma/ecm.git[/URL] and not my personal repository ([URL]https://github.com/sethtroisi/gmp-ecm[/URL]).[/QUOTE]
OK, Thanks! I think I'm getting somewhere. I found CGBN, but had trouble with googletest when unzip crashed with mismatched internal names. (Maybe I have an old CGBN?) I think I've worked around that now and am trying to compile CGBN, but it looks like it's stuck.

I think I'm where I can work it a little further. I'll get back in a bit.

EdH 2022-03-06 20:09

Well, this isn't making any sense to me:[code]configure: Using CGBN from /home/math55/Math/CGBN/include/cgbn
checking if CGBN is present... no
configure: error: [B]cgbn.h not found[/B] (check if /cgbn needed after <PATH>/include)[/code][code]$ ls /home/math55/Math/CGBN/include/cgbn
arith cgbn.cu [B]cgbn.h[/B] core impl_mpz.cc
cgbn_cpu.h cgbn_cuda.h cgbn_mpz.h impl_cuda.cu[/code]I even copied the directory from the properties for the cgbn.h file. I also tried with and without "/cgbn."

EdH 2022-03-06 21:03

I think I found it. (from acinclude.m4):[code] NVCC_CHECK_COMPILE(
[
#include <gmp.h>
#include <cgbn.h>
],
[-I$cgbn_include [B]$GMPLIB[/B]],
[AC_MSG_RESULT([yes])],
[
AC_MSG_RESULT([no])
AC_MSG_ERROR([cgbn.h not found (check if /cgbn needed after <PATH>/include)])
]
)[/code]My ECM compile normally includes [C]--with-gmp=/usr/local/[/C], but I removed it for this ./configure and it finished without troubles.

I'm running it now and it appears to be doing OK.

Thanks for all the help.

SethTro 2022-03-06 23:51

[QUOTE=EdH;601230]OK, Thanks! I think I'm getting somewhere. I found CGBN, but had trouble with googletest when unzip crashed with mismatched internal names. (Maybe I have an old CGBN?) I think I've worked around that now and am trying to compile CGBN, but it looks like it's stuck.

I think I'm where I can work it a little further. I'll get back in a bit.[/QUOTE]


IFAIK you don't need to compile CGBN (or setup the googletest), you just need the folder downloaded ("cloned") from Github.

EdH 2022-03-07 00:34

[QUOTE=SethTro;601241]IFAIK you don't need to compile CGBN (or setup the googletest), you just need the folder downloaded ("cloned") from Github.[/QUOTE]ATM, I have it running, but now that I've had success, I may backtrack to see if that is the case. Sorry if I've been too ignorant of all these workings, but I hope I'm learning something.

Thanks for all the help!

EdH 2022-03-07 00:49

[QUOTE=SethTro;601241]IFAIK you don't need to compile CGBN (or setup the googletest), you just need the folder downloaded ("cloned") from Github.[/QUOTE]I restarted from scratch and now that I'm familiar with it, all went quick and easy.

RichD 2022-03-07 01:07

[QUOTE=EdH;601248]I restarted from scratch and now that I'm familiar with it, all went quick and easy.[/QUOTE]
Oh good, does that mean we might see a new "How I ... " thread?

EdH 2022-03-07 03:19

[QUOTE=RichD;601249]Oh good, does that mean we might see a new "How I ... " thread?[/QUOTE]
I wasn't considering one, since it really boils down to only a couple lines of install. Or, maybe you mean a GMP-ECM with GPU thread from start to finish? I do wonder if I should try to add CGBN to the Colab GMP-ECM session.

RichD 2022-03-07 11:58

[QUOTE=EdH;601255]I do wonder if I should try to add CGBN to the Colab GMP-ECM session.[/QUOTE]
That might be helpful. I was thinking of the GPU Msieve LA process but I'm in the wrong thread. I see it was recently posted there. Many thanks!

EdH 2022-03-07 14:08

[QUOTE=RichD;601258]That might be helpful. I was thinking of the GPU Msieve LA process but I'm in the wrong thread. I see it was recently posted there. Many thanks![/QUOTE]You're quite welcome. I always hope someone can get some use from the threads.

I will have to think about some things a bit. The Colab GPU LA was quite complicated to get arranged and is still more involved than the rest, but the Colab portion actually simplified after a bit. It started out as about five separate code blocks.

Maybe for the Colab GPU GMP-ECM, I can simply add a function with a switch whether to include CGBN.

A few things to think about, but sometimes too many cause me to step back and go do something else.

RichD 2022-03-07 16:01

[QUOTE=EdH;601264]You're quite welcome. I always hope someone can get some use from the threads.[/QUOTE]

I like the cookbook approach. Everything you need in one post.

Jeff Gilchrist did some work early on when things were much easier. Just ECM, GGNFS and Msieve. Now with all the different branches it is hard to keep them all straight. Thanks for all your work.

[QUOTE=EdH;601264]A few things to think about, but sometimes too many cause me to step back and go do something else.[/QUOTE]

Yea, I have that problem too - [URL="https://www.youtube.com/watch?v=2MrlAvr7F9o"]AAADD[/URL].

chris2be8 2022-03-07 16:48

[QUOTE=SethTro;601229]
@chris2be8. I think you have an old version of the code. That limit was removed at some point. Can you check that you are using [url]https://gitlab.inria.fr/zimmerma/ecm.git[/url] and not my personal repository ([url]https://github.com/sethtroisi/gmp-ecm[/url]).[/QUOTE]

I was using [url]https://github.com/sethtroisi/gmp-ecm[/url] which probably explains it. I'll try [url]https://gitlab.inria.fr/zimmerma/ecm.git[/url] once the job that's running on the GPU now has ended (it built OK but I've not tested it yet).

EdH 2022-03-07 20:53

I guess I installed everything OK and it is working. I ran a test of B1 values only (B2=0) on a c170 to compare timings for my K20X card. ECM chose 896 curves, out of the 2688 shading units. (I still don't know why so few.):[code]8875593388...97<170>:
Completed 1e3 with CGBN in 00:00
Completed 1e3 without CGBN in 00:01
Completed 15e3 with CGBN in 00:03
Completed 15e3 without CGBN in 00:08
Completed 12e4 with CGBN in 00:24
Completed 12e4 without CGBN in 01:02
Completed 1e6 with CGBN in 03:17
Completed 1e6 without CGBN in 08:35
Completed 6e6 with CGBN in 19:41
Completed 6e6 without CGBN in 51:32[/code]Now, when I can get stage 2 to be a bit more competitive. . .

WraithX 2022-03-07 21:12

[QUOTE=EdH;601290]I guess I installed everything OK and it is working. I ran a test of B1 values only (B2=0) on a c170 to compare timings for my K20X card. ECM chose 896 curves, out of the 2688 shading units. (I still don't know why so few.):[code]8875593388...97<170>:
Completed 1e3 with CGBN in 00:00
Completed 1e3 without CGBN in 00:01
Completed 15e3 with CGBN in 00:03
Completed 15e3 without CGBN in 00:08
Completed 12e4 with CGBN in 00:24
Completed 12e4 without CGBN in 01:02
Completed 1e6 with CGBN in 03:17
Completed 1e6 without CGBN in 08:35
Completed 6e6 with CGBN in 19:41
Completed 6e6 without CGBN in 51:32[/code]Now, when I can get stage 2 to be a bit more competitive. . .[/QUOTE]

Are the "without CGBN" times still using the gpu? Or is that cpu time to complete 896 curves?

Could you run those tests again with "-gpucurves 2688"? I'd be interested to see if you can get more curves done in the same time, or more time.

Also, for some reason, I seem to remember gpu-ecm running best at half of the number of CUDA cores, but maybe that is different with CGBN? Maybe another test with "-gpucurves 1344" and/or "-gpucurves 5376"?

SethTro 2022-03-07 22:43

[QUOTE=WraithX;601291]Are the "without CGBN" times still using the gpu? Or is that cpu time to complete 896 curves?

Could you run those tests again with "-gpucurves 2688"? I'd be interested to see if you can get more curves done in the same time, or more time.

Also, for some reason, I seem to remember gpu-ecm running best at half of the number of CUDA cores, but maybe that is different with CGBN? Maybe another test with "-gpucurves 1344" and/or "-gpucurves 5376"?[/QUOTE]

You can run `./gpu_throughput_test.sh` from the gmp-ecm folder and it should test with many different multiples of the default (1/4x, 1/2x, 1x, 2x, 4x, 8x). If the default 1x curves is bad It takes the number of curves as an optional 2nd parameter after the ecm binary as an optional first parameter so something like `./gpu_throughput_test.sh ./ecm 1344`

For example on my 970 it looks like 832 is the best number of curves for 1024 bit numbers (Up to C300) but for smaller numbers (< C15) 3328 is the best number of curves which is 4x the default.


[CODE]
TESTING (2^269-1)/13822297 B1=128000
Step 1 took 139ms
Computing 832 Step 1 took 683ms of CPU time / 33471ms of GPU time
Throughput: 24.857 curves per second (on average 40.23ms per Step 1)

CGBN<512, 4> running kernel<4 block x 256 threads> input number is 246 bits
Computing 224 Step 1 took 35ms of CPU time / 7710ms of GPU time
Throughput: 29.054 curves per second (on average 34.42ms per Step 1)

CGBN<512, 4> running kernel<7 block x 256 threads> input number is 246 bits
Computing 416 Step 1 took 9ms of CPU time / 7646ms of GPU time
Throughput: 54.408 curves per second (on average 18.38ms per Step 1)

CGBN<512, 4> running kernel<13 block x 256 threads> input number is 246 bits
Computing 832 Step 1 took 17ms of CPU time / 7629ms of GPU time
Throughput: 109.055 curves per second (on average 9.17ms per Step 1)

CGBN<512, 4> running kernel<26 block x 256 threads> input number is 246 bits
Computing 1664 Step 1 took 21ms of CPU time / 7844ms of GPU time
Throughput: 212.141 curves per second (on average 4.71ms per Step 1)

CGBN<512, 4> running kernel<52 block x 256 threads> input number is 246 bits
Computing 3328 Step 1 took 33ms of CPU time / 13393ms of GPU time
Throughput: 248.482 curves per second (on average 4.02ms per Step 1)

CGBN<512, 4> running kernel<104 block x 256 threads> input number is 246 bits
Computing 6656 Step 1 took 83ms of CPU time / 27894ms of GPU time
Throughput: 238.620 curves per second (on average 4.19ms per Step 1)



TESTING (2^499-1)/20959 B1=64000
Step 1 took 81ms
Computing 832 Step 1 took 384ms of CPU time / 18396ms of GPU time
Throughput: 45.228 curves per second (on average 22.11ms per Step 1)

CGBN<512, 4> running kernel<4 block x 256 threads> input number is 485 bits
Computing 224 Step 1 took 18ms of CPU time / 3956ms of GPU time
Throughput: 56.626 curves per second (on average 17.66ms per Step 1)

CGBN<512, 4> running kernel<7 block x 256 threads> input number is 485 bits
Computing 416 Step 1 took 16ms of CPU time / 3882ms of GPU time
Throughput: 107.165 curves per second (on average 9.33ms per Step 1)

CGBN<512, 4> running kernel<13 block x 256 threads> input number is 485 bits
Computing 832 Step 1 took 6ms of CPU time / 3856ms of GPU time
Throughput: 215.783 curves per second (on average 4.63ms per Step 1)

CGBN<512, 4> running kernel<26 block x 256 threads> input number is 485 bits
Computing 1664 Step 1 took 14ms of CPU time / 4154ms of GPU time
Throughput: 400.610 curves per second (on average 2.50ms per Step 1)

CGBN<512, 4> running kernel<52 block x 256 threads> input number is 485 bits
Computing 3328 Step 1 took 37ms of CPU time / 7469ms of GPU time
Throughput: 445.558 curves per second (on average 2.24ms per Step 1)

CGBN<512, 4> running kernel<104 block x 256 threads> input number is 485 bits
Computing 6656 Step 1 took 47ms of CPU time / 15017ms of GPU time
Throughput: 443.217 curves per second (on average 2.26ms per Step 1)



TESTING 2^997-1 B1=32000
Step 1 took 73ms
Computing 832 Step 1 took 182ms of CPU time / 9450ms of GPU time
Throughput: 88.045 curves per second (on average 11.36ms per Step 1)

CGBN<1024, 8> running kernel<7 block x 256 threads> input number is 997 bits
Computing 224 Step 1 took 28ms of CPU time / 3294ms of GPU time
Throughput: 67.994 curves per second (on average 14.71ms per Step 1)

CGBN<1024, 8> running kernel<13 block x 256 threads> input number is 997 bits
Computing 416 Step 1 took 27ms of CPU time / 3161ms of GPU time
Throughput: 131.591 curves per second (on average 7.60ms per Step 1)

CGBN<1024, 8> running kernel<26 block x 256 threads> input number is 997 bits
Computing 832 Step 1 took 38ms of CPU time / 3450ms of GPU time
Throughput: 241.137 curves per second (on average 4.15ms per Step 1)

CGBN<1024, 8> running kernel<52 block x 256 threads> input number is 997 bits
Computing 1664 Step 1 took 37ms of CPU time / 7034ms of GPU time
Throughput: 236.566 curves per second (on average 4.23ms per Step 1)

CGBN<1024, 8> running kernel<104 block x 256 threads> input number is 997 bits
Computing 3328 Step 1 took 63ms of CPU time / 14158ms of GPU time
Throughput: 235.059 curves per second (on average 4.25ms per Step 1)

CGBN<1024, 8> running kernel<208 block x 256 threads> input number is 997 bits
Computing 6656 Step 1 took 105ms of CPU time / 29785ms of GPU time
Throughput: 223.465 curves per second (on average 4.47ms per Step 1)

[/CODE]

EdH 2022-03-07 23:01

[QUOTE=WraithX;601291]Are the "without CGBN" times still using the gpu? Or is that cpu time to complete 896 curves?

Could you run those tests again with "-gpucurves 2688"? I'd be interested to see if you can get more curves done in the same time, or more time.

Also, for some reason, I seem to remember gpu-ecm running best at half of the number of CUDA cores, but maybe that is different with CGBN? Maybe another test with "-gpucurves 1344" and/or "-gpucurves 5376"?[/QUOTE]Those were comparing GPU times between ECM compiled with CGBN and ECM without CGBN compiled. I'm running the test again with [C]-cgbn[/C] present and absent for the ECM that was compiled with CGBN:[code]function runecmCGBN {
result=$(echo "$comp" | $HOME/Math/ecm-cgbn/ecm-cgbn [B][C]-cgbn[/C][/B] -gpu -gpudevice 0 -q $b1 0)
}

function runecm {
result=$(echo "$comp" | $HOME/Math/ecm-cgbn/ecm-cgbn -gpu -gpudevice 0 -q $b1 0)
}[/code]So far the times are pretty close.

I will play with the throughput test and other values later.

EdH 2022-03-08 00:56

[QUOTE=EdH;601295]So far the times are pretty close.
. . .[/QUOTE]Nearly the same:[code]8875593388...97<170>:
Completed 1e3 with CGBN in 00:00
Completed 1e3 without CGBN in 00:01
Completed 15e3 with CGBN in 00:03
Completed 15e3 without CGBN in 00:08
Completed 12e4 with CGBN in 00:24
Completed 12e4 without CGBN in 01:02
Completed 1e6 with CGBN in 03:17
Completed 1e6 without CGBN in 08:33
Completed 6e6 with CGBN in 19:38
Completed 6e6 without CGBN in 51:22[/code]

EdH 2022-03-08 01:13

Not really sure how to use this. Would merely changing the -gpucurves value make all the other values change or would I adjust other things? Is this in the docs?[code]$ bash gpu_throughput_test.sh

TESTING (2^269-1)/13822297 B1=128000
Step 1 took 275ms
Computing 896 Step 1 took 1219ms of CPU time / 65682ms of GPU time
Throughput: 13.641 curves per second (on average 73.31ms per Step 1)

CGBN<512, 4> running kernel<4 block x 256 threads> input number is 246 bits
Computing 224 Step 1 took 23ms of CPU time / 12473ms of GPU time
Throughput: 17.959 curves per second (on average 55.68ms per Step 1)

CGBN<512, 4> running kernel<7 block x 256 threads> input number is 246 bits
Computing 448 Step 1 took 16ms of CPU time / 12474ms of GPU time
Throughput: 35.915 curves per second (on average 27.84ms per Step 1)

CGBN<512, 4> running kernel<14 block x 256 threads> input number is 246 bits
Computing 896 Step 1 took 32ms of CPU time / 12460ms of GPU time
Throughput: 71.913 curves per second (on average 13.91ms per Step 1)

CGBN<512, 4> running kernel<28 block x 256 threads> input number is 246 bits
Computing 1792 Step 1 took 19ms of CPU time / 14248ms of GPU time
Throughput: 125.769 curves per second (on average 7.95ms per Step 1)

CGBN<512, 4> running kernel<56 block x 256 threads> input number is 246 bits
Computing 3584 Step 1 took 45ms of CPU time / 22182ms of GPU time
Throughput: 161.573 curves per second (on average 6.19ms per Step 1)

CGBN<512, 4> running kernel<112 block x 256 threads> input number is 246 bits
Computing 7168 Step 1 took 70ms of CPU time / 44416ms of GPU time
Throughput: 161.384 curves per second (on average 6.20ms per Step 1)



TESTING (2^499-1)/20959 B1=64000
Step 1 took 184ms
Computing 896 Step 1 took 617ms of CPU time / 32883ms of GPU time
Throughput: 27.248 curves per second (on average 36.70ms per Step 1)

CGBN<512, 4> running kernel<4 block x 256 threads> input number is 485 bits
Computing 224 Step 1 took 8ms of CPU time / 6256ms of GPU time
Throughput: 35.808 curves per second (on average 27.93ms per Step 1)

CGBN<512, 4> running kernel<7 block x 256 threads> input number is 485 bits
Computing 448 Step 1 took 16ms of CPU time / 6233ms of GPU time
Throughput: 71.872 curves per second (on average 13.91ms per Step 1)

CGBN<512, 4> running kernel<14 block x 256 threads> input number is 485 bits
Computing 896 Step 1 took 17ms of CPU time / 6235ms of GPU time
Throughput: 143.703 curves per second (on average 6.96ms per Step 1)

CGBN<512, 4> running kernel<28 block x 256 threads> input number is 485 bits
Computing 1792 Step 1 took 24ms of CPU time / 7151ms of GPU time
Throughput: 250.600 curves per second (on average 3.99ms per Step 1)

CGBN<512, 4> running kernel<56 block x 256 threads> input number is 485 bits
Computing 3584 Step 1 took 31ms of CPU time / 11108ms of GPU time
Throughput: 322.648 curves per second (on average 3.10ms per Step 1)

CGBN<512, 4> running kernel<112 block x 256 threads> input number is 485 bits
Computing 7168 Step 1 took 87ms of CPU time / 22239ms of GPU time
Throughput: 322.312 curves per second (on average 3.10ms per Step 1)



TESTING 2^997-1 B1=32000
Step 1 took 180ms
Computing 896 Step 1 took 326ms of CPU time / 16376ms of GPU time
Throughput: 54.714 curves per second (on average 18.28ms per Step 1)

CGBN<1024, 8> running kernel<7 block x 256 threads> input number is 997 bits
Computing 224 Step 1 took 11ms of CPU time / 5296ms of GPU time
Throughput: 42.299 curves per second (on average 23.64ms per Step 1)

CGBN<1024, 8> running kernel<14 block x 256 threads> input number is 997 bits
Computing 448 Step 1 took 14ms of CPU time / 5289ms of GPU time
Throughput: 84.698 curves per second (on average 11.81ms per Step 1)

CGBN<1024, 8> running kernel<28 block x 256 threads> input number is 997 bits
Computing 896 Step 1 took 33ms of CPU time / 6285ms of GPU time
Throughput: 142.559 curves per second (on average 7.01ms per Step 1)

CGBN<1024, 8> running kernel<56 block x 256 threads> input number is 997 bits
Computing 1792 Step 1 took 44ms of CPU time / 10762ms of GPU time
Throughput: 166.513 curves per second (on average 6.01ms per Step 1)

CGBN<1024, 8> running kernel<112 block x 256 threads> input number is 997 bits
Computing 3584 Step 1 took 81ms of CPU time / 21541ms of GPU time
Throughput: 166.382 curves per second (on average 6.01ms per Step 1)

CGBN<1024, 8> running kernel<224 block x 256 threads> input number is 997 bits
Computing 7168 Step 1 took 159ms of CPU time / 43201ms of GPU time
Throughput: 165.923 curves per second (on average 6.03ms per Step 1)[/code]

SethTro 2022-03-08 06:53

[QUOTE=EdH;601297]Not really sure how to use this. Would merely changing the -gpucurves value make all the other values change or would I adjust other things? Is this in the docs?[/QUOTE]

This isn't documented anywhere, but if we talk through some good notes here I'll happily write them up and include them after the program runs. This is doing the same thing you are with runecmCGBN, it runs ecm with a bunch of different -gpucurves and prints out the time for each.

Maybe a prefix like "This script helps you find the best gpucurves for your gpu. It run ecm (<BINARY NAME>) while changing the -gpucurve parameter from the default on your card, X, to a number of multiplies. It runs at 3 levels a 256 bits (C80), 512 (C150), and 1024 bits (C300). The first line in each set is the CPU timing, then the GPU times for different values of -gpucurve.

After it's done something like like "Large values tend to produce better throughput but can double the time to get the curves. We suggest the first -gpucurve value that within 10% of the best throughput."


(written on mobile without proofreading, pre-apology for grammar and spelling)

EdH 2022-03-08 13:32

I understood about the -gpucurves, but what confused me was the:[code]CGBN<512, 4> running kernel<56 block x 256 threads> input number is 246 bits[/code]lines. I see now that they are based on the input number size and automatically taken care of by the program. I had thought maybe there were more options to provide.

Thanks for helping me understand this and for a great speedup.

chris2be8 2022-03-08 16:39

ecm-gpu downloaded from [url]https://gitlab.inria.fr/zimmerma/ecm.git[/url] works for b1=11e7:
[code]
chris@4core:~/ecm-cgbn.2/ecm> date;time ./ecm -gpu -cgbn -save test2.save 110000000 1 <b58+148.ini;date
Tue 8 Mar 08:18:31 GMT 2022
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits)
Using B1=110000000, B2=1, sigma=3:35896186-3:35898617 (2432 curves)
GPU: Large B1, S = 158705536 bits = 151 MB
GPU: Using device code targeted for architecture compile_86
GPU: Ptx version is 86
GPU: maxThreadsPerBlock = 896
GPU: numRegsPerThread = 67 sharedMemPerBlock = 0 bytes
Computing 2432 Step 1 took 4508ms of CPU time / 4674180ms of GPU time

real 78m0.885s
user 0m10.992s
sys 0m2.513s
Tue 8 Mar 09:36:32 GMT 2022
[/code]

This is after updating the Makefiles to [c]--generate-code arch=compute_86,code=sm_86[/c].

The older version without -cgbn took about 9 hours to do the same job. Many thanks for the speed up.

EdH 2022-03-08 19:43

Sorry if this has an "elementary" answer, but is there an optimum value that B1 should be a multiple of?

I'm currently basing my B1 values on what 896 curves need for the different t-levels. Should I adjust B1 to a close multiple of a base value, then adjust the -gpucurves, accordingly, or am I complicating things?

SethTro 2022-03-08 20:46

[QUOTE=EdH;601343]Sorry if this has an "elementary" answer, but is there an optimum value that B1 should be a multiple of?

I'm currently basing my B1 values on what 896 curves need for the different t-levels. Should I adjust B1 to a close multiple of a base value, then adjust the -gpucurves, accordingly, or am I complicating things?[/QUOTE]


TL;DR If you are still running B2 you should probably set B1 for each t-level based on [URL="https://members.loria.fr/PZimmermann/records/ecm/params.html"]this chart[/URL] then round number of curves to the nearest multiple of 896. This is probably within 20% of optimal for >= t45. You could slightly optimize by increasing B1 if you round down or increasing B1 if you round up (so that ecm -v prints "expected number of curves to find a factor" equal to the number of curves you are using)

In practice for small factors everything is really fast so for a single number who cares, but if you were working on factordb or a huge amount of numbers (>5000) you would want to do something smarter. In theory the code could run one curve for 896 different numbers or something.

It can also make sense to tune the B1/B2 ratio based on how much RAM you have and how fast your CPU is versus your GPU. For example see [URL="https://www.mersenneforum.org/showthread.php?t=23280&page=2"]the discussion here[/URL]. I wrote some hacky shell code to do this at [URL="https://github.com/sethtroisi/misc-scripts/tree/main/ecm_gpu_optimizer"]sethtro/misc-scripts/ecm_gpu_optimizer[/URL]

EdH 2022-03-08 22:36

Thanks. This gives me something to study. Unfortunately, the machine I was able to get to run the GPU has only 2 cores and 8G RAM. But, I have a script now that sends the residues to a second machine and moves to the next B1 level. Of course, now the GPU is the bottleneck since I'm only running stage 1 operations on its machine. I'm still looking at what might be best for my setup.

SethTro 2022-03-17 08:55

[QUOTE=chris2be8;601329]
The older version without -cgbn took about 9 hours to do the same job. Many thanks for the speed up.[/QUOTE]

Fun fact if you follow the advice about custom kernel size you can potentially make this an additional 40% faster

[CODE]
$ echo "1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417" | ./ecm -cgbn -v 11e5 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Using B1=1100000, B2=0, sigma=3:1276799189-3:1276800020 (832 curves)
[B]Compiling custom kernel for 640 bits should be ~144% faster[/B]
CGBN<1024, 8> running kernel<26 block x 256 threads> input number is 569 bits
Computing 1158 bits/call, 96372/1586512 (6.1%), ETA 106 + 7 = 113 seconds (~135 ms/curves)
Computing 1158 bits/call, 212172/1586512 (13.4%), ETA 97 + 15 = 113 seconds (~135 ms/curves)
Computing 1158 bits/call, 327972/1586512 (20.7%), ETA 89 + 23 = 112 seconds (~135 ms/curves)

After changing
- typedef cgbn_params_t<8, 1024> cgbn_params_1024;
+ typedef cgbn_params_t<8, 640> cgbn_params_1024;

$ echo "1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417" | ./ecm -cgbn -v 11e5 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Using B1=1100000, B2=0, sigma=3:230651649-3:230652480 (832 curves)
[B]CGBN<640, 8>[/B] running kernel<26 block x 256 threads> input number is 569 bits
Computing 1863 bits/call, 146292/1586512 (9.2%), ETA 67 + 7 = 74 seconds (~89 ms/curves)
Computing 1863 bits/call, 332592/1586512 (21.0%), ETA 60 + 16 = 76 seconds (~92 ms/curves)
Computing 1863 bits/call, 518892/1586512 (32.7%), ETA 52 + 25 = 77 seconds (~93 ms/curves)
[/CODE]

Gimarel 2022-03-17 09:16

If trying custom kernel sizes, try also 768 bits. For me (GTX 2060 Super) thats faster than 640 bits.

henryzz 2022-03-17 09:37

[QUOTE=Gimarel;601927]If trying custom kernel sizes, try also 768 bits. For me (GTX 2060 Super) thats faster than 640 bits.[/QUOTE]
If thats the case then a kernal benchmark would be useful that identifies the fastest kernels for each card. I currently have a version with all the possible kernals added upto 300 digits or so.

chris2be8 2022-03-17 16:53

[QUOTE=SethTro;601926]Fun fact if you follow the advice about custom kernel size you can potentially make this an additional 40% faster
[/QUOTE]

That won't be much help to me, it already takes the CPU much longer to do stage 2 than the GPU takes to do stage 1.

I've looked at your chart for recommended B1 and B2 values, but it confuses my script's calculations of how much ECM to do for a number of a given size. I need to do some serious thinking to get it to all work together.

wombatman 2022-04-03 03:52

Hi, I've built this under WSL2, and everything works quite nicely, but when I do the test file (gpu_throughput_test.sh), CBGN fails when the input number is large enough:

"No available CGBN Kernel large enough to process N(1864 bits)"

I saw some posts earlier in the thread that might apply, but I thought it would be best to ask before I start messing with anything.

SethTro 2022-04-03 06:22

[QUOTE=wombatman;603167]Hi, I've built this under WSL2, and everything works quite nicely, but when I do the test file (gpu_throughput_test.sh), CBGN fails when the input number is large enough:

"No available CGBN Kernel large enough to process N(1864 bits)"

I saw some posts earlier in the thread that might apply, but I thought it would be best to ask before I start messing with anything.[/QUOTE]

This is expected. I'm balancing binary size and compile time vs range of numbers that can be tested.

If you want to run ECM on numbers > 1020 bits look around line 670 in cgbn_stage1.cu

wombatman 2022-04-03 17:07

[QUOTE=SethTro;603168]This is expected. I'm balancing binary size and compile time vs range of numbers that can be tested.

If you want to run ECM on numbers > 1020 bits look around line 670 in cgbn_stage1.cu[/QUOTE]

Good deal. Thanks!


All times are UTC. The time now is 04:22.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.