mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   Faster GPU-ECM with CGBN (https://www.mersenneforum.org/showthread.php?t=27103)

 SethTro 2021-08-24 23:35

Faster GPU-ECM with CGBN

Resurrecting this thread. If anyone is running numbers smaller than C155 they should reach out to me.

My new CGBN enabled code is something like 7x faster

[CODE]
\$ echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 20000
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
Computing 3584 Step 1 took 93ms of CPU time / 7258ms of GPU time
Computing 3584 Step 2 on CPU took 71933ms

\$\$ echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
Computing 3584 Step 1 took 15ms of CPU time / 1019ms of GPU time
Computing 3584 Step 2 on CPU took 72142ms
[/CODE]

For numbers smaller than C300 It's generally 2-3x faster

[CODE]
\$ echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves)
Computing 1792 Step 1 took 91ms of CPU time / 3810ms of GPU time
Computing 1792 Step 2 on CPU took 83417ms

\$ echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves)
Computing 1792 Step 1 took 15ms of CPU time / 1588ms of GPU time
Computing 1792 Step 2 on CPU took 83521ms
[/CODE]

I'm working on the code actively in [url]https://github.com/sethtroisi/gmp-ecm/tree/gpu_integration[/url] if you are a developer and can possible distribute Linux binaries if we had a place to store them.

 henryzz 2021-08-25 11:59

This is an impressive speedup. I assume there is no chance of stage 2 being ported to run on CUDA? GPU memory was too small back when the old GPU code was written. I suspect that may still be the case when running many curves in parallel(maybe different sections of the stage 2 range could be done in parallel instead?)

Binaries for several programs have been hosted on the forum server. I would suggest messaging Xyyzy.

Does the windows visual studio compilation work for this? I would either need that or CUDA working under WSL2.

 bsquared 2021-08-25 13:07

[QUOTE=SethTro;586454]Resurrecting this thread. If anyone is running numbers smaller than C155 they should reach out to me.

My new CGBN enabled code is something like 7x faster

[CODE]
\$ echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 20000
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
Computing 3584 Step 1 took 93ms of CPU time / 7258ms of GPU time
Computing 3584 Step 2 on CPU took 71933ms

\$\$ echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
Computing 3584 Step 1 took 15ms of CPU time / 1019ms of GPU time
Computing 3584 Step 2 on CPU took 72142ms
[/CODE]

For numbers smaller than C300 It's generally 2-3x faster

[CODE]
\$ echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves)
Computing 1792 Step 1 took 91ms of CPU time / 3810ms of GPU time
Computing 1792 Step 2 on CPU took 83417ms

\$ echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves)
Computing 1792 Step 1 took 15ms of CPU time / 1588ms of GPU time
Computing 1792 Step 2 on CPU took 83521ms
[/CODE]

I'm working on the code actively in [url]https://github.com/sethtroisi/gmp-ecm/tree/gpu_integration[/url] if you are a developer and can possible distribute Linux binaries if we had a place to store them.[/QUOTE]

Very nice!

I haven't been able to build it yet. My knowledge of automake/autoconf is very limited, but trying to use these tools just throws a bunch of errors about missing files. Any advice on building this for linux?

Also are you by any chance looking at implementing a standard continuation for stage 2 on the GPU? It is very helpful for speeding up the process as a whole if these can also be run in parallel, even given that each curve is slightly less likely to find a factor:

[CODE]./yafu "ecm(2^997-1,1792)" -B1ecm 20000 -threads 16

ecm: 1792/1792 curves on C301 @ B1=20000, B2=100*B1
ecm: process took 5.5385 seconds.[/CODE]

sorry saw that henryzz already brought this up

 chris2be8 2021-08-25 15:42

I'm trying to build it on Linux, but have not had much luck. I copied the process I used to build the previous version:
[code]
unzip gmp-ecm-gpu_integration.zip
cd gmp-ecm-gpu_integration
autoreconf -si

./configure --enable-gpu=30 # The previous version needed --enable-gpu=sm30. The following messages look relevant:
-snip-
checking that CUDA Toolkit version and runtime version are the same... (9.1/9.1) yes
-snip-
configure: with_cgbn: , , "
-snip-

make # This fails with messages about CGBN being missing.
[/code]

What version of CUDA Toolkit and runtime is needed to support CGBN? And where is cgbn.h on your system?

Also what makes of GPU support it? Mine may be too old.

Chris

PS I use ECM set to do 512 bit arithmetic for numbers below 2^506. This is about 3 times faster than the 1024 bit version. So you may be *only* getting a factor of 3 over all ranges. But that's still very nice.

 bsquared 2021-08-25 16:19

Thanks to the autoreconf -si hint, I've progressed a little further and have run through a configure process for a sm_70 card. But now I'm getting a bunch of these errors:

[CODE]ptxas /tmp/tmpxft_00008eaa_00000000-5_cudakernel.ptx, line 2378; error : Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
[/CODE]

which probably has nothing to do with your cbgn enhancements. Anyone know if gpu-ecm will build and run on a sm_70 card or hints on how to proceed?

 bsquared 2021-08-25 17:34

[QUOTE=bsquared;586493]Thanks to the autoreconf -si hint, I've progressed a little further and have run through a configure process for a sm_70 card. But now I'm getting a bunch of these errors:

[CODE]ptxas /tmp/tmpxft_00008eaa_00000000-5_cudakernel.ptx, line 2378; error : Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
[/CODE]

which probably has nothing to do with your cbgn enhancements. Anyone know if gpu-ecm will build and run on a sm_70 card or hints on how to proceed?[/QUOTE]

Fixed it.

On line 10 of cudakernel_default.cu replace this line:

[CODE]
[/CODE]

with this one

[CODE]
[/CODE]

Now I have a working gpu-ecm! Later I'll try to get the cbgn part working.

[CODE]
% echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000
GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:6119 (5120 curves)
GPU: Block: 32x32x1 Grid: 160x1x1 (5120 parallel curves)
Computing 5120 Step 1 took 485ms of CPU time / 4990ms of GPU time
Computing 5120 Step 2 on CPU took 247281ms
[/CODE]

 SethTro 2021-08-25 17:48

[QUOTE=bsquared;586471]Very nice!
I haven't been able to build it yet. My knowledge of automake/autoconf is very limited, but trying to use these tools just throws a bunch of errors about missing files. Any advice on building this for linux?
[/QUOTE]

This is what I use

[CODE]
autoreconf -i
./configure --enable-gpu=61 --with-cuda=/usr/local/cuda CC=gcc-9 -with-cgbn-include=/home/five/Projects/CGBN/include/cgbn
make -j8
[/CODE]

Replace `--enable_gpu=61` with a different compute capability if you need one (and apply the change from [URL="https://www.mersenneforum.org/showpost.php?p=586504&postcount=27"]#27[/URL] if you need SM_70) but you already seem to have that figured out.

[QUOTE=chris2be8;586488]I'm trying to build it on Linux, but have not had much luck. I copied the process I used to build the previous version:

What version of CUDA Toolkit and runtime is needed to support CGBN? And where is cgbn.h on your system?
[/QUOTE]

I ran `git clone https://github.com/NVlabs/CGBN.git` under /home/five/Projects
then added `-with-cgbn-include=/home/five/Projects/CGBN/include/cgbn` to the list of options I pass to `./configure`

 bsquared 2021-08-25 18:24

I got CGBN set up and configured your code with the proper cgbn include path.

Needed to make the following changes to get it to compile:

[CODE]
Lines 510 and 587, replace 100'000'000 with 1000000000

Around line 640:
//const std::vector<uint32_t> available_kernels = { 512, 1024 };
uint32_t available_kernels[2] = { 512, 1024 };
uint32_t num_available_kernels = 2;

and then in the following loop:
//for (kernel_bits : available_kernels) {
for (i=0; i<num_available_kernels; i++) {
kernel_bits = available_kernels[i];

[/CODE]

Running with -gpu -cgbn gives me an error:

[CODE]echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=20000, B2=0, sigma=3:1000-3:6119 (5120 curves)
CUDA error occurred: cannot set while device is active in this process
While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync) (file cgbn_stage1.cu, line 591)
[/CODE]

Running with -gpu still works fine.

Just commenting out line 591 makes it work. cpu usage does go up during gpu execution though.

Note the factor of 10 increase in B1 compared to before. Speedup is about 8x! Awesome!

[CODE]% echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 200000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=200000, B2=0, sigma=3:1000-3:6119 (5120 curves)
Computing 5120 Step 1 took 3587ms of CPU time / 6088ms of GPU time
[/CODE]

 SethTro 2021-08-25 18:42

I committed the first tweak you made so you can 'git pull' (you might need `git fetch` `git reset --hard origin/gpu_integration`)

I'm not sure why you get the error for

[/CODE]

Google suggests this might happen if I've already started run things but the code shouldn't have at that point.
You could try replacing line 135 in cudakernel.cu (while leaving line 591 commented out)

[CODE]
[/CODE]

If anyone else experiences this let me know so I can try to fix but I'm going to ignore for now.

 chris2be8 2021-08-27 15:46

I got that error too:
[code]
\$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=20000, B2=0, sigma=3:1000-3:1383 (384 curves)
CUDA error occurred: cannot set while device is active in this process
While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync) (file cgbn_stage1.cu, line 601)
[/code]

After commenting out line 601 it works:
[code]
\$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=20000, B2=0, sigma=3:1000-3:1383 (384 curves)
Running GPU kernel<24,128> ...
Copying results back to CPU ...
Computing 384 Step 1 took 1653ms of CPU time / 2928ms of GPU time
[/code]

But after raising B1 to 200000 it fails:
[code]
\$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 200000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=200000, B2=0, sigma=3:1000-3:1383 (384 curves)
Running GPU kernel<24,128> ...
CUDA error occurred: the launch timed out and was terminated
While running cudaDeviceSynchronize() (file cgbn_stage1.cu, line 632)
[/code]

@bsquared, what happens if you try raising B1 by another factor of 10? How high a B1 will it take?

This GPU has CUDA arch 3.0, is that new enough for CGBN? I also have a newer GPU with CUDA arch 5.2, I'll try installing on that next.

NB. msieve says what CUDA arch the card is when used for GPU based poly selection. I'm using that to check what they really support.

Chris

PS. Should discussion of ecm with CGBN be split off into another thread?

 bsquared 2021-08-27 19:24

[QUOTE=chris2be8;586664]

@bsquared, what happens if you try raising B1 by another factor of 10? How high a B1 will it take?
[/QUOTE]

I have run 2^997-1 up to B1=10M so far with no problems. I will go up another order of magnitude; should take a few hours.

(2^499-1)/20959 has a factor that is found on my card in stage 1 at B1=2M (sigma=3890).

 SethTro 2021-08-27 22:55

[QUOTE=bsquared;586677]I have run 2^997-1 up to B1=10M so far with no problems. I will go up another order of magnitude; should take a few hours.

(2^499-1)/20959 has a factor that is found on my card in stage 1 at B1=2M (sigma=3890).[/QUOTE]

You might try (2^419-1)/4398294875195008479937661267298757530097217 which has a 38 digit factor or
(2^569-1)/160592976218334727184554268072735638438202191 which has a 42 digit factor

[QUOTE=chris2be8;586664]I got that error too:
[code]
\$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=20000, B2=0, sigma=3:1000-3:1383 (384 curves)
CUDA error occurred: cannot set while device is active in this process
While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync) (file cgbn_stage1.cu, line 601)
[/code]

[/QUOTE]

You can run `git pull` (tries to pull any changes I've made but fails if I do a bad thing that I like to do). If that fails you can run this command to reset to the state of my branch (this is destructive of any changes you made)
`git fetch`
`git reset --hard origin/gpu_integration`

Thanks for confirming you also see the cudaError, I'll investigate more now that multiple people see it.

 frmky 2021-08-28 07:07

That was a painful configure. I'm using nvhpc rather than the CUDA toolkit, so the directory structure is different and the cuda lib is just a stub so the check for cuInit() fails. I can point to the actual lib, but the version doesn't match the nvhpc version (which is fine, nvhpc doesn't require them to match) but that check fails. Then the check for CGBN ignores the --with-gmp directory so that check fails because it can't find GMP but reports that it can't find cgbn.h. After bypassing all of those checks, I get a working binary. Yay!

[CODE]\$ echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.1, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=0, sigma=3:1000-3:6119 (5120 curves)
Computing 5120 Step 1 took 282ms of CPU time / 705ms of GPU time

\$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 2000000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.1, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=2000000, B2=0, sigma=3:1000-3:6119 (5120 curves)
GPU: factor 1998447222711143545931606352264121 found in Step 1 with curve 2890 (-sigma 3:3890)
Computing 5120 Step 1 took 47146ms of CPU time / 112480ms of GPU time
********** Factor found in step 1: 1998447222711143545931606352264121
Found prime factor of 34 digits: 1998447222711143545931606352264121
Prime cofactor (((2^499-1)/20959))/1998447222711143545931606352264121 has 113 digits
[/CODE]

 SethTro 2021-08-28 09:35

[QUOTE=frmky;586708]That was a painful configure. I'm using nvhpc rather than the CUDA toolkit, so the directory structure is different and the cuda lib is just a stub so the check for cuInit() fails. I can point to the actual lib, but the version doesn't match the nvhpc version (which is fine, nvhpc doesn't require them to match) but that check fails. Then the check for CGBN ignores the --with-gmp directory so that check fails because it can't find GMP but reports that it can't find cgbn.h. After bypassing all of those checks, I get a working binary. Yay!

[CODE]\$ echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.1, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=0, sigma=3:1000-3:6119 (5120 curves)
Computing 5120 Step 1 took 282ms of CPU time / 705ms of GPU time

\$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 2000000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.1, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is ((2^499-1)/20959) (146 digits)
Using B1=2000000, B2=0, sigma=3:1000-3:6119 (5120 curves)
GPU: factor 1998447222711143545931606352264121 found in Step 1 with curve 2890 (-sigma 3:3890)
Computing 5120 Step 1 took 47146ms of CPU time / 112480ms of GPU time
********** Factor found in step 1: 1998447222711143545931606352264121
Found prime factor of 34 digits: 1998447222711143545931606352264121
Prime cofactor (((2^499-1)/20959))/1998447222711143545931606352264121 has 113 digits
[/CODE][/QUOTE]

Glad you got a working binary! Would you mind measuring the speedup of echo "2^997-1" with -gpu vs -cgbn?

I feel for you on setup. I didn't know any automake and how to configure it took me a whole day.

If you were willing you could try making this change and seeing if the cgbn.h checks respects the --with_gmp_lib path after it.

diff --git a/acinclude.m4 b/acinclude.m4
index fbbf94df..04694003 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -612,7 +612,7 @@ AS_IF([test "x\$enable_gpu" = "xyes" ],
#include <gmp.h>
#include <cgbn.h>
],
- [-I\$cgbn_include -lgmp],
+ [-I\$cgbn_include \$GMPLIB],
[AC_MSG_RESULT([yes])],
[
AC_MSG_RESULT([no])
@@ -620,7 +620,7 @@ AS_IF([test "x\$enable_gpu" = "xyes" ],
]
)
AC_DEFINE([HAVE_CGBN_H], [1], [Define to 1 if cgbn.h exists])
- NVCCFLAGS="-I\$with_cgbn_include -lgmp \$NVCCFLAGS"
+ NVCCFLAGS="-I\$with_cgbn_include \$GMPLIB \$NVCCFLAGS"
want_cgbn="yes"
])
])

 SethTro 2021-08-28 11:06

I made a fairly large change so that the GPU kernel is called in much smaller batches. This helps with system responsiveness also gives a progress status.

 chris2be8 2021-08-28 15:49

After further testing on the system I was trying it on I'm fairly sure it won't work because the GPU is sm_30 (Fermi architecture) and CGBN needs sm_35 (Kepler) or higher. I had to look in the Makefiles shipped with CGBN to find out what sm_.. corresponds to what marketing name Nvidia use.

I have a newer GPU which is sm_52. But that system has an old version of the CUDA Toolkit and runtime (7.5/7.5). And that doesn't support CGBN. So I'll have to upgrade CUDA on that system. Which will take a while.

I've just used [c]git clone 'https://github.com/sethtroisi/gmp-ecm.git'[/c] to grab a clean copy of the latest code instead of downloading gmp-ecm-gpu_integration.zip from the web site. I assume that's the best way to get it.

PS. [c]git clone 'https://github.com/sethtroisi/gmp-ecm.git'[/c] isn't right, it doesn't include cgbn_stage1.* So what should I use?

 EdH 2021-08-28 16:06

[QUOTE=chris2be8;586726]After further testing on the system I was trying it on I'm fairly sure it won't work because the GPU is sm_30 (Fermi architecture) and CGBN needs sm_35 (Kepler) or higher. I had to look in the Makefiles shipped with CGBN to find out what sm_.. corresponds to what marketing name Nvidia use.
. ..[/QUOTE]Thank you! I've been watching this thread, wondering if I should try it with my sm_30 card, that won't build GMP-ECM or Msieve with CUDA 10.x, which is supposed to support that architecture. CUDA 11.x does not support it.

 henryzz 2021-08-28 16:27

There still seems to be cudaDeviceSynchronize issues although I get it at a very different line number.

[CODE]echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000
GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert, --enable-openmp] [ECM]
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
CUDA error occurred: unknown error
While running cudaDeviceSynchronize() (file cgbn_stage1.cu, line 733)[/CODE]

I have managed to get the non-cgbn code working under WSL2 although this required updating windows to 21H2 to enable gpu support which is only available as a preview so far.

 chris2be8 2021-08-28 16:32

[QUOTE=EdH;586728]Thank you! I've been watching this thread, wondering if I should try it with my sm_30 card, that won't build GMP-ECM or Msieve with CUDA 10.x, which is supposed to support that architecture. CUDA 11.x does not support it.[/QUOTE]

My system with that card has CUDA 9.1 on it. So if you can "upgrade" to that level ecm and msieve should build.

 Plutie 2021-08-28 17:12

[QUOTE=chris2be8;586726]After further testing on the system I was trying it on I'm fairly sure it won't work because the GPU is sm_30 (Fermi architecture) and CGBN needs sm_35 (Kepler) or higher. I had to look in the Makefiles shipped with CGBN to find out what sm_.. corresponds to what marketing name Nvidia use.

I have a newer GPU which is sm_52. But that system has an old version of the CUDA Toolkit and runtime (7.5/7.5). And that doesn't support CGBN. So I'll have to upgrade CUDA on that system. Which will take a while.

I've just used [c]git clone 'https://github.com/sethtroisi/gmp-ecm.git'[/c] to grab a clean copy of the latest code instead of downloading gmp-ecm-gpu_integration.zip from the web site. I assume that's the best way to get it.

PS. [c]git clone 'https://github.com/sethtroisi/gmp-ecm.git'[/c] isn't right, it doesn't include cgbn_stage1.* So what should I use?[/QUOTE]

The correct command to download the CGBN branch should be [c]git clone https://github.com/sethtroisi/gmp-ecm/ -b gpu_integration folder_name[/c].

PS: Also receiving the same CUDA error as chris2be8,

[c]\$ echo "2^997-1" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.1, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is 2^997-1 (301 digits)
Using B1=20000, B2=0, sigma=3:1000-3:1639 (640 curves)
CUDA error occurred: cannot set while device is active in this process
While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync) (file cgbn_stage1.cu, line 600)[/c]

 SethTro 2021-08-28 20:08

WraithX [URL="https://github.com/sethtroisi/gmp-ecm/pull/3#pullrequestreview-741052640"]proposed a patch[/URL] which will hopefully resolve the `CUDA error occurred: cannot set while device is active in this process` error. Hopefully it will be checked in later today / tonight.

 henryzz 2021-08-28 21:55

With Wraith's patch(with the header fixed) I now get:
[CODE]echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 2000
GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert, --enable-openmp] [ECM]
Input number is (2^499-1)/20959 (146 digits)
Using B1=2000, B2=147396, sigma=3:1000-3:4583 (3584 curves)
Computing 3584 Step 1 took 1863ms of CPU time / 2291ms of GPU time
Computing 3584 Step 2 on CPU took 12404ms

echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 2000
GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert, --enable-openmp] [ECM]
Input number is (2^499-1)/20959 (146 digits)
Using B1=2000, B2=147396, sigma=3:1000-3:4583 (3584 curves)
GPU: Block: 32x32x1 Grid: 112x1x1 (3584 parallel curves)
Computing 3584 Step 1 took 3668ms of CPU time / 6199ms of GPU time
Computing 3584 Step 2 on CPU took 12445ms[/CODE]
Although with B1=20000 I still get:
[CODE]echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000
GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert, --enable-openmp] [ECM]
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
CUDA error (702) occurred: the launch timed out and was terminated
While running cudaDeviceSynchronize() (file cgbn_stage1.cu, line 731)[/CODE]

 WraithX 2021-08-28 22:06

[QUOTE=henryzz;586760]
Although with B1=20000 I still get:
[CODE]echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000
GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert, --enable-openmp] [ECM]
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves)
CUDA error (702) occurred: the launch timed out and was terminated
While running cudaDeviceSynchronize() (file cgbn_stage1.cu, line 731)[/CODE][/QUOTE]

What happens if you specify 0 for B2? Like this:
[CODE]echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 0[/CODE]

 henryzz 2021-08-28 22:40

[QUOTE=WraithX;586761]What happens if you specify 0 for B2? Like this:
[CODE]echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 0[/CODE][/QUOTE]
The same thing.

If I run less curves at once it works. Possibly just that my gpu is pathetic (750 Ti):
[CODE]echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -sigma 3:1000 20000
GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert, --enable-openmp] [ECM]
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=3804582, sigma=3:1000-3:1319 (320 curves)
Computing 320 Step 1 took 756ms of CPU time / 1269ms of GPU time
Computing 320 Step 2 on CPU took 7488ms[/CODE]

 SethTro 2021-08-28 22:44

You might try changing in cgbn_stage1.cu

-#define S_BITS_PER_CALL 10000
+#define S_BITS_PER_CALL 1000

then running with -v which might tell you when the GPU died (and also might prevent timeouts)

[CODE]
\$ echo "(2^499-1)/20959" | ./ecm -v -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is (2^499-1)/20959 (146 digits)
GPU: will use device 0: GeForce GTX 1080 Ti, compute capability 6.1, 28 MPs.
Using B1=20000, B2=0, sigma=3:1000-3:4583 (3584 curves)
Running CGBN<512,4> kernel<112,128> at bit 0/28820 (0.0%)...
Running CGBN<512,4> kernel<112,128> at bit 1000/28820 (3.5%)...
...
Running CGBN<512,4> kernel<112,128> at bit 27000/28820 (93.7%)...
Running CGBN<512,4> kernel<112,128> at bit 28000/28820 (97.2%)...
Copying results back to CPU ...
Computing 3584 Step 1 took 15ms of CPU time / 1105ms of GPU time
Throughput: 3244.848 curves per second (on average 0.31ms per Step 1)

[/CODE]

 frmky 2021-08-28 22:50

[QUOTE=SethTro;586711]Glad you got a working binary! Would you mind measuring the speedup of echo "2^997-1" with -gpu vs -cgbn?[/QUOTE]
[CODE]\$ echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.1, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=0, sigma=3:1000-3:6119 (5120 curves)
GPU: Block: 32x32x1 Grid: 160x1x1 (5120 parallel curves)
Computing 5120 Step 1 took 183ms of CPU time / 5364ms of GPU time

\$ echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.1, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=0, sigma=3:1000-3:6119 (5120 curves)
Computing 5120 Step 1 took 1284ms of CPU time / 3057ms of GPU time
[/CODE]

I'll try the configure changes later. Overnight I ran 2560 stage-1 curves on the C201 blocking the aliquot sequence starting at 3366 using B1=85e7. I'm working through stage 2 on those now.

 frmky 2021-08-28 23:23

Those changes to acinclude.m4 aren't enough. It still can't find gmp.h during the test compile. We need to add a -I for the gmp include directory. And that breaks the build since it's trying to include libgmp.a during compile.

 henryzz 2021-08-29 07:08

Reducing S_BITS_PER_CALL has fixed it for me. Thank you ðŸ˜€

 Gimarel 2021-08-29 12:53

Current git fails for inputs near 512 Bits. It seems that there is a condition the wrong way:[CODE]diff --git a/cgbn_stage1.cu b/cgbn_stage1.cu
index 1b512ecd..f67f8715 100644
--- a/cgbn_stage1.cu
+++ b/cgbn_stage1.cu
@@ -653,7 +653,7 @@ int run_cgbn(mpz_t *factors, int *array_stage_found,
#endif /* IS_DEV_BUILD */
for (int k_i = 0; k_i < available_kernels.size(); k_i++) {
uint32_t kernel_bits = available_kernels[k_i];
- if (kernel_bits + 6 >= mpz_sizeinbase(N, 2)) {
+ if (kernel_bits >= mpz_sizeinbase(N, 2) + 6) {
BITS = kernel_bits;
assert( BITS % 32 == 0 );
TPI = (BITS <= 512) ? 4 : (BITS <= 2048) ? 8 : (BITS <= 8192) ? 16 : 32;[/CODE]

 SethTro 2021-08-29 22:55

[QUOTE=Gimarel;586805]Current git fails for inputs near 512 Bits. It seems that there is a condition the wrong way:[CODE]diff --git a/cgbn_stage1.cu b/cgbn_stage1.cu
index 1b512ecd..f67f8715 100644
--- a/cgbn_stage1.cu
+++ b/cgbn_stage1.cu
@@ -653,7 +653,7 @@ int run_cgbn(mpz_t *factors, int *array_stage_found,
#endif /* IS_DEV_BUILD */
for (int k_i = 0; k_i < available_kernels.size(); k_i++) {
uint32_t kernel_bits = available_kernels[k_i];
- if (kernel_bits + 6 >= mpz_sizeinbase(N, 2)) {
+ if (kernel_bits >= mpz_sizeinbase(N, 2) + 6) {
BITS = kernel_bits;
assert( BITS % 32 == 0 );
TPI = (BITS <= 512) ? 4 : (BITS <= 2048) ? 8 : (BITS <= 8192) ? 16 : 32;[/CODE][/QUOTE]

Whoops, totally backwards, coding is hard :p I'll fix it tonight.
Thanks for testing

 chris2be8 2021-08-30 16:06

Has anyone checked ecm-cgbn can find factors? On my system with a sm_30 GPU I updated test.gpuecm to pass -cgbn to ecm. But it failed to find any factors when the test cases expected them to be found!

It is *probably* because sm_30 is too low for CGBN.

It will be a while before I can test my newer GPU. The system it's on is running an old version of Linux which doesn't support CUDA 9.0. (I've been working on a "if it works don't fix it" base since it's only used for computations.) Upgrading Linux will probably need a complete re-install which I'll need to plan for a time when I don't need it for a few hours/days. And I'd be happier if I was sure CGBN would work once I got it installed.

 SethTro 2021-08-30 18:51

[QUOTE=chris2be8;586868]Has anyone checked ecm-cgbn can find factors? On my system with a sm_30 GPU I updated test.gpuecm to pass -cgbn to ecm. But it failed to find any factors when the test cases expected them to be found!

It is *probably* because sm_30 is too low for CGBN.

It will be a while before I can test my newer GPU. The system it's on is running an old version of Linux which doesn't support CUDA 9.0. (I've been working on a "if it works don't fix it" base since it's only used for computations.) Upgrading Linux will probably need a complete re-install which I'll need to plan for a time when I don't need it for a few hours/days. And I'd be happier if I was sure CGBN would work once I got it installed.[/QUOTE]

Yes, many of use have found the same test factor for (2^499-1)/20959 and I've verified several times that the residuals exactly match those produced by `-gpu`. I've also tested with `\$ sage check_gpuecm.sage "./ecm -cgbn"`

 frmky 2021-08-30 18:52

[QUOTE=chris2be8;586868]Has anyone checked ecm-cgbn can find factors?[/QUOTE]

Yes, test.gpuecm completes successfully both with and without -cgbn. I'm using a V100 with CUDA 11.3.

 frmky 2021-08-30 19:34

Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?

 bsquared 2021-08-30 19:42

[QUOTE=frmky;586881]Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?[/QUOTE]

I am working on the ability to process ecm save files with yafu, but it isn't ready yet.

 EdH 2021-08-30 19:44

[QUOTE=frmky;586881]Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?[/QUOTE]Not sure if I'm understanding the question, but would [URL="https://www.mersenneforum.org/showthread.php?t=15508"]ECM.py[/URL] work?

Edit: For my Colab-GPU ECM experiements, I use:[code]python3 ecm.py -resume residues[/code]to run the residues from the Colab GPU stage 1 portion. I think I have all the threads, etc. set in the Python code, but they can be used on the command line, as well.

 SethTro 2021-08-31 08:49

@EdH I started using ECM.py again and it's great!

---

I wrote a bunch of code today so S_BITS_PER_BATCH is dynamic and there's better verbose output.

Verbose output includes this message, when the kernel size is much lager than the input number.
[CODE]
Input number is 2^239-1 (72 digits)
Compiling custom kernel for 256 bits should be ~180% faster
CGBN<512, 4> running kernel<56 block x 128 threads>
[/CODE]
I doubt that verbose is the right place for this output (as I'm not sure how many people can actually recompile cuda code), but if you have a working setup it's as easy as changing

[CODE]
- typedef cgbn_params_t<4, 512> cgbn_params_4_512;
+ typedef cgbn_params_t<4, 256> cgbn_params_4_512;
[/CODE]

---

ETA and estimated throughput

[CODE]
Copying 716800 bits of data to GPU
CGBN<640, 8> running kernel<112 block x 128 threads>
Computing 100 bits/call, 0/4328085 (0.0%)
Computing 110 bits/call, 100/4328085 (0.0%)
Computing 121 bits/call, 210/4328085 (0.0%)
...
Computing 256 bits/call, 1584/4328085 (0.0%)
Computing 655 bits/call, 5630/4328085 (0.1%)
Computing 1694 bits/call, 16050/4328085 (0.4%)
Computing 2049 bits/call, 35999/4328085 (0.8%), ETA 184 + 2 = 186 seconds (~104 ms/curves)
Computing 2049 bits/call, 56489/4328085 (1.3%), ETA 183 + 2 = 185 seconds (~103 ms/curves)
...
Computing 2049 bits/call, 158939/4328085 (3.7%), ETA 178 + 7 = 185 seconds (~103 ms/curves)
Computing 2049 bits/call, 363839/4328085 (8.4%), ETA 169 + 16 = 185 seconds (~103 ms/curves)
...
Computing 2049 bits/call, 1798139/4328085 (41.5%), ETA 109 + 77 = 186 seconds (~104 ms/curves)
Computing 2049 bits/call, 2003039/4328085 (46.3%), ETA 100 + 86 = 186 seconds (~104 ms/curves)
Computing 2049 bits/call, 4052039/4328085 (93.6%), ETA 12 + 175 = 187 seconds (~104 ms/curves)
Copying results back to CPU ...
Computing 1792 Step 1 took 240ms of CPU time / 186575ms of GPU time
Throughput: 9.605 curves per second (on average 104.12ms per Step 1)
[/CODE]

This is nice as it can gives very early feedback (estimates after 1-5 seconds are very accurate) if you are changing `-gpucurves` or playing with custom kernel bit sizes.
I've found that doubling gpucurves can lead to 2x worse throughput! So I may need to add some warnings.

 EdH 2021-08-31 12:41

[QUOTE=SethTro;586911]@EdH I started using ECM.py again and it's great!
---
[/QUOTE]Good to read. I just wish I could get my sm_30 card to do something. . . (2 sm-20s and 1 sm_30 and none will do anything productive, . . . yet. With all the install/reinstall/remove activity, now the sm_30 machine is complaining about a linux-kernel, so I've taken a break from trying more.)

 bur 2021-08-31 17:53

I couldn't find it in the thread (hope I didn't just overlook it), how does the speed of ECM on GPU generally compare to CPU? Say a GTX 1660 or similar.

And is it so that only small B1 values can be used? I found [URL="https://eprint.iacr.org/2020/1265.pdf"]this paper[/URL] and they also only seem to have used B1=50k. With a 2080 Ti they achieved "2781 ECM trials", I guess curves, per second for B1=50k. That is very fast, but if the B1 size is severely limited, a CPU is still required for larger factors?

 SethTro 2021-08-31 21:01

[QUOTE=bur;586936]I couldn't find it in the thread (hope I didn't just overlook it), how does the speed of ECM on GPU generally compare to CPU? Say a GTX 1660 or similar.

And is it so that only small B1 values can be used? I found [URL="https://eprint.iacr.org/2020/1265.pdf"]this paper[/URL] and they also only seem to have used B1=50k. With a 2080 Ti they achieved "2781 ECM trials", I guess curves, per second for B1=50k. That is very fast, but if the B1 size is severely limited, a CPU is still required for larger factors?[/QUOTE]

The most important factor is the size of N (which is limitted by CGBN to 32K for GPUs or ~10,000 digits).
Both CPU and GPU have the same linear scaling for B1 which can be increased to any number you want.

the speedup is strongly depends on your CPU vs GPU. For my 1080ti vs 2600K

250 bits 46x faster on GPU
500 bits 48x faster on GPU
1000 bits 68x faster on GPU
1500 bits 83x faster on GPU
2000 bits 46x faster on GPU

Which means we are seeing roughly the same scaling for the GPU as CPU for bit levels < 2K.
Informal testing with larger inputs (2048 - 32,768 bits) bits shows the CPU outscales GPU for larger inputs and the speedup slowly decreases from ~50x to ~25x as bits increase from 2K to 16K. At the maximal value of 32K bits performances has decreases again to 14x (from 26x at 16K bits)

 xilman 2021-09-01 01:50

[QUOTE=frmky;586881]Is there a simpler way to distribute stage 2 across multiple cores than creating a script to use the -save option with B2=0, split the save file, then launch multiple ecm processes with -resume?[/QUOTE]it is what I used to do when GPU-enabled ECM still worked on my machines. It was a trivial script to write.

 bsquared 2021-09-01 15:02

I re-cloned the gpu_integration branch to capture the latest changes and went through the build process with the following caveats:

specifying --with-gmp together with --with-cgbn-include doesn't work. I had to use the system default gmp (6.0.0).

With compute 70 I still have to replace __any with __any_sync(__activemask() on line 10 of cude_kernel_default.cu

building with gcc I get this error in cgbn_stage1.cu: cgbn_stage1.cu(654): error: initialization with "{...}" is not allowed for object of type "const std::vector<uint32_t, std::allocator<uint32_t>>"

I suppose I need to build with g++ instead?

Anyway I can get past all of that and get a working binary and the cpu usage is now much lower. But now the gpu portion appears to be about 15% slower?

Before:
[CODE]
Input number is 2^997-1 (301 digits)
Computing 5120 Step 1 took 75571ms of CPU time / 129206ms of GPU time
Throughput: 39.627 curves per second (on average 25.24ms per Step 1)
[/CODE]

New clone:
[CODE]
Input number is 2^997-1 (301 digits)
Computing 5120 Step 1 took 643ms of CPU time / 149713ms of GPU time
Throughput: 34.199 curves per second (on average 29.24ms per Step 1)
[/CODE]

Anyone else seeing this?

 chris2be8 2021-09-01 16:42

Hello,

I've upgraded my system with a GTX 970 (sm_52) to openSUSE 42.2 and installed CUDA 9.0 on it. But when I try to compile ecm with GPU support ./configure says:
[code]
configure: Using cuda.h from /usr/local/cuda/include
checking cuda.h usability... no
checking cuda.h presence... yes
configure: WARNING: cuda.h: present but cannot be compiled
configure: WARNING: cuda.h: check for missing prerequisite headers?
configure: WARNING: cuda.h: see the Autoconf documentation
configure: WARNING: cuda.h: section "Present But Cannot Be Compiled"
configure: WARNING: cuda.h: proceeding with the compiler's result
configure: WARNING: ## ----------------------------------- ##
configure: WARNING: ## Report this to ecm-discuss@inria.fr ##
configure: WARNING: ## ----------------------------------- ##
checking for cuda.h... no
configure: error: required header file missing
[/code]

[code]
Some versions of CUDA are not compatible with recent versions of gcc.
To specify which C compiler is called by the CUDA compiler nvcc, type:

\$ ./configure --enable-gpu --with-cuda-compiler=/PATH/DIR

If you get errors about "cuda.h: present but cannot be compiled"
Try using an older CC:

\$ ./configure --enable-gpu CC=gcc-8

The value of this parameter is directly passed to nvcc via the option
"--compiler-bindir". By default, GMP-ECM lets nvcc choose what C compiler it
uses.
[/code]

The only gcc installed now is version 4.8.5. Should I install an older gcc (if so what level) or should I upgrade the OS to a higher level so I can install a newer CUDA? Does anyone have ecm working with CUDA 9.0 or higher on openSUSE and if so what level of openSUSE?

Chris (getting slightly frustrated by now)

 EdH 2021-09-01 17:47

[QUOTE=chris2be8;587001]The only gcc installed now is version 4.8.5. Should I install an older gcc (if so what level) or should I upgrade the OS to a higher level so I can install a newer CUDA? Does anyone have ecm working with CUDA 9.0 or higher on openSUSE and if so what level of openSUSE?

Chris (getting slightly frustrated by now)[/QUOTE]I've passed the frustration point with my systems. I was getting the same with my Ubuntu 20.04 with all the 10.x and 11.x CUDA versions (my card isn't supported by CUDA 11.x, anyway). I installed and made default several older gcc versions (8, 9, 10).* I gave up for now.

* I'm curious about the gcc version numer difference between yours and mine. The default Ubuntu 20.04 gcc is 9.3.0, my Debian Buster is 8.3.0, and the default for my Fedora 33 is 10.3.1. Is your version actually that old compared to mine?

 chris2be8 2021-09-01 18:17

[c]gcc --version[/c] returns:
[code]
gcc (SUSE Linux) 4.8.5
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[/code]

[c]zypper search gcc[/c] shows it as gcc48 and says gcc5 and gcc6 could also be installed.

I've installed clang as well:
[code]
clang --version
clang version 3.8.0 (tags/RELEASE_380/final 262553)
Target: x86_64-unknown-linux-gnu
InstalledDir: /usr/bin
[/code]
but that gets a different error:
[code]
./configure --enable-gpu=50 --with-cuda=/usr/local/cuda --with-cuda-compiler=clang CC=clang
...
configure: Using nvcc compiler from from /usr/local/cuda/bin
checking for compatibility between gcc and nvcc... no
configure: error: gcc version is not compatible with nvcc
[/code]

I don't think my problems are due to openSUSE. So if someone who has ecm with cgbn working on any Linux distro could say what version of CUDA and what compiler version they have I could probably get it working.

 SethTro 2021-09-01 18:54

[QUOTE=bsquared;586996]I re-cloned the gpu_integration branch to capture the latest changes and went through the build process with the following caveats:

specifying --with-gmp together with --with-cgbn-include doesn't work. I had to use the system default gmp (6.0.0).

With compute 70 I still have to replace __any with __any_sync(__activemask() on line 10 of cude_kernel_default.cu

building with gcc I get this error in cgbn_stage1.cu: cgbn_stage1.cu(654): error: initialization with "{...}" is not allowed for object of type "const std::vector<uint32_t, std::allocator<uint32_t>>"

I suppose I need to build with g++ instead?
[/QUOTE]

[B]I rebased the branch to cleanup the git history. so everyone will likely need to `git pull` and `git reset --hard origin/gpu_integration`. I'm sorry, but also we're in development and everything is nicer now to review.
[/B]
I fixed the vector initialize issue and have included your "__any_sync(__activemask()" fix in the repo (I forgot to credit you in the commit but I'll try and do that the next time I rebase).

I'm not sure why --with-gmp doesn't work with --with-cgbn-include if you have some sense of why I'm happy to try and fix.
If it's failing on "checking if CGBN is present..." maybe try adding more flags to acinclude.m4:617 [-I\$cgbn_include \$GMPLIB], maybe "-I\$with_gmp_include" and or "-L\$with_gmp_lib"

 SethTro 2021-09-01 19:00

[QUOTE=EdH;587008]I've passed the frustration point with my systems. I was getting the same with my Ubuntu 20.04 with all the 10.x and 11.x CUDA versions (my card isn't supported by CUDA 11.x, anyway). I installed and made default several older gcc versions (8, 9, 10).* I gave up for now.

* I'm curious about the gcc version numer difference between yours and mine. The default Ubuntu 20.04 gcc is 9.3.0, my Debian Buster is 8.3.0, and the default for my Fedora 33 is 10.3.1. Is your version actually that old compared to mine?[/QUOTE]

I know that feeling and I really empathize. I'm building on the pile of cludge that is cuda and I wish I could make this easier.

did you try with CC=gcc-9? I can also maybe add some debug to the configure log to show which CC it's using.

I personally use this to configure
[CODE]./configure --enable-gpu=61 --with-cuda=/usr/local/cuda CC=gcc-9 -with-cgbn-include=/home/five/Projects/CGBN/include/cgbn[/CODE]

and my gcc / nvcc versions
[CODE]
\$ gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

\$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0
[/CODE]

If you tell me what compute / sm_arch your card is I can try building and sending you a binary.

 EdH 2021-09-01 19:25

In my case, everything except ECM and Msieve seemed to be working, but I've uninstalled everything now and I thought from a few posts ago that my arch 3.0 was perhaps too ancient, 3.5 being necessary. ATM, updates, etc. are also giving me errors, so I was going step back for a bit. I've uninstalled all the CUDA, NVIDIA, etc. from the system. In its latest iteration, although I had installed CUDA 10.2, nvcc and nvidia-smi claimed to be running CUDA 11, which does not support architecture 3.0. I'll try another installation some time soon and then see where it stalls. If I can't get ECM to build for GPU with my card, there is no point trying to add in cgbn, is there?

Thanks!

 xilman 2021-09-01 19:26

[QUOTE=EdH;587008]I've passed the frustration point with my systems. I was getting the same with my Ubuntu 20.04 with all the 10.x and 11.x CUDA versions (my card isn't supported by CUDA 11.x, anyway). I installed and made default several older gcc versions (8, 9, 10).* I gave up for now.

* I'm curious about the gcc version numer difference between yours and mine. The default Ubuntu 20.04 gcc is 9.3.0, my Debian Buster is 8.3.0, and the default for my Fedora 33 is 10.3.1. Is your version actually that old compared to mine?[/QUOTE]Which is why I would love for someone to make a [B]fully[/B] static Linux executable for a relatively low SM value.

OK, it would not be as fast as the latest and greatest but at least it would be much faster than a purely cpu version.

I'd do it myself but haven't been able to compile with CUDA for far too long now.:sad:

 chris2be8 2021-09-01 20:22

And I've been having "fun" with msieve's CUDA support. The version I had been running failed saying [c][sort_engine.cu, 95] sort engine: (CUDA error 78: a PTX JIT compilation failed)[/c] (probably because compiled with and old version of CUDA. So I decided to install the latest version of msieve, revision 1043. Which also failed with a message saying "file not found" but of course didn't say *which* file it could not find. After a lot of puzzling I found revision 1043 notes the card is compute architecture 5.2 and tries to load stage1_core_sm52.ptx. But the Makefile as shipped is only set up to build ptx files for sm20, sm30, sm35 and sm50. So you are out of luck with any other architecture. I hacked the Makefile, first to remove sm20 which CUDA 9.0 doesn't support, then to add sm52 once I realised that was missing.

The makefile probably should build ptx files for all of this list:
[code]
~/msieve.1043/trunk> strings msieve | grep ptx
stage1_core_sm20.ptx
stage1_core_sm30.ptx
stage1_core_sm35.ptx
stage1_core_sm50.ptx
stage1_core_sm52.ptx
stage1_core_sm61.ptx
stage1_core_sm70.ptx
stage1_core_sm75.ptx
stage1_core_sm86.ptx
[/code]

If I hadn't knows of the [c]strings[/c] command I would have been stuck.

 SethTro 2021-09-01 20:26

[QUOTE=bsquared;586996]
Anyway I can get past all of that and get a working binary and the cpu usage is now much lower. But now the gpu portion appears to be about 15% slower?

Before:
[CODE]
Input number is 2^997-1 (301 digits)
Computing 5120 Step 1 took 75571ms of CPU time / 129206ms of GPU time
Throughput: 39.627 curves per second (on average 25.24ms per Step 1)
[/CODE]

New clone:
[CODE]
Input number is 2^997-1 (301 digits)
Computing 5120 Step 1 took 643ms of CPU time / 149713ms of GPU time
Throughput: 34.199 curves per second (on average 29.24ms per Step 1)
[/CODE]

Anyone else seeing this?[/QUOTE]

Can you try running with `-v --gpucurves 1280` and `--gpucurves 2560` (if you are having fun you can also try 640 and 1792)?
The new code should give you approximate timings quite quickly so no need to complete a full run.

I have seen 2x and 4x slowdowns when gpucurves is large. I may need to put in some code that searches for optimal throughput.

 SethTro 2021-09-01 20:30

[QUOTE=xilman;587015]Which is why I would love for someone to make a [B]fully[/B] static Linux executable for a relatively low SM value.

OK, it would not be as fast as the latest and greatest but at least it would be much faster than a purely cpu version.

I'd do it myself but haven't been able to compile with CUDA for far too long now.:sad:[/QUOTE]

I don't know how static linking works especially with respect to CUDA but I compilled ecm with all supported SM (including sm35 and sm70) using CUDA 11.2. Feel free to try it, but I wouldn't be to hopeful. It doesn't run in colab and gives an error

[url]https://static.cloudygo.com/static/ecm_cgbn_cuda11_2[/url]

^ I pinky-promise this isn't a virus

 frmky 2021-09-01 21:10

cudacommon.h is missing from the git repository.

 henryzz 2021-09-01 21:11

[QUOTE=chris2be8;587010][c]gcc --version[/c] returns:
[code]
gcc (SUSE Linux) 4.8.5
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[/code][c]zypper search gcc[/c] shows it as gcc48 and says gcc5 and gcc6 could also be installed. [/quote]

My guess is that your gcc version may be too old. I would try the most recent version you can get your hands on. The easiest way may be to update your OS into a version that isn't end of life.

 bsquared 2021-09-01 21:20

[QUOTE=SethTro;587019]Can you try running with `-v --gpucurves 1280` and `--gpucurves 2560` (if you are having fun you can also try 640 and 1792)?
The new code should give you approximate timings quite quickly so no need to complete a full run.

I have seen 2x and 4x slowdowns when gpucurves is large. I may need to put in some code that searches for optimal throughput.[/QUOTE]

1280: (~31 ms/curves)
2560: (~21 ms/curves)
640: (~63 ms/curves)
1792: (~36 ms/curves)

So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves))

With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.

 SethTro 2021-09-01 22:46

[QUOTE=frmky;587024]cudacommon.h is missing from the git repository.[/QUOTE]

Fixed along with another issue.

 SethTro 2021-09-02 00:44

[QUOTE=bsquared;587026]1280: (~31 ms/curves)
2560: (~21 ms/curves)
640: (~63 ms/curves)
1792: (~36 ms/curves)

So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves))

With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.[/QUOTE]

I added `gpu_throughput_test.sh` which runs different sized inputs and measures throughput.

On my system maximum results are achieved at

256 bits: 2x default curves (or 3584 curves), same speed at 4x default too
512 bits: 2x and 4x default curves
1024 bits: only at default curves
extra testing at 2048 bits: 1.5x and 3x outperform 2x and 4x slightly

 SethTro 2021-09-02 00:45

[QUOTE=SethTro;587033]I added `gpu_throughput_test.sh` which runs different sized inputs and measures throughput.

On my system maximum results are achieved at

256 bits: 2x default curves (or 3584 curves), same speed at 4x default too
512 bits: 2x and 4x default curves
1024 bits: only at default curves
extra testing at 2048 bits: 1.5x and 3x outperform 2x and 4x slightly[/QUOTE]

Maybe this relates to registers used by the kernel? max threads per block? Any insight from CUDA experts would be appreciated

 SethTro 2021-09-02 09:12

I halved compile time by adding cgbn_swap and avoiding inlining double_add_v2 twice.

Sadly I pushed the branch and it will probably fail to compile for everyone till [url]https://github.com/NVlabs/CGBN/pull/17[/url] gets pulled

---

@bsquared, you might try changing TPB_DEFAULT from 128 to 512, In some initial testing it looks like larger gpucurves don't slow down any more with ./gpu_throughput_test.sh more testing to follow tomorrow.

 chris2be8 2021-09-02 15:39

[QUOTE=henryzz;587025]My guess is that your gcc version may be too old. I would try the most recent version you can get your hands on. The easiest way may be to update your OS into a version that isn't end of life.[/QUOTE]

I've installed gcc-6 (the latest in the repositories) and that gets past that error, but fails a bit further on:
[code]
gcc-6 --version
gcc-6 (SUSE Linux) 6.2.1 20160826 [gcc-6-branch revision 239773]
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

./configure --enable-gpu=30 --with-cuda=/usr/local/cuda CC=gcc-6 -with-cgbn-include=/home/chris/CGBN/include/cgbn
...
configure: Using cuda.h from /usr/local/cuda/include
checking cuda.h usability... yes
checking cuda.h presence... yes
checking for cuda.h... yes
checking that CUDA Toolkit version is at least 3.0... (9.0) yes
configure: Using CUDA dynamic library from /usr/local/cuda/lib64
checking for cuInit in -lcuda... yes
checking that CUDA Toolkit version and runtime version are the same... no
configure: error: 'cuda.h' and 'cudart' library have different versions, you have to reinstall CUDA properly, or use the --with-cuda parameter to tell configure the path to the CUDA library and header you want to use
[/code]

That error message doesn't make much sense because I only have one version of CUDA installed on the system. So it's probably failing to compile a test program.

So I'll try upgrading the OS next. Then install later versions of CUDA and gcc.

 SethTro 2021-09-02 20:06

[QUOTE=chris2be8;587067]I've installed gcc-6 (the latest in the repositories) and that gets past that error, but fails a bit further on:
[code]
configure: error: 'cuda.h' and 'cudart' library have different versions, you have to reinstall CUDA properly, or use the --with-cuda parameter to tell configure the path to the CUDA library and header you want to use
[/code]

That error message doesn't make much sense because I only have one version of CUDA installed on the system. So it's probably failing to compile a test program.
[/QUOTE]

You can find the literal program it failed to compile in config.log or the shape in acinclude.m4 (basically wrap the 2nd block in int maint() { ... })

[CODE]
AC_RUN_IFELSE([AC_LANG_PROGRAM([
[
#include <stdio.h>
#include <string.h>
#include <cuda.h>
#include <cuda_runtime.h>
]],[[
int libversion;
cudaError_t err;
err = cudaRuntimeGetVersion (&libversion);
if (err != cudaSuccess)
{
printf ("Could not get runtime version\n");
printf ("Error msg: %s\n", cudaGetErrorString(err));
return -1;
}
printf("(%d.%d/", CUDA_VERSION/1000, (CUDA_VERSION/10) % 10);
printf("%d.%d) ", libversion/1000, (libversion/10) % 10);
if (CUDA_VERSION == libversion)
return 0;
else
return 1;
]])],
[/CODE]

And you can find the command line it tried to compile this with in config.log too (my guess is something like gcc-9 -o conftest -I/usr/local/cuda/include -g -O2 -I/usr/local/cuda/include -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/ local/cuda/lib64 conftest.c -lcudart -lstdc++ -lcuda -lrt )

 frmky 2021-09-02 22:28

I think this can be triggered if the version of CUDA supported by the driver doesn't match the toolkit version. But this is usually ok as long as the driver is a little newer. I think both this and the lack of cuInit() in the CUDA lib should be warnings, not errors. Both of these are ok in some circumstances.

 SethTro 2021-09-02 23:57

Happy me!

I found two 35 digit factors from a [URL="http://factordb.com/index.php?id=1100000002657449020"]C303[/URL] today (from [URL="https://docs.google.com/spreadsheets/d/1IuxGlf6dEUd8Qixu87P-_r6sgdG7Yl8UUPXS6rKBpbM/edit#gid=1905095108"]Factoring for a publication[/URL])

[CODE]
GPU: factor 404157820975138535541421971085010741 found in Step 1 with curve 1796 (-sigma 3:1850760857)
GPU: factor 404157820975138535541421971085010741 found in Step 1 with curve 2049 (-sigma 3:1850761110)
GPU: factor 404157820975138535541421971085010741 found in Step 1 with curve 2449 (-sigma 3:1850761510)
Computing 3584 Step 1 took 2294ms of CPU time / 1816867ms of GPU time
********** Factor found in step 1: 404157820975138535541421971085010741
Found prime factor of 36 digits: 404157820975138535541421971085010741
[/CODE]

Then
[CODE]
Thu 2021/09/02 23:25:50 UTC Step 1 took 0ms
Thu 2021/09/02 23:25:50 UTC Step 2 took 9668ms
Thu 2021/09/02 23:25:50 UTC ********** Factor found in step 2: 51858345311243630596653971633910169
Thu 2021/09/02 23:25:50 UTC Found prime factor of 35 digits: 51858345311243630596653971633910169
[/CODE]

Feels good that this code is being useful :)

 frmky 2021-09-03 07:02

[QUOTE=SethTro;587108]Feels good that this code is being useful :)[/QUOTE]
Nearly all of the factors that I found for Factoring for a Publication 2 used this code.

 chris2be8 2021-09-04 16:00

I'm still puzzling over it. I've upgraded the system to openSUSE Leap 15.3 and installed CUDA 11.4. But no matter what I do [c]lspci -v[/c] still says [c]Kernel modules: nouveau[/c]

I've tried everything I can find in the CUDA Installation Guide for Linux. And everything I can find on the web. But it still loads the nouveau kernel module, not the one shipped with CUDA. Has anyone any idea how to get it to use the Nvidia drivers?

NB. On the system with the GTX 970:
[code]
4core:~ # lspci -v -s 01:00
01:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1) (prog-if 00 [VGA controller])
Subsystem: eVga.com. Corp. Device 3978
Flags: fast devsel, IRQ 11
Memory at f6000000 (32-bit, non-prefetchable) [disabled] [size=16M]
Memory at e0000000 (64-bit, prefetchable) [disabled] [size=256M]
Memory at f0000000 (64-bit, prefetchable) [disabled] [size=32M]
I/O ports at e000 [disabled] [size=128]
Expansion ROM at f7000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] #19
Kernel modules: nouveau
[/code]

Compare with on the system with a CC 3.0 card:
[code]
root@sirius:~# lspci -v -s 07:00
07:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 760] (rev a1) (prog-if 00 [VGA controller])
Subsystem: Micro-Star International Co., Ltd. [MSI] GK104 [GeForce GTX 760]
Flags: bus master, fast devsel, latency 0, IRQ 76
Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
Memory at e8000000 (64-bit, prefetchable) [size=128M]
Memory at f0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [100] Virtual Channel
Capabilities: [128] Power Budgeting <?>
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] #19
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
[/code]

Compare the last line of output in each case.

If it's because CUDA 11.4 doesn't support this card I could try removing CUDA 11.4 and installing CUDA 10.x But would that work.

 paulunderwood 2021-09-04 16:58

[QUOTE]
Boot to Ubuntu, but before you login in to Ubuntu, press Cntrl+Alt+F2

run the following command:

sudo nano /etc/modprobe.d/blacklist-nouveau.conf

add the 2 following lines, save & exit

blacklist nouveau
options nouveau modeset=0

run the following command

sudo update-initramfs -u

[/QUOTE]

reboot.

run [c]lsmod | grep nvidia[/c]

HTH

 chris2be8 2021-09-04 17:08

Thanks, but I've already tried that:
[code]
4core:/etc/modprobe.d # cat 60-blacklist.nouveau.conf
blacklist nouveau
options nouveau modeset=0
[/code]

And it is in the current initramfs:
[code]
4core:/etc/modprobe.d # lsinitrd -f /etc/modprobe.d/60-blacklist.nouveau.conf
blacklist nouveau
options nouveau modeset=0
[/code]

lsmod doesn't show any nvidia kernel modules:
[code]
4core:/etc/modprobe.d # lsmod | grep -i nvidia
4core:/etc/modprobe.d #
[/code]

On my system where CUDA (but not cgbn) works:
[code]
root@sirius:~# lsmod | grep nvidia
nvidia_uvm 876544 0
nvidia_drm 49152 5
nvidia_modeset 1122304 14 nvidia_drm
nvidia 19517440 682 nvidia_uvm,nvidia_modeset
drm_kms_helper 180224 1 nvidia_drm
drm 483328 8 drm_kms_helper,nvidia_drm
ipmi_msghandler 102400 2 ipmi_devintf,nvidia
[/code]

 paulunderwood 2021-09-04 17:20

 chris2be8 2021-09-04 17:26

[code]
4core:/etc/modprobe.d # cat 60-blacklist.nouveau.conf
blacklist nouveau
options nouveau modeset=0
[/code]

And it is in initrd:
[code]
4core:/etc/modprobe.d # lsinitrd -f /etc/modprobe.d/60-blacklist.nouveau.conf
blacklist nouveau
options nouveau modeset=0
[/code]

Digging a bit further I don't think the nvidia kernel modules are correctly installed:
[code]
4core:/lib/modules # find . -name 'nvidia*'
./5.3.18-57-default/kernel/drivers/net/ethernet/nvidia
./5.3.18-57-preempt/kernel/drivers/net/ethernet/nvidia
./5.3.18-59.19-preempt/kernel/drivers/net/ethernet/nvidia
./5.3.18-59.19-default/kernel/drivers/net/ethernet/nvidia

4core:/lib/modules # uname -r
5.3.18-59.19-preempt
[/code]

So the kernel I'm running won't find them because it will look in 5.3.18-59.19-preempt even though they are installed in 5.3.18-59.19-default (next question, how to fix this cleanly). But at least I think I know where I'm going now.

 chris2be8 2021-09-04 17:28

zypper on the command line. Following the instructions on Nvidia's web site [url]https://developer.nvidia.com/cuda-downloads[/url]

 EdH 2021-09-04 18:04

Some of the instructions I saw in the past had a separate step, almost hidden, that was required to install the driver. Is it possible there is a driver install step missing in your procedure?

For my Ubuntu repository install of 10.2, it automatically installs the 470 driver, no matter what I have beforehand.

Is there an equivalent to this Ubuntu command?:[code]sudo [B]ubuntu-drivers devices[/B]
WARNING:root:_pkg_get_support nvidia-driver-390: package has invalid Support Legacyheader, cannot determine support level
== /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0 ==
modalias : pci:v000010DEd00000FFDsv0000103Csd00000967bc03sc00i00
vendor : NVIDIA Corporation
model : GK107 [NVS 510]
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-450 - third-party non-free
driver : nvidia-driver-460-server - distro non-free
driver : nvidia-driver-455 - third-party non-free
driver : nvidia-driver-418-server - distro non-free
driver : nvidia-340 - distro non-free
driver : nvidia-driver-465 - third-party non-free
driver : nvidia-driver-390 - distro non-free
driver : nvidia-driver-470 - third-party non-free recommended
driver : nvidia-driver-418 - third-party non-free
driver : nvidia-driver-410 - third-party non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-440 - third-party non-free
driver : nvidia-driver-460 - third-party non-free
driver : xserver-xorg-video-nouveau - distro free builtin[/code]Would such be of any help?

 chris2be8 2021-09-04 20:10

After rebooting using the 5.3.18-59.19-default kernel the nvidia drivers are picked up:
[code]
4core:~ # lspci -v -s 01:00
01:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1) (prog-if 00 [VGA controller])
Subsystem: eVga.com. Corp. Device 3978
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
Memory at e0000000 (64-bit, prefetchable) [size=256M]
Memory at f0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
[virtual] Expansion ROM at f7000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] #19
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
[/code]

I'll need to fix that but it can wait for now.

Then I started testing things ...

msieve works OK:
[code]
Sat Sep 4 19:10:51 2021 Msieve v. 1.54 (SVN 1043)
Sat Sep 4 19:10:51 2021 random seeds: 6e515738 cae1a347
Sat Sep 4 19:10:51 2021 factoring 1522605027922533360535618378132637429718068114961380688657908494580122963258952897654000350692006139 (100 digits)
Sat Sep 4 19:10:51 2021 no P-1/P+1/ECM available, skipping
Sat Sep 4 19:10:51 2021 commencing number field sieve (100-digit input)
Sat Sep 4 19:10:51 2021 commencing number field sieve polynomial selection
Sat Sep 4 19:10:51 2021 polynomial degree: 4
Sat Sep 4 19:10:51 2021 max stage 1 norm: 1.16e+17
Sat Sep 4 19:10:51 2021 max stage 2 norm: 8.33e+14
Sat Sep 4 19:10:51 2021 min E-value: 9.89e-09
Sat Sep 4 19:10:51 2021 poly select deadline: 54
Sat Sep 4 19:10:51 2021 time limit set to 0.01 CPU-hours
Sat Sep 4 19:10:51 2021 expecting poly E from 1.49e-08 to > 1.71e-08
Sat Sep 4 19:10:51 2021 searching leading coefficients from 10000 to 1000000
Sat Sep 4 19:10:52 2021 using GPU 0 (NVIDIA GeForce GTX 970)
Sat Sep 4 19:10:52 2021 selected card has CUDA arch 5.2
Sat Sep 4 19:11:19 2021 polynomial selection complete
Sat Sep 4 19:11:19 2021 elapsed time 00:00:28
[/code]

But I've been having fun with ecm.

The problem with conftest turned out to be:
[code]
chris@4core:~> gcc-9 -o conftest -I/usr/local/cuda/include -g -O2 -I/usr/local/cuda/include -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 conftest.c -lcudart -lstdc++ -lcuda -lrt -lm -lm -lm -lm -lm
/usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: cannot find -lstdc++
collect2: error: ld returned 1 exit status
[/code]

So changing ./configure line 15498 from [c]CUDALIB="-lcudart -lstdc++"[/c] to [c]CUDALIB="-lcudart"[/c] made it work OK.

I then got a lot of errors like this:
[code]
Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
[/code]

So edited the Makefile to only build for sm_52 since that's all I need.

But trying to build CGBN support I get:
[code]
chris@4core:~/ecm-cgbn/gmp-ecm> make
make all-recursive
make[1]: Entering directory '/home/chris/ecm-cgbn/gmp-ecm'
Making all in x86_64
make[2]: Entering directory '/home/chris/ecm-cgbn/gmp-ecm/x86_64'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/home/chris/ecm-cgbn/gmp-ecm/x86_64'
make[2]: Entering directory '/home/chris/ecm-cgbn/gmp-ecm'
/bin/sh ./libtool --tag=CC --mode=compile /usr/local/cuda/bin/nvcc --compile -I/home/chris/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_52,code=sm_52 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU -o cgbn_stage1.lo cgbn_stage1.cu -static
libtool: compile: /usr/local/cuda/bin/nvcc --compile -I/home/chris/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_52,code=sm_52 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU cgbn_stage1.cu -o cgbn_stage1.o
cgbn_stage1.cu(435): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]"
(757): here

cgbn_stage1.cu(442): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]"
(757): here

cgbn_stage1.cu(435): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]"
(760): here

cgbn_stage1.cu(442): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]"
(760): here

4 errors detected in the compilation of "cgbn_stage1.cu".
make[2]: *** [Makefile:2571: cgbn_stage1.lo] Error 1
make[2]: Leaving directory '/home/chris/ecm-cgbn/gmp-ecm'
make[1]: *** [Makefile:1903: all-recursive] Error 1
make[1]: Leaving directory '/home/chris/ecm-cgbn/gmp-ecm'
make: *** [Makefile:783: all] Error 2
[/code]

This is after several attempts to run make, so hopefully only the relevant messages.

But I've got an older version of ecm working on the GPU (at last!) So i'll leave it for now.

 frmky 2021-09-04 22:01

[QUOTE=SethTro;587047]I halved compile time by adding cgbn_swap and avoiding inlining double_add_v2 twice.
[/QUOTE]

Does it affect the runtime? I don't care much about the compile time. Just compile a few small kernels for testing, and once it's stable include a good coverage of kernels and just let it compile overnight if necessary. In my current build I included all of
[CODE] typedef cgbn_params_t<4, 256> cgbn_params_256;
typedef cgbn_params_t<4, 512> cgbn_params_512;
typedef cgbn_params_t<8, 768> cgbn_params_768;
typedef cgbn_params_t<8, 1024> cgbn_params_1024;
typedef cgbn_params_t<8, 1536> cgbn_params_1536;
typedef cgbn_params_t<8, 2048> cgbn_params_2048;
typedef cgbn_params_t<16, 3072> cgbn_params_3072;
typedef cgbn_params_t<16, 4096> cgbn_params_4096;
typedef cgbn_params_t<16, 5120> cgbn_params_5120;
typedef cgbn_params_t<16, 6144> cgbn_params_6144;
typedef cgbn_params_t<16, 7168> cgbn_params_7168;
typedef cgbn_params_t<16, 8192> cgbn_params_8192;
typedef cgbn_params_t<32, 10240> cgbn_params_10240;
typedef cgbn_params_t<32, 12288> cgbn_params_12288;
typedef cgbn_params_t<32, 14336> cgbn_params_14336;
typedef cgbn_params_t<32, 16384> cgbn_params_16384;
typedef cgbn_params_t<32, 18432> cgbn_params_18432;
typedef cgbn_params_t<32, 20480> cgbn_params_20480;
typedef cgbn_params_t<32, 22528> cgbn_params_22528;
typedef cgbn_params_t<32, 24576> cgbn_params_24576;
typedef cgbn_params_t<32, 28672> cgbn_params_28672;
typedef cgbn_params_t<32, 32768> cgbn_params_32768;
[/CODE]
and it took a little over an hour to compile for sm_70.

 paulunderwood 2021-09-05 02:24

[QUOTE=chris2be8;587267]
So changing ./configure line 15498 from [c]CUDALIB="-lcudart -lstdc++"[/c] to [c]CUDALIB="-lcudart"[/c] made it work OK.
[/QUOTE]

Use YaST to search for the dev file of libstdc++ and install it (and its dependencies), and then link with -lstdc++

 SethTro 2021-09-05 03:28

[QUOTE=chris2be8;587267]
This is after several attempts to run make, so hopefully only the relevant messages.

But I've got an older version of ecm working on the GPU (at last!) So i'll leave it for now.[/QUOTE]

This is an easy fix, you are on the home stretch!

I'll committed a change that depends on [url]https://github.com/NVlabs/CGBN/pull/17[/url] being accepted. I'll committed a change reverting that to 3 cgbn_set's for now. After you `git pull` everything should build!

Alternatively you can use replace your CGBN directory with this one. `git clone -b cgbn_swap [email]git@github.com:sethtroisi/CGBN.git[/email]`

 SethTro 2021-09-05 03:40

[QUOTE=frmky;587274]Does it affect the runtime? I don't care much about the compile time. Just compile a few small kernels for testing, and once it's stable include a good coverage of kernels and just let it compile overnight if necessary. In my current build I included all of
[CODE] typedef cgbn_params_t<4, 256> cgbn_params_256;
typedef cgbn_params_t<4, 512> cgbn_params_512;
typedef cgbn_params_t<8, 768> cgbn_params_768;
typedef cgbn_params_t<8, 1024> cgbn_params_1024;
.........
typedef cgbn_params_t<32, 32768> cgbn_params_32768;
[/CODE]
and it took a little over an hour to compile for sm_70.[/QUOTE]

It doesn't reduce runtime, it does make it faster for me to test things and slightly reduces registers pressure.

 chris2be8 2021-09-05 05:35

[QUOTE=SethTro;587290]Alternatively you can use replace your CGBN directory with this one. `git clone -b cgbn_swap [email]git@github.com:sethtroisi/CGBN.git[/email]`[/QUOTE]

That fails:
[code]
chris@4core:~> git clone -b cgbn_swap git@github.com:sethtroisi/CGBN.git
Cloning into 'CGBN'...
The authenticity of host 'github.com (140.82.121.4)' can't be established.
RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'github.com,140.82.121.4' (RSA) to the list of known hosts.
git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
[/code]

And 'git pull' does nothing:
[code]
chris@4core:~/CGBN> git pull
[/code]

Unless I'm not using it correctly.

 SethTro 2021-09-05 07:01

[QUOTE=chris2be8;587296]That fails:
[code]
And 'git pull' does nothing:
[code]
chris@4core:~/CGBN> git pull
[/code]Unless I'm not using it correctly.[/QUOTE]

Ignore this, but for completion sake you can probably clone my copy of CGBN with `git clone -b cgbn_swap https://github.com/sethtroisi/CGBN.git`

The top entry from `git log` should be

[CODE]
commit 1595e543801bcbffd2c36cbf978baff843c09876 (HEAD -> gpu_integration, origin/gpu_integration)
Date: Sat Sep 4 20:26:30 2021 -0700

reverted the cgbn_swap change till that is accepted

[/CODE]
If so you should be able to build. If it's not try `git fetch` then `git pull origin gpu_integration`

 chris2be8 2021-09-05 15:44

I'm still stuck. I re-downloaded everything from scratch and re-ran autoreconf -si, ./configure and make. But make still fails
[code]
...
libtool: link: ( cd ".libs" && rm -f "libecm.la" && ln -s "../libecm.la" "libecm.la" )
/bin/sh ./libtool --tag=CC --mode=link gcc-9 -g -I/usr/local/cuda/include -g -O2 -DWITH_GPU -R /usr/local/cuda/lib64 -o ecm ecm-auxi.o ecm-b1_ainc.o ecm-candi.o ecm-eval.o ecm-main.o ecm-resume.o ecm-addlaws.o ecm-torsions.o ecm-getprime_r.o aprtcle/ecm-mpz_aprcl.o ecm-memusage.o libecm.la -lgmp -lrt -lm -lm -lm -lm -lm
libtool: link: gcc-9 -g -I/usr/local/cuda/include -g -O2 -DWITH_GPU -o ecm ecm-auxi.o ecm-b1_ainc.o ecm-candi.o ecm-eval.o ecm-main.o ecm-resume.o ecm-addlaws.o ecm-torsions.o ecm-getprime_r.o aprtcle/ecm-mpz_aprcl.o ecm-memusage.o ./.libs/libecm.a -L/usr/local/cuda/lib64 -lcudart -lgmp -lrt -lm -Wl,-rpath -Wl,/usr/local/cuda/lib64
/usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: ./.libs/libecm.a(cgbn_stage1.o): in function `cgbn_ecm_stage1':
tmpxft_00007e39_00000000-6_cgbn_stage1.cudafe1.cpp:(.text+0x8b3): undefined reference to `operator delete(void*)'
/usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: tmpxft_00007e39_00000000-6_cgbn_stage1.cudafe1.cpp:(.text+0x196e): undefined reference to `operator delete(void*)'
/usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: ./.libs/libecm.a(cgbn_stage1.o): in function `void std::vector<unsigned int, std::allocator<unsigned int> >::_M_realloc_insert<unsigned int>(__gnu_cxx::__normal_iterator<unsigned int*, std::vector<unsigned int, std::allocator<unsigned int> > >, unsigned int&&)':
tmpxft_00007e39_00000000-6_cgbn_stage1.cudafe1.cpp:(.text._ZNSt6vectorIjSaIjEE17_M_realloc_insertIJjEEEvN9__gnu_cxx17__normal_iteratorIPjS1_EEDpOT_[_ZNSt6vectorIjSaIjEE17_M_realloc_insertIJjEEEvN9__gnu_cxx17__normal_iteratorIPjS1_EEDpOT_]+0x50): undefined reference to `operator new(unsigned long)'
/usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: tmpxft_00007e39_00000000-6_cgbn_stage1.cudafe1.cpp:(.text._ZNSt6vectorIjSaIjEE17_M_realloc_insertIJjEEEvN9__gnu_cxx17__normal_iteratorIPjS1_EEDpOT_[_ZNSt6vectorIjSaIjEE17_M_realloc_insertIJjEEEvN9__gnu_cxx17__normal_iteratorIPjS1_EEDpOT_]+0xc8): undefined reference to `operator delete(void*)'
/usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: ./.libs/libecm.a(cgbn_stage1.o):(.data.rel.local.DW.ref.__gxx_personality_v0[DW.ref.__gxx_personality_v0]+0x0): undefined reference to `__gxx_personality_v0'
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:973: ecm] Error 1
make[2]: Leaving directory '/home/chris/ecm-cgbn/gmp-ecm'
make[1]: *** [Makefile:1903: all-recursive] Error 1
make[1]: Leaving directory '/home/chris/ecm-cgbn/gmp-ecm'
make: *** [Makefile:783: all] Error 2
[/code]

Any ideas?

 paulunderwood 2021-09-05 15:59

Did you install with YaST the dev package of libstdc++?

 chris2be8 2021-09-05 18:38

Success!

The vital bit of info came from putting "__gxx_personality_v0" into duckduckgo. That told me it's provided by libstdc++ which is the g++ runtime. After installing gcc9-g++ and its run time libstdc++6-devel-gcc9 everything works.

This has been an educational experience. Next step is to benchmark cgbn on my GPU.

 chris2be8 2021-09-06 16:02

Benchmark results:
[code]
chris@4core:~/ecm-cgbn/gmp-ecm> date;echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 20000 0;date
Sun 5 Sep 19:42:42 BST 2021
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=0, sigma=3:1000-3:4583 (3584 curves)
GPU: Using device code targeted for architecture compile_52
GPU: Ptx version is 52
GPU: numRegsPerThread = 31 sharedMemPerBlock = 24576 bytes
GPU: Block: 32x32x1 Grid: 112x1x1 (3584 parallel curves)
Computing 3584 Step 1 took 190ms of CPU time / 20427ms of GPU time
Sun 5 Sep 19:43:03 BST 2021

chris@4core:~/ecm-cgbn/gmp-ecm> date;echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 0;date
Sun 5 Sep 19:43:29 BST 2021
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is (2^499-1)/20959 (146 digits)
Using B1=20000, B2=0, sigma=3:1000-3:4583 (3584 curves)
GPU: Using device code targeted for architecture compile_52
GPU: Ptx version is 52
GPU: numRegsPerThread = 93 sharedMemPerBlock = 0 bytes
Computing 3584 Step 1 took 30ms of CPU time / 3644ms of GPU time
Sun 5 Sep 19:43:33 BST 2021

chris@4core:~/ecm-cgbn/gmp-ecm> date;echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000 0;date
Sun 5 Sep 19:44:25 BST 2021
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=0, sigma=3:1000-3:1831 (832 curves)
GPU: Using device code targeted for architecture compile_52
GPU: Ptx version is 52
GPU: numRegsPerThread = 31 sharedMemPerBlock = 24576 bytes
GPU: Block: 32x32x1 Grid: 26x1x1 (832 parallel curves)
Computing 832 Step 1 took 188ms of CPU time / 4552ms of GPU time
Sun 5 Sep 19:44:30 BST 2021

chris@4core:~/ecm-cgbn/gmp-ecm> date;echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0;date
Sun 5 Sep 19:44:41 BST 2021
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is (2^997-1) (301 digits)
Using B1=20000, B2=0, sigma=3:1000-3:1831 (832 curves)
GPU: Using device code targeted for architecture compile_52
GPU: Ptx version is 52
GPU: numRegsPerThread = 93 sharedMemPerBlock = 0 bytes
Computing 832 Step 1 took 8ms of CPU time / 1995ms of GPU time
Sun 5 Sep 19:44:44 BST 2021
[/code]

So about 5 times faster for (2^499-1)/20959 and about twice as fast for 2^997-1. But these are all small cases.

But my overall throughput won't increase much because my CPU can't do stage 2 as fast as the GPU can do stage 1 now. But that's not your fault. And any speedup is nice. Thanks.

Other lessons learnt:
autoreconf -si creates symlinks to missing files while autoreconf -i copies them. Using -si saves space, but if you upgrade to a new level of automake you can get hanging symlinks:
[code]
lrwxrwxrwx 1 chris users 32 Nov 12 2015 INSTALL -> /usr/share/automake-1.13/INSTALL
lrwxrwxrwx 1 chris users 35 Nov 12 2015 ltmain.sh -> /usr/share/libtool/config/ltmain.sh
[/code]
They needed updating to:
[code]
lrwxrwxrwx 1 chris users 32 Sep 4 19:20 INSTALL -> /usr/share/automake-1.15/INSTALL
lrwxrwxrwx 1 chris users 38 Sep 4 19:20 ltmain.sh -> /usr/share/libtool/build-aux/ltmain.sh
[/code]
Not a common issue though.

And suggestions for the install process:
INSTALL-ecm should tell users to run autoreconf -i (or -si) before running ./configure (which is created by autoreconf -i).

./configure compiles several small programs and runs them to check things. If the compile fails it should put out a message saying the compile failed, not one saying it found different levels of run time library etc. If the compile normally produces no output then letting any output it does produce go to the screen would be informative (eg when it can't find -lstdc++).

Chris

 SethTro 2021-09-07 06:25

[QUOTE=chris2be8;587337]Success![/QUOTE]

I'm glad we finally got here!

2.2x speedup for the 1024 bit case is almost exactly what everyone else is seeing (except bsquared maybe because newer card?).

You can often improve overall throughput by adjust to 1.2*B1 and 1/2*B2 (and checking that expected curves stays roughly the same). This can especially help if Stage 1 time < Stage 2 time / cores.

I'll reflect on your notes and see if I can improve the documentation / configure script.

 chris2be8 2021-09-07 15:41

[QUOTE=SethTro;587429]
I'll reflect on your notes and see if I can improve the documentation / configure script.[/QUOTE]

How about updating INSTALL-ecm like this:
[code]
diff -u INSTALL-ecm INSTALL-ecm.new
--- INSTALL-ecm 2021-09-05 12:13:55.613439408 +0100
+++ INSTALL-ecm.new 2021-09-07 16:37:42.903291304 +0100
@@ -19,6 +19,7 @@

+ \$ autoreconf -i
\$ ./configure

The configure script accepts several options (see ./configure --help).
[/code]

That's a minimum change to get new users started.

 WraithX 2021-09-07 18:08

[QUOTE=chris2be8;587449]How about updating INSTALL-ecm like this:
[code]
diff -u INSTALL-ecm INSTALL-ecm.new
--- INSTALL-ecm 2021-09-05 12:13:55.613439408 +0100
+++ INSTALL-ecm.new 2021-09-07 16:37:42.903291304 +0100
@@ -19,6 +19,7 @@

+ \$ autoreconf -i
\$ ./configure

The configure script accepts several options (see ./configure --help).
[/code]

That's a minimum change to get new users started.[/QUOTE]

That document describes what users should do when they have downloaded an official release. When building an official release, you do not need to run [C]autoreconf -i[/C]. You only need to run [C]autoreconf -i[/C] when you download a development version with git or svn. I don't think adding [C]autoreconf -i[/C] to this document is a good idea.

Looking at the various documents, I see that [C]README.dev[/C] has the advice of running [C]autoreconf -i[/C].

 chris2be8 2021-09-08 15:30

How about having INSTALL-ecm tell users to run [c]autoreconf -i[/c] if they don't have a ./configure in the directory?

And if people get an official release would the files that would be created by autoreconf -i be correct for their OS etc?

 EdH 2021-09-08 15:41

@Chris: Did you get your sm_30 card working or just the higher arch one?

 chris2be8 2021-09-09 15:40

Just the higher arch one (sm_52). Sorry.

PS. Does CGBN increase the maximum size of number that can be handled? I'd try it, but I'm tied up catching up with ECM work I delayed while I was getting ecm-cgbn working.

 SethTro 2021-09-09 17:03

[QUOTE=chris2be8;587572]Just the higher arch one (sm_52). Sorry.

PS. Does CGBN increase the maximum size of number that can be handled? I'd try it, but I'm tied up catching up with ECM work I delayed while I was getting ecm-cgbn working.[/QUOTE]

Yes! In cgbn_stage1.cu search for this line
/* NOTE: Custom kernel changes here

You can either add a new kernel or I recommend just changing `cgbn_params_512`

- typedef cgbn_params_t<4, 512> cgbn_params_512;
+ typedef cgbn_params_t<TPI_SEE_COMMENT_ABOVE, YOUR_BITS_HERE> cgbn_params_512; // My name is a lie

The absolute limit is 32,768 bits. I found that GPU/CPU performance decreases 3x from 1,024 bits to 16,384 bits then an additional 2x above 16,384 still something like 13x faster on my system but possible no longer competitive watt for watt. Read the nearby comment for a sense of how long it will take to compile.

 SethTro 2021-09-10 04:01

I spent most of today working on new optimal bounds. It can be a [URL="https://www.mersenneforum.org/showpost.php?p=587617&postcount=22"]large speedup[/URL] to use these instead of the traditionally optimal B1 bounds. ecm can confirm they represent a full t<X> while taking substantially less time when accounting for the GPU speedup.

Full table at [url]https://github.com/sethtroisi/misc-scripts/tree/main/ecm_gpu_optimizer[/url] and an excerpt below

[CODE]GPU speedup/CPU cores digits optimal B1 optimal B2 B2/B1 ratio expected curves
Fast GPU + 4 cores
40/4 35 2,567,367 264,075,603 103 809
40/4 40 8,351,462 1,459,547,807 175 1760
40/4 45 38,803,644 17,323,036,685 446 2481
40/4 50 79,534,840 58,654,664,284 737 7269
40/4 55 113,502,213 96,313,119,323 849 29883
40/4 60 322,667,450 395,167,622,450 1225 56664
Fast GPU + 8 cores
40/8 35 1,559,844 351,804,250 226 1038
40/8 40 6,467,580 2,889,567,750 447 1843
40/8 45 29,448,837 35,181,170,876 1195 2599
40/8 50 40,201,280 58,928,323,592 1466 11993
40/8 55 136,135,593 289,565,678,027 2127 20547
40/8 60 479,960,096 3,226,409,839,042 6722 30014[/CODE]

 SethTro 2021-09-10 09:46

[QUOTE=bsquared;587026]1280: (~31 ms/curves)
2560: (~21 ms/curves)
640: (~63 ms/curves)
1792: (~36 ms/curves)

So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves))

With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.[/QUOTE]

Two late night performance thoughts.
1. You might get 10% more throughput by toggling VERIFY_NORMALIZED to 0 on line 55
It's a nice debug check while this is still in development but it has never tripped so it's overly cautious especially if it costs 10% performance.
2. Would you mind sharing what card you have and the full output from -v output (especially the lines that start with "GPU: ")

 bsquared 2021-09-13 14:11

[QUOTE=SethTro;587628]Two late night performance thoughts.
1. You might get 10% more throughput by toggling VERIFY_NORMALIZED to 0 on line 55
It's a nice debug check while this is still in development but it has never tripped so it's overly cautious especially if it costs 10% performance.
2. Would you mind sharing what card you have and the full output from -v output (especially the lines that start with "GPU: ")[/QUOTE]

Hmm, when running on 2^997-1 I'm getting *better* throughput with VERIFY_NORMALIZED 1, 53.5 curves/sec with it defined to 1 vs. 45.6 curves/sec with it defined to 0, both running -gpucurves 2560. If I set gpucurves 5120 then the no_verify version is 15% faster, but still slower than -gpucurves 2560.

It is a Tesla V100-SXM2-32GB (compute capability 7.0, 80 MPs, maxSharedPerBlock = 49152 maxThreadsPerBlock = 1024 maxRegsPerBlock = 65536)

 SethTro 2021-09-21 08:35

[QUOTE=bsquared;587026]1280: (~31 ms/curves)
2560: (~21 ms/curves)
640: (~63 ms/curves)
1792: (~36 ms/curves)

So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves))

With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.[/QUOTE]

I was confused when you saw only moderate gains so I rented a V100 (V100-SXM2-16GB) on AWS today.
I'm seeing the new code be 3.1x faster which is similar to the 2-3x improvement I've seen on a 1080ti, 970, and K80.

[CODE]
\$ echo "(2^997-1)" | ./ecm -cgbn -v -sigma 3:1000 1000000 0
Computing 5120 Step 1 took 134ms of CPU time / 69031ms of GPU time
Throughput: 74.170 curves per second (on average 13.48ms per Step 1)

\$ echo "(2^997-1)" | ./ecm -gpu -v -sigma 3:1000 1000000 0
Computing 5120 Step 1 took 10911ms of CPU time / 218643ms of GPU time
Throughput: 23.417 curves per second (on average 42.70ms per Step 1)
[/CODE]

 unconnected 2021-10-22 13:50

Hello, I've got an error while trying to run curves with B1=11e7:

[CODE]ecm-cgbn: cgbn_stage1.cu:525: char* allocate_and_set_s_bits(const __mpz_struct*, int*): Assertion `1 <= num_bits && num_bits <= 100000000' failed.[/CODE] Is this a sort of CGBN limitations?

 SethTro 2021-10-22 18:50

It's to prevent GPU memory issues so it can be ignored (unless you run with a very huge number.
It's on my to-do list to remove but I'm sadly without internet today.
You can remove the assert and everything will be fine.

 SethTro 2021-10-24 08:47

[QUOTE=unconnected;591358]Hello, I've got an error while trying to run curves with B1=11e7:

[CODE]ecm-cgbn: cgbn_stage1.cu:525: char* allocate_and_set_s_bits(const __mpz_struct*, int*): Assertion `1 <= num_bits && num_bits <= 100000000' failed.[/CODE] Is this a sort of CGBN limitations?[/QUOTE]

I just merged [URL]https://gitlab.inria.fr/zimmerma/ecm/-/merge_requests/27[/URL] which contains a fix of B1 limit along with a number of quality of life improvements: multiple kernels included by default (512 and 1024), estimated timing, better overflow detection, faster compilation.

 unconnected 2021-10-24 13:35

[B]SethTro[/B], thanks for the explanation and improvements!

All times are UTC. The time now is 05:20.