![]() |
![]() |
#1 |
"Seth"
Apr 2019
479 Posts |
![]()
Resurrecting this thread. If anyone is running numbers smaller than C155 they should reach out to me.
(Moderator note: Referenced thread is here) My new CGBN enabled code is something like 7x faster Code:
$ echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 20000 Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves) Computing 3584 Step 1 took 93ms of CPU time / 7258ms of GPU time Computing 3584 Step 2 on CPU took 71933ms $$ echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves) Computing 3584 Step 1 took 15ms of CPU time / 1019ms of GPU time Computing 3584 Step 2 on CPU took 72142ms Code:
$ echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000 Input number is (2^997-1) (301 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves) Computing 1792 Step 1 took 91ms of CPU time / 3810ms of GPU time Computing 1792 Step 2 on CPU took 83417ms $ echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 Input number is (2^997-1) (301 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves) Computing 1792 Step 1 took 15ms of CPU time / 1588ms of GPU time Computing 1792 Step 2 on CPU took 83521ms Last fiddled with by bsquared on 2021-08-27 at 19:13 |
![]() |
![]() |
![]() |
#2 |
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
3·2,011 Posts |
![]()
This is an impressive speedup. I assume there is no chance of stage 2 being ported to run on CUDA? GPU memory was too small back when the old GPU code was written. I suspect that may still be the case when running many curves in parallel(maybe different sections of the stage 2 range could be done in parallel instead?)
Binaries for several programs have been hosted on the forum server. I would suggest messaging Xyyzy. Does the windows visual studio compilation work for this? I would either need that or CUDA working under WSL2. |
![]() |
![]() |
![]() |
#3 | |
"Ben"
Feb 2007
3,733 Posts |
![]() Quote:
I haven't been able to build it yet. My knowledge of automake/autoconf is very limited, but trying to use these tools just throws a bunch of errors about missing files. Any advice on building this for linux? Also are you by any chance looking at implementing a standard continuation for stage 2 on the GPU? It is very helpful for speeding up the process as a whole if these can also be run in parallel, even given that each curve is slightly less likely to find a factor: Code:
./yafu "ecm(2^997-1,1792)" -B1ecm 20000 -threads 16 ecm: 1792/1792 curves on C301 @ B1=20000, B2=100*B1 ecm: process took 5.5385 seconds. sorry saw that henryzz already brought this up Last fiddled with by bsquared on 2021-08-25 at 13:11 |
|
![]() |
![]() |
![]() |
#4 |
Sep 2009
22·607 Posts |
![]()
I'm trying to build it on Linux, but have not had much luck. I copied the process I used to build the previous version:
Code:
Download gmp-ecm-gpu_integration.zip unzip gmp-ecm-gpu_integration.zip cd gmp-ecm-gpu_integration autoreconf -si ./configure --enable-gpu=30 # The previous version needed --enable-gpu=sm30. The following messages look relevant: -snip- checking that CUDA Toolkit version and runtime version are the same... (9.1/9.1) yes -snip- configure: with_cgbn: , , " -snip- make # This fails with messages about CGBN being missing. Also what makes of GPU support it? Mine may be too old. Chris PS I use ECM set to do 512 bit arithmetic for numbers below 2^506. This is about 3 times faster than the 1024 bit version. So you may be *only* getting a factor of 3 over all ranges. But that's still very nice. Last fiddled with by chris2be8 on 2021-08-25 at 15:43 |
![]() |
![]() |
![]() |
#5 |
"Ben"
Feb 2007
3,733 Posts |
![]()
Thanks to the autoreconf -si hint, I've progressed a little further and have run through a configure process for a sm_70 card. But now I'm getting a bunch of these errors:
Code:
ptxas /tmp/tmpxft_00008eaa_00000000-5_cudakernel.ptx, line 2378; error : Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4 |
![]() |
![]() |
![]() |
#6 | |
"Ben"
Feb 2007
72258 Posts |
![]() Quote:
On line 10 of cudakernel_default.cu replace this line: Code:
while(__any(cy[threadIdx.x])!=0) Code:
while(__any_sync(__activemask(),cy[threadIdx.x])!=0) Code:
% echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000 GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is (2^997-1) (301 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:6119 (5120 curves) GPU: Block: 32x32x1 Grid: 160x1x1 (5120 parallel curves) Computing 5120 Step 1 took 485ms of CPU time / 4990ms of GPU time Computing 5120 Step 2 on CPU took 247281ms Last fiddled with by bsquared on 2021-08-25 at 17:52 |
|
![]() |
![]() |
![]() |
#7 | ||
"Seth"
Apr 2019
1110111112 Posts |
![]() Quote:
This is what I use Code:
autoreconf -i ./configure --enable-gpu=61 --with-cuda=/usr/local/cuda CC=gcc-9 -with-cgbn-include=/home/five/Projects/CGBN/include/cgbn make -j8 Quote:
then added `-with-cgbn-include=/home/five/Projects/CGBN/include/cgbn` to the list of options I pass to `./configure` Last fiddled with by SethTro on 2021-08-25 at 18:08 |
||
![]() |
![]() |
![]() |
#8 |
"Ben"
Feb 2007
3,733 Posts |
![]()
I got CGBN set up and configured your code with the proper cgbn include path.
Needed to make the following changes to get it to compile: Code:
Lines 510 and 587, replace 100'000'000 with 1000000000 Around line 640: //const std::vector<uint32_t> available_kernels = { 512, 1024 }; uint32_t available_kernels[2] = { 512, 1024 }; uint32_t num_available_kernels = 2; and then in the following loop: //for (kernel_bits : available_kernels) { for (i=0; i<num_available_kernels; i++) { kernel_bits = available_kernels[i]; Code:
echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=20000, B2=0, sigma=3:1000-3:6119 (5120 curves) CUDA error occurred: cannot set while device is active in this process While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync) (file cgbn_stage1.cu, line 591) [edit] Just commenting out line 591 makes it work. cpu usage does go up during gpu execution though. Note the factor of 10 increase in B1 compared to before. Speedup is about 8x! Awesome! Code:
% echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 200000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=200000, B2=0, sigma=3:1000-3:6119 (5120 curves) Computing 5120 Step 1 took 3587ms of CPU time / 6088ms of GPU time Last fiddled with by bsquared on 2021-08-25 at 18:30 Reason: new results |
![]() |
![]() |
![]() |
#9 |
"Seth"
Apr 2019
7378 Posts |
![]()
I committed the first tweak you made so you can 'git pull' (you might need `git fetch` `git reset --hard origin/gpu_integration`)
I'm not sure why you get the error for Code:
CUDA_CHECK(cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync)); You could try replacing line 135 in cudakernel.cu (while leaving line 591 commented out) Code:
- errCheck (cudaSetDeviceFlags (cudaDeviceScheduleYield)); + errCheck (cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync)); Last fiddled with by SethTro on 2021-08-25 at 18:47 |
![]() |
![]() |
![]() |
#10 |
Sep 2009
1001011111002 Posts |
![]()
I got that error too:
Code:
$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=20000, B2=0, sigma=3:1000-3:1383 (384 curves) CUDA error occurred: cannot set while device is active in this process While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync) (file cgbn_stage1.cu, line 601) After commenting out line 601 it works: Code:
$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=20000, B2=0, sigma=3:1000-3:1383 (384 curves) Running GPU kernel<24,128> ... Copying results back to CPU ... Computing 384 Step 1 took 1653ms of CPU time / 2928ms of GPU time Code:
$ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 200000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=200000, B2=0, sigma=3:1000-3:1383 (384 curves) Running GPU kernel<24,128> ... CUDA error occurred: the launch timed out and was terminated While running cudaDeviceSynchronize() (file cgbn_stage1.cu, line 632) This GPU has CUDA arch 3.0, is that new enough for CGBN? I also have a newer GPU with CUDA arch 5.2, I'll try installing on that next. NB. msieve says what CUDA arch the card is when used for GPU based poly selection. I'm using that to check what they really support. Chris PS. Should discussion of ecm with CGBN be split off into another thread? |
![]() |
![]() |
![]() |
#11 | |
"Ben"
Feb 2007
3,733 Posts |
![]() Quote:
(2^499-1)/20959 has a factor that is found on my card in stage 1 at B1=2M (sigma=3890). |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
NTT faster than FFT? | moytrage | Software | 50 | 2021-07-21 05:55 |
PRP on gpu is faster that on cpu | indomit | Information & Answers | 4 | 2020-10-07 10:50 |
faster than LL? | paulunderwood | Miscellaneous Math | 13 | 2016-08-02 00:05 |
My CPU is getting faster and faster ;-) | lidocorc | Software | 2 | 2008-11-08 09:26 |
Faster than LL? | clowns789 | Miscellaneous Math | 3 | 2004-05-27 23:39 |