mersenneforum.org Faster GPU-ECM with CGBN
 Register FAQ Search Today's Posts Mark Forums Read

 2021-08-24, 23:35 #1 SethTro     "Seth" Apr 2019 479 Posts Faster GPU-ECM with CGBN Resurrecting this thread. If anyone is running numbers smaller than C155 they should reach out to me. (Moderator note: Referenced thread is here) My new CGBN enabled code is something like 7x faster Code: $echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 20000 Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves) Computing 3584 Step 1 took 93ms of CPU time / 7258ms of GPU time Computing 3584 Step 2 on CPU took 71933ms $$echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves) Computing 3584 Step 1 took 15ms of CPU time / 1019ms of GPU time Computing 3584 Step 2 on CPU took 72142ms For numbers smaller than C300 It's generally 2-3x faster Code:  echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000 Input number is (2^997-1) (301 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves) Computing 1792 Step 1 took 91ms of CPU time / 3810ms of GPU time Computing 1792 Step 2 on CPU took 83417ms echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 Input number is (2^997-1) (301 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves) Computing 1792 Step 1 took 15ms of CPU time / 1588ms of GPU time Computing 1792 Step 2 on CPU took 83521ms I'm working on the code actively in https://github.com/sethtroisi/gmp-ec...pu_integration if you are a developer and can possible distribute Linux binaries if we had a place to store them. Last fiddled with by bsquared on 2021-08-27 at 19:13  2021-08-25, 11:59 #2 henryzz Just call me Henry "David" Sep 2007 Liverpool (GMT/BST) 3·2,011 Posts This is an impressive speedup. I assume there is no chance of stage 2 being ported to run on CUDA? GPU memory was too small back when the old GPU code was written. I suspect that may still be the case when running many curves in parallel(maybe different sections of the stage 2 range could be done in parallel instead?) Binaries for several programs have been hosted on the forum server. I would suggest messaging Xyyzy. Does the windows visual studio compilation work for this? I would either need that or CUDA working under WSL2. 2021-08-25, 13:07 #3 bsquared "Ben" Feb 2007 3,733 Posts Quote:  Originally Posted by SethTro Resurrecting this thread. If anyone is running numbers smaller than C155 they should reach out to me. My new CGBN enabled code is something like 7x faster Code:  echo "(2^499-1)/20959" | ./ecm -gpu -gpucurves 3584 -sigma 3:1000 20000 Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves) Computing 3584 Step 1 took 93ms of CPU time / 7258ms of GPU time Computing 3584 Step 2 on CPU took 71933ms$$ echo "(2^499-1)/20959" | ./ecm -gpu -cgbn -gpucurves 3584 -sigma 3:1000 20000 Input number is (2^499-1)/20959 (146 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:4583 (3584 curves) Computing 3584 Step 1 took 15ms of CPU time / 1019ms of GPU time Computing 3584 Step 2 on CPU took 72142ms For numbers smaller than C300 It's generally 2-3x faster Code: $ echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000 Input number is (2^997-1) (301 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves) Computing 1792 Step 1 took 91ms of CPU time / 3810ms of GPU time Computing 1792 Step 2 on CPU took 83417ms $echo "(2^997-1)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 Input number is (2^997-1) (301 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:2791 (1792 curves) Computing 1792 Step 1 took 15ms of CPU time / 1588ms of GPU time Computing 1792 Step 2 on CPU took 83521ms I'm working on the code actively in https://github.com/sethtroisi/gmp-ec...pu_integration if you are a developer and can possible distribute Linux binaries if we had a place to store them. Very nice! I haven't been able to build it yet. My knowledge of automake/autoconf is very limited, but trying to use these tools just throws a bunch of errors about missing files. Any advice on building this for linux? Also are you by any chance looking at implementing a standard continuation for stage 2 on the GPU? It is very helpful for speeding up the process as a whole if these can also be run in parallel, even given that each curve is slightly less likely to find a factor: Code: ./yafu "ecm(2^997-1,1792)" -B1ecm 20000 -threads 16 ecm: 1792/1792 curves on C301 @ B1=20000, B2=100*B1 ecm: process took 5.5385 seconds.  sorry saw that henryzz already brought this up Last fiddled with by bsquared on 2021-08-25 at 13:11  2021-08-25, 15:42 #4 chris2be8 Sep 2009 22·607 Posts I'm trying to build it on Linux, but have not had much luck. I copied the process I used to build the previous version: Code: Download gmp-ecm-gpu_integration.zip unzip gmp-ecm-gpu_integration.zip cd gmp-ecm-gpu_integration autoreconf -si ./configure --enable-gpu=30 # The previous version needed --enable-gpu=sm30. The following messages look relevant: -snip- checking that CUDA Toolkit version and runtime version are the same... (9.1/9.1) yes -snip- configure: with_cgbn: , , " -snip- make # This fails with messages about CGBN being missing. What version of CUDA Toolkit and runtime is needed to support CGBN? And where is cgbn.h on your system? Also what makes of GPU support it? Mine may be too old. Chris PS I use ECM set to do 512 bit arithmetic for numbers below 2^506. This is about 3 times faster than the 1024 bit version. So you may be *only* getting a factor of 3 over all ranges. But that's still very nice. Last fiddled with by chris2be8 on 2021-08-25 at 15:43  2021-08-25, 16:19 #5 bsquared "Ben" Feb 2007 3,733 Posts Thanks to the autoreconf -si hint, I've progressed a little further and have run through a configure process for a sm_70 card. But now I'm getting a bunch of these errors: Code: ptxas /tmp/tmpxft_00008eaa_00000000-5_cudakernel.ptx, line 2378; error : Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4 which probably has nothing to do with your cbgn enhancements. Anyone know if gpu-ecm will build and run on a sm_70 card or hints on how to proceed? 2021-08-25, 17:34 #6 bsquared "Ben" Feb 2007 72258 Posts Quote:  Originally Posted by bsquared Thanks to the autoreconf -si hint, I've progressed a little further and have run through a configure process for a sm_70 card. But now I'm getting a bunch of these errors: Code: ptxas /tmp/tmpxft_00008eaa_00000000-5_cudakernel.ptx, line 2378; error : Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4 which probably has nothing to do with your cbgn enhancements. Anyone know if gpu-ecm will build and run on a sm_70 card or hints on how to proceed? Fixed it. On line 10 of cudakernel_default.cu replace this line: Code: while(__any(cy[threadIdx.x])!=0) with this one Code: while(__any_sync(__activemask(),cy[threadIdx.x])!=0) Now I have a working gpu-ecm! Later I'll try to get the cbgn part working. Code: % echo "(2^997-1)" | ./ecm -gpu -sigma 3:1000 20000 GMP-ECM 7.0.5-dev [configured with GMP 6.2.0, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is (2^997-1) (301 digits) Using B1=20000, B2=3804582, sigma=3:1000-3:6119 (5120 curves) GPU: Block: 32x32x1 Grid: 160x1x1 (5120 parallel curves) Computing 5120 Step 1 took 485ms of CPU time / 4990ms of GPU time Computing 5120 Step 2 on CPU took 247281ms Last fiddled with by bsquared on 2021-08-25 at 17:52 2021-08-25, 17:48 #7 SethTro "Seth" Apr 2019 1110111112 Posts Quote:  Originally Posted by bsquared Very nice! I haven't been able to build it yet. My knowledge of automake/autoconf is very limited, but trying to use these tools just throws a bunch of errors about missing files. Any advice on building this for linux? There's some instructions in README.dev. This is what I use Code: autoreconf -i ./configure --enable-gpu=61 --with-cuda=/usr/local/cuda CC=gcc-9 -with-cgbn-include=/home/five/Projects/CGBN/include/cgbn make -j8 Replace --enable_gpu=61 with a different compute capability if you need one (and apply the change from #27 if you need SM_70) but you already seem to have that figured out. Quote:  Originally Posted by chris2be8 I'm trying to build it on Linux, but have not had much luck. I copied the process I used to build the previous version: What version of CUDA Toolkit and runtime is needed to support CGBN? And where is cgbn.h on your system? I ran git clone https://github.com/NVlabs/CGBN.git under /home/five/Projects then added -with-cgbn-include=/home/five/Projects/CGBN/include/cgbn to the list of options I pass to ./configure Last fiddled with by SethTro on 2021-08-25 at 18:08  2021-08-25, 18:24 #8 bsquared "Ben" Feb 2007 3,733 Posts I got CGBN set up and configured your code with the proper cgbn include path. Needed to make the following changes to get it to compile: Code: Lines 510 and 587, replace 100'000'000 with 1000000000 Around line 640: //const std::vector available_kernels = { 512, 1024 }; uint32_t available_kernels[2] = { 512, 1024 }; uint32_t num_available_kernels = 2; and then in the following loop: //for (kernel_bits : available_kernels) { for (i=0; i  2021-08-25, 18:42 #9 SethTro "Seth" Apr 2019 7378 Posts I committed the first tweak you made so you can 'git pull' (you might need git fetch git reset --hard origin/gpu_integration) I'm not sure why you get the error for Code: CUDA_CHECK(cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync)); Google suggests this might happen if I've already started run things but the code shouldn't have at that point. You could try replacing line 135 in cudakernel.cu (while leaving line 591 commented out) Code: - errCheck (cudaSetDeviceFlags (cudaDeviceScheduleYield)); + errCheck (cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync)); If anyone else experiences this let me know so I can try to fix but I'm going to ignore for now. Last fiddled with by SethTro on 2021-08-25 at 18:47  2021-08-27, 15:46 #10 chris2be8 Sep 2009 1001011111002 Posts I got that error too: Code: $ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=20000, B2=0, sigma=3:1000-3:1383 (384 curves) CUDA error occurred: cannot set while device is active in this process While running cudaSetDeviceFlags (cudaDeviceScheduleBlockingSync) (file cgbn_stage1.cu, line 601) Though the different line number makes me suspect I may have an old version of your code. What command should I run to download the latest version? After commenting out line 601 it works: Code: $echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 20000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=20000, B2=0, sigma=3:1000-3:1383 (384 curves) Running GPU kernel<24,128> ... Copying results back to CPU ... Computing 384 Step 1 took 1653ms of CPU time / 2928ms of GPU time But after raising B1 to 200000 it fails: Code: $ echo "((2^499-1)/20959)" | ./ecm -gpu -cgbn -sigma 3:1000 200000 0 GMP-ECM 7.0.5-dev [configured with GMP 6.1.2, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is ((2^499-1)/20959) (146 digits) Using B1=200000, B2=0, sigma=3:1000-3:1383 (384 curves) Running GPU kernel<24,128> ... CUDA error occurred: the launch timed out and was terminated While running cudaDeviceSynchronize() (file cgbn_stage1.cu, line 632)` @bsquared, what happens if you try raising B1 by another factor of 10? How high a B1 will it take? This GPU has CUDA arch 3.0, is that new enough for CGBN? I also have a newer GPU with CUDA arch 5.2, I'll try installing on that next. NB. msieve says what CUDA arch the card is when used for GPU based poly selection. I'm using that to check what they really support. Chris PS. Should discussion of ecm with CGBN be split off into another thread?
2021-08-27, 19:24   #11
bsquared

"Ben"
Feb 2007

3,733 Posts

Quote:
 Originally Posted by chris2be8 @bsquared, what happens if you try raising B1 by another factor of 10? How high a B1 will it take?
I have run 2^997-1 up to B1=10M so far with no problems. I will go up another order of magnitude; should take a few hours.

(2^499-1)/20959 has a factor that is found on my card in stage 1 at B1=2M (sigma=3890).

 Similar Threads Thread Thread Starter Forum Replies Last Post moytrage Software 50 2021-07-21 05:55 indomit Information & Answers 4 2020-10-07 10:50 paulunderwood Miscellaneous Math 13 2016-08-02 00:05 lidocorc Software 2 2008-11-08 09:26 clowns789 Miscellaneous Math 3 2004-05-27 23:39

All times are UTC. The time now is 15:01.

Tue Feb 7 15:01:19 UTC 2023 up 173 days, 12:29, 2 users, load averages: 0.85, 0.99, 1.06

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔