![]() |
![]() |
#100 |
"Seth"
Apr 2019
24×33 Posts |
![]()
I spent a good part of this week trying to implement fast squaring for CGBN. Ultimately my code was 10% slower and still had breaking edge cases.
In the best case with 100% faster fast squaring, there are 4 `mont_sqr` and 4 `mont_mul` so it would only be 8 / (4 / 2 + 4) - 1 = 33% faster. Using GMP's 50% faster number it would be 1 - 8 / (4 / 1.5 + 4) - 1 = 20% faster. I'll reach out to the author of the repo because they mention fast squaring in their paper "Optimizing Modular Multiplication for NVIDIA’s Maxwell GPUs" http://www.acsel-lab.com/arithmetic/...a/1616a047.pdf but it's unlikely to happen. |
![]() |
![]() |
![]() |
#101 |
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
10111011000002 Posts |
![]()
Just tried to upgrade my version of this as I was on a fairly old version and certain numbers were crashing.
Compiling has failed with the following error: Code:
/bin/bash ./libtool --tag=CC --mode=compile /usr/local/cuda/bin/nvcc --compile -I/mnt/c/Users/david/Downloads/gmp-ecm-gpu_integration/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_75,code=sm_75 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU -o cgbn_stage1.lo cgbn_stage1.cu -static libtool: compile: /usr/local/cuda/bin/nvcc --compile -I/mnt/c/Users/david/Downloads/gmp-ecm-gpu_integration/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_75,code=sm_75 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU cgbn_stage1.cu -o cgbn_stage1.o cgbn_stage1.cu(437): error: identifier "cgbn_swap" is undefined detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]" (800): here cgbn_stage1.cu(444): error: identifier "cgbn_swap" is undefined detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]" (800): here cgbn_stage1.cu(407): warning: variable "temp" was declared but never referenced detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]" (800): here cgbn_stage1.cu(437): error: identifier "cgbn_swap" is undefined detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]" (803): here cgbn_stage1.cu(444): error: identifier "cgbn_swap" is undefined detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]" (803): here cgbn_stage1.cu(407): warning: variable "temp" was declared but never referenced detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]" (803): here 4 errors detected in the compilation of "cgbn_stage1.cu". |
![]() |
![]() |
![]() |
#102 |
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
25×11×17 Posts |
![]()
May have discovered the issue. I think I need to update CGBN
edit: confirmed Last fiddled with by henryzz on 2021-11-28 at 05:21 |
![]() |
![]() |
![]() |
#103 |
Sep 2009
1001000101102 Posts |
![]()
My GTX 970 has burnt out, so I've had to replace it, with a RTX 3060 Ti. That's sm_86 so I had to reinstall ECM-GPU.
After updating CUDA to the latest driver and runtime version (11.6) I fetched gmp-ecm again: git clone https://github.com/sethtroisi/gmp-ecm/ -b gpu_integration But ./configure doesn't support GPU arch above 75 so I had to run: ./configure --enable-gpu=75 --with-cuda=/usr/local/cuda CC=gcc-9 -with-cgbn-include=/home/chris/CGBN/include/cgbn Then manually update the makefiles to sm_86. nvcc -h says in part Code:
--gpu-code <code>,... (-code) Specify the name of the NVIDIA GPU to assemble and optimize PTX for. nvcc embeds a compiled code image in the resulting executable for each specified <code> architecture, which is a true binary load image for each 'real' architecture (such as sm_50), and PTX code for the 'virtual' architecture (such as compute_50). During runtime, such embedded PTX code is dynamically compiled by the CUDA runtime system if no binary load image is found for the 'current' GPU. Architectures specified for options '--gpu-architecture' and '--gpu-code' may be 'virtual' as well as 'real', but the <code> architectures must be compatible with the <arch> architecture. When the '--gpu-code' option is used, the value for the '--gpu-architecture' option must be a 'virtual' PTX architecture. For instance, '--gpu-architecture=compute_60' is not compatible with '--gpu-code=sm_52', because the earlier compilation stages will assume the availability of 'compute_60' features that are not present on 'sm_52'. Note: the values compute_30, compute_32, compute_35, compute_37, compute_50, sm_30, sm_32, sm_35, sm_37 and sm_50 are deprecated and may be removed in a future release. Allowed values for this option: 'compute_35','compute_37','compute_50', 'compute_52','compute_53','compute_60','compute_61','compute_62','compute_70', 'compute_72','compute_75','compute_80','compute_86','compute_87','lto_35', 'lto_37','lto_50','lto_52','lto_53','lto_60','lto_61','lto_62','lto_70', 'lto_72','lto_75','lto_80','lto_86','lto_87','sm_35','sm_37','sm_50','sm_52', 'sm_53','sm_60','sm_61','sm_62','sm_70','sm_72','sm_75','sm_80','sm_86', 'sm_87'. Older versions of nvcc will probably give a list with: nvcc -h | grep -o -E 'sm_[0-9]+' | sort -u But that won't work for 11.6 because the help lists sm_30 as deprecated even though it's no longer valid. It seems to work OK, but I've not tried it on a big job yet. And I need to update my scripts because the new GPU does 2432 stage 1 curves per run. Which limits its use if I just need to do t30. @ SethTro, can you update configure to support sm_86? One other grouse is that Nvidia seem to regard details of what level of CUDA you need for a given card as top secret information. I wasted a lot of time searching for it. |
![]() |
![]() |
![]() |
#104 | |
"Ed Hall"
Dec 2009
Adirondack Mtns
3×19×79 Posts |
![]() Quote:
You're probably aware of this site, but I've been having good luck at techpowerup for all the details on the various cards. e.g. https://www.techpowerup.com/gpu-spec...-3060-ti.c3681, which shows CUDA 8.6. |
|
![]() |
![]() |
![]() |
#105 |
Sep 2009
2×1,163 Posts |
![]()
And I've got another problem with the new card:
Code:
tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -cgbn -save test1.save 110000000 110000000 <b58+148.ini GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits) Using B1=110000000, B2=110000004, sigma=3:3698165927-3:3698168358 (2432 curves) ecm: cgbn_stage1.cu:525: char* allocate_and_set_s_bits(const __mpz_struct*, int*): Assertion `1 <= num_bits && num_bits <= 100000000' failed. Aborted (core dumped) Testing various B1's it fails at 70000000 but works at 60000000: Code:
tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -cgbn -save test1.save 60000000 1 <b58+148.ini GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits) Using B1=60000000, B2=1, sigma=3:4285427795-3:4285430226 (2432 curves) GPU: Using device code targeted for architecture compile_86 GPU: Ptx version is 86 GPU: maxThreadsPerBlock = 896 GPU: numRegsPerThread = 65 sharedMemPerBlock = 0 bytes Computing 2432 Step 1 took 3151ms of CPU time / 2557979ms of GPU time Code:
tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -save test2.save 110000000 1 <b58+148.ini GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits) Using B1=110000000, B2=1, sigma=3:2243519347-3:2243521778 (2432 curves) GPU: Using device code targeted for architecture compile_86 GPU: Ptx version is 86 GPU: maxThreadsPerBlock = 1024 GPU: numRegsPerThread = 30 sharedMemPerBlock = 24576 bytes GPU: Block: 32x32x1 Grid: 76x1x1 (2432 parallel curves) @EdH, I don't think the old card can be repaired, it smells of burnt plastic. And the one thing techpowerup don't say is what level of CUDA drivers and runtime the card needs. |
![]() |
![]() |
![]() |
#106 |
"Ed Hall"
Dec 2009
Adirondack Mtns
3·19·79 Posts |
![]()
Which I haven't been trying to make use of yet. I still haven't figured out the correlation with sm/cores/?? and how many parallel processes are run by ECM.
|
![]() |
![]() |
![]() |
#107 | ||
Apr 2010
22·3·19 Posts |
![]() Quote:
git clone https://gitlab.inria.fr/zimmerma/ecm.git Quote:
Why? If you just want to do t30, use 4864 curves and a lower bound of 37e4 and skip stage 2. That should be about t30. Unless you have a very powerful cpu it should be faster. |
||
![]() |
![]() |
![]() |
#108 | ||
"Ed Hall"
Dec 2009
Adirondack Mtns
450310 Posts |
![]() Quote:
Quote:
All help in understanding this is appreciated. |
||
![]() |
![]() |
![]() |
#109 |
Apr 2010
22×3×19 Posts |
![]()
I don't know. I have a 2060 Super that has 2176 shader units. Anything below 2176 curves takes as much time as 2176 curves. Total throughput is about 5-10% better for 4352 concurrent curves.
|
![]() |
![]() |
![]() |
#110 |
"Ed Hall"
Dec 2009
Adirondack Mtns
3·19·79 Posts |
![]() |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
NTT faster than FFT? | moytrage | Software | 50 | 2021-07-21 05:55 |
PRP on gpu is faster that on cpu | indomit | Information & Answers | 4 | 2020-10-07 10:50 |
faster than LL? | paulunderwood | Miscellaneous Math | 13 | 2016-08-02 00:05 |
My CPU is getting faster and faster ;-) | lidocorc | Software | 2 | 2008-11-08 09:26 |
Faster than LL? | clowns789 | Miscellaneous Math | 3 | 2004-05-27 23:39 |