2021-09-09, 17:03   #89
SethTro

"Seth"
Apr 2019

Posts

Quote:
 Originally Posted by chris2be8 Just the higher arch one (sm_52). Sorry. PS. Does CGBN increase the maximum size of number that can be handled? I'd try it, but I'm tied up catching up with ECM work I delayed while I was getting ecm-cgbn working.
Yes! In cgbn_stage1.cu search for this line
/* NOTE: Custom kernel changes here

You can either add a new kernel or I recommend just changing cgbn_params_512

- typedef cgbn_params_t<4, 512> cgbn_params_512;
+ typedef cgbn_params_t<TPI_SEE_COMMENT_ABOVE, YOUR_BITS_HERE> cgbn_params_512; // My name is a lie

The absolute limit is 32,768 bits. I found that GPU/CPU performance decreases 3x from 1,024 bits to 16,384 bits then an additional 2x above 16,384 still something like 13x faster on my system but possible no longer competitive watt for watt. Read the nearby comment for a sense of how long it will take to compile.

 2021-09-10, 04:01 #90 SethTro     "Seth" Apr 2019 22·7·13 Posts I spent most of today working on new optimal bounds. It can be a large speedup to use these instead of the traditionally optimal B1 bounds. ecm can confirm they represent a full t while taking substantially less time when accounting for the GPU speedup. Full table at https://github.com/sethtroisi/misc-s..._gpu_optimizer and an excerpt below Code: GPU speedup/CPU cores digits optimal B1 optimal B2 B2/B1 ratio expected curves Fast GPU + 4 cores 40/4 35 2,567,367 264,075,603 103 809 40/4 40 8,351,462 1,459,547,807 175 1760 40/4 45 38,803,644 17,323,036,685 446 2481 40/4 50 79,534,840 58,654,664,284 737 7269 40/4 55 113,502,213 96,313,119,323 849 29883 40/4 60 322,667,450 395,167,622,450 1225 56664 Fast GPU + 8 cores 40/8 35 1,559,844 351,804,250 226 1038 40/8 40 6,467,580 2,889,567,750 447 1843 40/8 45 29,448,837 35,181,170,876 1195 2599 40/8 50 40,201,280 58,928,323,592 1466 11993 40/8 55 136,135,593 289,565,678,027 2127 20547 40/8 60 479,960,096 3,226,409,839,042 6722 30014
2021-09-10, 04:01 #90
SethTro

"Seth"
Apr 2019

Posts

Quote:
 Originally Posted by bsquared 1280: (~31 ms/curves) 2560: (~21 ms/curves) 640: (~63 ms/curves) 1792: (~36 ms/curves) So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves)) With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.
Two late night performance thoughts.
1. You might get 10% more throughput by toggling VERIFY_NORMALIZED to 0 on line 55
It's a nice debug check while this is still in development but it has never tripped so it's overly cautious especially if it costs 10% performance.
2. Would you mind sharing what card you have and the full output from -v output (especially the lines that start with "GPU: ")

2021-09-13, 14:11   #92
bsquared

"Ben"
Feb 2007

Posts

Quote:
 Originally Posted by SethTro Two late night performance thoughts. 1. You might get 10% more throughput by toggling VERIFY_NORMALIZED to 0 on line 55 It's a nice debug check while this is still in development but it has never tripped so it's overly cautious especially if it costs 10% performance. 2. Would you mind sharing what card you have and the full output from -v output (especially the lines that start with "GPU: ")
Hmm, when running on 2^997-1 I'm getting *better* throughput with VERIFY_NORMALIZED 1, 53.5 curves/sec with it defined to 1 vs. 45.6 curves/sec with it defined to 0, both running -gpucurves 2560. If I set gpucurves 5120 then the no_verify version is 15% faster, but still slower than -gpucurves 2560.

It is a Tesla V100-SXM2-32GB (compute capability 7.0, 80 MPs, maxSharedPerBlock = 49152 maxThreadsPerBlock = 1024 maxRegsPerBlock = 65536)

2021-09-21, 08:35   #93
SethTro

"Seth"
Apr 2019

Posts

Quote:
 Originally Posted by bsquared 1280: (~31 ms/curves) 2560: (~21 ms/curves) 640: (~63 ms/curves) 1792: (~36 ms/curves) So we have a winner! -gpucurves 2560 beats all the others and anything the old build could do as well (best on the old build was 5120 @ (~25 ms/curves)) With the smaller kernel (running (2^499-1) / 20959), -gpucurves 5120 is fastest at about 6ms/curve on both new and old builds.
I was confused when you saw only moderate gains so I rented a V100 (V100-SXM2-16GB) on AWS today.
I'm seeing the new code be 3.1x faster which is similar to the 2-3x improvement I've seen on a 1080ti, 970, and K80.

Code:
$echo "(2^997-1)" | ./ecm -cgbn -v -sigma 3:1000 1000000 0 Computing 5120 Step 1 took 134ms of CPU time / 69031ms of GPU time Throughput: 74.170 curves per second (on average 13.48ms per Step 1)$ echo "(2^997-1)" | ./ecm -gpu -v -sigma 3:1000 1000000 0
Computing 5120 Step 1 took 10911ms of CPU time / 218643ms of GPU time
Throughput: 23.417 curves per second (on average 42.70ms per Step 1)

