View Single Post 2021-09-10, 03:54   #22
SethTro

"Seth"
Apr 2019

36410 Posts Quote:
 Originally Posted by R.D. Silverman Interesting question. I guess [based on what is publicly reported] that it is under 1024 bits. One still must run stage 2. Even if GPU stage 1 took zero time the net result only cuts the total time in half after running stage 2.
I finished some new code today to measure what the speedup from adding a GPU is. In the extreme case where you use just one GPU and one CPU core.

All of these are equivalent to one t35 (using -param3)
1,116 curves at B1=1e6, B2=1e9 (traditional)
747 curves at B1=1.3e6, B2=2.86e9 (equal time for both stages)
1850 curves at B1=1.9e6, B2=28.5e6 (equal time for both stages with a 40x faster stage 1)

Now these take respectively (for the 146 digit input I tested)
1116 * (1082 + 757)ms = 34.2 minutes
747 * (1414 + 1479)ms = 36 minutes
1850 * (2045 + 72)ms = 65 minutes (on CPU)
1850 * (2045/40 + 72)ms = 3.8 minutes (1.5 minutes GPU + 2 minutes CPU)

So now 34.2 minutes for a t35 has been reduced to 3.8 minutes! or a 9x speedup!

In the case that you pair 8 CPUs with a GPU, the stage 1 effective speedup is only 40/8 = 5x and the overall speedup is muted to 3x
from 34.2/8 = 4.3 minutes to 45 seconds GPU + 51 seconds CPU = 1.5 minutes  