20200621, 14:00  #12 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
105A_{16} Posts 
Here's my take on what may be required. I'd be very interested in what George, Ernst, Preda, and the rest of the approximately 18 programmers who created the commonly used several GIMPS cpu and gpu applications have to say about how they did it. This should not be taken as discouragement for new entrants, but as recognition of what the authors of existing code have done.
Realistically consider whether you have what it takes and are willing to pay the price it takes. Commitment, great skills, best algorithms, best implementations, combined. Become a worldclass programmer. Sort of like an Olympian athlete. Train ruthlessly and constantly. Innate ability required. Program a lot. Learn from the best work of others. Read the source code. Be open, even eager, for constructive criticism. Remove ego and emotion from decisions. Use excellent rational judgment. Identify the best available mathematical algorithms for the task you're tackling. Learn them. For Woltman this probably involved library research in physical books, identifying experts in mathematics and number theory, by library, email, conferences, etc, asking questions, being humble about having blind spots and getting help identifying them. He did this in the WWW's infancy, by 1995. There was nothing like this forum then. Perhaps mailing lists. Learn continually. Stay current in the relevant areas. Learn well the programming languages and other tools. Know a high level language well, how to combine it with assembler, how to structure a program well, how to determine what parts are taking how much time (profiler), and especially when is too soon to optimize. Be ruthless about performance. Write the same functionality many ways and measure their actual performance. Start with what the processor documentation indicates would be fast, and try many different ways too. Know and use optimization techniques. Use different code branches optimized for different processor models, different operand inputs, etc. Be more ruthless about correctness of the program's operation. Include multiple error checks in the software. Include complementary checks in the project that uses the software. Produce outputs that can be compared to knowngood outputs of other programs. Compare them. Frequently. Search the world for independently produced data for what's correct output and allies that can help ensure correctness. Give the software away for free, including source code, create a community that uses it, accept and welcome any useful coding, testing, etc help that appears. Sustained commitment. Efficient use of your own time. Spend large amounts of time well focused and well directed, over a quarter century in the case of prime95. 
20200621, 14:08  #13  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·7·13·23 Posts 
Quote:
Do you know why current chips are small? As a thought experiment long ago, there was a discussion on a thread about what it would take to implement something like the PRP test algorithm in hardware, such that it could clock exponent values in one port and primality test results out another every clock cycle or two. As I recall the conclusion was tiling the area of a small US state, including distributed solar cells to power it. How to reconfigure continuously as exponents increased and longer fft length was required was left as an exercise to the reader. There has been some preliminary discussion of FPGAs or ASICs. My guess is they're more suitable for TF than P1 or primality testing. The development time and cost is considerable. Last fiddled with by kriesel on 20200621 at 16:17 

20200621, 16:47  #14 
Sep 2006
Brussels, Belgium
11000001101_{2} Posts 
There is a document on the the GIMPS site, I think by George, that shows some light on what is involved in a software optimisation process : P4 Notes.
Jacob Last fiddled with by S485122 on 20200621 at 16:47 Reason: the "4" was gone (or more probably I did'nt type it) 
20200621, 19:07  #15  
"Oliver"
Sep 2017
Porta Westfalica, DE
233 Posts 
Quote:
It really boills down to that (yes, other comments are helpful as well, of course). When I started to "understand" what Prime95 does, I first had to learn that nearly every step can have a substantial optimization. But when I tried to code something basic myself, my basic code was unstable. So I'll really recommend this procedure and you'll can always look up on our pages for optimizations. 

20200621, 19:27  #16  
P90 years forever!
Aug 2002
Yeehaw, FL
2^{5}×3×73 Posts 
Quote:
Mlucas FFTs are written in C, Prime95 in assembly. Try looking in the Mlucas subforum as Ernst has often benchmarked Mlucas against Prime95. Prime95 is not twice as fast as Mlucas. 

20200621, 21:00  #17  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×7×13×23 Posts 
Quote:
As to hair, that or rather its departing happens regardless of coding or not, depending mainly on maternal line. (Although pulling it out during programming does not help matters.) Last fiddled with by kriesel on 20200621 at 21:01 Reason: accursed excess line feeds 

20200622, 05:52  #18 
"Sam"
Nov 2016
13·23 Posts 
Apart from what others have said, there's nothing stopping you from creating your own algorithm, especially for multiplication. You will want to look at running times of existing algorithms such as FFT, which currently runs in O( ln(n) ) time is my understanding. So reducing multiplication to O ( ln(n)^(1/2) ) or even O ( ln(ln(n)) ) time would definitely speed computations up. I don't think that will happen anytime soon, however.
Also out of curiosity, suppose you had to write your own libraries from scratch, will writing everything in C give best performance (starting from FFT algorithms, to a PRP test involving binary exponentiation)? The advantage of writing such libraries is you could easily* write code using CUDA compiler or equivalent, so that the program takes advantage of a GPU, which are faster than CPUs (most of the time anyways). 
20200622, 16:04  #19  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
10132_{8} Posts 
Quote:
All known multiplication algorithms are O (n) or higher, where n is number of bits or words required. Knuth's O(n) is a special case that requires hardware capable of processing n bits wide. Hardware operating 10^{8} to 2^{32} bits wide would be quite large and probably would not clock very fast. IIRC it's been conjectured the lower bound is O(n log n). Various recent papers have closely approached that for very large n, much larger than 2^{32}. Since primality tests are performed by ~n iterations of n bits width, they are O(n^{2} ln n ln ln n). That works out to be around O(n^{2.1}) for n of current interest to finding new Mersenne primes. See https://www.mersenneforum.org/showpost.php?p=510721&postcount=7 for a more detailed writeup and links to the various publications. Last fiddled with by kriesel on 20200622 at 16:10 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
GGNFS not being developed?  jux  Factoring  4  20151227 03:05 