20190511, 18:16  #12 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,101 Posts 
Why don't we save interim residues on the primenet server?
This is often asked in the context of wanting to be able to continue a run abandoned before completion by someone. It's not unusual for someone to quit participating when their assigned exponent(s) are anywhere from 2 to 98% complete in a primality test.
Full length residues saved to the primenet server at some interval, perhaps every 20 million iterations, are sometimes proposed as a means of minimizing the lost throughput from abandoned uncompleted tests. The combined output of GIMPS would represent a considerable load on the server's resources to implement this, and require additional considerable expenditure to support, which is not in the Mersenne Research Inc. budget. For users with slow internet connections, the individual load could also be considerable as a fraction of available bandwidth. Transfer times could stall the application and reduce total throughput. https://www.mersenneforum.org/showpo...&postcount=118 Detailed analysis and discussion at https://www.mersenneforum.org/showpo...&postcount=124 However, it is feasible to save smaller interim residues, such as 64bit or 2048bit. And this is currently being done. Recent versions of prime95 automatically save 64bit residues at 500,000 iterations and at every multiple of 5,000,000. The 2048bit are generated at the end of PRP tests, possibly only type 1 and type 5 PRP tests, per posts 606609 of https://www.mersenneforum.org/showth...048#post494079 The stored interim 64bit residues from different runs can be compared to see if runs are matching along the way or when one or another diverges. Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20200220 at 20:35 
20190519, 15:58  #13 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,101 Posts 
Why don't we skip double checking of PRP tests protected by the very reliable Gerbicz check?
George Woltman gave a few reasons at https://www.mersenneforum.org/showpo...68&postcount=3.
An example of a bad PRP result is listed at https://www.mersenne.org/report_expo...9078529&full=1, which George has identified as an example of a software bug affecting a single bit outside the block of computations protected by the Gerbicz error check. However, the development of a method of generating a proof of correct completion of a PRP test, that can be independently verified, will replace PRP double checking, at a great savings in checking effort. https://www.mersenneforum.org/showth...ewpost&t=25638 This has been implemented in Gpuowl, mprime/prime95, and on the PrimeNet server. It is planned to be added to Mlucas also. Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20210202 at 19:28 Reason: updated statement of PRP proof/cert implementation status 
20190519, 17:11  #14  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,101 Posts 
Why don't we self test the applications, immediately before starting each primality test?
Why don't we self test the applications, immediately before starting each primality test? For the same fft length about to be used for a primality test such as a current wavefront test, and any 100Mdigit exponent or higher? Perhaps also upon resumption of an exponent?
(part of this was first posted as https://www.mersenneforum.org/showpo...0&postcount=10) Quote:
Users might find the checks annoying or regard them as lost throughput. Running LL on 100Mdigit exponents would be disincentivized, since it would involve working also on a 100Mdigit PRP DC so that there is an fft length match. One might as well run PRP for 100Mdigit exponents, and avoid the side self test or commitment to doing a 100Mdigit DC. Increasing adoption of PRP and reducing LL for 100Mdigit exponents is a good thing. There are some applicationspecific or interfacespecific reasons. There is no GIMPS PRP code for CUDA or Gerbicz check code for CUDA. There is no provision for self test of fft lengths larger than 8192K in CUDALucas. Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20200220 at 19:34 

20190520, 02:04  #15 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,101 Posts 
Why don't we occasionally manually submit progress reports for longduration manual primality tests?
There's currently no way to do that.
This is a CUDALucas console output line: Code:
 May 19 20:00:49  M49602851 30050000 0x05c21ef8e9eac8b2  2688K 0.15625 2.0879 104.39s  11:15:47 60.58%  Done processing: * Parsed 1 lines. * Found 0 datestamps. GHzdays Qty Work Submitted Accepted Average 0  all  0.000
Accepting gpuowl progress records would also be very useful. Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20200627 at 14:30 
20191002, 05:09  #16  
Romulan Interpreter
Jun 2011
Thailand
2^{5}×5×59 Posts 
Why don't we extend B1 or B2 of an existing nofactor P1 run?
Quote:
Extending B1 is a bit trickier, because you need to recompute the additional small primes that fit into the new B1, and do the exponentiation required to include them into the new product (b^E). There is a piece of pari/gp Pm1 code I posted some time ago which does B1 extension, but that is slow because first of all, it is pari, and second, it only uses "chunks" of 2 primes (i.e. no stage 2 extensions), but it can save intermediary files and extend B1 too. Also, once you extend B1, then you must do stage 2 "from scratch", whatever stage 2 you did before, for the same B2 (or more, or less) is void. (Kriesel:) Mostly though, we don't do P1 bounds extensions because:
Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20210302 at 19:53 Reason: Add title, list of reasons for status quo 

20200619, 19:41  #17 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1001111101101_{2} Posts 
Why don't we do proofs and certificates instead of double checks and triple and higher?
Update:
We can and do. Everyone who can should stop performing LL first tests, and should upgrade to PRP, GEC, and proof generation for first primality tests as soon as possible, to prime95/mprime v30.3 or later; gpuowl ~v6.11316 or later. Mlucas v20 is coming at some point, meanwhile use v19.1 for PRP/GEC without proof generation. PRP with proof generation is also recommended for all double check work. It has far superior reliability as a doublecheck, and is slightly more efficient, than LL DC on average, due to an extremely low error rate. The computing time (effort) to generate the proof proof file and to perform the CERT work is a fraction of that required to make up for the 12% LL DC expected error rate. The time between first test completion and verification is vastly reduced, from often 8 years or more for LL DC, to generally hours or days for PRP proof and CERT. See also https://www.mersenneforum.org/showpo...45&postcount=4 Original post: Because we didn't know it was possible to do proofs of PRP tests for these huge Mersenne numbers at considerably less effort than a repeat PRP test or repeat LL test until recently. The development of new code to do proofs and verifications, followed by widespread deployment of client applications to do proofs, and server infrastructure to accept proofs and perform verifications, will take around a year or more to complete. Gpuowl is closest to being ready to provide proofs. Prime95 and Mlucas haven't begun to get this added yet as of mid June 2020. There's also separate verifier code to write. Server modification for storing new data types. Manual result handling modification. Extension of the Primenet API to accommodate it for prime95. Some threads regarding this recent development are Announcement The Next Big Development for GIMPS (Layperson's and informal discussion here) Technical VDF (Verifiable Delay Function) and PRP (Leave this one for the number theorists and crack programmers) Technical background: Efficient Proth/PRP Test Proof Scheme (Also a math/numbertheory thread, let's leave this one for theorists too) This is an exciting development. It offers elimination of almost all confirmation effort on future PRP tests, so will substantially increase testing throughput (eventually). It is a high priority for design and implementation right now. Other possible gpuowl enhancements are likely to wait until this is at least ready for some final testing. Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20210416 at 20:35 Reason: Mlucas v20; prefer PRP&proof over LLDC 
20200627, 14:49  #18 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,101 Posts 
Why don't we run gpu P1 factoring's gcds on the gpus?
The software doesn't exist.
Currently CUDAPm1 stalls the gpu it runs on, for the duration of a stage 1 or stage 2 gcd that runs on one core of the system cpu. Earlier versions of gpuowl that performed P1 also stalled the gpu while running the gcd of a P1 stage on a cpu core. At some point, Mihai reprogrammed it so a separate thread ran the gcd on one core of the cpu, while the gpu went ahead and speculatively began the second stage of the P1 factoring in parallel with the stage 1 gcd, or the next worktodo assignment in parallel with the stage 2 gcd when one is available. In all cases, these gcds are performed by the gmp library. (About 98% of the time, a P1 factoring stage won't find a factor, so continuing is a good bet, and preferable to leaving the gpu idle during the gcd computation.) It was more efficient use of programmer time to implement it that way quickly, using an existing library routine. On a fast cpu the impact is small. On slow cpus hosting fast gpus it is not. Borrowing a cpu core for the gcd has the undesirable effect of stopping a worker in mprime or prime95 for the duration, and may also slow mlucas, unless hyperthreading is available and effective. To my knowledge no one has yet written a gpubased gcd routine for GIMPS size inputs. For gpu use for gcd in other contexts see http://www.cs.hiroshimau.ac.jp/cs/_...apdcm15gcd.pdf (RSA) and https://domino.mpiinf.mpg.de/intran...FILE/paper.pdf (polynomials). If one was written for the large inputs for current and future GIMPS work, a new gcd routine for the gpu could be difficult to share between CUDAPm1 and gpuowl, since gpuowl is OpenCL based but CUDAPm1 is CUDA based, and the available data structures probably differ significantly. Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20210302 at 19:55 
20201216, 17:25  #19 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,101 Posts 
Why don't we use 2 instead of 3 as the base for PRP or P1 computations?
Mersenne numbers are base2 pseudoprimes. All would be indicated as prime in P1 factoring or Fermat PRP tests, whether actually prime or composite. Using 3 as the base provides useful information, and costs no more computing time; using 2 as the base provides no useful information. That's a summary of my understanding of this thread as it relates to base choice.
Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20201217 at 16:41 
20210328, 19:41  #20 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,101 Posts 
Why don't we use interim 64bit residues as checking input on later runs?
(This idea was first posted as part of https://mersenneforum.org/showpost.p...&postcount=200)
As far as I recall, no current or past GIMPS software offers this feature. Empirical results show the error rate of an individual LL test is a steep function of exponent; 12% for 100M exponent, but 1020% for 100Mdigit, extrapolating by run time to ~90% error rate for gigabit. If a list of interim 64bit residues from a past run of an equivalent algorithm was available, that was not known to be bad, a program modified to read them in and compare them to the 64bit residues it obtains in its run could use the list as an additional error check input. Reaching a residue for the same iteration that did not match could trigger a limited number of retries from the last matching point. Very simply formatted, a file named <exponent>compare.txt, or perhaps , <exponent>[PRP3LL4]compare.txt, containing ASCII hex 64bit residues corresponding to iteration numbers that must be a multiple of the running application's log interval; beginning with a single header record for consistency check during the current run. (The following Fan Ming LL seed 4 DC interim residues have been confirmed by my TC in progress.) Code:
332194529 LL seed 4 1000000 6f3e8db940a2da46 5000000 af733e924b0ff14d 10000000 3445b03b12e7e63c 15000000 7f89336ab4c99ccf 20000000 7cc12f9bc5569c53 50000000 6d8283811fd2386c 100000000 e1f61c2c2ad97069 Code:
332194529 PRP type 1 base 3 It could also be handy for strategic double and triple checking. Ordinary double checking is much less likely to have access to a residue sample list. Default would need to be to proceed without the existence of the list file, and to continue forward if say 3 tries reproduce the same residue at the point where the list file has a different value. Options could be to halt on difference, or to generate such a list file, taking care to not clobber an existing list file. Perhaps by naming output list files <exponent><testtype>comparen.txt, where n starts at 1 and increments until no such filename exists in the folder. These would sort by exponent and then by type, LL4 or PRP3, and finally by test sequence number. Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20210421 at 14:40 Reason: added early paragraph re error rate vs exponent or runtime, updated res64 list for example 
20210405, 23:13  #21 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,101 Posts 
Why don't we merge the leading cpu and gpu applications?
Because our best volunteer programmers:
have better uses for their time; like things as they are now; aren't that ambitious, crazy or masochistic. As things stand now, there are several applications, with several separate lead authors. Examining source code, documentation, program output, and data files, we can see that the various applications use different variable names for the same things, the same names for different things, different ways of organizing the storage for the same computation, and different ini files and definitions, etc. New developments can be prototyped in one application without impacting another. Development in cpu and gpu code can proceed independently and in parallel. Mprime, prime95, mlucas are currently cpu oriented. Gpu programs are CUDA or OpenCL oriented. Getting a development environment to do well not only in mutual compatibility but also performance with multiple cpu instruction sets and CUDA and OpenCL is a tall order. With sufficient support for the several operating systems and versions, taller yet. To have mprime extend its PrimeNet API support to handle gpu type assignments, progress reports, and result reporting, and to modify it from supporting a single cpu type at a time to do sensible dispatching of work to disparate computing resources, and perform orderly shutdown (of the whole program, just one computing resource, what?) is a whole other design and programming task. Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20210430 at 01:09 
20210429, 23:00  #22 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1001111101101_{2} Posts 
Why don't we use FPGAs along with or instead of cpus or gpus?
Existing GIMPS software does not support FPGAs. The existing GIMPS software was developed for mass market CPUs or GPUs, via FTEdecades of effort of highly skilled software developer volunteers.
Even though there are existing OpenCL based applications for GPUs (Mfakto and GpuOwl), and some FPGA programming software supports OpenCL programming input, in practice the astronomy community has found that to get good performance, a complete rewrite is necessary when going from GPUs to FPGA implementations. Performance for an Arria 10 was indicated as 1TFLOPs (word size unspecified), which is well below the DP performance of a Radeon VII and some other GPUs. (A talk on FPGAs and GPUs from 2019, 39:13 run time https://www.youtube.com/watch?v=MO2Hxxxy6l4) FPGA onchip memory access can be very fast (~8TB/sec, much faster than the Radeon VII's 1TB/sec bandwidth to HBM2 memory). Even very high end FPGAs (~$20k each) don't have enough onchip memory for PRP, LL, or P1. (PRP or LL of ~103M takes ~400MB in gpuowl; P1 stage 2 benefits from multiple GB.) Basic TF needs very little memory and is less complex to program than the fftmultiplybased PRP, LL, or P1. Its performance can however benefit from large sieve sizes ~2Gbits or more on fast GPUs, and may on FPGA also. The top end of onchip memory on the Stratix 10 is only 308 Mbits. Possibly one could split the task, doing the sieving on cpu as in olden mfaktx days. Or accept a little less than optimal performance to fit within available memory resources; on current high end GPUs the performance gain from 128Mbits to 2Gbits sieve size is no more than 10%. Task splitting has also been proposed in earlier threads and other contexts, such as wide fast multipliers or core routines for 1Kpoint FFTs for LL, PRP or P1. FPGAs clock at a fraction of the frequency of a contemporary cpu or GPU. They need to make that frequency disadvantage up, in high efficiency and parallelism. The achievable clock rate is only determined after an FPGA design/compile is complete. The design is specific to both the FPGA chip model and the algorithm, since the available resources on the FPGAs vary between models. (Imagine if a Gpuowl equivalent needed to be rewritten and recompiled for each FPGA model! Plus recompiled for each host OS.) There have been many suggestions/proposals to use FPGAs in GIMPS over the years, with some seemingly coming from FPGA sales people or experienced FPGAbaseddesign engineers. But to my knowledge; there have been no working designs or compatible software created or announced or demonstrated or shared; no one has yet tackled and completed the large upfront development effort or accepted the lengthy FPGAcompile time (~half a day per try). See https://mersenneforum.org/showthread.php?t=2674 for a recent iteration of the question. And Uncwilly's list of some previous threads on the matter, from 20162018, 2005, 2019. There are more, including https://mersenneforum.org/showthread.php?t=23176 https://mersenneforum.org/showthread.php?t=21749 https://mersenneforum.org/showthread.php?t=18178 https://mersenneforum.org/showthread.php?t=16141 https://mersenneforum.org/showthread.php?t=15599 https://mersenneforum.org/showthread.php?t=15325 https://mersenneforum.org/showthread.php?t=14525 https://mersenneforum.org/showthread.php?t=8601 As I understand it, after reading on the topic including the above threads and more, and having worked with other engineers who used FPGAs, the most likely implementation would be as an already commercially available dropin PCIe card that supports a particular model of FPGA. That would allow such an endeavor to avoid expensive slow hardware design and prototyping. There's also the possibility of using existing FPGA hardware offered via cloud computing. That offers 4channel DDR4 ram available up to 64GB, which would be adequate in size for FFT at GIMPS wavefront, but considerably outperformed by memoryconstrained gpus running gpuowl on HBM2 ram such as the Radeon VII. Mprime performance is limited on cpus by memory speed constraints, to the extent that Woltman has stated he uses longer ostensibly slower instruction sequences that place lesser demands on memory bandwidth, to improve actual performance. Using an existing PCIe or USB interfaced FPGA card design it would still be necessary to create both a logic design to field program the FPGA to do something useful to GIMPS with acceptable efficiency and speed, and to create a program for the general purpose cpu based system to get data to and from the FPGA card to do something useful. I think that amounts to splitting function of something like Mfakto or Gpuowl into FPGAprogramming and hostprogramming portions, with drastic rewrite required, and addition of code for the two talking to each other (coordination of coprocessing). A talented experienced FPGA & software engineer may be able to do it in a way that reduces the effort of switching from one FPGA model to another. That none of them, over decades of occasional discussion in the GIMPS forum, have found it worthwhile to undertake and complete the set of tasks, says something about the size of the work required and (effort+cost)/reward ratio for doing so. This implies there may be something intrinsic, fundamental, that persists across generations of CPU, GPU, and FPGA designs. What are the possible reasons for it apparently not being worthwhile for the experienced designers? Some candidates:
(I don't run any software on my newest 2 TVs or my NAS, router, etc.) Tom's Hardware gives a figure of about 79.4 million PC shipments per quarter for Q4 2020, and 274. million annually. https://www.tomshardware.com/news/gp...rtq42020jpr https://www.grandviewresearch.com/in...ocessormarket gives $85 billion annually. A single fab plant costs ~$10 billion. Intel is building two more in Arizona. https://www.extremetech.com/computin...aulindecades GPUs? Shipments of GPUs for PC use in 2020 were ~41.5 million units. (https://www.tomshardware.com/news/gp...rtq42020jpr again) The GPU shortage continues. Not all of these will be accessible for GIMPS application use either; some are in government or medical use, etc. FPGAs? The FPGA market is estimated as $9.8 billion (2020), apparently smaller than the CPU market. Not all of these will be accessible for potential GIMPS application use either; most are in embedded systems. https://cacm.acm.org/magazines/2020/...fpgas/fulltext Some designs use multiple clocks. Most designs have bugs escape into production. https://semiengineering.com/verifica...icationstudy/ See also "FPGA Hell" in the debugging section of https://en.wikipedia.org/wiki/FPGA_prototyping Number of devices in high end units comparison Recent gpu designs contain tens of billions of transistors, 50 billion in the MI100 (2020): https://blog.jetbrains.com/datalore/...pecifications/ One year of Moore's law would move the device count up to ~71 billion in 2021. Similarly, recent cpu designs contain tens of billions of transistors, 19 in AMD Epyc (2018); allowing for 3 years of Moore's law from there, 19 x 2.83 ~ 54 billion in 2021; https://www.karlrupp.net/2018/02/42...ortrenddata/ For FPGAs, 1987, 9000 gates, 2013, 50 million gates, so 12.44 doublings in 25 years, pretty close to a Moore's law of doubling in 2 years. https://en.wikipedia.org/wiki/Field...te_array#Gates Extrapolating to 2021, would be ~13 billion gates; at an estimated 4 transistors average per gate, that's around 50 billion transistors. Device counts appear to be pretty comparable across the high end among GPU, CPU, and FPGA. Here's a whole book on FPGAs from the perspective of learning to use one little board with one. Plus links to some of the author's prior similar work. https://xess.com/static/media/appnot...owWhatBook.pdf Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20210502 at 21:30 