mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Blogorrhea > kriesel

Closed Thread
 
Thread Tools
Old 2019-05-11, 18:16   #12
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·3·5·97 Posts
Default Why don't we save interim residues on the primenet server?

This is often asked in the context of wanting to be able to continue a run abandoned before completion by someone. It's not unusual for someone to quit participating when their assigned exponent(s) are anywhere from 2 to 98% complete in a primality test.

Full length residues saved to the primenet server at some interval, perhaps every 20 million iterations, are sometimes proposed as a means of minimizing the lost throughput from abandoned uncompleted tests. The combined output of GIMPS would represent a considerable load on the server's resources to implement this, and require additional considerable expenditure to support, which is not in the Mersenne Research Inc. budget. For users with slow internet connections, the individual load could also be considerable as a fraction of available bandwidth. Transfer times could stall the application and reduce total throughput. https://www.mersenneforum.org/showpo...&postcount=118
Detailed analysis and discussion at https://www.mersenneforum.org/showpo...&postcount=124

However, it is feasible to save smaller interim residues, such as 64-bit or 2048-bit. And this is currently being done. Recent versions of prime95 automatically save 64-bit residues at 500,000 iterations and at every multiple of 5,000,000. The 2048-bit are generated at the end of PRP tests, possibly only type 1 and type 5 PRP tests, per posts 606-609 of https://www.mersenneforum.org/showth...048#post494079
The stored interim 64-bit residues from different runs can be compared to see if runs are matching along the way or when one or another diverges.


Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2020-02-20 at 20:35
kriesel is online now  
Old 2019-05-19, 15:58   #13
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×3×5×97 Posts
Default Why don't we skip double checking of PRP tests protected by the very reliable Gerbicz check?

George Woltman gave a few reasons at https://www.mersenneforum.org/showpo...68&postcount=3.
An example of a bad PRP result is listed at https://www.mersenne.org/report_expo...9078529&full=1, which George has identified as an example of a software bug affecting a single bit outside the block of computations protected by the Gerbicz error check.

However, the development of a method of generating a proof of correct completion of a PRP test, that can be independently verified, will replace PRP double checking, at a nearly 100% savings in checking effort. https://www.mersenneforum.org/showth...ewpost&t=25638
This has been implemented in Gpuowl, mprime/prime95, and on the PrimeNet server.* It is planned to be added to Mlucas also.

* subject to restriction of automatic server postprocessing, to exponents below ~596M, because of the current Primenet server's available instruction set, to the SSE2 fft lengths available, for proof verification issuance from the server, to ordinary users via PrimeNet for certification.


Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-07-21 at 22:43 Reason: updated statement of PRP proof/cert implementation status
kriesel is online now  
Old 2019-05-19, 17:11   #14
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×3×5×97 Posts
Default Why don't we self test the applications, immediately before starting each primality test?

Why don't we self test the applications, immediately before starting each primality test? For the same fft length about to be used for a primality test such as a current wavefront test, and any 100Mdigit exponent or higher? Perhaps also upon resumption of an exponent?

(part of this was first posted as https://www.mersenneforum.org/showpo...0&postcount=10)
Quote:
I think it would be a plus if future releases of primality testing software performed a brief self test before beginning each primality test, and if found unreliable, AT THAT TIME, refused to proceed with a primality test, instead providing the user with recommendations for improving reliability. Perhaps a fast small block of PRP/Gerbicz check, even if what's being run is LL; on the same exponent/fft length, to test more closely what's about to be run.
Hardware reliability changes with time and temperature and other factors. A self test of the same fft size checks that fft transforms and multiplications can be reliably done. If the self test was a couple of blocks of PRP/GC, it could be a useful small increment of a cat 4 PRP double check.

Users might find the checks annoying or regard them as lost throughput. Running LL on 100Mdigit exponents would be disincentivized, since it would involve working also on a 100Mdigit PRP DC so that there is an fft length match. One might as well run PRP for 100Mdigit exponents, and avoid the side self test or commitment to doing a 100Mdigit DC. Increasing adoption of PRP and reducing LL for 100Mdigit exponents is a good thing.

There are some application-specific or interface-specific reasons.
There is no GIMPS PRP code for CUDA or Gerbicz check code for CUDA.
There is no provision for self test of fft lengths larger than 8192K in CUDALucas.


Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2020-02-20 at 19:34
kriesel is online now  
Old 2019-05-20, 02:04   #15
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

582010 Posts
Default Why don't we occasionally manually submit progress reports for long-duration manual primality tests?

There's currently no way to do that.
This is a CUDALucas console output line:
Code:
|  May 19  20:00:49  |  M49602851  30050000  0x05c21ef8e9eac8b2  |  2688K  0.15625   2.0879  104.39s  |     11:15:47  60.58%  |
https://www.mersenne.org/manual_result/ does not understand it. It responds as follows:
Done processing:
* Parsed 1 lines.
* Found 0 datestamps.

GHz-days Qty Work Submitted Accepted Average 0 - all - 0.000
  • Did not understand 1 lines.
  • Recognized, but ignored 0/0 of the remaining lines.
  • Skipped 0 lines already in the database.
  • Accepted 0 lines.
There's no way to report progress of a GPU-based manual primality test or lengthy P-1 factoring run or long TF run, so from the PrimeNet server's point of view, progress remains at 0.0%. This means sometimes they prematurely expire. It would be useful if the manual results processing script would accept progress reports in CUDALucas console output form as in the example above, even if it was limited to accepting reports with iteration counts that were multiples of 1M or 10M. See also https://www.mersenneforum.org/showthread.php?t=24262

Accepting Gpuowl progress records would also be very useful. Since there has been considerable variation in log record format versus Gpuowl version, selecting a very small number of the most popular reliable efficient versions may be in order. I suggest V6.11-380 and V7.2-53. PRP seems most important to support, since first time PRPs can be time consuming, and PRPDC currently gets subjected to Cat0 short expiration periods without any progress reporting occurring.

Gpuowl v6.11-380 log records (PRP followed by LL):

Code:
2021-05-24 19:46:20 asr2/radeonvii0 59234033 OK 38500000  64.99%;  599 us/it; ETA 0d 03:27; ec9cffdc371be8e4 (check 1.34s)
2020-06-03 22:21:09 asr2/radeonvii0 91844033 LL   400000   0.44%;  697 us/it; ETA 0d 17:43; de6590c819df3895
Gpuowl V7.2-53 log record example of most use is the last style below in bold:
Code:
2021-01-02 18:40:07 asr2/radeonvii0 510004423 OK   3000000   0.59% 20d8f6ecf29974cf 7658 us/it + check 3.50s + save 6.33s; ETA 44d 22:33 | P1(3M) 69.3% ETA 02:50 b7f0461995a6dae4
2021-01-02 18:42:44 asr2/radeonvii0 510004423      3010000   0.59% d1ee8d164b98cfc2 15756 us/it
2021-01-02 18:45:22 asr2/radeonvii0 510004423      3020000   0.59% 41dd444c0d2bbf1e 15793 us/it
2021-01-02 18:46:34 asr2/radeonvii0 510004423 P1 Jacobi OK @ 3000000 b7f0461995a6dae4
...
2021-01-02 21:51:31 asr2/radeonvii0 510004423 P2(3M,130M)   0.8%  7530 muls, 7533 us/mul, ETA 13:32
...
2021-01-04 11:04:13 asr2/radeonvii0 510004423 P2(3M,130M) Starting GCD
2021-01-04 11:04:14 asr2/radeonvii0 510004423 P2(3M,130M) Released memory lock 'memlock-0'
...
2021-01-15 03:34:28 asr2/radeonvii0 510004423      4990000   0.98% 56b05ac1fec0d4e7 6555 us/it
2021-01-15 03:35:37 asr2/radeonvii0 510004423 OK   5000000   0.98% 54713855d68e7cdb 6294 us/it + check 3.49s + save 2.09s; ETA 36d 18:59
Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-05-27 at 17:59 Reason: added gpuowl versions, examples
kriesel is online now  
Old 2019-10-02, 05:09   #16
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

100110010010012 Posts
Default Why don't we extend B1 or B2 of an existing no-factor P-1 run?

Quote:
Originally Posted by kriesel View Post
Neither GpuOwl nor CUDAPm1 have yet implemented B1 extension from an existing save file. Consequently a run to a higher B1 for the same exponent currently requires starting over, repeating a lot of computation.

Neither GpuOwl nor CUDAPm1 have yet implemented B2 extension from an existing savefile. Consequently a run to a higher B2 for the same exponent currently requires starting over, repeating a lot of computation.
B2 extension is trivial. Moreover, you only need to save the residue after stage 1 finished, and the last B2 value (or range). That is because every stage 2 "chunk" (or cluster) does not use the results from the former "chunks", but it only uses the results at the end of stage 1, and the current stage 2 "chunk", and the "chunks" increase until B2 is reached. So, if you have a save file at the end of stage 1 (when B1 was reached), you technically could do independently "stage 2 from B2_start_x to B2_end_x", in x computers in parallel.

Extending B1 is a bit trickier, because you need to recompute the additional small primes that fit into the new B1, and do the exponentiation required to include them into the new product (b^E). There is a piece of pari/gp Pm1 code I posted some time ago which does B1 extension, but that is slow because first of all, it is pari, and second, it only uses "chunks" of 2 primes (i.e. no stage 2 extensions), but it can save intermediary files and extend B1 too.

Also, once you extend B1, then you must do stage 2 "from scratch", whatever stage 2 you did before, for the same B2 (or more, or less) is void.

(Kriesel:) Mostly though, we don't do P-1 bounds extensions because:
  • The code to do so does not exist in our available GIMPS production software.
  • The work type is not defined in the PrimeNet API
  • P-1 extension assignments don't exist on the server web interface for manual work assignments.
  • P-1 is a lesser development priority right now.
  • P-1 is a smaller fraction of the work on an exponent than primality testing.
  • CUDAPm1 is still labeled alpha software and does not implement bounds extension
  • gpuowl P-1 is relatively new and does not implement bounds extension
  • Often the user wanting to increase bounds is not the user who did some previous bounds, and does not have access to the files from previous runs.
  • Some software may not even save those files after completing a run.
  • There's no need to extend if the bounds were both adequate on an earlier run.
  • It's more efficient to do P-1 once, with adequate bounds the first time.

Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-03-02 at 19:53 Reason: Add title, list of reasons for status quo
LaurV is offline  
Old 2020-06-19, 19:41   #17
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·3·5·97 Posts
Default Why don't we do proofs and certificates instead of double checks and triple and higher?

Update:
We can and do. Everyone who can should stop performing LL first tests, and should upgrade to PRP, GEC, and proof generation for first primality tests as soon as possible, to prime95/mprime v30.3 or later; gpuowl ~v6.11-316 or later. Mlucas v20 is coming at some point, meanwhile use v19.1 for PRP/GEC without proof generation. PRP with proof generation is also recommended for all double check work. It has far superior reliability as a double-check, and is slightly more efficient, than LL DC on average, due to an extremely low error rate. The computing time (effort) to generate the proof proof file and to perform the CERT work is a fraction of that required to make up for the 1-2% LL DC expected error rate. The time between first test completion and verification is vastly reduced, from often 8 years or more for LL DC, to generally hours or days for PRP proof and CERT. See also https://www.mersenneforum.org/showpo...45&postcount=4


Original post:
Because we didn't know it was possible to do proofs of PRP tests for these huge Mersenne numbers at considerably less effort than a repeat PRP test or repeat LL test until recently. The development of new code to do proofs and verifications, followed by widespread deployment of client applications to do proofs, and server infrastructure to accept proofs and perform verifications, will take around a year or more to complete.
Gpuowl is closest to being ready to provide proofs.
Prime95 and Mlucas haven't begun to get this added yet as of mid June 2020.
There's also separate verifier code to write.
Server modification for storing new data types.
Manual result handling modification.
Extension of the Primenet API to accommodate it for prime95.

Some threads regarding this recent development are

Announcement The Next Big Development for GIMPS
(Layperson's and informal discussion here)

Technical VDF (Verifiable Delay Function) and PRP
(Leave this one for the number theorists and crack programmers)

Technical background: Efficient Proth/PRP Test Proof Scheme
(Also a math/number-theory thread, let's leave this one for theorists too)

This is an exciting development. It offers elimination of almost all confirmation effort on future PRP tests, so will substantially increase testing throughput (eventually). It is a high priority for design and implementation right now. Other possible gpuowl enhancements are likely to wait until this is at least ready for some final testing.


Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-04-16 at 20:35 Reason: Mlucas v20; prefer PRP&proof over LLDC
kriesel is online now  
Old 2020-06-27, 14:49   #18
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

582010 Posts
Default Why don't we run GPU P-1 factoring's GCDs on the GPUs?

The software doesn't exist.

Currently CUDAPm1 stalls the GPU it runs on, for the duration of a stage 1 or stage 2 GCD that runs on one core of the system CPU.
Earlier versions of Gpuowl that performed P-1 also stalled the GPU while running the GCD of a P-1 stage on a CPU core. At some point, Mihai reprogrammed it so a separate thread ran the GCD on one core of the CPU, while the GPU went ahead and speculatively began the second stage of the P-1 factoring in parallel with the stage 1 GCD, or the next worktodo assignment in parallel with the stage 2 GCD when one is available.
In all cases, these GCDs are performed by the GMP library.
(About 98% of the time, a P-1 factoring stage won't find a factor, so continuing is a good bet, and preferable to leaving the GPU idle during the GCD computation.)

It was more efficient use of programmer time to implement it that way quickly, using an existing library routine.

On a fast CPU the impact is small. On slow CPUs hosting fast GPUs it is not.

Borrowing a CPU core for the GCD has the undesirable effect of stopping a worker in mprime or prime95 for the duration, and may also slow Mlucas, unless hyperthreading is available and effective.

To my knowledge no one has yet written a GPU-based GCD routine for GIMPS size inputs.
For GPU use for GCD in other contexts see http://www.cs.hiroshima-u.ac.jp/cs/_...apdcm15gcd.pdf (RSA) and https://domino.mpi-inf.mpg.de/intran...FILE/paper.pdf (polynomials).
If one was written for the large inputs for current and future GIMPS work, a new GCD routine for the GPU could be difficult to share between CUDAPm1 and gpuowl, since gpuowl is OpenCL based but CUDAPm1 is CUDA based, and the available data structures probably differ significantly.


Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-07-21 at 22:52 Reason: edit for style
kriesel is online now  
Old 2020-12-16, 17:25   #19
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×3×5×97 Posts
Default Why don't we use 2 instead of 3 as the base for PRP or P-1 computations?

Mersenne numbers are base-2 pseudoprimes. All would be indicated as prime in P-1 factoring or Fermat PRP tests, whether actually prime or composite. Using 3 as the base provides useful information, and costs no more computing time; using 2 as the base provides no useful information. That's a summary of my understanding of this thread as it relates to base choice.


Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2020-12-17 at 16:41
kriesel is online now  
Old 2021-03-28, 19:41   #20
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×3×5×97 Posts
Default Why don't we use interim 64-bit residues as checking input on later runs?

(This idea was first posted as part of https://mersenneforum.org/showpost.p...&postcount=200)


As far as I recall, no current or past GIMPS software offers this feature.

Empirical results show the error rate of an individual LL test is a steep function of exponent; 1-2% for 100M exponent, but 10-20% for 100Mdigit, extrapolating by run time to ~90% error rate for gigabit.

If a list of interim 64-bit residues from a past run of an equivalent algorithm was available, that was not known to be bad, a program modified to read them in and compare them to the 64-bit residues it obtains in its run could use the list as an additional error check input. Reaching a residue for the same iteration that did not match could trigger a limited number of retries from the last matching point.
Very simply formatted, a file named <exponent>-compare.txt, or perhaps , <exponent>-[PRP3|LL4]-compare.txt, containing ASCII hex 64-bit residues corresponding to iteration numbers that must be a multiple of the running application's log interval; beginning with a single header record for consistency check during the current run. (The following Fan Ming LL seed 4 DC interim residues have been confirmed by my TC in progress.)
Code:
332194529 LL seed 4
1000000 6f3e8db940a2da46
5000000 af733e924b0ff14d
10000000 3445b03b12e7e63c
15000000 7f89336ab4c99ccf
20000000 7cc12f9bc5569c53
50000000 6d8283811fd2386c
100000000 e1f61c2c2ad97069
150000000 8179da4ed33f0ab5
200000000 d719e0881b8d4a02
Similar might also be useful at times during PRP DC. Header record something like
Code:
332194529 PRP type 1 base 3
Building something like that into CUDALucas, Gpuowl, Mlucas, and mprime / prime95 could be useful during future "confirm this prime discovery" exercises. It would automate for the team of confirmation testers, tracking whether and when a retest departs from a given test's 64-bit residue sequence. That would be useful during confirmation of a first LL test. But not quite as useful when running LL after a PRP first test, since their residues are incompatible. In that case the fastest / earliest LL confirmation tester could share their growing list periodically with the other LL testers for comparison. Or joint testers could take turns sharing interim residues for confirmation by the others.

It could also be handy for strategic double and triple checking.
Ordinary double checking is much less likely to have access to a residue sample list.

Default would need to be to proceed without the existence of the list file, and to continue forward if say 3 tries reproduce the same residue at the point where the list file has a different value. Options could be to halt on difference, or to generate such a list file, taking care to not clobber an existing list file. Perhaps by naming output list files <exponent>-<testtype>-compare-n.txt, where n starts at 1 and increments until no such filename exists in the folder. These would sort by exponent and then by type, LL4 or PRP3, and finally by test sequence number.


Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-05-27 at 17:49 Reason: updated res64 list for example
kriesel is online now  
Old 2021-04-05, 23:13   #21
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·3·5·97 Posts
Default Why don't we merge the leading CPU and GPU applications?

Because our best volunteer programmers:
have better uses for their time;
like things as they are now;
aren't that ambitious, crazy or masochistic.

As things stand now, there are several applications, with several separate lead authors.
Examining source code, documentation, program output, and data files, we can see that the various applications use different variable names for the same things, the same names for different things, different ways of organizing the storage for the same computation, and different ini files and definitions, etc.
New developments can be prototyped in one application without impacting another.
Development in CPU and GPU code can proceed independently and in parallel.

Mprime, prime95, Mlucas are currently CPU oriented. GPU programs are CUDA or OpenCL oriented. Getting a development environment to do well not only in mutual compatibility but also performance with multiple CPU instruction sets and CUDA and OpenCL is a tall order. With sufficient support for the several operating systems and versions, taller yet.

To have mprime extend its PrimeNet API support to handle GPU type assignments, progress reports, and result reporting, and to modify it from supporting a single CPU type at a time to do sensible dispatching of work to disparate computing resources, and perform orderly shutdown (of the whole program, just one computing resource, what?) is a whole other design and programming task.

Some users seem to already find mprime / prime95 confusingly complex and rich with user-controllable options. Multiplying that by 2 or 3 or more by adding CUDA and OpenCL capability with perhaps GPU-model-specific limitations would increase the needed human interface requirements and complexity.
(Let's see, I can do TF on CPU but shouldn't; I can do TF on some GPUs and IGPs but not other IGPs; I can do PRP on CPU or on modern GPUs via OpenCL but not those limited to OpenCL1.2; LL DC on CUDA or OpenCL, but should avoid OpenCL1.2 because that's half-speed compared to OpenCL2.0; etc. Why aren't my CPU-based PRP as fast as my GPU PRP? etc etc.)


Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-09-26 at 13:31 Reason: added complexity / support paragraph
kriesel is online now  
Old 2021-04-29, 23:00   #22
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·3·5·97 Posts
Default Why don't we use FPGAs along with or instead of CPUs or GPUs?

Existing GIMPS software does not support FPGAs. The existing GIMPS software was developed for mass market CPUs or GPUs, via FTE-decades of effort of highly skilled software developer volunteers.
Even though there are existing OpenCL based applications for GPUs (Mfakto and GpuOwl), and some FPGA programming software supports OpenCL programming input, in practice the astronomy community has found that to get good performance, a complete rewrite is necessary when going from GPUs to FPGA implementations. Performance for an Arria 10 was indicated as 1TFLOPs (word size unspecified), which is well below the DP performance of a Radeon VII and some other GPUs. (A talk on FPGAs and GPUs from 2019, 39:13 run time https://www.youtube.com/watch?v=MO2Hxxxy6l4)

FPGA on-chip memory access can be very fast (~8TB/sec, much faster than the Radeon VII's 1TB/sec bandwidth to HBM2 memory). Even very high end FPGAs (~$20k each) don't have enough on-chip memory for PRP, LL, or P-1. (PRP or LL of ~103M takes ~400MB in gpuowl; P-1 stage 2 benefits from multiple GB.)

Basic TF needs very little memory and is less complex to program than the fft-multiply-based PRP, LL, or P-1. Its performance can however benefit from large sieve sizes ~2Gbits or more on fast GPUs, and may on FPGA also. The top end of on-chip memory on the Stratix 10 is only 308 Mbits. Possibly one could split the task, doing the sieving on CPU as in olden mfaktx days. Or accept a little less than optimal performance to fit within available memory resources; on current high end GPUs the performance gain from 128Mbits to 2Gbits sieve size is no more than 10%.
Task splitting has also been proposed in earlier threads and other contexts, such as wide fast multipliers or core routines for 1K-point FFTs for LL, PRP or P-1.

FPGAs clock at a fraction of the frequency of a contemporary CPU or GPU. They need to make that frequency disadvantage up, in high efficiency and parallelism. The achievable clock rate is only determined after an FPGA design/compile is complete. The design is specific to both the FPGA chip model and the algorithm, since the available resources on the FPGAs vary between models. (Imagine if a Gpuowl equivalent needed to be rewritten and recompiled for each FPGA model! Plus recompiled for each host OS.)

There have been many suggestions/proposals to use FPGAs in GIMPS over the years, with some seemingly coming from FPGA sales people or experienced FPGA-based-design engineers.

But to my knowledge; there have been no working designs or compatible software created or announced or demonstrated or shared;
no one has yet tackled and completed the large up-front development effort or accepted the lengthy FPGA-compile time (~half a day per try).

See https://mersenneforum.org/showthread.php?t=2674 for a recent iteration of the question. And Uncwilly's list of some previous threads on the matter, from 2016-2018, 2005, 2019. There are more, including
https://mersenneforum.org/showthread.php?t=23176
https://mersenneforum.org/showthread.php?t=21749
https://mersenneforum.org/showthread.php?t=18178
https://mersenneforum.org/showthread.php?t=16141
https://mersenneforum.org/showthread.php?t=15599
https://mersenneforum.org/showthread.php?t=15325
https://mersenneforum.org/showthread.php?t=14525
https://mersenneforum.org/showthread.php?t=8601

As I understand it, after reading on the topic including the above threads and more, and having worked with other engineers who used FPGAs, the most likely implementation would be as an already commercially available drop-in PCIe card that supports a particular model of FPGA. That would allow such an endeavor to avoid expensive slow hardware design and prototyping.

There's also the possibility of using existing FPGA hardware offered via cloud computing. That offers 4-channel DDR4 ram available up to 64GB, which would be adequate in size for FFT at GIMPS wavefront, but considerably outperformed by memory-constrained GPUs running gpuowl on HBM2 ram such as the Radeon VII. Mprime performance is limited on CPUs by memory speed constraints, to the extent that Woltman has stated he uses longer ostensibly slower instruction sequences that place lesser demands on memory bandwidth, to improve actual performance.

Using an existing PCIe or USB interfaced FPGA card design it would still be necessary to create both a logic design to field program the FPGA to do something useful to GIMPS with acceptable efficiency and speed, and to create a program for the general purpose CPU based system to get data to and from the FPGA card to do something useful. I think that amounts to splitting function of something like Mfakto or Gpuowl into FPGA-programming and host-programming portions, with drastic rewrite required, and addition of code for the two talking to each other (coordination of coprocessing).

A talented experienced FPGA & software engineer may be able to do it in a way that reduces the effort of switching from one FPGA model to another. That none of them, over decades of occasional discussion in the GIMPS forum, have found it worthwhile to undertake and complete the set of tasks, says something about the size of the work required and (effort+cost)/reward ratio for doing so. This implies there may be something intrinsic, fundamental, that persists across generations of CPU, GPU, and FPGA designs.

What are the possible reasons for it apparently not being worthwhile for the experienced designers? Some candidates:
  1. The barrier to entry is higher for FPGA development, requiring a lot of hardware-oriented design skill and effort for FPGA implementation efficiency, a lot of software-oriented effort for the CPU-side programming, and a good understanding of the algorithms to be programmed;
  2. Lower feasible clock rates of a completed optimized design; typically ~order of magnitude on FPGA vs. CPU or GPU cores;
  3. For the same device budget on an integrated circuit, field programmability's device count requirements could constitute a sort of tax reducing computing throughput on FPGA relative to CPU or GPU architectures;
  4. Hardware cost: for the same price as a single high end FPGA, one could buy dozens of Radeon VII GPUs (and not need to do any additional device or application programming)
  5. Time to complete the project: long for FPGA, compared to days to buy CPU or GPU hardware, assemble, configure, and get existing applications running on them
  6. Economies of scale for mass produced CPU and GPU designs
  7. The barrier to entry is much higher for end users; most don't have access to an FPGA device or FPGA development tools, so CPU or GPU oriented programs are used. This prevents the establishment of a GIMPS FPGA user community that would attract FPGA application development.
A rough estimate of how many CPUs are produced annually is ~1 billion units, based on https://www.answers.com/Q/How_many_p...sell_in_a_year. Not all of these would be in end-user-application-usable form; some will be embedded as control units in cars, military or scientific or consumer equipment, etc.
(I don't run any software on my newest 2 TVs or my NAS, router, etc.)
Tom's Hardware gives a figure of about 79.4 million PC shipments per quarter for Q4 2020, and 274. million annually. https://www.tomshardware.com/news/gp...rt-q4-2020-jpr
https://www.grandviewresearch.com/in...ocessor-market gives $85 billion annually. A single fab plant costs ~$10 billion. Intel is building two more in Arizona. https://www.extremetech.com/computin...aul-in-decades

GPUs?
Shipments of GPUs for PC use in 2020 were ~41.5 million units. (https://www.tomshardware.com/news/gp...rt-q4-2020-jpr again) The GPU shortage continues. Not all of these will be accessible for GIMPS application use either; some are in government or medical use, etc.

FPGAs?
The FPGA market is estimated as $9.8 billion (2020), apparently smaller than the CPU market.
Not all of these will be accessible for potential GIMPS application use either; most are in embedded systems. https://cacm.acm.org/magazines/2020/...fpgas/fulltext
Some designs use multiple clocks. Most designs have bugs escape into production. https://semiengineering.com/verifica...ication-study/
See also "FPGA Hell" in the debugging section of https://en.wikipedia.org/wiki/FPGA_prototyping

Number of devices in high end units comparison
Recent GPU designs contain tens of billions of transistors, 50 billion in the MI100 (2020): https://blog.jetbrains.com/datalore/...pecifications/ One year of Moore's law would move the device count up to ~71 billion in 2021.

Similarly, recent CPU designs contain tens of billions of transistors, 19 in AMD Epyc (2018); allowing for 3 years of Moore's law from there, 19 x 2.83 ~ 54 billion in 2021;
https://www.karlrupp.net/2018/02/42-...or-trend-data/

For FPGAs, 1987, 9000 gates, 2013, 50 million gates, so 12.44 doublings in 25 years, pretty close to a Moore's law of doubling in 2 years. https://en.wikipedia.org/wiki/Field-...te_array#Gates Extrapolating to 2021, would be ~13 billion gates; at an estimated 4 transistors average per gate, that's around 50 billion transistors. Current pricing at DigiKey for Stratix 10 FPGAs appears to be ~$17,000 - $36,000. Each. https://www.digikey.com/en/products/...YBoxaCzCGExiQA
Arria 10 ranged around $3700 to $10,000. https://www.digikey.com/en/products/...gTFkYHCJYbFJAA

Device counts appear to be pretty comparable across the high end among GPU, CPU, and FPGA.

Here's a whole book on FPGAs from the perspective of learning to use one little board with one.
Plus links to some of the author's prior similar work. https://xess.com/static/media/appnot...owWhatBook.pdf


Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-07-21 at 23:01 Reason: minor edit
kriesel is online now  
Closed Thread

Thread Tools


All times are UTC. The time now is 18:36.


Thu Oct 28 18:36:59 UTC 2021 up 97 days, 13:05, 0 users, load averages: 1.17, 1.12, 1.25

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.