View Single Post
Old 2019-02-14, 03:10   #6
kriesel's Avatar
Mar 2017
US midwest

32·29·31 Posts
Default Concepts in GIMPS trial factoring (TF)

(note, sort of mfaktc oriented, more so toward the end. Where description applies to both mfaktc and mfakto, they are referred to as mfaktx)

Following is my evolving attempt to list and describe the many concepts and optimization considerations relevant to GIMPS trial factoring, in a way that assumes little or no prior knowledge.

See also (perhaps first) for some background that will be useful in reading the following,

1 Trial factoring work is generally assigned by exponent and bit levels;
Mersenne exponent,
bit level already trial factored to or where to start, and
bit level to factor up to.
Nearly all TF is now done on GPUs not CPUs. Assignments and result reporting are typically automatic for mprime/prime95, but manually or through separate programs for gpu applications.
Manual assignment (reservation) links and result submission links are available in the attachment at
See for a list of client management programs that get assignments and/or submit results.

2 Each bit level of TF is about as much computing effort as all bit levels below it, for the same exponent.
Probability of finding a factor by trial factoring between 2x and 2x+1 is approximately 1/x.
So while the probability of finding a factor goes down slowly with x, the effort goes up exponentially with x.
Assignments are made up to the point where it's more cost effective to primality test instead of continuing to trial factor. Finding a factor by trial factoring before any primality tests or P-1 factoring saves 1 PRP test plus proof generation and cert time, plus whatever the cert failure rate is, previously 2 x (1 + primality test error rate) primality tests, plus the P-1 factoring time, which is typically a few percent of a primality test. Error rate is around 2% in LL, so about 2(1+.02)+.03 =~ 2.07 primality tests equivalent, or ~1.03 if using PRP/proof. Except that P-1 effort (bounds selection) is optimized for maximum time savings, so actually <2(1+0.02) or <2.04 primality tests historically; currently 1.3 since some users are still performing LL first tests despite that being discouraged now.

3 TF makes use of the special form of factors of Mersenne numbers, f = 2 k p +1, where p is the prime exponent of the Mersenne number, k is a positive integer. See

4 Use a wheel generator for candidate factors, to efficiently exclude and not even consider candidate factors that have very small factors themselves, such as 2, 3, 5, 7 and usually 11. 2 2 3 5 7 = 420, the Less-classes number; 2 2 3 5 7 11 = 4620, the more-classes number.
(There was discussion of going to 13 or higher, but that was considered not worthwhile. 2 2 3 5 7 11 13 = 60060, etc. Higher complexity and overhead, trading off against diminishing returns of incremental number of candidates excluded.) Prime95 uses a simpler 2 2 3 5 = 60 wheel spokes, which reduces to 16 classes (26.67%). For Mfaktx at 420, 96 survive (22.86%) or at 4620, 960 (20.78%). The next would be 11520/60060=19.18%. (A potential reduction of 7.7%) (There are two factors of 2 in the number of classes. I think that's due to including #6 into the wheel generator; it's not just evens that can be dismissed, but 6 out of 8 mod 8 cases, or 3 out of 4.)

5 For a given exponent, exclude entire classes of candidate factors with a single test. Additionally there is subsequent fast sieving in the surviving classes for small prime factors, to eliminate the bits representing the corresponding k values of small primes beyond the wheel generator level.

6 Make use of the special form of factors of Mersenne numbers, that they are 1 or 7 mod 8. (Discard or do not generate candidate factors that are not 1 or 7 mod 8). A short proof for f=+/-1 mod 8 is at There's an excellent explanation of the wheel generator, this point, and an implementation of the combination, contained in source module factor.c.txt included in the Mlucas source distribution, beginning about line 304 in the v18 variety.; in v20.1.1, ~line 324.

7 Use of dense representation of the candidate factors, as a bit map of k values.
(That bit map is exponent-specific, so can not be reused for other exponents.)

8 These candidate factors are sieved, somewhat, but not exhaustively. Candidate factors found composite by sieving are discarded. Trial with prime candidates is sufficient, with composite candidates redundant.

9 Exhaustive sieving of candidate factors can produce less throughput per total computing effort, than a lesser amount of sieving of candidates. There's user control of sieving depth available to allow adjustment to near optimal for the exponents, depths, and other variables, in mfaktc and mfakto.

10 Candidate factors surviving wheel generation and restriction to 1 or 7 mod 8 are sieved somewhat. Implementation in mfaktx is by a maximum sieve prime value, that as I recall is set in mfaktx.ini. Sieving level (at some version of mfaktc) must not exceed the level where possible candidate factors are included in sieving values. This makes sieve limit settings dependent on exponent for low exponent low bit level combinations.
(If I recall correctly, in later versions of mfaktc special handling of certain cases relaxes this restriction on sieving level. See in #21 below)

11 Surviving candidate factors are tried. (With MOSTLY prime candidates, and the relatively few composite candidates that passed the speed-optimized incomplete sieving.)

12 The trial method is to generate a representation of the Mersenne prime plus 1 (2p), modulo the trial factor, by a succession of squarings and doublings according to the bit pattern in the exponent, modulo the trial factor. If 2p mod f =1, 2p-1 mod f = 0 proving f is a factor. The powering method is much much faster than trial long division for the very sizable numbers of interest near the current GIMPS wavefront, and uses far less memory than a simple approach of representing the full dividend at the beginning, and more so on both counts for larger numbers. (Retina points out though at that there would be no need to create a full representation of p ones at the beginning of long division. Since all the binary digit positions of a mersenne number are ones they could simply be added at the right as needed. Also, we care only about the remainder, not the quotient.) There's a description and brief small-numbers example of the modpow method at which is called left to right binary method in It is not necessarily the minimum-time method, but it's reliably pretty good, and fairly simple. The modulo trial-factor at each step keeps the operands at a manageable size for speed: 3 32-bit words, far smaller than the tens of millions or hundreds of millions of bits that naive trial-division could involve or deferred modulo could generate. Trial division would take (O)p operations, while modpow takes only (O)log2p. For p~90M, p/log2p ~3,400,000. If stepping through long division 32 bits at a time, ~2812500; log22812500 ~21.4; 2812500 / log22812500 ~131282. The efficiency of the modpow method allows trial factoring to many more bits factor depth than even a well implemented long division trial would. See point #2. Therefore the probability of finding a factor is increased significantly with the modpow approach.

13 On significantly parallel computing hardware, such as GPUs, many trial factors can be run in parallel, so successive subsets of candidates are distributed over the many available cores. GPUs are so much more relatively effective at TF than other types of GIMPS computation, that the optimal GPU TF bit level is typically a few bits higher than for CPUs, and CPU TF is generally avoided. (See also point 2 again.) Xeon Phi are reportedly quite fast at trial factoring, but not very common, and rather expensive. "Additionally, something like an FPGA would be really fast, but most users don't have one. And ASICs would be even faster, but way too expensive." (kruoli PM) FPGA or ASIC likely would have yet different TF bit level optima. Guides to suggested TF bit levels versus exponent and GPU model are available for each GPU model listed at; click on the GPU model of interest. Note that the curves correspond to optimization for one or two primality tests to be performed on the same GPU model as the TF. If the TF will be performed on a high TF/DP performance ratio GPU but the primality test performed on a lower TF/DP ratio GPU, the optimal TF is likely somewhere between the two levels. For example, for a 100Mdigit mersenne number, TF on RTX2080 indicates 83 bits but Radeon indicates 80 bits; 81 or 82 may be the optimal for the combination RTX 2080 & Radeon VII PRP or LL. RTX2080: Radeon VII:

14 Operation is as multiword integers, sometimes with some bits used for carries. 72-bit math via 3 32-bit words with 24 bits data, 8 bits for carries.

15 Different code sequences are written for different bit level ranges of trial factor, for best speeds for various bit level ranges. Higher bit levels take longer code sequences, so factors tried per unit time declines at higher bit levels on the same hardware and exponent.
Cursory examination of current source code shows the following 17 distinct sequences identified in mfaktc:
(In the above, _GS indicates gpu-sieving.
(mfakto is similar although it does not include 95-bit code or capability. That's not an issue for the GIMPS search since optimal final TF level via GPU is generally 86 bits or less in the range of exponents <109; 92 bits to its max 232. But see also #43 below)
Judging by the names above, most of these are based on Barrett reduction.

16 "First barrett kernel in mfaktc was BARRETT92, all other kernels are stripped down versions.
From BARRETT92 to BARRETT79 first (fixed inverse, multibit in single stage possible, a bit faster)
From there we go from BARRETT92 to BARRETT88 and BARRETT87 by (re)moving interim correction steps and some other "tricks" (loss of accuracy in interim steps (small example 22 mod 10 yields 12 (instead of 2))). Trading accuracy for speed. The same 'tricks' lead from BARRETT79 to BARRETT77 and BARRETT76."

17 "funnel shift" barrett 87

18 The frequency of primes on the number line declines gradually as the bit level increases. This partially offsets the effect of the longer code sequences for higher bit levels mentioned in #15.

19 For GPU applications, there are various implementation approaches for performance.

20 multiple streams and data sets allowing concurrent data transfer and processing

21 On-GPU sieving of factor candidates:
Tradeoffs and possible decisions about high GPU sieving, low exponents

22 There's a small performance advantage to 32-bit images when available, when using GPU sieving. (32bit addresses are smaller, placing less demand on memory bandwidth)
However, at CUDA8 or higher (which means GTX10xx or newer models), only 64-bit CUDA is available.

23 How, in the CUDA or OpenCl cases, all those factor candidates get bundled into batches for processing in parallel on many GPU cores is very hazy for me. Presumably it involves using as many of the cores as possible as much of the time as possible, implying work batches of equal size / run time.
Prime95 comments on passing out chunks of work and using memory bandwidth efficiently:
Oliver writes to R Gerbicz about it

24 Could FP be faster than integer math for the kernels? Probably not
Mark Rose did a detailed analysis in 2014 (and maybe revisiting that for recent GPU designs would be useful)

25 mfaktc v0.21 announce, and ruminations about possible v0.22

26 ini file parameter tuning advice, - 2508
in this order:

27 There was mention of an mfaktc v0.22-pre2 back in 2015 at
"Removed old stuff (CC 1.x code, CUDA compatibility < 6.5 dropped, minor changes and bugfixed)."
The current released version is v0.21, which incorporates worktodo.add file support and is slightly faster than v0.20.

28 Experimenting on tuning with a GTX 1080 Ti, RTX 2080, and RTX 2080 Super, on Windows 7, they seem to like the upper end of the tuning variable limits. Possibly the faster cards would benefit from higher maximums than documented for mfaktc v0.21. (Recompile required.)
# GPUSieveSize defines how big of a GPU sieve we use (in M bits).
# Minimum: GPUSieveSize=4
# Maximum: GPUSieveSize=128 (except in altered versions allowing 2047)
# Default: GPUSieveSize=64
# GPUSieveProcessSize defines how far many bits of the sieve each TF block
# processes (in K bits). Larger values may lead to less wasted cycles by
# reducing the number of times all threads in a warp are not TFing a
# candidate. However, more shared memory is used which may reduce occupancy.
# Smaller values should lead to a more responsive system (each kernel takes
# less time to execute). GPUSieveProcessSize must be a multiple of 8.
# Minimum: GPUSieveProcessSize=8
# Maximum: GPUSieveProcessSize=32
# Default: GPUSieveProcessSize=16
GPUSievePrimes ~94000

29 Some RTX 20xx owners have modified the tuning variable limits, recompiled, and obtained performance gains of several percent, with diminishing incremental returns as GPUSieveSize grows to gigabits
Maximum GpuSieveSize was found to be 2047 (Mbits) after program modification. This may relate to computing bit positions with signed 32 bit integer math. Other model GPUs that show greater combined throughput with multiple mfaktc instances with the 128 Mbit GpuSieveSize maximum, also benefit from the modified code. For example, a GTX 1080 Ti gains several percent throughput running a single modified instance. It appears there would be additional gain on a single instance if the limit was raised to 4095 Mbits by changing to unsigned 32 bit integer math for the bit positions, since even at 2047 Mbits, it shows performance improvement compared to somewhat lesser values, and benefits from running multiple instances. RTX 2080 and RTX 2080 Super also show additional throughput by several percent from running multiple instances after available tuning is completed.

30 The decoupling of FP from integer math in the RTX 20xx may change the tradeoffs significantly or allow use of both simultaneously for TF math. As far as I know this is as yet unexplored.

31 According to some, the probability of finding a factor for least effort is optimized by TF up to one bit less than the eventual tradeoff limit, then P-1 with potential computing time savings optimized by the program itself (prime95 or CUDAPm1), and then, if no factor is found in P-1, the last bit level of TF, constituting the remaining half of the TF effort for that exponent. This is somewhat different sequencing than usual PrimeNet assignment.

32 Since GPUs are far more effective at TF than at primality testing or P-1, and far faster at TF than CPUs are, assignments seek to prevent TF on CPUs, by performing all useful TF on GPUs. TF on GPUs allows going about 4 bits deeper, within computing time tradeoff optimization, saving some primality tests that CPU-optimized-TF limits would not have found factors for. The optimal number of bits deeper depend on the properties of the GPU models, with the higher ~40:1 TF/primality test ratio of the RTX 20xx family probably justifying 5 bits deeper for their use. As far as I know PrimeNet contains no provision for automatically taking them 1 bit higher on RTX 20xx.

33 The time required to perform a single TF attempt (one candidate factor on one exponent) is so brief, and the reliability of modern hardware high enough over such brief periods, that it does not pay to double check TF. Not double checking TF lets the TF go one bit level higher, finding more factors. But there is a sort of double-checking being performed, with a completely different approach, by user TJAOI, as Uncwilly pointed out at There's a link there to a whole thread about that activity.

34 There is some limited overlap of coverage of factors between TF and P-1, estimated at about 20-25%. The economics of TF and the sequential progress of TF assignments dictate that TF factors found will have at most a certain number of bits (ranging up to 75 bits for 84M, to 86 bits for 999M). P-1 can economically find factors with more bits, but requires that the number one less than the factor be rather smooth (composite with many smallish factors). Some factors that TF should find will be smooth, and many will not. Therefore P-1 factoring currently serves as an incomplete double-check on the TF runs. "Stage 2 of P-1 helps to find factors f=2kp+1, where k contains one less smooth factor. If k=ab with b being the biggest prime factor of k, then it is sufficent to do P-1 with B1 = maximum prime factor of a and B2 = b." (kruoli by PM)

35 Reported factors are confirmed at the PrimeNet server by repeating the computation for that exponent and factor, because that is quick to do and factors are worth confirming. No-factor reports are not confirmed there. Historically, TF was on the honor system, with no security code attached to a result outside prime95, or verification of completion of the work. A recent development is Robert Gerbicz's proposal for a method of incorporating verification of performing the TF work in the no-factor-found case. See and some following discussion on the same thread. At the moment this is only conceptual, not implemented in any software to my knowledge. There's also discussion of other approaches in

36 There is currently no production-ready Mersenne TF code for the following interface specifications:
Vulkan (but see
In many cases, this is not a limitation, because the Intel CPU TF code in prime95 and mprime, and GPU TF code in Mfaktc (NIVIDIA CUDA) and Mfakto (OpenCL on AMD GPUs and some Intel IGPs) provide considerable coverage. However, there are older IGPs or discrete GPUs that do not have the required support for CUDA or OpenCL but do have support for other interface specifications such as DirectX or OpenGL. It's likely the performance of these older devices would be low compared to the ~20GhzD/day of current IGPs, or hundreds or thousands of GhzD/day on current discrete GPUs.

37 Mfaktc and Mfakto are capable of factoring Mersenne numbers with exponents up to 232. Extending them to exponents above 232 would require a rewrite of the individual kernels as well as the surrounding code. If this was attempted, it would be advisable to maintain separate source files or use conditional compilation to produce separate executables, to preserve efficiency for exponents below 232. Once done, though slower than the 32-bit-exponent max versions, these are likely to be much faster over 232 than Luigi's Factor5, which has no limit, or Ernst Mayer's Mfactor, which has some limits depending on its build options. There's no hurry. There's a great deal of work to do within the limit of 109, and much more from there to 232, that will take centuries.

38 "Checkpoints are written between two residue classes by just remembering the last finished residue class." The checkpoint files are small and simple.
Perform version upgrades between bit levels. The checkpoint files include what version was used to produce them, and are not accepted by a later version of software, which will restart the bit level instead.

39 The number of residue classes (see #4 above) is set at compile time in mfaktc, or in the ini file in Mfakto (choosing between 420 and 4620). Another possibility that has been suggested is to make it dynamic, adjusting to exponent and bit level.

40 Some practical considerations constraining number of classes include
magnitude of spacing between candidate factors (32 vs 64bit),
frequency and effect of the GPU work queue running empty, and
effect on throughput of running a full last block per class that extends past the bit level's end.

41 In mfaktc or mfakto, there is screen output once per class. This can range from several classes per second or more, at high exponents and low bit levels, to nearly an hour or more per class at high bit levels on slow GPUs. (Bit levels already reached in trial factoring previously ranged from 64 to 87, 223=8,388,608:1 within the space. As of early 2021, they ranged from 72 to 87 bits, 215=65536:1.) There is no user control of mfaktx screen output frequency other than the approximately 11:1 effect of selecting more-classes or less-classes. In some cases a solid state disk or ramdisk is recommended to prevent logging or checkpointing from diminishing TF throughput unacceptably. A proposed user control for rate of screen output has not been implemented. Using less-classes to somewhat limit output frequency can cost of order 1.7% TF throughput.

42 Since finding factors is more likely with low k values than high k values, lowest bit levels are run first, for efficiency. For the primary purpose of GIMPS, finding new Mersenne primes, we only care about finding a factor quickly enough to eliminate an exponent p from any further processing, thereby saving computing time overall. The probability of finding a factor times the combined computing time of further trial factoring, plus P-1 factoring, and running a probable prime test with proof generation (historically two primality tests), must be less than the effort of trial factoring for there to be net savings. Trying the most-likely factors first optimizes the savings.

43 The optimal TF bit level described in #2, #13 and #32 depends somewhat on the GPU model, since different GPU models have different ratios of TF and primality test performance. Recent developments may raise the optimal above the levels described. RTX 20xx increased parallelism between FP and int computations, or its much higher performance ratio between TF and primality testing with current code, for example. Separate code and assignments according to GPU model may be necessary to make best use of the properties of the RTX20xx GPUs and older GPUs.

44 There is a discussion at to ~1514 regarding possibly eliminating the 20 to 25% overlap of P-1 and TF described in #34. If doing so had little or no overhead, this could provide a considerable speedup in TF, which may speed up the total per-exponent effort in GIMPS by <~0.03 x 0.25 / 2, or <~0.375%. Not large, but worth exploring. Its effect is diminished from there since current practice is to opportunistically seek a quickly found factor by TF in the lower bit levels, then P-1 after all TF is done. See also #31, regarding P-1 being run before the last, highest TF level, which may offer an opportunity for saving <~0.19% if the final bit level of TF, which is comparable in effort to all preceding levels combined, is performed only on k values not also covered by the preceding P-1. Taking full advantage of this probably involves
modifying PrimeNet TF and P-1 assignment policy, in both level and sequence,
modifying code in the P-1 applications (prime95, CUDAPm1, GpuOwL, Mlucas v20.0 or above) for determining optimal P-1 bounds,
modifying TF applications code (mfaktc, mfakto, mprime/prime95; gpuowl TF branch, Mfactor) to efficiently skip the relevant factor candidates, and
modifying GPU-specific TF level versus exponent charts and other project documentation.
It may also involve changing or adding parameters to the worktodo lines or ini files or command line options to indicate whether the TF run is to include or exclude the smooth factor candidates, possibly including specification of a smoothness limit, which may vary depending on P-1 bounds performed for the same exponent. (Different use cases for TF may require different behavior.) New TF worktodo line structure implies modification of the parsing code in each TF application. New TF worktodo line structure also implies modifying some of the several client management programs (see Making use of this also probably involves changes to TF result report records to indicate whether the smooth factors were tried or skipped, including PrimeNet and manual reports and assigned and unassigned work, impacting PrimeNet and its API, web manual report page processing, mfaktc, mfakto, prime95, gpuowl, Mfactor, and some of the client management programs. In total, over a dozen code authors would need to be on board with the change. Not all are currently actively maintaining what they wrote.

45 from re efficiency versus assignment size (roughly 2bitlevel/exponent)
mfakto (and mfaktc) are more efficient with larger assignments. This does not mean larger exponents, but "more work" or "longer runtime per class", generally bigger bit ranges. The reason is a certain one-time initialization effort per class - no matter if the class will just test 1 million factor candidates or 1 billion. In the first case, the one-time effort may account for, say, 75% of the sieving effort, in the latter case just 0.3%. Plus there is an average of half a block wasted for each class. If the class consists of only one block, that's 50%, for 1000 blocks it's just 0.05%.
46 Results are reported by GPU applications as exponent, factor(s) found, optionally username and computerid, program and version and kernel used, optionally time and date stamp, with a * flag if present indicating the bit level was not completed after finding a factor, or
exponent, no factor from bit level a to b, optionally username and computerid, program and version and kernel used, optionally time and date stamp (but not in that order). Other possible variables such as factor candidate sieving limit are not reported.

47 as #46 above implies, sometimes more than one factor will be present within a bit level. The behavior of mfaktx is controlled in such cases by part of mfaktx.ini[CODE]# possible values for StopAfterFactor:
# 0: Do not stop the current assignment after a factor was found.
# 1: When a factor was found for the current assignment stop after the
# current bitlevel. This makes only sense when Stages is enabled.
# 2: When a factor was found for the current assignment stop after the
# current class.
# Default: StopAfterFactor=1
It's uncommon but finding two factors in a single bit level does occur.

48 More than one factor per class within a bit level for an exponent also occurs. Its overall impact on the GIMPS project is small. See
Up to 10 factors per class are supported, by use of a small array for factors found. That would be very hard to test for. More than 4 per class/bitlevel/exponent is estimated to be quite rare.
For some examples of exponents with multiple factors found in one class in one bit level see

See also; what's described here is referred to there as "search by p".
(Thanks, to TheJudger, kruoli and others who have reviewed this large list or responded to PMs and offered constructive feedback.)

Top of reference tree:

Last fiddled with by kriesel on 2023-05-02 at 15:52 Reason: add link to Vulkan TF development thread; smoothness limit added to 44
kriesel is online now