(note, sort of mfaktc oriented, more so toward the end. Where description applies to both mfaktc and mfakto, they are referred to as mfaktx)
Following is my evolving attempt to list and describe the many concepts and optimization considerations relevant to GIMPS trial factoring, in a way that assumes little or no prior knowledge.
See also (perhaps first) for some background that will be useful in reading the following,
https://www.mersenneforum.org/showpo...35&postcount=3
1 Trial factoring work is generally assigned by exponent and bit levels;
Mersenne exponent,
bit level already trial factored to or where to start, and
bit level to factor up to.
Nearly all TF is now done on GPUs not CPUs. Assignments and result reporting are typically automatic for mprime/prime95, but manually or through separate programs for gpu applications.
Manual assignment (reservation) links and result submission links are available in the attachment at
https://www.mersenneforum.org/showpo...11&postcount=9
See
https://www.mersenneforum.org/showpo...92&postcount=3 for a list of client management programs that get assignments and/or submit results.
2 Each bit level of TF is about as much computing effort as all bit levels below it, for the same exponent.
Probability of finding a factor by trial factoring between 2
^{x} and 2
^{x+1} is approximately 1/x.
https://www.mersenne.org/various/math.php
So while the probability of finding a factor goes down slowly with x, the effort goes up exponentially with x.
Assignments are made up to the point where it's more cost effective to primality test instead of continuing to trial factor. Finding a factor by trial factoring before any primality tests or P1 factoring saves 1 PRP test plus proof generation and cert time, plus whatever the cert failure rate is, previously 2 x (1 + primality test error rate) primality tests, plus the P1 factoring time, which is typically a few percent of a primality test. Error rate is around 2% in LL, so about 2(1+.02)+.03 =~ 2.07 primality tests equivalent, or ~1.03 if using PRP/proof. Except that P1 effort (bounds selection) is optimized for maximum time savings, so actually <2(1+0.02) or <2.04 primality tests historically; currently 1.3 since some users are still performing LL first tests despite that being discouraged now.
3 TF makes use of the special form of factors of Mersenne numbers, f = 2 k p +1, where p is the prime exponent of the Mersenne number, k is a positive integer. See
https://en.wikipedia.org/wiki/Mersen...rsenne_numbers
4 Use a wheel generator
https://en.wikipedia.org/wiki/Wheel_factorization for candidate factors, to efficiently exclude and not even consider candidate factors that have very small factors themselves, such as 2, 3, 5, 7 and usually 11. 2 2 3 5 7 = 420, the Lessclasses number; 2 2 3 5 7 11 = 4620, the moreclasses number.
https://www.mersenneforum.org/showpo...1&postcount=35
https://www.mersenneforum.org/showpo...7&postcount=37
(There was discussion of going to 13 or higher, but that was considered not worthwhile. 2 2 3 5 7 11 13 = 60060, etc. Higher complexity and overhead, trading off against diminishing returns of incremental number of candidates excluded.) Prime95 uses a simpler 2 2 3 5 = 60 wheel spokes, which reduces to 16 classes (26.67%). For Mfaktx at 420, 96 survive (22.86%) or at 4620, 960 (20.78%). The next would be 11520/60060=19.18%. (A potential reduction of 7.7%) (There are two factors of 2 in the number of classes. I think that's due to including #6 into the wheel generator; it's not just evens that can be dismissed, but 6 out of 8 mod 8 cases, or 3 out of 4.)
5 For a given exponent, exclude entire classes of candidate factors with a single test. Additionally there is subsequent fast sieving in the surviving classes for small prime factors, to eliminate the bits representing the corresponding k values of small primes beyond the wheel generator level.
https://www.mersenneforum.org/showpo...7&postcount=37
6 Make use of the special form of factors of Mersenne numbers, that they are 1 or 7 mod 8. (Discard or do not generate candidate factors that are not 1 or 7 mod 8). A short proof for f=+/1 mod 8 is at
https://primes.utm.edu/notes/proofs/MerDiv.html. There's an excellent explanation of the wheel generator, this point, and an implementation of the combination, contained in source module factor.c.txt included in the Mlucas source distribution, beginning about line 304 in the v18 variety.; in v20.1.1, ~line 324.
7 Use of dense representation of the candidate factors, as a bit map of k values.
https://www.mersenneforum.org/showpo...4&postcount=72
(That bit map is exponentspecific, so can not be reused for other exponents.)
8 These candidate factors are sieved, somewhat, but not exhaustively. Candidate factors found composite by sieving are discarded. Trial with prime candidates is sufficient, with composite candidates redundant.
9 Exhaustive sieving of candidate factors can produce less throughput per total computing effort, than a lesser amount of sieving of candidates. There's user control of sieving depth available to allow adjustment to near optimal for the exponents, depths, and other variables, in mfaktc and mfakto.
https://www.mersenneforum.org/showpo...&postcount=180
10 Candidate factors surviving wheel generation and restriction to 1 or 7 mod 8 are sieved somewhat. Implementation in mfaktx is by a maximum sieve prime value, that as I recall is set in mfaktx.ini. Sieving level (at some version of mfaktc) must not exceed the level where possible candidate factors are included in sieving values. This makes sieve limit settings dependent on exponent for low exponent low bit level combinations.
https://www.mersenneforum.org/showpo...&postcount=788
(If I recall correctly, in later versions of mfaktc special handling of certain cases relaxes this restriction on sieving level. See in #21 below)
11 Surviving candidate factors are tried. (With MOSTLY prime candidates, and the relatively few composite candidates that passed the speedoptimized incomplete sieving.)
12 The trial method is to generate a representation of the Mersenne prime plus 1 (2
^{p}), modulo the trial factor, by a succession of squarings and doublings according to the bit pattern in the exponent, modulo the trial factor. If 2
^{p} mod f =1, 2
^{p}1 mod f = 0 proving f is a factor. The powering method is much much faster than trial long division for the very sizable numbers of interest near the current GIMPS wavefront, and uses far less memory than a simple approach of representing the full dividend at the beginning, and more so on both counts for larger numbers. (Retina points out though at
https://www.mersenneforum.org/showpo...&postcount=478 that there would be no need to create a full representation of p ones at the beginning of long division. Since all the binary digit positions of a mersenne number are ones they could simply be added at the right as needed. Also, we care only about the remainder, not the quotient.) There's a description and brief smallnumbers example of the modpow method at
https://www.mersenne.org/various/math.php which is called left to right binary method in
https://en.wikipedia.org/wiki/Modular_exponentiation. It is not necessarily the minimumtime method, but it's reliably pretty good, and fairly simple. The modulo trialfactor at each step keeps the operands at a manageable size for speed: 3 32bit words, far smaller than the tens of millions or hundreds of millions of bits that naive trialdivision could involve or deferred modulo could generate. Trial division would take (O)p operations, while modpow takes only (O)log
_{2}p. For p~90M, p/log
_{2}p ~3,400,000. If stepping through long division 32 bits at a time, ~2812500; log
_{2}2812500 ~21.4; 2812500 / log
_{2}2812500 ~131282. The efficiency of the modpow method allows trial factoring to many more bits factor depth than even a well implemented long division trial would. See point #2. Therefore the probability of finding a factor is increased significantly with the modpow approach.
13 On significantly parallel computing hardware, such as GPUs, many trial factors can be run in parallel, so successive subsets of candidates are distributed over the many available cores. GPUs are so much more relatively effective at TF than other types of GIMPS computation, that the optimal GPU TF bit level is typically a few bits higher than for CPUs, and CPU TF is generally avoided. (See also point 2 again.) Xeon Phi are reportedly quite fast at trial factoring, but not very common, and rather expensive. "Additionally, something like an FPGA would be really fast, but most users don't have one. And ASICs would be even faster, but way too expensive." (kruoli PM) FPGA or ASIC likely would have yet different TF bit level optima. Guides to suggested TF bit levels versus exponent and GPU model are available for each GPU model listed at
https://www.mersenne.ca/cudalucas.php; click on the GPU model of interest. Note that the curves correspond to optimization for one or two primality tests to be performed
on the same GPU model as the TF. If the TF will be performed on a high TF/DP performance ratio GPU but the primality test performed on a lower TF/DP ratio GPU, the optimal TF is likely somewhere between the two levels. For example, for a 100Mdigit mersenne number, TF on RTX2080 indicates 83 bits but Radeon indicates 80 bits; 81 or 82 may be the optimal for the combination RTX 2080 & Radeon VII PRP or LL. RTX2080:
https://www.mersenne.ca/cudalucas.php?model=744 Radeon VII:
https://www.mersenne.ca/cudalucas.php?model=752
14 Operation is as multiword integers, sometimes with some bits used for carries.
https://www.mersenneforum.org/showpo...3&postcount=17 72bit math via 3 32bit words with 24 bits data, 8 bits for carries.
https://www.mersenneforum.org/showpo...&postcount=159
15 Different code sequences are written for different bit level ranges of trial factor, for best speeds for various bit level ranges. Higher bit levels take longer code sequences, so factors tried per unit time declines at higher bit levels on the same hardware and exponent.
https://www.mersenneforum.org/showpo...&postcount=231
Cursory examination of current source code shows the following 17 distinct sequences identified in mfaktc:
_71BIT_MUL24
_75BIT_MUL32
_95BIT_MUL32
BARRETT76_MUL32
BARRETT77_MUL32
BARRETT79_MUL32
BARRETT87_MUL32
BARRETT88_MUL32
BARRETT92_MUL32
_75BIT_MUL32_GS
BARRETT76_MUL32_GS
BARRETT77_MUL32_GS
BARRETT79_MUL32_GS
BARRETT87_MUL32_GS
BARRETT88_MUL32_GS
BARRETT92_MUL32_GS
_95BIT_MUL32_GS
(In the above, _GS indicates gpusieving.
https://www.mersenneforum.org/showpo...postcount=3082)
(mfakto is similar although it does not include 95bit code or capability. That's not an issue for the GIMPS search since optimal final TF level via GPU is generally 86 bits or less in the mersenne.org range of exponents <10
^{9}; 92 bits to its max 2
^{32}. But see also #43 below)
Judging by the names above, most of these are based on Barrett reduction.
https://en.wikipedia.org/wiki/Barrett_reduction
16 "First barrett kernel in mfaktc was BARRETT92, all other kernels are stripped down versions.
From BARRETT92 to BARRETT79 first (fixed inverse, multibit in single stage possible, a bit faster)
From there we go from BARRETT92 to BARRETT88 and BARRETT87 by (re)moving interim correction steps and some other "tricks" (loss of accuracy in interim steps (small example 22 mod 10 yields 12 (instead of 2))). Trading accuracy for speed. The same 'tricks' lead from BARRETT79 to BARRETT77 and BARRETT76."
https://www.mersenneforum.org/showpo...postcount=3080
17 "funnel shift" barrett 87
https://www.mersenneforum.org/showpo...postcount=2243
18 The
frequency of primes on the number line declines gradually as the bit level increases. This partially offsets the effect of the longer code sequences for higher bit levels mentioned in #15.
19 For GPU applications, there are various implementation approaches for performance.
https://www.mersenneforum.org/showpo...9&postcount=29
20 multiple streams and data sets allowing concurrent data transfer and processing
https://www.mersenneforum.org/showpo...&postcount=152
21 OnGPU sieving of factor candidates:
https://www.mersenneforum.org/showpo...&postcount=554
Tradeoffs and possible decisions about high GPU sieving, low exponents
https://www.mersenneforum.org/showpo...postcount=2409
https://www.mersenneforum.org/showpo...postcount=2999
22 There's a small performance advantage to 32bit images when available, when using GPU sieving. (32bit addresses are smaller, placing less demand on memory bandwidth)
https://www.mersenneforum.org/showpo...postcount=1981
However, at CUDA8 or higher (which means GTX10xx or newer models), only 64bit CUDA is available.
23 How, in the CUDA or OpenCl cases, all those factor candidates get bundled into batches for processing in parallel on many GPU cores is very hazy for me. Presumably it involves using as many of the cores as possible as much of the time as possible, implying work batches of equal size / run time.
Prime95 comments on passing out chunks of work and using memory bandwidth efficiently:
https://www.mersenneforum.org/showpo...postcount=1634
Oliver writes to R Gerbicz about it
https://www.mersenneforum.org/showpo...postcount=3020
24 Could FP be faster than integer math for the kernels? Probably not
https://www.mersenneforum.org/showpo...postcount=2377
Mark Rose did a detailed analysis in 2014 (and maybe revisiting that for recent GPU designs would be useful)
https://www.mersenneforum.org/showpo...postcount=2380
25 mfaktc v0.21 announce, and ruminations about possible v0.22
https://www.mersenneforum.org/showpo...postcount=2492
26 ini file parameter tuning advice,
https://www.mersenneforum.org/showpo...postcount=2505  2508
in this order:
GPUSieveProcessSize
GPUSieveSize
GPUSievePrimes
27 There was mention of an mfaktc v0.22pre2 back in 2015 at
https://www.mersenneforum.org/showpo...postcount=2547
"Removed old stuff (CC 1.x code, CUDA compatibility < 6.5 dropped, minor changes and bugfixed)."
https://www.mersenneforum.org/showpo...postcount=3080
The current released version is v0.21, which incorporates worktodo.add file support and is slightly faster than v0.20.
28 Experimenting on tuning with a GTX 1080 Ti, RTX 2080, and RTX 2080 Super, on Windows 7, they seem to like the upper end of the tuning variable limits. Possibly the faster cards would benefit from higher maximums than documented for mfaktc v0.21. (Recompile required.)
# GPUSieveSize defines how big of a GPU sieve we use (in M bits).
# Minimum: GPUSieveSize=4
# Maximum: GPUSieveSize=128 (except in altered versions allowing 2047)
# Default: GPUSieveSize=64
GPUSieveSize=2047
# GPUSieveProcessSize defines how far many bits of the sieve each TF block
# processes (in K bits). Larger values may lead to less wasted cycles by
# reducing the number of times all threads in a warp are not TFing a
# candidate. However, more shared memory is used which may reduce occupancy.
# Smaller values should lead to a more responsive system (each kernel takes
# less time to execute). GPUSieveProcessSize must be a multiple of 8.
# Minimum: GPUSieveProcessSize=8
# Maximum: GPUSieveProcessSize=32
# Default: GPUSieveProcessSize=16
GPUSieveProcessSize=32
GPUSievePrimes ~94000
29 Some RTX 20xx owners have modified the tuning variable limits, recompiled, and obtained performance gains of several percent, with diminishing incremental returns as GPUSieveSize grows to gigabits
https://www.mersenneforum.org/showpo...postcount=3071
Maximum GpuSieveSize was found to be 2047 (Mbits) after program modification. This may relate to computing bit positions with signed 32 bit integer math. Other model GPUs that show greater combined throughput with multiple mfaktc instances with the 128 Mbit GpuSieveSize maximum, also benefit from the modified code. For example, a GTX 1080 Ti gains several percent throughput running a single modified instance. It appears there would be additional gain on a single instance if the limit was raised to 4095 Mbits by changing to unsigned 32 bit integer math for the bit positions, since even at 2047 Mbits, it shows performance improvement compared to somewhat lesser values, and benefits from running multiple instances. RTX 2080 and RTX 2080 Super also show additional throughput by several percent from running multiple instances after available tuning is completed.
30 The decoupling of FP from integer math in the RTX 20xx may change the tradeoffs significantly or allow use of both simultaneously for TF math. As far as I know this is as yet unexplored.
31 According to some, the probability of finding a factor for least effort is optimized by TF up to one bit less than the eventual tradeoff limit, then P1 with potential computing time savings optimized by the program itself (prime95 or CUDAPm1), and then, if no factor is found in P1, the last bit level of TF, constituting the remaining half of the TF effort for that exponent. This is somewhat different sequencing than usual PrimeNet assignment.
32 Since GPUs are far more effective at TF than at primality testing or P1, and far faster at TF than CPUs are, assignments seek to prevent TF on CPUs, by performing all useful TF on GPUs. TF on GPUs allows going about 4 bits deeper, within computing time tradeoff optimization, saving some primality tests that CPUoptimizedTF limits would not have found factors for. The optimal number of bits deeper depend on the properties of the GPU models, with the higher ~40:1 TF/primality test ratio of the RTX 20xx family probably justifying 5 bits deeper for their use. As far as I know PrimeNet contains no provision for automatically taking them 1 bit higher on RTX 20xx.
33 The time required to perform a single TF attempt (one candidate factor on one exponent) is so brief, and the reliability of modern hardware high enough over such brief periods, that it does not pay to double check TF. Not double checking TF lets the TF go one bit level higher, finding more factors. But there is a sort of doublechecking being performed, with a completely different approach, by user TJAOI, as Uncwilly pointed out at
https://www.mersenneforum.org/showpo...postcount=3043. There's a link there to a whole thread about that activity.
34 There is some limited overlap of coverage of factors between TF and P1, estimated at about 2025%. The economics of TF and the sequential progress of TF assignments dictate that TF factors found will have at most a certain number of bits (ranging up to 75 bits for 84M, to 86 bits for 999M). P1 can economically find factors with more bits, but requires that the number one less than the factor be rather smooth (composite with many smallish factors). Some factors that TF should find will be smooth, and many will not. Therefore P1 factoring currently serves as an incomplete doublecheck on the TF runs. "Stage 2 of P1 helps to find factors
f=2kp+1, where
k contains one less smooth factor. If
k=a⋅b with
b being the biggest prime factor of
k, then it is sufficent to do P1 with B1 = maximum prime factor of
a and B2 =
b." (kruoli by PM)
35 Reported factors are confirmed at the PrimeNet server by repeating the computation for that exponent and factor, because that is quick to do and factors are worth confirming. Nofactor reports are not confirmed there. Historically, TF was on the honor system, with no security code attached to a result outside prime95, or verification of completion of the work. A recent development is Robert Gerbicz's proposal for a method of incorporating verification of performing the TF work in the nofactorfound case. See
https://www.mersenneforum.org/showpo...7&postcount=30 and some following discussion on the same thread. At the moment this is only conceptual, not implemented in any software to my knowledge. There's also discussion of other approaches in
https://mersenneforum.org/showthread.php?t=25493
36 There is currently no productionready Mersenne TF code for the following interface specifications:
DirectCompute
DirectX
OpenGL
PhysX
Vulkan (but see
https://mersenneforum.org/showthread.php?t=28523)
In many cases, this is not a limitation, because the Intel CPU TF code in prime95 and mprime, and GPU TF code in Mfaktc (NIVIDIA CUDA) and Mfakto (OpenCL on AMD GPUs and some Intel IGPs) provide considerable coverage. However, there are older IGPs or discrete GPUs that do not have the required support for CUDA or OpenCL but do have support for other interface specifications such as DirectX or OpenGL. It's likely the performance of these older devices would be low compared to the ~20GhzD/day of current IGPs, or hundreds or thousands of GhzD/day on current discrete GPUs.
37 Mfaktc and Mfakto are capable of factoring Mersenne numbers with exponents up to 2
^{32}. Extending them to exponents above 2
^{32} would require a rewrite of the individual kernels as well as the surrounding code.
http://mersenneforum.org/showpost.ph...postcount=1148 If this was attempted, it would be advisable to maintain separate source files or use conditional compilation to produce separate executables, to preserve efficiency for exponents below 2
^{32}. Once done, though slower than the 32bitexponent max versions, these are likely to be much faster over 2
^{32} than Luigi's Factor5, which has no limit, or Ernst Mayer's Mfactor, which has some limits depending on its build options. There's no hurry. There's a great deal of work to do within the mersenne.org limit of 10
^{9}, and much more from there to 2
^{32}, that will take centuries.
https://www.mersenneforum.org/showpo...postcount=2862
38 "Checkpoints are written between two residue classes by just remembering the last finished residue class." The checkpoint files are small and simple.
https://www.mersenneforum.org/showpo...postcount=3008
Perform version upgrades between bit levels. The checkpoint files include what version was used to produce them, and are not accepted by a later version of software, which will restart the bit level instead.
39 The number of residue classes (see #4 above) is set at compile time in mfaktc, or in the ini file in Mfakto (choosing between 420 and 4620). Another possibility that has been suggested is to make it dynamic, adjusting to exponent and bit level.
https://www.mersenneforum.org/showpo...postcount=2999
40 Some practical considerations constraining number of classes include
magnitude of spacing between candidate factors (32 vs 64bit),
frequency and effect of the GPU work queue running empty, and
effect on throughput of running a full last block per class that extends past the bit level's end.
https://www.mersenneforum.org/showpo...postcount=3002
41 In mfaktc or mfakto, there is screen output once per class. This can range from several classes per second or more, at high exponents and low bit levels, to nearly an hour or more per class at high bit levels on slow GPUs. (Bit levels already reached in trial factoring previously ranged from 64 to 87, 2
^{23}=8,388,608:1 within the mersenne.org space. As of early 2021, they ranged from 72 to 87 bits, 2
^{15}=65536:1.) There is no user control of mfaktx screen output frequency other than the approximately 11:1 effect of selecting moreclasses or lessclasses. In some cases a solid state disk or ramdisk is recommended to prevent logging or checkpointing from diminishing TF throughput unacceptably. A proposed user control for rate of screen output has not been implemented.
https://www.mersenneforum.org/showpo...postcount=2703 Using lessclasses to somewhat limit output frequency can cost of order 1.7% TF throughput.
https://www.mersenneforum.org/showpo...postcount=2706
42 Since finding factors is more likely with low k values than high k values, lowest bit levels are run first, for efficiency. For the primary purpose of GIMPS, finding new Mersenne primes, we only care about finding a factor quickly enough to eliminate an exponent p from any further processing, thereby saving computing time overall. The probability of finding a factor times the combined computing time of further trial factoring, plus P1 factoring, and running a probable prime test with proof generation (historically two primality tests), must be less than the effort of trial factoring for there to be net savings. Trying the mostlikely factors first optimizes the savings.
43 The optimal TF bit level described in #2, #13 and #32 depends somewhat on the GPU model, since different GPU models have different ratios of TF and primality test performance. Recent developments may raise the optimal above the levels described. RTX 20xx increased parallelism between FP and int computations, or its much higher performance ratio between TF and primality testing with current code, for example. Separate code and assignments according to GPU model may be necessary to make best use of the properties of the RTX20xx GPUs and older GPUs.
44 There is a discussion at
https://www.mersenneforum.org/showpo...postcount=1511 to ~1514 regarding possibly eliminating the 20 to 25% overlap of P1 and TF described in #34. If doing so had little or no overhead, this could provide a considerable speedup in TF, which may speed up the total perexponent effort in GIMPS by <~0.03 x 0.25 / 2, or <~0.375%. Not large, but worth exploring. Its effect is diminished from there since current practice is to opportunistically seek a quickly found factor by TF in the lower bit levels, then P1 after all TF is done. See also #31, regarding P1 being run before the last, highest TF level, which may offer an opportunity for saving
<~0.19% if the final bit level of TF, which is comparable in effort to all preceding levels combined, is performed only on k values not also covered by the preceding P1. Taking full advantage of this probably involves
modifying PrimeNet TF and P1 assignment policy, in both level and sequence,
modifying code in the P1 applications (prime95, CUDAPm1, GpuOwL, Mlucas v20.0 or above) for determining optimal P1 bounds,
modifying TF applications code (mfaktc, mfakto, mprime/prime95; gpuowl TF branch, Mfactor) to efficiently skip the relevant factor candidates, and
modifying mersenne.ca GPUspecific TF level versus exponent charts and other project documentation.
It may also involve changing or adding parameters to the worktodo lines or ini files or command line options to indicate whether the TF run is to include or exclude the smooth factor candidates, possibly including specification of a smoothness limit, which may vary depending on P1 bounds performed for the same exponent. (Different use cases for TF may require different behavior.) New TF worktodo line structure implies modification of the parsing code in each TF application. New TF worktodo line structure also implies modifying some of the several client management programs (see
http://www.mersenneforum.org/showpos...92&postcount=3). Making use of this also probably involves changes to TF result report records to indicate whether the smooth factors were tried or skipped, including PrimeNet and manual reports and assigned and unassigned work, impacting PrimeNet and its API, web manual report page processing, mfaktc, mfakto, prime95, gpuowl, Mfactor, and some of the client management programs. In total, over a dozen code authors would need to be on board with the change. Not all are currently actively maintaining what they wrote.
45 from
https://www.mersenneforum.org/showpo...4&postcount=23 re efficiency versus assignment size (roughly 2
^{bitlevel}/exponent)
Quote:
mfakto (and mfaktc) are more efficient with larger assignments. This does not mean larger exponents, but "more work" or "longer runtime per class", generally bigger bit ranges. The reason is a certain onetime initialization effort per class  no matter if the class will just test 1 million factor candidates or 1 billion. In the first case, the onetime effort may account for, say, 75% of the sieving effort, in the latter case just 0.3%. Plus there is an average of half a block wasted for each class. If the class consists of only one block, that's 50%, for 1000 blocks it's just 0.05%.

46 Results are reported by GPU applications as exponent, factor(s) found, optionally username and computerid, program and version and kernel used, optionally time and date stamp, with a * flag if present indicating the bit level was not completed after finding a factor, or
exponent, no factor from bit level a to b, optionally username and computerid, program and version and kernel used, optionally time and date stamp (but not in that order). Other possible variables such as factor candidate sieving limit are not reported.
47 as #46 above implies, sometimes more than one factor will be present within a bit level. The behavior of mfaktx is controlled in such cases by part of mfaktx.ini[CODE]# possible values for StopAfterFactor:
# 0: Do not stop the current assignment after a factor was found.
# 1: When a factor was found for the current assignment stop after the
# current bitlevel. This makes only sense when Stages is enabled.
# 2: When a factor was found for the current assignment stop after the
# current class.
#
# Default: StopAfterFactor=1
It's uncommon but finding two factors in a single bit level does occur.
48 More than one factor per class within a bit level for an exponent also occurs. Its overall impact on the GIMPS project is small. See
https://www.mersenneforum.org/showpo...82&postcount=5
Up to 10 factors per class are supported, by use of a small array for factors found. That would be very hard to test for. More than 4 per class/bitlevel/exponent is estimated to be quite rare.
https://www.mersenneforum.org/showpost.php?p=410784&postcount=35
For some examples of exponents with multiple factors found in one class in one bit level see
https://mersenneforum.org/showpost.p...postcount=3262
See also
https://www.mersenneforum.org/showpo...&postcount=471; what's described here is referred to there as "search by p".
(Thanks, to TheJudger, kruoli and others who have reviewed this large list or responded to PMs and offered constructive feedback.)
Top of reference tree:
https://www.mersenneforum.org/showpo...22&postcount=1