Go Back > Extra Stuff > Blogorrhea > kriesel

Closed Thread
Thread Tools
Old 2018-05-28, 18:38   #1
kriesel's Avatar
Mar 2017
US midwest

22·32·191 Posts
Default Mfaktc-specific reference material

This thread is intended as a home for reference material specific to the mfaktc program.
(Suggestions are welcome. Discussion posts in this thread are not encouraged. Please use the reference material discussion thread Off-topic posts may be moved or removed, to keep the reference threads clean, tidy, and useful.)

Mfaktc howto

The following assumes you've already read the generic "How to get started in gpu computing for GIMPS" portion of, and duplicates little of that here.

Download a suitable version of mfaktc from here
You might as well get one of the larger-GpuSieveSize-enabled ones allowing up to 2047 Mbits if one is available.
If you can't find a suitable version there, you may be able to get one built by one of the participants on the mfaktc thread.

You'll also need the appropriate CUDArt library.
CUDA Dlls are in the CUDA_dlls folder of
(If all else fails, CUDA dlls or .so files can be obtained by downloading and installing the applicable OS and CUDA version’s CUDA SDK, then locating the files and adding the few needed files to your path or to your working directory on other systems. Download the current version from
Older versions can be obtained from the download archive, at
These are typically gigabytes in the newer versions; 2.8GB @ CUDA v11.3.
Get the installation guide and other documentation pdfs also while you're there.

Note, not all OS version and CUDA version combinations are supported! See CUDA Toolkit compatibility vs CUDA level and good luck!)

Install in a suitable user-account-owned subfolder.
Set the needed folder permissions so suitable permissions are inherited by files created there.

Modify the mfaktc.ini file to customize it to your GPU model, primenet account name, and system name.
(I usually incorporate system name, GPU identification, and instance number together. Something like systemname/gpuid-wn; condor/gtx1060-w2 for example.)
Correct the comments in the standard ini file to reflect the increased GpuSieveSize maximum.

You may want to try GPU-Z as a utility on your Windows system to see an indication of what the computer thinks is installed for your GPU (CUDA opencl openGl etc), graphically monitor GPU parameters, maybe even log them if you want. One of many utilities listed in which also lists some Linux alternatives. It can be handy while getting a GPU application going. When it's not needed shut it down along with other idle applications to reduce overhead that's costing performance.

Now is a good time to run mfaktc with -h >>help.txt in your working directory. Run once, refer to as often as needed.

Run the self test: mfaktc -st >>selftest.txt

Or the longer one: mfaktc -st2 >>selftest.txt

Create a Windows batch file or Linux shell script with a short name. I sometimes use similar to the following, named mf.bat. Adjust device name, device number, mfaktc version etc for your situation.

title %cd% mfaktc gtx1080Ti
if not exist help.txt mfaktc-more-cuda80-64 -h >help.txt
if not exist selftest.txt mfaktc-more-cuda80-64 -d 0 -st >selftest.txt
mfaktc-more-cuda80-64 -d 0 >>mfaktc-run.txt
Set the device number there.
Consider redirecting console output to files, as in the batch file example above, or employing a good tee program.

Create a desktop shortcut for easy launch of the batch file or script. Shortcut command line should be on Windows, of form cmd /k: <fullsystempath>cmd.exe /k <full mfaktc folder path>\<batchfilename> and the directory in which it runs <full mfaktc folder path>. (Eventually, for multiple instances or multiple GPUs this could launch a routine that invokes the individual-instance files with short time delays between, so you have a few seconds to see whether each launched correctly or a bug occurred.)

Check the results. Resolve any reliability issues before proceeding to real GIMPS work.

Create a worktodo file and put some assignments in there. Start with few, in case your GPU or IGP does not work out. Get the type you plan to run the most. Get them from

Results are reported manually at

Run one instance with default settings. Modify tuning in the mfaktc.ini file, stop and restart the program to make it take effect, and document performance for each modification. Tune one variable at a time. For best accuracy and reproducibility, minimize interactive use during a benchmark period. Averaging many timing samples for the same tune parameter set in a spreadsheet improves timing accuracy. See Mfaktc tuning for optimal throughput for tuning advice.
An example of tuning for an RTX 2080 Super is here. There are other GPU models' tuning shown also near it in the same thread.

When substantially changing type of work, such as switching between 100M tasks and 100Mdigit tasks, or significantly changing bit levels, especially when changing kernels results, retuning is suggested. (You may want to save different tunes for different exponent and bit level ranges, for later reuse.)
For best performance use a SSD, tune for single-instance first, and tune for maximum throughput by experimenting with multiple instances last. (I suggest keeping the work of simultaneous instances similar. If they are too different, different kernels may reduce throughput. Test for that.)
Multiple instances sometimes give slightly higher sustained throughput, and keep the GPU working if one instance has a problem, runs out of work, or is stopped briefly to replenish work and report results. If a single instance is optimal, chain the batch file to another to keep the GPU productive when one worktodo file empties or a bug terminates the program.
Slow GPUs benefit less from large GpuSieveSize and multiple instances; fast GPUs benefit more.

Get familiar with the nvidia-smi command line tool at some point. That's more efficient for when you get into production mode. It's probably hiding somewhere like the following. I usually make a batch file with a short name nv.bat (and suggest doing the Linux equivalent)
"c:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe"
goto loop
Nvidia-smi has less overhead than graphical monitoring utilities such as GPU-Z. Used as above, it will list all CUDA gpus working with the driver and whatever executables are using them, memory usage, power, temperature, etc. One keystroke in that command prompt window will refresh it.

Nvidia-smi can also control power and other parameters. Power savings can be quite considerable; 50% reduction in power from the nominal maximum seems to still provide better than 80% of maximum throughput.

Beware, nvidia-smi gpu number may not match mfaktc device number in the case of multiple gpus per system.

After getting the program functioning manually, you can consider continuing to operate it that way, or trying one of the client management software described at Available Mersenne prime hunting client management software, each of which have their own install requirements, or GPU to 72

(Note much of the preceding was derived from or copied from

Table of Contents:
  1. This post
  2. Run time versus exponent and bit level
  3. Bug and wish list
  4. Mfaktc throughput variables
  5. Volta and Turing support
  6. Concepts in GIMPS trial factoring
  7. Mfaktc -h help output
  8. Mfaktc tuning for optimal throughput
  9. gr-mfaktc
  10. Mfaktc tuning for best compute efficiency
  11. etc tbd

Top of reference tree:

Last fiddled with by kriesel on 2022-10-03 at 04:56 Reason: reordered content to separate exe from dlls
kriesel is online now  
Old 2018-05-28, 18:41   #2
kriesel's Avatar
Mar 2017
US midwest

1ADC16 Posts
Default Mfaktc runtime scaling versus exponent and bit depth

For computing run time estimate for a given GPU model and assignment, see the concise method given here.

Timings for an assortment of exponents and bit depths for a GTX480 are tabulated and charted, for reference, in the attachment. Note, only one trial per combination was tabulated, so no measure made or indication given of reproducibility run to run for same inputs. This is a somewhat different way of looking at it than the factoring benchmarks at The trend of how run time changes with bit level and exponent will apply to other GPU models also; time ~ c 2bitlevel/exponent. The "proportionality constant" c will vary with GPU model greatly, and somewhat with mfaktc or mfakto kernel required for the bit level, at least. Environmental conditions (room temperature), other workload on the GPU, GPU overclocking, etc. will also affect run time somewhat.

The kernel-slowdown-for-large-bit-level effect is modest, with more complex slower kernels needed, and optimal maximum bit level proportional to ~log exponent, while the factor density effect is significant, 1/exponent, and the factor-value-span/bit level effect very significant, at ~2bit-level-1.
Tuning of the application specific to the exponent range and bit level will sometimes provide some performance improvement.

Top of reference tree:
Attached Files
File Type: pdf mfaktc trial factoring runtime scaling.pdf (20.7 KB, 409 views)

Last fiddled with by kriesel on 2022-04-22 at 14:48 Reason: Link to concise estimation method post, expand content
kriesel is online now  
Old 2018-05-28, 19:24   #3
kriesel's Avatar
Mar 2017
US midwest

11010110111002 Posts
Default Mfaktc bug and wish list

Known and suspected bugs and desired features or changes are listed. Some cross references to other gpu programs are included. Some issues may be hardware related or due to other causes.

As always, such lists are in appreciation of the various authors' past efforts, contributions, and skills, and intended to help improve both the software and the users' experience.

Top of reference tree:
Attached Files
File Type: pdf mfaktc bug and wish list.pdf (88.1 KB, 46 views)

Last fiddled with by kriesel on 2022-04-22 at 14:41 Reason: updated attachment to include backup checkpoint file
kriesel is online now  
Old 2018-05-31, 03:12   #4
kriesel's Avatar
Mar 2017
US midwest

22×32×191 Posts
Default Mfaktc throughput variables

Investigating Mfaktc consistently indicating 98% GPU usage in GPU-Z, rather than the 100% that CUDAPm1 and CUDALucas typically produce, there are slight gains in total throughput to be had by running multiple instances. There are confounding factors, including:

Exponent being run (substantial effect)
Factoring depth being run (minor effect)
Overhead of running the screen display with the same GPU (varying effect)
How many instances are running per gpu
GPU sieving parameters: GPUSieveProcessSize, GPUSieveSize, GPUSievePrimes; tuning these can provide a several percent boost in throughput
Frequency of checkpoint saving (minor effect)

GPU clock throttling due to thermal limits
How similar the work of multiple instances are (similar is better, very dissimilar may cause a net slowdown. Running very different bit levels that cause use of different kernels seems likely to cause this. Very different exponents may.)

Potentially, at high exponent or instance count:
memory constraints (although mfaktc doesn't use much memory per instance at current wavefront exponents)

Load sharing for multiple instances appears to be not quite equal; they have slightly differing throughputs.

An operational advantage of running multiple instances is being able to stop one, add work, look at results, change ini file settings, etc, without noticeable loss of throughput; the others will essentially saturate the GPU while one instance is out of action temporarily. If one instance runs out of work the other running instances run a bit faster than if it hadn't, limiting lost total throughput.

On a GTX480, some slight effect of CUDA level (test your own gpu models; there are reports GTX10xx like v8.0 better than lower versions)
V0.20 CUDA 6.5 346.38 GhzD/d ----fastest 100%
V0.21 Less Classes CUDA 4.2 340.18 ---98.2%
V0.21 Less Classes CUDA 8.0 340.43 ---98.3%
Running the less classes versions could cost ~0.9 week per year of throughput.

Note, v0.20 checkpoint file was ignored by either flavor of less-classes, 3 hours throughput was lost when switching version, by the Less-Classes versions of mfaktc starting the bit depth over at each version change. I was able to get it back by reverting to V0.20 CUDA6.5 64-bit version.

These results were seen on a system with an actual spinning hard drive. Gain from multiple instances may be less with a solid state disk or ramdisk.

The effect of lower than 100% usage with a single instance is stronger with newer faster gpus such as the GTX 1080 or GTX 1080 Ti. The GTX 1080 Ti under test is running at 90% gpu load with one instance, 95% with two instances. (Not the less-classes version.)

The above is observed using the standard Mfaktc release. The modified versions allowing GpuSieveSize up to 2047M when well tuned reduce but do not eliminate the throughput advantage of running multiple instances.

Top of reference tree:
Attached Files
File Type: pdf parallel throughput.pdf (17.4 KB, 346 views)

Last fiddled with by kriesel on 2020-07-01 at 17:12 Reason: added comments on similar work
kriesel is online now  
Old 2018-10-09, 21:40   #5
kriesel's Avatar
Mar 2017
US midwest

22·32·191 Posts
Default Volta and Turing support

For versions compiled for CUDA 10, see

Top of reference tree:

Last fiddled with by kriesel on 2019-11-18 at 14:03
kriesel is online now  
Old 2019-02-14, 03:10   #6
kriesel's Avatar
Mar 2017
US midwest

22·32·191 Posts
Default Concepts in GIMPS trial factoring (TF)

(note, sort of mfaktc oriented, more so toward the end. Where description applies to both mfaktc and mfakto, they are referred to as mfaktx)

Following is my evolving attempt to list and describe the many concepts and optimization considerations relevant to GIMPS trial factoring, in a way that assumes little or no prior knowledge.

See also (perhaps first) for some background that will be useful in reading the following,

1 Trial factoring work is generally assigned by exponent and bit levels;
Mersenne exponent,
bit level already trial factored to or where to start, and
bit level to factor up to.
Nearly all TF is now done on GPUs not CPUs. Assignments and result reporting are typically automatic for mprime/prime95, but manually or through separate programs for gpu applications.
Manual assignment (reservation) links and result submission links are available in the attachment at
See for a list of client management programs that get assignments and/or submit results.

2 Each bit level of TF is about as much computing effort as all bit levels below it, for the same exponent.
Probability of finding a factor by trial factoring between 2x and 2x+1 is approximately 1/x.
So while the probability of finding a factor goes down slowly with x, the effort goes up exponentially with x.
Assignments are made up to the point where it's more cost effective to primality test instead of continuing to trial factor. Finding a factor by trial factoring before any primality tests or P-1 factoring saves 2 x (1 + primality test error rate) primality tests plus the P-1 factoring time, which is typically a few percent of a primality test. Error rate is around 2%, so about 2(1+.02)+.03 =~ 2.07 primality tests equivalent. Except that P-1 effort (bounds selection) is optimized for maximum time savings, so actually <2(1+0.02) or <2.04 primality tests.

3 TF makes use of the special form of factors of Mersenne numbers, f = 2 k p +1, where p is the prime exponent of the Mersenne number, k is a positive integer. See

4 Use a wheel generator for candidate factors, to efficiently exclude and not even consider candidate factors that have very small factors themselves, such as 2, 3, 5, 7 and usually 11. 2 2 3 5 7 = 420, the Less-classes number; 2 2 3 5 7 11 = 4620, the more-classes number.
(There was discussion of going to 13 or higher, but that was considered not worthwhile. 2 2 3 5 7 11 13 = 60060, etc. Higher complexity and overhead, trading off against diminishing returns of incremental number of candidates excluded.) Prime95 uses a simpler 2 2 3 5 = 60 wheel spokes, which reduces to 16 classes (26.67%). For Mfaktx at 420, 96 survive (22.86%) or at 4620, 960 (20.78%). The next would be 11520/60060=19.18%. (A potential reduction of 7.7%) (Tthere are two factors of 2 in the number of classes. I think that's due to including #6 into the wheel generator; it's not just evens that can be dismissed, but 6 out of 8 mod 8 cases, or 3 out of 4.)

5 For a given exponent, exclude entire classes of candidate factors with a single test. Additionally there is subsequent fast sieving in the surviving classes for small prime factors, to eliminate the bits representing the corresponding k values of small primes beyond the wheel generator level.

6 Make use of the special form of factors of Mersenne numbers, that they are 1 or 7 mod 8. (Discard or do not generate candidate factors that are not 1 or 7 mod 8). A short proof for f=+/-1 mod 8 is at

7 Use of dense representation of the candidate factors, as a bit map of k values.
(That bit map is exponent-specific, so can not be reused for other exponents.)

8 These candidate factors are sieved, somewhat, but not exhaustively. Candidate factors found composite by sieving are discarded. Trial with prime candidates is sufficient, with composite candidates redundant.

9 Exhaustive sieving of candidate factors can produce less throughput per total computing effort, than a lesser amount of sieving of candidates. There's user control of sieving depth available to allow adjustment to near optimal for the exponents, depths, and other variables, in mfaktc and mfakto.

10 Candidate factors surviving wheel generation and restriction to 1 or 7 mod 8 are sieved somewhat. Implementation in mfaktx is by a maximum sieve prime value, that as I recall is set in mfaktx.ini. Sieving level (at some version of mfaktc) must not exceed the level where possible candidate factors are included in sieving values. This makes sieve limit settings dependent on exponent for low exponent low bit level combinations.
(If I recall correctly, in later versions of mfaktc special handling of certain cases relaxes this restriction on sieving level. See in #21 below)

11 Surviving candidate factors are tried. (With MOSTLY prime candidates, and the relatively few composite candidates that passed the speed-optimized incomplete sieving.)

12 The trial method is to generate a representation of the Mersenne prime plus 1 (2p), modulo the trial factor, by a succession of squarings and doublings according to the bit pattern in the exponent, modulo the trial factor. If 2p mod f =1, 2p-1 mod f = 0 proving f is a factor. The powering method is much much faster than trial long division for the very sizable numbers of interest near the current GIMPS wavefront, and uses far less memory than a simple approach of representing the full dividend at the beginning, and more so on both counts for larger numbers. (Retina points out though at that there would be no need to create a full representation of p ones at the beginning of long division. Since all the binary digit positions of a mersenne number are ones they could simply be added at the right as needed. Also, we care only about the remainder, not the quotient.) There's a description and brief small-numbers example of the modpow method at which is called left to right binary method in It is not necessarily the minimum-time method, but it's reliably pretty good, and fairly simple. The modulo trial-factor at each step keeps the operands at a manageable size for speed: 3 32-bit words, far smaller than the tens of millions or hundreds of millions of bits that naive trial-division could involve or deferred modulo could generate. Trial division would take (O)p operations, while modpow takes only (O)log2p. For p~90M, p/log2p ~3,400,000. If stepping through long division 32 bits at a time, ~106400; log2106400 ~17. The efficiency of the modpow method allows trial factoring to many more bits factor depth than even a well implemented long division trial would. See point #2. Therefore the probability of finding a factor is increased significantly with the modpow approach.

13 On significantly parallel computing hardware, such as GPUs, many trial factors can be run in parallel, so successive subsets of candidates are distributed over the many available cores. GPUs are so much more relatively effective at TF than other types of GIMPS computation, that the optimal GPU TF bit level is typically a few bits higher than for CPUs, and CPU TF is generally avoided. (See also point 2 again.) Xeon Phi are reportedly quite fast at trial factoring, but not very common, and rather expensive. "Additionally, something like an FPGA would be really fast, but most users don't have one. And ASICs would be even faster, but way too expensive." (kruoli PM) FPGA or ASIC likely would have yet different TF bit level optima. Guides to suggested TF bit levels versus exponent and GPU model are available for each GPU model listed at; click on the GPU model of interest. Note that the curves correspond to optimization for one or two primality tests to be performed on the same GPU model as the TF. If the TF will be performed on a high TF/DP performance ratio GPU but the primality test performed on a lower TF/DP ratio GPU, the optimal TF is likely somewhere between the two levels. For example, for a 100Mdigit mersenne number, TF on RTX2080 indicates 83 bits but Radeon indicates 80 bits; 81 or 82 may be the optimal for the combination RTX 2080 & Radeon VII PRP or LL. RTX2080: Radeon VII:

14 Operation is as multiword integers, sometimes with some bits used for carries. 72-bit math via 3 32-bit words with 24 bits data, 8 bits for carries.

15 Different code sequences are written for different bit level ranges of trial factor, for best speeds for various bit level ranges. Higher bit levels take longer code sequences, so factors tried per unit time declines at higher bit levels on the same hardware and exponent.
Cursory examination of current source code shows the following 17 distinct sequences identified in mfaktc:
(In the above, _GS indicates gpu-sieving.
(mfakto is similar although it does not include 95-bit code or capability. That's not an issue for the GIMPS search since optimal final TF level via GPU is generally 86 bits or less in the range of exponents <109; 92 bits to Mlucas range max 232. But see also #43 below)
Judging by the names above, most of these are based on Barrett reduction.

16 "First barrett kernel in mfaktc was BARRETT92, all other kernels are stripped down versions.
From BARRETT92 to BARRETT79 first (fixed inverse, multibit in single stage possible, a bit faster)
From there we go from BARRETT92 to BARRETT88 and BARRETT87 by (re)moving interim correction steps and some other "tricks" (loss of accuracy in interim steps (small example 22 mod 10 yields 12 (instead of 2))). Trading accuracy for speed. The same 'tricks' lead from BARRETT79 to BARRETT77 and BARRETT76."

17 "funnel shift" barrett 87

18 The frequency of primes on the number line declines gradually as the bit level increases. This partially offsets the effect of the longer code sequences for higher bit levels mentioned in #15.

19 For GPU applications, there are various implementation approaches for performance.

20 multiple streams and data sets allowing concurrent data transfer and processing

21 On-GPU sieving of factor candidates:
Tradeoffs and possible decisions about high GPU sieving, low exponents

22 There's a small performance advantage to 32-bit images when available, when using GPU sieving. (32bit addresses are smaller, placing less demand on memory bandwidth)
However, at CUDA8 or higher (which means GTX10xx or newer models), only 64-bit CUDA is available.

23 How, in the CUDA or OpenCl cases, all those factor candidates get bundled into batches for processing in parallel on many GPU cores is very hazy for me. Presumably it involves using as many of the cores as possible as much of the time as possible, implying work batches of equal size / run time.
Prime95 comments on passing out chunks of work and using memory bandwidth efficiently:
Oliver writes to R Gerbicz about it

24 Could FP be faster than integer math for the kernels? Probably not
Mark Rose did a detailed analysis in 2014 (and maybe revisiting that for recent gpu designs would be useful)

25 mfaktc v0.21 announce, and ruminations about possible v0.22

26 ini file parameter tuning advice, - 2508
in this order:

27 There was mention of an mfaktc v0.22-pre2 back in 2015 at
"Removed old stuff (CC 1.x code, CUDA compatibility < 6.5 dropped, minor changes and bugfixed)."
The current released version is v0.21, which incorporates worktodo.add file support and is slightly faster than v0.20.

28 Experimenting on tuning with a GTX 1080 Ti, RTX 2080, and RTX 2080 Super, on Windows 7, they seem to like the upper end of the tuning variable limits. Possibly the faster cards would benefit from higher maximums than documented for mfaktc v0.21. (Recompile required.)
# GPUSieveSize defines how big of a GPU sieve we use (in M bits).
# Minimum: GPUSieveSize=4
# Maximum: GPUSieveSize=128 (except in altered versions allowing 2047)
# Default: GPUSieveSize=64
# GPUSieveProcessSize defines how far many bits of the sieve each TF block
# processes (in K bits). Larger values may lead to less wasted cycles by
# reducing the number of times all threads in a warp are not TFing a
# candidate. However, more shared memory is used which may reduce occupancy.
# Smaller values should lead to a more responsive system (each kernel takes
# less time to execute). GPUSieveProcessSize must be a multiple of 8.
# Minimum: GPUSieveProcessSize=8
# Maximum: GPUSieveProcessSize=32
# Default: GPUSieveProcessSize=16
GPUSievePrimes ~94000

29 Some RTX 20xx owners have modified the tuning variable limits, recompiled, and obtained performance gains of several percent, with diminishing incremental returns as GPUSieveSize grows to gigabits
Maximum GpuSieveSize was found to be 2047 (Mbits) after program modification. This may relate to computing bit positions with signed 32 bit integer math. Other model GPUs that show greater combined throughput with multiple mfaktc instances with the 128 Mbit GpuSieveSize maximum, also benefit from the modified code. For example, a GTX 1080 Ti gains several percent throughput running a single modified instance. It appears there would be additional gain on a single instance if the limit was raised to 4095 Mbits by changing to unsigned 32 bit integer math for the bit positions, since even at 2047 Mbits, it shows performance improvement compared to somewhat lesser values, and benefits from running multiple instances. RTX 2080 and RTX 2080 Super also show additional throughput by several percent from running multiple instances after available tuning is completed.

30 The decoupling of FP from integer math in the RTX 20xx may change the tradeoffs significantly or allow use of both simultaneously for TF math. As far as I know this is as yet unexplored.

31 According to some, the probability of finding a factor for least effort is optimized by TF up to one bit less than the eventual tradeoff limit, then P-1 with potential computing time savings optimized by the program itself (prime95 or CUDAPm1), and then, if no factor is found in P-1, the last bit level of TF, constituting the remaining half of the TF effort for that exponent. This is somewhat different sequencing than usual PrimeNet assignment.

32 Since GPUs are far more effective at TF than at primality testing or P-1, and far faster at TF than cpus are, assignments seek to prevent TF on CPUs, by performing all useful TF on GPUs. TF on GPUs allows going about 4 bits deeper, within computing time tradeoff optimization, saving some primality tests that CPU-optimized-TF limits would not have found factors for. The optimal number of bits deeper depend on the properties of the GPU models, with the higher ~40:1 TF/primality test ratio of the RTX 20xx family probably justifying 5 bits deeper for their use. As far as I know PrimeNet contains no provision for automatically taking them 1 bit higher on RTX 20xx.

33 The time required to perform a single TF attempt (one candidate factor on one exponent) is so brief, and the reliability of modern hardware high enough over such brief periods, that it does not pay to double check TF. Not double checking TF lets the TF go one bit level higher, finding more factors. But there is a sort of double-checking being performed, with a completely different approach, by user TJAOI, as Uncwilly pointed out at There's a link there to a whole thread about that activity.

34 There is some limited overlap of coverage of factors between TF and P-1, estimated at about 20-25%. The economics of TF and the sequential progress of TF assignments dictate that TF factors found will have at most a certain number of bits (ranging up to 75 bits for 84M, to 86 bits for 999M). P-1 can economically find factors with more bits, but requires that the number one less than the factor be rather smooth (composite with many smallish factors). Some factors that TF should find will be smooth, and many will not. Therefore P-1 factoring currently serves as an incomplete double-check on the TF runs. "Stage 2 of P-1 helps to find factors f=2kp+1, where k contains one less smooth factor. If k=ab with b being the biggest prime factor of k, then it is sufficent to do P-1 with B1 = maximum prime factor of a and B2 = b." (kruoli by PM)

35 Reported factors are confirmed at the PrimeNet server by repeating the computation for that exponent and factor, because that is quick to do and factors are worth confirming. No-factor reports are not confirmed there. Historically, TF was on the honor system, with no security code attached to a result outside prime95, or verification of completion of the work. A recent development is Robert Gerbicz's proposal for a method of incorporating verification of performing the TF work in the no-factor-found case. See and some following discussion on the same thread. At the moment this is only conceptual, not implemented in any software to my knowledge. There's also discussion of other approaches in

36 There is currently no Mersenne TF code for the following interface specifications:
In many cases, this is not a limitation, because the Intel CPU TF code in prime95 and mprime, and gpu TF code in Mfaktc (NIVIDIA CUDA) and Mfakto (OpenCL on AMD GPUs and some Intel IGPs) provide considerable coverage. However, there are older IGPs or discrete GPUs that do not have the required support for CUDA or OpenCL but do have support for other interface specifications such as DirectX or OpenGL. It's likely the performance of these older devices would be low compared to the ~20GhzD/day of current IGPs, or hundreds or thousands of GhzD/day on current discrete GPUs.

37 Mfaktc and Mfakto are capable of factoring Mersenne numbers with exponents up to 232.
Extending them to exponents above 232 would require a rewrite of the individual kernels as well as the surrounding code. If this was attempted, it would be advisable to maintain separate source files or use conditional compilation to produce separate executables, to preserve efficiency for exponents below 232. Once done, though slower than the 32-bit-exponent max versions, these are likely to be much faster over 232 than Factor5, which has no limit, or Ernst Mayer's Mfactor, which has some limits depending on its build options. There's no hurry. There's a great deal of work to do within the limit of 109, and much more from there to 232, that will take centuries.

38 "Checkpoints are written between two residue classes by just remembering the last finished residue class." The checkpoint files are small and simple.
Perform version upgrades between bit levels. The checkpoint files include what version was used to produce them, and are not accepted by a later version of software, which will restart the bit level instead.

39 The number of residue classes (see #4 above) is set at compile time in mfaktc, or in the ini file in Mfakto (choosing between 420 and 4620). Another possibility that has been suggested is to make it dynamic, adjusting to exponent and bit level.

40 Some practical considerations constraining number of classes include
magnitude of spacing between candidate factors (32 vs 64bit),
frequency and effect of the GPU work queue running empty, and
effect on throughput of running a full last block per class that extends past the bit level's end.

41 In mfaktc or mfakto, there is screen output once per class. This can range from several classes per second or more, at high exponents and low bit levels, to nearly an hour or more per class at high bit levels on slow GPUs. (Bit levels already reached in trial factoring previously ranged from 64 to 87, 223=8,388,608:1 within the space. As of early 2021, they ranged from 72 to 87 bits, 215=65536:1.) There is no user control of mfaktx screen output frequency other than the approximately 11:1 effect of selecting more-classes or less-classes. In some cases a solid state disk or ramdisk is recommended to prevent logging or checkpointing from diminishing TF throughput unacceptably. A proposed user control for rate of screen output has not been implemented. Using less-classes to somewhat limit output frequency can cost of order 1.7% TF throughput.

42 Since finding factors is more likely with low k values than high k values, lowest bit levels are run first, for efficiency. For the primary purpose of GIMPS, finding new Mersenne primes, we only care about finding a factor quickly enough to eliminate an exponent p from any further processing, thereby saving computing time overall. The probability of finding a factor times the combined computing time of further trial factoring, plus P-1 factoring, and running two primality tests, must be less than the effort of trial factoring for there to be net savings. Trying the most-likely factors first optimizes the savings.

43 The optimal TF bit level described in #2, #13 and #32 depends somewhat on the GPU model, since different GPU models have different ratios of TF and primality test performance. Recent developments may raise the optimal above the levels described. RTX 20xx increased parallelism between FP and int computations, or its much higher performance ratio between TF and primality testing with current code, for example. Separate code and assignments according to gpu model may be necessary to make best use of the properties of the RTX20xx GPUs and older GPUs.

44 There is a discussion at to ~1514 regarding possibly eliminating the 20 to 25% overlap of P-1 and TF described in #34. If doing so had little or no overhead, this could provide a considerable speedup in TF, which may speed up the total per-exponent effort in GIMPS by <~0.03 x 0.25 / 2, or <~0.375%. Not large, but worth exploring. Its effect is diminished from there since current practice is to opportunistically seek a quickly found factor by TF in the lower bit levels, then P-1 after all TF is done. See also #31, regarding P-1 being run before the last, highest TF level, which may offer an opportunity for saving <~0.19% if the final bit level of TF, which is comparable in effort to all preceding levels combined, is performed only on k values not also covered by the preceding P-1. Taking full advantage of this probably involves
modifying PrimeNet TF and P-1 assignment policy, in both level and sequence,
modifying code in the P-1 applications (prime95, CUDAPm1, GpuOwL, Mlucas v20.0 or above) for determining optimal P-1 bounds,
modifying TF applications code (mfaktc, mfakto, mprime/prime95; gpuowl TF branch, Mfactor) to efficiently skip the relevant factor candidates, and
modifying gpu-specific TF level versus exponent charts and other project documentation.
It may also involve changing or adding parameters to the worktodo lines or ini files or command line options to indicate whether the TF run is to include or exclude the smooth factor candidates. (Different use cases for TF may require different behavior.) New TF worktodo line structure implies modification of the parsing code in each TF application. New TF worktodo line structure also implies modifying some of the several client management programs (see Making use of this also probably involves changes to TF result report records to indicate whether the smooth factors were tried or skipped, including PrimeNet and manual reports and assigned and unassigned work, impacting PrimeNet and its API, web manual report page processing, mfaktc, mfakto, prime95, gpuowl, Mfactor, and some of the client management programs. In total, over a dozen code authors would need to be on board with the change. Not all are currently actively maintaining what they wrote.

45 from re efficiency versus assignment size (roughly 2bitlevel/exponent)
mfakto (and mfaktc) are more efficient with larger assignments. This does not mean larger exponents, but "more work" or "longer runtime per class", generally bigger bit ranges. The reason is a certain one-time initialization effort per class - no matter if the class will just test 1 million factor candidates or 1 billion. In the first case, the one-time effort may account for, say, 75% of the sieving effort, in the latter case just 0.3%. Plus there is an average of half a block wasted for each class. If the class consists of only one block, that's 50%, for 1000 blocks it's just 0.05%.
46 Results are reported by GPU applications as exponent, factor(s) found, optionally username and computerid, program and version and kernel used, optionally time and date stamp, or
exponent, no factor from bit level a to b, optionally username and computerid, program and version and kernel used, optionally time and date stamp (but not in that order). Other possible variables such as factor candidate sieving limit are not reported.

47 as #46 above implies, sometimes more than one factor will be present within a bit level. The behavior of mfaktx is controlled in such cases by part of mfaktx.ini[CODE]# possible values for StopAfterFactor:
# 0: Do not stop the current assignment after a factor was found.
# 1: When a factor was found for the current assignment stop after the
# current bitlevel. This makes only sense when Stages is enabled.
# 2: When a factor was found for the current assignment stop after the
# current class.
# Default: StopAfterFactor=1
It's uncommon but finding two factors in a single bit level does occur.

48 More than one factor per class within a bit level for an exponent also occurs. Its overall impact on the GIMPS project is small. See
Up to 10 factors per class are supported, by use of a small array for factors found. That would be very hard to test for. More than 4 per class/bitlevel/exponent is estimated to be quite rare.
For some examples of exponents with multiple factors found in one class in one bit level see

See also; what's described here is referred to there as "search by p".
(Thanks, to TheJudger, kruoli and others who have reviewed this large list or responded to PMs and offered constructive feedback.)

Top of reference tree:

Last fiddled with by kriesel on 2021-09-14 at 02:12 Reason: updated #44 for Mlucas V20; minor style edits
kriesel is online now  
Old 2019-08-12, 15:29   #7
kriesel's Avatar
Mar 2017
US midwest

153348 Posts
Default Mfaktc -h help output

mfaktc v0.21 Copyright (C) 2009, 2010, 2011, 2012, 2013, 2014, 2015  Oliver Weihe (
This program comes with ABSOLUTELY NO WARRANTY; for details see COPYING.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING for details.

Usage: mfaktc-win-64 [options]
  -h                     display this help and exit
  -d <device number>     specify the device number used by this program
  -tf <exp> <min> <max>  trial factor M<exp> from 2^<min> to 2^<max> and exit
                         instead of parsing the worktodo file
  -st                    run builtin selftest and exit
  -st2                   same as -st but extended range for k_min/m_max
  -v <number>            set verbosity (min = 0, default = 1, more = 2, max/debug = 3)

options for debuging purposes
  --timertest            run test of timer functions and exit
  --sleeptest            run test of sleep functions and exit

Top of reference tree:

Last fiddled with by kriesel on 2019-11-18 at 14:04
kriesel is online now  
Old 2019-09-29, 15:15   #8
kriesel's Avatar
Mar 2017
US midwest

22·32·191 Posts
Default Mfaktc tuning for maximum throughput

Per item 26,
tune mfaktc.ini gpu sieve parameters in this order:

For faster gpu models, a recompiled version of mfaktc allowing GpuSieveSize up to 2047 can produce up to 5% greater throughput over GpuSieveSize=128 for GTX1080Ti, more than 10% for RTX 2080 Super.

After tuning single-instance throughput, faster gpu models can benefit from running multiple instances on the same gpu simultaneously. The optimal number of instances appears to depend on both gpu model and cpu model. The potential gain is greater with faster gpus, less with slower gpus. I think the gain is that when one instance is waiting on display and disk writes, another can keep the gpu working. This has been observed on Win7 or Win10 systems with conventional rotating disks. Solid state disk or ramdisk may reduce or eliminate this multi-instance advantage.

Whether the optimal tune for multiple instances differs from the optimal tune for single instances is untested currently.
Note also the optimal tune is expected to slowly change with exponent and bit level.

See also
CUDA level and driver version seem to have little or no effect.
More_classes is more efficient than less_classes, unless the classes are so fast that it becomes somewhat I/O bound. Most exponents and bit levels remaining between the wavefront and the limit are better run in more_classes.

A tuning example for GTX1080Ti gpu is attached. A second example with two rounds of further tuning for the modified GpuSieveSize up to 2047 version is also attached. It gains a little more performance and a GPU-Z indicated 100% utilization with 3 instances.
Final tune check of an RTX2080 gpu using GPUSieveSize up to 2047 is also attached. It gains a little more performance and a GPU-Z indicated 100% utilization with 3 instances.
Both models indicate possible further gains if GpuSieveSize was not limited by 231-1 bits due to signed 32-bit bit position computation.
The newest gpu models respond substantially to tuning, of order 3 to 10% gain.

Lower speed gpus show less response to tuning, and their optimal tune tends to be at somewhat lower values of tune parameters.

Top of reference tree:
Attached Files
File Type: pdf gtx1080ti mfaktc tuning.pdf (18.1 KB, 360 views)
File Type: pdf gtx1080ti tuning 2047 limit.pdf (31.6 KB, 353 views)
File Type: pdf rtx2080super tuning.pdf (16.9 KB, 311 views)
File Type: txt gtx1050ti-tune.txt (2.1 KB, 271 views)

Last fiddled with by kriesel on 2020-08-23 at 15:29 Reason: updated for GTX1050Ti
kriesel is online now  
Old 2020-04-17, 03:50   #9
kriesel's Avatar
Mar 2017
US midwest

22×32×191 Posts
Default gr-mfaktc

There is a variant of mfaktc which also supports bases other than 2, called gr-mfaktc (generalized repunits mfaktc). It is somewhat slower for base 2 than mfaktc, so not recommended for Mersenne numbers. See

Top of reference tree:

Last fiddled with by kriesel on 2020-04-17 at 03:51
kriesel is online now  
Old 2022-05-15, 13:21   #10
kriesel's Avatar
Mar 2017
US midwest

22·32·191 Posts
Default Mfaktc tuning for best compute efficiency

"Best" here is assumed to be the maximum GHD/day per watt of GPU input power.
This fits my use case reasonably well since almost all systems are also running prime95.

If the system is only running because of GPU factoring, the system's power consumption probably should be used instead. (Measured with an inline AC wattmeter.) One could argue that system power should be prorated among CPU and GPU applications in that case, but I have chosen not to prorate, for simplicity.

Nvidia-smi supports setting maximum power level for many NVIDIA GPUs.
nvidia-smi -i n -pl w
where n is the GPU number and w is the wattage limit.
nvidia-smi with no parameters to see a list of CUDA GPUs, current load and power, etc.

After tuning for maximum performance without power limits, I measured GHD/d at a variety of power limits for several GPUs.

A GTX 1080 Ti reached max GHD/day/watt at 163w, middle of its allowed power range, while GTX 1650, GTX 1650 Super, RTX 2080, and RTX 2080 Super showed max GHD/d/w at their respective minimum power settings.
It may be possible to achieve higher GHD/d/watt, by detuning from maximum performance mfaktc.ini settings. I did not try any of that. Be careful of some settings. Disabling GPU sieving would create a false indication of higher efficiency, by displacing the factor candidate sieving work onto the CPU.

DrobinsonPE advises using -lgc instead of -pl indirectly allows a greater range of power adjustment and accessing higher efficiency at lower frequency/power than -pl provides.
It's likely that performance will be more stable also, at a locked clock frequency.

And for the same average power, since power is nonlinear with frequency, limiting frequency probably yields higher performance than letting frequency fluctuate during power limiting.
But, clock control seems not to apply to GTX 10xx, at least on Windows 7, driver version 378.66.
  • -lgc is unsupported
    "c:\Program Files\NVIDIA Corp
    ion\NVSMI\nvidia-smi.exe" -i 1 -lgc 1600
    Invalid combination of input arguments. Please run 'nvidia-smi -h' for help.
  • -h output output does not include clock rate support, in contrast to driver version 442.19 which indicates
        -lgc   --lock-gpu-clocks=    Specifies <minGpuClock,maxGpuClock> clocks  as a
                                        pair (e.g. 1500,1500) that defines the range
                                        of desired locked GPU clock speed in MHz.
                                        Setting this will supercede application clocks
                                        and take effect regardless if an app is running.
                                        Input can also be a singular desired clock value
                                        (e.g. <GpuClockValue>).
  • -d SUPPORTED_CLOCKS indicates no clock rates supported:
    "c:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe" -q -d SUPPORTED_CLOCKS
    ==============NVSMI LOG==============
    Timestamp                           : Wed Jul 13 23:17:04 2022
    Driver Version                      : 378.66
    Attached GPUs                       : 2
    GPU 0000:03:00.0
        Supported Clocks                : N/A
    GPU 0000:28:00.0
        Supported Clocks                : N/A
A bit of experimentation on Windows 10, driver version 456.71 indicates GTX1080 does not support -lgc but RTX 2080 does. The RTX2080 returns N/A however for -q -d SUPPORTED_CLOCKS. Clock and power limits act somewhat independently.

Top of reference tree:
Attached Files
File Type: pdf tuning for power efficiency.pdf (30.8 KB, 31 views)

Last fiddled with by kriesel on 2022-07-14 at 14:14
kriesel is online now  
Closed Thread

Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Reference material discussion thread kriesel kriesel 93 2022-09-13 05:02
Mersenne Prime mostly-GPU Computing reference material kriesel kriesel 36 2022-06-26 17:11
How do you obtain material of which your disapproval governs? jasong jasong 97 2015-09-14 00:17
Reference code for all-integer convolutions jasonp Software 17 2009-01-29 02:25
NFS reference Jushi Math 2 2006-08-28 12:07

All times are UTC. The time now is 03:04.

Thu Oct 6 03:04:17 UTC 2022 up 49 days, 32 mins, 0 users, load averages: 1.05, 1.21, 1.21

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔