20180528, 18:38  #1 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·3^{2}·191 Posts 
Mfaktcspecific reference material
This thread is intended as a home for reference material specific to the mfaktc program.
(Suggestions are welcome. Discussion posts in this thread are not encouraged. Please use the reference material discussion thread http://www.mersenneforum.org/showthread.php?t=23383. Offtopic posts may be moved or removed, to keep the reference threads clean, tidy, and useful.) Mfaktc howto The following assumes you've already read the generic "How to get started in gpu computing for GIMPS" portion of https://www.mersenneforum.org/showpo...89&postcount=1, and duplicates little of that here. Download a suitable version of mfaktc from here https://download.mersenne.ca/mfaktc/mfaktc0.21 You might as well get one of the largerGpuSieveSizeenabled ones allowing up to 2047 Mbits if one is available. If you can't find a suitable version there, you may be able to get one built by one of the participants on the mfaktc thread. You'll also need the appropriate CUDArt library. CUDA Dlls are in the CUDA_dlls folder of https://download.mersenne.ca/ (If all else fails, CUDA dlls or .so files can be obtained by downloading and installing the applicable OS and CUDA version’s CUDA SDK, then locating the files and adding the few needed files to your path or to your working directory on other systems. Download the current version from https://developer.nvidia.com/cudadownloads Older versions can be obtained from the download archive, at https://developer.nvidia.com/cudatoolkitarchive These are typically gigabytes in the newer versions; 2.8GB @ CUDA v11.3. Get the installation guide and other documentation pdfs also while you're there. Note, not all OS version and CUDA version combinations are supported! See CUDA Toolkit compatibility vs CUDA level https://www.mersenneforum.org/showpo...1&postcount=11 and good luck!) Install in a suitable useraccountowned subfolder. Set the needed folder permissions so suitable permissions are inherited by files created there. Modify the mfaktc.ini file to customize it to your GPU model, primenet account name, and system name. (I usually incorporate system name, GPU identification, and instance number together. Something like systemname/gpuidwn; condor/gtx1060w2 for example.) Correct the comments in the standard ini file to reflect the increased GpuSieveSize maximum. You may want to try GPUZ as a utility on your Windows system to see an indication of what the computer thinks is installed for your GPU (CUDA opencl openGl etc), graphically monitor GPU parameters, maybe even log them if you want. One of many utilities listed in https://www.mersenneforum.org/showpo...74&postcount=6 which also lists some Linux alternatives. It can be handy while getting a GPU application going. When it's not needed shut it down along with other idle applications to reduce overhead that's costing performance. Now is a good time to run mfaktc with h >>help.txt in your working directory. Run once, refer to as often as needed. Run the self test: mfaktc st >>selftest.txt Or the longer one: mfaktc st2 >>selftest.txt Create a Windows batch file or Linux shell script with a short name. I sometimes use similar to the following, named mf.bat. Adjust device name, device number, mfaktc version etc for your situation. Code:
title %cd% mfaktc gtx1080Ti if not exist help.txt mfaktcmorecuda8064 h >help.txt if not exist selftest.txt mfaktcmorecuda8064 d 0 st >selftest.txt mfaktcmorecuda8064 d 0 >>mfaktcrun.txt Consider redirecting console output to files, as in the batch file example above, or employing a good tee program. Create a desktop shortcut for easy launch of the batch file or script. Shortcut command line should be on Windows, of form cmd /k: <fullsystempath>cmd.exe /k <full mfaktc folder path>\<batchfilename> and the directory in which it runs <full mfaktc folder path>. (Eventually, for multiple instances or multiple GPUs this could launch a routine that invokes the individualinstance files with short time delays between, so you have a few seconds to see whether each launched correctly or a bug occurred.) Check the results. Resolve any reliability issues before proceeding to real GIMPS work. Create a worktodo file and put some assignments in there. Start with few, in case your GPU or IGP does not work out. Get the type you plan to run the most. Get them from https://www.mersenne.org/manual_gpu_assignment/ Results are reported manually at https://www.mersenne.org/manual_result/ Run one instance with default settings. Modify tuning in the mfaktc.ini file, stop and restart the program to make it take effect, and document performance for each modification. Tune one variable at a time. For best accuracy and reproducibility, minimize interactive use during a benchmark period. Averaging many timing samples for the same tune parameter set in a spreadsheet improves timing accuracy. See Mfaktc tuning for optimal throughput for tuning advice. An example of tuning for an RTX 2080 Super is here. There are other GPU models' tuning shown also near it in the same thread. When substantially changing type of work, such as switching between 100M tasks and 100Mdigit tasks, or significantly changing bit levels, especially when changing kernels results, retuning is suggested. (You may want to save different tunes for different exponent and bit level ranges, for later reuse.) For best performance use a SSD, tune for singleinstance first, and tune for maximum throughput by experimenting with multiple instances last. (I suggest keeping the work of simultaneous instances similar. If they are too different, different kernels may reduce throughput. Test for that.) Multiple instances sometimes give slightly higher sustained throughput, and keep the GPU working if one instance has a problem, runs out of work, or is stopped briefly to replenish work and report results. If a single instance is optimal, chain the batch file to another to keep the GPU productive when one worktodo file empties or a bug terminates the program. Slow GPUs benefit less from large GpuSieveSize and multiple instances; fast GPUs benefit more. Get familiar with the nvidiasmi command line tool at some point. That's more efficient for when you get into production mode. It's probably hiding somewhere like the following. I usually make a batch file with a short name nv.bat (and suggest doing the Linux equivalent) Code:
:loop "c:\Program Files\NVIDIA Corporation\NVSMI\nvidiasmi.exe" pause goto loop Nvidiasmi can also control power and other parameters. Power savings can be quite considerable; 50% reduction in power from the nominal maximum seems to still provide better than 80% of maximum throughput. Beware, nvidiasmi gpu number may not match mfaktc device number in the case of multiple gpus per system. After getting the program functioning manually, you can consider continuing to operate it that way, or trying one of the client management software described at Available Mersenne prime hunting client management software, each of which have their own install requirements, or GPU to 72 https://mersenneforum.org/forumdisplay.php?f=95 (Note much of the preceding was derived from or copied from https://mersenneforum.org/showthread.php?t=25673) Table of Contents:
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20221003 at 04:56 Reason: reordered content to separate exe from dlls 
20180528, 18:41  #2 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1ADC_{16} Posts 
Mfaktc runtime scaling versus exponent and bit depth
For computing run time estimate for a given GPU model and assignment, see the concise method given here.
Timings for an assortment of exponents and bit depths for a GTX480 are tabulated and charted, for reference, in the attachment. Note, only one trial per combination was tabulated, so no measure made or indication given of reproducibility run to run for same inputs. This is a somewhat different way of looking at it than the factoring benchmarks at http://www.mersenne.ca. The trend of how run time changes with bit level and exponent will apply to other GPU models also; time ~ c 2^{bitlevel}/exponent. The "proportionality constant" c will vary with GPU model greatly, and somewhat with mfaktc or mfakto kernel required for the bit level, at least. Environmental conditions (room temperature), other workload on the GPU, GPU overclocking, etc. will also affect run time somewhat. The kernelslowdownforlargebitlevel effect is modest, with more complex slower kernels needed, and optimal maximum bit level proportional to ~log exponent, while the factor density effect is significant, 1/exponent, and the factorvaluespan/bit level effect very significant, at ~2^{bitlevel1}. Tuning of the application specific to the exponent range and bit level will sometimes provide some performance improvement. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20220422 at 14:48 Reason: Link to concise estimation method post, expand content 
20180528, 19:24  #3 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1101011011100_{2} Posts 
Mfaktc bug and wish list
Known and suspected bugs and desired features or changes are listed. Some cross references to other gpu programs are included. Some issues may be hardware related or due to other causes.
As always, such lists are in appreciation of the various authors' past efforts, contributions, and skills, and intended to help improve both the software and the users' experience. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20220422 at 14:41 Reason: updated attachment to include backup checkpoint file 
20180531, 03:12  #4 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}×3^{2}×191 Posts 
Mfaktc throughput variables
Investigating Mfaktc consistently indicating 98% GPU usage in GPUZ, rather than the 100% that CUDAPm1 and CUDALucas typically produce, there are slight gains in total throughput to be had by running multiple instances. There are confounding factors, including:
Definitely: Exponent being run (substantial effect) Factoring depth being run (minor effect) Overhead of running the screen display with the same GPU (varying effect) How many instances are running per gpu GPU sieving parameters: GPUSieveProcessSize, GPUSieveSize, GPUSievePrimes; tuning these can provide a several percent boost in throughput Frequency of checkpoint saving (minor effect) Probably: GPU clock throttling due to thermal limits How similar the work of multiple instances are (similar is better, very dissimilar may cause a net slowdown. Running very different bit levels that cause use of different kernels seems likely to cause this. Very different exponents may.) Potentially, at high exponent or instance count: memory constraints (although mfaktc doesn't use much memory per instance at current wavefront exponents) Load sharing for multiple instances appears to be not quite equal; they have slightly differing throughputs. An operational advantage of running multiple instances is being able to stop one, add work, look at results, change ini file settings, etc, without noticeable loss of throughput; the others will essentially saturate the GPU while one instance is out of action temporarily. If one instance runs out of work the other running instances run a bit faster than if it hadn't, limiting lost total throughput. On a GTX480, some slight effect of CUDA level (test your own gpu models; there are reports GTX10xx like v8.0 better than lower versions) V0.20 CUDA 6.5 346.38 GhzD/d fastest 100% V0.21 Less Classes CUDA 4.2 340.18 98.2% V0.21 Less Classes CUDA 8.0 340.43 98.3% Running the less classes versions could cost ~0.9 week per year of throughput. Note, v0.20 checkpoint file was ignored by either flavor of lessclasses, 3 hours throughput was lost when switching version, by the LessClasses versions of mfaktc starting the bit depth over at each version change. I was able to get it back by reverting to V0.20 CUDA6.5 64bit version. These results were seen on a system with an actual spinning hard drive. Gain from multiple instances may be less with a solid state disk or ramdisk. The effect of lower than 100% usage with a single instance is stronger with newer faster gpus such as the GTX 1080 or GTX 1080 Ti. The GTX 1080 Ti under test is running at 90% gpu load with one instance, 95% with two instances. (Not the lessclasses version.) The above is observed using the standard Mfaktc release. The modified versions allowing GpuSieveSize up to 2047M when well tuned reduce but do not eliminate the throughput advantage of running multiple instances. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20200701 at 17:12 Reason: added comments on similar work 
20181009, 21:40  #5 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·3^{2}·191 Posts 
Volta and Turing support
For versions compiled for CUDA 10, see https://www.mersenneforum.org/showpo...postcount=2929
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20191118 at 14:03 
20190214, 03:10  #6  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·3^{2}·191 Posts 
Concepts in GIMPS trial factoring (TF)
(note, sort of mfaktc oriented, more so toward the end. Where description applies to both mfaktc and mfakto, they are referred to as mfaktx)
Following is my evolving attempt to list and describe the many concepts and optimization considerations relevant to GIMPS trial factoring, in a way that assumes little or no prior knowledge. See also (perhaps first) for some background that will be useful in reading the following, https://www.mersenneforum.org/showpo...35&postcount=3 1 Trial factoring work is generally assigned by exponent and bit levels; Mersenne exponent, bit level already trial factored to or where to start, and bit level to factor up to. Nearly all TF is now done on GPUs not CPUs. Assignments and result reporting are typically automatic for mprime/prime95, but manually or through separate programs for gpu applications. Manual assignment (reservation) links and result submission links are available in the attachment at https://www.mersenneforum.org/showpo...11&postcount=9 See https://www.mersenneforum.org/showpo...92&postcount=3 for a list of client management programs that get assignments and/or submit results. 2 Each bit level of TF is about as much computing effort as all bit levels below it, for the same exponent. Probability of finding a factor by trial factoring between 2^{x} and 2^{x+1} is approximately 1/x. https://www.mersenne.org/various/math.php So while the probability of finding a factor goes down slowly with x, the effort goes up exponentially with x. Assignments are made up to the point where it's more cost effective to primality test instead of continuing to trial factor. Finding a factor by trial factoring before any primality tests or P1 factoring saves 2 x (1 + primality test error rate) primality tests plus the P1 factoring time, which is typically a few percent of a primality test. Error rate is around 2%, so about 2(1+.02)+.03 =~ 2.07 primality tests equivalent. Except that P1 effort (bounds selection) is optimized for maximum time savings, so actually <2(1+0.02) or <2.04 primality tests. 3 TF makes use of the special form of factors of Mersenne numbers, f = 2 k p +1, where p is the prime exponent of the Mersenne number, k is a positive integer. See https://en.wikipedia.org/wiki/Mersen...rsenne_numbers 4 Use a wheel generator https://en.wikipedia.org/wiki/Wheel_factorization for candidate factors, to efficiently exclude and not even consider candidate factors that have very small factors themselves, such as 2, 3, 5, 7 and usually 11. 2 2 3 5 7 = 420, the Lessclasses number; 2 2 3 5 7 11 = 4620, the moreclasses number. https://www.mersenneforum.org/showpo...1&postcount=35 https://www.mersenneforum.org/showpo...7&postcount=37 (There was discussion of going to 13 or higher, but that was considered not worthwhile. 2 2 3 5 7 11 13 = 60060, etc. Higher complexity and overhead, trading off against diminishing returns of incremental number of candidates excluded.) Prime95 uses a simpler 2 2 3 5 = 60 wheel spokes, which reduces to 16 classes (26.67%). For Mfaktx at 420, 96 survive (22.86%) or at 4620, 960 (20.78%). The next would be 11520/60060=19.18%. (A potential reduction of 7.7%) (Tthere are two factors of 2 in the number of classes. I think that's due to including #6 into the wheel generator; it's not just evens that can be dismissed, but 6 out of 8 mod 8 cases, or 3 out of 4.) 5 For a given exponent, exclude entire classes of candidate factors with a single test. Additionally there is subsequent fast sieving in the surviving classes for small prime factors, to eliminate the bits representing the corresponding k values of small primes beyond the wheel generator level. https://www.mersenneforum.org/showpo...7&postcount=37 6 Make use of the special form of factors of Mersenne numbers, that they are 1 or 7 mod 8. (Discard or do not generate candidate factors that are not 1 or 7 mod 8). A short proof for f=+/1 mod 8 is at https://primes.utm.edu/notes/proofs/MerDiv.html. 7 Use of dense representation of the candidate factors, as a bit map of k values. https://www.mersenneforum.org/showpo...4&postcount=72 (That bit map is exponentspecific, so can not be reused for other exponents.) 8 These candidate factors are sieved, somewhat, but not exhaustively. Candidate factors found composite by sieving are discarded. Trial with prime candidates is sufficient, with composite candidates redundant. 9 Exhaustive sieving of candidate factors can produce less throughput per total computing effort, than a lesser amount of sieving of candidates. There's user control of sieving depth available to allow adjustment to near optimal for the exponents, depths, and other variables, in mfaktc and mfakto. https://www.mersenneforum.org/showpo...&postcount=180 10 Candidate factors surviving wheel generation and restriction to 1 or 7 mod 8 are sieved somewhat. Implementation in mfaktx is by a maximum sieve prime value, that as I recall is set in mfaktx.ini. Sieving level (at some version of mfaktc) must not exceed the level where possible candidate factors are included in sieving values. This makes sieve limit settings dependent on exponent for low exponent low bit level combinations. https://www.mersenneforum.org/showpo...&postcount=788 (If I recall correctly, in later versions of mfaktc special handling of certain cases relaxes this restriction on sieving level. See in #21 below) 11 Surviving candidate factors are tried. (With MOSTLY prime candidates, and the relatively few composite candidates that passed the speedoptimized incomplete sieving.) 12 The trial method is to generate a representation of the Mersenne prime plus 1 (2^{p}), modulo the trial factor, by a succession of squarings and doublings according to the bit pattern in the exponent, modulo the trial factor. If 2^{p} mod f =1, 2^{p}1 mod f = 0 proving f is a factor. The powering method is much much faster than trial long division for the very sizable numbers of interest near the current GIMPS wavefront, and uses far less memory than a simple approach of representing the full dividend at the beginning, and more so on both counts for larger numbers. (Retina points out though at https://www.mersenneforum.org/showpo...&postcount=478 that there would be no need to create a full representation of p ones at the beginning of long division. Since all the binary digit positions of a mersenne number are ones they could simply be added at the right as needed. Also, we care only about the remainder, not the quotient.) There's a description and brief smallnumbers example of the modpow method at https://www.mersenne.org/various/math.php which is called left to right binary method in https://en.wikipedia.org/wiki/Modular_exponentiation. It is not necessarily the minimumtime method, but it's reliably pretty good, and fairly simple. The modulo trialfactor at each step keeps the operands at a manageable size for speed: 3 32bit words, far smaller than the tens of millions or hundreds of millions of bits that naive trialdivision could involve or deferred modulo could generate. Trial division would take (O)p operations, while modpow takes only (O)log_{2}p. For p~90M, p/log_{2}p ~3,400,000. If stepping through long division 32 bits at a time, ~106400; log_{2}106400 ~17. The efficiency of the modpow method allows trial factoring to many more bits factor depth than even a well implemented long division trial would. See point #2. Therefore the probability of finding a factor is increased significantly with the modpow approach. 13 On significantly parallel computing hardware, such as GPUs, many trial factors can be run in parallel, so successive subsets of candidates are distributed over the many available cores. GPUs are so much more relatively effective at TF than other types of GIMPS computation, that the optimal GPU TF bit level is typically a few bits higher than for CPUs, and CPU TF is generally avoided. (See also point 2 again.) Xeon Phi are reportedly quite fast at trial factoring, but not very common, and rather expensive. "Additionally, something like an FPGA would be really fast, but most users don't have one. And ASICs would be even faster, but way too expensive." (kruoli PM) FPGA or ASIC likely would have yet different TF bit level optima. Guides to suggested TF bit levels versus exponent and GPU model are available for each GPU model listed at https://www.mersenne.ca/cudalucas.php; click on the GPU model of interest. Note that the curves correspond to optimization for one or two primality tests to be performed on the same GPU model as the TF. If the TF will be performed on a high TF/DP performance ratio GPU but the primality test performed on a lower TF/DP ratio GPU, the optimal TF is likely somewhere between the two levels. For example, for a 100Mdigit mersenne number, TF on RTX2080 indicates 83 bits but Radeon indicates 80 bits; 81 or 82 may be the optimal for the combination RTX 2080 & Radeon VII PRP or LL. RTX2080: https://www.mersenne.ca/cudalucas.php?model=744 Radeon VII: https://www.mersenne.ca/cudalucas.php?model=752 14 Operation is as multiword integers, sometimes with some bits used for carries. https://www.mersenneforum.org/showpo...3&postcount=17 72bit math via 3 32bit words with 24 bits data, 8 bits for carries. https://www.mersenneforum.org/showpo...&postcount=159 15 Different code sequences are written for different bit level ranges of trial factor, for best speeds for various bit level ranges. Higher bit levels take longer code sequences, so factors tried per unit time declines at higher bit levels on the same hardware and exponent. https://www.mersenneforum.org/showpo...&postcount=231 Cursory examination of current source code shows the following 17 distinct sequences identified in mfaktc: _71BIT_MUL24 _75BIT_MUL32 _95BIT_MUL32 BARRETT76_MUL32 BARRETT77_MUL32 BARRETT79_MUL32 BARRETT87_MUL32 BARRETT88_MUL32 BARRETT92_MUL32 _75BIT_MUL32_GS BARRETT76_MUL32_GS BARRETT77_MUL32_GS BARRETT79_MUL32_GS BARRETT87_MUL32_GS BARRETT88_MUL32_GS BARRETT92_MUL32_GS _95BIT_MUL32_GS (In the above, _GS indicates gpusieving. https://www.mersenneforum.org/showpo...postcount=3082) (mfakto is similar although it does not include 95bit code or capability. That's not an issue for the GIMPS search since optimal final TF level via GPU is generally 86 bits or less in the mersenne.org range of exponents <10^{9}; 92 bits to Mlucas range max 2^{32}. But see also #43 below) Judging by the names above, most of these are based on Barrett reduction. https://en.wikipedia.org/wiki/Barrett_reduction 16 "First barrett kernel in mfaktc was BARRETT92, all other kernels are stripped down versions. From BARRETT92 to BARRETT79 first (fixed inverse, multibit in single stage possible, a bit faster) From there we go from BARRETT92 to BARRETT88 and BARRETT87 by (re)moving interim correction steps and some other "tricks" (loss of accuracy in interim steps (small example 22 mod 10 yields 12 (instead of 2))). Trading accuracy for speed. The same 'tricks' lead from BARRETT79 to BARRETT77 and BARRETT76." https://www.mersenneforum.org/showpo...postcount=3080 17 "funnel shift" barrett 87 https://www.mersenneforum.org/showpo...postcount=2243 18 The frequency of primes on the number line declines gradually as the bit level increases. This partially offsets the effect of the longer code sequences for higher bit levels mentioned in #15. 19 For GPU applications, there are various implementation approaches for performance. https://www.mersenneforum.org/showpo...9&postcount=29 20 multiple streams and data sets allowing concurrent data transfer and processing https://www.mersenneforum.org/showpo...&postcount=152 21 OnGPU sieving of factor candidates: https://www.mersenneforum.org/showpo...&postcount=554 Tradeoffs and possible decisions about high GPU sieving, low exponents https://www.mersenneforum.org/showpo...postcount=2409 https://www.mersenneforum.org/showpo...postcount=2999 22 There's a small performance advantage to 32bit images when available, when using GPU sieving. (32bit addresses are smaller, placing less demand on memory bandwidth) https://www.mersenneforum.org/showpo...postcount=1981 However, at CUDA8 or higher (which means GTX10xx or newer models), only 64bit CUDA is available. 23 How, in the CUDA or OpenCl cases, all those factor candidates get bundled into batches for processing in parallel on many GPU cores is very hazy for me. Presumably it involves using as many of the cores as possible as much of the time as possible, implying work batches of equal size / run time. Prime95 comments on passing out chunks of work and using memory bandwidth efficiently: https://www.mersenneforum.org/showpo...postcount=1634 Oliver writes to R Gerbicz about it https://www.mersenneforum.org/showpo...postcount=3020 24 Could FP be faster than integer math for the kernels? Probably not https://www.mersenneforum.org/showpo...postcount=2377 Mark Rose did a detailed analysis in 2014 (and maybe revisiting that for recent gpu designs would be useful) https://www.mersenneforum.org/showpo...postcount=2380 25 mfaktc v0.21 announce, and ruminations about possible v0.22 https://www.mersenneforum.org/showpo...postcount=2492 26 ini file parameter tuning advice, https://www.mersenneforum.org/showpo...postcount=2505  2508 in this order: GPUSieveProcessSize GPUSieveSize GPUSievePrimes 27 There was mention of an mfaktc v0.22pre2 back in 2015 at https://www.mersenneforum.org/showpo...postcount=2547 "Removed old stuff (CC 1.x code, CUDA compatibility < 6.5 dropped, minor changes and bugfixed)." https://www.mersenneforum.org/showpo...postcount=3080 The current released version is v0.21, which incorporates worktodo.add file support and is slightly faster than v0.20. 28 Experimenting on tuning with a GTX 1080 Ti, RTX 2080, and RTX 2080 Super, on Windows 7, they seem to like the upper end of the tuning variable limits. Possibly the faster cards would benefit from higher maximums than documented for mfaktc v0.21. (Recompile required.) # GPUSieveSize defines how big of a GPU sieve we use (in M bits). # Minimum: GPUSieveSize=4 # Maximum: GPUSieveSize=128 (except in altered versions allowing 2047) # Default: GPUSieveSize=64 GPUSieveSize=2047 # GPUSieveProcessSize defines how far many bits of the sieve each TF block # processes (in K bits). Larger values may lead to less wasted cycles by # reducing the number of times all threads in a warp are not TFing a # candidate. However, more shared memory is used which may reduce occupancy. # Smaller values should lead to a more responsive system (each kernel takes # less time to execute). GPUSieveProcessSize must be a multiple of 8. # Minimum: GPUSieveProcessSize=8 # Maximum: GPUSieveProcessSize=32 # Default: GPUSieveProcessSize=16 GPUSieveProcessSize=32 GPUSievePrimes ~94000 29 Some RTX 20xx owners have modified the tuning variable limits, recompiled, and obtained performance gains of several percent, with diminishing incremental returns as GPUSieveSize grows to gigabits https://www.mersenneforum.org/showpo...postcount=3071 Maximum GpuSieveSize was found to be 2047 (Mbits) after program modification. This may relate to computing bit positions with signed 32 bit integer math. Other model GPUs that show greater combined throughput with multiple mfaktc instances with the 128 Mbit GpuSieveSize maximum, also benefit from the modified code. For example, a GTX 1080 Ti gains several percent throughput running a single modified instance. It appears there would be additional gain on a single instance if the limit was raised to 4095 Mbits by changing to unsigned 32 bit integer math for the bit positions, since even at 2047 Mbits, it shows performance improvement compared to somewhat lesser values, and benefits from running multiple instances. RTX 2080 and RTX 2080 Super also show additional throughput by several percent from running multiple instances after available tuning is completed. 30 The decoupling of FP from integer math in the RTX 20xx may change the tradeoffs significantly or allow use of both simultaneously for TF math. As far as I know this is as yet unexplored. 31 According to some, the probability of finding a factor for least effort is optimized by TF up to one bit less than the eventual tradeoff limit, then P1 with potential computing time savings optimized by the program itself (prime95 or CUDAPm1), and then, if no factor is found in P1, the last bit level of TF, constituting the remaining half of the TF effort for that exponent. This is somewhat different sequencing than usual PrimeNet assignment. 32 Since GPUs are far more effective at TF than at primality testing or P1, and far faster at TF than cpus are, assignments seek to prevent TF on CPUs, by performing all useful TF on GPUs. TF on GPUs allows going about 4 bits deeper, within computing time tradeoff optimization, saving some primality tests that CPUoptimizedTF limits would not have found factors for. The optimal number of bits deeper depend on the properties of the GPU models, with the higher ~40:1 TF/primality test ratio of the RTX 20xx family probably justifying 5 bits deeper for their use. As far as I know PrimeNet contains no provision for automatically taking them 1 bit higher on RTX 20xx. 33 The time required to perform a single TF attempt (one candidate factor on one exponent) is so brief, and the reliability of modern hardware high enough over such brief periods, that it does not pay to double check TF. Not double checking TF lets the TF go one bit level higher, finding more factors. But there is a sort of doublechecking being performed, with a completely different approach, by user TJAOI, as Uncwilly pointed out at https://www.mersenneforum.org/showpo...postcount=3043. There's a link there to a whole thread about that activity. 34 There is some limited overlap of coverage of factors between TF and P1, estimated at about 2025%. The economics of TF and the sequential progress of TF assignments dictate that TF factors found will have at most a certain number of bits (ranging up to 75 bits for 84M, to 86 bits for 999M). P1 can economically find factors with more bits, but requires that the number one less than the factor be rather smooth (composite with many smallish factors). Some factors that TF should find will be smooth, and many will not. Therefore P1 factoring currently serves as an incomplete doublecheck on the TF runs. "Stage 2 of P1 helps to find factors f=2kp+1, where k contains one less smooth factor. If k=a⋅b with b being the biggest prime factor of k, then it is sufficent to do P1 with B1 = maximum prime factor of a and B2 = b." (kruoli by PM) 35 Reported factors are confirmed at the PrimeNet server by repeating the computation for that exponent and factor, because that is quick to do and factors are worth confirming. Nofactor reports are not confirmed there. Historically, TF was on the honor system, with no security code attached to a result outside prime95, or verification of completion of the work. A recent development is Robert Gerbicz's proposal for a method of incorporating verification of performing the TF work in the nofactorfound case. See https://www.mersenneforum.org/showpo...7&postcount=30 and some following discussion on the same thread. At the moment this is only conceptual, not implemented in any software to my knowledge. There's also discussion of other approaches in https://mersenneforum.org/showthread.php?t=25493 36 There is currently no Mersenne TF code for the following interface specifications: DirectCompute DirectX OpenGL PhysX Vulkan In many cases, this is not a limitation, because the Intel CPU TF code in prime95 and mprime, and gpu TF code in Mfaktc (NIVIDIA CUDA) and Mfakto (OpenCL on AMD GPUs and some Intel IGPs) provide considerable coverage. However, there are older IGPs or discrete GPUs that do not have the required support for CUDA or OpenCL but do have support for other interface specifications such as DirectX or OpenGL. It's likely the performance of these older devices would be low compared to the ~20GhzD/day of current IGPs, or hundreds or thousands of GhzD/day on current discrete GPUs. 37 Mfaktc and Mfakto are capable of factoring Mersenne numbers with exponents up to 2^{32}. Extending them to exponents above 2^{32} would require a rewrite of the individual kernels as well as the surrounding code. http://mersenneforum.org/showpost.ph...postcount=1148 If this was attempted, it would be advisable to maintain separate source files or use conditional compilation to produce separate executables, to preserve efficiency for exponents below 2^{32}. Once done, though slower than the 32bitexponent max versions, these are likely to be much faster over 2^{32} than Factor5, which has no limit, or Ernst Mayer's Mfactor, which has some limits depending on its build options. There's no hurry. There's a great deal of work to do within the mersenne.org limit of 10^{9}, and much more from there to 2^{32}, that will take centuries. https://www.mersenneforum.org/showpo...postcount=2862 38 "Checkpoints are written between two residue classes by just remembering the last finished residue class." The checkpoint files are small and simple. https://www.mersenneforum.org/showpo...postcount=3008 Perform version upgrades between bit levels. The checkpoint files include what version was used to produce them, and are not accepted by a later version of software, which will restart the bit level instead. 39 The number of residue classes (see #4 above) is set at compile time in mfaktc, or in the ini file in Mfakto (choosing between 420 and 4620). Another possibility that has been suggested is to make it dynamic, adjusting to exponent and bit level. https://www.mersenneforum.org/showpo...postcount=2999 40 Some practical considerations constraining number of classes include magnitude of spacing between candidate factors (32 vs 64bit), frequency and effect of the GPU work queue running empty, and effect on throughput of running a full last block per class that extends past the bit level's end. https://www.mersenneforum.org/showpo...postcount=3002 41 In mfaktc or mfakto, there is screen output once per class. This can range from several classes per second or more, at high exponents and low bit levels, to nearly an hour or more per class at high bit levels on slow GPUs. (Bit levels already reached in trial factoring previously ranged from 64 to 87, 2^{23}=8,388,608:1 within the mersenne.org space. As of early 2021, they ranged from 72 to 87 bits, 2^{15}=65536:1.) There is no user control of mfaktx screen output frequency other than the approximately 11:1 effect of selecting moreclasses or lessclasses. In some cases a solid state disk or ramdisk is recommended to prevent logging or checkpointing from diminishing TF throughput unacceptably. A proposed user control for rate of screen output has not been implemented. https://www.mersenneforum.org/showpo...postcount=2703 Using lessclasses to somewhat limit output frequency can cost of order 1.7% TF throughput. https://www.mersenneforum.org/showpo...postcount=2706 42 Since finding factors is more likely with low k values than high k values, lowest bit levels are run first, for efficiency. For the primary purpose of GIMPS, finding new Mersenne primes, we only care about finding a factor quickly enough to eliminate an exponent p from any further processing, thereby saving computing time overall. The probability of finding a factor times the combined computing time of further trial factoring, plus P1 factoring, and running two primality tests, must be less than the effort of trial factoring for there to be net savings. Trying the mostlikely factors first optimizes the savings. 43 The optimal TF bit level described in #2, #13 and #32 depends somewhat on the GPU model, since different GPU models have different ratios of TF and primality test performance. Recent developments may raise the optimal above the levels described. RTX 20xx increased parallelism between FP and int computations, or its much higher performance ratio between TF and primality testing with current code, for example. Separate code and assignments according to gpu model may be necessary to make best use of the properties of the RTX20xx GPUs and older GPUs. 44 There is a discussion at https://www.mersenneforum.org/showpo...postcount=1511 to ~1514 regarding possibly eliminating the 20 to 25% overlap of P1 and TF described in #34. If doing so had little or no overhead, this could provide a considerable speedup in TF, which may speed up the total perexponent effort in GIMPS by <~0.03 x 0.25 / 2, or <~0.375%. Not large, but worth exploring. Its effect is diminished from there since current practice is to opportunistically seek a quickly found factor by TF in the lower bit levels, then P1 after all TF is done. See also #31, regarding P1 being run before the last, highest TF level, which may offer an opportunity for saving <~0.19% if the final bit level of TF, which is comparable in effort to all preceding levels combined, is performed only on k values not also covered by the preceding P1. Taking full advantage of this probably involves modifying PrimeNet TF and P1 assignment policy, in both level and sequence, modifying code in the P1 applications (prime95, CUDAPm1, GpuOwL, Mlucas v20.0 or above) for determining optimal P1 bounds, modifying TF applications code (mfaktc, mfakto, mprime/prime95; gpuowl TF branch, Mfactor) to efficiently skip the relevant factor candidates, and modifying mersenne.ca gpuspecific TF level versus exponent charts and other project documentation. It may also involve changing or adding parameters to the worktodo lines or ini files or command line options to indicate whether the TF run is to include or exclude the smooth factor candidates. (Different use cases for TF may require different behavior.) New TF worktodo line structure implies modification of the parsing code in each TF application. New TF worktodo line structure also implies modifying some of the several client management programs (see http://www.mersenneforum.org/showpos...92&postcount=3). Making use of this also probably involves changes to TF result report records to indicate whether the smooth factors were tried or skipped, including PrimeNet and manual reports and assigned and unassigned work, impacting PrimeNet and its API, web manual report page processing, mfaktc, mfakto, prime95, gpuowl, Mfactor, and some of the client management programs. In total, over a dozen code authors would need to be on board with the change. Not all are currently actively maintaining what they wrote. 45 from https://www.mersenneforum.org/showpo...4&postcount=23 re efficiency versus assignment size (roughly 2^{bitlevel}/exponent) Quote:
exponent, no factor from bit level a to b, optionally username and computerid, program and version and kernel used, optionally time and date stamp (but not in that order). Other possible variables such as factor candidate sieving limit are not reported. 47 as #46 above implies, sometimes more than one factor will be present within a bit level. The behavior of mfaktx is controlled in such cases by part of mfaktx.ini[CODE]# possible values for StopAfterFactor: # 0: Do not stop the current assignment after a factor was found. # 1: When a factor was found for the current assignment stop after the # current bitlevel. This makes only sense when Stages is enabled. # 2: When a factor was found for the current assignment stop after the # current class. # # Default: StopAfterFactor=1 It's uncommon but finding two factors in a single bit level does occur. 48 More than one factor per class within a bit level for an exponent also occurs. Its overall impact on the GIMPS project is small. See https://www.mersenneforum.org/showpost.php?p=520982&postcount=5 Up to 10 factors per class are supported, by use of a small array for factors found. That would be very hard to test for. More than 4 per class/bitlevel/exponent is estimated to be quite rare. https://www.mersenneforum.org/showpost.php?p=410784&postcount=35 For some examples of exponents with multiple factors found in one class in one bit level see https://mersenneforum.org/showpost.p...postcount=3262 See also https://www.mersenneforum.org/showpo...&postcount=471; what's described here is referred to there as "search by p". (Thanks, to TheJudger, kruoli and others who have reviewed this large list or responded to PMs and offered constructive feedback.) Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20210914 at 02:12 Reason: updated #44 for Mlucas V20; minor style edits 

20190812, 15:29  #7 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
15334_{8} Posts 
Mfaktc h help output
Code:
mfaktc v0.21 Copyright (C) 2009, 2010, 2011, 2012, 2013, 2014, 2015 Oliver Weihe (o.weihe@tonline.de) This program comes with ABSOLUTELY NO WARRANTY; for details see COPYING. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING for details. Usage: mfaktcwin64 [options] h display this help and exit d <device number> specify the device number used by this program tf <exp> <min> <max> trial factor M<exp> from 2^<min> to 2^<max> and exit instead of parsing the worktodo file st run builtin selftest and exit st2 same as st but extended range for k_min/m_max v <number> set verbosity (min = 0, default = 1, more = 2, max/debug = 3) options for debuging purposes timertest run test of timer functions and exit sleeptest run test of sleep functions and exit Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20191118 at 14:04 
20190929, 15:15  #8 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·3^{2}·191 Posts 
Mfaktc tuning for maximum throughput
Per https://www.mersenneforum.org/showpo...23&postcount=6 item 26,
tune mfaktc.ini gpu sieve parameters in this order: GPUSieveProcessSize GPUSieveSize GPUSievePrimes For faster gpu models, a recompiled version of mfaktc allowing GpuSieveSize up to 2047 can produce up to 5% greater throughput over GpuSieveSize=128 for GTX1080Ti, more than 10% for RTX 2080 Super. After tuning singleinstance throughput, faster gpu models can benefit from running multiple instances on the same gpu simultaneously. The optimal number of instances appears to depend on both gpu model and cpu model. The potential gain is greater with faster gpus, less with slower gpus. I think the gain is that when one instance is waiting on display and disk writes, another can keep the gpu working. This has been observed on Win7 or Win10 systems with conventional rotating disks. Solid state disk or ramdisk may reduce or eliminate this multiinstance advantage. Whether the optimal tune for multiple instances differs from the optimal tune for single instances is untested currently. Note also the optimal tune is expected to slowly change with exponent and bit level. See also https://www.mersenneforum.org/showpo...76&postcount=4 CUDA level and driver version seem to have little or no effect. More_classes is more efficient than less_classes, unless the classes are so fast that it becomes somewhat I/O bound. Most exponents and bit levels remaining between the wavefront and the mersenne.org limit are better run in more_classes. A tuning example for GTX1080Ti gpu is attached. A second example with two rounds of further tuning for the modified GpuSieveSize up to 2047 version is also attached. It gains a little more performance and a GPUZ indicated 100% utilization with 3 instances. Final tune check of an RTX2080 gpu using GPUSieveSize up to 2047 is also attached. It gains a little more performance and a GPUZ indicated 100% utilization with 3 instances. Both models indicate possible further gains if GpuSieveSize was not limited by 2^{31}1 bits due to signed 32bit bit position computation. The newest gpu models respond substantially to tuning, of order 3 to 10% gain. Lower speed gpus show less response to tuning, and their optimal tune tends to be at somewhat lower values of tune parameters. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20200823 at 15:29 Reason: updated for GTX1050Ti 
20200417, 03:50  #9 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}×3^{2}×191 Posts 
grmfaktc
There is a variant of mfaktc which also supports bases other than 2, called grmfaktc (generalized repunits mfaktc). It is somewhat slower for base 2 than mfaktc, so not recommended for Mersenne numbers. See https://mersenneforum.org/showthread.php?t=24901
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20200417 at 03:51 
20220515, 13:21  #10 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·3^{2}·191 Posts 
Mfaktc tuning for best compute efficiency
"Best" here is assumed to be the maximum GHD/day per watt of GPU input power.
This fits my use case reasonably well since almost all systems are also running prime95. If the system is only running because of GPU factoring, the system's power consumption probably should be used instead. (Measured with an inline AC wattmeter.) One could argue that system power should be prorated among CPU and GPU applications in that case, but I have chosen not to prorate, for simplicity. Nvidiasmi supports setting maximum power level for many NVIDIA GPUs. nvidiasmi i n pl w where n is the GPU number and w is the wattage limit. nvidiasmi with no parameters to see a list of CUDA GPUs, current load and power, etc. After tuning for maximum performance without power limits, I measured GHD/d at a variety of power limits for several GPUs. A GTX 1080 Ti reached max GHD/day/watt at 163w, middle of its allowed power range, while GTX 1650, GTX 1650 Super, RTX 2080, and RTX 2080 Super showed max GHD/d/w at their respective minimum power settings. It may be possible to achieve higher GHD/d/watt, by detuning from maximum performance mfaktc.ini settings. I did not try any of that. Be careful of some settings. Disabling GPU sieving would create a false indication of higher efficiency, by displacing the factor candidate sieving work onto the CPU. DrobinsonPE advises using lgc instead of pl indirectly allows a greater range of power adjustment and accessing higher efficiency at lower frequency/power than pl provides. It's likely that performance will be more stable also, at a locked clock frequency. And for the same average power, since power is nonlinear with frequency, limiting frequency probably yields higher performance than letting frequency fluctuate during power limiting. But, clock control seems not to apply to GTX 10xx, at least on Windows 7, driver version 378.66.
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 20220714 at 14:14 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Reference material discussion thread  kriesel  kriesel  93  20220913 05:02 
Mersenne Prime mostlyGPU Computing reference material  kriesel  kriesel  36  20220626 17:11 
How do you obtain material of which your disapproval governs?  jasong  jasong  97  20150914 00:17 
Reference code for allinteger convolutions  jasonp  Software  17  20090129 02:25 
NFS reference  Jushi  Math  2  20060828 12:07 