![]() |
mfaktc updates?
There was mention of an mfaktc v0.22-pre2 back in 2015 at [URL]https://www.mersenneforum.org/showpost.php?p=402408&postcount=2547[/URL]
What did that offer or was planned for implementation, and was it ever released? Given that the GTX 1080 Ti seems to be bumping up against the documented maximums for gpusieveprocesssize and gpusievesize, and there are now faster gpus (RTX20xx), there may be performance gains available by increasing those maximums. Could we get an updated version? Is there some obstacle preventing increasing maximum gpusieveprocesssize or gpusievesize? |
RTX gains several % more throughput by 1024 gpusievesize
At [URL]https://www.mersenneforum.org/showpost.php?p=505395&postcount=83[/URL] nomead provides test data for a modified version of mfaktc on RTX2080. He found a 7% gain in indicated throughput going up to gpusievesize 1024. The shape of the curves plotted for his data, in [URL]https://www.mersenneforum.org/showpost.php?p=507031&postcount=106[/URL] is consistent with additional gain at even higher gpusievesize. In my testing there appears to be some potential gain for GTX1080 Ti above 128 also. [URL]https://www.mersenneforum.org/showpost.php?p=506990&postcount=3069[/URL] Modern gpu cards have plenty of memory to support large sizes, even if running multiple instances.
TheJudger, what sort of issues might arise when increasing gpusievesize? How could we spot them? Similarly, on the GTX1080Ti, it looked like an increase in max gpusieveprocesssize would be beneficial. Again, what sort of possible issues, and symptoms? |
[QUOTE=kriesel;507040]At [URL]https://www.mersenneforum.org/showpost.php?p=505395&postcount=83[/URL] nomead provides test data for a modified version of mfaktc on RTX2080. He found a 7% gain in indicated throughput going up to gpusievesize 1024. The shape of the curves plotted for his data, in [URL]https://www.mersenneforum.org/showpost.php?p=507031&postcount=106[/URL] is consistent with additional gain at even higher gpusievesize. In my testing there appears to be some potential gain for GTX1080 Ti above 128 also. [URL]https://www.mersenneforum.org/showpost.php?p=506990&postcount=3069[/URL] Modern gpu cards have plenty of memory to support large sizes, even if running multiple instances.
TheJudger, what sort of issues might arise when increasing gpusievesize? How could we spot them?[/QUOTE] I had a similar increase of speed on my RTX 2060 |
Hi,
[QUOTE=kriesel;507040]At [URL]https://www.mersenneforum.org/showpost.php?p=505395&postcount=83[/URL] nomead provides test data for a modified version of mfaktc on RTX2080. He found a 7% gain in indicated throughput going up to gpusievesize 1024. The shape of the curves plotted for his data, in [URL]https://www.mersenneforum.org/showpost.php?p=507031&postcount=106[/URL] is consistent with additional gain at even higher gpusievesize. In my testing there appears to be some potential gain for GTX1080 Ti above 128 also. [URL]https://www.mersenneforum.org/showpost.php?p=506990&postcount=3069[/URL] Modern gpu cards have plenty of memory to support large sizes, even if running multiple instances. TheJudger, what sort of issues might arise when increasing gpusievesize? How could we spot them? Similarly, on the GTX1080Ti, it looked like an increase in max gpusieveprocesssize would be beneficial. Again, what sort of possible issues, and symptoms?[/QUOTE] Possible issues? Missing factors, everything else doesn't really matter. Will check max value of gpusievesize in next version (no ETA yet). Thank you for putting my attention on this! :smile: I think gpusieveprocesssize is more like a hard(er) limit on the other hand the increase of performance is rather low from 16k to 32k?! Oliver |
[QUOTE=TheJudger;507050]
Will check max value of gpusievesize in next version (no ETA yet). Thank you for putting my attention on this! :smile: [/QUOTE] I found at least one hard limit, 2047 still passed self tests but 2048 gave an error. [CODE]gpusieve.cu(1276) : CUDA Runtime API error 2: out of memory.[/CODE] And besides, above 1024 the performance increase wasn't so significant, so to be (semi-)safe I'll stay at max 1024 for now. I'm fully aware that I'm poking things I really shouldn't poke :reality-check: |
[QUOTE=nomead;507052]I found at least one hard limit, 2047 still passed self tests but 2048 gave an error.
[CODE]gpusieve.cu(1276) : CUDA Runtime API error 2: out of memory.[/CODE]And besides, above 1024 the performance increase wasn't so significant, so to be (semi-)safe I'll stay at max 1024 for now. I'm fully aware that I'm poking things I really shouldn't poke :reality-check:[/QUOTE] If you have data for above 1024, please share. Thank you for poking at things! That's how progress occurs. [URL]https://www.goodreads.com/quotes/536961-the-reasonable-man-adapts-himself-to-the-world-the-unreasonable[/URL] And wow, now for some reason my GTX1080Ti is no longer thermally throttling at 83C; seems to be happily running to the power limit at 90-91C, hotter than I'd like but one instance mfaktc 1356 GhzD/day & 93% gpu load, 2 instances 1383 & 97% gpu load. |
[QUOTE=kriesel;507057]If you have data for above 1024, please share.
[/QUOTE] In the [URL="https://www.mersenneforum.org/showpost.php?p=507051&postcount=107"]other thread...[/URL] 1024 -> 1536 +0,7% 1536 -> 2047 +0,2% |
[QUOTE=nomead;507058]In the [URL="https://www.mersenneforum.org/showpost.php?p=507051&postcount=107"]other thread...[/URL]
1024 -> 1536 +0,7% 1536 -> 2047 +0,2%[/QUOTE] The same happens on the RTX 2060 |
gtx 1060 tune variations (likes large gpusievesize)
GTX1060 pwr & vrel limiting
exponent 720M bit level 81-82 GPUSieveSize=64 GPUSieveProcessSize=16 thruput 412.34 GhzD/day GPUSieveSize=64 GPUSieveProcessSize=32 thruput 412.88 GPUSieveSize=[B]128[/B] GPUSieveProcessSize=32 thruput [B]420.25[/B] |
Trial factoring concepts
Concepts in GIMPS trial factoring (TF) (note, sort of mfaktc oriented, more so toward the end)
Does this list cover the main concepts? Anything missing, misstated, incomplete, etc? (Tact encouraged!) 1 Trial factoring work is generally assigned by exponent and bit levels; Mersenne exponent, bit level already trial factored to or where to start, and bit level to factor up to. 2 Each bit level of TF is about as much computing effort as all bit levels below it, for the same exponent. 3 TF makes use of the special form of factors of Mersenne numbers, f = 2 k p +1, where p is the prime exponent of the Mersenne number, k is a positive integer. 4 Use a wheel generator for candidate factors, to efficiently exclude and not even consider candidate factors that have very small factors themselves, such as 2, 3, 5, 7 and usually 11. 2 2 3 5 7 = 420, the Less-classes number; 2 2 3 5 7 11 = 4620, the more-classes number. [URL]https://www.mersenneforum.org/showpost.php?p=200871&postcount=35[/URL] [URL]https://www.mersenneforum.org/showpost.php?p=200887&postcount=37[/URL] (There was discussion of going to 13 or higher, but that was considered not worthwhile. 2 2 3 5 7 11 13 = 60060, etc. Higher complexity and overhead, trading off against diminishing returns of incremental number of candidates excluded.) 5 For a given exponent, exclude entire classes of candidate factors with a single test [URL]https://www.mersenneforum.org/showpost.php?p=200887&postcount=37[/URL] 6 Make use of the special form of factors, that they are 1 or 7 mod 8. 7 Dense representation of the candidate factors, as a bit map of k values. [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=201884&postcount=72 8 These candidate factors are sieved, somewhat, but not exhaustively. Candidate factors found composite by sieving are discarded. Trial with prime candidates is sufficient, with composite candidates redundant. 9 Exhaustive sieving of candidate factors can produce less throughput per total computing effort, than a lesser amount of sieving of candidates. There's user control of sieving depth available to allow adjustment to near optimal for the exponents, depths, and other variables, in mfaktc and mfakto. [URL]https://www.mersenneforum.org/showpost.php?p=212815&postcount=180[/URL] 10 Sieving level must not exceed the level where possible candidate factors are included in sieving values. This makes sieve limit settings dependent on exponent for low exponent low bit level combinations. [URL]https://www.mersenneforum.org/showpost.php?[/URL] p=260105&postcount=788 (If I recall correctly, in later versions of mfaktc special handling of certain cases relaxes this restriction on sieving level. See in #21 below) 11 Surviving candidate factors are tried. 12 The trial method is to generate a representation of the Mersenne prime plus 1 (2^p), modulo the trial factor, by a succession of squarings and doublings according to the bit pattern in the exponent, modulo the trial factor. If 2^p mod f =1, 2^p-1 mod f = 0 and f is a factor. The powering method is much much faster than trial long division for sizable numbers, and uses far less memory, and more so on both counts for larger numbers. Description and brief small-numbers example at [URL]https://www.mersenne.org/various/math.php[/URL] 13 On significantly parallel computing hardware, such as gpus, many trial factors can be run in parallel, so successive subsets of candidates are distributed over the many available cores. 14 Operation is as multiword integers, sometimes with some bits used for carries. [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=199203&postcount=17 [URL]https://www.mersenneforum.org/showpost.php?p=211326&postcount=159[/URL] 15 Different code sequences are written for different bit level ranges of trial factor, for best speeds for various bit level ranges. Higher bit levels take longer code sequences, so factors tried per unit time declines at higher bit levels on the same hardware and exponent. [URL]https://www.mersenneforum.org/showpost.php?p=216430&postcount=231[/URL] Cursory examination of current source code shows the following 17 distinct sequences identified in mfaktc: _71BIT_MUL24 _75BIT_MUL32 _95BIT_MUL32 BARRETT76_MUL32 BARRETT77_MUL32 BARRETT79_MUL32 BARRETT87_MUL32 BARRETT88_MUL32 BARRETT92_MUL32 _75BIT_MUL32_GS BARRETT76_MUL32_GS BARRETT77_MUL32_GS BARRETT79_MUL32_GS BARRETT87_MUL32_GS BARRETT88_MUL32_GS BARRETT92_MUL32_GS _95BIT_MUL32_GS (mfakto is similar although it does not include 95-bit) 16 The Barrett 77 (or was it 76) was derived from the 79 [URL]https://www.mersenneforum.org/showpost.php?p=306572&postcount=1824[/URL] [URL]https://www.mersenneforum.org/showpost.php?p=306808&postcount=1838[/URL] [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=307238&postcount=1845 17 "funnel shift" barrett 87 [URL]https://www.mersenneforum.org/showpost.php?p=334251&postcount=2243[/URL] 18 The frequency of primes on the number line declines gradually as the bit level increases. This partially offsets the effect of the longer code sequences for higher bit levels. 19 For gpu applications, there are various implementation approaches for performance. [URL]https://www.mersenneforum.org/showpost.php?p=199799&postcount=29[/URL] 20 multiple streams and data sets allowing concurrent data transfer and processing [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=207433&postcount=152 21 On gpu sieving: [URL]https://www.mersenneforum.org/showpost.php?p=251120&postcount=554[/URL] Tradeoffs and possible decisions about high gpu sieving, low exponents [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=385780&postcount=2409 [URL]https://www.mersenneforum.org/showpost.php?p=503820&postcount=2999[/URL] 22 There's a small performance advantage to 32-bit images when available, when using GPU sieving. (32bit addresses are smaller, placing less demand on memory bandwidth) [URL]https://www.mersenneforum.org/showpost.php?p=323678&postcount=1981[/URL] At CUDA8 or higher (which means GTX10xx or newer models) only 64-bit CUDA is available. 23 How, in the CUDA or OpenCl cases, all those factor candidates get bundled into batches for processing in parallel on many gpu cores is very hazy for me. Presumably it involves using as many of the cores as possible as much of the time as possible, implying work batches of equal size / run time. Prime95 comments on passing out chunks of work and using memory bandwidth efficiently: [URL]https://www.mersenneforum.org/showpost.php?p=292154&postcount=1634[/URL] Oliver writes to R Gerbicz about it [URL]https://www.mersenneforum.org/showpost.php?p=504233&postcount=3020[/URL] 24 could FP be faster than integer math for the kernels? Probably not [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=383607&postcount=2377 Mark Rose did a detailed analysis in 2014 (and maybe revisiting that for recent gpu designs would be useful) [URL]https://www.mersenneforum.org/showpost.php?p=384137&postcount=2380[/URL] 25 mfaktc v0.21 announce, and ruminations about possible v0.22 [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=395689&postcount=2492 26 ini file parameter tuning advice, [URL]https://www.mersenneforum.org/showpost.php?p=395719&postcount=2505[/URL] - 2508 in this order: GPUSieveProcessSize GPUSieveSize GPUSievePrimes 27 There was mention of an mfaktc v0.22-pre2 back in 2015 at [URL]https://www.mersenneforum.org/showpost.php?p=402408&postcount=2547[/URL] What did that offer or was planned for implementation, and was it ever released? 28 Experimenting on tuning with a gtx1080Ti, on Windows 7, it seems to like the upper end of the tuning variable limits. So do some lesser models. Possibly the faster cards would benefit from higher maximums than documented for mfaktc v0.21. # GPUSieveSize defines how big of a GPU sieve we use (in M bits). # Minimum: GPUSieveSize=4 # Maximum: GPUSieveSize=128 # Default: GPUSieveSize=64 GPUSieveSize=128 # GPUSieveProcessSize defines how far many bits of the sieve each TF block # processes (in K bits). Larger values may lead to less wasted cycles by # reducing the number of times all threads in a warp are not TFing a # candidate. However, more shared memory is used which may reduce occupancy. # Smaller values should lead to a more responsive system (each kernel takes # less time to execute). GPUSieveProcessSize must be a multiple of 8. # Minimum: GPUSieveProcessSize=8 # Maximum: GPUSieveProcessSize=32 # Default: GPUSieveProcessSize=16 GPUSieveProcessSize=32 GPUSievePrimes ~94000 29 some RTX20xx owners have modified the tuning variable limits, recompiled, and obtained performance gains of several percent, with diminishing incremental returns as GPUSieveSize grows to GB 30 the decoupling of FP from integer math in the RTX20xx may change the tradeoffs significantly or allow use of both simultaneously for TF math. As far as I know this is as yet unexplored. |
[QUOTE=kriesel;507954]16 The Barrett 77 (or was it 76) was derived from the 79 [URL]https://www.mersenneforum.org/showpost.php?p=306572&postcount=1824[/URL]
[URL]https://www.mersenneforum.org/showpost.php?p=306808&postcount=1838[/URL] [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=307238&postcount=1845[/QUOTE] Correct but not complete. First barrett kernel im mfaktc was BARRETT92, all other kernels are stripped down versions. From BARRETT92 to BARRETT79 first (fixed inverse, multibit in single stage possible, a bit faster) From there we go from BARRETT92 to BARRETT88 and BARRETT87 by (re)moving interim correction steps and some other "tricks" (loss of accuracy in interim steps (small example 22 mod 10 yields 12 (instead of 2))). Trading accuracy for speed. The same "tricks" lead from BARRETT79 to BARRETT77 and BARRETT76. [QUOTE=kriesel;507954]25 mfaktc v0.21 announce, and ruminations about possible v0.22 [URL]https://www.mersenneforum.org/showpost.php?[/URL]p=395689&postcount=2492 27 There was mention of an mfaktc v0.22-pre2 back in 2015 at [URL]https://www.mersenneforum.org/showpost.php?p=402408&postcount=2547[/URL] What did that offer or was planned for implementation, and was it ever released?[/QUOTE] "-pre" versions aren't released into the wild and are not intended for productive usage. Removed old stuff (CC 1.x code, CUDA compatibility < 6.5 dropped, minor changes and bugfixed). Oliver |
| All times are UTC. The time now is 23:01. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.