View Single Post
Old 2020-11-27, 23:24   #5
kriesel's Avatar
Mar 2017
US midwest

134516 Posts
Default Poor-man's multithreading approximation

Launching separate processes for separate pass numbers with output redirection to pass-specific files permits using multiple cores on Windows. If there's a system crash before the processes complete their work, it's possible to resume each from roughly where it left off, by manually specifying beginning and ending k values. For large process counts, that can become tedious.

The following two paragraphs first appeared here:
"Poor man's multithreading" is running multiple processes for the same bit level and exponent, with different passmin and passmax. For example, 4-way, to use 4 cores with an msys2 compiled image,
passmin 0 passmin 3,
passmin 4 passmax 7,
passmin 8 passmax 11,
passmin 12 passmax 15.
This works well for powers of two passes per run. 1,2,4,8.

If the build is done with -DTF_CLASSES=4620 for finer pass granularity, then passmin and passmax ranges become 0 to 959, 960 = 26 * 3 * 5 in number. This larger number of passes with numerous small factors allows for much more choice of degree of parallelism. 960 is a highly composite number:
1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 16, 20, 24, 30, 32, 40, 48, 60, 64, 80, 96, 120, 160, 192, 240, 320, 480, 960
For brief runs there is no point to going to high degrees of parallelism, and -DTF_CLASSES seems to introduce higher overhead into a single run. For lengthy runs, the only way to get run times reasonable may be high degrees of parallelism. Using hyperthreading helps.

These additional choices may better fit the available number of cpu cores and hyperthreads on a given system.

Note that in the case of a system problem, automated update restart, power outage, etc., it can be unpleasant to have a large number of incomplete passes to deal with. A good UPS, stable up-to-date reliable system and well chosen number of parallel processes are recommended to minimize the size of the chore to continue from k values capture in log files. Or sacrifice some throughput and resume all the processes from the lowest maximum k value reached among all the processes being resumed, with a script to relaunch them all. Nevertheless, parallel processes can be powerful, when the run time is weeks even with 16-64 processes.

Different hardware seems to behave differently. On dual-Xeon-E5-2697v2 (dual-12-core, 2-way hyperthreading), I've run 16 processes in parallel and seen only minor differences in duration among the parallel processes, and ~15% impact on prime95 throughput. On a Knights Landing Xeon Phi (which have 4-way hyperthreading and 64, 68 or 72 cores), with 64 Mfactor processes and 4 prime95 workers, I've seen the Mfactor processes vary significantly in run time (longest = 1.68 x shortest; 151.8 hours vs. 255+ for mfactor-base-2w-tfc -m 60651732991 -bmin 85 -bmax 86, 64 processes, with the OS assigning processes to the cores without user involvement, MCDRAM only), and the prime95 workers' impact varied greatly too, from ~10% to the highest numbered worker indicating more than 100% increase in primality test iteration time IIRC.

The exponent and bit level entries in the attached Windows batch files are for illustration only. Please do not run them as is without coordination with me. They take too long to waste time by duplicating effort. There's no web or other server site known to me for coordinating work on such large exponents, other than perhaps posting messages somewhere on the forum. gives an indication of how long one of the Mfactor runs took.

For simplicity, or maybe I didn't think of it soon enough, the log files are output in the working directory, not one level lower. It's straightforward to create a folder for the exponent and final bit level, put the code there, and create all the run files there, then move the code to another for another run. But not required. Since the individual processes' log files are named according to exponent, starting pass number, and ending bit level, multiple bit levels or even exponents could be run in the same directory at the same time, without log file name collision. For example, running 1 process to do bit level x, 2 processes for x+1, 4 processes for x+2, 8 processes for x+3, which would all complete in about the same time, assuming there are enough hyperthreads available that each mfactor process gets its own register set. I strongly recommend starting with small process counts and small bit levels for small run times to familiarize with the program and batch script operation, and confirm run time projections before attempting higher bit levels or more complex runs. Run time of a bit level, when setup overhead is small compared to factoring time, ideally scales as 2bits / exponent / parallelprocesscount. Run times of weeks are easy to exceed.

Top of reference tree:
Attached Files
File Type: zip (4.9 KB, 12 views)

Last fiddled with by kriesel on 2021-01-12 at 18:24
kriesel is online now   Reply With Quote