2018-12-14, 19:51 | #1 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1A14_{16} Posts |
prime95-specific reference material
This thread is here for comparison to the GPU-based applications. Please use the reference discussion thread https://www.mersenneforum.org/showthread.php?t=23383 to make comments or suggestions.
Mprime and prime95 are Intel-compatible-processor-specific. Older processor models will have limited if any support. For ARM and other not-Intel-compatible CPUs, see mlucas. Stable version: There is a version of v30.8 (build 15), and for MacOS mprime v30.7 build 9, available at the location for what's considered stable, https://www.mersenne.org/download/. V30.3 and later are PRP-proof capable versions, which greatly reduce the effort of verification of a primality test, but will require considerably more disk space to accomplish that. See the readme.txt and other documentation included in the compressed distribution file, for more info on that. Older versions: There are also older versions v29.8b7 and older, for legacy operating systems, available at https://www.mersenne.org/download/. Newer versions: Generally George leaves the stable version on the download page while newer versions are available and in active development or testing. There is typically a thread for that activity, including occasional announcements of / links to new builds. Coming attractions: There is V30.9 now at b1 with ECM improvements in development. Setup instructions are included at https://www.mersenne.org/download/. Follow with https://www.mersenne.org/gettingstarted/ One thing to avoid is installing into "Program Files" or other restricted directories. Permissions problems will follow with such errant installs. Making a separate working directory for prime95 under the user's home directory is the way to go. I strongly recommend benchmarking over the range of fft lengths expected to be used, analyzing the results in a spreadsheet, and configuring for best throughput that is consistent with latencies shorter than applicable expiration periods. Configure worker windows for your preferred work type, and make sure that trial factoring is not it; GPUs are far more effective at that. It is normal for the PrimeNet server to issue a new prime95 installation only LL DC, until each prime95/mprime worker has completed 4 LL DC successfully. After a new installation accumulates a history of reliability, the PrimeNet server will allow additional work types. Do not do TF in mprime/prime95 if it can possibly be avoided. Modern GPUs are much more effective at it. Note that with AVX512 hardware, mprime/prime95 can process exponents up to 1169M, somewhat exceeding the mersenne.org server's maximum 1000M. If PrimeNet API connected, it will try and fail to report such high exponents that are not tracked in the mersenne.org database. Mersenne.ca has been enhanced to accept such P-1 results in prime95 JSON format here. Such large exponents should be avoided on most hardware and by most users, since there is little need for P-1 on them currently, as primality testing would take too long, and odds per unit of compute time of finding a Mersenne prime at >1G exponent are between 2 to 3 orders of magnitude lower than at ~1/10 that exponent size. Odds are best at the first-test wavefront. For remaining questions see the program's extensive included documentation. PRP run time scaling for low p https://www.mersenneforum.org/showpo...78&postcount=2 P-1 run time scaling https://www.mersenneforum.org/showpo...92&postcount=3 Effect of number of workers https://www.mersenneforum.org/showpo...18&postcount=4 Effect of number of workers (continued) https://www.mersenneforum.org/showpo...19&postcount=5 Effect of frequent interim residue output https://www.mersenneforum.org/showpo...44&postcount=6 Prime95 documentation https://www.mersenneforum.org/showpo...03&postcount=7 Prime95 exponent limits https://www.mersenneforum.org/showpo...74&postcount=8 PRP proof capable versions https://www.mersenneforum.org/showpo...35&postcount=9 Performing version upgrades https://www.mersenneforum.org/showpo...2&postcount=10 Effect of number of workers continued 2 https://www.mersenneforum.org/showpo...4&postcount=11 Use as hardware reliability test https://www.mersenneforum.org/showpo...7&postcount=12 See also the Concepts in GIMPS Trial Factoring post at https://www.mersenneforum.org/showpo...23&postcount=6 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2022-07-30 at 19:02 Reason: misc edits |
2018-12-14, 19:54 | #2 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}×1,669 Posts |
PRP run time scaling for low p
Run time is fitted as approximately proportional to p^{2.094}, for 86243 <= p <= 2976221. LL run time is expected to scale very similarly. For comparison a theoretical fft convolution based primality tester scales as p^{2} log p log log p, which over the mersenne.org interval fits as p^{2.117}. Overhead at low exponents lowers the power on a fit.
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:30 |
2018-12-24, 22:06 | #3 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}×1,669 Posts |
Prime95 P-1 run time scaling
A small number of widely spaced exponents were run to observe the run time scaling.
For prime95 v29.4b8 x64 run on a Windows 7 x64 system with dual e5-2670 chips, 4 cores (half a chip package) per worker, 32,000 MB allowance per worker, run time was approximately proportional to exponent p^{2.33} up to 595M (27 days), a somewhat higher power than observed for P-1 on gpus (~2.1). Another prime95 v29.4b8 x64 run on an FMA equipped i7-7500U Windows 10 X64 system seemed to be taking inordinately long to perform P-1, at p=101M, on 7,200 MB memory allowed, one core. It had been running for two weeks to perform stage 1 and reach 90% in stage 2. It appeared to be paging to disk excessively. The same system can complete an 83M primality test per core in about 2.5 weeks. It was allowed to complete that P-1 and then reset to 4096M memory allowed, after it was found to still page excessively at 6144M. This is a system with 8GB ram currently. In all cases it was running 1 core per worker; the other worker was running an 83M LL. It projected P-1 run times ranging from 4.4 days for 201M to 43 days for 605M, 67 days for 701M. However, attempting 605M resulted in "Cannot initialize FFT code, errcode=1002". The fit to observed run time is p^{2.087} (with five data points). Another run, a mix of prime95 V29.7b1, v29.8b3, and v29.8b6, on an FMA equipped i7-8750H Windows 10 X64 system was able to run 801M (at 8GB allocated of its 16GB installed ram, 37 days run time), and 901M (at 12GB allocated, 57 days run time) and is expected to be capable of up to 920.8M. The offset in the estimated days runtime is believed to be due to whether mfakto is running on the Intel igp or not. It seems to be using somewhat lower bounds than GPU72 figures for exponents above p~400M. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-01-05 at 14:05 Reason: updated i7-8750h attachment for new data |
2018-12-28, 18:13 | #4 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·1,669 Posts |
Effect of number of workers
Similar to the number of threads choices in gpu applications, on multicore systems, the effect of number of cores per worker in prime95 is unpredictable, and so there is provision for benchmarking.
Number of workers could be chosen to optimize performance. But which measure of performance? Aggregate throughput maximized, latency of one assignment minimized, number of joules used for a 100GhzD primality test, aggregate throughput given a constraint of latency low enough to avoid assignment expiration, something else? For which single fft length, or for the current and next several? For minimum latency, as for confirming a newly discovered Mersenne prime, Madpoo has run experiments on a dual-14-core system. He reported the fastest primality test time around 20 cores out of the 28 available; any more than 6 on the lesser use package, and the increased package to package data transfers slow the progress. For picking number of cores/worker per cpu type, that's a reasonable compromise for maximum aggregate throughput, so I can set it and forget it for months or years on each system, I ran the built in prime95 benchmarking over wide fft ranges for a variety of cores/worker, on a variety of cpu types. Then the timings were tabulated in spreadsheets and graphed. If going after the maximum performance per fft length, consider that some work types restart from the beginning when the number of workers is changed. Read the readme.txt and other files, back up before changing number of workers, plan ahead, etc. Some patterns emerge. Worker counts that would straddle the divide between processor packages if divided evenly typically do not provide as much throughput. A 12-core 2-package system with 3 workers with equal cores/worker would have at least one worker with cores in each package (4 2 + 2 4). George indicates recent versions of prime95 prevent the straddle by assigning unequal numbers of cores to the workers. For larger core counts there can be quite a few choices to evaluate. What's fastest for one fft length may not be for others. A compromise that averages a small percentage penalty is usually available. Plotting the various combinations with trend lines seems a useful visualization method for selecting one configuration to run with for a long time. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-03-17 at 17:44 |
2018-12-28, 18:16 | #5 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1101000010100_{2} Posts |
Effect of number of workers continued
Working around the 5-attachment limit per post:
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-03-17 at 18:07 Reason: some updated to include max exponent and latency |
2019-03-14, 03:49 | #6 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·1,669 Posts |
Effect of frequent Res64 output
Timing runs on LL DC on the same 51M exponent and old 32-bit hardware with prime95 29.4b7 yield conflicting information on the cost of a Res64 output as a multiple of an ordinary iteration. The res64 cost is estimated as 7/8 to 4 times an iteration. Note that because of numbering skew between prime95 and other conventions, prime95 outputs res64 at 3 successive iterations, with cost ~3.1 to 12 times an iteration. The lower value is based on prime95-provided timings per iteration, the higher value on prime95-provided time stamp of 1 second resolution of the res64 output line.
An initial attempt to make a similar measurement on an i7-8750H with UHD630 igp in prime95 v29.4b8 x64 yielded negative per-res64 cost in two tries. I speculate this was an interaction with mfakto running at the same time on the same chip package power budget. Performance monitor indicates the cpu utilization drops considerably when frequent interim residue output is enabled. A retest, with the UHD630 mfakto instance halted, yielded timings that indicate a cost per PRP3 res64 interim output on the i7-8750H system of 2.7 seconds, equivalent to 263. iterations, on an 83M primality test. One of the 6 cores stays very busy while the rest are only used at a low duty cycle when outputting an interim residue every 10 iterations. This cut throughput from 96.6 iter/sec to 3.54 iter/sec, a rather severe 96.3% reduction. The estimated effect on run time for the exponent when producing interim residues for the primenet server at 5,000,000 iteration intervals is about 45 seconds, 52ppm of run time. The retest was brief, taking 48 seconds for iterations with interim residues, and 114 seconds without, so accuracy is no better than a percent or two. Note also the cpu clock was not held constant during the test. In this case the agreement between time stamp based rates and program-computed ms/iter was very good, ~1/4%. Another test, on a dual-xeon-e5-2690 system, v29.6b6 x64 on Win10, 4 cores/worker, 83.9M PRP tests, gave ~305 iterations/interim residue64, 3.45 sec/interim residue, or around 61ppm for the default 5,000,000 iteration interval. The preceding figure ignores the initial 500K-iteration interim residue, which raises the impact a bit to 65ppm for ~84M exponents, and somewhat more for DC exponents. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:31 |
2019-08-12, 17:22 | #7 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·1,669 Posts |
Prime95 documentation
Most GIMPS applications include a readme file. Prime95 has very comprehensive documentation included in the zip package, in multiple files.
Also, for those who would like a deeper understanding, there is the source code and its gwnum folder's tutorial.txt. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-12-15 at 23:07 Reason: added tutorial.txt |
2020-05-25, 00:07 | #8 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}×1,669 Posts |
Prime95 exponent limits
Prime95 and its sibling mprime contain many code paths specific to processor types and exponent magnitudes. What range of exponents is supported varies by processor type. I think what has been implemented was determined by a combination of processor throughput versus exponent size and decisions by George on which to spend his programming time.
There are several ways to determine what these limits are. George has made statements about them in email or on the forum. https://mersenneforum.org/showpost.p...&postcount=219 The whatsnew.txt describes numerous changes in what was supported. The source code is available for examination. Trying runs on differing hardware and OS may obscure the situation, because it could be that it's an old operating system version, not the processor type, that prevents running some versions of code. For a given CPU instruction set, refinements from one version prime95/mprime to another will sometimes raise the exponent limit per fft length and overall. So the attachments may be viewed as rough guidelines. Note, while benchmarking goes on AVX512 go all the way up to 64M fft, ~1169M exponent, for the time being the LL limits for AVX512 are about the same as for AVX2, ~922M. Also, P-1 and PRP appear in testing prime95 v30.7b9 to be limited to prime exponents below 2^{30}, 1,073,741,789 max, which makes the 64M fft length unnecessary and limits use of the 60M fft length also. This was observed on an AVX512 i5-1035G1 system. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-11-29 at 20:13 Reason: added P-1 and PRP empirical limits |
2020-07-27, 16:03 | #9 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·1,669 Posts |
PRP proof capable versions
UPDATE:
As of ~2022-05-06, latest available promoted to general use is v30.8b15. Get it here or here. Some earlier versions did not include the proof file upload capability. PRP proof generation was introduced in v30.1; automatic uploading of proofs in v30.2. The standalone command-line uploader, which works for gpuowl as well as prime95, is described briefly at https://www.mersenneforum.org/showpo...&postcount=154 but the direct download from dropbox for Windows x64 is no longer available. It can be found as an attachment at https://www.mersenneforum.org/showpo...0&postcount=26 NOTE: it is not being maintained, and preferred usage is upload through a current version of prime95 or mprime. Usage is Code:
uploader user_id proof_filename[ chunk_size[ upload_rate_limit]] (Note, for gpuowl, there are more choices; https://www.mersenneforum.org/showpo...0&postcount=26, some of which might conceivably apply to prime95/mprime too, at least for the most adventurous. But I encourage users to stick with prime95 & mprime's built in PrimeNet API & supported features whenever practical.) Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2022-07-30 at 18:53 Reason: V30.8b15 update |
2020-10-07, 06:50 | #10 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
15024_{8} Posts |
Performing version upgrades
The most efficient method will depend on whether it's a single install or a fleet of them to be upgraded. Each leaves your results files, worktodo, log files, work in progress files, and configuration files in place and undisturbed. (But you should be doing regular system backups anyway.)
Single install: Stop and exit the prime95 program to allow prime95 program files to be overwritten. Download the zip file. Unzip it. If necessary, move the new files into your working directory. Select replace if prompted. Restart the program in the working directory. Multiple systems, with USB drive: Download the zip file. Put it onto the USB drive. Unzip it there. On each system: Insert the USB stick. Stop and exit the prime95 program to allow prime95 program files to be overwritten. Copy the new version's files from the USB stick to the working directory, overwriting the old. Start the program in the working directory. "Eject" the USB drive. Its file explorer window will close. Remove the USB stick. Multiple systems, with network drive: Download the zip file. Put it onto the network drive. Unzip it there. On each system: In file explorer, navigate to the update version prime95 folder on the network drive. Stop and exit the prime95 program to allow prime95 program files to be overwritten. Copy the new version's files from the network folder to the working directory, overwriting the old. Start the program in the working directory. Close the file explorer window for the update version folder. It's possible to streamline the above somewhat with a bit of batch script. Strictly speaking, it is not necessary to copy and overwrite files that have not changed from the previous version, but it does little harm. Unneeded copying can be efficiently avoided by date sorting both source and destination folders, and only copying what's newer than the corresponding destination file. For more detail, quoted with some editing, from S485122 at https://mersenneforum.org/showpost.p...61&postcount=4 prime.txt contains the GIMPS user data, local.txt contains the machine data, worktodo.txt contains the current work (assigned or not), at some times a file named prime.spl which contains the results not yet transmitted to the server might be present, the work files pnnnnnnn mnnnnnnnn etc and their backup copies .bu, bu2, etc... None of these files are in the prime95.zip archive and will thus not be overwritten. They are essential for continuity. There are other user files that are not in the archive either, but they are less critical (results.txt, results.json.txt, prime.log, gwnum.txt, ...) In other words, keep all other files in the folder, since they contain your user and machine data and preferences, your work in progress and results. The only files overwritten will be the program and version dependent files. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-11-19 at 22:26 |
2020-11-15, 19:38 | #11 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
6676_{10} Posts |
Effect of number of workers continued 2
Additional processor types:
FMA3 capable 6-core i7-8750H (no code running on the IGP at the time) Xeon Phi 7250 (68 cores in one socket) see also https://www.mersenneforum.org/showthread.php?t=25767 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-06-18 at 16:50 |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
gpuOwL-specific reference material | kriesel | kriesel | 32 | 2022-08-07 17:06 |
clLucas-specific reference material | kriesel | kriesel | 5 | 2021-11-15 15:43 |
Mfakto-specific reference material | kriesel | kriesel | 5 | 2020-07-02 01:30 |
gpu-specific reference material | kriesel | kriesel | 4 | 2019-11-03 18:02 |
CUDAPm1-specific reference material | kriesel | kriesel | 12 | 2019-08-12 15:51 |