mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Blogorrhea > kriesel

Closed Thread
 
Thread Tools
Old 2021-06-18, 16:49   #12
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1E3116 Posts
Default Use as hardware reliability test

There's a pretty good video on this at https://www.youtube.com/watch?v=n0U7fPKRlVs.
There are what occur to me as some inaccuracies in that youtube video.

Hyperthreading should not usually be used in primality testing; performance is usually better without employing the additional threads in prime95 / mprime.

Prime95 includes disclosure of cpu type, number of cores, whether hyperthreading is available, instructions supported, cache sizes etc. Options, CPU...

It will not test the reliability of your GPU, IGP, PCIe slots, etc. Consider Gpuowl, CUDALucas -memtest, mfakto or mfaktc selftest, or other GPU GIMPS applications for that. And actual hardware test software.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-11-10 at 18:14
kriesel is offline  
Old 2022-08-13, 20:25   #13
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

170618 Posts
Default Memory use during PRP

P-1, P+1, or ECM factoring may benefit from considerable allowed RAM use. Primality testing via PRP or LLDC has less need of RAM. For exponents up to ~120M, a few hundred MB per worker is enough. Even dual workers at 500+M exponent each do not use 3GiB of RAM on systems with several times that or more installed.
It looks like from the small charted set of data, that ~2.6 bytes times sum of exponents being primality tested at the moment on the system is a usable rough estimate of required RAM, in the absence of any of the memory-hungry factoring algorithms.
I don't think Windows version matters. Data were collected from Vista to Windows 11. Prime95 version probably does not matter much either.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: pdf prime95 PRP memory usage.pdf (17.9 KB, 52 views)

Last fiddled with by kriesel on 2022-08-13 at 20:47
kriesel is offline  
Old 2022-09-08, 05:46   #14
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

59·131 Posts
Default Interpreting the error-counts 32-bit word

I haven't verified that the output form in results.txt or results.json.txt matches the internal storage form, but it seems likely.

from prime95 v30.8b15 source module commonb.c starting line 6116
Code:
/* Increment the error counter.  The error counter is one 32-bit field containing 5 values.  Prior to version 29.3, this was */
/* a one-bit flag if this is a continuation from a save file that did not track error counts, a 7-bit count of errors that were */
/* reproducible, a 8-bit count of ILLEGAL SUMOUTs or zeroed FFT data or corrupt units_bit, a 8-bit count of convolution errors */
/* above 0.4, and a 8-bit count of SUMOUTs not close enough to SUMINPs. */
/* NOTE:  The server considers an LL run clean if the error code is XXaaYY00 and XX = YY and aa is ignored.  That is, repeatable */
/* round off errors and all ILLEGAL SUMOUTS are ignored. */
/* In version 29.3, a.k.a. Wf in result lines, the 32-bit field changed.  See comments in the code below. */

void inc_error_count (
    int    type,
    unsigned long *error_count)
{
    unsigned long addin, orin, maxval;

    addin = orin = 0;
    if (type == 0) addin = 1, maxval = 0xF;                // SUMINP != SUMOUT
    else if (type == 4) addin = 1 << 4, maxval = 0x0F << 4;        // Jacobi error check
    else if (type == 1) addin = 1 << 8, maxval = 0x3F << 8;        // Roundoff > 0.4
    else if (type == 5) orin = 1 << 14;                // Zeroed FFT data
    else if (type == 6) orin = 1 << 15;                // Units bit, counter, or other value corrupted
    else if (type == 2) addin = 1 << 16, maxval = 0xF << 16;    // ILLEGAL SUMOUT
    else if (type == 7) addin = 1 << 20, maxval = 0xF << 20;    // High reliability (Gerbicz or dblchk) PRP error
    else if (type == 3) addin = 1 << 24, maxval = 0x3F << 24;    // Repeatable error

    if (addin && (*error_count & maxval) != maxval) *error_count += addin;
    *error_count |= orin;
}
So, if the 32 bit error count word was 0x12345678, I think that would mean the following:
1 & 0xC most significant two bits unassigned
0x12 && 0x3F is the repeatable error count field with max value 63 base 10;
3 is GEC error field, max value fifteen
4 is Illegal Sumout field, max value fifteen
5 & 4 is FFT data zeroed error bit field, 0 or 1
5 & 8 is corrupted data indicator bit field, 0 or 1
0x56 & 3F is roundoff error > 0.4 field, max value 63 base 10;
7 is Jacobi symbol error check field, max value fifteen
8 is suminp!=sumout errors field, max value fifteen


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2023-04-15 at 11:46 Reason: removed draft status
kriesel is offline  
Old 2022-09-25, 17:08   #15
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

59×131 Posts
Default P-1 performance

Mprime / prime95 V30.8 introduces an enhancement in P-1 performance, using polynomials to achieve almost 100% pairing of primes in stage 2. This allows cost-effectively factoring to higher stage 2 bounds, and achieving higher factor found probability, saving more primality tests.
Use V30.8 or later for P-1 factoring. Use adequate memory to enable the gains. A few GB is better than only running stage 1. As many GB for stage 2 as you can safely allow is better yet. The savings (which I define as over many very similar exponents, estimated probability of finding a factor and thereby avoiding some full PRP tests times estimated cost per primality test avoided, minus estimated cost of P-1 factor testing as a function of exponent, ram, and bounds) are approximately logarithmic with allowed memory, so 4 GiB is better than nothing, 16 GiB is good, 32 is better, more is better yet, until risking the onset of slowdown by paging/swapping which can cut performance drastically and change the expected gain into a large loss. Prime95's GUI limits allowed stage 2 ram to 90% of installed physical system ram. That limit can be overridden by editing local.txt's Memory= line with a text editor, then restarting prime95.

Use adequate memory and bounds the first time P-1 is run on an exponent. "Optimizing" by running P-1 to low bounds first, selected to maximize factors found per unit of initial computing effort, is actually a DE-optimization for the project. Avoid inadequate-bounds factoring attempts whenever feasible. (Even if it means asking someone else to do the P-1 run.)

Typically on CPUs I've benchmarked, at the wavefront of DC or first time testing, fewer cores per worker produces highest aggregate throughput figures, but the difference is slight. The response of prime95 v30.8 P-1 to a lot of allowed ram is larger. So run two workers for a single-CPU-package system, and they will use about the same amount of time in stage 1 and two, and alternate using large quantities of memory for stage 2, fully employing available memory and maximizing expected net savings of computing time. See second attachment. That attachment clearly shows by comparing the first try and retry curves for similar exponents (current first-primality-test wavefront) that the expected time saved is much larger for a first try, even at considerably less allowed ram than for a retry with nearly 64 GiB of ram. It also indicates there is not much difference in expected time saved versus retried exponent for the same allowable ram, from near the current DC wavefront (66M) to the first-test wavefront (110M).

Multi-socket systems (Dual-Xeon, Quad-Xeon, etc) may have nonuniform memory access (NUMA). Specifying a large amount of allowed ram that will cause significant traffic over the NUMA interconnect (QPI, UPI, etc) may be slower than using a lesser amount of ram all connected to one processor socket. Performance may be better on a dual-Xeon system by running 4 workers, with ~45% of system ram allowed per worker, so that it could all be on the near side of the NUMA interconnect. There was a noticeable dip in performance on a dual-Xeon system with 2 workers when using more than half the total system ram in stage 2 in a single worker, which would require some of it to be accessed across the NUMA boundary. See first attachment.

In prime95 v30.8b14, with prime95 optimizing bounds freely for the given exponent and allowed stage 2 ram, selected B1 is observed to increase slightly (~O(ram0.15)) with allowed ram; selected B2 is observed to increase nearly linearly with allowed ram (~O(ram0.77 to ram0.98)), and B2/B1 ratio increase substantially (~O(ram0.62 to ram0.89))with allowed ram. I believe the stage 2 performance increase results in selecting somewhat higher B1, as well as much higher B2, for the same exponent and allowed ram as would have occurred in v30.7 or earlier. Perhaps counterintuitively, these optimizations, for total probable compute time, result in longer P-1 stage 1 and stage 2 times with increasing ram allowed.

See also https://www.mersenneforum.org/showpo...&postcount=724 and https://www.mersenneforum.org/showpo...&postcount=727

The preceding is all in the context of mprime/prime95 doing both stage 1 and stage 2. A recent development in gpuowl supports passing the results of P-1 stage 1 performed standalone in gpuowl, into an mprime/prime95 folder and worktodo file for performance of stage 2 by prime95. See https://mersenneforum.org/showpost.p...postcount=2870 and some posts following it for more info, relevant to gpuowl ~v7.2-129 and configuring prime95 v30.8+ to work with that.

A brief comparison of v30.7b9 and v30.8b14, on i5-1035G1, Windows 10, two 32GiB DIMMs installed, on the same P-1 assignment, at 60GiB allowed stage 2 ram, running a single worker with 4 physical cores, yielded the following
Code:
for worktodo line PFactor=(AID),1,2,118970857,-1,77,1.3

version      B1        B2     factor odds    runtime estimate    computed odds/day of P-1
v30.7b9   587,000  25,716,000    3.59%      12 hours 26 minutes          6.93%/day
v30.8b14  840,000 267,531,000    5.49%      10 hours 41 minutes         12.33%/day
ratio      1.431     10.403      1.529             0.859                   1.779
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2023-05-19 at 13:44 Reason: minor edits
kriesel is offline  
Closed Thread

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL-specific reference material kriesel kriesel 33 2023-03-06 22:59
clLucas-specific reference material kriesel kriesel 5 2021-11-15 15:43
Mfakto-specific reference material kriesel kriesel 5 2020-07-02 01:30
gpu-specific reference material kriesel kriesel 4 2019-11-03 18:02
CUDAPm1-specific reference material kriesel kriesel 12 2019-08-12 15:51

All times are UTC. The time now is 01:29.


Fri Jun 9 01:29:56 UTC 2023 up 294 days, 22:58, 0 users, load averages: 0.77, 0.86, 0.95

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔