mersenneforum.org (https://www.mersenneforum.org/index.php)
-   -   What determine if P-1 factoring is used? (https://www.mersenneforum.org/showthread.php?t=26849)

 drkirkby 2021-05-29 18:41

What determine if P-1 factoring is used?

I wanted to check the RAM usage during P-1 factoring, so manually reserved an exponent, for a [B]PRP test.[/B] The exponent has no P-1 factoring.

[URL]https://www.mersenne.org/report_exponent/?exp_lo=108792767&full=1[/URL]

I assumed that when I run the exponent using mprime 30.6b4, using the line given by the server

[CODE]PRP=xxx,1,2,108792767,-1,76,0,3,1[/CODE]that mprime would attempt P-1 factoring, to look for a factor, to possibly save time performing a computationally expensive PRP test.

I then changed the line in worktodo.txt, to remove the 3,1 on the end,
[CODE]PRP=xxx,1,2,108792767,-1,76,0[/CODE]deleted all the files related to the exponent, and started again. I thought that would force a P-1 test, but it did not.

I'm a bit puzzled why mprime goes straight into a computationally expensive PRP test, before giving a chance to a P-1 test to find a factor.

Yes, I am aware I could have reserved P-1 factoring, rather than a PRP test, but I'd rather make use of the P-1 results, and assumed that since a P-1 factoring had not been done, one would be done before starting the PRP test. I obviously have some basic misunderstanding here.

Dave

 drkirkby 2021-05-29 18:50

I see changing the line in worktodo.txt to
[CODE]
PRP=(aid redacted),1,2,108792767,-1,76,2[/CODE]forces a P-1 test, Then mprime reports
[CODE][Worker #2 May 29 19:44] Optimal P-1 factoring of M108792767 using up to 376832MB of memory.
[Worker #2 May 29 19:44] Assuming no factors below 2^76 and 2 primality tests saved if a factor is found.
[Worker #2 May 29 19:44] Optimal bounds are B1=928000, B2=55053000
[Worker #2 May 29 19:44] Chance of finding a factor is an estimated 4.73%[/CODE]I'm just puzzled why this does not occur as a matter of course. It's only been trial factored to 2^76.

 kriesel 2021-05-29 18:56

Because for some reason the server incorrectly gave an assignment with <tests_saved>=0 that time.
Which tells the client software mprime there is no point to attempting P-1, don't bother.
You can fix that by changing the 76,0 to 76,1 or 76,2.
Or constructing a suitable Pminus1 entry preceding the PRP entry, stopping and restarting.
As usual [URL="https://www.mersenneforum.org/showpost.php?p=522098&postcount=22"]reference info[/URL]!
Not sure why the server seems to have gotten that one wrong. Usually it has worked.

 drkirkby 2021-05-29 19:08

[QUOTE=kriesel;579422]Not sure why the server seems to have gotten that one wrong.[/QUOTE]
I'm glad to see I am not the only one that thought it was wrong! Perhaps because I specifically requested PRP, although I'm a bit surprised.

I reserved 16 exponents for PRP tests, and every one was of the form
[CODE]PRP=xxx,1,2,exponent,-1,76,[B]0,3,1[/B][/CODE]Thank you for editing the post where I left the assignment ID. I realised I'd screwed up, and was about to edit it, but see you beat me to it.

Dave

 Happy5214 2021-06-04 23:22

[QUOTE=drkirkby;579421][CODE]PRP=(aid redacted),1,2,108792767,-1,76,2[/CODE][/QUOTE]

If you're doing a PRP with proof, wouldn't you only save ~1 test (not 2) by finding a P-1 factor, since the certification time pales in comparison to the first test?

 LaurV 2021-06-05 11:54

Yes, when referring to P95-given assignments, that is remnant from old LL age. Nowadays you only save one test and the cert time (which is about a fifth of a test, or less, depending on the power you use in generating the cert).

However, there is nothing wrong with using 2, for example, with manual P-1 assignments, people (me included) use to artificially raise the number to 3, 5, etc (i.e. manually editing it), to cheat P-1 bounds calculation into using a larger B1 (and if you have memory, B2 too). This way you will spend a little bit more time doing P-1, but increase your chances of finding a larger factor. This is better than using a "hard" larger B1 (specified in command line or in worktodo file for gpuOwl, for example), because in this case the bound is "flexible", it depends on the exponent, it is not fixed.

 kriesel 2021-06-05 12:22

[QUOTE=LaurV;580007]cert time (which is about a fifth of a test, or less, depending on the power you use in generating the cert[/QUOTE]Normally FAR less than a fifth; generally under 1%. For example, Gpuowl v6.11-380's default proof power is 8, which takes overall effort for proof generation, server processing, and verification (CERT) of 0.41% of a primality test. Content there IIRC re proof costs is a reviewed summary of statements made by Prime95 and Preda. [URL]https://www.mersenneforum.org/showpost.php?p=523345&postcount=4[/URL] So: relative to cost of a single primality test, LL & LLDC without error, 2; LL & LLDC with usual error rates at p~104M ~2.04 tests equivalent; PRP & PRPDC without proof, ~2.00; PRP & proof, 1.0041 for power 8. Note the proof cost for power 9 is about the same as the 0.2-0.3% cost of GEC or Jacobi checks.

 drkirkby 2021-06-05 13:01

[QUOTE=LaurV;580007]However, there is nothing wrong with using 2, for example, with manual P-1 assignments, people (me included) use to artificially raise the number to 3, 5, etc (i.e. manually editing it), to cheat P-1 bounds calculation into using a larger B1 (and if you have memory, B2 too). This way you will spend a little bit more time doing P-1, but increase your chances of finding a larger factor. This is better than using a "hard" larger B1 (specified in command line or in worktodo file for gpuOwl, for example), because in this case the bound is "flexible", it depends on the exponent, it is not fixed.[/QUOTE]
What's your reason for wanting to find a large factor? I can think of three possible reasons.
a) You like hunting for factors. I know many people do, as they have a better success rate than those looking for just Mersenne primes.
b) To reduce the probability of you needing to run the more computationally expensive PRP test.
c) A reason I can't think of.

If your [B]only [/B]interest is finding Mersenne Prime numbers, then is it not counterproductive to run a larger range of B1 and B2? It would seem from

[URL]https://www.mersenne.org/various/math.php#p-1_factoring[/URL]
that the optimal values of B1 and B2 are worked out, based on maximising the equation
[CODE]chance_of_finding_factor * primality_test_cost - factoring_cost[/CODE]

 LaurV 2021-06-06 13:38

Our motivations vary.

Our force, as a group, stays in the fact that we are different.

 Siegmund 2021-07-22 00:57

I had just peeked in to ask whether P-1 factoring bounds were being set higher than they should be... I noticed today that a recent PRP assignment was doing P-1 on the basis of saving two tests rather than one... and am happy to see that there was already a thread about it, and there's an easy way I might manually change an assignment to tell it it saves only 1 test.

It does seem like the default ought to be 2 for LL testers and only 1 for PRP testers.

 chalsall 2021-07-22 01:15

[QUOTE=Siegmund;583723]It does seem like the default ought to be 2 for LL testers and only 1 for PRP testers.[/QUOTE]

You are not incorrect.

But, some of the ones doing P-1 (or even P+1) simply like to find factors.

Let them work.

Others will come behind to clean up.

 kriesel 2021-07-22 02:27

[QUOTE=Siegmund;583723]It does seem like the default ought to be 2 for LL testers and only 1 for PRP testers.[/QUOTE]When standalone P-1 is performed, how is the server to predict whether the first primality test that will be assigned later and performed later will be LL, PRP without proof or with bad proof, or PRP with a good proof that verifies as correct?

 LaurV 2021-07-22 02:41

My two cents: the value should stay 2. Little bit more P-1 won't hurt anybody, and it may be beneficial on long term.

 Zhangrc 2021-07-22 04:27

[QUOTE=LaurV;583731]the value should stay 2.[/QUOTE]

If one is interested in P-1 factors, he or she of course could use 2-primarity-test-saved bounds. However, some people just want to test as much exponents as possible, using PRP with proof, the 1-test-saved bounds are OK. Of course, we could go into the middle, using 1.2-test-saved bounds, since there are some PRP tests with bad proof or stalling out.

Personally I suggest doing more TF work at current PRP wavefront. By adding the throughput of top 500 producers, we have done 64 million GHZDays on PRP tests in the last year, but over 147 million GHZDays on TF (over twice as much work!) . For this reason, we could TF a bit higher, say 2^77 (even 2^78), just like what we have done to 95M exponents.

 Uncwilly 2021-07-22 04:51

[QUOTE=Zhangrc;583737]Personally I suggest doing more TF work at current PRP wavefront.[/QUOTE]Quite a bit of the TF work is being done away from the area of FTC's. SRBase is moving through exponents bit level by bit level and not staying below 120M. Also user TJAOI is doing a lot of work on low exponents (well below the FTC rang and at lower bit levels). So these 2 should not be counted toward the TF total. Also, with PRP and certs there is less work being done on to test and confirm exponents. This changes the calculus of what makes sense WRT to TF vs primality testing. Those running GPU72 closely watch the front of the Cat 4 FTC wave front, the Cat 3 FTC wavefront, and the currently available TF firepower for working ahead of the FTC's and what sort of work the various users prefer.

 LaurV 2021-07-22 05:26

[QUOTE=Uncwilly;583739]Also... <snip>

Also... <snip>

Also... <snip>
[/QUOTE]
Also... we are comparing apples with watermelons, the TF credit unit and PRP credit unit are a lot different. One good GPU can spit 3000-6000 GHzDays for every day it does TF, but only 300-800 GHzDays for every day it does LL/PRP/P-1. This is remnant from the time when CPUs were used to TF, and the credit values were calculated to be approximately equal per time unit spent [U]by the CPU[/U] for each work type. GPUs joining the fight completely changed the equation: now you can get 5 to 10 times more "credit" if you use your GPU for TF than for PRP, and there are people still motivated by that, especially young gamers whose gaming cards are not good at FP64 flops (needed for PRP) but are excellent at FP32 flops (good for gaming and TF). Unfortunately (or more exactly, fortunately) this was never fixed, because the re-balancing is not easy, and it will upset some people. On the other hand, giving a lot more TF credit per unit of time may be beneficial because that's the ONLY incentive given for TF. Some people with gaming GPUs (which are anyhow better at TF, and worse at LL/PRP) will join and do TF to advance in tops fast - two average gaming cards can put you on the tops in few weeks - therefore helping the project, which is always in need of "more TF". The TF does not have other incentive (unlike PRP, where you can find a prime and take some money) beside altruism ("we want to help the project"), idiocy ("we want to find factors, or to get a lot of credits, albeit we know none of the both are of any use"), or entertainment ("yeah, it is fun! hihi" and make donkey face). So, let TF give more credit, that's ok. I personally will jump to grab some of it! :razz:

 Zhangrc 2021-07-22 05:39

[QUOTE=LaurV;583740]now you can get 5 to 10 times more "credit" if you use your GPU for TF than for PRP.[/QUOTE]
Sometimes it's 30 times more, depending on GPU model. My GPU (GTX1650) earn approximately 900 GHZDays per day doing TF but only less than 30 GHZDays doing PRP. As a result, I factor every exponent I test to at least 2^77, sometimes to 2^79.

 Siegmund 2021-07-22 06:39

[QUOTE=kriesel;583729]When standalone P-1 is performed, how is the server to predict whether the first primality test that will be assigned later and performed later will be LL, PRP without proof or with bad proof, or PRP with a good proof that verifies as correct?[/QUOTE]

Perhaps 2 is reasonable if a person requests a standalone P-1 assignment.

It seems less reasonable when one receives a PRP assignment for a number that has not yet had P-1 done on it. (I have been getting quite a few of these, for the past year or so.)

And is not the intention for all future world-record-sized testing to be PRP, not LL, no? So the expected number of tests saved is something like 1.03?

If extra factors are found, great - it just seems the default ought to be to minimize time needed to resolve each exponent.

 LaurV 2021-07-22 06:47

[QUOTE=Zhangrc;583743]As a result, I factor every exponent I test to at least 2^77, sometimes to 2^79.[/QUOTE]
Yep, that's exactly what I mean. I do the same. But at the end, what should drive you (general you, not personal) should be the speed you eliminate exponent candidates. If you can run 150 TF assignments and find two factors per day, but it will take you more than half day to run a PRP test in the same hardware, and I mean, at the front level, not picking low hanging, large expos, low bitlevel TF assignments, than your hardware should do TF. You help the project more that way.

 kriesel 2021-07-22 06:51

[QUOTE=LaurV;583740]you can get 5 to 10 times more "credit" if you use your GPU for TF than for PRP[/QUOTE]Lowest I had on record from older GPU models is 12:1 (TF credit/day) / (PRP or LL or P-1 / day). [URL]https://www.mersenneforum.org/showpost.php?p=497567&postcount=16[/URL] RTX20xx or GTX16xx are much higher; I think RTX30xx higher yet. (Mersenne.ca has rtx3090 at 4900 TF, ~98 PRP etc; 50:1; rtx2080 2623 TF, 62.5 PRP etc, 42:1)

Radeon VII mfakto v0.15pre6 was ~1300GHzD/day on Windows; there are benchmark results supporting up to 486/day in GpuOwl; that's 2.67:1. And noted for its incomparable PRP performance, at \$700 original list, IIRC beat by a factor of 2, \$2500 used Tesla P100. Some of the Teslas may have sufficiently strong DP to show low ratios also.

CPUs I've checked were 0.7 to 1.3.

Hence the general rule, TF on GPUs, PRP or P-1 or LL on CPUs.
Except on Radeon VII and other recent AMD GPUs, & maybe Teslas.

 drkirkby 2021-07-23 07:49

[QUOTE=Siegmund;583747]Perhaps 2 is reasonable if a person requests a standalone P-1 assignment.
<snip>

If extra factors are found, great - it just seems the default ought to be to minimize time needed to resolve each exponent.[/QUOTE]
Having done a quick test, with results obtained after a few beers,

[URL]https://www.mersenneforum.org/showpost.php?p=583797&postcount=54[/URL]

which I intend doing more thoroughly when 100% sober, I'm not convinced that any value of[B] tests_saved[/B] can really be said to maximise the throughput [B]unless you test the P-1 timing on your computer(s).[/B] I tested the run-time of the P-1 test on my Dell PC under the same circumstances
[LIST][*]Using the exponent [URL="https://www.mersenne.org/report_exponent/?exp_lo=M105216541&full=1"]M105216541[/URL][*]4 workers[*]3 workers doing PRP tests with exponents around 104-105 million (13 cores each)[*]1 worker doing a P-1 test (13 cores)[*]Dell 7920 tower workstation with 2 x Intel Xeon 8167M CPUs.[/LIST]What I found on that quick test was.
[LIST=1][*]Based on saving 1 primality test. B1=434000, B2=21339000. Chance of finding a factor is an estimated 3.60%. Started 15:15. Finished 16:57. [B]Runtime = 1 hour, 42 minutes = 102 minutes.[/B] Used 207872 MB (203 GB) RAM. 9.0812 GHz days credit.[*]Based on saving 2 primality tests. B1=889000, B2=52784000. Chance of finding a factor is an estimated 4.66%. Started 1701. Finished 22:16. [B]Runtime = 5 hours, 15 minutes = 315 minutes.[/B] Used 311330 MB (304 GB) RAM. 21.4559 GHz days credit.[/LIST] The ratio of runtimes of the P-1 tests (315/102=3.08824:1) was a lot more than the ratio of GHz days credits (21.4559/9.0812=2.36267).

Given the optimal bounds for P-1 tested are based on the calculated computational effort (GHz days), the tests_saved will not be optimal if the actual run-time of the test (in minutes) does not reflect the credit in GHz days. It changes where the optimal point is. Clearly that optimal point [B]could[/B] depend on things such as
[LIST=1][*]What else is running on the machine[*]Cache[*]RAM[*]Whether a CPU or GPU are used, and what model.[*]Number or cores devoted to the task.[*]The actual exponent[*]FFT size, which is a function of the exponent.[*]Bounds, B1 and B2.[*]Phase of the moon and direction of wind flow.[*]Things I have not thought of.[/LIST]As I said, I did not do this under ideal circumstances, and therefore the results need double-checking, but I intend testing assuming the tests saved are 0.9 (if mprime accepts <1.0) and 1.1, then measuring the actual run-time of the PRP test.

IMHO it is a shame so much effort (GHz days) is being put into testing exponents well away from the wavefront. They are not really helpful in finding primes.

 axn 2021-07-23 08:42

[QUOTE=drkirkby;583811] I'm not convinced that any value of[B] tests_saved[/B] can really be said to maximise the throughput [B]unless you test the P-1 timing on your computer(s).[/B] I tested the run-time of the P-1 test on my Dell PC under the same circumstances
<snip>
The ratio of runtimes of the P-1 tests (315/102=3.08824:1) was a lot more than the ratio of GHz days credits (21.4559/9.0812=2.36267).
[/QUOTE]
I do not believe the server has been adjusted to give the correct GHzDay credit for the improved P-1 Stage 2 implemented from 30.4 onwards. I believe it still assumes that the P-1 has been done by the older algorithm and credits accordingly. Hence do not draw any hard and fast conclusions based on those numbers. Ideally, due to the improved stage 2, the credit should be suitably discounted.

However, the optimality calculations done by the program itself _do_ take in account the specifics of the new algo. TL;DR dont trust the GHzDay numbers.

 drkirkby 2021-07-23 09:48

[QUOTE=axn;583812]I do not believe the server has been adjusted to give the correct GHzDay credit for the improved P-1 Stage 2 implemented from 30.4 onwards. I believe it still assumes that the P-1 has been done by the older algorithm and credits accordingly. Hence do not draw any hard and fast conclusions based on those numbers. Ideally, due to the improved stage 2, the credit should be suitably discounted.

However, the optimality calculations done by the program itself _do_ take in account the specifics of the new algo. TL;DR dont trust the GHzDay numbers.[/QUOTE]Thank you. However, it seems to me that the actual ratio of runtime for the PRP test divided by the runtime for the P-1 test, is going to depend upon the hardware and what that hardware is doing. I would have thought the bounds calculated for P-1 would have been determined by some sort of average for a number of different hardware configurations, and so benchmarking ones actual hardware might be beneficial. (There's also the [B]very [/B]real possibility that benchmarking ones own hardware uses more CPU time than one might gain from the benchmarking. If one spends days benchmarking something that manages to save 15 minutes per exponent, it is not really worth it, given the results are likely to change with exponent size.)

 axn 2021-07-23 12:14

[QUOTE=drkirkby;583814]Thank you. However, it seems to me that the actual ratio of runtime for the PRP test divided by the runtime for the P-1 test, is going to depend upon the hardware and what that hardware is doing.[/QUOTE]

Ratio of PRP runtime to P-1 runtime is not a meaningful figure (without also mentioning the bounds used / probability of success). However, for a given exponent, RAM allocation, and bounds, the ratio should be pretty stable across different hardware and worker/thread combinations. Explanation follows...

The P-1 bounds selection uses a hardware-neutral costing function. The theory is simple -- PRP uses a series of multiplication using a given FFT. So does P-1 stage1 and stage2. Hence all you need to do is compare the number of multiplications used in PRP vs how many multiplications are expected for P-1 (for a given B1/B2/memory allocation) and probability (as detailed in [url]https://www.mersenne.org/various/math.php#p-1_factoring[/url]). The only variable here is the memory allocation, which affects how many stage2 multiplications are needed. The hardware is not very relevant (because of the FFT computation being used on both sides).

Having said that, the calculation uses a fudge factor, where a stage 2 multiplication counts as 1.2 (?) regular multiplication because stage 2 is somewhat less cache friendly. Conceivably, hardware with higher memorybandwidth-to-cpu ratio might have an edge over hardware with lesser bandwidth-to-cpu ratio. But this is not factored into the bounds calculation (AFAIK).

 drkirkby 2021-07-23 14:05

[QUOTE=axn;583819]Ratio of PRP runtime to P-1 runtime is not a meaningful figure (without also mentioning the bounds used / probability of success).[/QUOTE]I realise that. But given the probability of success, and the times for P-1 and PRP, one can see if one is maximising this function or not
[URL]https://www.mersenne.org/various/math.php#p-1_factoring[/URL]
which is ultimately what matters for maximum chance of finding a prime.

How accurate are the estimates of finding a factor for P-1?
[QUOTE=axn;583819]
However, for a given exponent, RAM allocation, and bounds, the ratio should be pretty stable across different hardware and worker/thread combinations. Explanation follows...

The P-1 bounds selection uses a hardware-neutral costing function. The theory is simple -- PRP uses a series of multiplication using a given FFT. So does P-1 stage1 and stage2. Hence all you need to do is compare the number of multiplications used in PRP vs how many multiplications are expected for P-1 (for a given B1/B2/memory allocation) and probability (as detailed in [URL]https://www.mersenne.org/various/math.php#p-1_factoring[/URL]). The only variable here is the memory allocation, which affects how many stage2 multiplications are needed. The hardware is not very relevant (because of the FFT computation being used on both sides).

Having said that, the calculation uses a fudge factor, where a stage 2 multiplication counts as 1.2 (?) regular multiplication because stage 2 is somewhat less cache friendly. Conceivably, hardware with higher memorybandwidth-to-cpu ratio might have an edge over hardware with lesser bandwidth-to-cpu ratio. But this is not factored into the bounds calculation (AFAIK).[/QUOTE]You surprise me with some of those comments. I would have thought there to be a[B] big [/B]difference in cache friendliness between PRP and stage 2 of P-1.

I'm currently running 4 PRP tests. According to top, I'm using just 2.1 MB RAM
[CODE]%Cpu(s): 94.1/1.3 95[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
MiB Mem : 2.1/385610.4 [|| ]
MiB Swap: 7.6/2048.0 [|||||||| [/CODE][U]Can that be true - it seems very low![/U] The L3 cache on the CPUs is just over 35 MB each, so I would have expected this to remain in the cache. Contrast that with tens or hundreds of GB or RAM used it stage 2 of P-1, and appears to me at least that they have quite different characteristics. In my own case, neither CPU has enough local RAM, so the RAM from both CPUs must be used, and accesses slower to the 2nd CPU, I would have thought there was a serious penalty for cache on P-1. My RAM is only clocked at 2.4 GHz, but I think some gaming PCs would probably being doing around twice that. It seems to me there's a lot of variables.

Anyway, when I have a bit of time I will test these out, and convince myself one way or the other.

 axn 2021-07-23 14:28

I believe the current first time tests use either 5.5M or 6M FFT. That mean it either consumes 44MB or 48MB per FFT. Plus some ancillary memory for some lookup tables. If you're running two per CPU, that is 100MB per CPU. The 2MB figure you're seeing is a big fat lie.

Obviously 100MB is not going to run out of 35MB L3 cache. So your performance will be very much dependent on RAM bandwidth (and somewhat on RAM latency). Thus PRP, P-1 stage 1 and P-1 stage 2 are all very much dependent on RAM bandwidth. It is just that stage 2 also need to access a lot more memory in short order so there is some efficiency loss. You can check this yourself. At the end of both stages, P95 will print the "number of transforms" (if you have stopped/restarted in the middle, the count will be from the restart so it won't work -- we need an unbroken run). Divide the the number of transforms by the runtime to find how many transforms/sec in each stage. That will give you the relative inefficiency of stage2 vs stage1.

 drkirkby 2021-07-23 16:30

[QUOTE=axn;583825]I believe the current first time tests use either 5.5M or 6M FFT. That mean it either consumes 44MB or 48MB per FFT. Plus some ancillary memory for some lookup tables. If you're running two per CPU, that is 100MB per CPU. The 2MB figure you're seeing is a big fat lie. [/QUOTE]The FFT you quote is right. My tests at the minute seem to be using 5600 K, which is 5.5 M.
[QUOTE=axn;583825]
Obviously 100MB is not going to run out of 35MB L3 cache. So your performance will be very much dependent on RAM bandwidth (and somewhat on RAM latency). Thus PRP, P-1 stage 1 and P-1 stage 2 are all very much dependent on RAM bandwidth. It is just that stage 2 also need to access a lot more memory in short order so there is some efficiency loss. You can check this yourself. At the end of both stages, P95 will print the "number of transforms" (if you have stopped/restarted in the middle, the count will be from the restart so it won't work -- we need an unbroken run). Divide the the number of transforms by the runtime to find how many transforms/sec in each stage. That will give you the relative inefficiency of stage2 vs stage1.[/QUOTE]The efficiency loss seems quite significant. Looking at a recent result for stage 1 and stage 2 transforms, there are 648.699 transforms/s in stage 1, and 380.096 transforms/s in stage 2, so stage 2 is about 58.6% of stage 1. It will be interesting to look at some data with the RAM constrained, since maybe if the RAM is constrained to 128 GB or so, there will not be such a penalty in fetching data from the RAM. Anything above 192 GB is going to need to access RAM attached to both CPUs, so I would expect that would be slower than if the RAM is local to one CPU. That would be worth experimenting with.

 kriesel 2021-07-23 16:36

Let's see, Linux top apparently lying to drkirkby at 2.1 MiB;
assume as a check, exponent size ~104M which in packed binary would require up to 104000000/8 = 13.MB for each packed multiprecision binary integer value mod Mp, per worker; 13 * 4 =52 so top's figure is more than an order of magnitude too small;

5.5 Mi fft size * 8 B / word = 44 MiB per worker (~46 MB).
Inner loops will fit in L3 cache but apparently the outermost won't, even if workers reduced to 2 total, one per cpu package.
Four workers * 44 MiB = 176 MiB at least expected.

2 big CPUs with 35MiB L3 each, divided among 4 mprime workers,
2 * 35MiB / 4 ~ 17.5 MiB available per mprime worker.

On Win10, a prime95 one-worker instance running 105M PRP occupies ~267. MB at "5600K" fft length (~5.5Mi?). 267/46 ~5.8. Later, still running, 197. MB; paused, ~7 MB.

Maybe mprime workers were paused at the time the 2.1MiB figure was obtained by drkirkby? Pausing prime95 drastically reduced ram usage. Pause is not consistent with 94% CPU though. I've seen other utilities apparently misrepresent large ram usage mod some large 2[SUP]n[/SUP]. (GPU-Z at some version IIRC.)

 axn 2021-07-23 16:39

[QUOTE=drkirkby;583829]Looking at a recent result for stage 1 and stage 2 transforms, there are 648.699 transforms/s in stage 1, and 380.096 transforms/s in stage 2, so stage 2 is about 58.6% of stage 1.[/quote]
That is unusually low!
[QUOTE=drkirkby;583829]That would be worth experimenting with.[/QUOTE]
Indeed! But you would also need numactl & two instances to guarantee proper RAM mapping.

 drkirkby 2021-07-23 17:27

[QUOTE=axn;583831]That is unusually low!
Indeed! But you would also need numactl & two instances to guarantee proper RAM mapping.[/QUOTE]I've tried to use numactl, but I am not convinced it is achieving what I want. As I allocate cores to a process, and look at the temperature of the cores, the ones I believe I'm putting the process on are no warmer than others. A system monitor, that shows 102 CPUs (includes hyperthreading), seems to indicate that different CPUs (really cores) are randomly busy. mprime indicates it is mapping threads to cores - see below, but I'm not convinced this is happening.
[CODE][Worker #1 Jul 23 16:04] Setting affinity to run worker on CPU core #1
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 7 on CPU core #8
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 8 on CPU core #9
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 9 on CPU core #10
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 10 on CPU core #11
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 11 on CPU core #12
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 12 on CPU core #13
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 6 on CPU core #7
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 5 on CPU core #6
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 4 on CPU core #5
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 3 on CPU core #4
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 2 on CPU core #3
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 1 on CPU core #2
[Worker #2 Jul 23 16:04] Setting affinity to run worker on CPU core #14
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 7 on CPU core #21
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 8 on CPU core #22
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 9 on CPU core #23
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 10 on CPU core #24
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 11 on CPU core #25
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 12 on CPU core #26
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 6 on CPU core #20
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 5 on CPU core #19
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 4 on CPU core #18
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 3 on CPU core #17
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 2 on CPU core #16
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 1 on CPU core #15
[Worker #3 Jul 23 16:04] Setting affinity to run worker on CPU core #27
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 7 on CPU core #34
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 8 on CPU core #35
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 9 on CPU core #36
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 10 on CPU core #37
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 11 on CPU core #38
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 12 on CPU core #39
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 6 on CPU core #33
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 5 on CPU core #32
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 4 on CPU core #31
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 3 on CPU core #30
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 2 on CPU core #29
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 1 on CPU core #28
[Worker #4 Jul 23 16:04] Setting affinity to run worker on CPU core #40
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 7 on CPU core #47
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 8 on CPU core #48
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 9 on CPU core #49
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 10 on CPU core #50
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 11 on CPU core #51
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 12 on CPU core #52
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 6 on CPU core #46
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 5 on CPU core #45
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 4 on CPU core #44
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 3 on CPU core #43
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 2 on CPU core #42
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 1 on CPU core #41
[/CODE]

 drkirkby 2021-07-23 17:52

[QUOTE=kriesel;583830]Let's see, Linux top apparently lying to drkirkby at 2.1 MiB;
[/QUOTE]I have found the problem - the 2.1 was [B]percent[/B]! I'm just running top now as mprime says
[CODE][Worker #1 Jul 23 17:10] Using 311330MB of memory.
[/CODE]yet top is showing 82.7. With 384 GB (393216 MB) of RAM, and mprime indicating it is using 311330 MB for P-1, that's 100*311330/393216=79.2% of the RAM. Given mprime is using a bit more RAM, other processes are using a bit of RAM, the operating system has some overhead, 82.7% seems quite reasonable. The rest of top shows mprime using 81.0% of memory.
[CODE]top - 18:38:14 up 6 days, 45 min, 7 users, load average: 49.87, 49.93, 50.31
Tasks: 888 total, 1 running, 887 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 1.4 sy, 94.2 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 82.7/385610.4 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
MiB Swap: 7.6/2048.0 [|||||||| ]

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
635539 drkirkby 30 10 308.9g 305.0g 7464 S 4961 [B]81.0[/B] 7540:01 mprime
637133 drkirkby 20 0 21360 5036 3436 S 1.3 0.0 0:01.39 top
637124 drkirkby 20 0 21360 4780 3436 R 1.0 0.0 0:01.60 top
34729 drkirkby 20 0 5881712 292588 47492 S 0.7 0.1 40:02.49 WolframKernel[/CODE]

 axn 2021-07-24 03:39

[QUOTE=drkirkby;583834]I've tried to use numactl, but I am not convinced it is achieving what I want. [/QUOTE]

I think I've figured out what needs to be done to get this working. If you've already tried these then I'm out of ideas.

We need to setup two folders with different copies of local.txt, prime.txt and worktodo.txt. The executable (and libraries) itself do not need to be copied. We'll do 2 workers each and 13 threads per worker.
In each worktodo.txt, there should be two sections (Worker 1, Worker 2). Similarly in each local.txt as well.

In the local.txt files, we'll add Affinity lines.

local.txt #1, Worker #1
Affinity=(0,26),(1-12,27-38)

local.txt #1, Worker #2
Affinity=(13,39),(14-25,40-51)

local.txt #2, Worker #1
Affinity=(52,78),(53-64,79-90)

local.txt #2, Worker #2
Affinity=(65,91),(66-77,92-103)

Finally, we can do:
numactl -m 0 -N 0 ./mprime -d -wfolder0
numactl -m 1 -N 1 ./mprime -d -wfolder1

where folder0 and folder1 would be the folders you created.

 axn 2021-07-24 05:05

On further experimenting, the affinity numbers might be all wrong. Can you do a lstopo-no-graphics and post the output?

EDIT:- [c]lstopo-no-graphics --no-io[/c]

 drkirkby 2021-07-24 10:57

[QUOTE=axn;583855]On further experimenting, the affinity numbers might be all wrong. Can you do a lstopo-no-graphics and post the output?

EDIT:- [c]lstopo-no-graphics --no-io[/c][/QUOTE]
Sure. I've not read the man page on this, so don't know what it is doing, but the very first line is wrong. 12 x 32 != 377
[CODE]drkirkby@canary:~\$ lstopo-no-graphics
Machine (377GB total)
Package L#0
NUMANode L#0 (P#0 188GB)
L3 L#0 (36MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16)
L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17)
L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18)
L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19)
L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20)
L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#24)
L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#25)
HostBridge
PCI 00:11.5 (SATA)
PCI 00:16.2 (IDE)
PCI 00:17.0 (RAID)
Block(Disk) "sdb"
Block(Disk) "sda"
PCIBridge
PCI 02:00.0 (Ethernet)
Net "enp2s0"
PCIBridge
PCI 03:00.0 (Ethernet)
Net "enp3s0f0"
PCI 03:00.1 (Ethernet)
Net "enp3s0f1"
PCI 00:1f.6 (Ethernet)
Net "enp0s31f6"
HostBridge
PCI 44:05.5 (RAID)
Block(Disk) "nvme0n1"
HostBridge
PCIBridge
PCI 73:00.0 (VGA)
Package L#1
NUMANode L#1 (P#1 189GB)
L3 L#1 (36MB)
L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#26)
L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#27)
L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#28)
L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#29)
L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#30)
L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#31)
L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 (P#32)
L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 (P#33)
L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 (P#34)
L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 (P#35)
L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 (P#36)
L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 (P#37)
L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 (P#38)
L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 (P#39)
L2 L#40 (1024KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 (P#40)
L2 L#41 (1024KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 (P#41)
L2 L#42 (1024KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 + PU L#42 (P#42)
L2 L#43 (1024KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 + PU L#43 (P#43)
L2 L#44 (1024KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 + PU L#44 (P#44)
L2 L#45 (1024KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 + PU L#45 (P#45)
L2 L#46 (1024KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 + PU L#46 (P#46)
L2 L#47 (1024KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 + PU L#47 (P#47)
L2 L#48 (1024KB) + L1d L#48 (32KB) + L1i L#48 (32KB) + Core L#48 + PU L#48 (P#48)
L2 L#49 (1024KB) + L1d L#49 (32KB) + L1i L#49 (32KB) + Core L#49 + PU L#49 (P#49)
L2 L#50 (1024KB) + L1d L#50 (32KB) + L1i L#50 (32KB) + Core L#50 + PU L#50 (P#50)
L2 L#51 (1024KB) + L1d L#51 (32KB) + L1i L#51 (32KB) + Core L#51 + PU L#51 (P#51)
HostBridge
PCI d1:05.5 (RAID)
Block(Disk) "nvme1n1"
drkirkby@canary:~\$[/CODE]

 axn 2021-07-24 12:52

Is the HT turned off (or is there no HT)? It is only reporting 1 logical processor per core and a total of 52 logical cores.

Taking that at face value, the affinities should be:

local.txt #1, Worker #1
Affinity=0,1,2,3,4,5,6,7,8,9,10,11,12

local.txt #1, Worker #2
Affinity=13,14,15,16,17,18,19,20,21,22,23,24,25

local.txt #2, Worker #1
Affinity=26,27,28,29,30,31,32,33,34,35,36,37,38

local.txt #2, Worker #2
Affinity=39,40,41,42,43,44,45,46,47,48,49,50,51

 drkirkby 2021-07-24 14:36

[QUOTE=axn;583863]Is the HT turned off (or is there no HT)? It is only reporting 1 logical processor per core and a total of 52 logical cores.[/QUOTE]
It's probably turned off. I did turn it off in an attempt to make it easier to sort out the problems of getting this on cores I wanted. I thought the HT was just making life more difficult. I thought I had turned it back on again, but I must have overlooked that. The output from

[CODE]numactl -H[/CODE] is in this thread.

[url]https://www.mersenneforum.org/showpost.php?p=583289&postcount=35[/url]
That shows cpus numbered 0 to 103.
So do I need to use numactl or not now? Of course, whilst 4 workers is optimal with one process running, it may well not be if two processes are running. I'll give that a try.

 axn 2021-07-24 16:05

[QUOTE=drkirkby;583866]So do I need to use numactl or not now? Of course, whilst 4 workers is optimal with one process running, it may well not be if two processes are running. I'll give that a try.[/QUOTE]

The HT thing was just an observation. It does materially affect the Affinity setting, though. So, should you decide to turn on HT, it will need a different set of values for Affinity.

We still need numactl to evaluate the impact of stage 2 with local memory vs non-local memory. If it turns out that using only half the total amount, but local, memory makes stage 2 much faster, then that is the way to go.

Hopefully with the numactl and Affinity settings, we'll be able to run two instances of P95 with local memory and fully utilizing the cores. If that does work, I have no doubt that you'll get the best performance. It may or may not be significantly better than running a single instance, but that's what we want to find out.

 drkirkby 2021-07-24 17:47

(kriesel: Caution, next post indicates there was an undisclosed error affecting this post.)

I tried what you said, but performance was not that great. Then I tried running one process with 2 workers, with Affinity like you said, and benchmarking another process. The benchmarking was tried with 24-26 cores and 2-4 workers[B]. [/B]

[CODE]
[Worker #1 Jul 24 16:57] Timing 5760K FFT, 26 cores, 4 workers. Average times: 8.29, 7.12, 6.24, 5.37 ms. [B]Total throughput: 607.58 iter/sec[/B].[/CODE]Since 4 does not divide 26, clearly there must be an unequal number of cores running on each worker.

The 607.58 iter/sec is [U]almost double[/U] the throughput one obtains running 4 workers on each of two processes, where the processes are [B]not constrained in any way[/B]. Here are the results from running two benchmarks, where nothing is constrained.

[CODE][Worker #1 Jul 24 18:14] Benchmarking multiple workers to measure the impact of memory bandwidth
[Worker #1 Jul 24 18:15] Timing 5760K FFT, 26 cores, 4 workers. Average times: 13.27, 11.03, 13.31, 11.08 ms. Total throughput: 331.31 iter/sec.
[/CODE]and
[CODE][Worker #1 Jul 24 18:15] Timing 5760K FFT, 26 cores, 4 workers. Average times: 12.80, 11.30, 12.78, 11.42 ms. Total throughput: 332.36 iter/sec.[/CODE]Total throughput is a dismal 331.31+332.36=663.67 iter/sec.

One does better running one process
[CODE][Worker #1 Jul 24 18:22] Benchmarking multiple workers to measure the impact of memory bandwidth
[Worker #1 Jul 24 18:22] Timing 5760K FFT, 52 cores, 4 workers. Average times: 3.85, 3.84, 3.86, 3.86 ms. Total throughput: 1038.20 iter/sec.
[/CODE]I suppose the next thing to try is to run two processes, each with 4 workers. I guess 2x6+2x7=26 would be a reasonable It would be nice to think I could get a total throughput of 2*607.58 = 1215.16 iter/sec, but somehow I doubt that will happen.

 drkirkby 2021-07-24 20:49

Ignore my last post (at 2021-07-24 17:47.) I found an error.

 axn 2021-07-25 16:36

[QUOTE=drkirkby;583892]Ignore my last post (at 2021-07-24 17:47.) I found an error.[/QUOTE]

Did you correct the errors from your last trial?

 drkirkby 2021-07-25 16:50

[QUOTE=axn;583960]Did you correct the errors from your last trial?[/QUOTE]Not yet. I want to finish some benchmarking I've been doing on one exponent for various values of tests_saved and RAM on P-1 factoring. Then I will look into once again optimising things.

 All times are UTC. The time now is 20:53.