mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Information & Answers (https://www.mersenneforum.org/forumdisplay.php?f=38)
-   -   What determine if P-1 factoring is used? (https://www.mersenneforum.org/showthread.php?t=26849)

drkirkby 2021-07-23 09:48

[QUOTE=axn;583812]I do not believe the server has been adjusted to give the correct GHzDay credit for the improved P-1 Stage 2 implemented from 30.4 onwards. I believe it still assumes that the P-1 has been done by the older algorithm and credits accordingly. Hence do not draw any hard and fast conclusions based on those numbers. Ideally, due to the improved stage 2, the credit should be suitably discounted.

However, the optimality calculations done by the program itself _do_ take in account the specifics of the new algo. TL;DR dont trust the GHzDay numbers.[/QUOTE]Thank you. However, it seems to me that the actual ratio of runtime for the PRP test divided by the runtime for the P-1 test, is going to depend upon the hardware and what that hardware is doing. I would have thought the bounds calculated for P-1 would have been determined by some sort of average for a number of different hardware configurations, and so benchmarking ones actual hardware might be beneficial. (There's also the [B]very [/B]real possibility that benchmarking ones own hardware uses more CPU time than one might gain from the benchmarking. If one spends days benchmarking something that manages to save 15 minutes per exponent, it is not really worth it, given the results are likely to change with exponent size.)

axn 2021-07-23 12:14

[QUOTE=drkirkby;583814]Thank you. However, it seems to me that the actual ratio of runtime for the PRP test divided by the runtime for the P-1 test, is going to depend upon the hardware and what that hardware is doing.[/QUOTE]

Ratio of PRP runtime to P-1 runtime is not a meaningful figure (without also mentioning the bounds used / probability of success). However, for a given exponent, RAM allocation, and bounds, the ratio should be pretty stable across different hardware and worker/thread combinations. Explanation follows...

The P-1 bounds selection uses a hardware-neutral costing function. The theory is simple -- PRP uses a series of multiplication using a given FFT. So does P-1 stage1 and stage2. Hence all you need to do is compare the number of multiplications used in PRP vs how many multiplications are expected for P-1 (for a given B1/B2/memory allocation) and probability (as detailed in [url]https://www.mersenne.org/various/math.php#p-1_factoring[/url]). The only variable here is the memory allocation, which affects how many stage2 multiplications are needed. The hardware is not very relevant (because of the FFT computation being used on both sides).

Having said that, the calculation uses a fudge factor, where a stage 2 multiplication counts as 1.2 (?) regular multiplication because stage 2 is somewhat less cache friendly. Conceivably, hardware with higher memorybandwidth-to-cpu ratio might have an edge over hardware with lesser bandwidth-to-cpu ratio. But this is not factored into the bounds calculation (AFAIK).

drkirkby 2021-07-23 14:05

[QUOTE=axn;583819]Ratio of PRP runtime to P-1 runtime is not a meaningful figure (without also mentioning the bounds used / probability of success).[/QUOTE]I realise that. But given the probability of success, and the times for P-1 and PRP, one can see if one is maximising this function or not
[URL]https://www.mersenne.org/various/math.php#p-1_factoring[/URL]
which is ultimately what matters for maximum chance of finding a prime.

How accurate are the estimates of finding a factor for P-1?
[QUOTE=axn;583819]
However, for a given exponent, RAM allocation, and bounds, the ratio should be pretty stable across different hardware and worker/thread combinations. Explanation follows...

The P-1 bounds selection uses a hardware-neutral costing function. The theory is simple -- PRP uses a series of multiplication using a given FFT. So does P-1 stage1 and stage2. Hence all you need to do is compare the number of multiplications used in PRP vs how many multiplications are expected for P-1 (for a given B1/B2/memory allocation) and probability (as detailed in [URL]https://www.mersenne.org/various/math.php#p-1_factoring[/URL]). The only variable here is the memory allocation, which affects how many stage2 multiplications are needed. The hardware is not very relevant (because of the FFT computation being used on both sides).

Having said that, the calculation uses a fudge factor, where a stage 2 multiplication counts as 1.2 (?) regular multiplication because stage 2 is somewhat less cache friendly. Conceivably, hardware with higher memorybandwidth-to-cpu ratio might have an edge over hardware with lesser bandwidth-to-cpu ratio. But this is not factored into the bounds calculation (AFAIK).[/QUOTE]You surprise me with some of those comments. I would have thought there to be a[B] big [/B]difference in cache friendliness between PRP and stage 2 of P-1.

I'm currently running 4 PRP tests. According to top, I'm using just 2.1 MB RAM
[CODE]%Cpu(s): 94.1/1.3 95[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
MiB Mem : 2.1/385610.4 [|| ]
MiB Swap: 7.6/2048.0 [|||||||| [/CODE][U]Can that be true - it seems very low![/U] The L3 cache on the CPUs is just over 35 MB each, so I would have expected this to remain in the cache. Contrast that with tens or hundreds of GB or RAM used it stage 2 of P-1, and appears to me at least that they have quite different characteristics. In my own case, neither CPU has enough local RAM, so the RAM from both CPUs must be used, and accesses slower to the 2nd CPU, I would have thought there was a serious penalty for cache on P-1. My RAM is only clocked at 2.4 GHz, but I think some gaming PCs would probably being doing around twice that. It seems to me there's a lot of variables.

Anyway, when I have a bit of time I will test these out, and convince myself one way or the other.

axn 2021-07-23 14:28

I believe the current first time tests use either 5.5M or 6M FFT. That mean it either consumes 44MB or 48MB per FFT. Plus some ancillary memory for some lookup tables. If you're running two per CPU, that is 100MB per CPU. The 2MB figure you're seeing is a big fat lie.

Obviously 100MB is not going to run out of 35MB L3 cache. So your performance will be very much dependent on RAM bandwidth (and somewhat on RAM latency). Thus PRP, P-1 stage 1 and P-1 stage 2 are all very much dependent on RAM bandwidth. It is just that stage 2 also need to access a lot more memory in short order so there is some efficiency loss. You can check this yourself. At the end of both stages, P95 will print the "number of transforms" (if you have stopped/restarted in the middle, the count will be from the restart so it won't work -- we need an unbroken run). Divide the the number of transforms by the runtime to find how many transforms/sec in each stage. That will give you the relative inefficiency of stage2 vs stage1.

drkirkby 2021-07-23 16:30

[QUOTE=axn;583825]I believe the current first time tests use either 5.5M or 6M FFT. That mean it either consumes 44MB or 48MB per FFT. Plus some ancillary memory for some lookup tables. If you're running two per CPU, that is 100MB per CPU. The 2MB figure you're seeing is a big fat lie. [/QUOTE]The FFT you quote is right. My tests at the minute seem to be using 5600 K, which is 5.5 M.
[QUOTE=axn;583825]
Obviously 100MB is not going to run out of 35MB L3 cache. So your performance will be very much dependent on RAM bandwidth (and somewhat on RAM latency). Thus PRP, P-1 stage 1 and P-1 stage 2 are all very much dependent on RAM bandwidth. It is just that stage 2 also need to access a lot more memory in short order so there is some efficiency loss. You can check this yourself. At the end of both stages, P95 will print the "number of transforms" (if you have stopped/restarted in the middle, the count will be from the restart so it won't work -- we need an unbroken run). Divide the the number of transforms by the runtime to find how many transforms/sec in each stage. That will give you the relative inefficiency of stage2 vs stage1.[/QUOTE]The efficiency loss seems quite significant. Looking at a recent result for stage 1 and stage 2 transforms, there are 648.699 transforms/s in stage 1, and 380.096 transforms/s in stage 2, so stage 2 is about 58.6% of stage 1. It will be interesting to look at some data with the RAM constrained, since maybe if the RAM is constrained to 128 GB or so, there will not be such a penalty in fetching data from the RAM. Anything above 192 GB is going to need to access RAM attached to both CPUs, so I would expect that would be slower than if the RAM is local to one CPU. That would be worth experimenting with.

kriesel 2021-07-23 16:36

Let's see, Linux top apparently lying to drkirkby at 2.1 MiB;
assume as a check, exponent size ~104M which in packed binary would require up to 104000000/8 = 13.MB for each packed multiprecision binary integer value mod Mp, per worker; 13 * 4 =52 so top's figure is more than an order of magnitude too small;

5.5 Mi fft size * 8 B / word = 44 MiB per worker (~46 MB).
Inner loops will fit in L3 cache but apparently the outermost won't, even if workers reduced to 2 total, one per cpu package.
Four workers * 44 MiB = 176 MiB at least expected.

2 big CPUs with 35MiB L3 each, divided among 4 mprime workers,
2 * 35MiB / 4 ~ 17.5 MiB available per mprime worker.

On Win10, a prime95 one-worker instance running 105M PRP occupies ~267. MB at "5600K" fft length (~5.5Mi?). 267/46 ~5.8. Later, still running, 197. MB; paused, ~7 MB.

Maybe mprime workers were paused at the time the 2.1MiB figure was obtained by drkirkby? Pausing prime95 drastically reduced ram usage. Pause is not consistent with 94% CPU though. I've seen other utilities apparently misrepresent large ram usage mod some large 2[SUP]n[/SUP]. (GPU-Z at some version IIRC.)

axn 2021-07-23 16:39

[QUOTE=drkirkby;583829]Looking at a recent result for stage 1 and stage 2 transforms, there are 648.699 transforms/s in stage 1, and 380.096 transforms/s in stage 2, so stage 2 is about 58.6% of stage 1.[/quote]
That is unusually low!
[QUOTE=drkirkby;583829]That would be worth experimenting with.[/QUOTE]
Indeed! But you would also need numactl & two instances to guarantee proper RAM mapping.

drkirkby 2021-07-23 17:27

[QUOTE=axn;583831]That is unusually low!
Indeed! But you would also need numactl & two instances to guarantee proper RAM mapping.[/QUOTE]I've tried to use numactl, but I am not convinced it is achieving what I want. As I allocate cores to a process, and look at the temperature of the cores, the ones I believe I'm putting the process on are no warmer than others. A system monitor, that shows 102 CPUs (includes hyperthreading), seems to indicate that different CPUs (really cores) are randomly busy. mprime indicates it is mapping threads to cores - see below, but I'm not convinced this is happening.
[CODE][Worker #1 Jul 23 16:04] Setting affinity to run worker on CPU core #1
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 7 on CPU core #8
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 8 on CPU core #9
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 9 on CPU core #10
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 10 on CPU core #11
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 11 on CPU core #12
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 12 on CPU core #13
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 6 on CPU core #7
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 5 on CPU core #6
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 4 on CPU core #5
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 3 on CPU core #4
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 2 on CPU core #3
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 1 on CPU core #2
[Worker #2 Jul 23 16:04] Setting affinity to run worker on CPU core #14
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 7 on CPU core #21
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 8 on CPU core #22
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 9 on CPU core #23
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 10 on CPU core #24
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 11 on CPU core #25
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 12 on CPU core #26
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 6 on CPU core #20
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 5 on CPU core #19
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 4 on CPU core #18
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 3 on CPU core #17
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 2 on CPU core #16
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 1 on CPU core #15
[Worker #3 Jul 23 16:04] Setting affinity to run worker on CPU core #27
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 7 on CPU core #34
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 8 on CPU core #35
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 9 on CPU core #36
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 10 on CPU core #37
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 11 on CPU core #38
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 12 on CPU core #39
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 6 on CPU core #33
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 5 on CPU core #32
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 4 on CPU core #31
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 3 on CPU core #30
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 2 on CPU core #29
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 1 on CPU core #28
[Worker #4 Jul 23 16:04] Setting affinity to run worker on CPU core #40
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 7 on CPU core #47
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 8 on CPU core #48
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 9 on CPU core #49
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 10 on CPU core #50
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 11 on CPU core #51
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 12 on CPU core #52
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 6 on CPU core #46
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 5 on CPU core #45
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 4 on CPU core #44
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 3 on CPU core #43
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 2 on CPU core #42
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 1 on CPU core #41
[/CODE]

drkirkby 2021-07-23 17:52

[QUOTE=kriesel;583830]Let's see, Linux top apparently lying to drkirkby at 2.1 MiB;
[/QUOTE]I have found the problem - the 2.1 was [B]percent[/B]! I'm just running top now as mprime says
[CODE][Worker #1 Jul 23 17:10] Using 311330MB of memory.
[/CODE]yet top is showing 82.7. With 384 GB (393216 MB) of RAM, and mprime indicating it is using 311330 MB for P-1, that's 100*311330/393216=79.2% of the RAM. Given mprime is using a bit more RAM, other processes are using a bit of RAM, the operating system has some overhead, 82.7% seems quite reasonable. The rest of top shows mprime using 81.0% of memory.
[CODE]top - 18:38:14 up 6 days, 45 min, 7 users, load average: 49.87, 49.93, 50.31
Tasks: 888 total, 1 running, 887 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 1.4 sy, 94.2 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 82.7/385610.4 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
MiB Swap: 7.6/2048.0 [|||||||| ]

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
635539 drkirkby 30 10 308.9g 305.0g 7464 S 4961 [B]81.0[/B] 7540:01 mprime
637133 drkirkby 20 0 21360 5036 3436 S 1.3 0.0 0:01.39 top
637124 drkirkby 20 0 21360 4780 3436 R 1.0 0.0 0:01.60 top
34729 drkirkby 20 0 5881712 292588 47492 S 0.7 0.1 40:02.49 WolframKernel[/CODE]

axn 2021-07-24 03:39

[QUOTE=drkirkby;583834]I've tried to use numactl, but I am not convinced it is achieving what I want. [/QUOTE]

I think I've figured out what needs to be done to get this working. If you've already tried these then I'm out of ideas.

We need to setup two folders with different copies of local.txt, prime.txt and worktodo.txt. The executable (and libraries) itself do not need to be copied. We'll do 2 workers each and 13 threads per worker.
In each worktodo.txt, there should be two sections (Worker 1, Worker 2). Similarly in each local.txt as well.


In the local.txt files, we'll add Affinity lines.

local.txt #1, Worker #1
Affinity=(0,26),(1-12,27-38)

local.txt #1, Worker #2
Affinity=(13,39),(14-25,40-51)


local.txt #2, Worker #1
Affinity=(52,78),(53-64,79-90)

local.txt #2, Worker #2
Affinity=(65,91),(66-77,92-103)

Finally, we can do:
numactl -m 0 -N 0 ./mprime -d -wfolder0
numactl -m 1 -N 1 ./mprime -d -wfolder1

where folder0 and folder1 would be the folders you created.

axn 2021-07-24 05:05

On further experimenting, the affinity numbers might be all wrong. Can you do a lstopo-no-graphics and post the output?

EDIT:- [c]lstopo-no-graphics --no-io[/c]


All times are UTC. The time now is 00:48.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.