mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2021-07-23, 09:48   #23
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

1101111112 Posts
Default

Quote:
Originally Posted by axn View Post
I do not believe the server has been adjusted to give the correct GHzDay credit for the improved P-1 Stage 2 implemented from 30.4 onwards. I believe it still assumes that the P-1 has been done by the older algorithm and credits accordingly. Hence do not draw any hard and fast conclusions based on those numbers. Ideally, due to the improved stage 2, the credit should be suitably discounted.

However, the optimality calculations done by the program itself _do_ take in account the specifics of the new algo. TL;DR dont trust the GHzDay numbers.
Thank you. However, it seems to me that the actual ratio of runtime for the PRP test divided by the runtime for the P-1 test, is going to depend upon the hardware and what that hardware is doing. I would have thought the bounds calculated for P-1 would have been determined by some sort of average for a number of different hardware configurations, and so benchmarking ones actual hardware might be beneficial. (There's also the very real possibility that benchmarking ones own hardware uses more CPU time than one might gain from the benchmarking. If one spends days benchmarking something that manages to save 15 minutes per exponent, it is not really worth it, given the results are likely to change with exponent size.)
drkirkby is offline   Reply With Quote
Old 2021-07-23, 12:14   #24
axn
 
axn's Avatar
 
Jun 2003

19·271 Posts
Default

Quote:
Originally Posted by drkirkby View Post
Thank you. However, it seems to me that the actual ratio of runtime for the PRP test divided by the runtime for the P-1 test, is going to depend upon the hardware and what that hardware is doing.
Ratio of PRP runtime to P-1 runtime is not a meaningful figure (without also mentioning the bounds used / probability of success). However, for a given exponent, RAM allocation, and bounds, the ratio should be pretty stable across different hardware and worker/thread combinations. Explanation follows...

The P-1 bounds selection uses a hardware-neutral costing function. The theory is simple -- PRP uses a series of multiplication using a given FFT. So does P-1 stage1 and stage2. Hence all you need to do is compare the number of multiplications used in PRP vs how many multiplications are expected for P-1 (for a given B1/B2/memory allocation) and probability (as detailed in https://www.mersenne.org/various/math.php#p-1_factoring). The only variable here is the memory allocation, which affects how many stage2 multiplications are needed. The hardware is not very relevant (because of the FFT computation being used on both sides).

Having said that, the calculation uses a fudge factor, where a stage 2 multiplication counts as 1.2 (?) regular multiplication because stage 2 is somewhat less cache friendly. Conceivably, hardware with higher memorybandwidth-to-cpu ratio might have an edge over hardware with lesser bandwidth-to-cpu ratio. But this is not factored into the bounds calculation (AFAIK).
axn is offline   Reply With Quote
Old 2021-07-23, 14:05   #25
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

1BF16 Posts
Default

Quote:
Originally Posted by axn View Post
Ratio of PRP runtime to P-1 runtime is not a meaningful figure (without also mentioning the bounds used / probability of success).
I realise that. But given the probability of success, and the times for P-1 and PRP, one can see if one is maximising this function or not
https://www.mersenne.org/various/math.php#p-1_factoring
which is ultimately what matters for maximum chance of finding a prime.

How accurate are the estimates of finding a factor for P-1?
Quote:
Originally Posted by axn View Post
However, for a given exponent, RAM allocation, and bounds, the ratio should be pretty stable across different hardware and worker/thread combinations. Explanation follows...

The P-1 bounds selection uses a hardware-neutral costing function. The theory is simple -- PRP uses a series of multiplication using a given FFT. So does P-1 stage1 and stage2. Hence all you need to do is compare the number of multiplications used in PRP vs how many multiplications are expected for P-1 (for a given B1/B2/memory allocation) and probability (as detailed in https://www.mersenne.org/various/math.php#p-1_factoring). The only variable here is the memory allocation, which affects how many stage2 multiplications are needed. The hardware is not very relevant (because of the FFT computation being used on both sides).

Having said that, the calculation uses a fudge factor, where a stage 2 multiplication counts as 1.2 (?) regular multiplication because stage 2 is somewhat less cache friendly. Conceivably, hardware with higher memorybandwidth-to-cpu ratio might have an edge over hardware with lesser bandwidth-to-cpu ratio. But this is not factored into the bounds calculation (AFAIK).
You surprise me with some of those comments. I would have thought there to be a big difference in cache friendliness between PRP and stage 2 of P-1.

I'm currently running 4 PRP tests. According to top, I'm using just 2.1 MB RAM
Code:
%Cpu(s):  94.1/1.3    95[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||     ]
MiB Mem :  2.1/385610.4 [||                                                                                                  ]
MiB Swap:  7.6/2048.0   [||||||||
Can that be true - it seems very low! The L3 cache on the CPUs is just over 35 MB each, so I would have expected this to remain in the cache. Contrast that with tens or hundreds of GB or RAM used it stage 2 of P-1, and appears to me at least that they have quite different characteristics. In my own case, neither CPU has enough local RAM, so the RAM from both CPUs must be used, and accesses slower to the 2nd CPU, I would have thought there was a serious penalty for cache on P-1. My RAM is only clocked at 2.4 GHz, but I think some gaming PCs would probably being doing around twice that. It seems to me there's a lot of variables.

Anyway, when I have a bit of time I will test these out, and convince myself one way or the other.
drkirkby is offline   Reply With Quote
Old 2021-07-23, 14:28   #26
axn
 
axn's Avatar
 
Jun 2003

19·271 Posts
Default

I believe the current first time tests use either 5.5M or 6M FFT. That mean it either consumes 44MB or 48MB per FFT. Plus some ancillary memory for some lookup tables. If you're running two per CPU, that is 100MB per CPU. The 2MB figure you're seeing is a big fat lie.

Obviously 100MB is not going to run out of 35MB L3 cache. So your performance will be very much dependent on RAM bandwidth (and somewhat on RAM latency). Thus PRP, P-1 stage 1 and P-1 stage 2 are all very much dependent on RAM bandwidth. It is just that stage 2 also need to access a lot more memory in short order so there is some efficiency loss. You can check this yourself. At the end of both stages, P95 will print the "number of transforms" (if you have stopped/restarted in the middle, the count will be from the restart so it won't work -- we need an unbroken run). Divide the the number of transforms by the runtime to find how many transforms/sec in each stage. That will give you the relative inefficiency of stage2 vs stage1.

Last fiddled with by axn on 2021-07-23 at 14:28
axn is offline   Reply With Quote
Old 2021-07-23, 16:30   #27
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

3·149 Posts
Default

Quote:
Originally Posted by axn View Post
I believe the current first time tests use either 5.5M or 6M FFT. That mean it either consumes 44MB or 48MB per FFT. Plus some ancillary memory for some lookup tables. If you're running two per CPU, that is 100MB per CPU. The 2MB figure you're seeing is a big fat lie.
The FFT you quote is right. My tests at the minute seem to be using 5600 K, which is 5.5 M.
Quote:
Originally Posted by axn View Post
Obviously 100MB is not going to run out of 35MB L3 cache. So your performance will be very much dependent on RAM bandwidth (and somewhat on RAM latency). Thus PRP, P-1 stage 1 and P-1 stage 2 are all very much dependent on RAM bandwidth. It is just that stage 2 also need to access a lot more memory in short order so there is some efficiency loss. You can check this yourself. At the end of both stages, P95 will print the "number of transforms" (if you have stopped/restarted in the middle, the count will be from the restart so it won't work -- we need an unbroken run). Divide the the number of transforms by the runtime to find how many transforms/sec in each stage. That will give you the relative inefficiency of stage2 vs stage1.
The efficiency loss seems quite significant. Looking at a recent result for stage 1 and stage 2 transforms, there are 648.699 transforms/s in stage 1, and 380.096 transforms/s in stage 2, so stage 2 is about 58.6% of stage 1. It will be interesting to look at some data with the RAM constrained, since maybe if the RAM is constrained to 128 GB or so, there will not be such a penalty in fetching data from the RAM. Anything above 192 GB is going to need to access RAM attached to both CPUs, so I would expect that would be slower than if the RAM is local to one CPU. That would be worth experimenting with.
drkirkby is offline   Reply With Quote
Old 2021-07-23, 16:36   #28
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10110100101102 Posts
Default

Let's see, Linux top apparently lying to drkirkby at 2.1 MiB;
assume as a check, exponent size ~104M which in packed binary would require up to 104000000/8 = 13.MB for each packed multiprecision binary integer value mod Mp, per worker; 13 * 4 =52 so top's figure is more than an order of magnitude too small;

5.5 Mi fft size * 8 B / word = 44 MiB per worker (~46 MB).
Inner loops will fit in L3 cache but apparently the outermost won't, even if workers reduced to 2 total, one per cpu package.
Four workers * 44 MiB = 176 MiB at least expected.

2 big CPUs with 35MiB L3 each, divided among 4 mprime workers,
2 * 35MiB / 4 ~ 17.5 MiB available per mprime worker.

On Win10, a prime95 one-worker instance running 105M PRP occupies ~267. MB at "5600K" fft length (~5.5Mi?). 267/46 ~5.8. Later, still running, 197. MB; paused, ~7 MB.

Maybe mprime workers were paused at the time the 2.1MiB figure was obtained by drkirkby? Pausing prime95 drastically reduced ram usage. Pause is not consistent with 94% CPU though. I've seen other utilities apparently misrepresent large ram usage mod some large 2n. (GPU-Z at some version IIRC.)
kriesel is offline   Reply With Quote
Old 2021-07-23, 16:39   #29
axn
 
axn's Avatar
 
Jun 2003

19·271 Posts
Default

Quote:
Originally Posted by drkirkby View Post
Looking at a recent result for stage 1 and stage 2 transforms, there are 648.699 transforms/s in stage 1, and 380.096 transforms/s in stage 2, so stage 2 is about 58.6% of stage 1.
That is unusually low!
Quote:
Originally Posted by drkirkby View Post
That would be worth experimenting with.
Indeed! But you would also need numactl & two instances to guarantee proper RAM mapping.
axn is offline   Reply With Quote
Old 2021-07-23, 17:27   #30
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

3·149 Posts
Default

Quote:
Originally Posted by axn View Post
That is unusually low!
Indeed! But you would also need numactl & two instances to guarantee proper RAM mapping.
I've tried to use numactl, but I am not convinced it is achieving what I want. As I allocate cores to a process, and look at the temperature of the cores, the ones I believe I'm putting the process on are no warmer than others. A system monitor, that shows 102 CPUs (includes hyperthreading), seems to indicate that different CPUs (really cores) are randomly busy. mprime indicates it is mapping threads to cores - see below, but I'm not convinced this is happening.
Code:
[Worker #1 Jul 23 16:04] Setting affinity to run worker on CPU core #1
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 7 on CPU core #8
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 8 on CPU core #9
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 9 on CPU core #10
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 10 on CPU core #11
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 11 on CPU core #12
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 12 on CPU core #13
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 6 on CPU core #7
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 5 on CPU core #6
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 4 on CPU core #5
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 3 on CPU core #4
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 2 on CPU core #3
[Worker #1 Jul 23 16:04] Setting affinity to run helper thread 1 on CPU core #2
[Worker #2 Jul 23 16:04] Setting affinity to run worker on CPU core #14
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 7 on CPU core #21
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 8 on CPU core #22
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 9 on CPU core #23
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 10 on CPU core #24
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 11 on CPU core #25
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 12 on CPU core #26
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 6 on CPU core #20
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 5 on CPU core #19
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 4 on CPU core #18
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 3 on CPU core #17
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 2 on CPU core #16
[Worker #2 Jul 23 16:04] Setting affinity to run helper thread 1 on CPU core #15
[Worker #3 Jul 23 16:04] Setting affinity to run worker on CPU core #27
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 7 on CPU core #34
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 8 on CPU core #35
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 9 on CPU core #36
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 10 on CPU core #37
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 11 on CPU core #38
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 12 on CPU core #39
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 6 on CPU core #33
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 5 on CPU core #32
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 4 on CPU core #31
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 3 on CPU core #30
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 2 on CPU core #29
[Worker #3 Jul 23 16:04] Setting affinity to run helper thread 1 on CPU core #28
[Worker #4 Jul 23 16:04] Setting affinity to run worker on CPU core #40
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 7 on CPU core #47
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 8 on CPU core #48
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 9 on CPU core #49
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 10 on CPU core #50
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 11 on CPU core #51
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 12 on CPU core #52
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 6 on CPU core #46
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 5 on CPU core #45
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 4 on CPU core #44
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 3 on CPU core #43
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 2 on CPU core #42
[Worker #4 Jul 23 16:04] Setting affinity to run helper thread 1 on CPU core #41

Last fiddled with by drkirkby on 2021-07-23 at 17:28
drkirkby is offline   Reply With Quote
Old 2021-07-23, 17:52   #31
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

44710 Posts
Default

Quote:
Originally Posted by kriesel View Post
Let's see, Linux top apparently lying to drkirkby at 2.1 MiB;
I have found the problem - the 2.1 was percent! I'm just running top now as mprime says
Code:
[Worker #1 Jul 23 17:10] Using 311330MB of memory.
yet top is showing 82.7. With 384 GB (393216 MB) of RAM, and mprime indicating it is using 311330 MB for P-1, that's 100*311330/393216=79.2% of the RAM. Given mprime is using a bit more RAM, other processes are using a bit of RAM, the operating system has some overhead, 82.7% seems quite reasonable. The rest of top shows mprime using 81.0% of memory.
Code:
top - 18:38:14 up 6 days, 45 min,  7 users,  load average: 49.87, 49.93, 50.31
Tasks: 888 total,   1 running, 887 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  1.4 sy, 94.2 ni,  4.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 82.7/385610.4 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                 ]
MiB Swap:  7.6/2048.0   [||||||||                                                                                            ]

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 635539 drkirkby  30  10  308.9g 305.0g   7464 S  4961  81.0   7540:01 mprime
 637133 drkirkby  20   0   21360   5036   3436 S   1.3   0.0   0:01.39 top
 637124 drkirkby  20   0   21360   4780   3436 R   1.0   0.0   0:01.60 top
  34729 drkirkby  20   0 5881712 292588  47492 S   0.7   0.1  40:02.49 WolframKernel

Last fiddled with by drkirkby on 2021-07-23 at 18:21
drkirkby is offline   Reply With Quote
Old 2021-07-24, 03:39   #32
axn
 
axn's Avatar
 
Jun 2003

19×271 Posts
Default

Quote:
Originally Posted by drkirkby View Post
I've tried to use numactl, but I am not convinced it is achieving what I want.
I think I've figured out what needs to be done to get this working. If you've already tried these then I'm out of ideas.

We need to setup two folders with different copies of local.txt, prime.txt and worktodo.txt. The executable (and libraries) itself do not need to be copied. We'll do 2 workers each and 13 threads per worker.
In each worktodo.txt, there should be two sections (Worker 1, Worker 2). Similarly in each local.txt as well.


In the local.txt files, we'll add Affinity lines.

local.txt #1, Worker #1
Affinity=(0,26),(1-12,27-38)

local.txt #1, Worker #2
Affinity=(13,39),(14-25,40-51)


local.txt #2, Worker #1
Affinity=(52,78),(53-64,79-90)

local.txt #2, Worker #2
Affinity=(65,91),(66-77,92-103)

Finally, we can do:
numactl -m 0 -N 0 ./mprime -d -wfolder0
numactl -m 1 -N 1 ./mprime -d -wfolder1

where folder0 and folder1 would be the folders you created.
axn is offline   Reply With Quote
Old 2021-07-24, 05:05   #33
axn
 
axn's Avatar
 
Jun 2003

19·271 Posts
Default

On further experimenting, the affinity numbers might be all wrong. Can you do a lstopo-no-graphics and post the output?

EDIT:- lstopo-no-graphics --no-io

Last fiddled with by axn on 2021-07-24 at 05:08
axn is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Determine squares fenderbender Math 14 2007-07-28 23:24
determine hyderman Homework Help 7 2007-06-17 06:01
Methods to determine integer multiples dsouza123 Math 6 2006-11-18 16:10
Help: trying to determine latency on movaps instructions on AthlonXP LoKI.GuZ Hardware 1 2004-01-26 20:05
How to determine the P-1 boundaries? Boulder Software 2 2003-08-20 11:55

All times are UTC. The time now is 08:05.


Wed Oct 20 08:05:16 UTC 2021 up 89 days, 2:34, 0 users, load averages: 2.17, 2.40, 2.43

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.