![]() |
|
|
#12 | |
|
(loop (#_fork))
Feb 2006
Cambridge, England
2·7·461 Posts |
Quote:
|
|
|
|
|
|
|
#13 |
|
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
254738 Posts |
|
|
|
|
|
|
#14 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
17×487 Posts |
Prime95 is not NUMA-aware. Perhaps you need to run two instances of prime95 with some kind of OS command instructing each prime95 instance to allocate memory from different memory banks. I've no idea what that OS command would be in Windows.
|
|
|
|
|
|
#15 |
|
Jun 2003
155816 Posts |
I believe he uses Ubuntu (or some flavor of Linux), so taskset (along with Affinity setting in mprime) should work.
Last fiddled with by axn on 2021-07-09 at 14:54 |
|
|
|
|
|
#16 | |
|
"David Kirkby"
Jan 2021
Althorne, Essex, UK
45810 Posts |
Quote:
I intend swapping out the 8167Ms at a later date for higher performance CPUs when the prices fall. But currently the fast gold or platinum CPUs with a lot of cores are very expensive, but the 8167M offers a lot of bang for the buck. If you only want the performance for GIMPS, a fast graphics card might be a better bet. The prices of them are currently well above their manufacturers recommended retail prices, but the prices are falling a lot now. |
|
|
|
|
|
|
#17 |
|
Einyen
Dec 2003
Denmark
345210 Posts |
I'm not sure you can call this 12-channel RAM. It is 2 CPU's with 6 channels each. You should definitely run different tests on each physical CPU and each test will get it's own 6-channel RAM.
But if you run 1 single test on both CPUs, I do not think that test benefits from 12-channel RAM, but I could easily be wrong, I'm not familiar with this modern hardware
|
|
|
|
|
|
#18 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1E8F16 Posts |
Prime95 deals well with dual-package systems in my opinion. I've run in a single instance, analyzed and posted prime95 benchmarks on a variey of single and dual-package systems (up to dual-12core, but no dual-26core beasts) versus number of workers, FFT length, HT vs not; see attachments of https://www.mersenneforum.org/showpo...18&postcount=4
https://www.mersenneforum.org/showpo...19&postcount=5 https://www.mersenneforum.org/showpo...4&postcount=11 |
|
|
|
|
|
#19 | |
|
"David Kirkby"
Jan 2021
Althorne, Essex, UK
2×229 Posts |
Quote:
I will have to run some more benchmarks, but have some real work to do the weekend. There's a rather important football match taking place on Sunday too. Last fiddled with by drkirkby on 2021-07-09 at 19:58 |
|
|
|
|
|
|
#20 | |
|
Nov 2019
10000102 Posts |
Quote:
When I was trying to mine Monero on my quad CPU system, the mining software would occasionally error out with a page fault until I started running 1 instance per CPU. For P-1 factoring work, I am running 4 instances so that each CPU gets it's own pool of memory to allocate to it's own workers (again avoiding "foreign" memory access). The server does not let me give each CPU it's own name though; so the resulting stats are wonky. Edit: I think the [tables] on [pages 1 and 3 are] supposed to be showing the penalty for "foreign" memory access [under the "Straddles chips?" (yes) heading]. Bolded numbers on the left-hand side appear to be the "best" times. The percentages appear to be the approximate reduction in performance for each setting. Last fiddled with by phillipsjk on 2021-07-10 at 15:27 Reason: fixed wording. |
|
|
|
|
|
|
#21 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,823 Posts |
Quote:
Benchmarking such things as 3-workers on a dual-package system in a single instance gives slower total throughput. George has stated in the past that prime95 segregates threads of a worker onto a single CPU, not straddling dual packages for example. (I don't know how that squares with "not NUMA aware".) That would put two workers onto one cpu, and leave an entire cpu for one of the 3 workers. And benchmark results IIRC were consistent with that. If it did not do that, some threads & cores of a worker would be distant from others, with possible consequent performance loss. There are some practical issues with attempting to benchmark with more than one prime95 instance. Desynchronization of fft lengths and subcases between the instances is one that comes to mind. On Windows, specifying NUMA Node in the start command is more of a recommendation the OS is permitted to deviate from, than a definite mandatory specification. From Windows 10's "start /?" command help output: Code:
NODE Specifies the preferred Non-Uniform Memory Architecture (NUMA)
node as a decimal integer.
AFFINITY Specifies the processor affinity mask as a hexadecimal number.
The process is restricted to running on these processors.
The affinity mask is interpreted differently when /AFFINITY and
/NODE are combined. Specify the affinity mask as if the NUMA
node's processor mask is right shifted to begin at bit zero.
The process is restricted to running on those processors in
common between the specified affinity mask and the NUMA node.
If no processors are in common, the process is restricted to
running on the specified NUMA node.
Specifying /NODE allows processes to be created in a way that leverages memory
locality on NUMA systems. For example, two processes that communicate with
each other heavily through shared memory can be created to share the same
preferred NUMA node in order to minimize memory latencies. They allocate
memory from the same NUMA node when possible, and they are free to run on
processors outside the specified node.
start /NODE 1 application1.exe
start /NODE 1 application2.exe
These two processes can be further constrained to run on specific processors
within the same NUMA node. In the following example, application1 runs on the
low-order two processors of the node, while application2 runs on the next two
processors of the node. This example assumes the specified node has at least
four logical processors. Note that the node number can be changed to any valid
node number for that computer without having to change the affinity mask.
start /NODE 1 /AFFINITY 0x3 application1.exe
start /NODE 1 /AFFINITY 0xc application2.exe
In practice, I find with single-instance multi-package prime95 benchmarking, that number of workers = n * number of packages benchmarks best for total throughput, with n a small integer changing with fft length. Throughput-optimal parameters are not always entirely practical. It does little good to tune to high worker count for the last few% of throughput, if the primality test assignments expire before completion and progress to completion is wasted. (Especially now with PRP & proof where a first test is not followed by full double check.) Latency less than expiration time is a constraint. At small fft lengths, the Xeon Phi 7250 benchmarks best nominal total throughput with dozens of workers, but latency is an issue. Quite a while ago, Madpoo posted results for a different case, optimizing for latency of primality testing a single exponent, such as for verifying a new prime discovery where efficiency is less important than speed, on a dual package system (~dual-18-core?). Max speed on a single exponent was around (an entire cpu package plus ~6 cores of the other package), with the rest of the cores in the second cpu package left idle; adding more cores from package two slowed it. Last fiddled with by kriesel on 2021-07-10 at 16:55 |
|
|
|
|
|
|
#22 |
|
Jun 2003
23·683 Posts |
This shouldn't be much of an issue. In theory, if you do this right, neither instance will have any impact on the other one, since they won't be sharing any resources (cores/cache/RAM). So it wouldn't matter if the benchmarks don't exactly sync up.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Assigning too much memory slows down P-1 stage 2? | ZFR | Software | 11 | 2020-12-13 10:19 |
| Allow mprime to use more memory | ZFR | Software | 1 | 2020-12-10 09:50 |
| Mini ITX with LGA 2011 (4 memory channels) | bgbeuning | Hardware | 7 | 2016-06-18 10:32 |
| mprime checking available memory | tha | Software | 7 | 2015-12-07 15:56 |
| Cheesy memory slows down prime95? | nomadicus | Hardware | 9 | 2003-03-01 00:15 |