mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2021-07-09, 13:59   #12
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

3·19·113 Posts
Default

Quote:
Originally Posted by drkirkby View Post
There's 35.75 MB of L3 cache per CPU. The 2400 MHz is a limitation of the CPU - other CPUs in the Xeon gold or platinum range run the RAM up to 2933 MHz, but they are quite expensive CPUs, whereas these CPUs are quite cheap. I've benchmarked more workers (I tried, 1, 2, 3 .. 52). But 4 workers gives optimal throughput.
Are these the same CPUs as https://www.ebay.co.uk/itm/154497112899 ? I was expecting there to be a catch, if they work well I'll pick myself up a pair. I've got a Supermicro Skylake system which has 4114s in it at the moment.
fivemack is offline   Reply With Quote
Old 2021-07-09, 14:38   #13
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

13·769 Posts
Default

Quote:
Originally Posted by drkirkby View Post
Unsurprisingly reducing the number of cores per worker from 26 to 13 increased the iteration time further.
On VBCurtis's behalf:
2
Uncwilly is online now   Reply With Quote
Old 2021-07-09, 14:42   #14
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11101110100002 Posts
Default

Prime95 is not NUMA-aware. Perhaps you need to run two instances of prime95 with some kind of OS command instructing each prime95 instance to allocate memory from different memory banks. I've no idea what that OS command would be in Windows.
Prime95 is offline   Reply With Quote
Old 2021-07-09, 14:54   #15
axn
 
axn's Avatar
 
Jun 2003

19×271 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Prime95 is not NUMA-aware. Perhaps you need to run two instances of prime95 with some kind of OS command instructing each prime95 instance to allocate memory from different memory banks. I've no idea what that OS command would be in Windows.
I believe he uses Ubuntu (or some flavor of Linux), so taskset (along with Affinity setting in mprime) should work.

Last fiddled with by axn on 2021-07-09 at 14:54
axn is online now   Reply With Quote
Old 2021-07-09, 15:00   #16
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

1BF16 Posts
Default

Quote:
Originally Posted by fivemack View Post
Are these the same CPUs as https://www.ebay.co.uk/itm/154497112899 ? I was expecting there to be a catch, if they work well I'll pick myself up a pair. I've got a Supermicro Skylake system which has 4114s in it at the moment.
Yes they are the same CPUs, although I paid less than that - I paid £300 each. They gave a massive improvement in performance over a single Silver 4110, for not a huge outlay. If you want, I can dig out the email address of the seller I bought them from. You might be able to get a better deal than on eBay. PM me if you want. The single-threaded performance of the 8167M is pretty poor according to Passmark, but you get a lot of cores for the money. If you don't mind spending a bit more money, the 8171M appears to offer a lot more performance. The 8171M will not work in mainstream machines from Dell, IBM, Lenovo etc, but there's a good chance they would work on your Supermicro motherboard.

I intend swapping out the 8167Ms at a later date for higher performance CPUs when the prices fall. But currently the fast gold or platinum CPUs with a lot of cores are very expensive, but the 8167M offers a lot of bang for the buck.

If you only want the performance for GIMPS, a fast graphics card might be a better bet. The prices of them are currently well above their manufacturers recommended retail prices, but the prices are falling a lot now.
drkirkby is offline   Reply With Quote
Old 2021-07-09, 17:44   #17
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

52·127 Posts
Default

I'm not sure you can call this 12-channel RAM. It is 2 CPU's with 6 channels each. You should definitely run different tests on each physical CPU and each test will get it's own 6-channel RAM.
But if you run 1 single test on both CPUs, I do not think that test benefits from 12-channel RAM, but I could easily be wrong, I'm not familiar with this modern hardware
ATH is offline   Reply With Quote
Old 2021-07-09, 17:57   #18
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·5·172 Posts
Default

Prime95 deals well with dual-package systems in my opinion. I've run in a single instance, analyzed and posted prime95 benchmarks on a variey of single and dual-package systems (up to dual-12core, but no dual-26core beasts) versus number of workers, FFT length, HT vs not; see attachments of https://www.mersenneforum.org/showpo...18&postcount=4
https://www.mersenneforum.org/showpo...19&postcount=5
https://www.mersenneforum.org/showpo...4&postcount=11
kriesel is offline   Reply With Quote
Old 2021-07-09, 19:55   #19
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

6778 Posts
Default

Quote:
Originally Posted by ATH View Post
I'm not sure you can call this 12-channel RAM. It is 2 CPU's with 6 channels each. You should definitely run different tests on each physical CPU and each test will get it's own 6-channel RAM.
But if you run 1 single test on both CPUs, I do not think that test benefits from 12-channel RAM, but I could easily be wrong, I'm not familiar with this modern hardware
I never did call it 12-channel RAM - I just wrote that 12 memory channels were in use. As you say, it's dual CPUs, with each CPU having 6 memory channels.

I will have to run some more benchmarks, but have some real work to do the weekend. There's a rather important football match taking place on Sunday too.

Last fiddled with by drkirkby on 2021-07-09 at 19:58
drkirkby is offline   Reply With Quote
Old 2021-07-10, 15:05   #20
phillipsjk
 
Nov 2019

3·23 Posts
Default

Quote:
Originally Posted by kriesel View Post
Prime95 deals well with dual-package systems in my opinion. I've run in a single instance, analyzed and posted prime95 benchmarks on a variey of single and dual-package systems (up to dual-12core, but no dual-26core beasts) versus number of workers, FFT length, HT vs not; see attachments of https://www.mersenneforum.org/showpo...18&postcount=4
https://www.mersenneforum.org/showpo...19&postcount=5
https://www.mersenneforum.org/showpo...4&postcount=11
I looked at "dual-12-core%20e5-2697v2%20roa%20performance.pdf", and it does not mention running two instances, each locked to a specific CPU (using the Worker affinity setting in undocumeted.txt), vs one instance possibly accessing ["foreign" memory].

When I was trying to mine Monero on my quad CPU system, the mining software would occasionally error out with a page fault until I started running 1 instance per CPU.

For P-1 factoring work, I am running 4 instances so that each CPU gets it's own pool of memory to allocate to it's own workers (again avoiding "foreign" memory access). The server does not let me give each CPU it's own name though; so the resulting stats are wonky.

Edit: I think the [tables] on [pages 1 and 3 are] supposed to be showing the penalty for "foreign" memory access [under the "Straddles chips?" (yes) heading]. Bolded numbers on the left-hand side appear to be the "best" times. The percentages appear to be the approximate reduction in performance for each setting.

Last fiddled with by phillipsjk on 2021-07-10 at 15:27 Reason: fixed wording.
phillipsjk is offline   Reply With Quote
Old 2021-07-10, 16:39   #21
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×5×172 Posts
Default

Quote:
Originally Posted by phillipsjk View Post
I looked at "dual-12-core%20e5-2697v2%20roa%20performance.pdf", and it does not mention running two instances, each locked to a specific CPU (using the Worker affinity setting in undocumeted.txt), vs one instance possibly accessing ["foreign" memory].
You're right, it does not mention what was not done or attempted. All the benchmarking was done in a single instance, as previously posted. Also it presumes using all cores is best for throughput, and did not benchmark reduced-core-count cases.

Benchmarking such things as 3-workers on a dual-package system in a single instance gives slower total throughput. George has stated in the past that prime95 segregates threads of a worker onto a single CPU, not straddling dual packages for example. (I don't know how that squares with "not NUMA aware".) That would put two workers onto one cpu, and leave an entire cpu for one of the 3 workers. And benchmark results IIRC were consistent with that.
If it did not do that, some threads & cores of a worker would be distant from others, with possible consequent performance loss.

There are some practical issues with attempting to benchmark with more than one prime95 instance. Desynchronization of fft lengths and subcases between the instances is one that comes to mind.
On Windows, specifying NUMA Node in the start command is more of a recommendation the OS is permitted to deviate from, than a definite mandatory specification. From Windows 10's "start /?" command help output:
Code:
    NODE        Specifies the preferred Non-Uniform Memory Architecture (NUMA)
                node as a decimal integer.
    AFFINITY    Specifies the processor affinity mask as a hexadecimal number.
                The process is restricted to running on these processors.

                The affinity mask is interpreted differently when /AFFINITY and
                /NODE are combined.  Specify the affinity mask as if the NUMA
                node's processor mask is right shifted to begin at bit zero.
                The process is restricted to running on those processors in
                common between the specified affinity mask and the NUMA node.
                If no processors are in common, the process is restricted to
                running on the specified NUMA node.

Specifying /NODE allows processes to be created in a way that leverages memory
locality on NUMA systems.  For example, two processes that communicate with
each other heavily through shared memory can be created to share the same
preferred NUMA node in order to minimize memory latencies.  They allocate
memory from the same NUMA node when possible, and they are free to run on
processors outside the specified node.

    start /NODE 1 application1.exe
    start /NODE 1 application2.exe

These two processes can be further constrained to run on specific processors
within the same NUMA node.  In the following example, application1 runs on the
low-order two processors of the node, while application2 runs on the next two
processors of the node.  This example assumes the specified node has at least
four logical processors.  Note that the node number can be changed to any valid
node number for that computer without having to change the affinity mask.

    start /NODE 1 /AFFINITY 0x3 application1.exe
    start /NODE 1 /AFFINITY 0xc application2.exe
So presumably testing with a prime95 instance per package would require specifying affinity to all of node1's cores on one instance, and to all of node2's cores on the second, if that is possible.

In practice, I find with single-instance multi-package prime95 benchmarking, that number of workers = n * number of packages benchmarks best for total throughput, with n a small integer changing with fft length.
Throughput-optimal parameters are not always entirely practical. It does little good to tune to high worker count for the last few% of throughput, if the primality test assignments expire before completion and progress to completion is wasted. (Especially now with PRP & proof where a first test is not followed by full double check.) Latency less than expiration time is a constraint. At small fft lengths, the Xeon Phi 7250 benchmarks best nominal total throughput with dozens of workers, but latency is an issue.

Quite a while ago, Madpoo posted results for a different case, optimizing for latency of primality testing a single exponent, such as for verifying a new prime discovery where efficiency is less important than speed, on a dual package system (~dual-18-core?). Max speed on a single exponent was around (an entire cpu package plus ~6 cores of the other package), with the rest of the cores in the second cpu package left idle; adding more cores from package two slowed it.

Last fiddled with by kriesel on 2021-07-10 at 16:55
kriesel is offline   Reply With Quote
Old 2021-07-10, 17:17   #22
axn
 
axn's Avatar
 
Jun 2003

19·271 Posts
Default

Quote:
Originally Posted by kriesel View Post
There are some practical issues with attempting to benchmark with more than one prime95 instance. Desynchronization of fft lengths and subcases between the instances is one that comes to mind.
This shouldn't be much of an issue. In theory, if you do this right, neither instance will have any impact on the other one, since they won't be sharing any resources (cores/cache/RAM). So it wouldn't matter if the benchmarks don't exactly sync up.
axn is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Assigning too much memory slows down P-1 stage 2? ZFR Software 11 2020-12-13 10:19
Allow mprime to use more memory ZFR Software 1 2020-12-10 09:50
Mini ITX with LGA 2011 (4 memory channels) bgbeuning Hardware 7 2016-06-18 10:32
mprime checking available memory tha Software 7 2015-12-07 15:56
Cheesy memory slows down prime95? nomadicus Hardware 9 2003-03-01 00:15

All times are UTC. The time now is 18:29.


Mon Oct 18 18:29:12 UTC 2021 up 87 days, 12:58, 0 users, load averages: 1.36, 1.52, 1.53

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.