mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2022-09-25, 18:27   #727
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

816410 Posts
Default

Quote:
Originally Posted by kriesel View Post
undoc.txt says this about memory in P-1 stage 2:
(There's nothing there about the upper limit or modifying it.)
1. Is there a way to allow up to ~60 GiB on a 64 GiB system?
The 90% limit is for the GUI. Editing local.txt manually can work around the 90% limit.

Quote:
2. Is there a way to ensure a worker's memory access & allocation remains entirely or mostly on the same side of the NUMA boundary as the worker's CPU cores on a multi-Xeon system?
Prime95 has no understanding of NUMA. Your only up is to run two instances of prime95. Use a Windows tool to force each instance to run on a different NUMA node. In local.txt ste NumCPUs=8. Let us know if you find a method that works well.
Prime95 is offline   Reply With Quote
Old 2022-09-26, 06:12   #728
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

5·172 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Some debugging reveals prime95 is looking for a benchmark with all 16 cores used. Thus, run a throughput benchmark for 16 cores, 1 worker, all FFT implementations, 6M to 7M fft sizes. Let me know if that does the trick.

Auto bench done every 21(?) hours until there are several data points. I'm looking into why it is running 13 core benchmarks when it only uses 16 core bench results (a bug).

Benchmarks are not uploaded. They are not particularly useful to others given all the combinations of overclocking, memory speeds, etc.
I ran a benchmark with all 14cores, and indeed it seems to pick up the FFT bench timings afterwards. Although the run configuration is 1worker/12cores.

The benchmark asks for the number of cores to use for bench, and before I was giving it a list of what I was actually using (i.e. 12 cores, 13 cores) not the all-cores (14).
preda is offline   Reply With Quote
Old 2022-09-26, 08:47   #729
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

1,327 Posts
Default

What does the heading in results.bench.txt say for you? E.g.:
Code:
Compare your results to other computers at http://www.mersenne.org/report_benchmarks
AMD Ryzen 7 3800X 8-Core Processor             
CPU speed: 4350.39 MHz, 8 hyperthreaded cores
Especially the last line.
kruoli is offline   Reply With Quote
Old 2022-09-26, 15:19   #730
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

5·172 Posts
Default

Quote:
Originally Posted by kruoli View Post
What does the heading in results.bench.txt say for you? E.g.:
Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz
CPU speed: 3805.32 MHz, 14 hyperthreaded cores
CPU features: Prefetchw, SSE, SSE2, SSE4, AVX, AVX2, FMA, AVX512F
L1 cache size: 14x32 KB, L2 cache size: 14x1 MB, L3 cache size: 19712 KB
L1 cache line size: 64 bytes, L2 cache line size: 64 bytes
preda is offline   Reply With Quote
Old 2022-09-28, 05:16   #731
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

163068 Posts
Default Dual-xeon dual instance experiment

Multiple instances' worker windows give repeated occurrences of the error message error setting affinity: no error. See attachment.

I think George's recent comments about prime95 being NUMA-unaware means split up worktodo and files in progress to two folders, one per Xeon, copy the program code then alter prime.txt to limit number of cores per prime95 instance to number of physical cores per Xeon, copy prime.txt and local.txt,
and specify using one thread per physical core on a CPU package using affinity bitmasks. I chose the even numbered logical cores. In a batch file, or separately:

Code:
start /D (folder0) /NODE 0 /Affinity 0x5555 prime95.exe
start /D (folder1) /NODE 1 /Affinity 0x5555 prime95.exe
seemed to do the trick. (See cmd /k start /? for detailed help. AfAIK the start command is the only NUMA-aware control available at the Windows command line. Powershell is a whole other kettle of fish I won't go into here.)
(Omitting /Affinity (bitmask) filled up all the hyperthreads on NUMA node 0 and doubled iteration times, leaving the other Xeon idle.)

The two instances each have two workers with four cores each. Each instance has 48 GiB allowed for stage 2 P-1/P+1/ECM, 32 GiB as emergency memory, which leaves the 128 GiB ram potentially oversubscribed. Since it is being transitioned from DC to P-1, emergency memory for saving proof residues will become moot and can be pared back.

Alternate hyperthreads on the same physical core are consecutive logical processors on Windows, so use either the odd or the even bit but not both for a given two-bit field in the affinity mask;
0x5555 = binary 0101 0101 0101 0101 corresponding to HT0 of each of 8 cores on a Xeon E5-2670 8-core x2 HT. So in HWMonitor, even numbered logical cores are fully occupied with prime95; odd are available for OS etc.
Without setting both /node and /affinity values, everything fell on NUMA node 0. 0XAAAA would select 8 odd numbered logical cores.
Observed Windows 7 worker timings are consistent with that interpretation.
https://linustechtips.com/topic/5919...ined-for-real/
Task Manager displays Cores as follows:
Numa Node 0 top row
leftmost: core 0 hyperthread 0, then core 0 hyperthread 1, core 1 hyperthread 0 ... core 7 hyperthread 1
Second row is Numa node 1.
On Xeon Phis, in Windows 10, CPU rows wrap according to window width, but upper left is core 0, HT 0 1 2 3, bottom right ends core N-1, HT 0 1 2 3.

A forced server update from each of the instances, then a check of my CPUs page shows one occurrence of the nodename common to the two instances. (No duplication seen.)
Attached Thumbnails
Click image for larger version

Name:	error setting affinity is no error.png
Views:	24
Size:	116.6 KB
ID:	27384  

Last fiddled with by kriesel on 2022-09-28 at 05:27
kriesel is offline   Reply With Quote
Old 2022-09-28, 15:50   #732
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

736610 Posts
Default SPR

Two outstanding issues as far as I know (haven't tested v30.8b17 yet)

1) Observed on Windows 7 Pro x64, dual Xeon E5-2670, prime95 V30.8b15, using start /Node 0 or 1, /affinity 0x5555, running two instances, intended as one each side of the QPI;
when a worker window assigns cores, the following message is produced repeatedly, with variety of hex values consisting of 3 or c at various offsets:
Error setting affinity to cpuset 0x000000c0: No error
(refer to attachment of https://mersenneforum.org/showpost.p...&postcount=731)
3 or c is 0011 or 1100.
Windows' numbering representation of the two logical cores of a x2 hyperthreaded physical core #0 is 0,1,
while Linux's is 0,n where n is number of physical hyperthreaded cores present in the system.
So it appears to me that prime95, a Windows application, may be using an inappropriate affinity mask for Windows. We don't usually want two prime95 compute threads running on the same physical core.
Or, prime95 is setting a bit map for using either hyperthread of the core involved. (Which would be less constraining than what was already done in the start command's affinity mask.) https://www.systutorials.com/docs/li...wloc_cpuset_t/
I speculate that locking prime95 activity to a specific hyperthread of a core may reduce activity in Windows' task scheduling on multiple cores. I've seen indications in Task Manager's CPU display of a fair amount of logical-core-hopping at times. As if Windows is trying to balance load between hyperthreads, a probably futile exercise for code as memory-bound as prime95's fft crunching, with performance generally hurt, not helped, by multiple hyperthreads on the same core.

2) Observed repeatedly on Windows 10, i5-1035G1, prime95 V30.8b14,
Indicated P-1 stage 1 total time for a P-1 interrupted by autobenchmarking is too low, reflecting only the time from the end of the interruption to the end of the stage, omitting all the stage time before the benchmarking interruption.
See for example the log content at https://mersenneforum.org/showpost.p...4&postcount=40 which shows
"[Sep 2 15:41] M100204259 stage 1 complete. 347584 transforms. Total time: 2487.884 sec."
but the stage 1 ran from Sep 2 12:44 to Sep 2 15:41, 12:44 to 18:21 = 5:37 - 3 minutes for benchmarking ~20040. seconds, about 8.055 times as long as the total time indicated to millisecond precision.

Last fiddled with by kriesel on 2022-09-28 at 16:29
kriesel is offline   Reply With Quote
Old 2022-09-28, 16:22   #733
storm5510
Random Account
 
storm5510's Avatar
 
Aug 2009
Not U. + S.A.

17×149 Posts
Default

Quote:
Originally Posted by kriesel View Post
Two outstanding issues as far as I know (haven't tested v30.8b17 yet)

1) Observed on Windows 7 Pro x64, dual Xeon E5-2670, prime95 V30.8b15, using start /Node 0 or 1, /affinity 0x5555,
when a worker window assigns cores, the following message is produced repeatedly, with variety of hex values consisting of 3 or c at various offsets:
Error setting affinity to cpuset 0x000000c0: No error
(refer to attachment of https://mersenneforum.org/showpost.p...&postcount=731)

2) Observed repeatedly on Windows 10, i5-1035G1, prime95 V30.8b14,
Indicated P-1 stage 1 total time for a P-1 interrupted by autobenchmarking is too low, reflecting only the time from the end of the interruption to the end of the stage, omitting all the stage time before the benchmarking interruption.
See for example the log content at https://mersenneforum.org/showpost.p...4&postcount=40 which shows
"[Sep 2 15:41] M100204259 stage 1 complete. 347584 transforms. Total time: 2487.884 sec."
but the stage 1 ran from Sep 2 12:44 to Sep 2 15:41, 12:44 to 18:21 = 5:37 - 3 minutes for benchmarking ~20040. seconds, about 8.055 times as long as the total time indicated to millisecond precision.
It seems like you may be taking a more difficult road setting affinity. I use the below in local.txt:

Code:
[Worker #1]
Affinity=(0,4),(2,6)
The bold section in your quote above makes no sense. 20,040 seconds is 5.57 hours...
storm5510 is offline   Reply With Quote
Old 2022-09-28, 16:36   #734
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·29·127 Posts
Default

Quote:
Originally Posted by storm5510 View Post
It seems like you may be taking a more difficult road setting affinity. I use the below in local.txt:

Code:
[Worker #1]
  Affinity=(0,4),(2,6)
The bold section in your quote above makes no sense. 20,040 seconds is 5.57 hours...
Are you running a dual-Xeon system with QPI bottleneck between halves of total installed ram? Two separate instances, one per Xeon as directed by George? I'm setting affinity to 8 hyperthreads of 32, in each of two prime95 instances. I think what I'm doing is simpler than setting four different affinity lists for four different workers in two different folders. And more likely to get the memory locality right. It's also easily extensible to a dual-12-core&HT system later; masks become 0x555555. Done.

5:37: 5 hours 37 minutes from start to finish of stage 1, minus 3 minutes benchmarking interruption:
5 * 3600 +37 * 60 -3 * 60 = 20040. seconds. Perhaps a case of vigorous agreement?
But the program reported only 2487. seconds & change, less than 1/8 the actual stage 1 compute time, is the point I was making.

Last fiddled with by kriesel on 2022-09-28 at 17:24
kriesel is offline   Reply With Quote
Old 2022-09-28, 17:40   #735
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

736610 Posts
Default

Same hardware, the error no error persists in prime95 v30.8b17.
There are no directives involving affinity in local.txt or prime.txt, so it is prime95 default response in the context of the start commands used.
IIRC George runs Linux not Windows, and may have no dual-CPU-package systems to test mprime / prime95 on.

Last fiddled with by kriesel on 2022-09-28 at 18:17
kriesel is offline   Reply With Quote
Old 2022-09-29, 23:16   #736
storm5510
Random Account
 
storm5510's Avatar
 
Aug 2009
Not U. + S.A.

17·149 Posts
Default

Quote:
Originally Posted by kriesel View Post
Are you running a dual-Xeon system with QPI bottleneck between halves of total installed ram? Two separate instances, one per Xeon as directed by George?

But the program reported only 2487. seconds & change, less than 1/8 the actual stage 1 compute time, is the point I was making.
No, and I probably would not try. Multiple workers in a single instance, maybe. If the OS can see both CPU's then I would think any running process could as well.
storm5510 is offline   Reply With Quote
Old 2022-09-30, 03:06   #737
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×29×127 Posts
Default

Quote:
Originally Posted by storm5510 View Post
If the OS can see both CPU's then I would think any running process could as well.
Yes, and there's empirical evidence that using almost all the dual-CPU-system's ram in prime95 is suboptimal compared to using almost all the near ram on each CPU, in P-1 stage 2, and leaving the rest to be used for the other CPU's processes. Mprime / prime95 needs a little user assistance apparently to ensure it's the near ram in the multi-CPU (not merely multi-core) case.
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Do not post your results here! kar_bon Prime Wiki 40 2022-04-03 19:05
what should I post ? science_man_88 science_man_88 24 2018-10-19 23:00
Where to post job ad? xilman Linux 2 2010-12-15 16:39
Moderated Post kar_bon Forum Feedback 3 2010-09-28 08:01
Something that I just had to post/buy dave_0273 Lounge 1 2005-02-27 18:36

All times are UTC. The time now is 10:28.


Sun Feb 5 10:28:58 UTC 2023 up 171 days, 7:57, 1 user, load averages: 0.53, 0.61, 0.66

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔