mersenneforum.org Intel Xeon PHI?
 Register FAQ Search Today's Posts Mark Forums Read

 2022-02-18, 14:58 #232 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 7,369 Posts After adding 1, 3, or 6 x 32 GiB DIMMs, or 1 x 64, Win10 and Ubuntu atop WSL/Win10 have strange distorted views of the hardware, and it affects prime95, Task Manager, lscpu in Ubuntu, etc. One example is the following, obtained from Ubuntu 18.04/WSL1/Win10 on a 68-core Xeon Phi 7250. Code: lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 1 Core(s) per socket: 49 Socket(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 87 Model name: Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz Stepping: 1 CPU MHz: 1401.000 CPU max MHz: 1401.0000 BogoMIPS: 2802.00 Hypervisor vendor: Windows Subsystem for Linux Virtualization type: container Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave osxsave avx f16c rdrand lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd ibrs ibpb See also https://mersenneforum.org/showpost.p...&postcount=184 for several screen shots and some descriptive text. Stability seemed also to be adversely affected, with the red HD light having come on on the motherboard at uptimes of 5 minutes to an hour after boot, and seeming to correlate with launching production prime95; more typical for the 7250 system is of order 10-15 days with only MCDRAM and prime95 running 24/7 all physical cores. The current boot seems to be faring better at 4+ hours of running. Prime95 worker iteration times are up from ~45 and 35 ms/iter for MCDRAM only, to 112 and 47 respectively with DIMMs in. For my next trick I'll attempt a Centos 8 Stream install (since non-stream v8.x are no longer supported/maintained/updateable/package-installable-later) and see what native Linux makes of the configuration, how mprime and mlucas perform, etc. But first some other (non-computing) things need my attention.
2022-02-18, 15:11   #233
chalsall
If I May

"Chris Halsall"
Sep 2002

11,087 Posts

Quote:
 Originally Posted by kriesel For my next trick I'll attempt a Centos 8 Stream install (since non-stream v8.x are no longer supported/maintained/updateable/package-installable-later) and see what native Linux makes of the configuration, how mprime and mlucas perform, etc. But first some other (non-computing) things need my attention.
I highly recommend you install one or more versions of Linux on a USB device, that you can use to bring a machine up for forensics. This can be as simple as a small (but ideally, fast) Flash drive, to a portable USB HDD/SSD.

These are also handy to use to run mprime on machines that otherwise waste their time as Human's Windows environments...

 2022-02-18, 15:50 #234 paulunderwood     Sep 2002 Database er0rr 5×29×31 Posts Apart from the M$subsystem telling lies about the number of cores... I suggest you play with numactl. See previous posts in this thread. I think you will get maximum benefit running a native Linux system. 2022-02-19, 08:45 #235 kriesel "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 736910 Posts Quote:  Originally Posted by chalsall I highly recommend you install one or more versions of Linux on a USB device, that you can use to bring a machine up for forensics. This can be as simple as a small (but ideally, fast) Flash drive, to a portable USB HDD/SSD. I recently acquired a 128GiB memory stick for that purpose. (WD EasyStore, read/write reliability test attached.) I've been wondering how clone-able Linux is; install to the stick on one system, how different a set of hardware can be run on the same stick's existing Linux install? Quote:  Originally Posted by paulunderwood Apart from the M$ subsystem telling lies about the number of cores... I suggest you play with numactl. See previous posts in this thread. I think you will get maximum benefit running a native Linux system.
Numactl is ineffective in Ubuntu atop the WSL environment. And some commands or usual system files are simply not present or available in the WSL compatible build of Ubuntu downloadable from the MS Store.

In my experience, native Linux on comparatively simpler hardware can present some really stubborn obstacles at times, making its install more difficult and time consuming than Windows, or simply repeatedly fail. There's the immortal-can't-kill-it-NVIDIA-drivers issue that prevents putting in the necessary-for-OpenCL-gpu-computing-driver, and sequences like the following that stumped me nearly two years ago, though I've had better luck recently.
Code:
https://linuxhint.com/install_centos8_netboot_iso/ gives step by step tutorial of installing Centos 8.
I've been slogging through repeated tries of that for hours, dealing with a new tabletop board build at the moment, learning what file systems and drives the installer won't accept and other quirks, and am stopped cold by it refusing to accept the exact repository URL given in that web page. Character for character match, including case and punctuation, fails.
mirror.centos.org/centos/8/BaseOS/x86_64/os/

http://centos.mirrors.tds.net/centos/ gives the appearance of working for a while before errors appear again
http://mirror.cs.uwp.edu/pub/centos/ will probably do the same

Seems like none of that should even be necessary, since the repos are supposed to be needed for the netboot ISO, while what I have on my little USB install drive is the whole fat 8.72GB CentOS-8.3.2011-x86_64-dvd1 ISO.
Or in an occasional case the performance on native Linux is observed lower than Linux/WSL/Win10! (Seen on a dual-instance Mlucas benchmark at 192M fft length on a dual-Xeon system; https://www.mersenneforum.org/showpo...5&postcount=17 attachment 3, pages 3 & 4, 417 vs 377 ms/iter effective throughput.)

Thanks for the responses.
Attached Thumbnails

Last fiddled with by kriesel on 2022-02-19 at 09:20

 2022-02-19, 09:34 #236 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 7,369 Posts NUMA and Windows and Xeon Phi I wrote the following a while ago, and don't have the time to update it now, except for the last line and to note that my 7250 has Win 10 build 19044 (21H2) on it currently and will be doing an update cycle, which will likely cause a failed restart during the update process. (It usually takes several attempts to get from off to through the BIOS, or several dozen sometimes.) Xeon Phi models have several dozen physical cores, and due to x4 hyperthreading, hundreds of logical cores. Model Cores x4HT 7210 64 256 7250 68 272 7290 72 288 To specify NUMA node, there is the /NODE option in the Start command https://ss64.com/nt/start.html. It accepts apparently one NUMA node integer. So one could start prime95 specifying a NUMA node, from a command line: start /Node 0 prime95.exe which is an expression of a preference, not a mandatory requirement. Numactrl in Linux offers both, per the little documentation I've read. Coreinfo from sysinternals shows core and cache NUMA node association. All four HT of a physical Xeon Phi core are in the same Windows NUMA node. https://docs.microsoft.com/en-us/sys...loads/coreinfo Windows splits core count into NUMA nodes of no more than 64 logical cores. And puts all hyperthreads of a core consecutively. So a MCDRAM-only 7250 presents as 5 NUMA; nodes 0-4 with 64 logical cores, 16 physical cores each; #5 with 16 logical cores, 4 real. IIRC prime95 / mprime are not fully NUMA-aware. They usually perform best for primality testing or presumably P-1 factoring with one thread per physical core. There are controls for setting affinity, described near the end of undoc.txt. There is a provision for using 2MB pages, which prevents them from being swapped out. There does not appear to be a provision for using 1GB pages. There does not appear to be provision for running multiple instances of prime95 simultaneously on the same system, such as to 1:1 map prime95 instances and core use and NUMA node. There does not appear to be provision for specifying running in MCDRAM vs. DIMM ram on a Xeon Phi system containing both while booted in flat mode or hybrid mode. After some web searching and exploring the sysinternals tools, I have not yet located a tool to either 1) determine or specify how Windows or the K1SPE motherboard maps MCDRAM vs. DIMM into physical address space 2) determine or specify what memory type(s) a specific application runs on 3) determine what memory mode Windows boots in, other than querying total memory and doing some deductive arithmetic that indicates flat mode. 4) determine in real time, or specify, what physical addresses a specific application such as prime95 loads in or uses, so as to indirectly select operation in MCDRAM mostly or entirely for speed. (It would seem to be counter to usual virtual memory management.) https://docs.microsoft.com/en-us/win...d/numa-support talks about the 64-logical-processor limit per Windows NUMA node and a relaxation of it at build 20348. Which appears not to yet be available to the public. https://docs.microsoft.com/en-us/win...se-information currently shows V21H2, build 19044.1566 revision 2022-02-15 as the latest. Last fiddled with by kriesel on 2022-02-19 at 09:36
2022-02-19, 10:06   #237
paulunderwood

Sep 2002
Database er0rr

5×29×31 Posts

Quote:
 Originally Posted by kriesel I recently acquired a 128GiB memory stick for that purpose. (WD EasyStore, read/write reliability test attached.) I've been wondering how clone-able Linux is; install to the stick on one system, how different a set of hardware can be run on the same stick's existing Linux install?
Many a time I just swapped a disk into another system system and it has booted. Linux's monolithic kernel ensures this. There might be an issue with the network driver, but by and large it works. You might have to fiddle with /etc/fstab though to access any other hard drives automatically from boot, but for a USB bootable system it should be fine.

Quote:
 Or in an occasional case the performance on native Linux is observed lower than Linux/WSL/Win10! (Seen on a dual-instance Mlucas benchmark at 192M fft length on a dual-Xeon system; https://www.mersenneforum.org/showpo...5&postcount=17 attachment 3, pages 3 & 4, 417 vs 377 ms/iter effective throughput.)
I find this very hard to believe.

Good luck getting numactl to work, if you go down a native Linux path. I use Debian, though not needed numactl yet.

Last fiddled with by paulunderwood on 2022-02-19 at 10:36

 2022-02-19, 14:32 #238 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 7×751 Posts I've been swapping linux hard drives between machines for quite some time now. For my purposes they've always worked, except having to reset my static IP because it was linked to the previous ethernet. I even ran some machines off a microSD card for a while and machines that would boot from SD ran the OS fine from that card. I recompiled the various factoring packages as needed, but the OS (Ubuntu) seemed to always run.
2022-02-19, 15:23   #239
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

11100110010012 Posts

Quote:
 Originally Posted by paulunderwood I find this very hard to believe.
I did too. Double checked the data and my handling of it.
Re NUMA and preference rather than requirement in Win10, or virtualization, there's a screen shot I took of specifying successively start Ubuntu atop WSL/Win10 on nodes 0-5 on a 7250. 0-4 all indicated 64 logical cores, in lscpu, for the appearance of 320 on a 272-hyperthread system. Node 5 was refused. IIRC node 4 will be slower because there's only 4 physical cores, 16 logical, "really". That's a complication that won't show up on a 64-core 7210.
Attached Thumbnails

Last fiddled with by kriesel on 2022-02-19 at 15:24

2022-02-19, 16:11   #240
chalsall
If I May

"Chris Halsall"
Sep 2002

11,087 Posts

Quote:
 Originally Posted by EdH ...but the OS (Ubuntu) seemed to always run.
Yup. I've been doing this for many, many years. I have /never/ encountered a situation where a generic Linux install on a bootable device didn't bring the system up. It's rare for even the NICs not to be recognized; if configured with DHCP nothing further should be needed.

Proprietary stuff like GPU drivers is, of course, separate, and may require manual installation and/or configuration.

2022-03-04, 16:43   #242
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7,369 Posts

Quote:
 Originally Posted by ewmayer Brief thoughts re. the KNL setp: o If this is not your main work/admin machine, why insist on using Win+WSL, given its crippling manycore limitations?
Evaluating Mlucas on it in multiple environments was a goal.
Finding issues to report to you in any of the environments, and document for other possible users to perhaps avoid pitfalls, was a goal.
Being able to continue existing prime95 workloads alongside to completion without reinstall/migrate was a goal.
Being able to run Mlucas without totally redoing the OS was a goal.
Your own readme.html advised Windows users to rely on WSL as a means of running Mlucas. (Previously Mlucas could be compiled for Windows via msys2, but that is no longer the case, as of ~v19?)
That said, the 7250 Xeon Phi here was set up for eventual dual-boot from the start.
I am beyond saturated and have very little time to do that now, and really should be spending all my time on things other than GIMPS for a month. So meanwhile I'm mostly trying to make do with what's installed.
I have dozens of Windows installs on old hardware, ~ a dozen WSL environments, and only 3 Linux boot environments installed, in my hardware fleet, no native Linux boot on Xeon Phi yet.
Quote:
 o Re. your Mlucas 192M timings on the 2-socket 16c32t machine, several questions: [a] Is this a dual-boot system setup, Win+WSL and CentOS Stream?
Yes, the hard way:
Shutdown. Disconnect AC cord. Cycle power button to dissipate residual stored energy in the PSU. Open case. Anti-static strap up. Swap boot drives. Static strap off. Close case. Reconnect power. Restart.
Because that used system was shipped to me with Windows preinstalled and GPU drivers in place for the old Quadro and older Tesla included.
I went through several cycles of Linux install on that; CentOS 7.9 first per Chalsall's recommendation; 8.2 to match what I understood you to have in place on your 7250, 8.3, 8.4, before finding that the unsolved issues I was having was because CentOS 8 was deliberately orphaned, so no packages could be installed or updated. (This seems like a fundamental Linux distro design flaw to me.) So CentOS 8 Stream was selected as the closest I could get to matching your system's OS and also be able to install or update as necessary to be able to compile Mlucas there, an absolute requirement.

This sequence and the previous system's Linux-only install was practice for attempting CentOS on Xeon Phi, which due to age or PSU or whatever is unreliable in getting through the very lengthy BIOS checks to the start of an OS boot.
Quote:
 [b] How did you generate the total throughout numbers? 100-iters per instance, with -fft [comma-separated list of radices] and the radices corresponding to the best-timing-for-that-core-and-thread-count taken from the mlucas.cfg entry for 192M FFT resulting from running self-tests at that same core and thread count? (Using mlucas.cfg data for a different core/thread count will be suboptimal, often hugely so.)
Numerous self-test-tune cycles for 192M fft using different core counts and core patterns; documented (attachment 3 here) as previously stated. I get they are preliminary and low precision and need some followup. Later.
Total throughput was computed by converting ms/it numbers Mlucas provides for two simultaneous instances, to it/ms, summing those, and convert back to give a single effective throughput timing.

Quote:
 o numactl works fine for me under CentOS - it has proved absolutely crucial in getting good performance from my big-footprint (nearly 200GB) F33 p-1 stage 2 work (@512M FFT) in terms of telling the OS "use MCDRAM as a giant L3 cache". 2.5x speedup for stage 2 over just letting the OS use its defaults for MCDRAM-versus-DIMM-RAM management.
Saw your previous posts about that. Will use them as reference when I have time to use them.

Re GPUs, the 7250 is planned to get a PSU upgrade, which will enable adding a GPU. (Low profile GPU due to mechanical layout of those little cases & the CPU cooler lines.)

On Windows, there is _OX in the upper right of each GUI app window (minimize, maximize, terminate respectively). On Gnome _ (minimize) is missing.
On Windows there is the ability to customize the mouse cursor for size and color (including invert color of traversed pixels) to make it easily visible on large and/or high resolution or cluttered screens. If that exists in Gnome, it's well enough hidden to prevent me finding it during a lengthy search.
I spent hours trying, with frequent web searching for hints, but without success, to get Ubuntu on WSL2 to talk with my multi-OS-serving (Linux-based!) server. Just use the Linux GUI is no help for a command-line-only Ubuntu-on-WSL environment, or a headless system without functioning remote graphical access. A workaround for WSL1 hosted Linux is to identify the Windows location of the Linux files and copy using Windows to the server. But since WSL2 containerizes it, that workaround is not available for WSL2.

Perhaps a Linux wizard could breeze through all that and more I've not identified yet. I'm not that.

Here's Mlucas running in Ubuntu/WSL1/Win10 on one e5-2690 while prime95/Win10 occupies the other. Timing is a bit better than the self-test results (~6% better and fluctuating somewhat). ./Mlucas -cpu 16:30:2

Code:
 tail p3321928307.stat
[2022-03-03 15:23:03] M3321928307 S1 bit = 890000 [ 3.63% complete] clocks = 01:59:17.111 [715.7111 msec/iter] Res64: 1B811311F05BE51A. AvgMaxErr = 0.143603562. MaxErr = 0.203125000. Residue shift count = 0.
[2022-03-03 17:22:50] M3321928307 S1 bit = 900000 [ 3.67% complete] clocks = 01:59:34.788 [717.4789 msec/iter] Res64: 8D77BF8AC6AFD23A. AvgMaxErr = 0.143744259. MaxErr = 0.187500000. Residue shift count = 0.
[2022-03-03 19:22:41] M3321928307 S1 bit = 910000 [ 3.71% complete] clocks = 01:59:39.210 [717.9211 msec/iter] Res64: 8DC45B5ECC81EA89. AvgMaxErr = 0.143716101. MaxErr = 0.187500000. Residue shift count = 0.
[2022-03-03 21:22:21] M3321928307 S1 bit = 920000 [ 3.75% complete] clocks = 01:59:28.661 [716.8662 msec/iter] Res64: FBE70202E4FDBB0C. AvgMaxErr = 0.143669422. MaxErr = 0.187500000. Residue shift count = 0.
[2022-03-03 23:21:59] M3321928307 S1 bit = 930000 [ 3.79% complete] clocks = 01:59:25.410 [716.5411 msec/iter] Res64: 4DF1492B9B0E3919. AvgMaxErr = 0.143673560. MaxErr = 0.187500000. Residue shift count = 0.
[2022-03-04 01:21:58] M3321928307 S1 bit = 940000 [ 3.83% complete] clocks = 01:59:46.892 [718.6892 msec/iter] Res64: BE43FDAC1E6F8E5B. AvgMaxErr = 0.143704682. MaxErr = 0.187500000. Residue shift count = 0.
[2022-03-04 03:21:39] M3321928307 S1 bit = 950000 [ 3.87% complete] clocks = 01:59:29.534 [716.9535 msec/iter] Res64: 413F6E06000B7B66. AvgMaxErr = 0.143720421. MaxErr = 0.187500000. Residue shift count = 0.
[2022-03-04 05:20:56] M3321928307 S1 bit = 960000 [ 3.91% complete] clocks = 01:59:05.087 [714.5088 msec/iter] Res64: 6B5A97B02AE26A4E. AvgMaxErr = 0.143638286. MaxErr = 0.203125000. Residue shift count = 0.
[2022-03-04 07:20:07] M3321928307 S1 bit = 970000 [ 3.95% complete] clocks = 01:58:58.511 [713.8512 msec/iter] Res64: 9A3FE8F993F9E403. AvgMaxErr = 0.143765465. MaxErr = 0.187500000. Residue shift count = 0.
[2022-03-04 09:19:38] M3321928307 S1 bit = 980000 [ 4.00% complete] clocks = 01:59:19.661 [715.9662 msec/iter] Res64: 5A838F28B6654A84. AvgMaxErr = 0.143642746. MaxErr = 0.187500000. Residue shift count = 0.

Last fiddled with by kriesel on 2022-03-04 at 17:19

 Similar Threads Thread Thread Starter Forum Replies Last Post dtripp Software 3 2013-02-19 20:20 nucleon Hardware 2 2012-05-10 23:53 R.D. Silverman Programming 19 2011-09-17 01:43 mack Information & Answers 7 2009-09-13 01:48 penguain NFSNET Discussion 0 2006-06-12 01:31

All times are UTC. The time now is 20:45.

Sun Feb 5 20:45:06 UTC 2023 up 171 days, 18:13, 1 user, load averages: 1.61, 1.39, 1.23