![]() |
![]() |
#232 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,369 Posts |
![]()
After adding 1, 3, or 6 x 32 GiB DIMMs, or 1 x 64, Win10 and Ubuntu atop WSL/Win10 have strange distorted views of the hardware, and it affects prime95, Task Manager, lscpu in Ubuntu, etc. One example is the following, obtained from Ubuntu 18.04/WSL1/Win10 on a 68-core Xeon Phi 7250.
Code:
lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 1 Core(s) per socket: 49 Socket(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 87 Model name: Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz Stepping: 1 CPU MHz: 1401.000 CPU max MHz: 1401.0000 BogoMIPS: 2802.00 Hypervisor vendor: Windows Subsystem for Linux Virtualization type: container Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave osxsave avx f16c rdrand lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd ibrs ibpb Stability seemed also to be adversely affected, with the red HD light having come on on the motherboard at uptimes of 5 minutes to an hour after boot, and seeming to correlate with launching production prime95; more typical for the 7250 system is of order 10-15 days with only MCDRAM and prime95 running 24/7 all physical cores. The current boot seems to be faring better at 4+ hours of running. Prime95 worker iteration times are up from ~45 and 35 ms/iter for MCDRAM only, to 112 and 47 respectively with DIMMs in. For my next trick I'll attempt a Centos 8 Stream install (since non-stream v8.x are no longer supported/maintained/updateable/package-installable-later) and see what native Linux makes of the configuration, how mprime and mlucas perform, etc. But first some other (non-computing) things need my attention. |
![]() |
![]() |
![]() |
#233 | |
If I May
"Chris Halsall"
Sep 2002
Barbados
11,087 Posts |
![]() Quote:
These are also handy to use to run mprime on machines that otherwise waste their time as Human's Windows environments... ![]() |
|
![]() |
![]() |
![]() |
#234 |
Sep 2002
Database er0rr
5×29×31 Posts |
![]()
Apart from the M$ subsystem telling lies about the number of cores... I suggest you play with numactl. See previous posts in this thread. I think you will get maximum benefit running a native Linux system.
![]() |
![]() |
![]() |
![]() |
#235 | ||
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
736910 Posts |
![]() Quote:
Quote:
In my experience, native Linux on comparatively simpler hardware can present some really stubborn obstacles at times, making its install more difficult and time consuming than Windows, or simply repeatedly fail. There's the immortal-can't-kill-it-NVIDIA-drivers issue that prevents putting in the necessary-for-OpenCL-gpu-computing-driver, and sequences like the following that stumped me nearly two years ago, though I've had better luck recently. Code:
https://linuxhint.com/install_centos8_netboot_iso/ gives step by step tutorial of installing Centos 8. I've been slogging through repeated tries of that for hours, dealing with a new tabletop board build at the moment, learning what file systems and drives the installer won't accept and other quirks, and am stopped cold by it refusing to accept the exact repository URL given in that web page. Character for character match, including case and punctuation, fails. mirror.centos.org/centos/8/BaseOS/x86_64/os/ It never reaches "Downloading package metadata". Instead it's "Error setting up base repository" http://centos.mirrors.tds.net/centos/ gives the appearance of working for a while before errors appear again http://mirror.cs.uwp.edu/pub/centos/ will probably do the same Seems like none of that should even be necessary, since the repos are supposed to be needed for the netboot ISO, while what I have on my little USB install drive is the whole fat 8.72GB CentOS-8.3.2011-x86_64-dvd1 ISO. Thanks for the responses. Last fiddled with by kriesel on 2022-02-19 at 09:20 |
||
![]() |
![]() |
![]() |
#236 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,369 Posts |
![]()
I wrote the following a while ago, and don't have the time to update it now, except for the last line and to note that my 7250 has Win 10 build 19044 (21H2) on it currently and will be doing an update cycle, which will likely cause a failed restart during the update process. (It usually takes several attempts to get from off to through the BIOS, or several dozen sometimes.)
Xeon Phi models have several dozen physical cores, and due to x4 hyperthreading, hundreds of logical cores. Model Cores x4HT 7210 64 256 7250 68 272 7290 72 288 To specify NUMA node, there is the /NODE option in the Start command https://ss64.com/nt/start.html. It accepts apparently one NUMA node integer. So one could start prime95 specifying a NUMA node, from a command line: start /Node 0 prime95.exe which is an expression of a preference, not a mandatory requirement. Numactrl in Linux offers both, per the little documentation I've read. Coreinfo from sysinternals shows core and cache NUMA node association. All four HT of a physical Xeon Phi core are in the same Windows NUMA node. https://docs.microsoft.com/en-us/sys...loads/coreinfo Windows splits core count into NUMA nodes of no more than 64 logical cores. And puts all hyperthreads of a core consecutively. So a MCDRAM-only 7250 presents as 5 NUMA; nodes 0-4 with 64 logical cores, 16 physical cores each; #5 with 16 logical cores, 4 real. IIRC prime95 / mprime are not fully NUMA-aware. They usually perform best for primality testing or presumably P-1 factoring with one thread per physical core. There are controls for setting affinity, described near the end of undoc.txt. There is a provision for using 2MB pages, which prevents them from being swapped out. There does not appear to be a provision for using 1GB pages. There does not appear to be provision for running multiple instances of prime95 simultaneously on the same system, such as to 1:1 map prime95 instances and core use and NUMA node. There does not appear to be provision for specifying running in MCDRAM vs. DIMM ram on a Xeon Phi system containing both while booted in flat mode or hybrid mode. After some web searching and exploring the sysinternals tools, I have not yet located a tool to either 1) determine or specify how Windows or the K1SPE motherboard maps MCDRAM vs. DIMM into physical address space 2) determine or specify what memory type(s) a specific application runs on 3) determine what memory mode Windows boots in, other than querying total memory and doing some deductive arithmetic that indicates flat mode. 4) determine in real time, or specify, what physical addresses a specific application such as prime95 loads in or uses, so as to indirectly select operation in MCDRAM mostly or entirely for speed. (It would seem to be counter to usual virtual memory management.) https://docs.microsoft.com/en-us/win...d/numa-support talks about the 64-logical-processor limit per Windows NUMA node and a relaxation of it at build 20348. Which appears not to yet be available to the public. https://docs.microsoft.com/en-us/win...se-information currently shows V21H2, build 19044.1566 revision 2022-02-15 as the latest. Last fiddled with by kriesel on 2022-02-19 at 09:36 |
![]() |
![]() |
![]() |
#237 | ||
Sep 2002
Database er0rr
5×29×31 Posts |
![]() Quote:
Quote:
Good luck getting numactl to work, if you go down a native Linux path. I use Debian, though not needed numactl yet. Last fiddled with by paulunderwood on 2022-02-19 at 10:36 |
||
![]() |
![]() |
![]() |
#238 |
"Ed Hall"
Dec 2009
Adirondack Mtns
7×751 Posts |
![]()
I've been swapping linux hard drives between machines for quite some time now. For my purposes they've always worked, except having to reset my static IP because it was linked to the previous ethernet. I even ran some machines off a microSD card for a while and machines that would boot from SD ran the OS fine from that card. I recompiled the various factoring packages as needed, but the OS (Ubuntu) seemed to always run.
|
![]() |
![]() |
![]() |
#239 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
11100110010012 Posts |
![]()
I did too. Double checked the data and my handling of it.
Re NUMA and preference rather than requirement in Win10, or virtualization, there's a screen shot I took of specifying successively start Ubuntu atop WSL/Win10 on nodes 0-5 on a 7250. 0-4 all indicated 64 logical cores, in lscpu, for the appearance of 320 on a 272-hyperthread system. Node 5 was refused. IIRC node 4 will be slower because there's only 4 physical cores, 16 logical, "really". That's a complication that won't show up on a 64-core 7210. Last fiddled with by kriesel on 2022-02-19 at 15:24 |
![]() |
![]() |
![]() |
#240 |
If I May
"Chris Halsall"
Sep 2002
Barbados
11,087 Posts |
![]()
Yup. I've been doing this for many, many years. I have /never/ encountered a situation where a generic Linux install on a bootable device didn't bring the system up. It's rare for even the NICs not to be recognized; if configured with DHCP nothing further should be needed.
Proprietary stuff like GPU drivers is, of course, separate, and may require manual installation and/or configuration. |
![]() |
![]() |
![]() |
#241 |
∂2ω=0
Sep 2002
República de California
5·2,351 Posts |
![]()
Brief thoughts re. the KNL setp:
o If this is not your main work/admin machine, why insist on using Win+WSL, given its crippling manycore limitations? My main issue in initial setup of my 7250 was that I started with the 'thin' CentOS distro which lacks desktop-GUI support, said lack proved nigh-impossible in terms of getting the WiFi to work, something which is trivially easy using the GUI - that disparity is on the CentOS folks, but now we know the workaround, namely use the 'fat' distro. o Re. yor Mlucas 192M timings on the 2-socket 16c32t machine, several questions: [a] Is this a dual-boot system setup, Win+WSL and CentOS Stream? [b] How did you generate the total throughout numbers? 100-iters per instance, with -fft [comma-separated list of radices] and the radices corresponding to the best-timing-for-that-core-and-thread-count taken from the mlucas.cfg entry for 192M FFT resulting from running self-tests at that same core and thread count? (Using mlucas.cfg data for a different core/thread count will be suboptimal, often hugely so.) [c] In my experience 100-iters is too small to accurately gauge throughput for runs with >= 4 threads per instance. If I wanted to gauge total-throughput @192M on a 16c32t system I would do as follows: 1: ./Mlucas -fft 192M -iters 1000 -cpu 0:15:2 2: ./Mlucas -fft 192M -iters 1000 -cpu 0:15 3: ./Mlucas -fft 192M -iters 1000 -cpu 0:31:2 4: ./Mlucas -fft 192M -iters 1000 -cpu 0:31 then note down the set of best-FFT-radices captured in mlucas.cfg for each test; let's shorthand said radix sets, now in ,-separated form (e.g. "192,16,32,32,32") as [1]-[4]. Then, with the system as otherwise-unloaded as possible, run 4 nice long timing tests, each at just the single best-timing radix set for that core/thread config, using 2 instances for the 2 single-socket configs. For the 2-instance cases we put both in background mode so they run concurrently: 1b: ./Mlucas -fft 192M -radset [1] -iters 10000 -cpu 0:15:2 & ./Mlucas -fft 192M -radset [1] -iters 10000 -cpu 16:31:2 & 2b: ./Mlucas -fft 192M -radset [2] -iters 10000 -cpu 0:15 & ./Mlucas -fft 192M -radset [2] -iters 10000 -cpu 16:31 & 3b: ./Mlucas -fft 192M -radset [3] -iters 10000 -cpu 0:31:2 4b: ./Mlucas -fft 192M -radset [4] -iters 10000 -cpu 0:31 then compare total-throughputs based on the resulting timings. o OpenCL issues are moot for the KNL, unless you're planning to also stick a GPU in the small mini-ATX case. o numactl works fine for me under CentOS - it has proved absolutely crucial in getting good performance from my big-footprint (nearly 200GB) F33 p-1 stage 2 work (@512M FFT) in terms of telling the OS "use MCDRAM as a giant L3 cache". 2.5x speedup for stage 2 over just letting the OS use its defaults for MCDRAM-versus-DIMM-RAM management. |
![]() |
![]() |
![]() |
#242 | ||||
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,369 Posts |
![]() Quote:
Finding issues to report to you in any of the environments, and document for other possible users to perhaps avoid pitfalls, was a goal. Being able to continue existing prime95 workloads alongside to completion without reinstall/migrate was a goal. Being able to run Mlucas without totally redoing the OS was a goal. Your own readme.html advised Windows users to rely on WSL as a means of running Mlucas. (Previously Mlucas could be compiled for Windows via msys2, but that is no longer the case, as of ~v19?) That said, the 7250 Xeon Phi here was set up for eventual dual-boot from the start. I am beyond saturated and have very little time to do that now, and really should be spending all my time on things other than GIMPS for a month. So meanwhile I'm mostly trying to make do with what's installed. I have dozens of Windows installs on old hardware, ~ a dozen WSL environments, and only 3 Linux boot environments installed, in my hardware fleet, no native Linux boot on Xeon Phi yet. Quote:
Shutdown. Disconnect AC cord. Cycle power button to dissipate residual stored energy in the PSU. Open case. Anti-static strap up. Swap boot drives. Static strap off. Close case. Reconnect power. Restart. Because that used system was shipped to me with Windows preinstalled and GPU drivers in place for the old Quadro and older Tesla included. I went through several cycles of Linux install on that; CentOS 7.9 first per Chalsall's recommendation; 8.2 to match what I understood you to have in place on your 7250, 8.3, 8.4, before finding that the unsolved issues I was having was because CentOS 8 was deliberately orphaned, so no packages could be installed or updated. (This seems like a fundamental Linux distro design flaw to me.) So CentOS 8 Stream was selected as the closest I could get to matching your system's OS and also be able to install or update as necessary to be able to compile Mlucas there, an absolute requirement. This sequence and the previous system's Linux-only install was practice for attempting CentOS on Xeon Phi, which due to age or PSU or whatever is unreliable in getting through the very lengthy BIOS checks to the start of an OS boot. Quote:
Total throughput was computed by converting ms/it numbers Mlucas provides for two simultaneous instances, to it/ms, summing those, and convert back to give a single effective throughput timing. Quote:
Re GPUs, the 7250 is planned to get a PSU upgrade, which will enable adding a GPU. (Low profile GPU due to mechanical layout of those little cases & the CPU cooler lines.) Most of my fleet (the 7210 Xeon Phi, all dual-xeons, most single-package, even a laptop) contain at least one discrete GPU each; some have usable IGPs including a laptop or two; some systems are headless and many contain multiple GPUs each. The two with the most productivity and GPU count per system are headless Win10Pro. GPUs last I checked accounted for about 90% of my total GIMPS throughput. They're a mix of AMD and NVIDIA. So solving the GPU on Linux issues for both AMD and NVIDIA is important for Linux to be usable significantly here. So is addressing the "Ubuntu does not allow effective remote desktop access of existing sessions" issue-by-design. Effective remote access management is necessary. The 7250 and 7210 are currently only mildly more physically accessible than your router-behind-the-couch. I had spent hours trying to get an Ubuntu boot test system remote access from my Windows based fleet management laptop without success, before finding online statements that Ubuntu 18.04 or up prohibits access to existing sessions, making it useless, and I did not yet get any remote desktop (graphical) access to actually function usefully. Does that work with CentOS remote / Windows console? I've no idea, nor how long it would take me to try. On Windows both ends graphical remote for existing sessions is fully supported and easy; either TightVNC for home or older versions, or built-in RDP for Pro or server editions (Vista & maybe XP? and newer). On Windows, there is _OX in the upper right of each GUI app window (minimize, maximize, terminate respectively). On Gnome _ (minimize) is missing. On Windows there is the ability to customize the mouse cursor for size and color (including invert color of traversed pixels) to make it easily visible on large and/or high resolution or cluttered screens. If that exists in Gnome, it's well enough hidden to prevent me finding it during a lengthy search. I spent hours trying, with frequent web searching for hints, but without success, to get Ubuntu on WSL2 to talk with my multi-OS-serving (Linux-based!) server. Just use the Linux GUI is no help for a command-line-only Ubuntu-on-WSL environment, or a headless system without functioning remote graphical access. A workaround for WSL1 hosted Linux is to identify the Windows location of the Linux files and copy using Windows to the server. But since WSL2 containerizes it, that workaround is not available for WSL2. Perhaps a Linux wizard could breeze through all that and more I've not identified yet. I'm not that. Here's Mlucas running in Ubuntu/WSL1/Win10 on one e5-2690 while prime95/Win10 occupies the other. Timing is a bit better than the self-test results (~6% better and fluctuating somewhat). ./Mlucas -cpu 16:30:2 Code:
tail p3321928307.stat [2022-03-03 15:23:03] M3321928307 S1 bit = 890000 [ 3.63% complete] clocks = 01:59:17.111 [715.7111 msec/iter] Res64: 1B811311F05BE51A. AvgMaxErr = 0.143603562. MaxErr = 0.203125000. Residue shift count = 0. [2022-03-03 17:22:50] M3321928307 S1 bit = 900000 [ 3.67% complete] clocks = 01:59:34.788 [717.4789 msec/iter] Res64: 8D77BF8AC6AFD23A. AvgMaxErr = 0.143744259. MaxErr = 0.187500000. Residue shift count = 0. [2022-03-03 19:22:41] M3321928307 S1 bit = 910000 [ 3.71% complete] clocks = 01:59:39.210 [717.9211 msec/iter] Res64: 8DC45B5ECC81EA89. AvgMaxErr = 0.143716101. MaxErr = 0.187500000. Residue shift count = 0. [2022-03-03 21:22:21] M3321928307 S1 bit = 920000 [ 3.75% complete] clocks = 01:59:28.661 [716.8662 msec/iter] Res64: FBE70202E4FDBB0C. AvgMaxErr = 0.143669422. MaxErr = 0.187500000. Residue shift count = 0. [2022-03-03 23:21:59] M3321928307 S1 bit = 930000 [ 3.79% complete] clocks = 01:59:25.410 [716.5411 msec/iter] Res64: 4DF1492B9B0E3919. AvgMaxErr = 0.143673560. MaxErr = 0.187500000. Residue shift count = 0. [2022-03-04 01:21:58] M3321928307 S1 bit = 940000 [ 3.83% complete] clocks = 01:59:46.892 [718.6892 msec/iter] Res64: BE43FDAC1E6F8E5B. AvgMaxErr = 0.143704682. MaxErr = 0.187500000. Residue shift count = 0. [2022-03-04 03:21:39] M3321928307 S1 bit = 950000 [ 3.87% complete] clocks = 01:59:29.534 [716.9535 msec/iter] Res64: 413F6E06000B7B66. AvgMaxErr = 0.143720421. MaxErr = 0.187500000. Residue shift count = 0. [2022-03-04 05:20:56] M3321928307 S1 bit = 960000 [ 3.91% complete] clocks = 01:59:05.087 [714.5088 msec/iter] Res64: 6B5A97B02AE26A4E. AvgMaxErr = 0.143638286. MaxErr = 0.203125000. Residue shift count = 0. [2022-03-04 07:20:07] M3321928307 S1 bit = 970000 [ 3.95% complete] clocks = 01:58:58.511 [713.8512 msec/iter] Res64: 9A3FE8F993F9E403. AvgMaxErr = 0.143765465. MaxErr = 0.187500000. Residue shift count = 0. [2022-03-04 09:19:38] M3321928307 S1 bit = 980000 [ 4.00% complete] clocks = 01:59:19.661 [715.9662 msec/iter] Res64: 5A838F28B6654A84. AvgMaxErr = 0.143642746. MaxErr = 0.187500000. Residue shift count = 0. Last fiddled with by kriesel on 2022-03-04 at 17:19 |
||||
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
AMD vs Intel | dtripp | Software | 3 | 2013-02-19 20:20 |
Intel NUC | nucleon | Hardware | 2 | 2012-05-10 23:53 |
Intel RNG API? | R.D. Silverman | Programming | 19 | 2011-09-17 01:43 |
AMD or Intel | mack | Information & Answers | 7 | 2009-09-13 01:48 |
Intel Mac? | penguain | NFSNET Discussion | 0 | 2006-06-12 01:31 |