mersenneforum.org Any First-gen Xeon Phi System Owners?
 Register FAQ Search Today's Posts Mark Forums Read

2021-11-22, 19:49   #1
ewmayer
2ω=0

Sep 2002
República de California

2×5×7×167 Posts
Any First-gen Xeon Phi System Owners?

I can recall several posts on mersenneforum.org in the past (mostly in the Xeon Phi? thread) in which users contemplated buying a cheap used first-gen Xeon Phi coprocessor online, only to abandon the idea since no GIMPS client can run that version of the 512-bit SIMD. I just heard from someone who is working to port the Mlucas AVX-512 assembly to the aforementioned architecture's 512-bit SIMD instruction set. She writes:
Quote:
 The first generation cards are not like the KNL cards, and while the architecture is vaguely like x86_64, and the simd code is vaguely like avx512, it is not directly compatible with them at either the source or binary level. namely, alongside having different encodings, the base instruction set is only that of the Pentium (p54c), SSE/AVX/AVX2 are not available, and many instructions available in AVX512 such as vunpcklpd/vunpckhpd/vshuff64x2/vshufpd are not available. however, it does have additional instructions not found in avx512 such as vpermf32x4, and allows for a free swizzle in nearly every instruction, meaning that you can use vblendmpd with a mask and swizzle as a direct replacement for vunpcklpd, among other things. I have been working on writing replacement assembly routines for this architecture where it is incompatible with the avx512 in mlucas, which is primarily the matrix transposition routines scattered about, and i am about to attempt integrating these routines in my local copy of mlucas. Would you be interested in having this code for the proper mlucas distribution? i doubt that you have access to the necessary hardware to test it, and i know from experience it can be a pain to get such hardware set up, but it might be useful to others if you were to accept it.
I am happy to integrate such a contribution into the official Mlucas distro, but it would be useful to have a few other GIMPSers with access to such hardware to test it in order to work out any kinks and also get an idea of what the best run mode (number of instances, number of threads each) in terms of total-throughput is.

 2021-11-22, 20:34 #2 kruoli     "Oliver" Sep 2017 Porta Westfalica, DE 84010 Posts Does this include the PCIe cards? I have a 31S1P, and if throughput (also P-1!) is good, I will revisit using this card. It is sitting idle for an eternity now...
 2021-11-22, 21:10 #3 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 22·11·139 Posts I have a 7120A, which is PCIe coprocessor not installed ATM. Estimated 1/3 throughput of a 7210. Limited and fixed max ram would cost P-1 performance at normal exponents, preclude past ~1G. Performance/watt will be low compared to 7210. Cold-weather hardware.
2021-11-22, 21:45   #4
emilymm

"emily"
Nov 2021
us pacific northwest

2 Posts

Quote:
 Originally Posted by kruoli Does this include the PCIe cards? I have a 31S1P, and if throughput (also P-1!) is good, I will revisit using this card. It is sitting idle for an eternity now...

hi, i'm the person working on this.

yes, this is for the PCIe cards, i am working on this with a pair of 7120P cards.

2021-11-22, 23:34   #5
ewmayer
2ω=0

Sep 2002
República de California

2·5·7·167 Posts

Quote:
 Originally Posted by kruoli Does this include the PCIe cards? I have a 31S1P, and if throughput (also P-1!) is good, I will revisit using this card. It is sitting idle for an eternity now...
Yes, it seems all versions of 1st-gen Phi have the same "ICMI 512-bit" SIMD instruction set. However, the setup needed to build and run code is nontrivial - from the person who contacted me (a.k.a. emilymm, whose above post landed just ahead of this one):
Quote:
 i obtained three of these Xeon Phi 7120P cards from someone at a university in Germany, who picked them up from the surplus department after they were retired from use sometime last year. two of them work, and the third had a firmware upgrade fail, and no longer boots. i should be able to recover it, but haven't spent the time to do so yet. the host system i use them with is a custom-built system. the motherboard is an ASUS "ROG STRIX B450-F GAMING II". it has a Ryzen 2600x, a 1600 watt power supply, 16GB of RAM, and the two working cards installed. i went through four different motherboards before finding one that would boot with the cards installed. having "Above 4G decoding" and 64-bit PCI BAR support in the firmware is a must, but even with those features some boards still seem to dislike the cards. the cooling system is rather intense, with about 16 fans total installed inside the case, including two 40mm server fans mounted on each card with a 3d printed shroud specifically for these cards. this is necessary to keep them from going into thermal shutdown. i run Void Linux on the host, but this is mostly irrelevant to the setup. i use the machine for a few other things, and Void Linux is my preference for that. the Intel MPSS SDK for these cards is run within a VM that uses Centos 7.3, and the cards are passed through to the VM. the SDK only works properly on this, or older, or an equivalently old version of SUSE Linux Enterprise Server. i attempted to run it on something even slightly newer, to no avail.
Per Wikipedia, the 3,5 and 7-series Phis have not dissimilar peak DP throughput, though those sort of "max theoretical" numbers need to be taken with a large grain of salt. Only 8GB DDR5 on the earlier models will make running p-1 stage 2 on more than one instance (one typically wants to fill the CPU up with multiple 4c8t instances) difficult. On my KNL I found that I can get excellent use out of the 192GB installed RAM by using numactl with the "-preferred 1" flag to treat the 16GB fast MCDRAM as a huge L3 cache (I'm doing an initial stage 2 interval on F33, that needs 4GB per double-residue buffer, allowing me to use - just! - 40 stage 2 buffers, one step up from and perhaps a 10% throughput boost over the minimum 24-buffer run option), but I don't know if the coprocessor versions of Phi permit that sort of thing.

Last fiddled with by ewmayer on 2021-11-22 at 23:36

2021-11-23, 01:53   #6
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

611610 Posts

per https://en.wikipedia.org/wiki/Xeon_Phi content:
Code:
Knights Corner ram is GDDR5 ECC (no provision for DIMM, or any other expansion) as PCIe coprocessor
31S1P 8GB ram 320 GB/sec, 1003 GFLOPS, 270 W
71nnA/P 16 GB ram 352 GB/sec, 1208 GFLOPS, 300W, integral fan/heatsink in A, passive heatsink in P; ~4Gflop/W

Knights Landing 16 GB MCDRAM (+ 6 DIMM channels optional) as system processor
7210 16G @400+GB/sec + optionally up to 384GB @ 102GB/sec, 2662 GFLOPS, 215 W
7250 16G @400+GB/sec + optionally up to 384GB @ 102GB/sec, 3046 GFLOPS, 215 W ~14 GFlop/W
In my experience running a variety of GIMPS applications doing P-1, testing limits, on a variety of hardware, getting P-1 both stages to run above ~1.14G exponent on 16GiB is questionable; gpuowl 6.11-380 definitely fails stage 2 at ~1.55G exponent. And that was observed in V6.11-380, which allows smaller buffer count than Mlucas 20.1.x or Gpuowl 7.x. And its 16GiB GPU ram doesn't need to also support an OS like a PCIe coprocessor does.
The 8 or 6GB coprocessors would be better employed on wavefront DC than P-1 IMO. But wavefront P-1 should be feasible; I've run it on 2GB. Just not as efficient as with more ram.
Quote:
 the cooling system is rather intense, with about 16 fans total installed inside the case, including two 40mm server fans mounted on each card with a 3d printed shroud specifically for these cards. this is necessary to keep them from going into thermal shutdown.
That's why I held out for a 7120A.
The MPSS support may be better on the Windows side, since it includes Windows 10.

Best of luck on the programming, emilymm.

Last fiddled with by kriesel on 2021-11-23 at 02:02

2021-11-23, 02:46   #7
emilymm

"emily"
Nov 2021
us pacific northwest

210 Posts

Quote:
 Originally Posted by ewmayer Per Wikipedia, the 3,5 and 7-series Phis have not dissimilar peak DP throughput, though those sort of "max theoretical" numbers need to be taken with a large grain of salt. Only 8GB DDR5 on the earlier models will make running p-1 stage 2 on more than one instance (one typically wants to fill the CPU up with multiple 4c8t instances) difficult. On my KNL I found that I can get excellent use out of the 192GB installed RAM by using numactl with the "-preferred 1" flag to treat the 16GB fast MCDRAM as a huge L3 cache (I'm doing an initial stage 2 interval on F33, that needs 4GB per double-residue buffer, allowing me to use - just! - 40 stage 2 buffers, one step up from and perhaps a 10% throughput boost over the minimum 24-buffer run option), but I don't know if the coprocessor versions of Phi permit that sort of thing.

the KNC cards don't have any sort of hardware or kernel NUMA support, but something resembling this might still be possible.

something that has come to mind is that it is possible to map host memory into the cards' address space. so although only 8/16GiB is onboard, you could install a much larger amount of RAM into the host and have it partition that out.

it would be slower than the onboard GDDR5, though, with the PCIe bus being more comparable to DDR3 speeds. though the latency likely wouldn't be worse...

Last fiddled with by emilymm on 2021-11-23 at 02:47

 2021-11-23, 11:35 #8 henryzz Just call me Henry     "David" Sep 2007 Liverpool (GMT/BST) 3×5×397 Posts Even with a form of AVX512 are 1st gen Phi cards worth worrying about now? Wouldn't a single modern cpu (Zen 3?) match it in efficiency?
 2021-11-23, 14:11 #9 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 17E416 Posts Ebay currently has some 5110P coprocessors for under $100; in one case =$40 US. For experimentation on a budget.

 Similar Threads Thread Thread Starter Forum Replies Last Post armeg Hardware 5 2019-05-23 00:17 TObject Hardware 34 2013-10-17 20:52 Prime95 Software 10 2005-07-08 08:44 ixfd64 Lounge 17 2004-11-21 20:22 flava Hardware 2 2004-01-04 19:17

All times are UTC. The time now is 15:13.

Tue Jan 25 15:13:01 UTC 2022 up 186 days, 9:42, 0 users, load averages: 1.36, 1.39, 1.29