![]() |
[B]Madpoo[/B], Thank you for link! Nice to hear that KNL still alive :rolleyes:
Hehehe, I think we should wait ready systems not early than September ((( |
[QUOTE=Lorenzo;438119][B]Madpoo[/B], Thank you for link! Nice to hear that KNL still alive :rolleyes:
Hehehe, I think we should wait ready systems not early than September ((([/QUOTE] Yeah, my guess is that when Intel says they're shipping, they may be going to government/universities that were promised some early units for their supercomputers, or something like that. Or perhaps the companies that announced their server models are still working out some kinks and they're not really ready for sale yet. Still, I think you can officially buy them from that Intel / Colfax "developer access program": [URL="http://dap.xeonphi.com/"]http://dap.xeonphi.com/[/URL] They're just not as customizable as what we could get from 3rd parties like getting lower cost amount of RAM or disk options, etc. I mean, those things start at $4200 if you get one without any disks at all, so you're still paying for 6 x 16GB memory, and for me I'd be happy with 6 x 4GB or maybe 8GB on a dev platform and save some money. Not sure CentOS would be my flavor of choice either but at least that could be changed unless there are some funky driver things that make it the only choice right now. The inclusion of a "Parallel Studio XE Cluster Edition Named User License" is nice though...I don't know how much that would be separately but if you're planning to actually use something like this for dev work, I imagine you'd need it. Whatever the case, it may be a while before I get my hands on one. :smile: Probably when HP comes out with a Proliant model, and I can work the #'s to show how awesome this would be for a virtual machine host. I'm imagining a cluster of a few of these compared to a larger amount of boring old Xeon E5 processors. :smile: A virtual host is definitely a good use of a CPU with many cores... things like SQL, IIS, and a lot of the other stuff I use, take good advantage of multiple threads. |
[url]http://dap.xeonphi.com/[/url]
|
Motherboard:
[url]http://www.serversdirect.com/Components/motherboards/id-T-K1SPE/Supermicro_K1SPE_Xeon_Phi_x200_ATX_motherboard_Socket-P_LGA_3647[/url] CPU pre-order [url]http://www.shopblt.com/item/intel-xeon-phi-7210p-64c-1.3g/intel_hj8066702859300.html[/url] Complete system: [url]https://www.sabrepc.com/supermicro-sys-5038k-i-knights-landing-workstation.html[/url] |
If we were able to get one of these systems up and running for shared development, just how active is the interest/available is George et al.'s time to get prime95 using those avx-512 opcodes?
I am excited about pushing forward the tech but currently lack the time and mental bandwidth for x64 ASM optimization work. |
[QUOTE=airsquirrels;440867]If we were able to get one of these systems up and running for shared development, just how active is the interest/available is George et al.'s time to get prime95 using those avx-512 opcodes?
I am excited about pushing forward the tech but currently lack the time and mental bandwidth for x64 ASM optimization work.[/QUOTE] FWIW, there is one user with a KNL that's been turning in results. So far, error free. Of course it's not tuned for any KNL optimizations, but it is working "out of the box" which is a good sign. 16 matching double-checks. Very encouraging. No benchmark info but I did request it. Question: how does one run the benchmark on mprime? I don't use the Linux version so I'm useless on that front. Any other benchmarks besides mprime that would be useful to see how it would do? Bear in mind, it's not going to reflect anything with the new CPU features, and the slower per-core clock speeds are probably going to make it seem slower than other CPUs (but with 64-72 cores, it makes up for it in other ways). I also wasn't sure how the vector pipelines appear to the OS... weren't there multiple AVX pipelines per core? Does it show up as another CPU, like hyperthreaded cores do? Beats me. Would be interesting to see how the 16GB of HBM does in it's different configurations (as a huge cache, or part of system memory, etc...) |
[QUOTE=Madpoo;440888]FWIW, there is one user with a KNL that's been turning in results. So far, error free. Of course it's not tuned for any KNL optimizations, but it is working "out of the box" which is a good sign.[/QUOTE]
Do you know (or can you ask) if said user has the Intel build tools installed? If so, would he or she be willing to offer guest accounts for folks looking to do AVX512/manycore code-dev? I intend to spend more or less all of the Fall adding AVX512 assembly support to Mlucas and playing with manythread performance analysis & tuning on KL. I filled out the online form for the Xeon Phi developer access program at the link I posted, hope to hear back on Monday (i.e. next business day). But, that sounds like one is getting in line for the opportunity to *buy* an early-release system, and I can't justify shelling out $5000 (roughly whee I expect the proce range to start) - I signed up mainly in hopes of getting info on the price range of a complete system with-build-software-installed. |
[QUOTE=Madpoo;440888]I also wasn't sure how the vector pipelines appear to the OS... weren't there multiple AVX pipelines per core? Does it show up as another CPU, like hyperthreaded cores do? Beats me.[/quote]
The OS knows little about the vector pipelines; the software just issues an instruction like [code] vgatherqpd zmm30{k1}, [r14+zmm31*8+0x7b] [/code] and the hardware scheduler sticks it onto one of the two pipes. |
I would guess that a decent way of utilizing this would be hyperthreading. There may be other bottlenecks then though.
|
George has not posted to this thread so far. We don't know if he has the time and inclination to pursue this now, as opposed to some time in the future.
How was this handled with previous introductions of new architectures? At what point in the cycle did development begin, and is anything different this time? Are the Intel-supplied tools suitable, or is it better to wait for third-party tools to catch up? |
[QUOTE=ewmayer;440894]Do you know (or can you ask) if said user has the Intel build tools installed? If so, would he or she be willing to offer guest accounts for folks looking to do AVX512/manycore code-dev?[/QUOTE]
I don't think so... I haven't asked, but if I had to guess, it's not the type of setup that would lend itself to something like that. Hopefully before too long, one of the regular "enthusiasts" on here will get their hands on one and could make that work. If I had $5K burning a hole in my pocket I'd pick one up for sure but I think my wife would have some words with me. LOL When I can pick one up for half that, I *might* have a fighting chance since I tend to upgrade my desktop very rarely and it's probably overdue (current one is a 3770K for instance). |
[QUOTE=Madpoo;440912]I don't think so... I haven't asked, but if I had to guess, it's not the type of setup that would lend itself to something like that.
Hopefully before too long, one of the regular "enthusiasts" on here will get their hands on one and could make that work. If I had $5K burning a hole in my pocket I'd pick one up for sure but I think my wife would have some words with me. LOL When I can pick one up for half that, I *might* have a fighting chance since I tend to upgrade my desktop very rarely and it's probably overdue (current one is a 3770K for instance).[/QUOTE] If you are taking about rare upgrades, mine is a Q6600. |
[QUOTE=Madpoo;440912]
Hopefully before too long, one of the regular "enthusiasts" on here will get their hands on one and could make that work. [/QUOTE] I did the DAP sign up - we will see. If selected I can likely get one in the R&D budget and open some guest account access. |
[QUOTE=ewmayer;440894]
I filled out the online form for the Xeon Phi developer access program at the link I posted, hope to hear back on Monday (i.e. next business day). But, that sounds like one is getting in line for the opportunity to *buy* an early-release system, and I can't justify shelling out $5000 (roughly whee I expect the proce range to start) - I signed up mainly in hopes of getting info on the price range of a complete system with-build-software-installed.[/QUOTE] It sure looks like the Colfax dap systems come complete with a full Intel Parallel Studio XE suite of software already configured and part of the price quoted. Looks like 5k dollars for a system. I'd bet they would get wiggy if they found out a whole bunch of people were using it rather than the one person who bought it due to software license wording. |
[QUOTE=airsquirrels;440867]how active is the interest/available is George et al.'s time to get prime95 using those avx-512 opcodes?.[/QUOTE]
I dread doing this development: 1) MASM does not support AVX-512 opcodes. This means I need to learn NASM macros (and rewrite many of my current MASM macros) or I need to learn Intel intrinsics assuming I can get down to register level programming in C. Merging the 2 environments into one library will be messy, rewriting all the existing MASM code would be even worse. 2) AVX2 code took a year to develop. I've not been in a coding mood recently. 3) All KNL code will have to be re-optimized when AVX-512 comes to the desktop. Let's face it KNL is a niche product, so the KNL optimized code would be little used. |
[QUOTE=tServo;440922]It sure looks like the Colfax dap systems come complete with a full Intel Parallel Studio XE suite of software already configured and part of the price quoted. Looks like 5k dollars for a system. I'd bet they would get wiggy if they found out a whole bunch of people were using it rather than the one person who bought it due to software license wording.[/QUOTE]
Once at least one of us signer-uppers gets price quotes, assuming our $5K estimate is not way off the mark on the low side, I suggest we try the crowdsourced approach I mentioned previously - see if we can get ~10 want-to-developers to share the cost, one of whom - preferably in a locale with cheap electricity - would physically host the system and provide accounts for the rest. George, your comment re. reoptimizing for later desktop releases - you expect that to be necessary even at the micro (instruction) level? My take is that thread-related optimizations would be the bigger-picture issue here. ===================== [b]Edit:[/b] Neglected to check my Spam folder 'til just now, the Colfax quote is in there: [i] System Configuration Base Platform : Colfax KNL Ninja Air Cooled Pedestal Developer Platform(15979) Memory Socket 1: DIMM 16384mb 2133MHz Registered ECC DDR4(14007) Memory Socket 2: DIMM 16384mb 2133MHz Registered ECC DDR4(14007) Memory Socket 3: DIMM 16384mb 2133MHz Registered ECC DDR4(14007) Memory Socket 4: DIMM 16384mb 2133MHz Registered ECC DDR4(14007) Memory Socket 5: DIMM 16384mb 2133MHz Registered ECC DDR4(14007) Memory Socket 6: DIMM 16384mb 2133MHz Registered ECC DDR4(14007) SSD Drive 1: Intel DC S3510 Series 240gb SSD SATA 6.0Gb/s(15150) SSD Drive 2: A option only, no default(1419) Disk Drive 1: HGST 0F23005 7K6000 ISE 4000gb 7200rpm 128mb Cache SATA 6.0Gb/s(15110) Disk Drive 2: A option only, no default(1419) Operating System SW: CentOS 7.2(15904) Notes: Est. Cost As Configured $4,705.32[/i] Perhaps someone else could sign-up to get a quote on the liquid-cooled variant of the above basic platform. |
p.s.: I hope they aren't treating my signup as an ironclad commitment to buy - I first wanted to see pricing, and the sign-up form was the only way I saw to doing that.
|
[QUOTE=ewmayer;440940]p.s.: I hope they aren't treating my signup as an ironclad commitment to buy - I first wanted to see pricing, and the sign-up form was the only way I saw to doing that.[/QUOTE]
You understand how contracts work, correct? |
[QUOTE=ewmayer;440928]George, your comment re. reoptimizing for later desktop releases - you expect that to be necessary even at the micro (instruction) level?[/QUOTE]
Probably. They are completely different architectures. They will probably have different instruction latencies and different throughputs (e.g. will each KNL core have 2 FMA units?). Will KNL require different striding through memory or L1/L2 prefetching? If so, this too can affect the low-level macros. Then there is the next level up optimizations for cache sizes. I expect KNL has smaller caches which might mean writing a 3-passes-over-memory instead of the current 2-passes-over-memory used presently. |
[QUOTE=chalsall;440942]You understand how contracts work, correct?[/QUOTE]
I understand the difference between a "request for quote" and a "commit to purchase" - but Colfax's messaging could use a little help in that regard. Here is what their auto-reply of last night, resulting from my signup based on their no-prices-listed frontpage - said: [i] Thank you for your order. Below is the information submitted by you towards the order of the Ninja Developer Platform for Intel® Xeon Phi™ Processor codenamed Knights Landing (KNL) at our website dap.xeonphi.com We will email you with schedule details and payment information in the next few weeks. If you have questions you can reach us at [email]dap@colfax-intl.com[/email] [/i] Just got a followup which makes it clear that that was in fact a RFQ, as I expected when filling out the online signup: [i] Thank you for your interest in the Intel KNL developer access program. Your quote is attached. The current lead time is about ten days. Please call in at your convenience with your credit card details if you wish to proceed. [/i] The detailed quote was basically same info I posted above, plus CA state sales tax, an eye-watering 8.75%. Bottom line: Around $5K as I surmised, including all the sweet Intel compiler&tuning tools. If we can get 10 people together and 1 to play physical host, that's $500 each, roughly what I paid last year for my little Intel 2-core Broadwell NUC. Who's interested on those terms? |
Est. Cost As Configured $4,576.53
FWIW, liquid cooled without the 4TB spinny disk. I'm willing to participate. I can also offer the colo as needed. |
[QUOTE=airsquirrels;440951]Est. Cost As Configured $4,576.53
FWIW, liquid cooled without the 4TB spinny disk. I'm willing to participate. I can also offer the colo as needed.[/QUOTE] Thanks for offering to host the beast. For code-dev purposes doubt we need the large HD - is the SSD for the liquid-cooled system the same 240 GB as for the air-cooled? With respect to the software, I followed up on the quote with a query about noncommercial multi-user licensing - here is reply I got: [i] The Platform includes a single user license for the cluster edition of Parallel Studio XE 2016 with latest updates included. We have free software license for Students & additional programs for researchers at [url]https://software.intel.com/en-us/qualify-for-free-software[/url] [url]https://software.intel.com/en-us/qualify-for-free-software/student[/url] Based on the institute/research type you may qualify for free/discounted software if you are wanting access beyond the single user license the system comes with.[/i] |
I'm also willing to participate.
|
[QUOTE=Prime95;440925]3) All KNL code will have to be re-optimized when AVX-512 comes to the desktop. Let's face it KNL is a niche product, so the KNL optimized code would be little used.[/QUOTE]
Maybe it depends. Sixteen-core machines are currently a niche product, but they're the underlying hardware for virtual machines on the cloud, for Amazon and Google and Microsoft. When those companies eventually get around to offering virtual machine instances on an architecture having AVX-512, I wonder will the underlying hardware be KNL? Or is KNL more of a specialty high-performance computing architecture only? Is it possible that there could be three versions of AVX-512 to optimize mprime for? Desktop, the "general server" market, and KNL? Or just two? |
[QUOTE=ewmayer;440950]Bottom line: Around $5K as I surmised, including all the sweet Intel compiler&tuning tools. If we can get 10 people together and 1 to play physical host, that's $500 each, roughly what I paid last year for my little Intel 2-core Broadwell NUC. Who's interested on those terms?[/QUOTE]
How would the funds be collected? Perhaps a third-party site like GoFundMe would allow smaller contributions from a larger circle. However, maybe it's jumping the gun. If a KNL machine was acquired then that would really put George on the spot to start tackling an enormous amount of work right away. I know that you would use the machine for Mlucas and others would have their other uses for it, but I think the goal of most would-be funders would be to further mprime/Prime95 development. Maybe some other funding could be earmarked instead to finding an assembler expert on a RentACoder type site who could carry out some of the drudge work of converting MASM macros to NASM. Would that help at all? Just thinking out loud. It sounds like that sort of preliminary groundwork would need to be done in any case before the feasibility of a KNL version of mprime/Prime95 could even be contemplated. And much of it (i.e., the existing code base) would not require a KNL machine to do it. |
[QUOTE=GP2;440970]...I know that you would use the machine for Mlucas and others would have their other uses for it, but I think the goal of most would-be funders would be to further mprime/Prime95 development.
Maybe some other funding could be earmarked instead to finding an assembler expert on a RentACoder type site who could carry out some of the drudge work of converting MASM macros to NASM. Would that help at all?...[/QUOTE] I am no expert on the nuanced differences between MASM and NASM, though I do write NASM from time to time. With that said, if it really is limited to the syntactical differences suggested by [url]http://left404.com/2011/01/04/converting-x86-assembly-from-masm-to-nasm-3/[/url] then perhaps I could find the time. Looks like there are about 30288 non-comment lines of assembly. |
[QUOTE=Prime95;440925]I dread doing this development:
2) AVX2 code took a year to develop. I've not been in a coding mood recently. 3) Let's face it KNL is a niche product, so the KNL optimized code would be little used.[/QUOTE] If we do this, there should be no pressure on George to participate; implicit or explicit. He's right about it being a niche product. How many people are going to have these processors? If I decide to participate, it would be out of curiosity and something to put on a resume, perhaps. |
Is prime95 version controlled at all? Or is there just a static source download from George's latest work?
I think the goal would be to make sure Prime95 continues to be available 2-3 years out when AVX-512 in some fashion exists on consumer grade CPUs. If migrating from MASM->NASM is required to handle that development then it has to happen at some point. I would just be concerned about putting in the requisite work but ending up with an orphaned fork of Prime95. |
[QUOTE=airsquirrels;441005]I would just be concerned about putting in the requisite work but ending up with an orphaned fork of Prime95.[/QUOTE]
You raise an important (but possibly uncomfortable) question: Is GIMPS reliant upon George? Moving Prime95/mprime development into a community space would make a lot of sense. And it could be done today. There is always the "secret sauce" code (read: security via obscurity) to take into consideration. There are many ways this could be managed without preventing a "fork" today. Absolutely no disrespect intended towards George in this message. I always have a "If I'm hit by a bus" document ready for automatic release for all my clients (and being hit by a bus is a surprisingly likely event here in Bim.) |
[QUOTE=tServo;441003]If we do this, there should be no pressure on George to participate; implicit or explicit. He's right about it being a niche product. How many people are going to have these processors?
If I decide to participate, it would be out of curiosity and something to put on a resume, perhaps.[/QUOTE] I may be confused on the particulars, but AVX-512 on Knights Landing (Xeon Phi x200) is the same as what will be on the Xeon Skylake processors (Xeon E5/E7 v5), correct? I wasn't aware of any differences in that regard. Of course the other big differences are the # of cores and the memory/caching architecture, but the thing about KNL was the x86 compatibility, so unless I'm missing something, I thought optimizations for AVX-512 on KNL would also apply to future AVX-512 implementations on Xeon and desktop CPUs. The best I could figure (and this was a while ago), future AVX-512 will have new features. The only incompatible 512-bit stuff was the jump from Knights Corner to Knights Landing... so yeah, the older Xeon Phi 71xx stuff wouldn't be worth coding for since it's a dead-end. |
[QUOTE=GP2;440970]How would the funds be collected? Perhaps a third-party site like GoFundMe would allow smaller contributions from a larger circle.[/quote]
Don't see why check and/or paypal shouldn't suffice. [quote]However, maybe it's jumping the gun. If a KNL machine was acquired then that would really put George on the spot to start tackling an enormous amount of work right away. I know that you would use the machine for Mlucas and others would have their other uses for it, but I think the goal of most would-be funders would be to further mprime/Prime95 development.[/QUOTE] That was not my intention, nor do I think George would take it that way. He's an adult, and I'm sure perfectly capable of deciding how his time is best spent. My intention was to set up a system for developers-not-necessarily-named-George with an interest in cutting-edge x86 vector/manycore programming to get up to speed on AVX512 and the manythread paradigm embodied by KL - which I believe is also the direction of future desktop CPUs, ever more cores. [QUOTE=Madpoo;441011]I may be confused on the particulars, but AVX-512 on Knights Landing (Xeon Phi x200) is the same as what will be on the Xeon Skylake processors (Xeon E5/E7 v5), correct? I wasn't aware of any differences in that regard. Of course the other big differences are the # of cores and the memory/caching architecture, but the thing about KNL was the x86 compatibility, so unless I'm missing something, I thought optimizations for AVX-512 on KNL would also apply to future AVX-512 implementations on Xeon and desktop CPUs. The best I could figure (and this was a while ago), future AVX-512 will have new features.[/QUOTE] The fact that AVX512, unlike previous Xeon-Phi "special" instruction sets, *is* a standard which will also carry over to future desktop-CPUs, is why I have waited 'til now to do any serious Xeon-Phi-oriented coding. Also, based on my reading of the first-gen AVX512 instruction set, while there will likely be enhancements in future updates, I doubt they will be anywhere near the level of, say, the AVX-to-AVX2 transition, in which Intel added FMA and rectified their stupidity of not including full-vector-width integer support. AVX512 has no obvious 'holes' like those. |
So AVX-512 will not be available in Skylake-E it seems, only in Skylake Xeon and then in Cannonlake, so until Cannonlake it will be a niche product, since I don't expect lots of people getting Skylake Xeons.
[url]https://en.wikipedia.org/wiki/AVX-512[/url] [url]http://wccftech.com/mainstream-intel-core-processors-support-avx-512-skylake-xeon/[/url] [CODE]AVX-512 Subset F CDI ERI PFI VL BW DQ IFMA VBMI Knights Landing Yes Yes Yes Yes Skylake Xeon (SKX) Yes Yes Yes Yes Yes Cannonlake Yes Yes Yes Yes Yes Yes Yes AVX-512 F Foundation (F) – expands most 32-bit and 64-bit based AVX instructions with EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control, supported by Knights Landing and Skylake Xeon AVX-512 CDI Conflict Detection Instructions (CDI) – efficient conflict detection to allow more loops to be vectorized, supported by Knights Landing[1] and Skylake Xeon AVX-512 ERI Exponential and Reciprocal Instructions (ERI) – exponential and reciprocal operations designed to help implement transcendental operations, supported by Knights Landing[1] AVX-512 PFI Prefetch Instructions (PFI) – new prefetch capabilities, supported by Knights Landing[1] AVX-512 VL Vector Length Extensions (VL) – extends most AVX-512 operations to also operate on XMM (128-bit) and YMM (256-bit) registers[2] AVX-512 BW Byte and Word Instructions (BW) – extends AVX-512 to cover 8-bit and 16-bit integer operations[2] AVX-512 DQ Doubleword and Quadword Instructions (DQ) – adds new 32-bit and 64-bit AVX-512 instructions[2] AVX-512 IFMA Integer Fused Multiply Add (IFMA) - fused multiply add of integers using 52-bit precision. AVX-512 VBMI Vector Byte Manipulation Instructions (VBMI) adds vector byte permutation instructions which were not present in AVX-512BW. [/CODE] [QUOTE=Madpoo;441011]I may be confused on the particulars, but AVX-512 on Knights Landing (Xeon Phi x200) is the same as what will be on the Xeon Skylake processors (Xeon E5/E7 v5), correct?[/QUOTE] No, see above. Knights Landing will have ERI and PFI which neither Xeon or Cannonlake will have. Xeon will add VL, BW and DQ which Knights Landing does not have and Cannonlake will add IFMA and VBMI. I do not pretend to know what all these subsets mean, but there is clearly differences between them all unfortunately. It would be simpler if they all had all the subsets. |
[QUOTE=airsquirrels;440978]I am no expert on the nuanced differences between MASM and NASM, though I do write NASM from time to time. With that said, if it really is limited to the syntactical differences suggested by [url]http://left404.com/2011/01/04/converting-x86-assembly-from-masm-to-nasm-3/[/url] then perhaps I could find the time. Looks like there are about 30288 non-comment lines of assembly.[/QUOTE]
I should not have tried to make such a quick estimate while at work. Make that more like 190441 lines of non comment/non-blank assembly + macros...... |
[QUOTE=ATH;441015]Knights Landing will have ERI and PFI which neither Xeon or Cannonlake will have.
Xeon will add VL, BW and DQ which Knights Landing does not have and Cannonlake will add IFMA and VBMI. I do not pretend to know what all these subsets mean, but there is clearly differences between them all unfortunately. It would be simpler if they all had all the subsets.[/QUOTE] The foundation instructions include more or less everything we want for FFT-based LL-testing and TF, too. As to the 'extra' subsets supported by KL but not the later desktop CPUs, the enhanced prefetch stuff (PFI) for strided (scattered) data might be nice for some applications but I don't see it as a dealbreaker by any stretch. Similarly for the Exponential and Reciprocal (ERI) Instructions - the Foundation instructions include the 512-bit vector versions of 14-bit accurate approximate exp/recip instructions AVX users expect; the 'extended' instructions provide 28-bit accurate versions as well, which can save an iteration in the usual get-approximant/do-Newton-iteration-to-desired-precision application of such instructions. My code does make use of such instructions, but only in the context of some relatively infrequent data initializations. I similarly see no major impact for the other subsets, the ones which later CPUs will have but missing in KL. When I first did a deep dive into the various 512-bit instruction subsets last year I recall being very pleased by the relative completeness of the foundation set for the high-performance numerics I am interested in. |
[QUOTE=ewmayer;441031]The foundation instructions include more or less everything we want for FFT-based LL-testing and TF, too...[/QUOTE]
Well in that case, has anyone looked at HJWASM as a viable option? MASM compatible with support for AVX512-F.... |
[QUOTE=airsquirrels;441033]Well in that case, has anyone looked at HJWASM as a viable option? MASM compatible with support for AVX512-F....[/QUOTE]
I use GCC-syntax inline assembly in my work, so am unable to offer any information on this subject. So far, we have these folks in the source for our KL crowd: airsquirrels (david) ewmayer (ernst) ATH (andreas) Madpoo (aaron) David, could you clarify what you meant by "getting one on the R&D budget"? You mean your personal funds, or some corporate entity? (The distinction may matter for purposes of academic-style multiuser licensing.) |
[QUOTE=ewmayer;441040]I use GCC-syntax inline assembly in my work, so am unable to offer any information on this subject.
So far, we have these folks in the source for our KL crowd: airsquirrels (david) ewmayer ATH David, could you clarify what you meant by "getting one on the R&D budget"? You mean your personal funds, or some corporate entity? (The distinction may matter for purposes of academic-style multiuser licensing.)[/QUOTE] I was originally indicating that I could possibly get an entire hardware unit under my business for R&D, however if we are doing a pool I don't mind contributing personally. I would think we could get the license for the software under academic even if the hardware is donated by the business, however I am not a lawyer. |
[QUOTE=airsquirrels;441060]I was originally indicating that I could possibly get an entire hardware unit under my business for R&D, however if we are doing a pool I don't mind contributing personally.
I would think we could get the license for the software under academic even if the hardware is donated by the business, however I am not a lawyer.[/QUOTE] I can contribute the hardware purchase as well. For licensing, I'd imagine there's no problem separating the licensing part from the hardware itself... plenty of researchers run software under academic licensing on hardware that could be corporately owned (think of any of the cloud services...those are most definitely "for profit" hardware). Essentially this would be a "cloud" device limited to certain folks who may be using software licensed under some other plan (academic, whatever). |
I have started working on getting gwnum to assemble with HJWASM. I was able to get cpuidhlp.obj going without too much issue, which was a significant advancement over my attempts to port that code to NASM.
So far I only seem to have one big problem and that's with the AVX instructions. George has all the XMM/YMM etc values as QWORD PTR in extrn.mac, which MASM must be happy to just treat as memory pointers to the correct type regard less of whether i.e. subsd(m64 operand) or subpd (m128 operand) is using them. HJWASM seems to want XMMWORD (OWORD) YWORD, ZWORD etc. explicitly in the PTR. Unfortunately George is quite clever and frequently reuses XMM_ variables as m64 when he only needs the bottom double, so I can't just change the type of XPTR. I will ask around on the HJWASM forums and see if there is a compatibility flag that will ease this issue, otherwise I have a mess of macros to update... |
[QUOTE=airsquirrels;441119]I have started working on getting gwnum to assemble with HJWASM. I was able to get cpuidhlp.obj going without too much issue, which was a significant advancement over my attempts to port that code to NASM.
So far I only seem to have one big problem and that's with the AVX instructions.[/QUOTE] I presume your work is aimed at getting a version of the Prime95 source buildable-by-ICC and thus amenable to tuning for KL using Intel's toolsuite, is that right? Because Aaron noted one user is already turning in results from Prime95 (or mprime?) running on a KL - would that imply current AVX/AVX2 binaries will run on KL without modification? p.s.: We need one more shared-system pool signer-upper to get the per-person cost under $1000 ... do we have a 5th hardy pioneer in the audience? |
[QUOTE=ewmayer;441122]I presume your work is aimed at getting a version of the Prime95 source buildable-by-ICC and thus amenable to tuning for KL using Intel's toolsuite, is that right?
Because Aaron noted one user is already turning in results from Prime95 (or mprime?) running on a KL - would that imply current AVX/AVX2 binaries will run on KL without modification?[/QUOTE] As far as I know nothing should prevent existing code from running on KL. Currently AFAIK AVX512 is not supported by MASM, so to do any work will require an assembler supporting those instructions. That should not necessary for thread/cache tuning on the KNL architecture, however. My understanding is using ICC is mostly beneficial for C/C and intrinsics where the compiler is doing the optimizations. Given that almost all of prime95's math is in assembly, I'm not sure how much use it will be. I admit I have not dug deeply into this and I'm not sure how powerful the profiling tools are at the assembly level to aid in tuning. |
[QUOTE=airsquirrels;441119]
So far I only seem to have one big problem and that's with the AVX instructions. George has all the XMM/YMM etc values as QWORD PTR in extrn.mac, which MASM must be happy to just treat as memory pointers to the correct type[/QUOTE] I have ho problem updating the source code to be a little stricter regarding typing. HJWASM may turn out to be a good solution assuming it really is MASM compatible. |
[QUOTE=ewmayer;441122]I presume your work is aimed at getting a version of the Prime95 source buildable-by-ICC and thus amenable to tuning for KL using Intel's toolsuite, is that right?
Because Aaron noted one user is already turning in results from Prime95 (or mprime?) running on a KL - would that imply current AVX/AVX2 binaries will run on KL without modification?[/QUOTE] That's correct... that's due to the x86 compatibility baked into the Atom cores and they have the full complement of AVX, SSE2, etc. support. Like I mentioned though, running that way it's basically no different than a bunch of slow (1.3 GHz, in this case) cores. I believe this particular user is running 16 workers out of the 64 cores. If I look even further back at the assignment history, there were some anonymous users earlier this year (as early as March) who started working assignments on a Knights Landing, but sadly never finished them. Must have been doing some benchmarking... in those cases they had 64 workers going using all of the cores. I wonder how it was doing with 64 workers going... their last check-in had 'em up to only 2-3% complete, and that was a month after being assigned. For exponents in the 44M range that might be what I'd expect for a 1.3 GHz processor... Now, if you had a single worker using 64 cores... whew... I'd like to see that. Using the 16GB of HBM available? Yeah... even without any code changes at all I'm guessing it would fly. |
[QUOTE=Prime95;441132]I have ho problem updating the source code to be a little stricter regarding typing.
HJWASM may turn out to be a good solution assuming it really is MASM compatible.[/QUOTE] Here is my post over at masm32 regarding the incompatibility. My rudimentary grepping shows about 1000 places to update m64 references to Q_XMM or similar pointers and about the same to update m128 reference to XMMWORDs. [url]http://masm32.com/board/index.php?topic=5633.0[/url] Maybe they will add compatibility. Otherwise, if you have a preference for how you want that work done I don't mind doing it. |
[QUOTE=Madpoo;441142]...
Now, if you had a single worker using 64 cores... whew... I'd like to see that. Using the 16GB of HBM available? Yeah... even without any code changes at all I'm guessing it would fly.[/QUOTE] Well just as soon as the other few participants chip in we can find out! |
[QUOTE=airsquirrels;441119]I have started working on getting gwnum to assemble with HJWASM.[/QUOTE]
This is unrelated to the discussion here, but it occurs to me that it might be very interesting if the gwnum library was available as a C extension module for Python. In principle it shouldn't be too hard. Has this ever been attempted? |
[QUOTE=Madpoo;441142]That's correct... that's due to the x86 compatibility baked into the Atom cores and they have the full complement of AVX, SSE2, etc. support.
Like I mentioned though, running that way it's basically no different than a bunch of slow (1.3 GHz, in this case) cores. [/QUOTE] The benchmarks on the pre-production Xeon Phi 7290 indicate that with the scalar workload (i.e. regular x86 software benchmarks, multi-threaded but not vectorized), it's about 3 times faster than Intel Xeon E5-2697 v4. While most of the folks in this thread seem to have the intent of taking advantage of the ASX-512 and vectorized code (which would indeed yield the max throughput from Xeon Phi), my personal interest is simply making use of all Phi's 256 cores in a straightforward way, with conventional multi-threaded Java. |
[QUOTE=airsquirrels;441174]Well just as soon as the other few participants chip in we can find out![/QUOTE]
David, How many participants do you reckon is needed? |
[QUOTE=tServo;441330]David,
How many participants do you reckon is needed?[/QUOTE] At last tally, we had the following people in for $500 airsquirrels (david) ewmayer (ernst) ATH (andreas) Madpoo (aaron) If we could get another 4 ($4000 total) I can swing the difference. I also could throw up a KickStarter/IndieGogo, or whichever platform people prefer and we could see if a broader group wants to help advance the state of prime95 , Mersenne location, etc. KL is niche now, but getting a headstart on AVX512 ultimately will speed the whole project. |
[QUOTE=airsquirrels;441336]I also could throw up a KickStarter/IndieGogo, or whichever platform people prefer and we could see if a broader group wants to help advance the state of prime95 , Mersenne location, etc. KL is niche now, but getting a headstart on AVX512 ultimately will speed the whole project.[/QUOTE]
While I have not used it, I think GoFundMe would be a better fit than KickStarter/IndieGogo. Kickstarter and IndieGogo tend to be for things like crowdfunding large-scale projects and products, for instance innovative tech gear, indie films, music albums, charitable causes, cultural projects, etc. There are often hundreds of contributors and the overwhelming majority do not know the project runners personally, or know about them beforehand. On the other hand GoFundMe is for personal causes, usually financed by friends and family and acquaintances, and only occasionally some sympathetic stranger. For instance, funding a school trip, helping a bereaved family, paying for medical treatment. One small drawback is that sites like this collect a fee, usually 5%. Ernst mentioned PayPal or check, but not everyone trusts PayPal anymore, and not everyone has a supply of paper checks anymore, not to mention this isn't an option for anyone out of the US (cashing checks from other countries is difficult and costly and mostly impractical). PS, Right now the contributors include a small circle of developers who have their own projects that they want to try out, so they are motivated to move forward right away independently of Prime95. But many of us are basically solely interested in Prime95, and it does seem premature at least until the assembly language issues are determined to have been fully resolved and intentions have been clarified. If fundraising mentions Prime95, it creates expectations that development on it is ready to move forward at the present time, and it's just not clear that that's the case. |
Are there any alternative solutions to consider?
Maybe get access to the Intel tools and use their emulation software for development until Purley comes along in 2017? Or we might only need the emulation software if gcc / HJWASM generates AVX-512 code. |
[QUOTE=Prime95;441351]Are there any alternative solutions to consider?
Maybe get access to the Intel tools and use their emulation software for development until Purley comes along in 2017? Or we might only need the emulation software if gcc / HJWASM generates AVX-512 code.[/QUOTE] It is true that just AVX-512 work could be roughed in with a viable emulation tool and assembler, however I personally avoid developing any performance software (Android, iOS, etc.) on emulators or simulators if at all possible. I agree it may be misleading to mention Prime95 development. What will actually happen at least initially is much more likely to be Prime95 benchmarking, performance tuning, and other research of the effects of HBM, high core counts, etc. I am sure some of us also just want a chance to play with the bleeding edge of Intel's tech. It is worth saying that while the bulk of the Prime95 work comes from the army of run of the mill machines, there seem to be quite a few high-throughput users such as Madpoo, myself, etc. that are using fairly recent server/enterprise grade hardware and could reasonable get access to v5 Skylake Xeon's yet this year. |
Ok, I set this up. If anyone has comments or wants anything changed let me know. I did mention mersenne.org/prime95 although hopefully not in a way that is misleading. I will also post this in a new thread if the folks here approve.
[url]https://www.gofundme.com/KNL4NumberTheory[/url] I also circulated this to a few good-willed people we may donate just to help. As to credentials - if anyone here does not know or trust me to handle this for some reason PM me and I'll try to set your mind at ease. Otherwise I'm happy to let someone else orchestrate. Finally - if anyone here donating wants to arrange another, less fee-filled way to fund this let me know. |
[QUOTE=airsquirrels;441364]As to credentials - if anyone here does not know or trust me to handle this for some reason PM me and I'll try to set your mind at ease.[/QUOTE]
Somewhat strangely, sometimes trusted people are attacked in their actions. I would like to give you a +1 as being the leader in this action. |
[QUOTE=airsquirrels;441364]
As to credentials - if anyone here does not know or trust me to handle this for some reason PM me and I'll try to set your mind at ease. Otherwise I'm happy to let someone else orchestrate. .[/QUOTE] I applaud your efforts and am considering my level of commitment ( that has nothing to do with your trust, but just with my schedule, etc). 2 questions: Have you considered the water cooled system? I know it's more expensive, but considering that most folk's goal is to peg this thing to the max for hours, it may be worth it. I have slowly come around to LaurV's way of thinking about system cooling. Since I moved to central Illinois years ago, it seems the climate has changed from "midwestern corn belt" to "tropical rain forest." I simply can't run many machines during the summer anymore. I know this is early, but what would be the logistics for actually using this system wrt distributing the available time? It would probably have to be single-user-threaded since the Phi cannot be shared reasonably. I'm not trying to put anyone on the spot here but i'm just curious and don't want my expectations to exceed reality. I think a dialog on this topic would be healthy. |
[QUOTE=Prime95;441351]Are there any alternative solutions to consider?
Maybe get access to the Intel tools and use their emulation software for development until Purley comes along in 2017? Or we might only need the emulation software if gcc / HJWASM generates AVX-512 code.[/QUOTE] I've been following this thread for quite a while now and I see that's it's looking more interesting. Intel's emulation tools are actually quite good for correctness testing. I've been using them to test a lot of AVX512 intrinsic code that I've accumulated over the past 3 years since Intel announced AVX512. The catch of course is overhead of the emulation: about 100 - 1000x. So while you're not doing any performance testing through the emulator, it's sufficient to run all your unit tests. If you assume standard desktop CPU models for Skylake Purley, I'm confident it's possible to write code that will be fairly close to optimal without actually having the hardware. Then when the hardware does come out, you can fine-tune it. But I can't say the same about Knights Landing. Based on the recently released literature, KNL's execution core is so drastically different from the usual desktop core that it will be difficult to write optimal code for it without the hardware. For one, KNL's OOE reorder window is significantly smaller than the desktop chips. So the old trick of relying on the CPU's OOE to parallelize across loop iterations with long dependency chains probably won't work that well. Not to mention that the FMA latency is 6 cycles as opposed to only 5/4 on Haswell/Skylake. This is a problem I run into even on Haswell. I have loops where the dependency chain is too long even for Haswell to sufficiently parallelize across iterations, but 16 registers is not enough to unroll it so it doesn't need to reorder as much. And that's where HT bails me out. So for KNL, expect to really work all 32 of those registers and the 4-wide HT. Secondly, KNL has two VPUs for 2 FMAs/cycle throughput. But instruction decoding and dispatch is also only 2-wide. So if I'm interpreting the literature correctly, there will not be any "free" issue slots that can be used for loop counters and prefetching. So we might be entering the world of massive amounts of loop-unrolling. Back in 2014 when I was analyzing ICC's code generation for KNL, I noticed that it really liked to do redundant loads. For example, an untwiddled radix 2 butterfly might look like this in desktop code: [CODE]vmovapd zmm0, ZMMWORD PTR [rax] vmovapd zmm1, ZMMWORD PTR [rbx] vaddpd zmm2, zmm0, zmm1 vsubpd zmm1, zmm0, zmm1[/CODE]But ICC likes to generate this for KNL instead: [CODE]vmovapd zmm0, ZMMWORD PTR [rax] vaddpd zmm2, zmm0, ZMMWORD PTR [rbx] vsubpd zmm1, zmm0, ZMMWORD PTR [rbx][/CODE]I've been wondering for a couple years why it would do that. Now it's almost obvious: It's one less instruction, and there are no "free" issue slots on KNL. It almost makes be wonder if it's worth using a gather-prefetch to simultaneously fetch 16 cache lines in one instruction as opposed to 16 normal prefetches. |
[QUOTE=airsquirrels;441364]
[url]https://www.gofundme.com/KNL4NumberTheory[/url][/QUOTE] One other advantage of GoFundMe is that you receive all donations even if the goal in dollars was not reached, whereas KickStarter/Indiegogo cancel and refund in that case. [QUOTE=airsquirrels;441364]As to credentials - if anyone here does not know or trust me to handle this for some reason[/QUOTE] Not an issue. [QUOTE=airsquirrels;441364]I will also post this in a new thread if the folks here approve.[/QUOTE] Since you've gone ahead with it, might as well make a sticky thread. |
[QUOTE=Prime95;441351]Maybe get access to the Intel tools and use their emulation software for development until Purley comes along in 2017? Or we might only need the emulation software if gcc / HJWASM generates AVX-512 code.[/QUOTE]
As you know, the problem there is that emulation is fine as far as code correctness, but tells you nada about performance, nor about real-world-run stability, a nontrivial issue when it comes to complex, demanding multithreaded applications. (As your own code showed not too long ago with early-release Skylake systems.) On the theme of possibly cutting the dev-system cost: Do we really need that big honking 96GB blob of DDR4 memory? I don't envision doing anything that would need more than 4-5 GB of RAM, which would fit easily in the 16 Gig of fast MCDRAM. What say my fellow would-be KL developers? I mean that DDR4 has gotta run at least $10 a gig, so running without it, or with some lesser 'system minimum' amount (if that is in fact required) could save a nice chunk of money. Similarly, perhaps look at an air-cooled system with just the SSD, i.e. sans the 4TB HD included in Colfax's standard offerings. David, thanks for setting up the GFM page ... I feel we're pretty close, some combination of cost-cutting, 2-3 more interested-developers and a whiff of public charity via GFM should allow us to pull the trigger on this. |
[QUOTE=tServo;441369]Have you considered the water cooled system?
... I know this is early, but what would be the logistics for actually using this system wrt distributing the available time?... I think a dialog on this topic would be healthy.[/QUOTE] I am most likely to purchase the water cooled version and make up the difference myself if needed. I run a lot of systems 24/7 and hands down have much better reliability and easier temperature management, even in cooled racks, using water cooling. With regards to time sharing, I imagined we would have a thread here to informally track reservations. My only request would be that we setup so that when the system is not actively in use for development or test runs we run mprime or similar on it. This could even be automated to run anytime there were no active `screen` sessions or ttys. [QUOTE=GP2;441371]Since you've gone ahead with it, might as well make a sticky thread.[/QUOTE] I'm not sure what needs done to accomplish that here. With regards to RAM, no we probably do not need 96GB, however we definitely want all 6 channels propagated so we would be at 24(~$200 retail) or 48GB (about $50 more). Others may want the memory depending on work load, so I will wait to see. |
[QUOTE=airsquirrels;441374]I am most likely to purchase the water cooled version and make up the difference myself if needed. I run a lot of systems 24/7 and hands down have much better reliability and easier temperature management, even in cooled racks, using water cooling.[/quote]
Off topic, but it's worth the expense? I've always seen water cooling as more maintenance heavy that dusting and not as cost effective as buying more hardware when it comes to performance. [quote] With regards to time sharing, I imagined we would have a thread here to informally track reservations. My only request would be that we setup so that when the system is not actively in use for development or test runs we run mprime or similar on it. This could even be automated to run anytime there were no active `screen` sessions or ttys. [/quote] Why not just use MaxLoad= and PauseTime= in prime.txt? I wish I had an actual use for this machine; I would participate if I had. |
[QUOTE=ewmayer;441373]On the theme of possibly cutting the dev-system cost: Do we really need that big honking 96GB blob of DDR4 memory? I don't envision doing anything that would need more than 4-5 GB of RAM, which would fit easily in the 16 Gig of fast MCDRAM. What say my fellow would-be KL developers? I mean that DDR4 has gotta run at least $10 a gig, so running without it, or with some lesser 'system minimum' amount (if that is in fact required) could save a nice chunk of money. Similarly, perhaps look at an air-cooled system with just the SSD, i.e. sans the 4TB HD included in Colfax's standard offerings.[/QUOTE]
We can go with 24 or 48 GB as long as we run 6-channel RAM. |
[QUOTE=Mark Rose;441380]Off topic, but it's worth the expense? I've always seen water cooling as more maintenance heavy that dusting and not as cost effective as buying more hardware when it comes to performance.
Why not just use MaxLoad= and PauseTime= in prime.txt? I wish I had an actual use for this machine; I would participate if I had.[/QUOTE] With very performance centric code as a developer I would want to know that my cache wasn't being eaten or cycles used by anything else, even if it was getting out the way. Either way I am sure that is a solvable problem. I don't believe the water cooling unit was much more expensive based on the other quote I saw. I just hate CPU downclocking due to temps. |
[QUOTE=GP2;441350]While I have not used it, I think GoFundMe would be a better fit than KickStarter/IndieGogo.[/QUOTE]
Agreed... the others are more for people interested in funding a new product that they could then purchase. GoFundMe is for whatever... [QUOTE]Ernst mentioned PayPal or check, but not everyone trusts PayPal anymore, and not everyone has a supply of paper checks anymore, not to mention this isn't an option for anyone out of the US (cashing checks from other countries is difficult and costly and mostly impractical). PS, Right now the contributors include a small circle of developers who have their own projects that they want to try out, so they are motivated to move forward right away independently of Prime95. But many of us are basically solely interested in Prime95, and it does seem premature at least until the assembly language issues are determined to have been fully resolved and intentions have been clarified. If fundraising mentions Prime95, it creates expectations that development on it is ready to move forward at the present time, and it's just not clear that that's the case.[/QUOTE] One thing's for sure though...without access to a Xeon Phi x200 system, no real development could be done. I think there are simulators or whatever that would let you build and test but you'd need some real hardware to run it on for full verification. But you bring up a good point that maybe it'd be better to do some of the groundwork first like making whatever changes are needed so it would actually build using whatever tools and working out any complications ahead of time. By then there could be some competition from 3rd party vendors getting their own KNL systems out there (like the SuperMicro setups they've announced). Could be even cheaper by that point... maybe. The biggest cost of a system like that is still going to be the CPU itself though. |
[QUOTE=Madpoo;441410]...
One thing's for sure though...without access to a Xeon Phi x200 system, no real development could be done. I think there are simulators or whatever that would let you build and test but you'd need some real hardware to run it on for full verification. But you bring up a good point that maybe it'd be better to do some of the groundwork first like making whatever changes are needed so it would actually build using whatever tools and working out any complications ahead of time. By then there could be some competition from 3rd party vendors getting their own KNL systems out there (like the SuperMicro setups they've announced). Could be even cheaper by that point... maybe. The biggest cost of a system like that is still going to be the CPU itself though.[/QUOTE] I know the LAG Xeon systems I have built currently cost about the same as the KNL system just barebones without RAM or CPU. While it is true the hardware might eventually become commodity, time is also money. A less capable e5-26xx v2 system two generations old still costs significantly more. |
[QUOTE=airsquirrels;441374]With regards to time sharing, I imagined we would have a thread here to informally track reservations. My only request would be that we setup so that when the system is not actively in use for development or test runs we run mprime or similar on it. This could even be automated to run anytime there were no active `screen` sessions or ttys.
... With regards to RAM, no we probably do not need 96GB, however we definitely want all 6 channels propagated so we would be at 24(~$200 retail) or 48GB (about $50 more). Others may want the memory depending on work load, so I will wait to see.[/QUOTE] For resource sharing, what's the thought in running VMWare on the bare metal and setting up a few virtual machines running whatever Linux flavor, and some Windows as well (to allow for mprime and Prime95 testing/development). Perhaps a couple of each, each one allocated 16 processors? I'm less familiar with VMWare than I am with HyperV, so I'm not entirely sure how VMWare sets machine/core affinity... for purposes of performance it'd be important to NOT allow the hypervisor to shuffle which core is being mapped to a virtual machine. I'm pretty sure it won't do that unless you've over-allocated your cores among all your guests, and if total CPU usage is being maxed out. Anyway, an arrangement like that would make more memory a better option... at least 48GB, I'd guess (6 x 8GB modules). |
KVM would also work and it's free. Docker could also be used since it shares the kernel.
Virtualization will make performance tuning a lot more difficult, since it would be impossible to control what else is running and consuming memory bandwidth for instance. |
[QUOTE=Madpoo;441477]For resource sharing, what's the thought in running VMWare on the bare metal and setting up a few virtual machines running whatever Linux flavor, and some Windows as well (to allow for mprime and Prime95 testing/development).
.[/QUOTE] I would be utterly astonished if VMware would run on one of these systems, or any other type of virtualization for that matter. Just my 2 cents. |
[QUOTE=tServo;441531]I would be utterly astonished if VMware would run on one of these systems,
or any other type of virtualization for that matter. Just my 2 cents.[/QUOTE] I believe I read that KNL does not support virtualization instructions. |
[QUOTE=airsquirrels;441532]I believe I read that KNL does not support virtualization instructions.[/QUOTE]
Hmm... that would be troubling... I did a quick Google and found this: [URL="http://colfaxresearch.com/knl-avx512/"]http://colfaxresearch.com/knl-avx512/[/URL] There's a spot on there where it lists the output of "cat /proc/cpuinfo" and "vmx" is listed (that link itself might prove useful since it talks about the 512-bit AVX) On the other hand though, I went to the official Intel spec page, and you're right, it says no virtualization support. Boo! [URL="http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core"]http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core[/URL] That's disappointing since it means I wouldn't be able to use any of these for work (virtual hosting). Bummer. |
[QUOTE=Madpoo;441542]On the other hand though, I went to the official Intel spec page, and you're right, it says no virtualization support.[/QUOTE]
So it sounds like this chip is specifically for the HPC niche market, and some other processor will be intended for servers. Few of us will actually use one, in home computing farms or cloud computing. |
[QUOTE=GP2;441546]So it sounds like this chip is specifically for the HPC niche market, and some other processor will be intended for servers. Few of us will actually use one, in home computing farms or cloud computing.[/QUOTE]
Which is why the fact that - unlike previous Intel 'Xeon Phi specials' which proved evolutionary dead ends in this regard - the instruction set will also carry forward to future wider release, including the PC market was an absolute must for justifying raising money for early adoption of such a system. As the funding page states, this system is aimed at the HPC-for-number-theory code-development folks. There are probably less than a dozen such who regularly post here, but that is the inherent nature of DC projects like GIMPS. |
Thanks to an incredibly attentive development team and a very small additional patch, I was able to get the latest version of HJWASM to assemble gwnum.a for linux successfully.
I am still working out some nuances in the rest of the packaging/build process for prime95 on my linux dev box but things do seem to be moving forward. There is now a way to build prime95 with support for AVX-512 instructions. |
[QUOTE=airsquirrels;441562]Thanks to an incredibly attentive development team and a very small additional patch, I was able to get the latest version of HJWASM to assemble gwnum.a for linux successfully.
I am still working out some nuances in the rest of the packaging/build process for prime95 on my linux dev box but things do seem to be moving forward. There is now a way to build prime95 with support for AVX-512 instructions.[/QUOTE] If you get to a point where you want to test that build to make sure it's spitting out the same results as mprime would (sans any AVX-512 stuff of course...just to make sure it works equally as well) I'd suggest trying it out by doing triple checks of some small exponents... why not. :smile: I've done triple-checks on all exponents below 2M (that didn't already have 3), and it's not a bad way to compare results of a new build to some tried & true residues from the past. I can generate a list of suitable worktodo lines for you if you'd like. Note that a custom build won't have the checksum or security code or whatever it is that an official "George" build would have, so it wouldn't be accepted by the server. |
[QUOTE=Madpoo;441664]If you get to a point where you want to test that build to make sure it's spitting out the same results as mprime would (sans any AVX-512 stuff of course...just to make sure it works equally as well) I'd suggest trying it out by doing triple checks of some small exponents... why not. :smile:[/quote]
It causes spikes in the [url=http://www.mersenne.org/primenet/graphs.php]graphs[/url]. [quote]I've done triple-checks on all exponents below 2M (that didn't already have 3), and it's not a bad way to compare results of a new build to some tried & true residues from the past. I can generate a list of suitable worktodo lines for you if you'd like.[/QUOTE] They're done up to 2.06M, actually. I usually run a few hundred to test new hardware. |
[QUOTE=Madpoo;441664]If you get to a point where you want to test that build to make sure it's spitting out the same results as mprime would (sans any AVX-512 stuff of course...just to make sure it works equally as well) I'd suggest trying it out by doing triple checks of some small exponents... why not. :smile:
I've done triple-checks on all exponents below 2M (that didn't already have 3), and it's not a bad way to compare results of a new build to some tried & true residues from the past. I can generate a list of suitable worktodo lines for you if you'd like. Note that a custom build won't have the checksum or security code or whatever it is that an official "George" build would have, so it wouldn't be accepted by the server.[/QUOTE] That's not a bad thought, easy enough to validate things. I was aware of the security code issue. I suppose if I was malicious I could just make that work... I don't expect to submit anything to primenet unless it is promoted up to a working build that has been through all the paces. Right now my goal was just to make things buildable on a KNL system so that I, or George, or whoever could poke at AVX-512. |
preliminary KNL analysis
Intel says the 16GB HBM memory has 4x the bandwidth of the 3-channel DDR4 ram. FFT data will easily fit in 16GB, so the good news is we should be running out of HBM memory at all times. Compare KNL to a 4-core Skylake with 2-channel DDR4 ram, the KNL system will have 4x (HBM vs DDR4) times 1.5x (3-channel vs 2-channel), or 6x the memory bandwidth. Unfortunately, we have 6x memory bandwidth feeding 16x number of cores!
A Skylake system is hurting on memory bandwidth, the KNL is going to be downright starving. We're looking at roughly 33% FPU utilization. I do not have any good ideas on reducing memory bandwidth requirements any further. The only option may be to run 64 cores of TF hyperthreaded alongside a 64 core FFT. |
I believe KL is 6 channel, not 3 channel.
The cores are based off Atom (Silvermont) and are clocked slower, but have 4 hyperthreads each. The chip we're getting probably runs at 1.3 GHz. The CPU we're getting probably only has 64 cores enabled. Edit: Each core will have double the FP. Apparently the onboard 16 GB can deliver over 400 GB/s versus the 42 GB/s a Skylake with 2 channel DDR4-3200 gets. If we consider a 4 GHz Skylake, we have roughly 10.4 times the CPU, but we'll have about 10 times the bandwidth. So it might not be so bad. Edit: Updated to reflect comments from ldesnogu |
Skylake can do 16 DP FLOPs/cycle. So 4 cores at 4 GHz will give 256 GFLOPs/s.
KNL is 32 DP FLOPs/cycle. So 64 cores at 1.3 GHz will give 2662 GFLOPs/s. That's more than 10 times a 4-core Skylake, not 5. |
[QUOTE=Prime95;441759]Intel says the 16GB HBM memory has 4x the bandwidth of the 3-channel DDR4 ram. FFT data will easily fit in 16GB, so the good news is we should be running out of HBM memory at all times. Compare KNL to a 4-core Skylake with 2-channel DDR4 ram, the KNL system will have 4x (HBM vs DDR4) times 1.5x (3-channel vs 2-channel), or 6x the memory bandwidth. Unfortunately, we have 6x memory bandwidth feeding 16x number of cores!
A Skylake system is hurting on memory bandwidth, the KNL is going to be downright starving. We're looking at roughly 33% FPU utilization. I do not have any good ideas on reducing memory bandwidth requirements any further. The only option may be to run 64 cores of TF hyperthreaded alongside a 64 core FFT.[/QUOTE] 16x cores which will be a fair bit slower per core at least before avx512. Is there any point in developing avx512 code while the current systems are so memory bound? Can we actually expect any improvement? Didn't you mention a while back that it may be possible to reduce memory bandwidth by using integer ffts rather than floating point. This may be an inaccurate memory. |
[QUOTE=Prime95;441759]Intel says the 16GB HBM memory has 4x the bandwidth of the 3-channel DDR4 ram. FFT data will easily fit in 16GB, so the good news is we should be running out of HBM memory at all times. Compare KNL to a 4-core Skylake with 2-channel DDR4 ram, the KNL system will have 4x (HBM vs DDR4) times 1.5x (3-channel vs 2-channel), or 6x the memory bandwidth. Unfortunately, we have 6x memory bandwidth feeding 16x number of cores!
A Skylake system is hurting on memory bandwidth, the KNL is going to be downright starving. We're looking at roughly 33% FPU utilization. I do not have any good ideas on reducing memory bandwidth requirements any further. The only option may be to run 64 cores of TF hyperthreaded alongside a 64 core FFT.[/QUOTE] You may well be right, but I prefer to remain optimistic until cruel reality smacks me upside the head. :) Two added points to consider: 1. We have 32 SIMD registers to work with, which should mean somewhat reduced memory traffic, when properly used; 2. There are just 2 points in each FFT-mul where the various processor threads need to share data, i.e. go back to main memory. IIRC KL cores are paired, with each pair sharing a 2MB L2 cache. If we have one of each such pair doing low-bandwidth TF work, each LL-test thread has 2 MB L2. Moreover, these multiple L2s can communicate directly with each other, or via main memory - from the Colfax "Clustering Modes in Knights Landing Processors" whitepaper: [i] In KNL (see Figure 1, bottom), each of its ≤ 72 cores has an L1 cache, pairs of cores are organized into tiles with a slice of the L2 cache symmetrically shared between the two cores, and the L2 caches are connected to each other with a mesh. All caches are kept coherent by the mesh with the MESIF protocol (this is an acronym for Modified/Exclusive/Shared/Invalid/Forward states of cache lines). In the mesh, each vertical and horizontal link is a bidirectional ring. [/i] The whitepaper alas does not reveal the L2-to-CPU or L2-mesh bandwidths. |
[QUOTE=Mark Rose;441764]The chip we're getting probably runs at 1.3 GHz. The CPU we're getting probably only has 64 cores enabled.[/QUOTE]
Ah, you are correct. I did not factor that in. So we have 16x the cores but running at half speed (compared to my standard issue i5-6500s). So, we have somewhere in the neighborhood of 6x the memory bandwidth and 8x the FPU firepower. Not good, but not as terrible as I estimated earlier. Edit: According to [url]http://www.hardwareunboxed.com/forum/viewtopic.php?t=1570[/url] and [url]http://www.asrock.com/news/index.asp?id=3043[/url] my Skylakes are getting just shy of 30 GB/s. Intel says HBM is 400 GB/s, so this my 6x bandwidth estimate was off by a factor of 2 as well! Sorry, for the false alarm. Those back-of-the-envelope calculations can be dangerous. If we can keep HBM memory fully busy, KNL may be a winner! |
That's why having actual hardware is going to be good. As complex as newer hardware is where calculating exact performance of the whole pipeline isn't even possible and done by statistical analysis, we have to leave our arm chairs to actually know.
Or someone can unearth a new FFT/multiplication algorithm with drastically less required memory and change the whole game... |
[QUOTE=Mark Rose;441666]It causes spikes in the [url=http://www.mersenne.org/primenet/graphs.php]graphs[/url].
They're done up to 2.06M, actually. I usually run a few hundred to test new hardware.[/QUOTE] I tried to make sure my own (more recent) totally unnecessary triple-checks were excluded from the graphs of throughput. The SQL query itself has a "where (user != madpoo or exponent > 3e6)" type of thing (little more complicated than that, but you get the idea). But that only works for me. LOL I guess I could tell it to exclude any LL tests from anyone below XX size... In the case of any custom builds to test things out, the server wouldn't accept those results anyway. |
[QUOTE=airsquirrels;441780]Or someone can unearth a new FFT/multiplication algorithm with drastically less required memory and change the whole game...[/QUOTE]
That would be the NTTs. They'll save you a factor of 2-3x for memory and bandwidth. But they're also at least 3x slower at best. I'm also unsure how well the IBDWT can be used with the NTT. My gut feeling tells me it might be difficult to find a modulus that has both suitably deep roots-of-unity [I]and [/I]roots-of-two. But I'm saying that without any expertise in the field. |
[QUOTE=Mysticial;441792]That would be the NTTs. They'll save you a factor of 2-3x for memory and bandwidth. But they're also at least 3x slower at best.
I'm also unsure how well the IBDWT can be used with the NTT. My gut feeling tells me it might be difficult to find a modulus that has both suitably deep roots-of-unity [I]and [/I]roots-of-two. But I'm saying that without any expertise in the field.[/QUOTE] [url=http://www.mersenneforum.org/showthread.php?t=118]See here[/url] for an example of such a modulus, in this case a complex (Gaussian-integer) one. Since x86 SIMD only supports 32x32 --> 64-bit integer multiply, M31 rather than M61 would be a more promising modulus in that context. A hybrid float64/int32 such transform would add ~31/2 = 15.5 bits to the allowable per-digit input size, which represents slightly less than a doubling of that (i.e. halving of the transform length). But each transform 'word' is now 1.5x larger (96 bits versus 64), so the overall bandwidth reduction is maybe ~20%. No free lunch in view here, I'm afraid. |
[QUOTE=Mysticial;441792]That would be the NTTs. They'll save you a factor of 2-3x for memory and bandwidth. But they're also at least 3x slower at best.
I'm also unsure how well the IBDWT can be used with the NTT. My gut feeling tells me it might be difficult to find a modulus that has both suitably deep roots-of-unity [I]and [/I]roots-of-two. But I'm saying that without any expertise in the field.[/QUOTE] Another way would be to use (software-emulated) quad precision FP. It will improve the compute:memory ratio significantly. But still probably won't be a net win due to the software overhead :-( |
[QUOTE=ewmayer;441795][URL="http://www.mersenneforum.org/showthread.php?t=118"]See here[/URL] for an example of such a modulus, in this case a complex (Gaussian-integer) one. Since x86 SIMD only supports 32x32 --> 64-bit integer multiply, M31 rather than M61 would be a more promising modulus in that context. A hybrid float64/int32 such transform would add ~31/2 = 15.5 bits to the allowable per-digit input size, which represents slightly less than a doubling of that (i.e. halving of the transform length). But each transform 'word' is now 1.5x larger (96 bits versus 64), so the overall bandwidth reduction is maybe ~20%. No free lunch in view here, I'm afraid.[/QUOTE]
That is an interesting idea. Doing both NTT+FFT and using the NTT to reconstruct the bottom (lost) parts of the coefficients. I was thinking more on the lines of going 100% NTT. The memory reduction should be a lot more than just 20%. For an double-precision FFT using 16-bits/point the memory efficiency is 0.25. And for library writers who prefer not to rely on destructive cancellation, we're talking more like only 8-bits/point if we want it to work at 1 billion+ bits. That's 0.125. At the other extreme, the Schönhage–Strassen NTT gets you asymptotically close to 0.50. The multi-prime algorithms will get you somewhere between that. 9 primes will get you ~0.44 which is still a lot better than even than FFT with destructive cancellation. [QUOTE=axn;441796]Another way would be to use (software-emulated) quad precision FP. It will improve the compute:memory ratio significantly. But still probably won't be a net win due to the software overhead :-( [/QUOTE]That's actually not a terrible idea. If you use double-double arithmetic: [LIST][*]Addition is 8 word-sized additions.[*]Multiplication is 1 multiplication and 3 FMAs.[/LIST]Double-double is 107 bits. That's probably large enough to place 40+ bits per point. IOW more than 2x over simple double-precision. The cost is somewhere between 4 - 8x for each operation. So computationally, you're going up by around a factor of 3-4x over the standard DP implementation for a 30 - 50%? reduction in bandwidth? It doesn't look like a win at first. But it might be worth investigating. I'm sure there are corners that can be cut when doing an optimized butterfly with double-double arithmetic. |
[QUOTE=Prime95;441777]Those back-of-the-envelope calculations can be dangerous[/QUOTE]
As ldesnogu pointed out, I forgot one other factor of 2, this time not in our favor. A KNL core needs twice the bandwidth as a Skylake core because AVX-512 is twice as wide as AVX-256. So summarizing KNL vs. my Skylake system: 400 GB/s vs. 30GB/s or 13.33x more bandwidth in KNL 1.3 GHz vs 2.5GHz or Skylake will need 1.92x more bandwidth 64 cores vs 4 cores or KNL will need 16x more bandwidth AVX-512 vs. AVX-256 or KNL will need 2x the bandwidth Net result is KNL should be a little more memory bound than a typical 4-core Skylake. |
I have a Colfax KNL development system sitting idle in my office. I had originally planned to purchase several hundred KNL nodes but have abandoned that for Broadwell after doing extensive testing.
I'll be more than happy to run some benchmarks for you guys, but I'm afraid you'll find out it behaves more like a 128 core machine running at half the clock speed due to the hardware forcing in order threading. |
[QUOTE=xathor;442606]I have a Colfax KNL development system sitting idle in my office. I had originally planned to purchase several hundred KNL nodes but have abandoned that for Broadwell after doing extensive testing.
I'll be more than happy to run some benchmarks for you guys, but I'm afraid you'll find out it behaves more like a 128 core machine running at half the clock speed due to the hardware forcing in order threading.[/QUOTE] One of the key advantages in KNL though is the AVX-512 support, and I'm pretty sure the developers are interested to get their hands dirty with that. It's also an interesting opportunity to get the current codebases tuned better for multi-threading. There are certain challenges involved for sure. At the very least, if a system like this could run 64 (or 128) simultaneous, single-core workers, using AVX-512, and the fast memory can keep all the pipes flowing, then it should be able to provide amazing throughput at a smaller price-point than a similarly kitted Broadwell. Dual Broadwell systems aren't cheap, and they don't have AVX-512, 6-channel memory, HBM... they do have faster clock speeds though. :smile: But even then, when you're looking at the top 22-core Broadwell, it's only running at 2.2 GHz with a turbo boost to 2.8 or something. So yeah, twice as fast as the 7210P, 44 cores on a dual CPU setup, but maybe 2-3 times the price. |
[QUOTE=Madpoo;442655]One of the key advantages in KNL though is the AVX-512 support, and I'm pretty sure the developers are interested to get their hands dirty with that.
It's also an interesting opportunity to get the current codebases tuned better for multi-threading. There are certain challenges involved for sure. At the very least, if a system like this could run 64 (or 128) simultaneous, single-core workers, using AVX-512, and the fast memory can keep all the pipes flowing, then it should be able to provide amazing throughput at a smaller price-point than a similarly kitted Broadwell. Dual Broadwell systems aren't cheap, and they don't have AVX-512, 6-channel memory, HBM... they do have faster clock speeds though. :smile: But even then, when you're looking at the top 22-core Broadwell, it's only running at 2.2 GHz with a turbo boost to 2.8 or something. So yeah, twice as fast as the 7210P, 44 cores on a dual CPU setup, but maybe 2-3 times the price.[/QUOTE] Testing on the KNL system has already shown that we can't really accelerate a single exponent, at least not with the current code. There is not a single chip or card out there that can beat this 7210 in terms of raw throughput though. Even before taking advantage of AVX-512 I'm matching the performance of a dual 16 core Xeon v3 system for 4-5x the cost of the colfax KNL unit. |
| All times are UTC. The time now is 06:48. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.