mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2016-09-19, 15:43   #177
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2×5×37 Posts
Default

Quote:
Originally Posted by xathor View Post
Easy. We have done it... but not for this application. I have over 200 different software packages on my supercomputers and we selected the top 10 used ones. Each one got compiled, benchmarked, re-compiled, tweaked and re-benchmarked. The performance just wasn't there, especially considering the time that it took to recompile software. Any single threaded applications flat out wont run well on KNL. If it doesn't scale well, it wont run well on KNL.

Unless the supercomputer center only runs a few applications and they tweak them to death, I highly doubt they will pick KNL over a more traditional Xeon. I have colleagues at many non-DoD supercomputer centers and I will tell you that they have come to the same conclusion.

The DoD centers typically build out the best FLOP/$ and then expect people to develop their code to scale well on their systems, which is why you see them gobbling up KNL.

Go to Supercomputing 2016 and find out yourself.

Intel swore up and down they were going to release Knights Landing at SC2015 and here's the only thing that I found after pestering the shit out of HP. Here's an Apollo KNL blade.

KNL SC15

I have fists full of cash wanting KNL to be the best thing since sliced bread. I waited years to finally get my hands on one. I'm very interested in this communities success with this chipset, but for now my money is on Broadwell.

You can get 3 DPTF in a 2U out of 96 Broadwell cores and 1TB DDR4 for well under $30k. That gets you way more flexible of a node that can run a broader range of applications over a similar KNL node. You'd be hard pressed to find a non-DoD supercomputer center that would put their money in KNL at this time.
Just curious with a few questions:
  1. Was the bottleneck the compute? Or was it the memory access? Since you mentioned low clocks being a problem, I'm guessing it's the former.
  2. Was the "recompiling" really just recompiling? As in no manual optimizations? Compilers nowadays still suck at vectorizing scalar code. The Intel Compiler is ahead of everything else, but it's far from being competitive with any properly done hand-written intrinsic code - let alone assembly.

IOW, I doubt you can just throw old code at KNL with some tweaks and expect it to perform. I imagine most of the stuff would need to be redesigned from bottom-up. But of course the latter option is likely prohibitive in development costs.

Last fiddled with by Mysticial on 2016-09-19 at 15:45
Mysticial is offline   Reply With Quote
Old 2016-09-19, 16:51   #178
xathor
 
Sep 2016

19 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Just curious with a few questions:
  1. Was the bottleneck the compute? Or was it the memory access? Since you mentioned low clocks being a problem, I'm guessing it's the former.
  2. Was the "recompiling" really just recompiling? As in no manual optimizations? Compilers nowadays still suck at vectorizing scalar code. The Intel Compiler is ahead of everything else, but it's far from being competitive with any properly done hand-written intrinsic code - let alone assembly.

IOW, I doubt you can just throw old code at KNL with some tweaks and expect it to perform. I imagine most of the stuff would need to be redesigned from bottom-up. But of course the latter option is likely prohibitive in development costs.
That is exactly why you see DoD and those with deep pockets and large staff gobbling up KNL. If you have a handful of applications that you run and want it to perform as best as possible, you'll try out different architectures and use whatever works best.

Almost every other supercomputer center is going to shy away from KNL mainly for the overhead cost of optimizing the code to work on a very narrow range of specific hardware.

To answer your optimization question. My co-worker with a PhD in computer science who has been optimizing and installing code on our systems for decades spent several weeks on a couple applications and was quite disappointed with the results.
xathor is offline   Reply With Quote
Old 2016-09-19, 17:03   #179
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

22×149 Posts
Default

Quote:
Originally Posted by xathor View Post
That is exactly why you see DoD and those with deep pockets and large staff gobbling up KNL. If you have a handful of applications that you run and want it to perform as best as possible, you'll try out different architectures and use whatever works best.

Almost every other supercomputer center is going to shy away from KNL mainly for the overhead cost of optimizing the code to work on a very narrow range of specific hardware.

To answer your optimization question. My co-worker with a PhD in computer science who has been optimizing and installing code on our systems for decades spent several weeks on a couple applications and was quite disappointed with the results.
How does CUDA fits there? Is it worth supporting it?

FWIW I strongly believe it's a dev/support hell to extract a lot of perf from KNL. But I also believe that a single project (such as mprime or mlucas) can get the most of it (at a large cost in dev).
ldesnogu is offline   Reply With Quote
Old 2016-09-19, 17:12   #180
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

7×13×41 Posts
Default

https://xkcd.com/1205/

Edit:
Is it not also a dev/support hell to extract a lot of perf from CUDA? I would argue the performance per unit effort is higher with KNL... but maybe I'm biased.

Last fiddled with by bsquared on 2016-09-19 at 17:15
bsquared is offline   Reply With Quote
Old 2016-09-19, 17:26   #181
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
How does CUDA fits there? Is it worth supporting it?

FWIW I strongly believe it's a dev/support hell to extract a lot of perf from KNL. But I also believe that a single project (such as mprime or mlucas) can get the most of it (at a large cost in dev).
Mprime benefits from having two decades of work put into optimizing over and over for the platform of the day, so from what I can tell it is built more like a bunch of high performance blocks that can be reassembled to match the core/thread/cache/VPU configuration of the next best thing. Our work in KNL is more of an identification of two of those blocks that need improvement (threading support for many many cores, and AVX512 support). The work will be done to add those new tools to the mprime toolbox and they will be used over and over in future Intel and other architectures, albeit in slightly different configurations or arrangements.

Would I buy a facility full of just KNL chips? Not for generic workload. If I owned a shipping company I also wouldn't buy an entire fleet of car-haulers and expect to take contracts for various types of shipping. If my business is shipping cars and I rarely use a flatbed or box trailer I just might.
airsquirrels is offline   Reply With Quote
Old 2016-09-19, 17:50   #182
Mysticial
 
Mysticial's Avatar
 
Sep 2016

37010 Posts
Default

Quote:
Originally Posted by xathor View Post
That is exactly why you see DoD and those with deep pockets and large staff gobbling up KNL. If you have a handful of applications that you run and want it to perform as best as possible, you'll try out different architectures and use whatever works best.

Almost every other supercomputer center is going to shy away from KNL mainly for the overhead cost of optimizing the code to work on a very narrow range of specific hardware.

To answer your optimization question. My co-worker with a PhD in computer science who has been optimizing and installing code on our systems for decades spent several weeks on a couple applications and was quite disappointed with the results.
Ah I understand. KNL didn't turn out to be the free lunch they expected it to be.
Mysticial is offline   Reply With Quote
Old 2016-09-19, 18:20   #183
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

22×149 Posts
Default

Quote:
Originally Posted by bsquared View Post
Is it not also a dev/support hell to extract a lot of perf from CUDA? I would argue the performance per unit effort is higher with KNL... but maybe I'm biased.
I think the price is the same. But that's just guessing.

I just never bought the Intel marketing that wants you to believe that since it's x86 and comes with excellent dev tools, it will be a piece of cake

@airsquirrels: yes and no. Data arrangement, preloads, decode limitations, etc. heavily depend on the particular micro-architecture and core/caches/RAM topology, and this tuning usually requires a lot more dev time than using new instructions, and is not easily applicable to a radically different CPU (such as upcoming Xeon).

But I agree with you, the work being done here is very interesting and will open the door for future AVX-512 implementations. I am just not (yet?) convinced this will prove a good value for money/power. I only wished I had more time to play with that beast, it looks so sexy...
ldesnogu is offline   Reply With Quote
Old 2016-09-19, 19:45   #184
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

11010000110002 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
Data arrangement, preloads, decode limitations, etc. heavily depend on the particular micro-architecture and core/caches/RAM topology, and this tuning usually requires a lot more dev time than using new instructions, and is not easily applicable to a radically different CPU (such as upcoming Xeon).
Yes, basically it is complicated, time consuming, and hard. The new instructions and basic non-crashability can be easily tested in emulators, but they don't help anyone to actually make the code perform well. That requires real hardware and lots of time testing with it.
retina is online now   Reply With Quote
Old 2016-09-19, 20:45   #185
xathor
 
Sep 2016

19 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
How does CUDA fits there? Is it worth supporting it?

FWIW I strongly believe it's a dev/support hell to extract a lot of perf from KNL. But I also believe that a single project (such as mprime or mlucas) can get the most of it (at a large cost in dev).
CUDA, absolutely. Especially with things like OpenCL and some nVidia tools that should be coming out soon-ish. GPU's are the bang for the dollar, clear winner if your code can be ported to the GPU.

Titans entire compute is 5 million something CUDA cores.

nVidia's aim is clear with things like nv-link.

I honestly wouldn't be surprised if KNL goes the way of Itanium.
xathor is offline   Reply With Quote
Old 2016-09-19, 21:36   #186
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

267538 Posts
Default

Re. the "which was hardar? Cuda or KNL-specific optimizations?" I refer the interested reader to Oliver Weihe's many-thousands-of-posts mfaktc dev-thread ... yeah, cuda-dev was really easy! /sarc

In my case, a few months of work to ||ize and many-thread-capabilize (new word!) my TF code using broadly supported and standardized pthreading primitives, my very first TF build on KNL yields a result that is quite good compared to current best-of-breed code/gpu combos, and the prophets of doom are already crowing loudly. Heck, they may prove correct, but it is wildly premature to early to make such blanket prophecies, IMO.

=================

Re. the 250,000x performance hit I reported yesterday for first-look TF-using-AVX-vector-float-math, I think I know what's going on ... the added clue lay in going back and carefully examining the onscreen output of my dismally slow TF run of MM31 to 50 bits. That should have caught the smallest factor with k = 68745, but didn't. Further comparative examination of the int64 and float-based modpow routines shows the former properly having multithread support - in terms of the one-time-init code section of same alloc'ing nthread times as much local memory as needed by a single thread and then pointing each user thread to its own chunk of that as it comes - but the float-based routines lack this. I.e. in || mode we have multiple threads reading-from/writing-to the same block of local memory. The reads are perhaps not so bad, but the write collisions would definitely seem to account for the symptomology in question, that is, massive slowdown and corrupted results.

Why didn't the same run fail in the self-tests which get done prior to the TFing? Because those are done 1-threaded.

So when I ||ized my factoring code last year it seems I only added || support for the int64-based modpow routines and left the floating-point ones for later. Later being now.

Last fiddled with by ewmayer on 2016-09-20 at 00:31
ewmayer is offline   Reply With Quote
Old 2016-09-20, 01:58   #187
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

5·677 Posts
Default

Quote:
Originally Posted by xathor View Post
...
You can get 3 DPTF in a 2U out of 96 Broadwell cores and 1TB DDR4 for well under $30k. That gets you way more flexible of a node that can run a broader range of applications over a similar KNL node. You'd be hard pressed to find a non-DoD supercomputer center that would put their money in KNL at this time.
I think the cool thing about Phi x200 and our recent procurement of a dev box is that it puts a many-core system in the hands of our talented developers (for perhaps the first time?), and at a reasonable price.

Even though a faster 4+ socket system in a 2U form factor could run circles around a single 1U KNL box, the price of a single such unit is far in excess of what any of us common folks could absorb.

For supercomputing needs, having a 4-socket system in a small space is great for power density, and when you're building a HPC you often start out with what your needs are and then work backwards to the price. Sweet work if you can get it.
Madpoo is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLR development version 3.8.7 is available! Jean Penné Software 39 2012-04-27 12:33
LLR 3.8.5 Development version Jean Penné Software 6 2011-04-28 06:21
Do you have a dedicated system for gimps? Surge Hardware 5 2010-12-09 04:07
Query - Running GIMPS on a 4 way system Unregistered Hardware 6 2005-07-04 04:27
System tweaks to speed GIMPS Uncwilly Software 46 2004-02-05 09:38

All times are UTC. The time now is 09:44.


Tue Jan 31 09:44:45 UTC 2023 up 166 days, 7:13, 0 users, load averages: 0.65, 0.74, 0.91

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔