mersenneforum.org Your help wanted - Let's buy GIMPS a KNL development system!
 Register FAQ Search Today's Posts Mark Forums Read

2016-09-19, 15:43   #177
Mysticial

Sep 2016

2×5×37 Posts

Quote:
 Originally Posted by xathor Easy. We have done it... but not for this application. I have over 200 different software packages on my supercomputers and we selected the top 10 used ones. Each one got compiled, benchmarked, re-compiled, tweaked and re-benchmarked. The performance just wasn't there, especially considering the time that it took to recompile software. Any single threaded applications flat out wont run well on KNL. If it doesn't scale well, it wont run well on KNL. Unless the supercomputer center only runs a few applications and they tweak them to death, I highly doubt they will pick KNL over a more traditional Xeon. I have colleagues at many non-DoD supercomputer centers and I will tell you that they have come to the same conclusion. The DoD centers typically build out the best FLOP/$and then expect people to develop their code to scale well on their systems, which is why you see them gobbling up KNL. Go to Supercomputing 2016 and find out yourself. Intel swore up and down they were going to release Knights Landing at SC2015 and here's the only thing that I found after pestering the shit out of HP. Here's an Apollo KNL blade. KNL SC15 I have fists full of cash wanting KNL to be the best thing since sliced bread. I waited years to finally get my hands on one. I'm very interested in this communities success with this chipset, but for now my money is on Broadwell. You can get 3 DPTF in a 2U out of 96 Broadwell cores and 1TB DDR4 for well under$30k. That gets you way more flexible of a node that can run a broader range of applications over a similar KNL node. You'd be hard pressed to find a non-DoD supercomputer center that would put their money in KNL at this time.
Just curious with a few questions:
1. Was the bottleneck the compute? Or was it the memory access? Since you mentioned low clocks being a problem, I'm guessing it's the former.
2. Was the "recompiling" really just recompiling? As in no manual optimizations? Compilers nowadays still suck at vectorizing scalar code. The Intel Compiler is ahead of everything else, but it's far from being competitive with any properly done hand-written intrinsic code - let alone assembly.

IOW, I doubt you can just throw old code at KNL with some tweaks and expect it to perform. I imagine most of the stuff would need to be redesigned from bottom-up. But of course the latter option is likely prohibitive in development costs.

Last fiddled with by Mysticial on 2016-09-19 at 15:45

2016-09-19, 16:51   #178
xathor

Sep 2016

19 Posts

Quote:
 Originally Posted by Mysticial Just curious with a few questions:Was the bottleneck the compute? Or was it the memory access? Since you mentioned low clocks being a problem, I'm guessing it's the former. Was the "recompiling" really just recompiling? As in no manual optimizations? Compilers nowadays still suck at vectorizing scalar code. The Intel Compiler is ahead of everything else, but it's far from being competitive with any properly done hand-written intrinsic code - let alone assembly. IOW, I doubt you can just throw old code at KNL with some tweaks and expect it to perform. I imagine most of the stuff would need to be redesigned from bottom-up. But of course the latter option is likely prohibitive in development costs.
That is exactly why you see DoD and those with deep pockets and large staff gobbling up KNL. If you have a handful of applications that you run and want it to perform as best as possible, you'll try out different architectures and use whatever works best.

Almost every other supercomputer center is going to shy away from KNL mainly for the overhead cost of optimizing the code to work on a very narrow range of specific hardware.

To answer your optimization question. My co-worker with a PhD in computer science who has been optimizing and installing code on our systems for decades spent several weeks on a couple applications and was quite disappointed with the results.

2016-09-19, 17:03   #179
ldesnogu

Jan 2008
France

22×149 Posts

Quote:
 Originally Posted by xathor That is exactly why you see DoD and those with deep pockets and large staff gobbling up KNL. If you have a handful of applications that you run and want it to perform as best as possible, you'll try out different architectures and use whatever works best. Almost every other supercomputer center is going to shy away from KNL mainly for the overhead cost of optimizing the code to work on a very narrow range of specific hardware. To answer your optimization question. My co-worker with a PhD in computer science who has been optimizing and installing code on our systems for decades spent several weeks on a couple applications and was quite disappointed with the results.
How does CUDA fits there? Is it worth supporting it?

FWIW I strongly believe it's a dev/support hell to extract a lot of perf from KNL. But I also believe that a single project (such as mprime or mlucas) can get the most of it (at a large cost in dev).

 2016-09-19, 17:12 #180 bsquared     "Ben" Feb 2007 7×13×41 Posts https://xkcd.com/1205/ Edit: Is it not also a dev/support hell to extract a lot of perf from CUDA? I would argue the performance per unit effort is higher with KNL... but maybe I'm biased. Last fiddled with by bsquared on 2016-09-19 at 17:15
2016-09-19, 17:26   #181
airsquirrels

"David"
Jul 2015
Ohio

11·47 Posts

Quote:
 Originally Posted by ldesnogu How does CUDA fits there? Is it worth supporting it? FWIW I strongly believe it's a dev/support hell to extract a lot of perf from KNL. But I also believe that a single project (such as mprime or mlucas) can get the most of it (at a large cost in dev).
Mprime benefits from having two decades of work put into optimizing over and over for the platform of the day, so from what I can tell it is built more like a bunch of high performance blocks that can be reassembled to match the core/thread/cache/VPU configuration of the next best thing. Our work in KNL is more of an identification of two of those blocks that need improvement (threading support for many many cores, and AVX512 support). The work will be done to add those new tools to the mprime toolbox and they will be used over and over in future Intel and other architectures, albeit in slightly different configurations or arrangements.

Would I buy a facility full of just KNL chips? Not for generic workload. If I owned a shipping company I also wouldn't buy an entire fleet of car-haulers and expect to take contracts for various types of shipping. If my business is shipping cars and I rarely use a flatbed or box trailer I just might.

2016-09-19, 17:50   #182
Mysticial

Sep 2016

37010 Posts

Quote:
 Originally Posted by xathor That is exactly why you see DoD and those with deep pockets and large staff gobbling up KNL. If you have a handful of applications that you run and want it to perform as best as possible, you'll try out different architectures and use whatever works best. Almost every other supercomputer center is going to shy away from KNL mainly for the overhead cost of optimizing the code to work on a very narrow range of specific hardware. To answer your optimization question. My co-worker with a PhD in computer science who has been optimizing and installing code on our systems for decades spent several weeks on a couple applications and was quite disappointed with the results.
Ah I understand. KNL didn't turn out to be the free lunch they expected it to be.

2016-09-19, 18:20   #183
ldesnogu

Jan 2008
France

22×149 Posts

Quote:
 Originally Posted by bsquared Is it not also a dev/support hell to extract a lot of perf from CUDA? I would argue the performance per unit effort is higher with KNL... but maybe I'm biased.
I think the price is the same. But that's just guessing.

I just never bought the Intel marketing that wants you to believe that since it's x86 and comes with excellent dev tools, it will be a piece of cake

@airsquirrels: yes and no. Data arrangement, preloads, decode limitations, etc. heavily depend on the particular micro-architecture and core/caches/RAM topology, and this tuning usually requires a lot more dev time than using new instructions, and is not easily applicable to a radically different CPU (such as upcoming Xeon).

But I agree with you, the work being done here is very interesting and will open the door for future AVX-512 implementations. I am just not (yet?) convinced this will prove a good value for money/power. I only wished I had more time to play with that beast, it looks so sexy...

2016-09-19, 19:45   #184
retina
Undefined

"The unspeakable one"
Jun 2006
My evil lair

11010000110002 Posts

Quote:
 Originally Posted by ldesnogu Data arrangement, preloads, decode limitations, etc. heavily depend on the particular micro-architecture and core/caches/RAM topology, and this tuning usually requires a lot more dev time than using new instructions, and is not easily applicable to a radically different CPU (such as upcoming Xeon).
Yes, basically it is complicated, time consuming, and hard. The new instructions and basic non-crashability can be easily tested in emulators, but they don't help anyone to actually make the code perform well. That requires real hardware and lots of time testing with it.

2016-09-19, 20:45   #185
xathor

Sep 2016

19 Posts

Quote:
 Originally Posted by ldesnogu How does CUDA fits there? Is it worth supporting it? FWIW I strongly believe it's a dev/support hell to extract a lot of perf from KNL. But I also believe that a single project (such as mprime or mlucas) can get the most of it (at a large cost in dev).
CUDA, absolutely. Especially with things like OpenCL and some nVidia tools that should be coming out soon-ish. GPU's are the bang for the dollar, clear winner if your code can be ported to the GPU.

Titans entire compute is 5 million something CUDA cores.

nVidia's aim is clear with things like nv-link.

I honestly wouldn't be surprised if KNL goes the way of Itanium.

2016-09-20, 01:58   #187
Serpentine Vermin Jar

Jul 2014

5·677 Posts

Quote:
 Originally Posted by xathor ... You can get 3 DPTF in a 2U out of 96 Broadwell cores and 1TB DDR4 for well under \$30k. That gets you way more flexible of a node that can run a broader range of applications over a similar KNL node. You'd be hard pressed to find a non-DoD supercomputer center that would put their money in KNL at this time.
I think the cool thing about Phi x200 and our recent procurement of a dev box is that it puts a many-core system in the hands of our talented developers (for perhaps the first time?), and at a reasonable price.

Even though a faster 4+ socket system in a 2U form factor could run circles around a single 1U KNL box, the price of a single such unit is far in excess of what any of us common folks could absorb.

For supercomputing needs, having a 4-socket system in a small space is great for power density, and when you're building a HPC you often start out with what your needs are and then work backwards to the price. Sweet work if you can get it.

 Similar Threads Thread Thread Starter Forum Replies Last Post Jean Penné Software 39 2012-04-27 12:33 Jean Penné Software 6 2011-04-28 06:21 Surge Hardware 5 2010-12-09 04:07 Unregistered Hardware 6 2005-07-04 04:27 Uncwilly Software 46 2004-02-05 09:38

All times are UTC. The time now is 09:44.

Tue Jan 31 09:44:45 UTC 2023 up 166 days, 7:13, 0 users, load averages: 0.65, 0.74, 0.91