mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   High energy price vs cheapest cruncher (https://www.mersenneforum.org/showthread.php?t=28353)

Mark Rose 2023-01-02 18:40

[QUOTE=M344587487;621456]Laptops they're probably fine as they can use their clout to maintain orders and it doesn't seem to be AMD's focus anyway, but in servers AMD is eating their lunch.[/QUOTE]

Not any more! AMD chips are so much more power efficient that laptop makers can no longer ignore AMD's offerings. Intel has nothing to compete with the forthcoming Zen 4 laptop chips.

Mark Rose 2023-01-02 18:48

[QUOTE=diep;621444]Interesting. Is it a form of a NUMA-L3 type cache (or SRAM)?
So Non Uniform Memory Access type L3 cache where they needed the 8th chiplet just to provide the L3 cache access?

As otherwise you won't get to that size L3, or the paperclaim is incorrect.[/QUOTE]

So the L3 cache on each chiplet is a victim cache: it's only populated by data evicted from the L2 cache on the same chiplet.

From what I understand, before going to main memory, the caches in other chiplets are checked and data returned over infinity fabric. So that part is NUMA-like.

Without working cores on a chiplet, the cache on the chiplet would never be populated. So the 84 core model would have 12 working chiplets with 7 active cores each.

diep 2023-01-05 14:19

[QUOTE=M344587487;621456]intel can get round any patent issues easily if they exist, they're just going a much more complex direction trying to interpose dozens of chips together (EMIB?). They've had issues with their 10nm onwards since forever and are on something like their dozenth+ stepping of sapphire rapids, things haven't been going to plan for a long time which wasn't a problem until Zen2 onwards turned up. They hung on for a few years in desktop adding cores and pumping power, then P+E cores is a way for them to compete in desktop longer term. Laptops they're probably fine as they can use their clout to maintain orders and it doesn't seem to be AMD's focus anyway, but in servers AMD is eating their lunch. intel are focusing heavily on accelerators for common tasks as the next step, which is fine if you are one of those customers and is something in intels wheelhouse as they are good as developing accompanying software, that is a key point of putting so many resources into oneapi IMO. Less common accelerators they're gating behind paywalls, the hardware may have certain features built in but just like tesla's heated seats you may have to pay a fee or subscription to active them.[/QUOTE]

The initial decision taking at intel was rather simplistic. They could simply make more cash from 2 socket or 4 socket motherboards - especially 2 socket motherboards, than to put several 'chiplets' onto a single die. Yet by the time the 32 core threadrippers arrived in short succession followed by 64 cores - that 64 core thing was obvious going to get there. Yet of course it had a 280 watt TDP originally. Thing is that a 64 core die existing out of 8 chiplets and 1 central bridge chip - that's 9 chips to produce from which 8 chiplets are 8-core processors with as we can read here their own L3 cache.

The hard thing to solve is the cache coherency protocol - amongst others - i'm not deep into modern cpu design there.

Just turning off 1 core from a 8-core chiplet is really freaking hard to accomplish at a hardware level.

If somehow it still eats power and 'idles' that's something totally other than what AMD claims on paper here. They claim on a performance level in terms of raw CPU power a watt that a 84 core chip is more EFFICIENT than a 96 core chip. If you would just have a faulty core idle on a chiplet then you're not gonna be more efficient at all.

7 chiplets of 12 cores = 84.

It makes sense to produce this, over the far harder 8 chiplets with either 10 or 11 cores - especially powerwise it makes more sense as well as economically because you 'lose' just 7 chiplets to 1 die then instead of 8.

Huge huge consequences for the cache coherency to have just 10 or 11 cores active on a single chiplet and hardware wise turn off the 11th and 12th core.

Yet productionwise, producing chiplets of 12 cores is much much cheaper than producing 1 chip with 24 cores. That's last is quadratic harder. Litterary.

diep 2023-01-05 14:34

[QUOTE=Mark Rose;621480]So the L3 cache on each chiplet is a victim cache: it's only populated by data evicted from the L2 cache on the same chiplet.

From what I understand, before going to main memory, the caches in other chiplets are checked and data returned over infinity fabric. So that part is NUMA-like.

Without working cores on a chiplet, the cache on the chiplet would never be populated. So the 84 core model would have 12 working chiplets with 7 active cores each.[/QUOTE]

Oh so my math was wrong to start with that there is chiplets of 12 cores. It's all 8 cores. So there is in fact 12 chiplets on a 84 or 96 core chip. In itself that doesn't make much sense to turn off 1 core from a chiplet. Really tough for the cache coherency protocol.

So the L3 cache is 1 huge NUMA type cache coherency protocl.

the SMP between the chiplets i assume that this works via the L3 cache at AMD?

the pain intel has there is that their SMP works via the L2 cache. So maybe that's explanation why for AMD the chiplet idea is much simpler than for intel.

Keeping everything atomically synchronized via the L2 is much tougher to expand to a chiplet model than via already Alpha type working L3 cache - if chiplets use the same L3 cache synchronisation like Alpha back then used which K8 and newer also used.

So the only explanation i can find is that for intel it probably came as a shock as soon as they realized that 64 core threadrippers would get on the market. Matter of time before that also takes over HPC (high performance supercomputing) by storm.

The initial threadrippers weren't that much faster in absolute sense under normal conditions. The thing that took them on top of the hill is fact that as a single cpu socket entity they could get overclocked a tad - at which point nothing intel had came close.

Especially for benchmarks that's deadly as they'll find that threadripper chip within AMD that can 'auto turboboost' all cores to the boost frequency - at which point 64 cores win it always..

So they have 12 chiplets on a single die! Not 8 of 12 cores, yet 12 of 8 cores!
Boy oh boy. It's all about the yields!

I guess what has delayed intel is that earning 20k dollar times 2 chips being 40k dollar delivers you more cash than selling 1 chip of 64 cores at 10k dollar. Takes that extra 2 years then to start development of something similar.

AMD wins it based upon yields!

edit: so thinking more about all this.

by keeping the L3 cache intact at all chiplets - they also do not mess with the SMP at AMD by turning off a single core.
the SMP at 84 core chip works 100% identical to 96 core chip as the SMP probably still happens via the L3 (assumptions mine).

Whereas intel simply CANNOT follow this chiplet approach as long as their chips do the SMP on the L2 cache. Less cores means less L2 caches available so the protocol falls in ruins then.

Mark Rose 2023-01-05 17:22

Chiplets are all about yields. It's a brilliant strategy.

Most chiplets have no defects, too, but if they do, they can be binned by working cores and working L3 not just power efficiency and frequency

So chiplet yield is basically 100%.

Then the IO die can be built on older process nodes. Stuff like IO doesn't scale as well to smaller nodes, my guess is due to higher current requirements. But even here they can use defective dies in low end Epycs that have fewer memory channels, at least in 3rd gen.

Chiplets do suffer a bit when it comes to main memory latency, but that is largely masked from the increase L3 cache size.

xilman 2023-01-05 19:25

[QUOTE=diep;621755]The initial decision taking at intel was rather simplistic. They could simply make more cash from 2 socket or 4 socket motherboards - especially 2 socket motherboards, than to put several 'chiplets' onto a single die. Yet by the time the 32 core threadrippers arrived in short succession followed by 64 cores - that 64 core thing was obvious going to get there. Yet of course it had a 280 watt TDP originally. Thing is that a 64 core die existing out of 8 chiplets and 1 central bridge chip - that's 9 chips to produce from which 8 chiplets are 8-core processors with as we can read here their own L3 cache.

The hard thing to solve is the cache coherency protocol - amongst others - i'm not deep into modern cpu design there.
[/QUOTE]I can't help wondering whether a transputer-like architecture will come back into fashion, perhaps with an ARM core.

Two things doomed the transputer, IMO. Primarily, Inmos failed to ship a C compiler to tide the industry over until coders got the hang of CSP. Secondarily, it was not American and, especially, not x86-compatible.

Often wondered whether to build a few transputers out of one or more FPGAs and play with them. I am far too lazy, alas.

diep 2023-01-06 00:32

[QUOTE=Mark Rose;621759]Chiplets are all about yields. It's a brilliant strategy.

Most chiplets have no defects, too, but if they do, they can be binned by working cores and working L3 not just power efficiency and frequency

So chiplet yield is basically 100%.

[/quote]

In your dreams - with such low power 8 core chiplets they may be happy with above 80% i would guess. Really happy.

We'll never know what their yield is. Figuring out all secrets of the B2 bomber is much easier than that. Heh didn't USA crash such a drone version of the B2 in 2011 completely intact in Iran? Yeah yields is one of the REAL secrets on this planet!

Yet rule of thumb is that a yield over 80% is needed to make a good profit on CPU's. Now this is far more expensive cpu's so throw some dices to gamble the real yield value.

If yield was really brilliant they would have flooded the market with 64 core cpu's and for years there was a shortage.

Tells me it's difficult to get good yields with it and probably it gets produced in the center of a wafer. The rest of the wafer having other cpu's.

So in dead center of the wafer where the machine produces the best results, they have problems getting to 80% yield is my blindfolded guess.

[quote]
Then the IO die can be built on older process nodes. Stuff like IO doesn't scale as well to smaller nodes, my guess is due to higher current requirements. But even here they can use defective dies in low end Epycs that have fewer memory channels, at least in 3rd gen.

Chiplets do suffer a bit when it comes to main memory latency, but that is largely masked from the increase L3 cache size.[/QUOTE]



Well that i/o chip is the crossbar of the octocore nodes. It's really a 12 socket machine in reality with 12 chiplets + 1 crossbar.
That chip is a lot harder to produce than you might guess even though it basically connects SRAM.

I have really little knowledge on memory controller chips though. Would guess though if we consider its huge bandwidth and critical cache coherency roles that it's the toughest chip to produce, otherwise we would've seen a 64 chiplet type NUMA structure on a single die already i bet :) So dead center of the wafer of that 7 nm Node i would suppose. Unclear to me whether older proces technologies would be able to produce it at all.

chalsall 2023-01-06 00:37

[QUOTE=diep;621794]I have really little knowledge on memory controller chips though. Would guess ...[/QUOTE]

One of the things that *REALLY* pisses me off is when people say "I guess".

Please don't guess. Know.

Or, at least, question.

diep 2023-01-06 01:01

[QUOTE=xilman;621765]I can't help wondering whether a transputer-like architecture will come back into fashion, perhaps with an ARM core.

Two things doomed the transputer, IMO. Primarily, Inmos failed to ship a C compiler to tide the industry over until coders got the hang of CSP. Secondarily, it was not American and, especially, not x86-compatible.

Often wondered whether to build a few transputers out of one or more FPGAs and play with them. I am far too lazy, alas.[/QUOTE]

As a disadvantage you mean that they were hundreds of thousands of dollars if not more expensive, weighed a ton, and on universities majority of guys running codes on it, they ran duck slow?

Good example is Zugzwang (chessprogram) from Rainer Feldman & co from the University Paderborn.

End of 80s at 512 cpu 1Mhz transputer 'supercomputer' if i remember well that i heard (can google it and find it) i know they got 200 nodes a second.

Frans Morsch who played against them on a 8 bits single chip in assembler at 1Mhz got a lot more chess positions a second on a singlechip. That was a h8 - a much worse cpu than the transputer individual nodes. Now of course he is a really good assembler programmer. So even the best C code programmers some years later with great compilers on great 32 bits cpu's easily lost factor 3 to him at the same chip.

Yet here it was more like factor 1500 in speed that the university team lost compared to Frans Morsch at a better chip than the h8 is.

How many good codes are there for gpu's now?
Some years ago before i was fulltime working on my 3d printer products and other hardware (robotics/drone/uav) attempts, when i looked around: zero contract jobs not even fulltime jobs for coding GPU's.

You get what you pay for - yet programming well for such hardware is no easy task. In a forum like this where amongst the very best codes get used to do calculations - that's maybe hard to imagine to some here.

Yet that's what killed also the Numa Supercomputers.

cilkchess supercomputer program from USA?
They ran world champs 1999 at a 512 processor Origin3800 from SGI.

The problem is the CILK framework - total beginnerscode - university code from MiT of course.

Achieved 1-2 million nps when i watched it at 3 minutes a move and i assume 500 cpu's out of 512.

Yet the programmer of Cilk, Don Dailey who was sitting here next to me in my home when he visited - without cilk at a much slower 32 bits processor than the 64 bits processor in that origin3800 (R14000 at 300Mhz) - single cpu it achieved 200k nps.

So hardware 500+ x faster was just say factor 10 faster effectively.

Now you'll argue parallel speedup this and that.
In 2003 i ran myself at an origin3800.

Unlike the cilk team i never got system time to test. The first tournament i played it had to perform. After couple of rounds (heh 7 rounds) i had done some hacks to get it working better. It scaled ok then. Now of course a chess engine with tons of chessknowledge is way slower in nps than a fast searcher Don had built back then. Normally my chessprogram factor 20 slower than the fast searchers back then.

Yet at a single cpu without parallel framework (R14000 - 500Mhz) it got 20k nps. And at 460 cpu's it achieved after fixing so that's rounds 8-11 an average of roughly 7+ million nps an peak of 10+ mln nps.

In short rougly 60-80% scaling.

That;s without testing - yeah the first few rounds on the world champs were the test!

I give this example to show you one of the main problems i had. So at a normal computer you can use gettimeofday or clock or whatever sort of clock to time your thing.

I soon noticed this wasn't possible at the Numa supercomputers. They had back then dedicated clock processors.
So to measure performance (internal performance code i can compile in) how much of a cpu usage actually gets done,
i had to turn that off as it slowed down too much the program. Exponential slowdown. A result of the exponential slowdown was
also that if you 'spinlock()' for example - that doesn't eat couple of nanoseconds. If it can't take the lock it basically idles the entire process and then some dozens of milliseconds later the proces gets put into the runqueue again. So that is a total death penalty for a realtime system.

There is in short tons of issues.

All sorts of 'bugs' had to be fixed in that little time. A good example is the standard unix/linux (even today) manner to get a hash and unique identifier of a proces - well that has big bugs. That already gave a collission somewhere around proces 106 or so. Which of course such hashfunction should never give so fast.

As a result i couldn't start it at 500 processes of course initially.

Had to nearly cry for losing a full testcycle of the program to such a silly bug in unix.

Without testing it's rather hard to get a spaceshuttle bugfree to launch.

Nearly no one pays for good software!

diep 2023-01-06 01:15

yeah well i have couple of ARMs here - raspberry pi4 for example i had bought a while ago.
Much tougher now to get delivered it seems.

Now i haven't researched the ARM chips too well lately - yet even if you would be able to execute at a throughput of 1 instruction a clock a double precision multiplication. Even FMADD type instruction would exist at ARM hardware then we can do ballpark math.

That means a 64 bits, 4 core arm chip at 2.0Ghz will be roughly delivering 16 gflops double precision in such case.

Compare that to the roughly couple of teraflops - let's take 4 there as a moderate estimate - of a threadripper.

4000 / 16 = 250 quadcore arm cpu's needed roughly.

How would you fit those at an electronics board? And next to each coupl of dimms a bunch of chips. So it gets monumental huge.
Maybe interesting for certain sieving attempts.

didn't figure out the tdp of those new 64 bits arm cpu's - yet let's take 2.2 watt as an example and add some ram we're quickly in the kilowatt range. yet i'm pretty sure it's more than 2.2 watts (which is what a 32 bits equivalent eats).

Now the really hard part. Some years ago buying quadcore chips was rather easy if you bought a 100. So if you had special soldering equipment you could build such board yourself so to speak.

Some years ago quadcores, of course tad lower clocked than mainstream, like 20 dollar a piece.
yet the latest arm cpu's have gone massive up in price there.

In short it is massive projects and you still slower for most software than a threadripper.
And we speak of embarrassingly parallel software then of course that hardly needs any bandwidth to other cpu's.

As all the connections that default boards have for those arm chips that one could use to network them, that's non-dma connections. So if some data gets in or out, then kernel will do a central lock of all i/o kind of - giving a big dang to all cores of the ARM processor - they cannot run further until the unlock occurs.

So to build a massive cluster of ARMs that is useful anyhow for software that needs a little bandwidth to other nodes,
you would need to build a board that has a DMA type network controller on it to each cpu. Also to get down power consumption you
really need quite a lot of ARM cpu's on each board, each with each CPU its own RAM of course.

Designing such board is a massive task.

diep 2023-01-06 02:05

[QUOTE=chalsall;621795]One of the things that *REALLY* pisses me off is when people say "I guess".

Please don't guess. Know.

Or, at least, question.[/QUOTE]

I'm pretty sure my 'guess' there is better than your know.

chip design is shrouded in secrecy. even the most knowledgeable expert commenting on websites will have to write 'my guess' - even if he knows or thinks he knows - because the actual hard data he'll never get to see. So it is a guess and it stays a guess in such a case.

You will NEVER learn ANYTHING about any sort of cpu type sort of design if you skip the 'guesses'. Because who KNOWS - will kind of get executed and to be sure shreddered and all parts burnt at solar temperatures - so who KNOWS in that field will NEVER post.

Now i'm not an expert there on memory controllers yet my best guess is that AMD has really troubles getting a high yield for the toughest part of the chip which is the crossbar design. And with 'high' i mean over 80% - regardless on which node that is.


All times are UTC. The time now is 16:24.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.