![]() |
High energy price vs cheapest cruncher
Now i have built a CAD machine here some time ago (44 cores in total) the box eats like 400+ watt if i add gpu and disk arrays and the watercooling and buch of fans.
Let me quote you the current price of energy from Eneco for variable contract in The Netherlands: [url]https://www.eneco.nl/duurzame-energie/modelcontract/[/url] <pre> Power a kWh day € 0,91886 Power a kWh night € 0,73654 Power a kWh single € 0,82899 Gas each m^3 € 3,06421 </pre> In short 0.83 euro a kWh and this is including most taxes. 400 watt usage (i believe it's more than this on average) sets you down a sloppy 2908 euro a year then. Most of winter this also heats office just enough. At gas 3 euro a m^3 of course central heating turned off in office here. When it's freezing outside all this isn't enough to heat office when i'm there, but is not the relevant discussion. Now we all realize that not having a machine turned on might seem cheapest to most. Yet that's an illusion - something will need to run here of course. I do the CAD work on that box as well to design hardware components. What cpu at the moment the most efficient performer a watt for George Woltman's excellent DWT implementation? So not GPU yet cpu. Then we can take it from there :) |
My guess would be a 32 core 9004 series Epyc, such as the 9354P, with all 8 memory channels populated.
|
[QUOTE=Mark Rose;621406]My guess would be a 32 core 9004 series Epyc, such as the 9354P, with all 8 memory channels populated.[/QUOTE]
That's what i feared already yes. That chip in existance at all? Seemingly no one offers it for sale. It's a benchmark chip? 3.25Ghz @ 64 cores at just 280 watt TDP seems very very very high clocked for that tdp... When chips get produced the middle of the wafers has sometimes cpu's that can clock a lot higher. If they selected 8 out of it that can they home. |
I think the 64 core would be memory bandwidth starved. The supported DDR5 speeds are 4800 I believe. That may be enough for the 48 core, but the AVX512 support makes the 32 core likely a better fit.
|
[QUOTE=Mark Rose;621410]I think the 64 core would be memory bandwidth starved. The supported DDR5 speeds are 4800 I believe. That may be enough for the 48 core, but the AVX512 support makes the 32 core likely a better fit.[/QUOTE]
Doesn't look bad at first sight: 460 GB/s bandwidth is claim from AMD. Which is interesting claim as i'd naively guess 32 GB/s x 12 channels = 384GB/s (that's raw bandwidth - user data bandwidth is usually about 80% of that as rule of thumb). Yet a whopping 256MB L3 cache gives it good odds provided your working set size fits in there kind of, or when you execute quite some instructions for every L3 cache miss. |
I notice here: the 9354p has 32 cores.
[url]https://www.amd.com/en/processors/epyc-9004-series[/url] Whereas the 9534p has 64 cores at 2.45Ghz @ 280 watt. the 9634 has 86 cores at 2.25Ghz @ 290 watt - single socket i assume as has no P behind it. At first sight it's the one giving most Ghz per watt TDP. Yet the 96 cores are close. What i find weird is the L3 caches. That 4 MB a core was wrong. The 9354 is 8MB a core and the 96 core version is 4MB a core L3 cache. Yet at 84 and 96 cores the size of the L3 cache makes little sense to me except when they manage to 'turn off cores' of the chip meanwhile keeping the L3 cache alive. Ok price makes sense now. 10k dollar for the 96 core ones - that will be out of my budget for a while i'm afraid (except when i sell lots of 3d printers any time soon which is months away). Question there is more: are you willing to pay that high of a price - rather than whether you'd buy one if you had the cash. In any case the 32 core versions have a LOT more bandwidth to the RAM and a much larger L3 cache for each core. That is pretty interesting observation indeed! edit: Seems more like if they use chiplets of 12 cores (just guessing) that 84 cores is 7 chiplets and 96 cores is 8 chiplets - rather than that any core has been 'turned off'. Simply a chiplet less in the package (just a guess). It's a different crossbar in such case. |
So interesting if we compare with nearly 4 years ago
[url]https://www.amd.com/en/products/cpu/amd-epyc-7702[/url] Then constant = Ghz * cores / TDP We get to the same value like the 96 core versions recently launched - namely 0.64 Ghz each watt. Now of course the 9004 series have 50% more memory channels which is very nice. Yet huge price difference obviously. The 7702 is in huge quantities on aliexpress. (64 cores @ 2.0Ghz @ 200 watt TDP Very interesting to build 2nd hand now for HPC is of course the 7H12 that is in huge quantities available as well on aliexpress. Even though less efficient, it's higher clocked at 2.6Ghz (64 cores @ 280 watt). 7742 is also efficient at 0.64 yet very expensive still on aliexpress - far over 3000 euro. That's for those who otherwise would invest in crypto's. Better buy a good chip then instead of wash it through the toilet. Lots of different chips getting sold by AMD in short. Where is intel? |
[QUOTE=diep;621423]Doesn't look bad at first sight:
460 GB/s bandwidth is claim from AMD. Which is interesting claim as i'd naively guess 32 GB/s x 12 channels = 384GB/s (that's raw bandwidth - user data bandwidth is usually about 80% of that as rule of thumb). Yet a whopping 256MB L3 cache gives it good odds provided your working set size fits in there kind of, or when you execute quite some instructions for every L3 cache miss.[/QUOTE] Don't forget that 256 MB of L3 is segmented: each chiplet has 32 MB. It's not a unified L3 cache across all chiplets. The Epyc Milan-X series have 96 MB of L3 per chiplet. That fits a lot of PRP work, but newer Genoa is more power efficient. Genoa-X with up to 1152 MB of L3 is coming soon (96 MB per chiplet). It may be worth waiting for that. [QUOTE=diep;621425]I notice here: the 9354p has 32 cores. [url]https://www.amd.com/en/processors/epyc-9004-series[/url] Whereas the 9534p has 64 cores at 2.45Ghz @ 280 watt. the 9634 has 86 cores at 2.25Ghz @ 290 watt - single socket i assume as has no P behind it. At first sight it's the one giving most Ghz per watt TDP. Yet the 96 cores are close. What i find weird is the L3 caches. That 4 MB a core was wrong. The 9354 is 8MB a core and the 96 core version is 4MB a core L3 cache. Yet at 84 and 96 cores the size of the L3 cache makes little sense to me except when they manage to 'turn off cores' of the chip meanwhile keeping the L3 cache alive.[/quote] In my limited experience, my rule of thumb has been 1 channel of memory is good for 2 cores before memory bandwidth saturation begins. Additional cores may be able to squeeze more out but often they'll be spinning waiting for data: making a lot more heat for not much more throughput. With 12 channels, that would indicate a 24 core chip like the 9224 or 9254, but those have only 64 and 128 MB of L3. The high watt 9274F gets 256 MB — 3 cores and 32 MB L3 per chiplet — but it's not throughput/watt efficient. That's why I think the 9354P or 9354 is the way to go: 4 cores and 32 MB L3 per chiplet, and run 8 workers, one per chiplet. I really do think the 48 and higher core chips will be memory bandwidth starved at wavefront PRP by George's amazingly efficient code. Genoa-X with 96 MB of L3 per chiplet would probably be fine with the higher core count parts. [quote]Ok price makes sense now. 10k dollar for the 96 core ones - that will be out of my budget for a while i'm afraid (except when i sell lots of 3d printers any time soon which is months away). Question there is more: are you willing to pay that high of a price - rather than whether you'd buy one if you had the cash. In any case the 32 core versions have a LOT more bandwidth to the RAM and a much larger L3 cache for each core. That is pretty interesting observation indeed! edit: Seems more like if they use chiplets of 12 cores (just guessing) that 84 cores is 7 chiplets and 96 cores is 8 chiplets - rather than that any core has been 'turned off'. Simply a chiplet less in the package (just a guess). It's a different crossbar in such case.[/QUOTE] The 84 and 96 cores have 12 chiplets, with 7 or 8 active cores, and 32 MB of L3. Most of the 32-64 core parts have 8 chiplets, except the 9334 which has only 4 chiplets. The 9274F and 9174F have 8 chiplets, Chiplets with bad L3 cache get used for the 9224 where there are 4 chiplets but only half the working L3 cache. [QUOTE=diep;621427]So interesting if we compare with nearly 4 years ago [url]https://www.amd.com/en/products/cpu/amd-epyc-7702[/url] Then constant = Ghz * cores / TDP We get to the same value like the 96 core versions recently launched - namely 0.64 Ghz each watt. Now of course the 9004 series have 50% more memory channels which is very nice. Yet huge price difference obviously. The 7702 is in huge quantities on aliexpress. (64 cores @ 2.0Ghz @ 200 watt TDP Very interesting to build 2nd hand now for HPC is of course the 7H12 that is in huge quantities available as well on aliexpress. Even though less efficient, it's higher clocked at 2.6Ghz (64 cores @ 280 watt). 7742 is also efficient at 0.64 yet very expensive still on aliexpress - far over 3000 euro. That's for those who otherwise would invest in crypto's. Better buy a good chip then instead of wash it through the toilet. Lots of different chips getting sold by AMD in short. Where is intel?[/QUOTE] Intel is five years behind. AMD recently slowed down orders from TSMC, so they shouldn't be out of stock for long. I wouldn't get 7002 series as that's Zen 2. Zen 3 is far more efficient. Zen 4 adds AVX512, which is even more power efficient. |
[QUOTE=Mark Rose;621435]Don't forget that 256 MB of L3 is segmented: each chiplet has 32 MB. It's not a unified L3 cache across all chiplets.
[/quote] Interesting. Is it a form of a NUMA-L3 type cache (or SRAM)? So Non Uniform Memory Access type L3 cache where they needed the 8th chiplet just to provide the L3 cache access? As otherwise you won't get to that size L3, or the paperclaim is incorrect. |
[QUOTE=Mark Rose;621435]Don't forget that 256 MB of L3 is segmented: each chiplet has 32 MB. It's not a unified L3 cache across all chiplets.
Intel is five years behind. AMD recently slowed down orders from TSMC, so they shouldn't be out of stock for long. I wouldn't get 7002 series as that's Zen 2. Zen 3 is far more efficient. Zen 4 adds AVX512, which is even more power efficient.[/QUOTE] 5 years is a lot. Must be patents from AMD somewhere that prevents intel from using the chiplet idea. I thought that idea wasn't new if we look back to Q6600 introduction some years ago. Yet of course that didn't have a special chip that forms a central bridge between the chiplets. Some patent must stop intel, other explanation then i'd fire the entire staff for incompetence if i was a shareholder of intel. p.s. if it has a process explanation that intel doesn't have 7nm machines from ASML - then it's only the White House and/or US congress that might've stopped intel from buying latest ASML machine technology - to avoid Israel to build such plant. I remember how here in Netherlands basically the White House nearly wanted to declare some sort of financial war to netherlands when China some years ago wanted to have those machines to build a plant. Only the independant Taiwan with TSMC seemingly allowed to have it by USA. But i didn't follow the latest there, i notice on intel website they quote 10nm process technology for 3d generation scalable Xeon processors. Latest release seemingly in 2021. That's older than 7nm - though i read article explaining that diff might be not so enormeous between the latest technologies. A small 8 or 12 core chiplet is of course much cheaper to produce than a 96 core chip. Like factor 100 easier - it's about the yields of course :) |
intel can get round any patent issues easily if they exist, they're just going a much more complex direction trying to interpose dozens of chips together (EMIB?). They've had issues with their 10nm onwards since forever and are on something like their dozenth+ stepping of sapphire rapids, things haven't been going to plan for a long time which wasn't a problem until Zen2 onwards turned up. They hung on for a few years in desktop adding cores and pumping power, then P+E cores is a way for them to compete in desktop longer term. Laptops they're probably fine as they can use their clout to maintain orders and it doesn't seem to be AMD's focus anyway, but in servers AMD is eating their lunch. intel are focusing heavily on accelerators for common tasks as the next step, which is fine if you are one of those customers and is something in intels wheelhouse as they are good as developing accompanying software, that is a key point of putting so many resources into oneapi IMO. Less common accelerators they're gating behind paywalls, the hardware may have certain features built in but just like tesla's heated seats you may have to pay a fee or subscription to active them.
|
[QUOTE=M344587487;621456]Laptops they're probably fine as they can use their clout to maintain orders and it doesn't seem to be AMD's focus anyway, but in servers AMD is eating their lunch.[/QUOTE]
Not any more! AMD chips are so much more power efficient that laptop makers can no longer ignore AMD's offerings. Intel has nothing to compete with the forthcoming Zen 4 laptop chips. |
[QUOTE=diep;621444]Interesting. Is it a form of a NUMA-L3 type cache (or SRAM)?
So Non Uniform Memory Access type L3 cache where they needed the 8th chiplet just to provide the L3 cache access? As otherwise you won't get to that size L3, or the paperclaim is incorrect.[/QUOTE] So the L3 cache on each chiplet is a victim cache: it's only populated by data evicted from the L2 cache on the same chiplet. From what I understand, before going to main memory, the caches in other chiplets are checked and data returned over infinity fabric. So that part is NUMA-like. Without working cores on a chiplet, the cache on the chiplet would never be populated. So the 84 core model would have 12 working chiplets with 7 active cores each. |
[QUOTE=M344587487;621456]intel can get round any patent issues easily if they exist, they're just going a much more complex direction trying to interpose dozens of chips together (EMIB?). They've had issues with their 10nm onwards since forever and are on something like their dozenth+ stepping of sapphire rapids, things haven't been going to plan for a long time which wasn't a problem until Zen2 onwards turned up. They hung on for a few years in desktop adding cores and pumping power, then P+E cores is a way for them to compete in desktop longer term. Laptops they're probably fine as they can use their clout to maintain orders and it doesn't seem to be AMD's focus anyway, but in servers AMD is eating their lunch. intel are focusing heavily on accelerators for common tasks as the next step, which is fine if you are one of those customers and is something in intels wheelhouse as they are good as developing accompanying software, that is a key point of putting so many resources into oneapi IMO. Less common accelerators they're gating behind paywalls, the hardware may have certain features built in but just like tesla's heated seats you may have to pay a fee or subscription to active them.[/QUOTE]
The initial decision taking at intel was rather simplistic. They could simply make more cash from 2 socket or 4 socket motherboards - especially 2 socket motherboards, than to put several 'chiplets' onto a single die. Yet by the time the 32 core threadrippers arrived in short succession followed by 64 cores - that 64 core thing was obvious going to get there. Yet of course it had a 280 watt TDP originally. Thing is that a 64 core die existing out of 8 chiplets and 1 central bridge chip - that's 9 chips to produce from which 8 chiplets are 8-core processors with as we can read here their own L3 cache. The hard thing to solve is the cache coherency protocol - amongst others - i'm not deep into modern cpu design there. Just turning off 1 core from a 8-core chiplet is really freaking hard to accomplish at a hardware level. If somehow it still eats power and 'idles' that's something totally other than what AMD claims on paper here. They claim on a performance level in terms of raw CPU power a watt that a 84 core chip is more EFFICIENT than a 96 core chip. If you would just have a faulty core idle on a chiplet then you're not gonna be more efficient at all. 7 chiplets of 12 cores = 84. It makes sense to produce this, over the far harder 8 chiplets with either 10 or 11 cores - especially powerwise it makes more sense as well as economically because you 'lose' just 7 chiplets to 1 die then instead of 8. Huge huge consequences for the cache coherency to have just 10 or 11 cores active on a single chiplet and hardware wise turn off the 11th and 12th core. Yet productionwise, producing chiplets of 12 cores is much much cheaper than producing 1 chip with 24 cores. That's last is quadratic harder. Litterary. |
[QUOTE=Mark Rose;621480]So the L3 cache on each chiplet is a victim cache: it's only populated by data evicted from the L2 cache on the same chiplet.
From what I understand, before going to main memory, the caches in other chiplets are checked and data returned over infinity fabric. So that part is NUMA-like. Without working cores on a chiplet, the cache on the chiplet would never be populated. So the 84 core model would have 12 working chiplets with 7 active cores each.[/QUOTE] Oh so my math was wrong to start with that there is chiplets of 12 cores. It's all 8 cores. So there is in fact 12 chiplets on a 84 or 96 core chip. In itself that doesn't make much sense to turn off 1 core from a chiplet. Really tough for the cache coherency protocol. So the L3 cache is 1 huge NUMA type cache coherency protocl. the SMP between the chiplets i assume that this works via the L3 cache at AMD? the pain intel has there is that their SMP works via the L2 cache. So maybe that's explanation why for AMD the chiplet idea is much simpler than for intel. Keeping everything atomically synchronized via the L2 is much tougher to expand to a chiplet model than via already Alpha type working L3 cache - if chiplets use the same L3 cache synchronisation like Alpha back then used which K8 and newer also used. So the only explanation i can find is that for intel it probably came as a shock as soon as they realized that 64 core threadrippers would get on the market. Matter of time before that also takes over HPC (high performance supercomputing) by storm. The initial threadrippers weren't that much faster in absolute sense under normal conditions. The thing that took them on top of the hill is fact that as a single cpu socket entity they could get overclocked a tad - at which point nothing intel had came close. Especially for benchmarks that's deadly as they'll find that threadripper chip within AMD that can 'auto turboboost' all cores to the boost frequency - at which point 64 cores win it always.. So they have 12 chiplets on a single die! Not 8 of 12 cores, yet 12 of 8 cores! Boy oh boy. It's all about the yields! I guess what has delayed intel is that earning 20k dollar times 2 chips being 40k dollar delivers you more cash than selling 1 chip of 64 cores at 10k dollar. Takes that extra 2 years then to start development of something similar. AMD wins it based upon yields! edit: so thinking more about all this. by keeping the L3 cache intact at all chiplets - they also do not mess with the SMP at AMD by turning off a single core. the SMP at 84 core chip works 100% identical to 96 core chip as the SMP probably still happens via the L3 (assumptions mine). Whereas intel simply CANNOT follow this chiplet approach as long as their chips do the SMP on the L2 cache. Less cores means less L2 caches available so the protocol falls in ruins then. |
Chiplets are all about yields. It's a brilliant strategy.
Most chiplets have no defects, too, but if they do, they can be binned by working cores and working L3 not just power efficiency and frequency So chiplet yield is basically 100%. Then the IO die can be built on older process nodes. Stuff like IO doesn't scale as well to smaller nodes, my guess is due to higher current requirements. But even here they can use defective dies in low end Epycs that have fewer memory channels, at least in 3rd gen. Chiplets do suffer a bit when it comes to main memory latency, but that is largely masked from the increase L3 cache size. |
[QUOTE=diep;621755]The initial decision taking at intel was rather simplistic. They could simply make more cash from 2 socket or 4 socket motherboards - especially 2 socket motherboards, than to put several 'chiplets' onto a single die. Yet by the time the 32 core threadrippers arrived in short succession followed by 64 cores - that 64 core thing was obvious going to get there. Yet of course it had a 280 watt TDP originally. Thing is that a 64 core die existing out of 8 chiplets and 1 central bridge chip - that's 9 chips to produce from which 8 chiplets are 8-core processors with as we can read here their own L3 cache.
The hard thing to solve is the cache coherency protocol - amongst others - i'm not deep into modern cpu design there. [/QUOTE]I can't help wondering whether a transputer-like architecture will come back into fashion, perhaps with an ARM core. Two things doomed the transputer, IMO. Primarily, Inmos failed to ship a C compiler to tide the industry over until coders got the hang of CSP. Secondarily, it was not American and, especially, not x86-compatible. Often wondered whether to build a few transputers out of one or more FPGAs and play with them. I am far too lazy, alas. |
[QUOTE=Mark Rose;621759]Chiplets are all about yields. It's a brilliant strategy.
Most chiplets have no defects, too, but if they do, they can be binned by working cores and working L3 not just power efficiency and frequency So chiplet yield is basically 100%. [/quote] In your dreams - with such low power 8 core chiplets they may be happy with above 80% i would guess. Really happy. We'll never know what their yield is. Figuring out all secrets of the B2 bomber is much easier than that. Heh didn't USA crash such a drone version of the B2 in 2011 completely intact in Iran? Yeah yields is one of the REAL secrets on this planet! Yet rule of thumb is that a yield over 80% is needed to make a good profit on CPU's. Now this is far more expensive cpu's so throw some dices to gamble the real yield value. If yield was really brilliant they would have flooded the market with 64 core cpu's and for years there was a shortage. Tells me it's difficult to get good yields with it and probably it gets produced in the center of a wafer. The rest of the wafer having other cpu's. So in dead center of the wafer where the machine produces the best results, they have problems getting to 80% yield is my blindfolded guess. [quote] Then the IO die can be built on older process nodes. Stuff like IO doesn't scale as well to smaller nodes, my guess is due to higher current requirements. But even here they can use defective dies in low end Epycs that have fewer memory channels, at least in 3rd gen. Chiplets do suffer a bit when it comes to main memory latency, but that is largely masked from the increase L3 cache size.[/QUOTE] Well that i/o chip is the crossbar of the octocore nodes. It's really a 12 socket machine in reality with 12 chiplets + 1 crossbar. That chip is a lot harder to produce than you might guess even though it basically connects SRAM. I have really little knowledge on memory controller chips though. Would guess though if we consider its huge bandwidth and critical cache coherency roles that it's the toughest chip to produce, otherwise we would've seen a 64 chiplet type NUMA structure on a single die already i bet :) So dead center of the wafer of that 7 nm Node i would suppose. Unclear to me whether older proces technologies would be able to produce it at all. |
[QUOTE=diep;621794]I have really little knowledge on memory controller chips though. Would guess ...[/QUOTE]
One of the things that *REALLY* pisses me off is when people say "I guess". Please don't guess. Know. Or, at least, question. |
[QUOTE=xilman;621765]I can't help wondering whether a transputer-like architecture will come back into fashion, perhaps with an ARM core.
Two things doomed the transputer, IMO. Primarily, Inmos failed to ship a C compiler to tide the industry over until coders got the hang of CSP. Secondarily, it was not American and, especially, not x86-compatible. Often wondered whether to build a few transputers out of one or more FPGAs and play with them. I am far too lazy, alas.[/QUOTE] As a disadvantage you mean that they were hundreds of thousands of dollars if not more expensive, weighed a ton, and on universities majority of guys running codes on it, they ran duck slow? Good example is Zugzwang (chessprogram) from Rainer Feldman & co from the University Paderborn. End of 80s at 512 cpu 1Mhz transputer 'supercomputer' if i remember well that i heard (can google it and find it) i know they got 200 nodes a second. Frans Morsch who played against them on a 8 bits single chip in assembler at 1Mhz got a lot more chess positions a second on a singlechip. That was a h8 - a much worse cpu than the transputer individual nodes. Now of course he is a really good assembler programmer. So even the best C code programmers some years later with great compilers on great 32 bits cpu's easily lost factor 3 to him at the same chip. Yet here it was more like factor 1500 in speed that the university team lost compared to Frans Morsch at a better chip than the h8 is. How many good codes are there for gpu's now? Some years ago before i was fulltime working on my 3d printer products and other hardware (robotics/drone/uav) attempts, when i looked around: zero contract jobs not even fulltime jobs for coding GPU's. You get what you pay for - yet programming well for such hardware is no easy task. In a forum like this where amongst the very best codes get used to do calculations - that's maybe hard to imagine to some here. Yet that's what killed also the Numa Supercomputers. cilkchess supercomputer program from USA? They ran world champs 1999 at a 512 processor Origin3800 from SGI. The problem is the CILK framework - total beginnerscode - university code from MiT of course. Achieved 1-2 million nps when i watched it at 3 minutes a move and i assume 500 cpu's out of 512. Yet the programmer of Cilk, Don Dailey who was sitting here next to me in my home when he visited - without cilk at a much slower 32 bits processor than the 64 bits processor in that origin3800 (R14000 at 300Mhz) - single cpu it achieved 200k nps. So hardware 500+ x faster was just say factor 10 faster effectively. Now you'll argue parallel speedup this and that. In 2003 i ran myself at an origin3800. Unlike the cilk team i never got system time to test. The first tournament i played it had to perform. After couple of rounds (heh 7 rounds) i had done some hacks to get it working better. It scaled ok then. Now of course a chess engine with tons of chessknowledge is way slower in nps than a fast searcher Don had built back then. Normally my chessprogram factor 20 slower than the fast searchers back then. Yet at a single cpu without parallel framework (R14000 - 500Mhz) it got 20k nps. And at 460 cpu's it achieved after fixing so that's rounds 8-11 an average of roughly 7+ million nps an peak of 10+ mln nps. In short rougly 60-80% scaling. That;s without testing - yeah the first few rounds on the world champs were the test! I give this example to show you one of the main problems i had. So at a normal computer you can use gettimeofday or clock or whatever sort of clock to time your thing. I soon noticed this wasn't possible at the Numa supercomputers. They had back then dedicated clock processors. So to measure performance (internal performance code i can compile in) how much of a cpu usage actually gets done, i had to turn that off as it slowed down too much the program. Exponential slowdown. A result of the exponential slowdown was also that if you 'spinlock()' for example - that doesn't eat couple of nanoseconds. If it can't take the lock it basically idles the entire process and then some dozens of milliseconds later the proces gets put into the runqueue again. So that is a total death penalty for a realtime system. There is in short tons of issues. All sorts of 'bugs' had to be fixed in that little time. A good example is the standard unix/linux (even today) manner to get a hash and unique identifier of a proces - well that has big bugs. That already gave a collission somewhere around proces 106 or so. Which of course such hashfunction should never give so fast. As a result i couldn't start it at 500 processes of course initially. Had to nearly cry for losing a full testcycle of the program to such a silly bug in unix. Without testing it's rather hard to get a spaceshuttle bugfree to launch. Nearly no one pays for good software! |
yeah well i have couple of ARMs here - raspberry pi4 for example i had bought a while ago.
Much tougher now to get delivered it seems. Now i haven't researched the ARM chips too well lately - yet even if you would be able to execute at a throughput of 1 instruction a clock a double precision multiplication. Even FMADD type instruction would exist at ARM hardware then we can do ballpark math. That means a 64 bits, 4 core arm chip at 2.0Ghz will be roughly delivering 16 gflops double precision in such case. Compare that to the roughly couple of teraflops - let's take 4 there as a moderate estimate - of a threadripper. 4000 / 16 = 250 quadcore arm cpu's needed roughly. How would you fit those at an electronics board? And next to each coupl of dimms a bunch of chips. So it gets monumental huge. Maybe interesting for certain sieving attempts. didn't figure out the tdp of those new 64 bits arm cpu's - yet let's take 2.2 watt as an example and add some ram we're quickly in the kilowatt range. yet i'm pretty sure it's more than 2.2 watts (which is what a 32 bits equivalent eats). Now the really hard part. Some years ago buying quadcore chips was rather easy if you bought a 100. So if you had special soldering equipment you could build such board yourself so to speak. Some years ago quadcores, of course tad lower clocked than mainstream, like 20 dollar a piece. yet the latest arm cpu's have gone massive up in price there. In short it is massive projects and you still slower for most software than a threadripper. And we speak of embarrassingly parallel software then of course that hardly needs any bandwidth to other cpu's. As all the connections that default boards have for those arm chips that one could use to network them, that's non-dma connections. So if some data gets in or out, then kernel will do a central lock of all i/o kind of - giving a big dang to all cores of the ARM processor - they cannot run further until the unlock occurs. So to build a massive cluster of ARMs that is useful anyhow for software that needs a little bandwidth to other nodes, you would need to build a board that has a DMA type network controller on it to each cpu. Also to get down power consumption you really need quite a lot of ARM cpu's on each board, each with each CPU its own RAM of course. Designing such board is a massive task. |
[QUOTE=chalsall;621795]One of the things that *REALLY* pisses me off is when people say "I guess".
Please don't guess. Know. Or, at least, question.[/QUOTE] I'm pretty sure my 'guess' there is better than your know. chip design is shrouded in secrecy. even the most knowledgeable expert commenting on websites will have to write 'my guess' - even if he knows or thinks he knows - because the actual hard data he'll never get to see. So it is a guess and it stays a guess in such a case. You will NEVER learn ANYTHING about any sort of cpu type sort of design if you skip the 'guesses'. Because who KNOWS - will kind of get executed and to be sure shreddered and all parts burnt at solar temperatures - so who KNOWS in that field will NEVER post. Now i'm not an expert there on memory controllers yet my best guess is that AMD has really troubles getting a high yield for the toughest part of the chip which is the crossbar design. And with 'high' i mean over 80% - regardless on which node that is. |
[QUOTE=diep;621794]In your dreams - with such low power 8 core chiplets they may be happy with above 80% i would guess. Really happy.
We'll never know what their yield is. Figuring out all secrets of the B2 bomber is much easier than that. Heh didn't USA crash such a drone version of the B2 in 2011 completely intact in Iran? Yeah yields is one of the REAL secrets on this planet! Yet rule of thumb is that a yield over 80% is needed to make a good profit on CPU's. Now this is far more expensive cpu's so throw some dices to gamble the real yield value. If yield was really brilliant they would have flooded the market with 64 core cpu's and for years there was a shortage. Tells me it's difficult to get good yields with it and probably it gets produced in the center of a wafer. The rest of the wafer having other cpu's. So in dead center of the wafer where the machine produces the best results, they have problems getting to 80% yield is my blindfolded guess.[/quote] I should have said effective 99% yields, because even if given chiplet is highly defective, there is a place in the product stack it can be used. Only very unfortunate defects will hit a chiplet in a critical spot. Use a defect density calculator like [url]https://isine.com/resources/die-yield-calculator/[/url] A Zen 4 chiplet is 70 mm^2. I don't have exact dimensions, but call it 7 x 10. TSMC has a defect rate of about 0.07/cm^2. Edge loss of 3 mm, and 0.2 mm scribe lines, 300 mm wafer. You get a 95% yield of defect-free dies. Of the 5% that do have defects, most defects will occur in a core or in L3 cache. If the L3 is hit, fuse off half and sell the die in an Epyc 9224. Or like a future Zen 4 Ryzen like the Ryzen 3 5500. If a core is defective, sell it as a chiplet with fewer than 8 cores active. Dies farther from the center are likely to need more power or to clock lower, yes. Sell those as desktop chips, the least power sensitive market. The best go in high core count Epycs. So yeah, 99% effective yields are easily attained on the chiplets. As the IO dies are much bigger yields will be lower there. AMD didn't flood the market with 64 core chips because there was limited capacity at TSMC until very recently. |
With yields i mean chips that get out of the wafer that you do not need to shredder right away even before it gets tested nonstop for a week or more which cores could work and if so at which common Ghz clock the chip can run if any at all.
I doubt AMD is above 80% and if they are - very little. Maybe 80.1% even - after all those years. Look how bad those cpu's are available for years now. We speak about the latest ASML machines made here in Netherlands that are located in Taiwan at TSMC that they produce this with. It's a huge accomplishment they managed. Then each so time they guess they have a new tweak to get up yields, they move the software upgrade from Netherlands to Taiwan - an airplane with a few guys will arrive in Taiwan - incorporate it and travel back after upgrade done. Takes years and years to get such new factory to work at a yield rate that is acceptable anyhow. I read somewhere a chiplet is 112+mm^2 but could've been talking about something totally different. Remember that there is a crossbar that has to serve a huge bandwidth and low latency for any sort of cache coherency protocols that are there. You can throw your defect rate completely out of the window at a new process. Center of the wafer or edge of the wafer - difference couldn't be larger there. Possibly producing low clocked memory chips at the edges of every wafer. Someone who taped out a certain huge chip for AMD some time ago told me the next rule of thumb. They start at 300Mhz and from there on they need to fix the chips design to run at the Ghz level it gets sold at. And only after that improve yields as well. Remember this is a technology so finegrained - no one else on the planet is able to produce cpu's with it. Getting the kinks out of that is much harder than building a space shuttle. [QUOTE=Mark Rose;621807]I should have said effective 99% yields, because even if given chiplet is highly defective, there is a place in the product stack it can be used. Only very unfortunate defects will hit a chiplet in a critical spot. Use a defect density calculator like [url]https://isine.com/resources/die-yield-calculator/[/url] A Zen 4 chiplet is 70 mm^2. I don't have exact dimensions, but call it 7 x 10. TSMC has a defect rate of about 0.07/cm^2. Edge loss of 3 mm, and 0.2 mm scribe lines, 300 mm wafer. You get a 95% yield of defect-free dies. Of the 5% that do have defects, most defects will occur in a core or in L3 cache. If the L3 is hit, fuse off half and sell the die in an Epyc 9224. Or like a future Zen 4 Ryzen like the Ryzen 3 5500. If a core is defective, sell it as a chiplet with fewer than 8 cores active. Dies farther from the center are likely to need more power or to clock lower, yes. Sell those as desktop chips, the least power sensitive market. The best go in high core count Epycs. So yeah, 99% effective yields are easily attained on the chiplets. As the IO dies are much bigger yields will be lower there. AMD didn't flood the market with 64 core chips because there was limited capacity at TSMC until very recently.[/QUOTE] |
[QUOTE=Mark Rose;621807] AMD didn't flood the market with 64 core chips because there was limited capacity at TSMC until very recently.[/QUOTE]
Limited capacity for a chip they can make so much money with and push Intel out of the market with only occurs when yield is too low. Zero other explanations possible. The big moneymaker also will continue production during covid or whatever other problems. I'm not aware exactly what the amount of cash that was quoted yet we can think of order of 20 billion dollar invested in such production facility at this proces. In short the machines have to nonstop generate product to pay back that 20 billion dollar otherwise TSMC is bankrupt. So whatever you produce - yields have to be good otherwise that design needs nonstop fixes while producing very little meanwhile other products that make more cash have a priority. The yields were not good enough yet in short. Better produce something massively later if yields are not good and do minimum production now in order to bugfix the chips yields and as they call it 'cream off the market' in old school macro-economics. Too many dozens of billions are on stake. Yields yields yields! |
Note that the L3 cache from a distance seen you can see it as a piece of SRAM that's on-chip.
|
[QUOTE=diep;621812]Limited capacity for a chip they can make so much money with and push Intel out of the market with only occurs when yield is too low. Zero other explanations possible.
...[/QUOTE] They're fabless and until a few years ago were dragging themselves out of bankruptcy and extricating themselves from global foundries contracts that were in place as a result of going fabless. Market forces on TSMC's highly sought-after nodes, and probably limited-until-recently ability to bid/acquire as much allocation as perhaps hindsight shows they should have, are definitely factors. |
| All times are UTC. The time now is 16:24. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.