![]() |
|
|
#1 |
|
Apr 2003
Berlin, Germany
192 Posts |
AMD will introduce SSE5 with upcoming "Bulldozer" core (which is meant to be optimized for high throughput) in 2009.
More here: http://developer.amd.com/sse5.jsp http://www.extremetech.com/article2/...2177464,00.asp |
|
|
|
|
|
#2 |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
22×1,549 Posts |
I haven't yet seen an SSE4 chip and AMD already want to do SSE5. I wonder how useful it will really be? Time will tell I guess.
|
|
|
|
|
|
#3 | |
|
Apr 2003
Berlin, Germany
192 Posts |
Quote:
Just think of the SSE4.1 benchmark results shown by Intel a while ago. If there is already a SSE4.1 optimized DivX encoder ready (as beta version) to be run on an early CPU engineering sample, then the developers must have known about SSE4.1 long before. SSE5 is in theory much more useful to Prime95 than SSE3 was (with it's horizontal operations). Besides other changes, it will provide fused multiply add (FMAC) with 3 source operands, which should be the most useful change. |
|
|
|
|
|
|
#4 | |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
Quote:
You're right, fused double-pumped SSE-based MADD would give a nice performance boost, but I wonder if you can pull it off without significantly increasing chip wattage. Besides the bugs in, delays with and [what now appear to be] outright lies about Barcelona performance by AMD, one of the things that will hurt them even if they manage to get their newest chips to market and recapture some of the share they're hemorrhaging to Intel is that their power consumption will be much higher than the Core2 series, and the disparity will only get worse once Intel finishes the shrink to 45nm. AMD desperately needs to stay viable in the notebook-PC market, and making more-power-hungry chips is the exact opposite of what they need to do in that respect. Sure, the SSE5-capable CPUs may be "targeted" at the server market, but how much engineering talent is being diverted away from the high-volume-low-power consumer market as a result? I think Intel have got it right [once again] - make a fast, great-performing low-power chip series, sell it in 2-4 core in the PC market, and 8,16-core and above in the server market. That way you leverage the same single highly-optimized technology for both sectors. Such considerations should be even higher on the list for a company like AMD, since they have far less money and manpower to burn. Hate to say it, but not looking good for AMD, either in terms of near-term deliverables nor long-term strategy. |
|
|
|
|
|
|
#5 |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Forgot to mention: one thing I've been mentioning to anyone who might have an "in" with one of the major chip vendors' hardware groups over the past few years, is the desirability of a hardware instruction which computes a packed interleaved floating-point add/sub pair, in-place, i.e. which takes inputs x and y and returns x+y and x-y in their place. [2 such pairs for SSE2-style packed doubles, 4 for SSE packed singles.]
Note that this is different than the SSE3 addsubpd instruction, which takes packed doubles [x0,x1] and [y0,y1] and returns - taking x as the destination operand - [x0-y0,x1+y1]. The throughput of my above version of interleaved add/sub would be double the number of FP add and sub per instruction, in that it would take "destination" input [x0,y0] and "source" input [y0,y1] and return [x0+y0,x1+y1]] and [x0-y0,x1-y1] in their stead. The rationale is this: in computing e.g. x0+-y0, the floating-point sign,exponent and significand-extraction-and-shift-align step need only be done once, and then one can do what amounts to a fixed-point add/sub pair on the resulting normalized significands, along with the usual FP rounding and ensuing repacking [which would need to be done separately for the output of the qadd and sub, obviously]. Thus, one can get double the throughput for less than double the hardware, without breaking the x86-style two-operand instruction paradigm [except that both src and dest operands would be altered by the operation.] Defining a RISC-style version of this is also not hard, though one would either need to relax the RISC 2-input-register-1-output-register paradigm. [But this is already done by many RISC chips for certain instructions - in any event, since the above is done in-place one could treat it as a 3-register RISC instruction with one null or dummy operand, which seems easier than allowing for an fully general [i.e. not nec. in-place] 2-input-register-2-output-register version. In my estimation this would give significantly more throughput bang for one's hardware buck than fused mul/add for computations such as transforms [FFT and other kinds]. The drawback is that it's less generally useful than FMADD. |
|
|
|
|
|
#6 | |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
Quote:
Was he fired or did he quit? Either way, not a good sign for AMD. The fact SSE5 has already been relegated to ho-hum by Intel recent announcement of 256-bit 4-way-floating-double SSE-style instructions in their 2010 chips ain't good news for AMD, either. If only they listed to us prime folks about how to max chip performance. ;) |
|
|
|
|
|
|
#7 |
|
Apr 2003
Berlin, Germany
192 Posts |
Now, after AMD decided to include AVX (with FMA support, which Sandy Bridge won't have) and published some architecture details this November (a lot of that was already published in patents as I documented in my blog - my latest posting is about the FMAC architecture), it's time again to look at the options of doing prime search on either the Bulldozer core architecture or even using the upcoming APU designs (accelerated processing units - combined CPU + GPU designs).
So far there will be eight core chips, where two of them will be used for 16 core MCM "Interlagos". Those eight cores are actually four "Bulldozer modules" (optimized dual cores). Each module will have two (vector) 128 bit wide FMAC units in a shared FPU (kind of SMT) and two integer cores (or clusters) with four pipelines, a scheduler and a L1 D$ each. Max throughput of one of these FMAC units per cycle will be either 1x128 bit FMA or 1x128 bit FADD and 1x128 bit FMUL. One thread (running on one integer core, 1 thread per such core) can use one or both FMAC units per cycle, depending on availability. The L1 D$s of both integer cores will feed the FPU, 2 loads per cycle (width is unknown so far). As it seems (just found some evidence on AMD slides), there will be some boost technology more advanced than "Turbo Boost". I described it in one of my power management related blog postings (see tags). |
|
|
|
|
|
#8 |
|
Oct 2008
n00bville
23×7×13 Posts |
More interesting will be a OpenCL support with the new Geforce Tesla cards ...
|
|
|
|
|
|
#9 | |
|
Apr 2003
Berlin, Germany
192 Posts |
Quote:
Larrabee (already reaches 1 TFLOPS - DP I suppose), Hemlock (this dual GPU card from ATI with 4.6 TFLOPS in SP and 0.93 TFLOPS in DP - theoretical peak) and last but not least Fermi based systems and look at their moves to integrate shader cores on general purpose processors. But the first variants will be less powerful. E.g. Llano with ~480 shader units (~130 DP GFLOPS) - just a bit more FP power as a 8 core Bulldozer at 3 GHz might have. And that at about the same power consumption and without the good GPU mem interface. Maybe it will need another step until heterogenous computing is went that far, that the power of many small processors (similar to shaders today) is simply there without any overhead - living in the same coherent memory space and so on. Last fiddled with by Dresdenboy on 2009-11-30 at 20:49 |
|
|
|
|
|
|
#10 |
|
Jan 2008
France
10468 Posts |
It's SP I think. At least that's what this article claims : http://www.theregister.co.uk/2009/11...ote/page2.html
|
|
|
|