![]() |
|
|
#12 |
|
Sep 2002
29616 Posts |
Could both the x87 and SSE2 be run in parallel/interleaved ?
I don't think there are any hardware dependencies ( registers ), the x87 registers are separate from the SSE2 registers. I am not sure about floating point pipelines if there would be any resource contention. Do both x87 and SSE2 use the same pipeline ? If it works it would give an almost 2x improvement. |
|
|
|
|
|
#13 | |
|
Aug 2002
26·5 Posts |
Quote:
|
|
|
|
|
|
|
#14 |
|
Sep 2002
2×331 Posts |
I know that MMX registers are the same as (aliased) x87 registers, but from what I understand the SSE/SSE2 registers are separate, using x87 along with SSE2 doesn't corrupt each other (registers or state).
Under 64-bit long mode there are 16 SSE2 registers but still only 8 x87 registers. There still may be contention for the 3 floating point units ( not identical ) to do the calculations on using the x87 (64-bit doubles/80-bit extended) or SSE2 (64-bit doubles), maybe Dresdenboy knows. May be there is some way of partially overlapping the code for both or even alternating, complete x87, complete SSE2, complete x87, etc and still getting a performance gain. |
|
|
|
|
|
#15 | |
|
Apr 2003
Berlin, Germany
16916 Posts |
Quote:
And on AMD CPUs all these registers are mapped to the same internal FPU register file (with 88 regs). |
|
|
|
|
|
|
#16 | |
|
Jan 2003
7×29 Posts |
Quote:
|
|
|
|
|
|
|
#17 |
|
Sep 2002
12268 Posts |
It wouldn't even need that ( dual SSE2 units ) just a second set of the 3 FP (functional) units FSTORE FADD FMUL. Then 2 muls or 2 adds could be done in parallel using either SSE2 or x87.
|
|
|
|
|
|
#18 | |
|
Apr 2003
Berlin, Germany
192 Posts |
Quote:
DDB |
|
|
|
|
|
|
#19 |
|
Jan 2003
7×29 Posts |
It took them 3 years to go from the K7 architecture to the K8 (and it's still only in servers at this time). I'd not like to hold my breath for the K9 :(
Ironically, it's probably cheaper (in terms of silicon real estate) for them to implement just a 512Kb cache (instead of 1MB) and add those 3 full FPU blocks instead. More performance, yet more chips per waffer. Cache just uses up so many transistors. Yet this is just wishful thinking... P4 seems to be the way to go for Prime95 in the near future. |
|
|
|
|
|
#20 | |
|
Apr 2003
Berlin, Germany
5518 Posts |
Quote:
Cache is useful if the memory is slow. But the wins by using cache are smaller the bigger the cache already is. The probability of cache hits are already high, so it is hard to get them much higher - by doubling the cache we won about 3-10% with Barton. And although the P4 also can do 1 FADD/1FMUL per cycle, it has other limitations - no 64bit integer multiplier, no 16 SSE2 regs, and AFAIK it needs to load numbers from L2, never L1 (I'll check that). DDB |
|
|
|
|
|
|
#21 | |
|
∂2ω=0
Sep 2002
República de California
2×32×647 Posts |
Quote:
To give you an idea of the impact of being able to do any two of (fadd, fmul, madd) per cycle: on a 550MHz HP PA-RISC (whose dual mul/add FPUs are similar to those of the Itanium) I get ~0.12 sec/iteration for a 1024K FFT length. That is only ~2x slower than my 2GHz P4 gets using Prime95, i.e. nearly double the per-cycle floating-point throughput. I'm not yet getting as good of performance from the Itanium (on that, I'm getting slightly better per-cycle performance than the Alpha 21264), but the compiler technology for the Itanium is not yet as mature. |
|
|
|
|
|
|
#22 | |
|
Aug 2002
14016 Posts |
Quote:
|
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Let's buy GIMPS an Opteron! | Xyzzy | Lounge | 264 | 2006-08-17 12:39 |
| AMD Athlon 64 vs AMD Opteron for ecm | thomasn | Factoring | 6 | 2004-11-08 13:25 |
| Opteron web server... | Xyzzy | Lounge | 14 | 2003-11-05 23:07 |
| Opteron Bottleneck?? | Prime95 | Hardware | 31 | 2003-09-17 06:54 |
| What will an AMD Opteron be classified as ? | dsouza123 | Software | 4 | 2003-08-02 14:29 |