mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2003-08-05, 18:33   #12
dsouza123
 
dsouza123's Avatar
 
Sep 2002

29616 Posts
Default

Could both the x87 and SSE2 be run in parallel/interleaved ?

I don't think there are any hardware dependencies ( registers ),
the x87 registers are separate from the SSE2 registers.

I am not sure about floating point pipelines if there would
be any resource contention.

Do both x87 and SSE2 use the same pipeline ?

If it works it would give an almost 2x improvement.
dsouza123 is offline   Reply With Quote
Old 2003-08-05, 20:20   #13
ColdFury
 
ColdFury's Avatar
 
Aug 2002

26·5 Posts
Default

Quote:
I don't think there are any hardware dependencies ( registers ),
the x87 registers are separate from the SSE2 registers.
There's a (relatively) lengthy mode change required when going from x87 to SSE code. This is because MMX, SSE, and (I think?) SSE2 all use the fpu registers aliased.
ColdFury is offline   Reply With Quote
Old 2003-08-05, 20:56   #14
dsouza123
 
dsouza123's Avatar
 
Sep 2002

2×331 Posts
Default

I know that MMX registers are the same as (aliased) x87 registers, but from what I understand the SSE/SSE2 registers are separate, using x87 along with SSE2 doesn't corrupt each other (registers or state).

Under 64-bit long mode there are 16 SSE2 registers but still only 8 x87 registers.

There still may be contention for the 3 floating point units ( not identical )
to do the calculations on using the x87 (64-bit doubles/80-bit extended) or SSE2 (64-bit doubles), maybe Dresdenboy knows.

May be there is some way of partially overlapping the code for both or even alternating, complete x87, complete SSE2, complete x87, etc and still getting a performance gain.
dsouza123 is offline   Reply With Quote
Old 2003-08-06, 13:02   #15
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

16916 Posts
Default

Quote:
Originally Posted by dsouza123
I know that MMX registers are the same as (aliased) x87 registers, but from what I understand the SSE/SSE2 registers are separate, using x87 along with SSE2 doesn't corrupt each other (registers or state).

Under 64-bit long mode there are 16 SSE2 registers but still only 8 x87 registers.

There still may be contention for the 3 floating point units ( not identical )
to do the calculations on using the x87 (64-bit doubles/80-bit extended) or SSE2 (64-bit doubles), maybe Dresdenboy knows.

May be there is some way of partially overlapping the code for both or even alternating, complete x87, complete SSE2, complete x87, etc and still getting a performance gain.
The functional units are limiting here - you can't do more than one double precision mul and add (at same time) per cycle on Athlon/Opteron/P4.

And on AMD CPUs all these registers are mapped to the same internal FPU register file (with 88 regs).
Dresdenboy is offline   Reply With Quote
Old 2003-08-06, 16:15   #16
db597
 
db597's Avatar
 
Jan 2003

7×29 Posts
Default

Quote:
The functional units are limiting here - you can't do more than one double precision mul and add (at same time) per cycle on Athlon/Opteron/P4.
Looks like the only way an Athlon64/Opteron can beat a P4 is if it has dual SSE2 units (which it doesn't). Otherwise there's just no way to make up for the lack of clock speed.
db597 is offline   Reply With Quote
Old 2003-08-06, 16:44   #17
dsouza123
 
dsouza123's Avatar
 
Sep 2002

12268 Posts
Default

It wouldn't even need that ( dual SSE2 units ) just a second set of the 3 FP (functional) units FSTORE FADD FMUL. Then 2 muls or 2 adds could be done in parallel using either SSE2 or x87.
dsouza123 is offline   Reply With Quote
Old 2003-08-06, 18:51   #18
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Quote:
Originally Posted by dsouza123
It wouldn't even need that ( dual SSE2 units ) just a second set of the 3 FP (functional) units FSTORE FADD FMUL. Then 2 muls or 2 adds could be done in parallel using either SSE2 or x87.
If we believe some specs which popped up somewhere on the net, the K9 will have 3 full FPU blocks :) But now we only have 1 FADD, 1 FMUL.. 1 64bit integer multiplier (with 128bit result) and 3 64bit integer adder. Unfortunately the decoder can decode only 3 instructions per cycle.

DDB
Dresdenboy is offline   Reply With Quote
Old 2003-08-06, 23:10   #19
db597
 
db597's Avatar
 
Jan 2003

7×29 Posts
Default

It took them 3 years to go from the K7 architecture to the K8 (and it's still only in servers at this time). I'd not like to hold my breath for the K9 :(

Ironically, it's probably cheaper (in terms of silicon real estate) for them to implement just a 512Kb cache (instead of 1MB) and add those 3 full FPU blocks instead. More performance, yet more chips per waffer. Cache just uses up so many transistors.

Yet this is just wishful thinking... P4 seems to be the way to go for Prime95 in the near future.
db597 is offline   Reply With Quote
Old 2003-08-07, 07:32   #20
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

5518 Posts
Default

Quote:
Originally Posted by db597
Yet this is just wishful thinking... P4 seems to be the way to go for Prime95 in the near future.
Prime95 is a special case. It's so heavily optimized that it's close to the max. possible FPU throughput. In normal cases the K7 fpu is nearly always underutilized. Opteron helps by adding regs - for up to 32 double values. So instead of loading, storing, swapping (what you had to do without SSE2 when you wanted double precision, not SSE) you can do much more calculations at once. And the cache is even bigger and faster for steady delivery of numbers to the FPU.

Cache is useful if the memory is slow. But the wins by using cache are smaller the bigger the cache already is. The probability of cache hits are already high, so it is hard to get them much higher - by doubling the cache we won about 3-10% with Barton.

And although the P4 also can do 1 FADD/1FMUL per cycle, it has other limitations - no 64bit integer multiplier, no 16 SSE2 regs, and AFAIK it needs to load numbers from L2, never L1 (I'll check that).

DDB
Dresdenboy is offline   Reply With Quote
Old 2003-08-07, 16:48   #21
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×32×647 Posts
Default

Quote:
Originally Posted by Dresdenboy
And although the P4 also can do 1 FADD/1FMUL per cycle, it has other limitations - no 64bit integer multiplier, no 16 SSE2 regs, and AFAIK it needs to load numbers from L2, never L1 (I'll check that).
Please correct me if I'm wrong, but I believe the SSE2 allows any combination of 2 fadd/fmul ops per cycle. For highly optimized FFT code, which tends to be add-dominated, being able to do 2 fadd per cycle is a huge gain. If the Alpha 21264 could do that, instead of merely being equal to the P4 in terms of per-cycle performance (the 21264's larger caches and such help make up for its lesser maximal FPU throughput), I'd be getting nearly double the per-cycle performance of the P4. (Though of course the P4's clock rates are much higher than those of any currently available 64-bit RISC chip)

To give you an idea of the impact of being able to do any two of (fadd, fmul, madd) per cycle: on a 550MHz HP PA-RISC (whose dual mul/add FPUs are similar to those of the Itanium) I get ~0.12 sec/iteration for a 1024K FFT length. That is only ~2x slower than my 2GHz P4 gets using Prime95, i.e. nearly double the per-cycle floating-point throughput. I'm not yet getting as good of performance from the Itanium (on that, I'm getting slightly better per-cycle performance than the Alpha 21264), but the compiler technology for the Itanium is not yet as mature.
ewmayer is offline   Reply With Quote
Old 2003-08-07, 20:15   #22
ColdFury
 
ColdFury's Avatar
 
Aug 2002

14016 Posts
Default

Quote:
AFAIK it needs to load numbers from L2, never L1 (I'll check that).
I believe that floating point automatically goes to the L2. Integer should use the L1 (otherwise the L1 wouldn't be used at all!) This isn't a problem in practice because most floating point working sets don't fit in the L1 anyways.
ColdFury is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Let's buy GIMPS an Opteron! Xyzzy Lounge 264 2006-08-17 12:39
AMD Athlon 64 vs AMD Opteron for ecm thomasn Factoring 6 2004-11-08 13:25
Opteron web server... Xyzzy Lounge 14 2003-11-05 23:07
Opteron Bottleneck?? Prime95 Hardware 31 2003-09-17 06:54
What will an AMD Opteron be classified as ? dsouza123 Software 4 2003-08-02 14:29

All times are UTC. The time now is 16:40.


Sun Aug 1 16:40:06 UTC 2021 up 9 days, 11:09, 0 users, load averages: 1.23, 1.33, 1.48

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.