mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Prescott impact to prime95. (https://www.mersenneforum.org/showthread.php?t=769)

Dresdenboy 2003-07-21 06:50

[quote="ColdFury"]I really wish AMD moved to a 3 operand format with x86-64, but I guess that would have required too much redesign of the decoders.[/quote]

You know it was planned (maybe with even more regs than 16 XMM regs) and I don't think it would have been more complex to decode than SSE2. Maybe the Athlon/Opteron RISC cores already use such instruction format for their 88 register file. That would have made it very easy to decode. But it's better to make available applications faster (with optimized SSE2 code and poor x87 code) than requiring a recompile and 64bit mode to run.

[quote="S3SJK"]Out of interest what performance advantages would that give?[/quote]

We want to calculate a=x²+4x
Just look at this pseudo code (for a risc like architectures like the execution cores of x86 CPUs):
[code:1]// format is instruction dest, source
fpload r0, [x]
fpload r1, [const_4]
fpmov r2, r0 ;we need it later again
fpmul r0, r0 ; x²
fpmul r2, r1 ; 4x
fpadd r0, r2 ; x²+4x
fpstore [a], r0
[/code:1]

With 3 operands it would maybe look like this:
[code:1]// format is instruction dest, source1, source2
fpload r0, [x]
fpload r1, [const_4]
fpmul r2, r0, r0 ; x²
fpmul r1, r0, r1 ; 4x
fpadd r3, r1, r2 ; x²+4x
fpstore [a], r3
[/code:1]

We saved one instruction in this simple calculation. But if you look at complex SSE2 or also x87 code you'll see a lot of shuffling, moving and saving registers (that they don't get destroyed all the time). While x86 CPUs have to move and save, the other CPUs (Alpha, Power, even G5) continue to calculate.

DDB

ebx 2003-07-22 03:39

I got a better compiler for a=x²+4x:

// format is instruction dest, source
fpload r0, [x]
fpmov r1, r0 ;we need it later again
fpadd r0, [const_4] ;x+4
fpmul r1, r0 ; x²+4x
fpstore [a], r1

Moving data between regs is the fastest instruction. Cant compare to fpmul. Load/Store memory is more than one instruction usually. That further brings down the weight of fpmov.

2 operand vs 3 operand is a long debate. There isnt any clear winner.

Dresdenboy 2003-07-22 06:02

[quote="ebx"]I got a better compiler for a=x²+4x:

// format is instruction dest, source
fpload r0, [x]
fpmov r1, r0 ;we need it later again
fpadd r0, [const_4] ;x+4
fpmul r1, r0 ; x²+4x
fpstore [a], r1

Moving data between regs is the fastest instruction. Cant compare to fpmul. Load/Store memory is more than one instruction usually. That further brings down the weight of fpmov.

2 operand vs 3 operand is a long debate. There isnt any clear winner.[/quote]

You are right. I didn't optimize my code, just wanted to show the difference. And because I was thinking about a RISC architecture by creating this example, I also didn't count on memory operands for fpadd/fpmul.

At least by using a simple adressing mode the load/store can be handled easily by the hardware.

IMO the advantage of 3 operand instructions is, that you may use a different destination register or just one of the sources - what fits best for the algorithm, while with 2 operands you are always required to overwrite one of the sources. And the disadvantage is, that the opcode needs additional bits for adressing the third register.

DDB


All times are UTC. The time now is 01:59.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.