mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2009-02-21, 15:14   #12
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

15A616 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
Anyway that doesn't explain why x86 SIMD various instruction sets are so odd. I guess it's the result of adding a few instructions at each generation, instead of spending a few years in R&D thinking about what is really needed in the longer term.
I strongly doubt that the current instruction set(s) is/are just a random collection of things someone threw in before going to lunch.

It seems to me that doing FFT on multi-megabit numbers was not a high priority for Intel/AMD/whoever. I would imagine video/audio encoding/decoding is very high on their list of things to consider when designing new instructions.

How many people do multi-precision arithmetic compared to how many play games, listen to music and/or watch movies? I imagine the ratio is a rather small one.
retina is offline   Reply With Quote
Old 2009-02-21, 15:50   #13
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

10000100002 Posts
Default

Quote:
Originally Posted by retina View Post
I strongly doubt that the current instruction set(s) is/are just a random collection of things someone threw in before going to lunch.
No they are not random additions, but they certainly are not well thought resulting in a stacking of instructions that doesn't look very coherent. Intel's AVX looks better, Altivec (and newer Power SIMD ISA) looks nicer too.

But as you wrote, these are not written with MP in mind
ldesnogu is offline   Reply With Quote
Old 2009-02-21, 16:01   #14
xilman
Bamboozled!
 
xilman's Avatar
 
May 2003
Down not across

23×13×97 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Sounds pretty good, doesn't it? Sounds like "if you're done crunching datum <foo> whose value is currently stored in an xmm register and you know that <foo> won't be used (either in read or write mode) for a while, you should use this special MOV instruction to write it back to memory, because this bypasses the cache hierarchy and thus allows more soon-to-be-used data to enter the caches without risking being kicked out by <foo> on its way back to main memory."
The first thing I thought when reading this paragraph was "cache side-channel attacks". Use Google if that phrase means nothing to you.

Paul
xilman is offline   Reply With Quote
Old 2009-02-21, 16:44   #15
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24·32·5 Posts
Default

Quote:
Originally Posted by retina View Post
I strongly doubt that the current instruction set(s) is/are just a random collection of things someone threw in before going to lunch.
I agree.

My guess is that it's the result of one serious rochambeau tournament that took the better part of the afternoon, too.

The guy in the sales-department won alot: He got to add 20 new instructions by aliasing them to existing instructions, and the guy responsible for putting together the "Instruction Set Reference", probably has a template and "add redundant instruction" macroed to F11.

Quote:
Originally Posted by retina View Post
It seems to me that doing FFT on multi-megabit numbers was not a high priority for Intel/AMD/whoever. I would imagine video/audio encoding/decoding is very high on their list of things to consider when designing new instructions.
I doubt it.

Somewhere there's a forum specialized on video encoding with guys concluding that pmuludq must be for multi-megabit computations, because it's useless for anything else.

The rest of the thread is devoted to missing 16-bit floating point support, and wtf to do with the 205 instead of 200 idle cycles now that paddusw, psubsw, etc. eliminated some code instructions.

Last fiddled with by __HRB__ on 2009-02-21 at 16:46
__HRB__ is offline   Reply With Quote
Old 2009-02-21, 16:57   #16
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

While we're whining about instruction sets, aside from the general travesty of a cpu that is the x86 architecture, why-oh-why is there no setcc that sets a whole register to 0 or 1? With the current instructions, not only do you get a false dependency thanks to a partial register write, but you can't store the carry flag for later easy adding either, since the reg will still contain garbage in bits 8,...,63. This drove me mad while writing mulredc code for GMP-ECM.

Alex
akruppa is offline   Reply With Quote
Old 2009-02-21, 17:14   #17
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

21016 Posts
Default

Quote:
Originally Posted by akruppa View Post
While we're whining about instruction sets, aside from the general travesty of a cpu that is the x86 architecture, why-oh-why is there no setcc that sets a whole register to 0 or 1? With the current instructions, not only do you get a false dependency thanks to a partial register write, but you can't store the carry flag for later easy adding either, since the reg will still contain garbage in bits 8,...,63. This drove me mad while writing mulredc code for GMP-ECM.
I had the same thought and had to resort to zero-extension of the result of setcc with a movzbl After all there's a price to pay for a monstruosity that has lived far too long and that doesn't show any will to die the horrible death it deserves.
ldesnogu is offline   Reply With Quote
Old 2009-02-21, 17:15   #18
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

2·17·163 Posts
Default

Quote:
Originally Posted by akruppa View Post
While we're whining about instruction sets, aside from the general travesty of a cpu that is the x86 architecture, why-oh-why is there no setcc that sets a whole register to 0 or 1? With the current instructions, not only do you get a false dependency thanks to a partial register write, but you can't store the carry flag for later easy adding either, since the reg will still contain garbage in bits 8,...,63. This drove me mad while writing mulredc code for GMP-ECM.

Alex
I thought the partial register stall was only in the P4. Earlier CPUs were not affected and later CPUs used the older architecture as a starting point and also avoided it. However I always use sbb eax,eax to replicate the carry through a register and then just sub instead of add it later to accumulate carries. And only ever use setcc for accumulating multiple flag tests to avoid long lists of branches.

Last fiddled with by retina on 2009-02-21 at 17:16
retina is offline   Reply With Quote
Old 2009-02-21, 18:01   #19
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

Quote:
Originally Posted by retina View Post
I thought the partial register stall was only in the P4. Earlier CPUs were not affected and later CPUs used the older architecture as a starting point and also avoided it. However I always use sbb eax,eax to replicate the carry through a register and then just sub instead of add it later to accumulate carries. And only ever use setcc for accumulating multiple flag tests to avoid long lists of branches.
The AMD64 software optimization guide lists partial register reads/writes as something to avoid. I don't know if the Core2 identifies and ignores these false dependencies, but my (somewhat naive) understanding is that keeping track of the dependencies between instructions is hard enough even without treating each register as 4 independent pieces, where one instruction can write to one or several pieces.

Yes, "sbb reg, same reg" can be used to set a full register according only to the carry flag, and recent cpus even know that this instruction does not depend on the previous value of "reg," so no false dependency occurs here. However, I use three registers as a kind of ring buffer for the carry propagation, and need to add to the register that holds the carry... so having the carry value negated was a bit of a problem. I thought about flipping the sign of the partial result in every pass so I could use sbb and then keep subtracting. I didn't do that, though... I may try some time.

Alex
akruppa is offline   Reply With Quote
Old 2009-02-21, 18:29   #20
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24·32·5 Posts
Default

Quote:
Originally Posted by retina View Post
sbb eax,eax
That's clever. I quickly deleted my post with an inferior solution.

I was thinking about using rcr or rcl to create a carry cache and use clc to remove dependencies to do several multiprecision adds in parallel.

Your solution is much better. You 'da Man!.

Here's why I'm so exited:

Let's suppose we want to do 2 multiprecision adds in parallel. With eax & edx as temporaries.

After sbb eax,eax a add edx,edx will

a) remove the dependency on the carry
b) restore the carry

so we can do load/adc/store N times
ending with sbb edx,edx
and setting up the second stream with add eax,eax

I count 4 + 2*3*N instructions, so, if e.g. N==4, and Core 2 can do 1 add/clock, we're doing 8 multiprecision adds in 12 cycles. This is 25% faster than the naive implementation.

If we're doing X+Y and X-Y and can reuse X and/or Y, Athlons might be able to do more than one 64-bit adc/clock.

This would allow nice butterflies for medium sized power-of-two moduli.

Edit: Two adds can be replaced with shifts. Duh. 10/8=1.25 clocks/adc for core 2.

Last fiddled with by __HRB__ on 2009-02-21 at 18:39 Reason: I'm a moron
__HRB__ is offline   Reply With Quote
Old 2009-02-25, 16:33   #21
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
I had the same thought and had to resort to zero-extension of the result of setcc with a movzbl After all there's a price to pay for a monstruosity that has lived far too long and that doesn't show any will to die the horrible death it deserves.
Btw, zero-extending the result of setcc keeps the (false) dependency chain intact. It may be better to do a "mov $0, reg" before the setcc, it's just as ugly but at least it breaks the dependency chain.

Alex
akruppa is offline   Reply With Quote
Old 2009-02-25, 16:49   #22
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

24·3·11 Posts
Default

Quote:
Originally Posted by akruppa View Post
Btw, zero-extending the result of setcc keeps the (false) dependency chain intact. It may be better to do a "mov $0, reg" before the setcc, it's just as ugly but at least it breaks the dependency chain.
I really have to forget my old m68k days and its all-instructions-set-flags view of the world
If I make this remark it's because I have restrictions on the order of instructions: I get registers allocated from above (inputs and destination); so in my case I was doing a cmp + setcc + signextend; and I wrongly thought the only other option was mov 0 + cmp + setcc ("only" due to wrong "mov 0 clobbers flags" assumption) which doesn't work if my allocated destination reg overlaps one of the input regs.

Thanks for the hint!
ldesnogu is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Posts that seem less than useless, or something like that jasong Forum Feedback 1050 2019-04-29 00:50
Fedora gedit for bash has become useless EdH Linux 11 2016-05-13 15:36
Useless DC assignment lycorn PrimeNet 16 2009-09-08 18:16
Useless p-1 work jocelynl Data 4 2004-11-28 13:28

All times are UTC. The time now is 01:03.

Wed Jul 15 01:03:22 UTC 2020 up 111 days, 22:36, 0 users, load averages: 1.09, 1.28, 1.43

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.