mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   New SSE shtuff by Intel(anybody know anything?) (https://www.mersenneforum.org/showthread.php?t=13630)

jasong 2010-07-20 23:04

New SSE shtuff by Intel(anybody know anything?)
 
I don't know a lot about cpu instructions, but I know that a lot of SSE stuff is really helpful with prime-finding and a lot of DCing projects in general.

Anybody have any opinions about the new SSE code Intel will be releasing soon on their cpus?

Edit: OMG, you can't edit the title of a thread even if you click edit 30 seconds after posting. I want smart comments.

NOOOOOOOOOOOO!!!

Ken_g6 2010-08-05 03:56

There haven't been any really major improvements for sieving since SSE2. I haven't seen anything that's come out since then that's worth trying to support the few processors that might have it. (Of course, I'm not doing FP math in the SSE registers; I need the full 80 bits of precision, and I suspect most FFTs do as well.)

The next interesting change will be [url=http://en.wikipedia.org/wiki/Advanced_Vector_Extensions]AVX[/url] in 2011: double-sized registers (for doing twice as much at once), and instructions that put their output in a third register.

cheesehead 2010-08-05 06:48

[quote=jasong;222165]Edit: OMG, you can't edit the title of a thread even if you click edit 30 seconds after posting. I want smart comments.[/quote]Would you settle for a friendly moderator? Xyzzy changes titles all the time just for fun.

Primeinator 2010-08-05 14:28

[QUOTE=Ken_g6;224116]

The next interesting change will be [url=http://en.wikipedia.org/wiki/Advanced_Vector_Extensions]AVX[/url] in 2011: double-sized registers (for doing twice as much at once), and instructions that put their output in a third register.[/QUOTE]

Would this lead to any significant improvements in the rate at which LL tests are run? Pardon my ignorance, but I'm not sure as to where the current bottleneck is in speed (other than a thermal barrier in throttle speed). What other components of the processor would speed up these tests if modified?

henryzz 2010-08-05 16:09

[quote=Primeinator;224151]Would this lead to any significant improvements in the rate at which LL tests are run? Pardon my ignorance, but I'm not sure as to where the current bottleneck is in speed (other than a thermal barrier in throttle speed). What other components of the processor would speed up these tests if modified?[/quote]
Considering that this is an extension to SSE2 which seriously speeds up LL tests then I would expect this will do even more. Especially if they increase from 256 bits at somepoint which has been commented on several times so might happen.

@Prime95 how fast can we expect these instructions to be used by Prime95? Is development before release out of the question?

Prime95 2010-08-05 23:57

[QUOTE=henryzz;224158]Considering that this is an extension to SSE2 which seriously speeds up LL tests ...

@Prime95 how fast can we expect these instructions to be used by Prime95? Is development before release out of the question?[/QUOTE]

AVX with 256-bit registers should double the FPU throughput (although Intel could implement it in a way that there is no increase). The 3-register instruction format reduces register pressure - another decent-sized win (especially on Intel chips which have half the load-to-FPU-register capability of AMD chips). AVX instructions are also more compact, though I doubt that will yield any performance benefit.

Finally, AVX has spec'ed a fused multiply-add instruction that will be very useful in the future. The first Intel chips will not support fused multiply-add. The AMD chips will emulate this instruction.

In short, AVX is a very well thought out extension of the x86 architecture. The instruction format is ready to support 512 and 1024 bit registers in the future.

I doubt I'll be able to work on an AVX version before Sandy Bridge comes out in the 4th quarter.

Primeinator 2010-08-06 03:15

[QUOTE=Prime95;224204]AVX with 256-bit registers should double the FPU throughput (although Intel could implement it in a way that there is no increase). [/QUOTE]

How much of an increase in LL speed would be achieved by doubling the floating point throughput? Surely not double it?

__HRB__ 2010-08-06 06:30

At most AVX will speed up the search by 100%, but since it will take a while until a significant portion of users owns a processor that supports these new extensions...

Considering that a $200 entry-level ATI 5830 delivers around 450-900 giga-flops in double precision arithmetic (which would be roughly equivalent to a system with 16 AVX-capable cores clocked at 4Ghz) I seriously doubt that George will want to spend more than an afternoon on contemplating the specifics.

ldesnogu 2010-08-06 07:25

[quote=__HRB__;224215]Considering that a $200 entry-level ATI 5830 delivers around 450-900 giga-flops in double precision arithmetic[/quote]
Huh? ATI themselves are quoting 358 DP GFLOP/sec. [URL="http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5830/Pages/hd-5830-specifications.aspx"]ref[/URL]

[quote] (which would be roughly equivalent to a system with 16 AVX-capable cores clocked at 4Ghz) [/quote]Using the 358 number above, that gives ~11 cores. And I bet it's easier to get closer to peak performance on a CPU than it is on a GPU.

Primeinator 2010-08-06 21:20

[QUOTE=__HRB__;224215]At most AVX will speed up the search by 100%, [/QUOTE]

Again, 100% increase in.... factoring? P-1? LL? All? My knowledge of how CPUs actually work is limited, please accept my ignorance and my apologies.

ewmayer 2010-08-06 22:41

[QUOTE=__HRB__;224215]At most AVX will speed up the search by 100%, but since it will take a while until a significant portion of users owns a processor that supports these new extensions...[/QUOTE]
Could quite possibly be > 100% ... doing 4-way double FPU instructions per cycle will double the throughput vs SSE2. But as George notes, the RISC-style 3-operand instructions will reduce register pressure, meaning fewer cycles needed for spill-and-fill and more available for computation. Not a massive speedup, but another 10-20% seems doable.

[QUOTE]Considering that a $200 entry-level ATI 5830 delivers around 450-900 giga-flops in double precision arithmetic (which would be roughly equivalent to a system with 16 AVX-capable cores clocked at 4Ghz) I seriously doubt that George will want to spend more than an afternoon on contemplating the specifics.[/QUOTE]
Depends on the relative numbers of folks who will have AVX-supporting CPUs versus ones with fast-double-supporting GFX cards.


All times are UTC. The time now is 04:24.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.