mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2012-11-19, 11:03   #12
lorgix
 
lorgix's Avatar
 
Sep 2010
Scandinavia

10011001112 Posts
Default

Some NFS benchmarks from a Piledriver-system would be very interesting.

GMP-ECM would also be nice.

This thread shall be followed.
lorgix is offline   Reply With Quote
Old 2012-11-20, 05:22   #13
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2×1,061 Posts
Default

Quote:
Originally Posted by lorgix View Post
Some NFS benchmarks from a Piledriver-system would be very interesting.

GMP-ECM would also be nice.

This thread shall be followed.
I don't expect it to do well, particularly running ecm. The integer multiply is significantly slower than the K10's. NFS sieving might be ok. Are there particular polys/numbers/parameters that are used as a standard benchmark?
frmky is online now   Reply With Quote
Old 2012-11-20, 05:34   #14
lorgix
 
lorgix's Avatar
 
Sep 2010
Scandinavia

3·5·41 Posts
Default

Quote:
Originally Posted by frmky View Post
I don't expect it to do well, particularly running ecm. The integer multiply is significantly slower than the K10's. NFS sieving might be ok. Are there particular polys/numbers/parameters that are used as a standard benchmark?
I don't know of any standard benchmark.
lorgix is offline   Reply With Quote
Old 2013-02-08, 07:02   #15
mrolle
 
mrolle's Avatar
 
Feb 2012
Cupertino, CA

13 Posts
Default

Quote:
Originally Posted by henryzz View Post
Is it worth using avx code on piledriver? Have they improved it since bulldozer?
Let me update what I wrote a while ago about AVX performance.
1. Using YMM instructions is equivalent to having twice as many XMM instructions, except that a few of them are even slower than equivalent XMMs.
2. One reason you might want to use YMMs anyway would be for register pressure. A YMM program is equivalent to an XMM program if there were 32 XMM registers instead of 16.
3. Another reason for YMM instructions is that if your program is running in one thread with the companion thread being idle, there may be stalls in the decoding stage depending on the code size. YMM code size being half that of the XMM code size (per cycle), you might get more work per cycle.
4. I think the best way to get performance is to run workloads in both threads at the same time. The amount of work that can be done is significantly more than what can be done in one thread even if the other thread is idle.
5. Running a workload with XMM code in both threads will be equivalent in timing to running it with YMM code in one thread alone. This gives you the 32 XMM registers, and doesn't have the above problems with decode bandwidth or certain slower instructions.
6. XMM AVX instructions are similar to legacy XMM instructions in performance. However, with AVX you can take advantage of the 3-operand forms.
7. The FMA4 instructions can be a big win because you can do adds and multiplies together. An example would be a vector dot-product.
8. For complex numbers, you will be better off to have separate vectors for the real and imag components, compared with vectors of complex numbers. If you have a complex number in an XMM register, a complex multiply can be done with a multiply and a multiply-add instructions, together with a shuffle to exchange the real and imag components. The shuffle uses FPU pipe 1, which takes away from the multiply/add bandwidth, so it takes 3 FMA operations instead of 2. If you have 2 reals in one register and 2 imags in another register, you can do the two parallel complex multiplies with 4 FMA operations and no other instructions.
9. All of this applies to the Bulldozer parts. I am assuming that Piledriver (which includes Trinity and Vishera) will be the same until I can run some tests.
10. Piledriver has support for the FMA3 instruction set that Intel has implemented, so at least you can run the same code on Piledriver and Intel. However, FMA4 can let you avoid some register copies, so if you can benefit from that in your code, it would be worth having a Piledriver version different from the Intel version.
mrolle is offline   Reply With Quote
Old 2013-02-08, 07:11   #16
mrolle
 
mrolle's Avatar
 
Feb 2012
Cupertino, CA

13 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Does anyone have access to a Piledriver system? I have a friend who needs to run some benchmarks. Any help would be appreciated.
In addition to running benchmark code on Windows, Agner Fog would be very interested in testing a Piledriver CPU so that he can publish instruction timings for it. His Instruction Tables document (see http://www.agner.org) has a chapter for Bulldozer already.
Agner wants to be able to do a remote login to a linux system.
Can anyone provide such access?
By the way, the Piledriver models that I am aware of are the following:
Vishera -- FX-4300, 4320, 6300, 8300, 8320, 8350
Trinity -- A4-5300, A6-5400K, A8-5500, 5600K, A10-5700, 5800K.
mrolle is offline   Reply With Quote
Old 2013-02-08, 20:59   #17
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

265778 Posts
Default

Quote:
Originally Posted by mrolle View Post
Let me update what I wrote a while ago about AVX performance.
1. Using YMM instructions is equivalent to having twice as many XMM instructions, except that a few of them are even slower than equivalent XMMs.
In Intel's implementation of AVX, most key instructions provide 2x the max throughput of the analogous SSE instruction, not just theoretically, but in practice. Agner Fog has written at length about this. That is the #1 reason to prefer "true" full-register AVX to SSE on hardware supporting both. AVX2 promises to provide a further significant throughput boost, as discussed here.

Quote:
8. For complex numbers, you will be better off to have separate vectors for the real and imag components, compared with vectors of complex numbers. If you have a complex number in an XMM register, a complex multiply can be done with a multiply and a multiply-add instructions, together with a shuffle to exchange the real and imag components. The shuffle uses FPU pipe 1, which takes away from the multiply/add bandwidth, so it takes 3 FMA operations instead of 2. If you have 2 reals in one register and 2 imags in another register, you can do the two parallel complex multiplies with 4 FMA operations and no other instructions.
It is far preferable to minimize data-shuffling by altering one's data layout in order to mitigate the lack of proper CMUL support in all the x86 SIMD implementations. Using doubles, an array of complex data would be laid out like so:

Non-SIMD: a.re,a.im,b.re,b.im,c.re,c.im,d.re,d.im,...

SSE2+: a.re,b.re,a.im,b.im,c.re,d.re,c.im,d.im,...

AVX: a.re,b.re,c.re,d.re,a.im,b.im,c.im,d.im,...
ewmayer is offline   Reply With Quote
Old 2013-02-08, 22:07   #18
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19×397 Posts
Default

Quote:
Originally Posted by ewmayer View Post
In Intel's implementation of AVX, most key instructions provide 2x the max throughput of the analogous SSE instruction
True for Intel, false for AMD.

Scott Bardwick has generously let me test out code on his Bulldozer. I'm doing this primarily to rewrite my core building blocks to use Intel's upcoming FMA instruction. I've also put a modest amount of work into making a Bulldozer AVX version of prime95. I can almost get the AVX version as fast as the SSE2 version, but not quite. I gain speed over the SSE2 version by using FMA, but lose significant speed due to 4-way swizzling. For prime95, the best solution for Bulldozer is to use 128-bit AVX instructions (instead of SSE2) and use FMA. I don't know if I have the willpower to make such a version.
Prime95 is offline   Reply With Quote
Old 2013-02-18, 05:59   #19
RMAC9.5
 
RMAC9.5's Avatar
 
Jun 2003

2318 Posts
Default AMD FX8350 (Piledriver) Available

George,
Last month I built a new system using an AMD Piledriver CPU, ASRock MB, and 8 GB of GSkill DDR3-1333 memory (2 4GB DIMMs). It is currently Prime95 stable on Windows 7 (64bit) at 4.2 GHz and available to you or your friend if either of you need it for testing.

I am hoping that you have the willpower to make a Piledriver (i.e. 128-bit AVX, FMA) version of Prime95 for us diehard AMD users as this is likely to be the only way we can upgrade our PC hardware in the future.
RMAC9.5 is offline   Reply With Quote
Old 2013-02-22, 20:05   #20
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19·613 Posts
Default

AMD Secretly Rolls-Out "Steamroller" Support Patch for Compilers.
Quote:
According to Phoronix web-site, the Bulldozer version 3 (bdver3) GCC patch is presently in its very early form and generally copies most of the tuning work from bdver2 (Piledriver), except the fact that the pipelines have already been modeled in accordance with the new Steamroller core design. Considering the fact that AMD did not add support for any new instruction the next-gen x86 core supports, it is evident that the company is more concerned about ensuring that peculiarities of the Steamroller cores are taken into consideration by software designers on the first place.

AMD is reportedly trying to ensure that Steamroller micro-architecture is supported by the GNU compiler collection 4.8, which is due in the first half of 2013. Apparently, the company is very concerned about optimization of compilers for the new bdver3 pipelines, which were significantly redesigned in the third-gen compared to the original Bulldozer.

AMD pins a lot of hopes on Bulldozer micro-architecture and even disclosed many of its peculiarities back in August '12, well ahead of the roll-out of the first chips, which are projected to be due in late 2013.
Note the repeated mention of "peculiarities".
ewmayer is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 04:47.


Fri Aug 6 04:47:30 UTC 2021 up 13 days, 23:16, 1 user, load averages: 2.59, 2.41, 3.09

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.