![]() |
AMD Piledriver
Does anyone have access to a Piledriver system? I have a friend who needs to run some benchmarks. Any help would be appreciated.
|
My wife's computer has an A10-5700 Trinity APU. It's running Window 7 x64, but I could boot a Ubuntu live cd while she's not looking if necessary. :smile:
|
And that is how I learned what a Piledriver is... :blush:
|
[QUOTE=frmky;318679]My wife's computer has an A10-5700 Trinity APU. It's running Window 7 x64,[/QUOTE]
That will be perfect - thanks! |
[QUOTE=LaurV;318680]And that is how I learned what a Piledriver is... :blush:[/QUOTE]
"Obviously you're not a golfer!" Or a fan of Andy Kaufman! |
Is it worth using avx code on piledriver? Have they improved it since bulldozer?
|
[QUOTE=henryzz;318767]Is it worth using avx code on piledriver? Have they improved it since bulldozer?[/QUOTE]
From the articles I've read, SSE2 code will be faster on Bulldozer and Piledriver |
[QUOTE=Prime95;318787]From the articles I've read, SSE2 code will be faster on Bulldozer and Piledriver[/QUOTE]
..damn, was hoping they would improve AVX on piledriver, oh well :smile: |
Thank you.
[QUOTE=frmky;318679]My wife's computer has an A10-5700 Trinity APU. It's running Window 7 x64. :smile:[/QUOTE]
Thanks from me, too. I'm the friend that George was referring to. |
[QUOTE=henryzz;318767]Is it worth using avx code on piledriver? Have they improved it since bulldozer?[/QUOTE]
I've been analyzing Bullzoder performance for a while and I'm now going to see how much of it applies to Piledriver. Bulldozer has a family of 15h and models between 00h and 0fh. The one Piledriver system I've seen (an A8 which has Piledriver cores) has a family of 15h and model of 10h. I'm guessing that models 10h - 1fh are for Piledrivers. The Software Optimization Guide for family 15h shows different timings for some instructions, but otherwise the microarchitecture seems to be the same. There are pipeline differences with models 20h to 2fh. I'm guessing that these are for Steamroller cores, but that's just a guess. Regarding using AVX instructions... My opinion is that there is no performance advantage to using AVX 128 bit instructions over the legacy instructions. The AVX may save slightly on code space. AVX 256 bit instructions will definitely save on code space. Also, you won't run out of YMM registers as fast as XMM registers. Instruction timings for a 256 bit instruction are generally double those for the 128 bit instructions. Some AVX 256 bit code may run a bit slower than equivalent 128 bit code, particularly stores to memory. So my own recommendation is to stick to 128 bit AVX, or legacy instructions, unless you are running out of registers. Or try it both ways and see which way comes out better. Another thing about Bulldozer / Piledriver. The best way to get the full performance out of a module is if both of the module's cores are running similar workloads. A program running in a core with the other core idle tends to get significantly less throughput in the decoding stage, in the loads and stores. Also, depending on the time it takes instructions to retire, it is more likely that the retire buffer will fill up and stall the processor. |
1 Attachment(s)
[QUOTE=mrolle;318902]
The one Piledriver system I've seen (an A8 which has Piledriver cores) has a family of 15h and model of 10h. I'm guessing that models 10h - 1fh are for Piledrivers. [/QUOTE] My wife's is also family 15h, model 10h. Attached is the screenshot from CPU-Z. |
Some NFS benchmarks from a Piledriver-system would be very interesting.
GMP-ECM would also be nice. This thread shall be followed. |
[QUOTE=lorgix;318915]Some NFS benchmarks from a Piledriver-system would be very interesting.
GMP-ECM would also be nice. This thread shall be followed.[/QUOTE] I don't expect it to do well, particularly running ecm. The integer multiply is significantly slower than the K10's. NFS sieving might be ok. Are there particular polys/numbers/parameters that are used as a standard benchmark? |
[QUOTE=frmky;319019]I don't expect it to do well, particularly running ecm. The integer multiply is significantly slower than the K10's. NFS sieving might be ok. Are there particular polys/numbers/parameters that are used as a standard benchmark?[/QUOTE]
I don't know of any standard benchmark. |
[QUOTE=henryzz;318767]Is it worth using avx code on piledriver? Have they improved it since bulldozer?[/QUOTE]
Let me update what I wrote a while ago about AVX performance. 1. Using YMM instructions is equivalent to having twice as many XMM instructions, except that a few of them are even slower than equivalent XMMs. 2. One reason you might want to use YMMs anyway would be for register pressure. A YMM program is equivalent to an XMM program if there were 32 XMM registers instead of 16. 3. Another reason for YMM instructions is that if your program is running in one thread with the companion thread being idle, there may be stalls in the decoding stage depending on the code size. YMM code size being half that of the XMM code size (per cycle), you might get more work per cycle. 4. I think the best way to get performance is to run workloads in both threads at the same time. The amount of work that can be done is significantly more than what can be done in one thread even if the other thread is idle. 5. Running a workload with XMM code in both threads will be equivalent in timing to running it with YMM code in one thread alone. This gives you the 32 XMM registers, and doesn't have the above problems with decode bandwidth or certain slower instructions. 6. XMM AVX instructions are similar to legacy XMM instructions in performance. However, with AVX you can take advantage of the 3-operand forms. 7. The FMA4 instructions can be a big win because you can do adds and multiplies together. An example would be a vector dot-product. 8. For complex numbers, you will be better off to have separate vectors for the real and imag components, compared with vectors of complex numbers. If you have a complex number in an XMM register, a complex multiply can be done with a multiply and a multiply-add instructions, together with a shuffle to exchange the real and imag components. The shuffle uses FPU pipe 1, which takes away from the multiply/add bandwidth, so it takes 3 FMA operations instead of 2. If you have 2 reals in one register and 2 imags in another register, you can do the two parallel complex multiplies with 4 FMA operations and no other instructions. 9. All of this applies to the Bulldozer parts. I am assuming that Piledriver (which includes Trinity and Vishera) will be the same until I can run some tests. 10. Piledriver has support for the FMA3 instruction set that Intel has implemented, so at least you can run the same code on Piledriver and Intel. However, FMA4 can let you avoid some register copies, so if you can benefit from that in your code, it would be worth having a Piledriver version different from the Intel version. |
[INDENT][QUOTE=Prime95;318670]Does anyone have access to a Piledriver system? I have a friend who needs to run some benchmarks. Any help would be appreciated.[/QUOTE]
[/INDENT]In addition to running benchmark code on Windows, Agner Fog would be very interested in testing a Piledriver CPU so that he can publish instruction timings for it. His Instruction Tables document (see [URL]http://www.agner.org[/URL]) has a chapter for Bulldozer already. Agner wants to be able to do a remote login to a linux system. Can anyone provide such access? By the way, the Piledriver models that I am aware of are the following:[INDENT]Vishera -- FX-4300, 4320, 6300, 8300, 8320, 8350 Trinity -- A4-5300, A6-5400K, A8-5500, 5600K, A10-5700, 5800K. [/INDENT]:smile: |
[QUOTE=mrolle;328475]Let me update what I wrote a while ago about AVX performance.
1. Using YMM instructions is equivalent to having twice as many XMM instructions, except that a few of them are even slower than equivalent XMMs.[/QUOTE] In Intel's implementation of AVX, most key instructions provide 2x the max throughput of the analogous SSE instruction, not just theoretically, but in practice. Agner Fog has written at length about this. That is the #1 reason to prefer "true" full-register AVX to SSE on hardware supporting both. AVX2 promises to provide a further significant throughput boost, as discussed [url=http://mersenneforum.org/showthread.php?t=17618]here[/url]. [QUOTE]8. For complex numbers, you will be better off to have separate vectors for the real and imag components, compared with vectors of complex numbers. If you have a complex number in an XMM register, a complex multiply can be done with a multiply and a multiply-add instructions, together with a shuffle to exchange the real and imag components. The shuffle uses FPU pipe 1, which takes away from the multiply/add bandwidth, so it takes 3 FMA operations instead of 2. If you have 2 reals in one register and 2 imags in another register, you can do the two parallel complex multiplies with 4 FMA operations and no other instructions.[/QUOTE] It is far preferable to minimize data-shuffling by altering one's data layout in order to mitigate the lack of proper CMUL support in all the x86 SIMD implementations. Using doubles, an array of complex data would be laid out like so: Non-SIMD: a.re,a.im,b.re,b.im,c.re,c.im,d.re,d.im,... SSE2+: a.re,b.re,a.im,b.im,c.re,d.re,c.im,d.im,... AVX: a.re,b.re,c.re,d.re,a.im,b.im,c.im,d.im,... |
[QUOTE=ewmayer;328614]In Intel's implementation of AVX, most key instructions provide 2x the max throughput of the analogous SSE instruction[/QUOTE]
True for Intel, false for AMD. Scott Bardwick has generously let me test out code on his Bulldozer. I'm doing this primarily to rewrite my core building blocks to use Intel's upcoming FMA instruction. I've also put a modest amount of work into making a Bulldozer AVX version of prime95. I can almost get the AVX version as fast as the SSE2 version, but not quite. I gain speed over the SSE2 version by using FMA, but lose significant speed due to 4-way swizzling. For prime95, the best solution for Bulldozer is to use 128-bit AVX instructions (instead of SSE2) and use FMA. I don't know if I have the willpower to make such a version. |
AMD FX8350 (Piledriver) Available
George,
Last month I built a new system using an AMD Piledriver CPU, ASRock MB, and 8 GB of GSkill DDR3-1333 memory (2 4GB DIMMs). It is currently Prime95 stable on Windows 7 (64bit) at 4.2 GHz and available to you or your friend if either of you need it for testing. I am hoping that you have the willpower to make a Piledriver (i.e. 128-bit AVX, FMA) version of Prime95 for us [U]diehard[/U] AMD users as this is likely to be the only way we can upgrade our PC hardware in the future. |
[url=www.xbitlabs.com/news/cpu/display/20121011232630_AMD_Secretly_Rolls_Out_Steamroller_Support_Patch_for_Compilers.html]AMD Secretly Rolls-Out "Steamroller" Support Patch for Compilers.[/url]
[quote]According to Phoronix web-site, the Bulldozer version 3 (bdver3) GCC patch is presently in its very early form and generally copies most of the tuning work from bdver2 (Piledriver), except the fact that the pipelines have already been modeled in accordance with the new Steamroller core design. Considering the fact that AMD did not add support for any new instruction the next-gen x86 core supports, it is evident that the company is more concerned about ensuring that peculiarities of the Steamroller cores are taken into consideration by software designers on the first place. AMD is reportedly trying to ensure that Steamroller micro-architecture is supported by the GNU compiler collection 4.8, which is due in the first half of 2013. Apparently, the company is very concerned about optimization of compilers for the new bdver3 pipelines, which were significantly redesigned in the third-gen compared to the original Bulldozer. AMD pins a lot of hopes on Bulldozer micro-architecture and even disclosed many of its peculiarities back in August '12, well ahead of the roll-out of the first chips, which are projected to be due in late 2013.[/quote] Note the repeated mention of "peculiarities". |
| All times are UTC. The time now is 04:47. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.