![]() |
|
|
#12 |
|
Sep 2010
Scandinavia
10011001112 Posts |
Some NFS benchmarks from a Piledriver-system would be very interesting.
GMP-ECM would also be nice. This thread shall be followed. |
|
|
|
|
|
#13 |
|
Jul 2003
So Cal
2×1,061 Posts |
I don't expect it to do well, particularly running ecm. The integer multiply is significantly slower than the K10's. NFS sieving might be ok. Are there particular polys/numbers/parameters that are used as a standard benchmark?
|
|
|
|
|
|
#14 |
|
Sep 2010
Scandinavia
3·5·41 Posts |
I don't know of any standard benchmark.
|
|
|
|
|
|
#15 | |
|
Feb 2012
Cupertino, CA
13 Posts |
Quote:
1. Using YMM instructions is equivalent to having twice as many XMM instructions, except that a few of them are even slower than equivalent XMMs. 2. One reason you might want to use YMMs anyway would be for register pressure. A YMM program is equivalent to an XMM program if there were 32 XMM registers instead of 16. 3. Another reason for YMM instructions is that if your program is running in one thread with the companion thread being idle, there may be stalls in the decoding stage depending on the code size. YMM code size being half that of the XMM code size (per cycle), you might get more work per cycle. 4. I think the best way to get performance is to run workloads in both threads at the same time. The amount of work that can be done is significantly more than what can be done in one thread even if the other thread is idle. 5. Running a workload with XMM code in both threads will be equivalent in timing to running it with YMM code in one thread alone. This gives you the 32 XMM registers, and doesn't have the above problems with decode bandwidth or certain slower instructions. 6. XMM AVX instructions are similar to legacy XMM instructions in performance. However, with AVX you can take advantage of the 3-operand forms. 7. The FMA4 instructions can be a big win because you can do adds and multiplies together. An example would be a vector dot-product. 8. For complex numbers, you will be better off to have separate vectors for the real and imag components, compared with vectors of complex numbers. If you have a complex number in an XMM register, a complex multiply can be done with a multiply and a multiply-add instructions, together with a shuffle to exchange the real and imag components. The shuffle uses FPU pipe 1, which takes away from the multiply/add bandwidth, so it takes 3 FMA operations instead of 2. If you have 2 reals in one register and 2 imags in another register, you can do the two parallel complex multiplies with 4 FMA operations and no other instructions. 9. All of this applies to the Bulldozer parts. I am assuming that Piledriver (which includes Trinity and Vishera) will be the same until I can run some tests. 10. Piledriver has support for the FMA3 instruction set that Intel has implemented, so at least you can run the same code on Piledriver and Intel. However, FMA4 can let you avoid some register copies, so if you can benefit from that in your code, it would be worth having a Piledriver version different from the Intel version. |
|
|
|
|
|
|
#16 |
|
Feb 2012
Cupertino, CA
13 Posts |
In addition to running benchmark code on Windows, Agner Fog would be very interested in testing a Piledriver CPU so that he can publish instruction timings for it. His Instruction Tables document (see http://www.agner.org) has a chapter for Bulldozer already. Agner wants to be able to do a remote login to a linux system. Can anyone provide such access? By the way, the Piledriver models that I am aware of are the following: Vishera -- FX-4300, 4320, 6300, 8300, 8320, 8350
|
|
|
|
|
|
#17 | ||
|
∂2ω=0
Sep 2002
República de California
265778 Posts |
Quote:
Quote:
Non-SIMD: a.re,a.im,b.re,b.im,c.re,c.im,d.re,d.im,... SSE2+: a.re,b.re,a.im,b.im,c.re,d.re,c.im,d.im,... AVX: a.re,b.re,c.re,d.re,a.im,b.im,c.im,d.im,... |
||
|
|
|
|
|
#18 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
19×397 Posts |
Quote:
Scott Bardwick has generously let me test out code on his Bulldozer. I'm doing this primarily to rewrite my core building blocks to use Intel's upcoming FMA instruction. I've also put a modest amount of work into making a Bulldozer AVX version of prime95. I can almost get the AVX version as fast as the SSE2 version, but not quite. I gain speed over the SSE2 version by using FMA, but lose significant speed due to 4-way swizzling. For prime95, the best solution for Bulldozer is to use 128-bit AVX instructions (instead of SSE2) and use FMA. I don't know if I have the willpower to make such a version. |
|
|
|
|
|
|
#19 |
|
Jun 2003
2318 Posts |
George,
Last month I built a new system using an AMD Piledriver CPU, ASRock MB, and 8 GB of GSkill DDR3-1333 memory (2 4GB DIMMs). It is currently Prime95 stable on Windows 7 (64bit) at 4.2 GHz and available to you or your friend if either of you need it for testing. I am hoping that you have the willpower to make a Piledriver (i.e. 128-bit AVX, FMA) version of Prime95 for us diehard AMD users as this is likely to be the only way we can upgrade our PC hardware in the future. |
|
|
|
|
|
#20 | |
|
∂2ω=0
Sep 2002
República de California
19·613 Posts |
AMD Secretly Rolls-Out "Steamroller" Support Patch for Compilers.
Quote:
|
|
|
|
|