mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   Prime95 version 29.6/29.7/29.8 (https://www.mersenneforum.org/showthread.php?t=24094)

Prime95 2019-08-20 21:59

[QUOTE=Evil Genius;524079]Also of note is that if there's no native 256-bit implementation, the 128-bit implementation is faster.[/QUOTE]

Only in Bulldozer and its descendants. AMD made a real mess of their initial AVX implementation.

All Ryzens (to my knowledge) are faster using the 256-bit implementation even if internally it is done 128-bits at a time.

Prime95 2019-08-20 22:00

[QUOTE=ixfd64;524078]I don't have [c]NumCPUs[/c] set on this computer. I'm also not able to reproduce this on a Mac Pro. It seems this issue only affects certain computers.[/QUOTE]

I set NumCPUs=2 to emulate your dual-core machine.

Bulldozer 2019-08-21 05:24

[QUOTE=AG5BPilot;524053]There's no such thing as "AVX-256". There's AVX, AVX2 (which isn't important for Prime95/gwnum but comes along with FMA3, which is important), and AVX-512.

What you're calling AVX-256 is plain old original AVX, which has been supported by AMD for as long as Intel has supported it. AMD's implementation was crippled, however, so it hasn't been used here until Zen 2. With Zen 2, it's useful, finally -- but it's always been there.

Zen2 supports FMA3. And it supports AVX. And, finally, they're as good as Intel's implementation. They don't support AVX-512, but that's a whole different discussion.

Edit: Please see the Wikipedia page for the Piledriver architecture: [url]https://en.wikipedia.org/wiki/Piledriver_(microarchitecture)[/url] . It clearly states that Piledriver supports AVX and FMA3.[/QUOTE]
On Excavator, AVX and FMA3 is just working well. The A10-9700(2m4c4t)'s performance is similar to a hyperthreaded Kaby Lake i5-8250u (4c8t).

ixfd64 2019-08-21 06:00

[QUOTE=Prime95;524081]I set NumCPUs=2 to emulate your dual-core machine.[/QUOTE]

I added [C]NumCPUs=2[/C] to my [C]local.txt[/C] file as an experiment, and it didn't make a difference. I'm at a loss as to why Prime95 won't let me set less than two cores per worker on a dual-core machine when using two worker windows. Any other Mac users seen this issue?

mackerel 2019-08-21 07:41

[QUOTE=Mark Rose;524040]Zen 2 doesn't downclock when doing AVX-256 either.[/QUOTE]

They do, just not in the same way Intel does it. I assume we're familiar with Intel's AVX offset. If there is AVX code running, the clock may be reduced by some fixed amount. It is crude but does the job.

Zen 2 doesn't have an AVX offset concept, but when running Prime95 like code with FMA, it still generates a lot of heat. Based on observation of actual behaviour, running stock, you will hit PPT limit and current limit is also close, so it does still clock down compared to running lesser loads. From memory, on my 3600 it runs all core around 3900 MHz with 128k FFT per core, and a lower stress like Cinebench R15 it is well above 4 GHz.

AMD took a more GPU-like approach on Zen 2, it will adjust its clock based on power, current, temperature... so they're not detecting AVX and dropping, but it still uses more power and hits other limits earlier than otherwise so it still drops.

Evil Genius 2019-08-21 17:41

[QUOTE=Prime95;524080]All Ryzens (to my knowledge) are faster using the 256-bit implementation even if internally it is done 128-bits at a time.[/QUOTE]


So I have been running AVX-256 bit FFTs for years on my Ryzen 1700. You learn something new everyday.


What's your secret? Good instruction scheduling? My own FFT implementation runs slightly faster on the 1700 when I use AVX-128 bit instead of AVX-256 bit.

Prime95 2019-08-21 17:55

[QUOTE=Evil Genius;524159]What's your secret? Good instruction scheduling? My own FFT implementation runs slightly faster on the 1700 when I use AVX-128 bit instead of AVX-256 bit.[/QUOTE]

I've written SSE2 (128-bit) and AVX (256-bit) versions of FFTs. I've never tried writing an AVX version that only uses half of the register width. I don't see why that would be beneficial.

Evil Genius 2019-08-21 18:08

[QUOTE=Prime95;524164]I've written SSE2 (128-bit) and AVX (256-bit) versions of FFTs. I've never tried writing an AVX version that only uses half of the register width. I don't see why that would be beneficial.[/QUOTE]

1. three-register non-destructive mode!
2. implied SSE4_2 support
3. possible FMA support
4. faster on some processors (not the mainstream ones, however)
5. explicit zeroing of upper half of register (only important when switching between 128-bit and 256-bit)

Expect about a 5% speed increase compared to SSE2.

Evil Genius 2019-08-21 19:32

I forgot an important one:

1.5 better scheduling on processors that chop 256-bit operations in half


To summarize:

* three-register non-destructive mode gives AVX-128 about a 5% speed advantage over SSE2
* better instruction scheduling gives AVX-128 about a 5% speed advantage over AVX-256 when the processor doesn't natively support 256 bit

YMMV

Prime95 2019-08-21 20:08

[QUOTE=Evil Genius;524185]
1.5 better scheduling on processors that chop 256-bit operations in half

* better scheduling gives AVX-128 about a 5% speed advantage over AVX-256 when the processor doesn't natively support 256 bit

YMMV[/QUOTE]

I think my mileage will/should vary.

No arguments about the improvements over SSE2.

I contend that using the full 256 register should be significantly faster unless the chip developer completely screwed up.

a) Twice as much data in registers -- the fastest possible place to store data.
b) Half as many instructions to be read and decoded.
c) Guaranteed no data dependencies executing on data in the upper vs. lower 128 bits (makes it easier to schedule 128-bit uops).

IMO, when AMD screwed up their implementation of splitting 256-bit instructions into two 128-bit uops it was not my job to fix it. I do admit that when I worked on Bulldozer several years ago I did not think of timing AVX on 128-bit operands.

Evil Genius 2019-08-21 20:13

[QUOTE=Prime95;524195]IMO, when AMD screwed up their implementation of splitting 256-bit instructions into two 128-bit uops it was not my job to fix it. I do admit that when I worked on Bulldozer several years ago I did not think of timing AVX on 128-bit operands.[/QUOTE]


They were not the only ones. They deemed compatibility more important than throughput at the time. The reason their cores are more efficient power-wise.


But if you have time to spare give it a try on a Zen 1. The results may surprise you.


All times are UTC. The time now is 22:45.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.