mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2019-08-20, 21:59   #353
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,681 Posts
Default

Quote:
Originally Posted by Evil Genius View Post
Also of note is that if there's no native 256-bit implementation, the 128-bit implementation is faster.
Only in Bulldozer and its descendants. AMD made a real mess of their initial AVX implementation.

All Ryzens (to my knowledge) are faster using the 256-bit implementation even if internally it is done 128-bits at a time.
Prime95 is online now   Reply With Quote
Old 2019-08-20, 22:00   #354
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

768110 Posts
Default

Quote:
Originally Posted by ixfd64 View Post
I don't have NumCPUs set on this computer. I'm also not able to reproduce this on a Mac Pro. It seems this issue only affects certain computers.
I set NumCPUs=2 to emulate your dual-core machine.
Prime95 is online now   Reply With Quote
Old 2019-08-21, 05:24   #355
Bulldozer
 
Jun 2019

3·7 Posts
Default

Quote:
Originally Posted by AG5BPilot View Post
There's no such thing as "AVX-256". There's AVX, AVX2 (which isn't important for Prime95/gwnum but comes along with FMA3, which is important), and AVX-512.

What you're calling AVX-256 is plain old original AVX, which has been supported by AMD for as long as Intel has supported it. AMD's implementation was crippled, however, so it hasn't been used here until Zen 2. With Zen 2, it's useful, finally -- but it's always been there.

Zen2 supports FMA3. And it supports AVX. And, finally, they're as good as Intel's implementation. They don't support AVX-512, but that's a whole different discussion.

Edit: Please see the Wikipedia page for the Piledriver architecture: https://en.wikipedia.org/wiki/Piledr...oarchitecture) . It clearly states that Piledriver supports AVX and FMA3.
On Excavator, AVX and FMA3 is just working well. The A10-9700(2m4c4t)'s performance is similar to a hyperthreaded Kaby Lake i5-8250u (4c8t).
Bulldozer is offline   Reply With Quote
Old 2019-08-21, 06:00   #356
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

242310 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I set NumCPUs=2 to emulate your dual-core machine.
I added NumCPUs=2 to my local.txt file as an experiment, and it didn't make a difference. I'm at a loss as to why Prime95 won't let me set less than two cores per worker on a dual-core machine when using two worker windows. Any other Mac users seen this issue?

Last fiddled with by ixfd64 on 2019-08-21 at 06:01
ixfd64 is offline   Reply With Quote
Old 2019-08-21, 07:41   #357
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

23·5·11 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
Zen 2 doesn't downclock when doing AVX-256 either.
They do, just not in the same way Intel does it. I assume we're familiar with Intel's AVX offset. If there is AVX code running, the clock may be reduced by some fixed amount. It is crude but does the job.

Zen 2 doesn't have an AVX offset concept, but when running Prime95 like code with FMA, it still generates a lot of heat. Based on observation of actual behaviour, running stock, you will hit PPT limit and current limit is also close, so it does still clock down compared to running lesser loads. From memory, on my 3600 it runs all core around 3900 MHz with 128k FFT per core, and a lower stress like Cinebench R15 it is well above 4 GHz.

AMD took a more GPU-like approach on Zen 2, it will adjust its clock based on power, current, temperature... so they're not detecting AVX and dropping, but it still uses more power and hits other limits earlier than otherwise so it still drops.
mackerel is offline   Reply With Quote
Old 2019-08-21, 17:41   #358
Evil Genius
 
Evil Genius's Avatar
 
Jul 2019
the Netherlands

2·11 Posts
Default

Quote:
Originally Posted by Prime95 View Post
All Ryzens (to my knowledge) are faster using the 256-bit implementation even if internally it is done 128-bits at a time.

So I have been running AVX-256 bit FFTs for years on my Ryzen 1700. You learn something new everyday.


What's your secret? Good instruction scheduling? My own FFT implementation runs slightly faster on the 1700 when I use AVX-128 bit instead of AVX-256 bit.
Evil Genius is offline   Reply With Quote
Old 2019-08-21, 17:55   #359
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

170018 Posts
Default

Quote:
Originally Posted by Evil Genius View Post
What's your secret? Good instruction scheduling? My own FFT implementation runs slightly faster on the 1700 when I use AVX-128 bit instead of AVX-256 bit.
I've written SSE2 (128-bit) and AVX (256-bit) versions of FFTs. I've never tried writing an AVX version that only uses half of the register width. I don't see why that would be beneficial.
Prime95 is online now   Reply With Quote
Old 2019-08-21, 18:08   #360
Evil Genius
 
Evil Genius's Avatar
 
Jul 2019
the Netherlands

2×11 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I've written SSE2 (128-bit) and AVX (256-bit) versions of FFTs. I've never tried writing an AVX version that only uses half of the register width. I don't see why that would be beneficial.
1. three-register non-destructive mode!
2. implied SSE4_2 support
3. possible FMA support
4. faster on some processors (not the mainstream ones, however)
5. explicit zeroing of upper half of register (only important when switching between 128-bit and 256-bit)

Expect about a 5% speed increase compared to SSE2.

Last fiddled with by Evil Genius on 2019-08-21 at 18:23
Evil Genius is offline   Reply With Quote
Old 2019-08-21, 19:32   #361
Evil Genius
 
Evil Genius's Avatar
 
Jul 2019
the Netherlands

268 Posts
Default

I forgot an important one:

1.5 better scheduling on processors that chop 256-bit operations in half


To summarize:

* three-register non-destructive mode gives AVX-128 about a 5% speed advantage over SSE2
* better instruction scheduling gives AVX-128 about a 5% speed advantage over AVX-256 when the processor doesn't natively support 256 bit

YMMV

Last fiddled with by Evil Genius on 2019-08-21 at 20:01
Evil Genius is offline   Reply With Quote
Old 2019-08-21, 20:08   #362
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,681 Posts
Default

Quote:
Originally Posted by Evil Genius View Post
1.5 better scheduling on processors that chop 256-bit operations in half

* better scheduling gives AVX-128 about a 5% speed advantage over AVX-256 when the processor doesn't natively support 256 bit

YMMV
I think my mileage will/should vary.

No arguments about the improvements over SSE2.

I contend that using the full 256 register should be significantly faster unless the chip developer completely screwed up.

a) Twice as much data in registers -- the fastest possible place to store data.
b) Half as many instructions to be read and decoded.
c) Guaranteed no data dependencies executing on data in the upper vs. lower 128 bits (makes it easier to schedule 128-bit uops).

IMO, when AMD screwed up their implementation of splitting 256-bit instructions into two 128-bit uops it was not my job to fix it. I do admit that when I worked on Bulldozer several years ago I did not think of timing AVX on 128-bit operands.
Prime95 is online now   Reply With Quote
Old 2019-08-21, 20:13   #363
Evil Genius
 
Evil Genius's Avatar
 
Jul 2019
the Netherlands

2×11 Posts
Default

Quote:
Originally Posted by Prime95 View Post
IMO, when AMD screwed up their implementation of splitting 256-bit instructions into two 128-bit uops it was not my job to fix it. I do admit that when I worked on Bulldozer several years ago I did not think of timing AVX on 128-bit operands.

They were not the only ones. They deemed compatibility more important than throughput at the time. The reason their cores are more efficient power-wise.


But if you have time to spare give it a try on a Zen 1. The results may surprise you.
Evil Genius is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Prime95 version 29.2 Prime95 Software 71 2017-09-16 16:55
Prime95 version 29.1 Prime95 Software 95 2017-08-22 22:46
Prime95 version 26.5 Prime95 Software 175 2011-04-04 22:35
Prime95 version 25.9 Prime95 Software 143 2010-01-05 22:53
Prime95 version 25.8 Prime95 Software 159 2009-09-21 16:30

All times are UTC. The time now is 20:14.


Thu Dec 2 20:14:49 UTC 2021 up 132 days, 14:43, 0 users, load averages: 1.42, 1.44, 1.64

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.