mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2014-10-03, 01:23   #1
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103·113 Posts
Default x86 Microcode Execution Unit and repeated-address implied loads

(First a bit of pedantic background, for the interest-reader but non-hardware-wonk who may be reading this:)

The x86 CPU architecture family is notoriously register poor - while things have gotten much better in 64-bit, it is still easy to run out of registers, especially if one is trying to manage instruction sequencing in order to maintain high throughput in the presence of higher-latency operations such as MUL.

One major way the x86 ISA (instruction set architecture) helps mitigate the register-poorness is via "implied load", that is, allowing one of the inputs to most common arithmetic instructions be a memory reference. For example in GCC 32-bit inline-assemblerese, let's say I want to take a pair of SSE2 data x and y stored in registers xmm0 and xmm1 and do a simple radix-2 butterfly, producing outputs x+y and x-y. The constraint? This butterfly is competing for registers with other in-flight operations, thus I need it to be in-place, i.e. using just the 2 input registers, preferably without incurring a register spill. So here's one way - again note in GCC/AT&T syntax the source operand is leftmost and the destination is the right one:
Code:
subpd  xmm1,xmm0 // x -= y
addpd  xmm1,xmm1 // y *= 2
addpd  xmm0,xmm1 // y += x
yielding (in terms of the original inputs) x+y in xmm1 and x-y in xmm0. Alternatively, one might have (say) 4 such independent butterflies (thus using all 8 xmm registers available in 32-bit mode) to do and realizing that while ADD/SUB are low-latency (3 cycles in recent Intel CPUs), the CPU can only complete one such operation per cycle, one might get overall better cycle count by doing each of the 4 overlapping butterflies like so:
Code:
subpd  xmm1,xmm0 // x -= y
mulpd (eax),xmm1 // y *= 2
addpd  xmm0,xmm1 // y += x
Here the general-purpose register eax is assumed to contain the address in memory of a the lower of an adjacent pair of double-precision 2.0s. In theory the resulting MULs should be able to overlap the in-flight ADDs in the FPU, rather than competing with them for execution ports, as in the first implementation above. The tradeoffs are: higher-latency of MUL (5 cycles), and each MUL incurs a load from memory.

Or does it? Because if I were designing the microcode execution engine (MEE; not sure if there's a CS "standard initialism" for this chip functionality) for such an ISA, I would consider things like the following:

[a] The (full 4-fold) code sequence has 4 such MULs, each invoking the same memory location;
[b] A typical modern-day CPU architecture has many more "virtual" registers (both general-purpose and SIMD) than are visible to the coder, these are used for things like in-flight register-rename and speculative code execution.

The obvious conclusion here is: the MEE should scan ahead for reused memory references in such implied loads, and whenever possible, load the memory datum into a virtual register, thus incurring just a single memory load.

My question is: Do MEEs in modern-day Intel and AMD chips do such a thing? And if so, what kinds of constraints (e.g. depth-of-lookahead, number of distinct memory refs which can be cached at any one time this way) govern them?

Are there any Intel or AMD engineers on this board who know?
ewmayer is online now   Reply With Quote
Old 2014-10-03, 02:59   #2
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11101011001102 Posts
Default

I don't know the answer.

Some observations:

1) Haswell has twice the L1 bandwidth over Sandy/Ivy. Thus, unless you are doing an enormous number of load/store ops, there should be little or no difference between the microcode reloading 2.0 every time vs. cleverly putting 2.0 in a virtual register.

2) I've ceased all development of 32-bit FFT code. With sixteen registers, I often can keep constants in registers.

3) With Skylake, due next year, we will have 32 registers and 8 doubles in an AVX register. All the more reason to abandon 32-bit, 8 register development. Alas, Microsoft's assembler does not support the new instructions yet.
Prime95 is online now   Reply With Quote
Old 2014-10-03, 06:36   #3
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103·113 Posts
Default

Thanks, George - I similarly have been doing exclusively 64-bit work, except for a brief bout of "restore 32-bit buildability" to allow me to actually build and run SSE2 code on my replacement 2008-9 Macbook Core Duo. I just used 32-bit mode for purposes of my tiny example, because anyone who has coded 32-bit assembly for x86 knows how register-starved that is. (And yet materially better than the pre-32-bit x86 CPUs - I know a couple older guys who love to tell war stories about those, along the lines of the classic Dilbert "A: We had to code using just 0s and 1s. B: They gave you 1s? C: Ha! We didn't even have 0s - we had to use Os" cartoon.)

This is an issue that has nagged at me for a while, today finally felt compelled to write about it - it would simply be nice to *know* yes or no, because Intel and AMD's being so coy about such fairly basic and coder-impacting aspects of their architectures leads to a huge amount of wasted "flying blind" time on the part of coders. Basically, I felt like whining. :)
ewmayer is online now   Reply With Quote
Old 2014-10-03, 08:40   #4
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

588010 Posts
Default

Quote:
Originally Posted by Prime95 View Post
3) With Skylake, due next year, we will have 32 registers and 8 doubles in an AVX register. All the more reason to abandon 32-bit, 8 register development. Alas, Microsoft's assembler does not support the new instructions yet.
Do you have a source for this other than wikipedia?
henryzz is offline   Reply With Quote
Old 2014-10-03, 14:01   #5
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

DD516 Posts
Default

If memory serves (haha) the x86 memory model is fully coherent; writes to memory by one processor are automatically visible to all others in the system at the time the write happens. While that doesn't preclude optimizing away redundant reads like you propose, the view of memory in one thread has to incorporate side effects produced by other threads.

The redundant load can also execute earlier in the pipeline than the point in the instruction stream where it appears, so you would also need logic to examine the address of the load and turn off the load removal if the addresses happen to be different. Plus if you really do two loads from the same address, the second will be in cache already and wouldn't have a long latency. I would think it unlikely there would be a big payoff in implementing this kind of optimization. Cache it yes, but not remove it entirely.
jasonp is offline   Reply With Quote
Old 2014-10-03, 22:46   #6
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·53·71 Posts
Default

Quote:
Originally Posted by henryzz View Post
Do you have a source for this other than wikipedia?
IIRC, Intel has not officially promised this but it was leaked at a developer conference. So, more than speculation but less than guaranteed.

My speculation is that a single ADD (or MUL) will be implemented as two 256-bit uops either executed on two different ports or two different clock cycles. Therefore, total FPU throughput would be the same as Haswell. AVX-512 will make programming easier and allows Intel to deliver a true 512-bit wide FPU at a later date and potentially double the throughput of existing AVX-512 programs.
Prime95 is online now   Reply With Quote
Old 2014-10-04, 21:16   #7
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103×113 Posts
Default

Quote:
Originally Posted by Prime95 View Post
IIRC, Intel has not officially promised this but it was leaked at a developer conference. So, more than speculation but less than guaranteed.
The Wikipedia page has enough details and Intel whitepaper references that I'd be shocked if there were any appreciable changes in what Intel actually releases.

Quote:
My speculation is that a single ADD (or MUL) will be implemented as two 256-bit uops either executed on two different ports or two different clock cycles. Therefore, total FPU throughput would be the same as Haswell. AVX-512 will make programming easier and allows Intel to deliver a true 512-bit wide FPU at a later date and potentially double the throughput of existing AVX-512 programs.
This "baby steps" approach would be different than Intel's handling of AVX, wouldn't it? I seem to recall they were "true 256-bit" even in the first implementation, Sandy Bridge. AMD, on the other hand, uses half-width (128-bit) uops for their AVX implementations to date. All I can say is, if Intel goes that route for Skylake, let's hope they do a more competent job of it than AMD seems to have done for AVX.
ewmayer is online now   Reply With Quote
Old 2014-10-05, 02:48   #8
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·53·71 Posts
Default

Quote:
Originally Posted by ewmayer View Post
This "baby steps" approach would be different than Intel's handling of AVX, wouldn't it? I seem to recall they were "true 256-bit" even in the first implementation, Sandy Bridge.
Yes.

But, I believe the Pentium 4 did the baby steps approach in the move to 128-bit. Core 2 was the first true 128-bit. Sandy the first 256-bit. So Intel has a history of both approaches.

If Intel does implement true 512-bit wide FPU it will be interesting to see the cache improvements that will be required to feed such a beast.
Prime95 is online now   Reply With Quote
Old 2014-10-07, 08:17   #9
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

22616 Posts
Default

Quote:
Originally Posted by henryzz View Post
Do you have a source for this other than wikipedia?
There are so many fake slides all around the place it's hard to know what's official or not.

Here is Intel official word on it. They only talk about Xeon which likely means we won't have AVX-512 on typical laptop/desktop CPU with Skylake generation.

Another one (bold is mine):
Quote:
As announced last week by James, future Intel Xeon processors will add support for byte and word processing in AVX-512.
ldesnogu is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Add repeated digits after a number until it is prime sweety439 sweety439 73 2019-08-02 10:18
msieve has stopped breathing in the middle of execution. ray10may Msieve 15 2017-04-17 22:21
Workload report times out before page loads NBtarheel_33 PrimeNet 6 2011-09-09 02:49
Trivial bug: repeated PM notifications Christenson Forum Feedback 0 2011-03-21 03:49
D.C. Sniper Execution storm5510 Soap Box 57 2009-12-05 00:57

All times are UTC. The time now is 01:14.


Sat Jul 17 01:14:27 UTC 2021 up 49 days, 23:01, 1 user, load averages: 1.58, 1.16, 1.31

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.