mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2003-09-05, 02:33   #1
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22×1,873 Posts
Default Opteron Bottleneck??

The following snipet is typical of prime95's FFT code. Read eight registers, do some math, write it back.

On the P4 the code below takes 41 and 45 clocks if the data is in the L1 and L2 cache respectively. The optimal is 40 clocks (20 addpd/subpd instructions).

The Opteron takes 51 and 74 respectively. If you comment out the loads or stores you can get to the 40 clock optimum. Moving the stores up in the code or spreading out the loads did not help. I'm beginning to suspect the Opteron bottleneck is in loading and storing the XMM registers.

More research is needed - I've only got one day invested. Dresdenboy, any insights???

[code:1]x2cl_eight_reals_fft MACRO srcreg,srcinc,d1
movapd xmm0, [srcreg]
movapd xmm1, [srcreg+d1]
movapd xmm2, [srcreg+16]
movapd xmm3, [srcreg+d1+16]
movapd xmm4, [srcreg+32]
movapd xmm5, [srcreg+d1+32]
movapd xmm6, [srcreg+48]
movapd xmm7, [srcreg+d1+48]
lea srcreg, [srcreg+srcinc]
x8r_fft
movapd [srcreg-srcinc], xmm7
movapd [srcreg-srcinc+16], xmm6
movapd [srcreg-srcinc+32], xmm4
movapd [srcreg-srcinc+48], xmm5
movapd [srcreg-srcinc+d1], xmm1
movapd [srcreg-srcinc+d1+16], xmm3
movapd [srcreg-srcinc+d1+32], xmm0
movapd [srcreg-srcinc+d1+48], xmm2
ENDM
x8r_fft MACRO
subpd xmm3, xmm7 ;; new R8 = R4 - R8
multwo xmm7
addpd xmm7, xmm3 ;; new R4 = R4 + R8
subpd xmm1, xmm5 ;; new R6 = R2 - R6
multwo xmm5
addpd xmm5, xmm1 ;; new R2 = R2 + R6
mulpd xmm3, XMM_SQRTHALF ;; R8 = R8 * square root
mulpd xmm1, XMM_SQRTHALF ;; R6 = R6 * square root
subpd xmm0, xmm4 ;; new R5 = R1 - R5
multwo xmm4
addpd xmm4, xmm0 ;; new R1 = R1 + R5
subpd xmm5, xmm7 ;; R2 = R2 - R4 (new & final R4)
multwo xmm7 ;; R4 = R4 * 2
subpd xmm2, xmm6 ;; new R7 = R3 - R7
multwo xmm6
addpd xmm6, xmm2 ;; new R3 = R3 + R7
subpd xmm1, xmm3 ;; R6 = R6 - R8 (Real part)
multwo xmm3 ;; R8 = R8 * 2
subpd xmm4, xmm6 ;; R1 = R1 - R3 (new & final R3)
multwo xmm6 ;; R3 = R3 * 2
addpd xmm7, xmm5 ;; R4 = R2 + R4 (new R2)
addpd xmm3, xmm1 ;; R8 = R6 + R8 (Imaginary part)
subpd xmm0, xmm1 ;; R5 = R5 - R6 (final R7)
multwo xmm1 ;; R6 = R6 * 2
addpd xmm6, xmm4 ;; R3 = R1 + R3 (new R1)
subpd xmm2, xmm3 ;; R7 = R7 - R8 (final R8)
multwo xmm3 ;; R8 = R8 * 2
subpd xmm6, xmm7 ;; R1 = R1 - R2 (final R2)
multwo xmm7 ;; R2 = R2 * 2
addpd xmm1, xmm0 ;; R6 = R5 + R6 (final R5)
addpd xmm3, xmm2 ;; R8 = R7 + R8 (final R6)
addpd xmm7, xmm6 ;; R2 = R1 + R2 (final R1)
ENDM
[/code:1]
Prime95 is offline   Reply With Quote
Old 2003-09-05, 03:43   #2
ebx
 
ebx's Avatar
 
Aug 2002

101 Posts
Default

What if you load the xmm registers in the order you need them in x8r_fft? Usually only load is bottleneck since you need the reasult to move forward. Store you can just post it.

Maybe it is the address calculation that takes time. Are they always in one cacheline(or two since there is a dl)?
ebx is offline   Reply With Quote
Old 2003-09-05, 09:54   #3
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

One quick idea I have here is that a movapd (which issues 2 ops to either 2 of the FADD/FMUL/FSTORE units) can't be paired well with calculating instructions. It could help to split them up into movhpd and movlpd although they have higher load latency (4).

Is the code running in 32bit mode? In 64bit mode it could help to move XMM_TWO and XMM_SQRTHALF to some free SSE2 registers. Currently the code need 6 64bit loads from cache for the first 2 instructions of x8r_fft.

I'll try the code piece in pipeline simulator to check if there is some issue we can't think of that easily. On weekend I'll test some modified code on sourceforge's compile farm.
Dresdenboy is offline   Reply With Quote
Old 2003-09-08, 14:47   #4
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Currently it looks like there are too many mem accesses. When I simply replace all muls with constants in memory by muls which multiply the register with itself and shuffling the movapd's a bit the clock count goes down to 44.

That could be the same number or even smaller if the memory constants are located in additional registers.

And as it seems the decoders aren't limiting because aligning the code to 8byte boundaries doesn't change anything. Maybe this effect is hidden due to FPU scheduler contention.
Dresdenboy is offline   Reply With Quote
Old 2003-09-09, 12:59   #5
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

After aligning the branch target of the test loop to a 16byte boundary and some more reshuffling of loads/stores I got it down to 47cycles.

Interestingly it can change by 5 or more cycles if just one memory source address is changed by some n*16 bytes which could cause bank conflict penalties if bits [5:3] are the same for 2 loads during a cycle.

An optimization tutorial by Tim Wilkens has shown some pipeline behaviour of SSE2 code for DGEMM. As it seems a movapd from memory executes as 2 sequential loads which are scheduled to one of the 3 units. I have to make some tests if this is true or if it schedules the loads to 2 different units in the same cycle.
Dresdenboy is offline   Reply With Quote
Old 2003-09-09, 15:03   #6
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Some investigations and known facts:
- code padding for SSE2 is not necessary
- loads of full SSE2 regs are executed in 2 consecutive cycles by one unit
- memory operands cause one cache access per cycle (here one has to take care for the increased latencies of 7 cycles) but usually there is enough room for that. OTOH there are a lot of loads, stores and memory operands in the above code..

So the reason for the 51cycles seems to be L1 cache bandwidth and changed latency (in regard to the P4) of some instructions which causes the scheduler to be less effective. It only holds 12 lines of 3 macroOps each.

The following disassembled code of my test loop executes in 32 cycles.
It shows that it is possible to sustain 3 FPU ops and 2 loads per cycle. If we consider the availability of 64bit mode it would be easier to go for doing 2 iterations in ~82 cycles instead of one in 41.

[code:1] 4005e0: 66 0f 58 46 20 addpd 0x20(%rsi),%xmm0
4005e5: 66 0f 59 c9 mulpd %xmm1,%xmm1
4005e9: 66 44 0f 28 43 f0 movapd 0xfffffffffffffff0(%rbx),%xmm8
4005ef: 66 0f 58 56 20 addpd 0x20(%rsi),%xmm2
4005f4: 66 0f 59 db mulpd %xmm3,%xmm3
4005f8: 66 44 0f 28 4b 10 movapd 0x10(%rbx),%xmm9
4005fe: 66 0f 58 66 20 addpd 0x20(%rsi),%xmm4
400603: 66 0f 59 ed mulpd %xmm5,%xmm5
400607: 66 44 0f 28 53 20 movapd 0x20(%rbx),%xmm10
40060d: 66 0f 58 46 20 addpd 0x20(%rsi),%xmm0
400612: 66 0f 59 c9 mulpd %xmm1,%xmm1
400616: 66 44 0f 28 5b 30 movapd 0x30(%rbx),%xmm11
40061c: 66 0f 58 56 20 addpd 0x20(%rsi),%xmm2
400621: 66 0f 59 db mulpd %xmm3,%xmm3
400625: 66 44 0f 28 63 d0 movapd 0xffffffffffffffd0(%rbx),%xmm12
40062b: 66 0f 58 66 20 addpd 0x20(%rsi),%xmm4
400630: 66 0f 59 ed mulpd %xmm5,%xmm5
400634: 66 44 0f 28 6b e0 movapd 0xffffffffffffffe0(%rbx),%xmm13
40063a: 66 0f 58 46 20 addpd 0x20(%rsi),%xmm0
40063f: 66 0f 59 c9 mulpd %xmm1,%xmm1
400643: 66 44 0f 28 73 f0 movapd 0xfffffffffffffff0(%rbx),%xmm14
400649: 66 0f 58 56 20 addpd 0x20(%rsi),%xmm2
40064e: 66 0f 59 db mulpd %xmm3,%xmm3
400652: 66 44 0f 28 7b d0 movapd 0xffffffffffffffd0(%rbx),%xmm15
400658: 66 0f 58 66 20 addpd 0x20(%rsi),%xmm4
40065d: 66 0f 59 ed mulpd %xmm5,%xmm5
400661: 66 44 0f 28 43 10 movapd 0x10(%rbx),%xmm8
400667: 66 0f 58 46 20 addpd 0x20(%rsi),%xmm0
40066c: 66 0f 59 c9 mulpd %xmm1,%xmm1
400670: 66 44 0f 28 4b 20 movapd 0x20(%rbx),%xmm9
400676: 66 0f 58 56 20 addpd 0x20(%rsi),%xmm2
40067b: 66 0f 59 db mulpd %xmm3,%xmm3
40067f: 66 44 0f 28 53 30 movapd 0x30(%rbx),%xmm10
400685: 66 0f 58 66 20 addpd 0x20(%rsi),%xmm4
40068a: 66 0f 59 ed mulpd %xmm5,%xmm5
40068e: 66 44 0f 28 5b d0 movapd 0xffffffffffffffd0(%rbx),%xmm11
400694: 66 0f 58 46 20 addpd 0x20(%rsi),%xmm0
400699: 66 0f 59 c9 mulpd %xmm1,%xmm1
40069d: 66 44 0f 28 63 e0 movapd 0xffffffffffffffe0(%rbx),%xmm12
4006a3: 66 0f 58 56 20 addpd 0x20(%rsi),%xmm2
4006a8: 66 0f 59 db mulpd %xmm3,%xmm3
4006ac: 66 44 0f 28 6b f0 movapd 0xfffffffffffffff0(%rbx),%xmm13
4006b2: 66 0f 58 66 20 addpd 0x20(%rsi),%xmm4
4006b7: 66 0f 59 ed mulpd %xmm5,%xmm5
4006bb: 66 44 0f 28 73 30 movapd 0x30(%rbx),%xmm14
[/code:1]
Dresdenboy is offline   Reply With Quote
Old 2003-09-09, 17:14   #7
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

165048 Posts
Default

As an aside, do you know if Microsoft MASM is going to support the extra XMM registers? Also, is it true that these extra registers are only available in 64-bit mode? If so, then MASM would have to output a whole new object file format, true?

Converting all that assembly code to some other syntax would be a horrendously tedious task.
Prime95 is offline   Reply With Quote
Old 2003-09-09, 18:10   #8
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

There is a version of MASM for AMD64 available:
(found at http://www.sandpile.org/post/msgs/20004230.htm)
Quote:
A while ago someone mentioned in an anonymous sandpile.org forum
message that the May 2003 edition of Microsoft's DDK -- which is
available for a relatively modest shipping/handling fee -- seems
to contain a version of MASM with x86-64 support.

Back then I offered to host those files... and today I am making
good on my offer: ftp://ftp.sandpile.org/ml64.zip (3.5 MB).

This link will expire in a few weeks, or when my quota runs out.
I'll check that MASM version soon. Also I read that GCC supports the Intel ASM syntax and there are tools which convert Intel to AT&T syntax and vice versa.

In a discussion about the DDK it was mentioned that the 64bit mode only supports SSE and SSE2, no more x87, 3DNow! or MMX. I don't know if this is true - but it won't make sense to disable that - the instruction codes aren't used otherwise and x87 is at least needed for transcendentals and other more complex functions.

Matthias
Dresdenboy is offline   Reply With Quote
Old 2003-09-09, 21:13   #9
gbvalor
 
gbvalor's Avatar
 
Aug 2002

6F16 Posts
Default

Quote:
Is the code running in 32bit mode? In 64bit mode it could help to move XMM_TWO and XMM_SQRTHALF to some free SSE2 registers. Currently the code need 6 64bit loads from cache for the first 2 instructions of x8r_fft.
It would be better to store minus two in a XMM register. So the basic addsub(xmm0, xmm1) could be
[code:1]
addpd xmm0, xmm1; r0 <- r0 + r1
mul_minustwo xmm1 ; r1 <- -2*r1;
addpd xmm0, xmm1; r1 <- r0 - r1
[/code:1]

At exit, xmm0 still is new r0 and xmm1 new r1. The advantage here is the same xmm register contains same r at exit. That way it is easy to understand the code, I think.

My 0.02$. :)

Guillermo
gbvalor is offline   Reply With Quote
Old 2003-09-09, 21:43   #10
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22·1,873 Posts
Default

Quote:
Originally Posted by gbvalor
It would be better to store minus two in a XMM register.
My assembly code is never very readable :)

Anyway, the real reason it is a mul-by-two is that it lets you choose between "addpd reg, reg" or "mul reg, XMM_TWO" rather easily.
Prime95 is offline   Reply With Quote
Old 2003-09-09, 21:46   #11
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1D4416 Posts
Default

Thanks for the MASM link, I'll play with it some. Already, I've noticed that some x86 instructions no longer exist. Like "push ebp" and "push OFFSET global_var". Looks like I'll have to download the x86-64 manual.

I sure hope the MASM output can be turned into a linux compatible object file. Does anyone know what format the MASM object file is?
Prime95 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
The bandwidth bottleneck is apparently much older than I thought Dubslow Hardware 5 2017-11-16 19:50
Opteron is Hyperthreaded ? bgbeuning Information & Answers 3 2016-01-10 08:26
Modular Inversion Bottleneck Sam Kennedy Programming 4 2013-01-25 16:50
AMD Athlon 64 vs AMD Opteron for ecm thomasn Factoring 6 2004-11-08 13:25
AMD Opteron naclosagc Software 27 2003-08-10 19:14

All times are UTC. The time now is 00:03.

Sat May 15 00:03:15 UTC 2021 up 36 days, 18:44, 0 users, load averages: 1.94, 1.88, 1.91

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.