mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   How can this code take the time it does? (https://www.mersenneforum.org/showthread.php?t=13945)

fivemack 2010-09-21 22:49

How can this code take the time it does?
 
This is just a load of independent 64x64->128 multiplies; compiled with g++ -march=native -O3

[code]
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;

typedef unsigned long long u64;
typedef unsigned int u32;

u64 rdtsc(void)
{
u64 d;
u32 hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi) );
d=hi; d<<=32; return d|lo;
}

inline u64 hisq(u64 A)
{
u64 r0,r1;
asm ("mulq %2" \
: "=a"(r0),"=d"(r1) \
: "a"(A) \
: "cc");
return r1;
}

int main(void)
{
u64 a,b,c,d,e,f,g,h,t0,t1;
a=0x123456789abcdef0LL;
b=(a<<4)|1; c=(b<<4)|2; d=(c<<4)|3;
e=(d<<4)|4; f=(e<<4)|5; g=(f<<4)|6; h=(g<<4)|7;
vector<u64> times;
for (int trial=0; trial<10000;)
{
t0=rdtsc();
for (u64 i=0; i<100000; i++)
{
a=a+hisq(a);
b=b+hisq(b);
c=c+hisq(c);
//d=d+hisq(d);
//e=e+hisq(e);
//f=f+hisq(f);
}
t1=rdtsc();
if ((t1-t0 < 3500000)) { times.push_back(t1-t0);
trial++;}
}
cout << dec << (a^b^c^d^e^f^g^h) << " just to make sure it gets computed\n";
sort(times.begin(),times.end());
cout << times[0]<<" "<<times[100] << " " << times[5000] << " " << times[9900] << endl;
}
[/code]

You can comment out some of the x=x+hisq(x) lines in the middle to change the number of independent trails being run. Timings are {best, 1st%, 50th%, 99th%}

The timing on the core2 isn't that hard to understand

[code]

// core2

// one 9.1 / 9.1 / 9.1 / 13.7
// two 9.1 / 9.2 / 9.2 / 10.1
// three 12.0 / 12.0 / 12.0 / 18.0
// four 16.0 / 16.0 / 16.0 / 17.0
// five 20.0 / 20.0 / 20.0 / 21.0
// six 24.0 / 24.0 / 24.0 / 24.6
[/code]

and the k10 at least leaves the numbers integral, though I don't see how they're going up
[code]
// k10 phenom

// one 7.0 / 7.0 / 7.0 / 7.1
// two 8.0 / 8.0 / 8.0 / 8.1
// three 11.0 / 11.0 / 11.0 / 11.1
// four 13.0 / 13.0 / 13.0 / 13.1
// five 18.0 / 18.0 / 18.0 / 18.1
// six 21.0 / 21.0 / 21.0 / 21.1
[/code]

but the i7 numbers seem completely mad to me; they're vaguely consistent with starting a multiply every cycle and taking four cycles to complete, whilst overlapping the hell out of all the other instructions that need issuing, but why are they nowhere near to being integer numbers of clock cycles?
[code]
// i7
// one multiply: 11.4 / 11.4 / 11.5 / 13.7
// two 11.4 / 11.4 / 11.8 / 13.4
// three 11.8 / 12.2 / 12.5 / 14.2
// four multiplies: 12.3 / 12.7 / 13.6 / 16.2
// five 13.7 / 13.8 / 15.3 / 18.6
// six 14.7 / 14.7 / 17.1 / 22.5
[/code]

Intel claim '25% faster 64x64->128 multiply' for Sandy Bridge, so presumably three cycles to complete rather than four.

mdettweiler 2010-09-21 22:58

While I really don't know too much about low-level CPU instructions and my comments here should be taken with a grain of salt, might hyperthreading be the cause of the strange numbers on the i7? If I understand it correctly, the idea behind hyperthreading is that when you have all the hyperthreads busy (i.e. two processes per core), each will [i]mostly[/i] run at 1/2 speed, but every once in a while some of them will be doing the same kind of operation at the same time, and the CPU can save some time by doing them both at once. (Or something like that.) Hence, an operation can be considered to take a partial cycle.

Again, what I'm saying here could be complete bunk so take it with a grain of salt.

Robert Holmes 2010-09-21 23:41

Since there aren't memory accesses in this, we can measure what's going on fairly accurate using [URL="http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/"]IACA[/URL]:

[CODE]
[FONT=Fixedsys]$ iaca -64 -cp PERFORMANCE -arch nehalem a.out
Intel(R) Architecture Code Analyzer Version - 1.1.0
Analyzed File - a.out
Binary Format - 64Bit
Architecture - Intel(R) microArchitecture - codename Nehalem

Analysis Report[/FONT] [FONT=Fixedsys]
---------------
Total Throughput: 32 Cycles; Throughput Bottleneck: Port0, Port1, Port5
Total number of Uops bound to ports: 96
Data Dependency Latency: 40 Cycles; Performance Latency: 51 Cycles

Port Binding in cycles:[/FONT] [FONT=Fixedsys]
-------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
-------------------------------------------------------
| Cycles | 32 | 0 | 32 | 0 | 0 | 0 | 0 | 0 | 32 |
-------------------------------------------------------

N - port number, DV - Divider pipe (on port 0), D - Data fetch pipe (on ports[/FONT] [FONT=Fixedsys]
CP - on a critical Performance Path
N - number of cycles port was bound
X - other ports that can be used by this instructions
F - Macro Fusion with the previous instruction occurred
^ - Micro Fusion happened
* - instruction micro-ops not bound to a port
@ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expe
! - instruction not supported, was not accounted in Analysis

| Num of | Ports pressure in cycles | |[/FONT] [FONT=Fixedsys]
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
------------------------------------------------------------
| 1 | 1 | | X | | | | | | X | | mov rax, r14
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, r15
| 1 | X | | 1 | | | | | | X | | add r14, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, r12
| 1 | X | | X | | | | | | 1 | | add r15, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, rbp
| 1 | 1 | | X | | | | | | X | | add r12, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, rbx
| 1 | X | | 1 | | | | | | X | | add rbp, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | CP | mov rax, r13
| 1 | X | | X | | | | | | 1 | | add rbx, rdx
| 2 | 1 | | 1 | | | | | | | CP | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, r14
| 1 | 1 | | X | | | | | | X | CP | add r13, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, r15
| 1 | X | | 1 | | | | | | X | | add r14, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, r12
| 1 | X | | X | | | | | | 1 | | add r15, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, rbp
| 1 | 1 | | X | | | | | | X | | add r12, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, rbx
| 1 | X | | 1 | | | | | | X | | add rbp, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | CP | mov rax, r13
| 1 | X | | X | | | | | | 1 | | add rbx, rdx
| 2 | 1 | | 1 | | | | | | | CP | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, r14
| 1 | 1 | | X | | | | | | X | CP | add r13, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, r15
| 1 | X | | 1 | | | | | | X | | add r14, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, r12
| 1 | X | | X | | | | | | 1 | | add r15, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, rbp
| 1 | 1 | | X | | | | | | X | | add r12, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, rbx
| 1 | X | | 1 | | | | | | X | | add rbp, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | CP | mov rax, r13
| 1 | X | | X | | | | | | 1 | | add rbx, rdx
| 2 | 1 | | 1 | | | | | | | CP | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, r14
| 1 | 1 | | X | | | | | | X | CP | add r13, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, r15
| 1 | X | | 1 | | | | | | X | | add r14, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, r12
| 1 | X | | X | | | | | | 1 | | add r15, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, rbp
| 1 | 1 | | X | | | | | | X | | add r12, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | | mov rax, rbx
| 1 | X | | 1 | | | | | | X | | add rbp, rdx
| 2 | 1 | | 1 | | | | | | | | mul rax
| 1 | X | | X | | | | | | 1 | CP | mov rax, r13
| 1 | X | | X | | | | | | 1 | | add rbx, rdx
| 2 | 1 | | 1 | | | | | | | CP | mul rax
| 1 | X | | X | | | | | | 1 | CP | add r13, rdx[/FONT]
[/CODE]This is 6 independent trails, unrolled 4 times. This gives us 51/4 = 12.75 clocks per iteration, which is somewhat agreeing with your wacky results.


All times are UTC. The time now is 22:52.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.