![]() |
|
|
#1 |
|
∂2ω=0
Sep 2002
República de California
22×2,939 Posts |
Just had my old WinXP laptop at work replaced with a spiffy quad-core I7/Win7 one. I sometimes used the old system to test out my latest Mlucas Visual Studio builds (done on my XP notebook at home), to see if any timing changes observed on the home system were (at least qualitatively) reproducible on the work one. (The 2 systems are different manufacturer, hardware and chip-revs).
So I juts tried a quick timing of a 32-bit Mlucas Visual Studio build (this is SSE2-using FFT code) on the new laptop - it runs correctly, but the timings are suspiciously slow. Some questions: - Does Win7 use some kind of built-in virtualization to run 32-bit code? If so, should one expect a significant performance hit? - The 32-bit Windows build uses Visual-studio-style inline assembler for the SSE2 code (for gcc builds I have gcc-style macros, both 32 and 64-bit ... most of the latter attempt to take advantage of the extra 8 xmm registers to boost performance). But Microsoft being Microsoft, Visual Studio does not support 64-bit inline ASM. What is my best option for porting the assembler to a 64-bit windows build? Port the macros to 64-bit MASM, use 64-bit inline ASM with the Intel C compiler, what? |
|
|
|
|
|
#2 | ||
|
"Bob Silverman"
Nov 2003
North of Boston
5·17·89 Posts |
Quote:
WIN-7 box as it does on my 32-bit WIN-XP at work. It is compiled as a WIN-32 console app. Quote:
e.g. Code:
/************************************************************************/
/* */
/* compute (a*2^30 + b)/c and (a*2^30 + b) % c */
/* assembler version */
/* */
/************************************************************************/
void divrem_asm(a,b,c,d)
unsigned int a,b,c,d[2];
{ /* start of divrem_asm */
/* We could use a double length register fromthe mmx instruction set, */
/* however, the emmx instruction must be executedto clean up the */
/* FP registers every time we use mmx. Emmx is a very lengthy */
/* instruction compared to what we are doing here. */
_asm {
/* edx:eax = (a << 30) + tempb; */
mov eax,a
shl eax,30
mov edx,a
shr edx,2
and edx,0x3fffffff
add eax,b
adc edx,0
/* Now divide a * (2^30) which resides in the register pair: edx:eax */
/* eax = edx:eax / c */
/* edx = edx:eax % c */
mov ecx,c
/* d[x] is a pointer (moved ahead here for a pentium optimization */
mov edi,d
div ecx
/* d[0] = ((a << 30) + b) / c; */
mov DWORD PTR[edi],eax
/* d[1] = ((a << 30) + b) % c; */
mov DWORD PTR[edi]+4,edx
} // end _asm
} /* end of divrem_asm */
Do you attach the _inline declaration to the _asm routine? |
||
|
|
|
|
|
#3 | ||
|
Jun 2003
23×683 Posts |
Quote:
Quote:
Have you looked at Agner Fog's optimization manuals? It covers in some detail various considerations for 64 bit C++/asm development in x86. |
||
|
|
|
|
|
#4 |
|
∂2ω=0
Sep 2002
República de California
22×2,939 Posts |
I'll try to answer every one's questions in a single reply:
- "Suspiciously slow" means no faster - in fact 5-10% slower on a cycle-per-cycle basis - than on my i5-based macbook, which runs a 64-bit gcc build, i.e. uses versions of key macros that attempt to leverage the extra 8 SSE registers. So the 5-10% hit is explainable by the Windows build being 32-bit - guess I was just hoping for a decent per-cycle boost from running the same build on spiffier hardware, is all. - @Bob: Your macro syntax does not compile for me in Visual Studio - what is your build environment? The simple change of prepending a second _ (i.e. __asm) allows the macro to compile ... but the syntax I use in VS prefixes each assembler instruction with __asm ... I prefer that because I can seamlessly intermingle C code and ASM, which is very handy for the early-prototyping stages of the code development. I don't add any inline specifiers because VS (I know this based on examining the assembler output re4sulting from compilation) inlines the macro invocations automatically. Also, your code sample is 32-bit ASM ... try replacing the 32-bit ints with 64-bits, and e-registers with their 64-bit r-prefixed counterparts. - @axn: Yes, have read Agner's excellent manual ... in my case the main decision with respect to 64-bit coding is how to effectively use the extra SSE registers. My basic code-development here - since I was a latecomer to inline-ASM and SSE - has unil recently been focused on simply getting a fully-functional SSE2 version of the Mlucas FFT and DWT code working, with most optimizations being decidedly low-tech "eyeball level" ones. Now that the basic SSE code is all in place (I just recently finished the SSE coding of final 2 front-end radix routines I need to support the full spectrum of non-power-of-2 FFT lengths the scalar Mlucas code does, i.e. each power-of-2 interval subdivided into 8 equal-stride subdivisions), it looks like it's high time to get down to some serious instruction-level code profiling. For example: based on a rough FFT opcount, my code is only yielding 50-60% saturastionm of the 128-bit adders. There`s a big potential speedup to be had there, but it needs to be seen whether the suboptimality is due to memory (load/store/cache-miss) issues or add-port saturation (there are some subtle things that can occur here), or both. |
|
|
|
|
|
#5 | |
|
Jun 2003
546410 Posts |
Quote:
Last fiddled with by axn on 2010-10-05 at 16:31 |
|
|
|
|
|
|
#6 | |
|
"Bob Silverman"
Nov 2003
North of Boston
756510 Posts |
Quote:
However, the code I gave is not a macro. It is a subroutine. I don't like interweaving _asm with actual C code because you never know when changes in the C code will change register usage --> invalidating the in line assembler. I prefer to make all of my _asm's into subroutines (it is cleaner and less error prone IMO) and let the compiler handle the inlining when possible. |
|
|
|
|
|
|
#7 |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
3×23×89 Posts |
AFAIK MPIR, msieve etc use things like yasm to add support for 64-bit asm to visual studio.
|
|
|
|
|
|
#8 | |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Quote:
@henryzz: Yes, I am contemplating adding a 64-bit assembler to my VS setup, but would like to know if there is a straightforward way to port either my current 32-bit VS inline ASM code or the corresponding GCC macros to 64-bit MASM-style code. (I've only ever written VS-style inline ASM and GCC-style, never MASM). |
|
|
|
|
|
|
#9 | |
|
A Sunny Moo
Aug 2007
USA
2·47·67 Posts |
Quote:
-For desktops, i5s are quad-cores with no hyperthreading (i.e. 4 actual, 4 virtual cores). For laptops, i5s are dual-cores with hyperthreading (2 actual, 4 virtual). -For desktops, i7s are quad-cores with hyperthreading (4 actual, 8 virtual). For laptops, I think they are always quad-cores with no hyperthreading (4 actual, 4 virtual); I could be wrong, though, since while what I'm saying holds true for most laptop i7's, there may be some high-end ones available with HT. |
|
|
|
|
|
|
#10 | |
|
Jun 2003
23·683 Posts |
Quote:
|
|
|
|
|
|
|
#11 | |
|
A Sunny Moo
Aug 2007
USA
2·47·67 Posts |
Quote:
![]() (Or is that not the whole list of desktop i5s?) |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| PauseWhileRunning and running as admin [Win7] | ixfd64 | Software | 8 | 2016-03-14 01:17 |
| Query - Running GIMPS on a 4 way system | Unregistered | Hardware | 6 | 2005-07-04 04:27 |
| Torture Test - System running processor very low compared to other systems | DougTheSlug | Hardware | 5 | 2005-01-27 09:51 |
| Running prime95 and NFSNET together on a HT enabled system | TauCeti | NFSNET Discussion | 1 | 2003-07-02 16:26 |
| How long has your system been running without a reset? | Gary Edstrom | Lounge | 14 | 2003-06-28 15:00 |