mersenneforum.org Running 32-bit builds on a Win7 system
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2010-10-04, 19:50 #1 ewmayer ∂2ω=0     Sep 2002 República de California 101101011011102 Posts Running 32-bit builds on a Win7 system Just had my old WinXP laptop at work replaced with a spiffy quad-core I7/Win7 one. I sometimes used the old system to test out my latest Mlucas Visual Studio builds (done on my XP notebook at home), to see if any timing changes observed on the home system were (at least qualitatively) reproducible on the work one. (The 2 systems are different manufacturer, hardware and chip-revs). So I juts tried a quick timing of a 32-bit Mlucas Visual Studio build (this is SSE2-using FFT code) on the new laptop - it runs correctly, but the timings are suspiciously slow. Some questions: - Does Win7 use some kind of built-in virtualization to run 32-bit code? If so, should one expect a significant performance hit? - The 32-bit Windows build uses Visual-studio-style inline assembler for the SSE2 code (for gcc builds I have gcc-style macros, both 32 and 64-bit ... most of the latter attempt to take advantage of the extra 8 xmm registers to boost performance). But Microsoft being Microsoft, Visual Studio does not support 64-bit inline ASM. What is my best option for porting the assembler to a 64-bit windows build? Port the macros to 64-bit MASM, use 64-bit inline ASM with the Intel C compiler, what?
2010-10-04, 21:55   #2
R.D. Silverman

Nov 2003

22×5×373 Posts

Quote:
 Originally Posted by ewmayer Just had my old WinXP laptop at work replaced with a spiffy quad-core I7/Win7 one. I sometimes used the old system to test out my latest Mlucas Visual Studio builds (done on my XP notebook at home), to see if any timing changes observed on the home system were (at least qualitatively) reproducible on the work one. (The 2 systems are different manufacturer, hardware and chip-revs). So I juts tried a quick timing of a 32-bit Mlucas Visual Studio build (this is SSE2-using FFT code) on the new laptop - it runs correctly, but the timings are suspiciously slow. Some questions: - Does Win7 use some kind of built-in virtualization to run 32-bit code? If so, should one expect a significant performance hit?
Not as far as I know. My NFS code runs at the same speed on my home
WIN-7 box as it does on my 32-bit WIN-XP at work. It is compiled as
a WIN-32 console app.

Quote:
 - The 32-bit Windows build uses Visual-studio-style inline assembler for the SSE2 code (for gcc builds I have gcc-style macros, both 32 and 64-bit ... most of the latter attempt to take advantage of the extra 8 xmm registers to boost performance). But Microsoft being Microsoft, Visual Studio does not support 64-bit inline ASM.
??? I have 64-bit asms.....

e.g.

Code:
/************************************************************************/
/*																		*/
/*	compute (a*2^30 + b)/c  and (a*2^30 + b) % c						*/
/*	assembler version													*/
/*																		*/
/************************************************************************/

void divrem_asm(a,b,c,d)
unsigned int a,b,c,d[2];

{	/* start of divrem_asm */

/* We could use a double length register fromthe mmx instruction set,	*/
/* however, the emmx instruction must be executedto clean up the		*/
/* FP registers every time we use mmx.  Emmx is a very lengthy			*/
/* instruction compared to what we are doing here.						*/

_asm {

/*		edx:eax = (a << 30) + tempb;									*/

mov		eax,a
shl		eax,30
mov		edx,a
shr		edx,2
and		edx,0x3fffffff

/*  Now divide a * (2^30) which resides in the register pair: edx:eax	*/
/*  eax = edx:eax / c													*/
/*	edx = edx:eax % c													*/

mov		ecx,c

/*  d[x] is a pointer (moved ahead here for a pentium optimization      */
mov		edi,d
div		ecx

/*   d[0] = ((a << 30) + b) / c;										*/
mov		DWORD PTR[edi],eax

/*	 d[1] = ((a << 30) + b) % c;										*/
mov		DWORD PTR[edi]+4,edx

} // end _asm

}	/* end of divrem_asm */
This e.g. gets inlined by the compiler.
Do you attach the _inline declaration to the _asm routine?

2010-10-04, 22:44   #3
axn

Jun 2003

2·3·827 Posts

Quote:
 Originally Posted by ewmayer So I juts tried a quick timing of a 32-bit Mlucas Visual Studio build (this is SSE2-using FFT code) on the new laptop - it runs correctly, but the timings are suspiciously slow.
Suspicious? Like, how? Also, beware of the hyperthreading!

Quote:
 Originally Posted by ewmayer Does Win7 use some kind of built-in virtualization to run 32-bit code? If so, should one expect a significant performance hit?
Not virutalization per se (Wow64 thunking, I believe is the technical term). And No.

Quote:
 Originally Posted by ewmayer What is my best option for porting the assembler to a 64-bit windows build? Port the macros to 64-bit MASM, use 64-bit inline ASM with the Intel C compiler, what?
Have you looked at Agner Fog's optimization manuals? It covers in some detail various considerations for 64 bit C++/asm development in x86.

 2010-10-05, 16:05 #4 ewmayer ∂2ω=0     Sep 2002 República de California 2×5×1,163 Posts I'll try to answer every one's questions in a single reply: - "Suspiciously slow" means no faster - in fact 5-10% slower on a cycle-per-cycle basis - than on my i5-based macbook, which runs a 64-bit gcc build, i.e. uses versions of key macros that attempt to leverage the extra 8 SSE registers. So the 5-10% hit is explainable by the Windows build being 32-bit - guess I was just hoping for a decent per-cycle boost from running the same build on spiffier hardware, is all. - @Bob: Your macro syntax does not compile for me in Visual Studio - what is your build environment? The simple change of prepending a second _ (i.e. __asm) allows the macro to compile ... but the syntax I use in VS prefixes each assembler instruction with __asm ... I prefer that because I can seamlessly intermingle C code and ASM, which is very handy for the early-prototyping stages of the code development. I don't add any inline specifiers because VS (I know this based on examining the assembler output re4sulting from compilation) inlines the macro invocations automatically. Also, your code sample is 32-bit ASM ... try replacing the 32-bit ints with 64-bits, and e-registers with their 64-bit r-prefixed counterparts. - @axn: Yes, have read Agner's excellent manual ... in my case the main decision with respect to 64-bit coding is how to effectively use the extra SSE registers. My basic code-development here - since I was a latecomer to inline-ASM and SSE - has unil recently been focused on simply getting a fully-functional SSE2 version of the Mlucas FFT and DWT code working, with most optimizations being decidedly low-tech "eyeball level" ones. Now that the basic SSE code is all in place (I just recently finished the SSE coding of final 2 front-end radix routines I need to support the full spectrum of non-power-of-2 FFT lengths the scalar Mlucas code does, i.e. each power-of-2 interval subdivided into 8 equal-stride subdivisions), it looks like it's high time to get down to some serious instruction-level code profiling. For example: based on a rough FFT opcount, my code is only yielding 50-60% saturastionm of the 128-bit adders. There`s a big potential speedup to be had there, but it needs to be seen whether the suboptimality is due to memory (load/store/cache-miss) issues or add-port saturation (there are some subtle things that can occur here), or both.
2010-10-05, 16:28   #5
axn

Jun 2003

2·3·827 Posts

Quote:
 Originally Posted by ewmayer "Suspiciously slow" means no faster - in fact 5-10% slower on a cycle-per-cycle basis - than on my i5-based macbook, which runs a 64-bit gcc build, i.e. uses versions of key macros that attempt to leverage the extra 8 SSE registers. So the 5-10% hit is explainable by the Windows build being 32-bit - guess I was just hoping for a decent per-cycle boost from running the same build on spiffier hardware, is all.
i5 and i7 (and i3) are based off same microarchitecture. Ergo, same performance per clock cycle. i7 has 4 actual cores (8 HT cores) (don't get me started on the hexacore!), whereas i5 has 4 HT cores (except for i5 720 which has 4 actual core, but no HT).

Last fiddled with by axn on 2010-10-05 at 16:31

2010-10-05, 16:43   #6
R.D. Silverman

Nov 2003

22·5·373 Posts

Quote:
 Originally Posted by ewmayer - @Bob: Your macro syntax does not compile for me in Visual Studio - what is your build environment? The simple change of prepending a second _ (i.e. __asm) allows the macro to compile ... but the syntax I use in VS prefixes each assembler instruction with __asm ... I prefer that because I can seamlessly intermingle C code and ASM, which is very handy for the early-prototyping stages of the code development. I don't add any inline specifiers because VS (I know this based on examining the assembler output re4sulting from compilation) inlines the macro invocations automatically. Also, your code sample is 32-bit ASM ... try replacing the 32-bit ints with 64-bits, and e-registers with their 64-bit r-prefixed counterparts. .
I will try using the 64 bit registers.

However, the code I gave is not a macro. It is a subroutine.
I don't like interweaving _asm with actual C code because you never
know when changes in the C code will change register usage -->
invalidating the in line assembler. I prefer to make all of my _asm's
into subroutines (it is cleaner and less error prone IMO) and let the compiler
handle the inlining when possible.

 2010-10-05, 16:45 #7 henryzz Just call me Henry     "David" Sep 2007 Cambridge (GMT/BST) 2×5×587 Posts AFAIK MPIR, msieve etc use things like yasm to add support for 64-bit asm to visual studio.
2010-10-05, 17:06   #8
ewmayer
2ω=0

Sep 2002
República de California

2×5×1,163 Posts

Quote:
 Originally Posted by R.D. Silverman I will try using the 64 bit registers. However, the code I gave is not a macro. It is a subroutine. I don't like interweaving _asm with actual C code because you never know when changes in the C code will change register usage --> invalidating the in line assembler. I prefer to make all of my _asm's into subroutines (it is cleaner and less error prone IMO) and let the compiler handle the inlining when possible.
Actually, one of the things I like about VS inline ASM support is that as long as you remain in assembler, there will be no register clobbering ... you only need to save register state when you hit the next high-level code instruction. Again, very handy for rapid prototyping ... once the basic code works, then I do the GCC port, by abstracting the main ASM chunks as macros with suitable clobber lists.

@henryzz: Yes, I am contemplating adding a 64-bit assembler to my VS setup, but would like to know if there is a straightforward way to port either my current 32-bit VS inline ASM code or the corresponding GCC macros to 64-bit MASM-style code. (I've only ever written VS-style inline ASM and GCC-style, never MASM).

2010-10-05, 17:35   #9
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

141518 Posts

Quote:
 Originally Posted by axn i5 and i7 (and i3) are based off same microarchitecture. Ergo, same performance per clock cycle. i7 has 4 actual cores (8 HT cores) (don't get me started on the hexacore!), whereas i5 has 4 HT cores (except for i5 720 which has 4 actual core, but no HT).
To clarify this more:

-For desktops, i5s are quad-cores with no hyperthreading (i.e. 4 actual, 4 virtual cores). For laptops, i5s are dual-cores with hyperthreading (2 actual, 4 virtual).

-For desktops, i7s are quad-cores with hyperthreading (4 actual, 8 virtual). For laptops, I think they are always quad-cores with no hyperthreading (4 actual, 4 virtual); I could be wrong, though, since while what I'm saying holds true for most laptop i7's, there may be some high-end ones available with HT.

2010-10-05, 17:55   #10
axn

Jun 2003

496210 Posts

Quote:
 Originally Posted by mdettweiler For desktops, i5s are quad-cores with no hyperthreading (i.e. 4 actual, 4 virtual cores).
For desktops, the 7xx series is 4C/4T, while 6xx series is 2C/4T (http://ark.intel.com/ProductCollecti...rketSegment=DT)

2010-10-05, 19:17   #11
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

3·2,083 Posts

Quote:
 Originally Posted by axn For desktops, the 7xx series is 4C/4T, while 6xx series is 2C/4T (http://ark.intel.com/ProductCollecti...rketSegment=DT)
Hmm...I didn't know there were such thing as dual-core i5's for desktops. I thought all the desktop i5's were quads, and that you had to go to i3 if you wanted a dualcore. But from the link you gave, it would almost appear that most of the desktop i5's out there are dualcores.

(Or is that not the whole list of desktop i5s?)

 Similar Threads Thread Thread Starter Forum Replies Last Post ixfd64 Software 8 2016-03-14 01:17 Unregistered Hardware 6 2005-07-04 04:27 DougTheSlug Hardware 5 2005-01-27 09:51 TauCeti NFSNET Discussion 1 2003-07-02 16:26 Gary Edstrom Lounge 14 2003-06-28 15:00

All times are UTC. The time now is 06:22.

Wed May 12 06:22:08 UTC 2021 up 34 days, 1:03, 0 users, load averages: 1.75, 1.70, 1.70