mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2010-10-04, 19:50   #1
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101101011011102 Posts
Default Running 32-bit builds on a Win7 system

Just had my old WinXP laptop at work replaced with a spiffy quad-core I7/Win7 one. I sometimes used the old system to test out my latest Mlucas Visual Studio builds (done on my XP notebook at home), to see if any timing changes observed on the home system were (at least qualitatively) reproducible on the work one. (The 2 systems are different manufacturer, hardware and chip-revs).

So I juts tried a quick timing of a 32-bit Mlucas Visual Studio build (this is SSE2-using FFT code) on the new laptop - it runs correctly, but the timings are suspiciously slow. Some questions:

- Does Win7 use some kind of built-in virtualization to run 32-bit code? If so, should one expect a significant performance hit?

- The 32-bit Windows build uses Visual-studio-style inline assembler for the SSE2 code (for gcc builds I have gcc-style macros, both 32 and 64-bit ... most of the latter attempt to take advantage of the extra 8 xmm registers to boost performance). But Microsoft being Microsoft, Visual Studio does not support 64-bit inline ASM. What is my best option for porting the assembler to a 64-bit windows build? Port the macros to 64-bit MASM, use 64-bit inline ASM with the Intel C compiler, what?
ewmayer is offline   Reply With Quote
Old 2010-10-04, 21:55   #2
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22×5×373 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Just had my old WinXP laptop at work replaced with a spiffy quad-core I7/Win7 one. I sometimes used the old system to test out my latest Mlucas Visual Studio builds (done on my XP notebook at home), to see if any timing changes observed on the home system were (at least qualitatively) reproducible on the work one. (The 2 systems are different manufacturer, hardware and chip-revs).

So I juts tried a quick timing of a 32-bit Mlucas Visual Studio build (this is SSE2-using FFT code) on the new laptop - it runs correctly, but the timings are suspiciously slow. Some questions:

- Does Win7 use some kind of built-in virtualization to run 32-bit code? If so, should one expect a significant performance hit?
Not as far as I know. My NFS code runs at the same speed on my home
WIN-7 box as it does on my 32-bit WIN-XP at work. It is compiled as
a WIN-32 console app.

Quote:
- The 32-bit Windows build uses Visual-studio-style inline assembler for the SSE2 code (for gcc builds I have gcc-style macros, both 32 and 64-bit ... most of the latter attempt to take advantage of the extra 8 xmm registers to boost performance). But Microsoft being Microsoft, Visual Studio does not support 64-bit inline ASM.
??? I have 64-bit asms.....

e.g.

Code:
/************************************************************************/
/*																		*/
/*	compute (a*2^30 + b)/c  and (a*2^30 + b) % c						*/
/*	assembler version													*/
/*																		*/
/************************************************************************/


void divrem_asm(a,b,c,d)
unsigned int a,b,c,d[2];
 
{	/* start of divrem_asm */

/* We could use a double length register fromthe mmx instruction set,	*/
/* however, the emmx instruction must be executedto clean up the		*/
/* FP registers every time we use mmx.  Emmx is a very lengthy			*/
/* instruction compared to what we are doing here.						*/

	_asm {

/*		edx:eax = (a << 30) + tempb;									*/

	mov		eax,a
	shl		eax,30
	mov		edx,a
	shr		edx,2
	and		edx,0x3fffffff

	add		eax,b
	adc		edx,0

/*  Now divide a * (2^30) which resides in the register pair: edx:eax	*/
/*  eax = edx:eax / c													*/
/*	edx = edx:eax % c													*/

	mov		ecx,c

/*  d[x] is a pointer (moved ahead here for a pentium optimization      */
	mov		edi,d
     div		ecx

/*   d[0] = ((a << 30) + b) / c;										*/
	mov		DWORD PTR[edi],eax

/*	 d[1] = ((a << 30) + b) % c;										*/ 
 	mov		DWORD PTR[edi]+4,edx

	} // end _asm

}	/* end of divrem_asm */
This e.g. gets inlined by the compiler.
Do you attach the _inline declaration to the _asm routine?
R.D. Silverman is offline   Reply With Quote
Old 2010-10-04, 22:44   #3
axn
 
axn's Avatar
 
Jun 2003

2·3·827 Posts
Default

Quote:
Originally Posted by ewmayer View Post
So I juts tried a quick timing of a 32-bit Mlucas Visual Studio build (this is SSE2-using FFT code) on the new laptop - it runs correctly, but the timings are suspiciously slow.
Suspicious? Like, how? Also, beware of the hyperthreading!

Quote:
Originally Posted by ewmayer View Post
Does Win7 use some kind of built-in virtualization to run 32-bit code? If so, should one expect a significant performance hit?
Not virutalization per se (Wow64 thunking, I believe is the technical term). And No.

Quote:
Originally Posted by ewmayer View Post
What is my best option for porting the assembler to a 64-bit windows build? Port the macros to 64-bit MASM, use 64-bit inline ASM with the Intel C compiler, what?
Have you looked at Agner Fog's optimization manuals? It covers in some detail various considerations for 64 bit C++/asm development in x86.
axn is online now   Reply With Quote
Old 2010-10-05, 16:05   #4
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×5×1,163 Posts
Default

I'll try to answer every one's questions in a single reply:

- "Suspiciously slow" means no faster - in fact 5-10% slower on a cycle-per-cycle basis - than on my i5-based macbook, which runs a 64-bit gcc build, i.e. uses versions of key macros that attempt to leverage the extra 8 SSE registers. So the 5-10% hit is explainable by the Windows build being 32-bit - guess I was just hoping for a decent per-cycle boost from running the same build on spiffier hardware, is all.

- @Bob: Your macro syntax does not compile for me in Visual Studio - what is your build environment? The simple change of prepending a second _ (i.e. __asm) allows the macro to compile ... but the syntax I use in VS prefixes each assembler instruction with __asm ... I prefer that because I can seamlessly intermingle C code and ASM, which is very handy for the early-prototyping stages of the code development.

I don't add any inline specifiers because VS (I know this based on examining the assembler output re4sulting from compilation) inlines the macro invocations automatically.

Also, your code sample is 32-bit ASM ... try replacing the 32-bit ints with 64-bits, and e-registers with their 64-bit r-prefixed counterparts.

- @axn: Yes, have read Agner's excellent manual ... in my case the main decision with respect to 64-bit coding is how to effectively use the extra SSE registers. My basic code-development here - since I was a latecomer to inline-ASM and SSE - has unil recently been focused on simply getting a fully-functional SSE2 version of the Mlucas FFT and DWT code working, with most optimizations being decidedly low-tech "eyeball level" ones. Now that the basic SSE code is all in place (I just recently finished the SSE coding of final 2 front-end radix routines I need to support the full spectrum of non-power-of-2 FFT lengths the scalar Mlucas code does, i.e. each power-of-2 interval subdivided into 8 equal-stride subdivisions), it looks like it's high time to get down to some serious instruction-level code profiling. For example: based on a rough FFT opcount, my code is only yielding 50-60% saturastionm of the 128-bit adders. There`s a big potential speedup to be had there, but it needs to be seen whether the suboptimality is due to memory (load/store/cache-miss) issues or add-port saturation (there are some subtle things that can occur here), or both.
ewmayer is offline   Reply With Quote
Old 2010-10-05, 16:28   #5
axn
 
axn's Avatar
 
Jun 2003

2·3·827 Posts
Default

Quote:
Originally Posted by ewmayer View Post
"Suspiciously slow" means no faster - in fact 5-10% slower on a cycle-per-cycle basis - than on my i5-based macbook, which runs a 64-bit gcc build, i.e. uses versions of key macros that attempt to leverage the extra 8 SSE registers. So the 5-10% hit is explainable by the Windows build being 32-bit - guess I was just hoping for a decent per-cycle boost from running the same build on spiffier hardware, is all.
i5 and i7 (and i3) are based off same microarchitecture. Ergo, same performance per clock cycle. i7 has 4 actual cores (8 HT cores) (don't get me started on the hexacore!), whereas i5 has 4 HT cores (except for i5 720 which has 4 actual core, but no HT).

Last fiddled with by axn on 2010-10-05 at 16:31
axn is online now   Reply With Quote
Old 2010-10-05, 16:43   #6
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22·5·373 Posts
Default

Quote:
Originally Posted by ewmayer View Post
- @Bob: Your macro syntax does not compile for me in Visual Studio - what is your build environment? The simple change of prepending a second _ (i.e. __asm) allows the macro to compile ... but the syntax I use in VS prefixes each assembler instruction with __asm ... I prefer that because I can seamlessly intermingle C code and ASM, which is very handy for the early-prototyping stages of the code development.

I don't add any inline specifiers because VS (I know this based on examining the assembler output re4sulting from compilation) inlines the macro invocations automatically.

Also, your code sample is 32-bit ASM ... try replacing the 32-bit ints with 64-bits, and e-registers with their 64-bit r-prefixed counterparts.
.
I will try using the 64 bit registers.

However, the code I gave is not a macro. It is a subroutine.
I don't like interweaving _asm with actual C code because you never
know when changes in the C code will change register usage -->
invalidating the in line assembler. I prefer to make all of my _asm's
into subroutines (it is cleaner and less error prone IMO) and let the compiler
handle the inlining when possible.
R.D. Silverman is offline   Reply With Quote
Old 2010-10-05, 16:45   #7
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2×5×587 Posts
Default

AFAIK MPIR, msieve etc use things like yasm to add support for 64-bit asm to visual studio.
henryzz is offline   Reply With Quote
Old 2010-10-05, 17:06   #8
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×5×1,163 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
I will try using the 64 bit registers.

However, the code I gave is not a macro. It is a subroutine.
I don't like interweaving _asm with actual C code because you never
know when changes in the C code will change register usage -->
invalidating the in line assembler. I prefer to make all of my _asm's
into subroutines (it is cleaner and less error prone IMO) and let the compiler
handle the inlining when possible.
Actually, one of the things I like about VS inline ASM support is that as long as you remain in assembler, there will be no register clobbering ... you only need to save register state when you hit the next high-level code instruction. Again, very handy for rapid prototyping ... once the basic code works, then I do the GCC port, by abstracting the main ASM chunks as macros with suitable clobber lists.

@henryzz: Yes, I am contemplating adding a 64-bit assembler to my VS setup, but would like to know if there is a straightforward way to port either my current 32-bit VS inline ASM code or the corresponding GCC macros to 64-bit MASM-style code. (I've only ever written VS-style inline ASM and GCC-style, never MASM).
ewmayer is offline   Reply With Quote
Old 2010-10-05, 17:35   #9
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

141518 Posts
Default

Quote:
Originally Posted by axn View Post
i5 and i7 (and i3) are based off same microarchitecture. Ergo, same performance per clock cycle. i7 has 4 actual cores (8 HT cores) (don't get me started on the hexacore!), whereas i5 has 4 HT cores (except for i5 720 which has 4 actual core, but no HT).
To clarify this more:

-For desktops, i5s are quad-cores with no hyperthreading (i.e. 4 actual, 4 virtual cores). For laptops, i5s are dual-cores with hyperthreading (2 actual, 4 virtual).

-For desktops, i7s are quad-cores with hyperthreading (4 actual, 8 virtual). For laptops, I think they are always quad-cores with no hyperthreading (4 actual, 4 virtual); I could be wrong, though, since while what I'm saying holds true for most laptop i7's, there may be some high-end ones available with HT.
mdettweiler is offline   Reply With Quote
Old 2010-10-05, 17:55   #10
axn
 
axn's Avatar
 
Jun 2003

496210 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
For desktops, i5s are quad-cores with no hyperthreading (i.e. 4 actual, 4 virtual cores).
For desktops, the 7xx series is 4C/4T, while 6xx series is 2C/4T (http://ark.intel.com/ProductCollecti...rketSegment=DT)
axn is online now   Reply With Quote
Old 2010-10-05, 19:17   #11
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by axn View Post
For desktops, the 7xx series is 4C/4T, while 6xx series is 2C/4T (http://ark.intel.com/ProductCollecti...rketSegment=DT)
Hmm...I didn't know there were such thing as dual-core i5's for desktops. I thought all the desktop i5's were quads, and that you had to go to i3 if you wanted a dualcore. But from the link you gave, it would almost appear that most of the desktop i5's out there are dualcores.

(Or is that not the whole list of desktop i5s?)
mdettweiler is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PauseWhileRunning and running as admin [Win7] ixfd64 Software 8 2016-03-14 01:17
Query - Running GIMPS on a 4 way system Unregistered Hardware 6 2005-07-04 04:27
Torture Test - System running processor very low compared to other systems DougTheSlug Hardware 5 2005-01-27 09:51
Running prime95 and NFSNET together on a HT enabled system TauCeti NFSNET Discussion 1 2003-07-02 16:26
How long has your system been running without a reset? Gary Edstrom Lounge 14 2003-06-28 15:00

All times are UTC. The time now is 06:22.

Wed May 12 06:22:08 UTC 2021 up 34 days, 1:03, 0 users, load averages: 1.75, 1.70, 1.70

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.