mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Prime Sierpinski Project

Reply
 
Thread Tools
Old 2007-04-22, 05:53   #12
Joe O
 
Joe O's Avatar
 
Aug 2002

10000011012 Posts
Default

Did you really mean to do this?

# Intel 32-bit. Run on any Pentium, tune for Pentium 2/3, SSE2 for Pentium 4.
#
ifeq ($(strip $(ARCH)),x86-intel)
DEBUG_CFLAGS=-g march=i586 --mtune=i686

instead of

# Intel 32-bit. Run on any Pentium, tune for Pentium 2/3, SSE2 for Pentium 4.
#
ifeq ($(ARCH),x86-intel)
DEBUG_CFLAGS=-g -march=i586 -mtune=pentium2
Joe O is offline   Reply With Quote
Old 2007-04-22, 05:54   #13
Joe O
 
Joe O's Avatar
 
Aug 2002

3·52·7 Posts
Default

Quote:
Originally Posted by geoff View Post
Done.
Thanks.
Joe O is offline   Reply With Quote
Old 2007-04-22, 06:27   #14
Joe O
 
Joe O's Avatar
 
Aug 2002

3×52×7 Posts
Default

Another question or two

#define SETUP_LADDER_MAX_RUNG_DENSITY 0.4

vs

#define SETUP_LADDER_MAX_RUNG_DENSITY 0.6

And why the change from function to macro for the vec4 routines?
Joe O is offline   Reply With Quote
Old 2007-04-26, 05:21   #15
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

115710 Posts
Default

Quote:
Originally Posted by Joe O View Post
Another question or two

#define SETUP_LADDER_MAX_RUNG_DENSITY 0.4

vs

#define SETUP_LADDER_MAX_RUNG_DENSITY 0.6
In setup64() a certain array has to be filled with entries x, x^2, ..., x^n (mod p), but not all entries are actually used. The density is the ratio of needed entries to total entries.

If the density is lower than this setting then an addition ladder will be used to fill just the needed entries. If greater, then all entries get filled using SSE2. The speed of the SSE2 code increased, but the ladder code doesn't use SSE2, so it becomes faster to fill all values at a lower density than to use the ladder.

I don't know exactly what the correct breakpoint should be, but when I tested with the 19k SoB.dat (which has a density of 0.43?) it was faster to fill every entry with SSE2 than to use the ladder. When I tested with the 8k SoB.dat (density 0.25) is was faster to use the ladder. Of course this will be different on machines where the relative speed of the VEC4_MULMOD64_NEXT() code and the ladder code are different. For riesel.dat and sr5data.txt the densities are much higher due to the large number of k values, so the ladder code never gets used.

Quote:
And why the change from function to macro for the vec4 routines?
In principle a static inline function should be the same as a macro, but GCC seems to generate better code from the macro, sometimes much better. I don't really know why this is, but it seems to happen more with GCC 4.1 than earlier versions.

If the macro is broken up into a number of seperate asm() statements then the difference could be due to late evaluation of macro arguments, but the better code is generated even when the macro consists of a single asm() statement.
geoff is offline   Reply With Quote
Old 2007-04-26, 05:30   #16
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13·89 Posts
Default

Quote:
Originally Posted by Joe O View Post
Did you really mean to do this?

# Intel 32-bit. Run on any Pentium, tune for Pentium 2/3, SSE2 for Pentium 4.
#
ifeq ($(strip $(ARCH)),x86-intel)
DEBUG_CFLAGS=-g march=i586 --mtune=i686

instead of

# Intel 32-bit. Run on any Pentium, tune for Pentium 2/3, SSE2 for Pentium 4.
#
ifeq ($(ARCH),x86-intel)
DEBUG_CFLAGS=-g -march=i586 -mtune=pentium2
Yes, this is consistent with earlier versions which used -march=i686, although I don't think there is any difference between pentium2 and i686 in practice. I want the binaries to run on machines that don't have the CMOV instructions, but otherwise optimised for more recent machines.

edit: the `march' instead of `-march' is a typo I think? It should be -march=i586 -mtune=i686. The DEBUG_CFLAGS are not used in the released binaries in any case.

Last fiddled with by geoff on 2007-04-26 at 05:36
geoff is offline   Reply With Quote
Old 2007-04-26, 18:04   #17
Joe O
 
Joe O's Avatar
 
Aug 2002

10158 Posts
Cool

Thanks for the replies. The reason for my request of a specific source and my questions was to see if I could find any reasons for observed speeds.
I "diffed" all the tars from 1.4.34 to 1.4.39 to see what could be causing the speed differences. In addition to your posted speeds, I ran 1G of Riesel for each of the following:
Code:
sr2sieve 1.4.19 -- 206754 p/sec
sr2sieve 1.4.39 -- 247491 p/sec
sr2sieve 1.4.34 -- 225192 p/sec
sr2sieve 1.4.37 -- 254533 p/sec
followed by 5G on the two fastest. The results for the 5G runs were consistent with the above. This was on an AMD64 @2400MHZ under a constant light load. The 5G runs were overnight on an otherwise idle machine.
Attached Thumbnails
Click image for larger version

Name:	KiwiPhoto.jpg
Views:	331
Size:	3.8 KB
ID:	1682  
Joe O is offline   Reply With Quote
Old 2007-06-07, 23:08   #18
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13×89 Posts
Default

Just an update to this thread:

The main problem slowing down the SSE2 mulmod code was the mixing of 16 byte SSE2 read/writes (movdqa) with 8 byte FPU read/writes (fildll/fistpll). This is a problem on all x86 machines, but it is especially bad on some Pentium 4 models.

One solution is to move the reads and writes further apart, which is the approach I have taken with sr2sieve. Another solution in some cases might be to read and write the SSE2 registers in two 8 byte halves.
geoff is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Slow Transfer Speeds with Windows 7 pinhodecarlos Soap Box 2 2016-10-19 19:52
Core i7 memory speeds nucleon Hardware 9 2009-09-17 11:47
LL tests running at different speeds GARYP166 Information & Answers 11 2009-07-13 19:39
sieving speeds for Intels jasong Sierpinski/Riesel Base 5 11 2007-08-09 00:15
Factoring Speeds Khemikal796 Lone Mersenne Hunters 5 2005-04-26 20:28

All times are UTC. The time now is 16:38.


Fri Jul 16 16:38:14 UTC 2021 up 49 days, 14:25, 1 user, load averages: 2.01, 1.83, 1.69

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.