![]() |
|
|
#12 |
|
Aug 2002
10000011012 Posts |
Did you really mean to do this?
# Intel 32-bit. Run on any Pentium, tune for Pentium 2/3, SSE2 for Pentium 4. # ifeq ($(strip $(ARCH)),x86-intel) DEBUG_CFLAGS=-g march=i586 --mtune=i686 instead of # Intel 32-bit. Run on any Pentium, tune for Pentium 2/3, SSE2 for Pentium 4. # ifeq ($(ARCH),x86-intel) DEBUG_CFLAGS=-g -march=i586 -mtune=pentium2 |
|
|
|
|
|
#13 |
|
Aug 2002
3·52·7 Posts |
|
|
|
|
|
|
#14 |
|
Aug 2002
3×52×7 Posts |
Another question or two
#define SETUP_LADDER_MAX_RUNG_DENSITY 0.4 vs #define SETUP_LADDER_MAX_RUNG_DENSITY 0.6 And why the change from function to macro for the vec4 routines? |
|
|
|
|
|
#15 | ||
|
Mar 2003
New Zealand
115710 Posts |
Quote:
If the density is lower than this setting then an addition ladder will be used to fill just the needed entries. If greater, then all entries get filled using SSE2. The speed of the SSE2 code increased, but the ladder code doesn't use SSE2, so it becomes faster to fill all values at a lower density than to use the ladder. I don't know exactly what the correct breakpoint should be, but when I tested with the 19k SoB.dat (which has a density of 0.43?) it was faster to fill every entry with SSE2 than to use the ladder. When I tested with the 8k SoB.dat (density 0.25) is was faster to use the ladder. Of course this will be different on machines where the relative speed of the VEC4_MULMOD64_NEXT() code and the ladder code are different. For riesel.dat and sr5data.txt the densities are much higher due to the large number of k values, so the ladder code never gets used. Quote:
If the macro is broken up into a number of seperate asm() statements then the difference could be due to late evaluation of macro arguments, but the better code is generated even when the macro consists of a single asm() statement. |
||
|
|
|
|
|
#16 | |
|
Mar 2003
New Zealand
13·89 Posts |
Quote:
edit: the `march' instead of `-march' is a typo I think? It should be -march=i586 -mtune=i686. The DEBUG_CFLAGS are not used in the released binaries in any case. Last fiddled with by geoff on 2007-04-26 at 05:36 |
|
|
|
|
|
|
#17 |
|
Aug 2002
10158 Posts |
Thanks for the replies. The reason for my request of a specific source and my questions was to see if I could find any reasons for observed speeds.
I "diffed" all the tars from 1.4.34 to 1.4.39 to see what could be causing the speed differences. In addition to your posted speeds, I ran 1G of Riesel for each of the following: Code:
sr2sieve 1.4.19 -- 206754 p/sec sr2sieve 1.4.39 -- 247491 p/sec sr2sieve 1.4.34 -- 225192 p/sec sr2sieve 1.4.37 -- 254533 p/sec |
|
|
|
|
|
#18 |
|
Mar 2003
New Zealand
13×89 Posts |
Just an update to this thread:
The main problem slowing down the SSE2 mulmod code was the mixing of 16 byte SSE2 read/writes (movdqa) with 8 byte FPU read/writes (fildll/fistpll). This is a problem on all x86 machines, but it is especially bad on some Pentium 4 models. One solution is to move the reads and writes further apart, which is the approach I have taken with sr2sieve. Another solution in some cases might be to read and write the SSE2 registers in two 8 byte halves. |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Slow Transfer Speeds with Windows 7 | pinhodecarlos | Soap Box | 2 | 2016-10-19 19:52 |
| Core i7 memory speeds | nucleon | Hardware | 9 | 2009-09-17 11:47 |
| LL tests running at different speeds | GARYP166 | Information & Answers | 11 | 2009-07-13 19:39 |
| sieving speeds for Intels | jasong | Sierpinski/Riesel Base 5 | 11 | 2007-08-09 00:15 |
| Factoring Speeds | Khemikal796 | Lone Mersenne Hunters | 5 | 2005-04-26 20:28 |