mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Sierpinski/Riesel Base 5 (https://www.mersenneforum.org/forumdisplay.php?f=54)
-   -   A multiple k/c sieve for Sierpinski/Riesel problems (https://www.mersenneforum.org/showthread.php?t=5785)

Cruelty 2007-05-08 05:50

C2D E6600 @ 3GHz[code]length = 1000, iterations = 10000, b = 2, p = 4611686018427387817:
: 213.333 million mulmods per second.
CMOV: 160.000 million mulmods per second.
SSE2: 213.333 million mulmods per second.[/code]

Cruelty 2007-05-08 07:24

Using [b][I]mulmod8[/I][/b] on 1GHz P3 under Windows 2000 Pro, I get a [I]drwtsn32.exe[/I] error "mulmod8.exe has caused an error and needs to close..."

Cruelty 2007-05-08 09:14

Ugrading sr2sieve.win32 from 1.4.39 to 1.4.42 on 1GHz P3 resulted in ~5% speed increase (291kp/s vs. 306kp/s) :tu:

ET_ 2007-05-08 09:53

[QUOTE=geoff;105481]
I have put a benchmark program mulmod8.zip [url=http://www.geocities.com/g_w_reynolds/sr5sieve/testing/]here[/url]. It measures the speed of the SSE2 vs non-SSE2 mulmods. It would be interesting to see the results for some other machines. Here are results for my 2.9GHz Northwood P4:
[code]
length = 1000, iterations = 10000, b = 2, p = 4611686018427387817:
: 57.812 million mulmods per second.
CMOV: 54.953 million mulmods per second.
SSE2: 106.400 million mulmods per second.
[/code][/QUOTE]

Mobile Intel Pentium4 CPU 2.66 GHz
CPU features: RDTSC, CMOV, Prefetch, MMX, SSE, SSE2
L1 cache size: 8 KB
L2 cache size: 512 KB

[code]
C:\Documents and Settings\Administrator\Desktop>mulmod8
length = 1000, iterations = 10000, b = 2, p = 4611686018427387817:
: 53.333 million mulmods per second.
CMOV: 53.333 million mulmods per second.
SSE2: 91.429 million mulmods per second.
[/code]

Should I recompile the source code?

Luigi

em99010pepe 2007-05-08 11:55

AMD 64 3000+ (2 GHz)

length = 1000, iterations = 10000, b = 2, p = 4611686018427387817:
: 128.000 million mulmods per second.
CMOV: 106.667 million mulmods per second.
SSE2: 91.429 million mulmods per second.

Flatlander 2007-05-08 12:52

C2D E4300 @ 3013mhz

length = 1000, iterations = 10000, b = 2, p = 4611686018427387817:
: 213.333 million mulmods per second.
CMOV: 160.000 million mulmods per second.
SSE2: 213.333 million mulmods per second.

Joe O 2007-05-08 13:36

AMD64 3300+ (2.4 GHz)



length = 1000, iterations = 10000, b = 2, p = 4611686018427387817:
: 160.000 million mulmods per second.
CMOV: 128.000 million mulmods per second.
SSE2: 106.667 million mulmods per second.

Joe O 2007-05-08 15:37

Intel P4 2.4GHz

length = 1000, iterations = 10000, b = 2, p = 4611686018427387817:
: 45.714 million mulmods per second.
CMOV: 45.714 million mulmods per second.
SSE2: 91.429 million mulmods per second.

Joe O 2007-05-09 03:15

So what does the first line represent?

Here are a few more timings:

[CODE]AMD64X2 5200+ (2.6 GHZ)

length = 1000, iterations = 10000, b = 2, p = 4611686018427387817:
: 213.333 million mulmods per second.
CMOV: 128.000 million mulmods per second.
SSE2: 128.000 million mulmods per second.

________________________________________

2.8MHz Prescott

length = 1000, iterations = 10000, b = 2, p = 4611686018427387817:
: 71.111 million mulmods per second.
CMOV: 71.111 million mulmods per second.
SSE2: 71.111 million mulmods per second.
________________________________________

3.0 MHz Prescot

length = 1000, iterations = 10000, b = 2, p = 4611686018427387817:
: 53.333 million mulmods per second.
CMOV: 42.667 million mulmods per second.
SSE2: 71.111 million mulmods per second.

________________________________________

2.4 MHz Prescott

length = 1000, iterations = 10000, b = 2, p = 4611686018427387817:
: 71.111 million mulmods per second.
CMOV: 58.182 million mulmods per second.
SSE2: 71.111 million mulmods per second.

________________________________________.

2.8 MHz Prescott

length = 1000, iterations = 10000, b = 2, p = 4611686018427387817:
: 80.000 million mulmods per second.
CMOV: 64.000 million mulmods per second.
SSE2: 80.000 million mulmods per second.

________________________________________

3.0 MHz Prescott

length = 1000, iterations = 10000, b = 2, p = 4611686018427387817:
: 80.000 million mulmods per second.
CMOV: 80.000 million mulmods per second.
SSE2: 80.000 million mulmods per second.

________________________________________


[/CODE]

Cruelty 2007-05-10 07:13

CeleronM 1.5GHz [code]length = 1000, iterations = 10000, b = 2, p = 4611686018427387817:
: 64.000 million mulmods per second.
CMOV: 53.333 million mulmods per second.
SSE2: 49.231 million mulmods per second.[/code]I am also wondering what does the first result mean. Anyway it seems that both Athlon64 and PentiumM run faster without SSE2 enabled.
I have confirmed it using win32.sr2sieve: on the above CeleronM I get ~491kp/s (no sse2) vs. ~431kp/s (sse2), which means ~14% increase.

geoff 2007-05-10 23:09

[QUOTE=Cruelty;105683]I am also wondering what does the first result mean. Anyway it seems that both Athlon64 and PentiumM run faster without SSE2 enabled.
I have confirmed it using win32.sr2sieve: on the above CeleronM I get ~491kp/s (no sse2) vs. ~431kp/s (sse2), which means ~14% increase.[/QUOTE]

The first result is the plain x86 code, no CMOV or SSE2 instructions used. (The reason the CMOV code is slower is that the branches that were replaced by conditional moves were being correctly predicted, CMOV is only faster when branches are mispredicted often.)

Thanks everyone for the benchmarks. It looks like there were not enough iterations to get a precise result on fast machines, but still useful. (You can run it as ./mulmod8 1000 100000, the larger second argument is the number if iterations, increase to get a more precise result).

I am working on a new version which will incorperate all the mulmod routines from versions 1.4.34, 1.4.39, 1.4.40, do some benchmarks before sieving starts (using the rdtsc instruction for more precise results), and then choose the fastest one to use for the actual sieving. I'll include the non-SSE2 routines in the benchmarks so that they will be used if faster.

The basic problem that needs to be solved is that some machines suffer a large penalty when loading an SSE2 register from a memory location too soon after the FPU has stored a result there. But the ideal interval between writing and reading seems to be different on every model, so there is no way to write just one routine that will work well for everyone.


All times are UTC. The time now is 22:47.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.