![]() |
SSE2 sieve speeds
These times are for a 2.66GHz P4 (8Kb L1, 512Kb L2) running Windows XP sieving the current combined SoB.dat file at p=100e12:
[code] proth_sieve_sse2 0.42 190 kp/s JJsieveSSE2 PRS52 v0.102b 192 kp/s sr2sieve-intel 1.4.39 266 kp/s [/code] I don't have access to any SSE2 machines other than 32-bit P4 and P4/Celerons for testing, so I would be very interested to see comparisons for Core2 or Athlon64 (in 32-bit mode). To run a benchmark, download sr2sieve-1.4.39-windows-x86.zip [url=http://www.geocities.com/g_w_reynolds/sr2sieve/]here[/url] and in the same directory as SoB.dat run `sr2sieve -s -p100e12 -P101e12'. My understanding is that proth_sieve and JJsieve have a better algorithm than sr2sieve, and that the speed difference is due to the SSE2 assembler used for modular multiplications. If anyone is interested in this code it is in the asm-i386-gcc.h file in sr5sieve-1.4.39.tar.gz [url=http://www.geocities.com/g_w_reynolds/sr5sieve/]here[/url]. (see the VEC4_* macros). I think there is still room for improvement, perhaps by interleaving 8 mulmods per pass instead of 4, or finding some way to use the SSE2 floating point unit instead of the FPU. |
I will run the bench on my 1.83GHz Core2Duo as soon as I get home. I can run it with clockspeed throttled as well (both internal and external), if you are interested. H.
|
Opteron 165 @ 2840 MHz, combined .dat
sr2sieve-amd.exe sr2sieve started: 991 <= n <= 49999997, 100000000000000 <= p <= 101000000000000 p=100000103547551, 504004 p/sec, 0 factors, 0.01% done, ETA 14 May 11:54 sr2sieve stopped: at p=100000132329451 because SIGINT was received. Found factors for 0 terms in 296.922 cpu sec. (expected about 0.10) sr2sieve-intel.exe sr2sieve started: 991 <= n <= 49999997, 100000132329451 <= p <= 101000000000000 p=100000253047711, 504963 p/sec, 0 factors, 0.03% done, ETA 10 May 16:24 sr2sieve stopped: at p=100000272376483 because SIGINT was received. Found factors for 0 terms in 575.375 cpu sec. (expected about 0.20) JJsieveSSE2 pmin=100000024999967 @ 511kp/s pmin=100000049999999 @ 536kp/s pmin=100000074999941 @ 536kp/s pmin=100000099999937 @ 535kp/s pmin=100000124999969 @ 536kp/s I set the priority to realtime in the taskmanager after I started the siever. |
Core2Duo, 1.83 GHz. One bar of 1GB only.
[CODE] Intel AMD One core alone 426 416 425 417 425 417 Two cores 418/421 409/413 416/419 408/411 409/413 Competing against 418 408 each other 410 419 [/CODE] If you still have questions, I'll be glad to help. |
Thanks KriZp and hhh for the benchmarks :-)
I have also tested on a P4 Celeron D (16Kb L1, 256Kb L2), I find it hard to get repeatable benchmarks for either program but on average the sr2sieve times are no slower than the JJsieve ones. Just the fact that sr2sieve gets within 10% of JJsieve without the benefit of the SPH algorithm suggests that there might be something to gain if the SSE2 assembler from sr2sieve could be added to JJsieve. The SSE2 routines are used in two situations: 1. An array A is filled with elements A[0] = x (mod p), A[1] = x^2 (mod p) ... A[n] = x^n (mod p). 2. Each element in an existing array A is multiplied by a constant b (mod p). |
Great work Geoff.
I'm beginning to wonder if SPH really is worth the trouble it is. I wrote a whole load of stuff about sieving (Riesel/Sierpinski, DL problem, BSGS and SPH) over at the SoB forum: [url]http://www.free-dc.org/forum/showthread.php?t=10262[/url] The "problem" with SPH is that you have to find, and keep track of factors of p-1 (where p is the current value you are sieving). But in doing this you use reasonable chunks of memory and also start to overuse the cache. I'm sure there's a better balance to be had but it's so tricky working towards it. I need to go back and see if just looking at powers of 2 in p-1 using SPH would help "whack" as many candidate k,p pairs before BSGS is required. |
[QUOTE=Greenbank;103926]I'm beginning to wonder if SPH really is worth the trouble it is. I wrote a whole load of stuff about sieving (Riesel/Sierpinski, DL problem, BSGS and SPH) over at the SoB forum:
[url]http://www.free-dc.org/forum/showthread.php?t=10262[/url] [/QUOTE] That is an interesting thread, thanks for the pointer. I see there that proth_sieve uses a fixed size giant step, and so doesn't need to compute every baby step. I guess this means that the SSE2 routine to fill an array with x,x^2,...,x^n (mod p) may not be useful. sr2sieve uses a variable number of giant and baby steps and so every baby step needs to be computed, which is the main place that routine is used. I have found that on the P4, computing every element of the array x,x^2,...,x^n (mod p) with SSE2 is faster than computing 40% of the elements using an addition ladder without SSE2, but that will be different for different machines. [QUOTE]The "problem" with SPH is that you have to find, and keep track of factors of p-1 (where p is the current value you are sieving). But in doing this you use reasonable chunks of memory and also start to overuse the cache. I'm sure there's a better balance to be had but it's so tricky working towards it. I need to go back and see if just looking at powers of 2 in p-1 using SPH would help "whack" as many candidate k,p pairs before BSGS is required.[/QUOTE] That is one of the problems I had when I was looking at how to implement SPH: I haven't been able to think of any way to find the needed factors of p-1 fast enough to make it feasible. sr2sieve only uses small factors of p-1 implicitly: to check whether k is a cubic residue for p it needs first to check whether p=1 (mod 3). That is equivalent to determining whether 3 is a factor of p-1, but since only 3,4,5,8,9,16-power residues are checked, all the needed factors can be found by a single lookup into a table indexed by p%720. |
[QUOTE=geoff;103984]
sr2sieve only uses small factors of p-1 implicitly: to check whether k is a cubic residue for p it needs first to check whether p=1 (mod 3). That is equivalent to determining whether 3 is a factor of p-1, but since only 3,4,5,8,9,16-power residues are checked, all the needed factors can be found by a single lookup into a table indexed by p%720.[/QUOTE] One way to get over this is to look for primes that are p=x (mod 720) in the given range and then go back and look at primes p=x+1 (mod 720) and so on... Do you think this will be faster? For prime=2, I think you are using the quadratic reciprocity law. Note, that these laws exist for higher powers also and can be used to replace the SPH algorithm. Google "Rational reciprocity laws".:smile: [url]http://mathworld.wolfram.com/ReciprocityTheorem.html[/url] (A good place to start) |
[QUOTE=geoff;103848]These times are for a 2.66GHz P4 (8Kb L1, 512Kb L2) running Windows XP sieving the current combined SoB.dat file at p=100e12:
[code] proth_sieve_sse2 0.42 190 kp/s JJsieveSSE2 PRS52 v0.102b 192 kp/s sr2sieve-intel 1.4.39 266 kp/s [/code] [/QUOTE] While these times are accurate for the Northwood P4, it appears that they may not be valid for other P4 models. Sorry for jumping to conclusions. I am going to try to work out exactly what is making the code run so much faster on Northwood processors than others. |
Could you make the 1.4.34.tar.gz source available?
|
[QUOTE=Joe O;104198]Could you make the 1.4.34.tar.gz source available?[/QUOTE]
Done. |
| All times are UTC. The time now is 13:39. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.