mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Sierpinski/Riesel Base 5 (https://www.mersenneforum.org/forumdisplay.php?f=54)
-   -   A multiple k/c sieve for Sierpinski/Riesel problems (https://www.mersenneforum.org/showthread.php?t=5785)

geoff 2007-04-07 03:11

sr1sieve 1.0.20, sr2sieve 1.4.37
 
These versions have improved SSE2 assembler for 32-bit machines: modular multiplications are done four at a time (interleaved in pairs) in an attempt to keep the long Pentium 4 pipeline full. Hopefully it will be a little faster for Core2 and Athlon64 machines (in 32-bit compatability mode) as well.

The sr2sieve binaries are now compiled to allow sieving in any base, not
just base 2.

geoff 2007-04-17 00:34

sr1sieve 1.0.22, sr2sieve 1.4.39
 
It seems the cache size detection has never worked on AMD machines, because I had left a zero off the CPUID parameters :-(.

It really should be working now.

geoff 2007-04-19 04:34

It appears that the speed gains for P4 machines in versions 1.4.37 and later could be at the expense of other SSE2 machines. If anyone finds that versions after sr[25]sieve 1.4.34 or sr1sieve 1.0.19 are slower on Core2, Athlon64 or pentium M, then please let me know so I can fix it.

masser 2007-04-19 19:14

On the following machine:

Intel (R) Xeon(TM) CPU 3.40GHz
CPU speed: 3400.30 MHz
CPU features: RDTSC, CMOV, Prefetch, MMX, SSE, SSE2
L1 cache size: 16 KB
L2 cache size: 2048 KB


With a 5 sequence dat file, I'm seeing a slowdown:

version speed
1.4.39 1350 kp/s
1.4.37 1350 kp/s
1.4.34 1450 kp/s

Hope this helps.

geoff 2007-04-20 02:25

[QUOTE=masser;104037]On the following machine:

Intel (R) Xeon(TM) CPU 3.40GHz
CPU speed: 3400.30 MHz
CPU features: RDTSC, CMOV, Prefetch, MMX, SSE, SSE2
L1 cache size: 16 KB
L2 cache size: 2048 KB


With a 5 sequence dat file, I'm seeing a slowdown:

version speed
1.4.39 1350 kp/s
1.4.37 1350 kp/s
1.4.34 1450 kp/s

Hope this helps.[/QUOTE]

Thanks for the benchmarks, this is very dissapointing. I am beginning to think my SSE2 assembler hits a sweet spot for the Northwood P4 and runs slower on everything else :-(

Could you send me a copy of the sieve file you used, so I can get a direct comparison with my P4?

edit: BTW if you are using sr5sieve to sieve this 5 sequence file, it may now be faster to use sr2sieve instead. Now that sr2sieve can sieve in any base, the main difference between it and sr5sieve is that sr5sieve is optimised for very large numbers of sequences.

masser 2007-04-20 03:04

Sent. Hope it helps.

geoff 2007-04-21 02:44

[QUOTE=masser;104037]Intel (R) Xeon(TM) CPU 3.40GHz
CPU speed: 3400.30 MHz
version speed
1.4.39 1350 kp/s
1.4.37 1350 kp/s
1.4.34 1450 kp/s
[/QUOTE]

On the same file with a P4 2.9GHz I get these results:
1.4.34 829 kp/s
1.4.37 1098 kp/s
1.4.39 1150 kp/s

1150/2.9=397 and 1350/3.4=397 ? Maybe all I have done with the new SSE2 code is bring my own machine up to the speed of everyone else's :-)

I don't really understand what is causing the variations in speed. I think the best advice might be for everyone to stick with sr2sieve/sr5sieve version 1.4.34 or sr1sieve version 1.0.19 unless you have the time to test later versions and determine that they are faster on your particular machine.

Citrix 2007-04-21 06:30

A feature request:

As you mentioned in one of the other threads that you have some problems storing factors of p-1, to do SPH.

Is it possible to have an option such that, for a specified t, only primes such that p%t=1 will be used for sieving. And since all p-1 is divisible by t, the t part can be used for SPH.

I think, if this can be implemented, it can speed up the sieve by alot. May be you can just modify the 3^16 code to support this.

Thanks.:smile:

geoff 2007-04-22 00:36

[QUOTE=Citrix;104169]A feature request:

As you mentioned in one of the other threads that you have some problems storing factors of p-1, to do SPH.

Is it possible to have an option such that, for a specified t, only primes such that p%t=1 will be used for sieving. And since all p-1 is divisible by t, the t part can be used for SPH.

I think, if this can be implemented, it can speed up the sieve by alot. May be you can just modify the 3^16 code to support this.
[/QUOTE]

If t is not a power of 2 then the Sieve of Eratosthenes has to use division instead of right shifts. It may still be faster though, because of the cache space saved by not needing to store the primes other than p=1 (mod t).

But then you still need to go back and sieve the other numbers the normal way, so I don't see the point unless you know in advance that there is no need to test any other primes, as for 3^16*2^(16*n)+1 with t=32.

I have been working on a sieve that uses t=30 (2*3*5) and there are certainly some gains to be made there, but that is partly because there are only 8 congruence classes of primes mod 30, which fits nicely into a byte and so simpifies the bitmap code. Maybe something similar could be done for t=16*30 that would help with sieving 3^16*2^(16*n)+1. This is much more complicated than the existing sieve however, and it would probably be limited to sieving primes up to (2^63)/t.

edit: the t=30 sieve is not just sieving primes p=1 (mod 30) though, it sieves all congruences classes mod 30, so is not really what you are asking for.

Citrix 2007-04-22 00:56

[QUOTE=geoff;104233]If t is not a power of 2 then the Sieve of Eratosthenes has to use division instead of right shifts. It may still be faster though, because of the cache space saved by not needing to store the primes other than p=1 (mod t).

But then you still need to go back and sieve the other numbers the normal way, so I don't see the point unless you know in advance that there is no need to test any other primes, as for 3^16*2^(16*n)+1 with t=32.

I have been working on a sieve that uses t=30 (2*3*5) and there are certainly some gains to be made there, but that is partly because there are only 8 congruence classes of primes mod 30, which fits nicely into a byte and so simpifies the bitmap code. Maybe something similar could be done for t=16*30 that would help with sieving 3^16*2^(16*n)+1. This is much more complicated than the existing sieve however, and it would probably be limited to sieving primes up to (2^63)/t.

edit: the t=30 sieve is not just sieving primes p=1 (mod 30) though, it sieves all congruences classes mod 30, so is not really what you are asking for.[/QUOTE]


You are right that we will have to go back and sieve the whole range again. But if 't' is smooth and the sieve is very fast, due to SPH, we can find alot of factors really fast and this can save some LLR tests and help find primes soon. For example for PSP, we have done around 10,000+ PRP's for which a factor is now known. If there was a fast way to avoid or reduce these, then that can help the project.

edit: Also alot of people don't sieve to 2^63. They only sieve upto say 2^45 or so. If we can speed the sieve up for a particular mod and cover the range 2^45 to 2^63 for primes p%t=1, then we can help reduce some factors.

For the Sieve of Eratosthenes, you will have to do one inverse calculation for the first 250,000 primes in the beginning and then store some values in an array and then read them for each cycle and update them. So no division is actually needed.




:smile:

geoff 2007-04-22 01:21

[QUOTE=Citrix;104234]You are right that we will have to go back and sieve the whole range again. But if 't' is smooth and the sieve is very fast, due to SPH, we can find alot of factors really fast and this can save some LLR tests and help find primes soon. For example for PSP, we have done around 10,000+ PRP's for which a factor is now known. If there was a fast way to avoid or reduce these, then that can help the project.
[/QUOTE]

yes I see your point. It might work, I am not sure. If the proportion of the factors that get tested is too small then you might quickly reach 2^62 without finding enough to make it worthwhile. Or the lower density of larger primes might mean you quickly reach a lower limit than 2^62 where it becomes faster to go back to normal sieving.

Another idea like this could be to implement a sieve using the [url=http://en.wikipedia.org/wiki/Pollard%27s_rho_algorithm_for_logarithms]Pollard Rho[/url] method. This doesn't guarantee to find all factors, but you could use it to sieve ahead of the main sieve to get the easy factors. It would probably be a lot eastier to implement than SPH.


All times are UTC. The time now is 22:37.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.