mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Sierpinski/Riesel Base 5 (https://www.mersenneforum.org/forumdisplay.php?f=54)
-   -   A multiple k/c sieve for Sierpinski/Riesel problems (https://www.mersenneforum.org/showthread.php?t=5785)

Cruelty 2007-06-20 18:29

I've just noticed that sr2sieve v.1.5.6 (linux x86-64) chooses hashtable size of 512kB instead of previously selected 1024kB (1.5.5). When I tried forcing -H 1024 I get the following error: [code]Wed Jun 20 20:25:24 2007 ERROR: Hashtable size 1024 Kb is too large, recompile with SHORT_HASHTABLE=0[/code]
BTW: I have just finnished double checking my range and both results are the same :tu:

geoff 2007-06-21 00:05

[QUOTE=Cruelty;108618]I've just noticed that sr2sieve v.1.5.6 (linux x86-64) chooses hashtable size of 512kB instead of previously selected 1024kB (1.5.5). When I tried forcing -H 1024 I get the following error: [code]Wed Jun 20 20:25:24 2007 ERROR: Hashtable size 1024 Kb is too large, recompile with SHORT_HASHTABLE=0[/code]
[/QUOTE]
512Kb will be the size of the Sieve of Eratosthenes bitmap, not the hashtable.

Previous versions used 1/2 L2 cache for the bitmap, but new machines coming out with massive L2 cache sizes are going to cause some problems.

In version 1.5.6 I limited the bitmap size to 512Kb, but if that is too small I could increase the limit as far as 4Mb. If you run with -L512 that will cause a 256Kb bitmap to be used, could you check whether there is a difference in speed from using 256Kb vs 512Kb?

One problem is that as larger bitmaps are used the p/sec times may become bumpy, you may need to take the average over a few reports to get a good comparison.

Cruelty 2007-06-21 05:31

1 Attachment(s)
Attached are benchmarks for 1.5.5 and 1.5.6 - 1.5.5 is faster.

geoff 2007-06-22 02:07

[QUOTE=Cruelty;108643]Attached are benchmarks for 1.5.5 and 1.5.6 - 1.5.5 is faster.[/QUOTE]

It is necessary to compare the actual run times. The benchmark timer was changed in 1.5.6 to flush the pipeline before starting the benchmark, which should give more consistent times on some machines (like my P4/Celeron), but all times will now be a little longer. Also the Sieve of Eratosthenes is not part of the benchmarks, so the bitmap size will have no effect on them.

In version 1.5.7 I have increased the bitmap limit to 2Mb anyway, so it will only come into effect for machines with L2 cache size greater than 4Mb. I am a little worried that the large bitmap will have a negative effect on multi-core machines, but I will leave it up to the user to lower the limit with the -L switch if that is the case.

Also in version 1.5.7 for the x86-64 build there is a sr2sieve-fpu binary which can sieve up to 2^62 (the standard x86-64 binary is limited to 2^52). I expect the sr2sieve-fpu binary will be slower, but it would be interesting to know just how much slower.

Cruelty 2007-06-22 21:04

Indeed initial benchmark still favours 1.5.5 however "real" results are better using 1.5.7 (~3.5%).
Then the "fpu" version produces some really strange results + generates an error...
[code]./sr2sieve-fpu -vv
sr2sieve 1.5.7 -- A sieve for multiple sequences k*b^n+/-1.
L1 data cache 32Kb (detected), L2 cache 2048Kb (detected).
Read 489119 terms for 10 sequences from ABCD format file `sr2data.txt'.
Split 10 base 2 sequences into 255 base 2^60 subsequences.
Loaded Legendre symbol lookup tables for 10 sequences from `sr2cache.bin'.
Using 16 Kb for the baby-steps giant-steps hashtable, maximum density 0.25.
Best time for baby step method gen/2: 919350.
Best time for baby step method gen/4: 914373.
Best time for baby step method gen/1: 18288.
Best time for giant step method gen/2: 802197.
Best time for giant step method gen/4: 779670.
Best time for giant step method gen/1: 787527.
Best time for ladder method gen/2: 1791.
Best time for ladder method gen/4: 1269.
Best time for ladder method gen/1: 3285.
Best time for ladder method add/1: 3924.
Using baby step method gen/1, giant step method gen/4, ladder method gen/4.
Resuming from checkpoint pmin=9204118684829 in `checkpoint.txt'.
Using 1024Kb for the Sieve of Eratosthenes bitmap.
Expecting to find factors for about 9768.73 terms in this range.
sr2sieve started: 1000000 <= n <= 1999997, 9204118684829 <= p <= 10000000000000
[B]Segmentation fault (core dumped)[/B][/code]

geoff 2007-06-23 03:12

[QUOTE=Cruelty;108752]Then the "fpu" version produces some really strange results + generates an error...[/QUOTE]
Thanks for the report, it looks like I still have some work to do on that :-(.

geoff 2007-06-24 01:19

sr2sieve 1.5.8 (x86-64 only)
 
I have fixed a corruption of the FPU stack in the sr2sieve-fpu build, but there may be other bugs remaining. Use sr2sieve-fpu for tesing only for now.

In the standard x86-64 build I have enabled gen/8 methods. There are not enough general registers to do 8 mulmods in parallel, but they may still be faster than gen/4 methods in some cases.

Cruelty 2007-06-24 07:12

Version 1.5.8 is marginally faster than 1.5.7: [code]./sr2sieve -vv
sr2sieve 1.5.7 -- A sieve for multiple sequences k*b^n+/-1.
L1 data cache 32Kb (detected), L2 cache 2048Kb (detected).
Read 489119 terms for 10 sequences from ABCD format file `sr2data.txt'.
Split 10 base 2 sequences into 255 base 2^60 subsequences.
Loaded Legendre symbol lookup tables for 10 sequences from `sr2cache.bin'.
Using 16 Kb for the baby-steps giant-steps hashtable, maximum density 0.25.
Best time for baby step method gen/2: 52875.
Best time for baby step method gen/4: 40599.
Best time for baby step method gen/1: 58644.
Best time for giant step method gen/2: 26460.
Best time for giant step method gen/4: 25011.
Best time for giant step method gen/1: 33462.
Best time for ladder method gen/2: 1467.
Best time for ladder method gen/4: 1134.
Best time for ladder method gen/1: 2538.
Best time for ladder method add/1: 2943.
Using baby step method gen/4, giant step method gen/4, ladder method gen/4.
Resuming from checkpoint pmin=9452855837203 in `checkpoint.txt'.
Using 1024Kb for the Sieve of Eratosthenes bitmap.
Expecting to find factors for about 9768.73 terms in this range.
sr2sieve started: 1000000 <= n <= 1999997, 9452855837203 <= p <= 10000000000000
p=9452986387147, 2188122 p/sec, 8132 factors, 87.84% done, 222 sec/factor[/code][code]./sr2sieve -vv
sr2sieve 1.5.8 -- A sieve for multiple sequences k*b^n+/-1.
L1 data cache 32Kb (detected), L2 cache 2048Kb (detected).
Read 489119 terms for 10 sequences from ABCD format file `sr2data.txt'.
Split 10 base 2 sequences into 255 base 2^60 subsequences.
Loaded Legendre symbol lookup tables for 10 sequences from `sr2cache.bin'.
Using 16 Kb for the baby-steps giant-steps hashtable, maximum density 0.25.
Best time for baby step method gen/2: 52929.
Best time for baby step method gen/4: 40725.
Best time for baby step method gen/8: 40590.
Best time for baby step method gen/1: 58581.
Best time for giant step method gen/2: 26415.
Best time for giant step method gen/4: 25101.
Best time for giant step method gen/8: 25092.
Best time for giant step method gen/1: 33444.
Best time for ladder method gen/2: 1467.
Best time for ladder method gen/4: 1134.
Best time for ladder method gen/8: 1269.
Best time for ladder method gen/1: 2547.
Best time for ladder method add/1: 2943.
Using baby step method gen/8, giant step method gen/8, ladder method gen/4.
Resuming from checkpoint pmin=9452855837203 in `checkpoint.txt'.
Using 1024Kb for the Sieve of Eratosthenes bitmap.
Expecting to find factors for about 9768.73 terms in this range.
sr2sieve started: 1000000 <= n <= 1999997, 9452855837203 <= p <= 10000000000000
p=9452986780493, 2195598 p/sec, 8132 factors, 87.84% done, 221 sec/factor[/code]
"FPU" version is working and it's not so bad in terms of speed :tu: [code]cruelty@c2d-low:~/Desktop/LLR-low/tests$ ./sr2sieve-fpu -vv
sr2sieve 1.5.8 -- A sieve for multiple sequences k*b^n+/-1.
L1 data cache 32Kb (detected), L2 cache 2048Kb (detected).
Read 489119 terms for 10 sequences from ABCD format file `sr2data.txt'.
Split 10 base 2 sequences into 255 base 2^60 subsequences.
Loaded Legendre symbol lookup tables for 10 sequences from `sr2cache.bin'.
Using 16 Kb for the baby-steps giant-steps hashtable, maximum density 0.25.
Best time for baby step method gen/2: 54351.
Best time for baby step method gen/4: 41274.
Best time for baby step method gen/8: 38502.
Best time for baby step method gen/1: 82296.
Best time for giant step method gen/2: 30744.
Best time for giant step method gen/4: 28467.
Best time for giant step method gen/8: 28890.
Best time for giant step method gen/1: 41715.
Best time for ladder method gen/2: 1773.
Best time for ladder method gen/4: 1251.
Best time for ladder method gen/8: 1341.
Best time for ladder method gen/1: 3249.
Best time for ladder method add/1: 3924.
Using baby step method gen/8, giant step method gen/4, ladder method gen/4.
Resuming from checkpoint pmin=9453008566493 in `checkpoint.txt'.
Using 1024Kb for the Sieve of Eratosthenes bitmap.
Expecting to find factors for about 9768.73 terms in this range.
sr2sieve started: 1000000 <= n <= 1999997, 9453008566493 <= p <= 10000000000000
p=9453036880669, 1938265 p/sec, 8133 factors, 87.85% done, 250 sec/factor[/code]

geoff 2007-06-26 03:46

[QUOTE=Cruelty;108813]Version 1.5.8 is marginally faster than 1.5.7: [code]./sr2sieve -vv
sr2sieve 1.5.7 -- A sieve for multiple sequences k*b^n+/-1.
p=9452986387147, 2188122 p/sec, 8132 factors, 87.84% done, 222 sec/factor[/code][code]./sr2sieve -vv
sr2sieve 1.5.8 -- A sieve for multiple sequences k*b^n+/-1.
p=9452986780493, 2195598 p/sec, 8132 factors, 87.84% done, 221 sec/factor[/code]
[/QUOTE]

That small improvement probably just comes from loop unrolling, the gen/8 methods were implemented just by doing the gen/4 method twice per loop.

[QUOTE]"FPU" version is working and it's not so bad in terms of speed :tu: [code]cruelty@c2d-low:~/Desktop/LLR-low/tests$ ./sr2sieve-fpu -vv
sr2sieve 1.5.8 -- A sieve for multiple sequences k*b^n+/-1.
p=9453036880669, 1938265 p/sec, 8133 factors, 87.85% done, 250 sec/factor[/code][/QUOTE]

I might try to translate rogue's ppc64 mulmods to the x86-64, to see how they compare.

geoff 2007-06-28 02:07

In sr2sieve version 1.5.10 (x86-64 only) there are two new binaries: sr2sieve-int1 and sr2sieve-int2. These are attempts to use rogue's ppc64 mulmod (which doesn't use floating point) on the x86-64, it would be interesting to see how they compare to sr2sieve-fpu. (If they work at all, that is :-).

The main problem translating the ppc64 code to x86-64 is that there are not enough general registers to get maximum benefit from unrolling the loops, so I expect the -int binaries will be slower. But if they are not too much slower then it may be worthwhile trying to do some of the operations in SSE registers to get around that.

Cruelty 2007-06-28 05:31

1 Attachment(s)
Attached are benchmarks for all variants of 1.5.10.


All times are UTC. The time now is 22:37.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.