![]() |
[QUOTE=em99010pepe;120223]You are doing a great job on the sieving side what about the LLRring side?
[/QUOTE] Thanks Carlos. I don't plan to try writing any FFT based programs. I was lucky to have just enough algebra to understand SHank's algorithm for solving the discrete logarithm, which allowed be to get started writing srsieve. I think understanding FFT is way beyond me. |
sr1sieve 1.2.6, sr2sieve 1.6.17
These versions have a `-q --quiet' switch to prevent factors being printed to the screen. They also record the version number in the log file.
|
sr2sieve 1.7.0 Multithreaded (Linux)
This version has a new switch: `-t --threads N' starts N child threads. There are still some problems to work out, so I recommend using the latest 1.6.x version for important work. See the README-threads file for a list of known problems.
The threading is done with fork(), so the `threads' are full unix processes with their own address space, communicating with the parent process via pipes. This method is very easy to bolt on to an existing program because there is no need for functions to be thread-safe. However it does have a lot of communication and scheduling overhead, and my implementation is surely less than perfect, so there is some performance loss from running one copy of sr2sieve with N threads compared to just running N copies of sr2sieve. Multithreading does save memory though, because the child threads don't need to contain a copy of the Sieve of Eratosthenes which dominates the memory requirements as sieve depth increases. There is no windows version yet because fork() and pipe() are missing from the mingw library. |
sr1sieve 1.3.0, sr2sieve 1.7.1
sr1sieve 1.3.0 has the -t switch for multithreading.
sr2sieve 1.7.1 fixes a bug in 1.7.0 that could cause factors at the very end of a completed range to be lost. |
If anyone has access to a AMD Barcelona in 64-bit mode it would be interesting to compare performance with earlier Athlon64 and with Core 2. One reason that sr2sieve performs better on Core 2 is that the latency of the cvtsi2sdq and cvtsd2siq instructions are much lower, but these are supposed to be improved in the Barcelona core.
|
sr2sieve 1.6.18 (stable), 1.7.3 (experimantal)
This version extends the maximum number of subsequences to 2^32-1. (In practice memory will run out long before this limit is reached).
In previous versions the limit was 2^16-1, but it was not always properly enforced, which could have caused factors to be missed for input files containing much more than 1000 sequences. This problem existed since version 1.4.22. It didn't affect SoB/PSP, RieselSieve or SR5. |
sr1sieve 1.3.2 and sr2sieve 1.7.4 for x86/sse2 and x86-64 have a new powmod() function, the x86-64 version has the biggest improvement, but the resulting sieve speed depends on sieve depth and number of sequences in the sieve (large p, many k, more improvement).
I think that the current powmod function is running at about half it's potential speed on x86-64. On my Core 2 Duo calculating b^n mod p takes about 35 clocks per bit of n, for p of about 40-60 bits. I think it would be possible to reduce this to about 18 clocks per bit by doing two powmods in parallel. If that can be done effectively then it could give upward of 20% boost to the overall sieve speed. |
Can this improve gcwsieve too?
|
[quote=geoff;122026]This version has a new switch: `-t --threads N' starts N child threads. There are still some problems to work out, so I recommend using the latest 1.6.x version for important work. See the README-threads file for a list of known problems.
The threading is done with fork(), so the `threads' are full unix processes with their own address space, communicating with the parent process via pipes. This method is very easy to bolt on to an existing program because there is no need for functions to be thread-safe. However it does have a lot of communication and scheduling overhead, and my implementation is surely less than perfect, so there is some performance loss from running one copy of sr2sieve with N threads compared to just running N copies of sr2sieve. Multithreading does save memory though, because the child threads don't need to contain a copy of the Sieve of Eratosthenes which dominates the memory requirements as sieve depth increases. There is no windows version yet because fork() and pipe() are missing from the mingw library.[/quote] Is there a multithreaded version of srsieve yet? |
sr2sieve 1.7.5
This version has vectorised powmod functions for x86 and x86-64.
I don't know why I didn't make this optimisation earlier: When p=1 (mod r) the r-power residue test involves computing the power (-k/c)^((p-1)/r) mod p for each sequence k*b^b+c. Since the exponent (p-1)/r is fixed, this can be done efficiently by creating a vector of values -k/c mod p and running the powmod algorithm on the whole vector at once. The gains from this optimisation are mainly due to implementing the vector powmod function with fully pipelined code, which explains why the Pentium 4 benefits most: [code] Sob.dat 6k SoB.dat 17k riesel.dat 66k sr5data.txt 220k ---------- ----------- -------------- ---------------- 1.6.18 p=1e15: 233 kp/s 127 kp/s 67.7 kp/s 35.0 kp/s 1.7.5 p=1e15: 238 kp/s 133 kp/s 75.4 kp/s 43.4 kp/s P3 600MHz: +2% +5% +11% +24% 1.6.18 p=1e15: 2726 kp/s 1481 kp/s 730 kp/s 343 kp/s 1.7.5 p=1e15: 2825 kp/s 1597 kp/s 900 kp/s 494 kp/s C2D 2.66GHz +4% +8% +23% +44% 1.6.18 p=1e15: 975 kp/s 527 kp/s 257 kp/s 110 kp/s 1.7.5 p=1e15: 1047 kp/s 604 kp/s 345 kp/s 179 kp/s P4 2.9GHz +7% +15% +34% +63% [/code] [QUOTE=monst]Can this improve gcwsieve too?[/QUOTE] No, gcwsieve makes only minor use of powmod, it would probably help only when there were a small number of terms in the sieve. [QUOTE=Anonymous]Is there a multithreaded version of srsieve yet?[/QUOTE] Not yet, I will probably wait until I can get multithreading working for the Windows version before I put it into srsieve. There are two main ways of multithreading a program: The easy way using fork(), and the hard way using clone() or equivalent. Linux has both options, but on Windows I think it has to be done the hard way. However a clone() method could be more efficient, so I will aim for that on Linux too. |
[quote=geoff;123174]Not yet, I will probably wait until I can get multithreading working for the Windows version before I put it into srsieve. There are two main ways of multithreading a program: The easy way using fork(), and the hard way using clone() or equivalent. Linux has both options, but on Windows I think it has to be done the hard way. However a clone() method could be more efficient, so I will aim for that on Linux too.[/quote]
Thanks! :smile: |
| All times are UTC. The time now is 20:52. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.