![]() |
|
|
#23 |
|
Oct 2006
On a Suzuki Boulevard C90
111101102 Posts |
With 1.4.3, BASE=2 and the assembly code fix, I am now getting over 282000 p/sec per core on a 2.5 GHz 970MP. Thanks everyone.
|
|
|
|
|
|
#24 |
|
Oct 2006
On a Suzuki Boulevard C90
3668 Posts |
With gcc4 and changing -mtune=G5 to -mtune=970, I am now getting ~292000 p/sec on each core of a 2.5GHz 970MP. I am also getting ~255000 p/sec on a 2.2 GHz 970FX and am getting ~212000 p/sec on each core of a 1.9 GHz POWER5+.
|
|
|
|
|
|
#25 |
|
"Mark"
Apr 2003
Between here and the
11×577 Posts |
What is your speed without the ASM code?
|
|
|
|
|
|
#26 | |
|
Mar 2003
New Zealand
13·89 Posts |
Quote:
Did you try out the inline version of the mulmod code? Although it is not needed anymore, if it did work then it would be interesting to know whether it was any faster than the externally linked version. |
|
|
|
|
|
|
#27 |
|
Oct 2006
On a Suzuki Boulevard C90
3668 Posts |
I will take some time today to do the different variations. I've also got a couple of my own permutations that I'd like to check.
The standard range that I've been testing with (just because it was one of the first ones I used) is 273000-273025. Does any one have any particular values that they'd like me to test?
|
|
|
|
|
|
#28 |
|
Oct 2006
On a Suzuki Boulevard C90
2·3·41 Posts |
More testing results. When running with DEBUG=yes, I get the following:
Code:
~/sieve/sr5sieve-1.4.3-gcc4 $./sr5sieve -s sr2sieve 1.4.3 -- A sieve for the Sierpinski/Riesel Base 2 projects. Read 2473631 terms for 19 sequences from dat format file `SoB.dat'. 10223*2^n+1: length=40892, popcount=20444, checksum=415534282 19249*2^n+1: length=38498, popcount=19248, checksum=367877264 21181*2^n+1: length=42362, popcount=20764, checksum=436996010 22699*2^n+1: length=22699, popcount=11349, checksum=127778346 24737*2^n+1: length=98948, popcount=47712, checksum=2351574312 33661*2^n+1: length=67322, popcount=32800, checksum=1099217216 55459*2^n+1: length=55459, popcount=26820, checksum=740696994 67607*2^n+1: length=270428, popcount=135212, checksum=1062054378 79309*2^n+1: length=158618, popcount=79308, checksum=1972783320 79817*2^n+1: length=319268, popcount=159632, checksum=3949033080 90527*2^n+1: length=362108, popcount=181052, checksum=2654493066 152267*2^n+1: length=609068, popcount=304532, checksum=2383261450 156511*2^n+1: length=156511, popcount=78255, checksum=1828877729 168451*2^n+1: length=168451, popcount=84225, checksum=2778500758 222113*2^n+1: length=888452, popcount=444224, checksum=3841764096 225931*2^n+1: length=225931, popcount=112965, checksum=4136589190 237019*2^n+1: length=237019, popcount=118509, checksum=1126535552 258317*2^n+1: length=1033268, popcount=516632, checksum=271148168 265711*2^n+1: length=265711, popcount=132855, checksum=470581841 Built (-ckb^n/p) tables for 11 even, 8 odd, 0 mixed parity sequences. Used 617 Kb for Legendre lookup tables, 3119 Kb for subsequence bitmaps. Split 19 base 2 sequences into 184 base 2^360 subsequences. Code:
222361*2^n+1: length=444722, popcount=222360, checksum=2105159460 Code:
273007156439257 | 225931*2^40191656+1 273013881908563 | 90527*2^22497551+1 273018111521137 | 33661*2^40903200+1 273020751694081 | 152267*2^11683419+1 273024206407651 | 10223*2^23639849+1 Code:
273005642053939 | 10223*2^30572969+1 273012258141967 | 258317*2^4553159+1 273019346461271 | 24737*2^26923927+1 273020049854507 | 225931*2^25456352+1 I just tested on a Xeon system, and it also prints the same Legendre table as the ppc64 (with 19 terms instead of the original 20) |
|
|
|
|
|
#29 |
|
Oct 2006
On a Suzuki Boulevard C90
111101102 Posts |
Okay, I take part of that back. It found those missing factors, but reported them as duplicates. Do you know if that is correct (are they really duplicates)? If not, is there anything that I can do to capture the duplicates (since REPORT_DUPLICATES just causes them to print to the screen)?
Edit: The client running on a Xeon reports the factors in question as actual not duplicate (ie. it writes them to the factors file). Thanks
Last fiddled with by BlisteringSheep on 2006-11-30 at 06:23 Reason: added x86 update |
|
|
|
|
|
#30 | ||
|
Mar 2003
New Zealand
13·89 Posts |
Quote:
Quote:
The algorithm does not guarantee to find all duplicate factors, so recording them is not really a solution. |
||
|
|
|
|
|
#31 | |
|
Mar 2003
New Zealand
115710 Posts |
Quote:
return clear_bit(SUBSEQ[subseq].M, m-m_low); to return (clear_bit(SUBSEQ[subseq].M, m-m_low) != 0); This only affects 64-bit builds. I'll have to check I haven't done similar things in other places. |
|
|
|
|
|
|
#32 |
|
Mar 2003
New Zealand
13·89 Posts |
Once the bugs are gone, there are a couple of easy experiments to do that might make it faster on the ppc64:
1. Set CONST_EMPTY_SLOT to zero in hashtable.h. This avoids one branch in the critical lookup64() function. On 32-bit machines the extra branch is worthwhile to avoid a 64-bit comparison, but on 64-bit machines maybe not. 2. Reduce HASH_MAX_DENSITY in sr5sieve.h to about half its current value. This might benefit machines with a 64Kb L1 cache. 3. Declare the getMagic() function and the pMagic, pShift global variables as static. You will need to make separate copies in bsgs.c and factors.c, and to add a call to getMagic(p) before the mod64_init0(p) call in factors.c. This might allow GCC to generate better code for the giant steps in bsgs64(). (I can make these changes for you to test if its not clear what I mean here). 4. Adjust the values of BABY_WORK, GIANT_WORK, EXP_WORK, and SUBSEQ_WORK in choose.c. The ideal values should reflect the relative speed of a mulmod vs an insert/lookup. Since the insert/lookup code works with 16-bit variables while mulmod works with 64-bit variables, the mulmod may be relatively faster on the ppc64 than on the x86. If this is the case I would expect reducing all four values by a constant amount (say 0.1) would result in better parameters being chosen. You may also need to increase LIMIT_BASE in sr5sieve.h to a small multiple of the current value, say 4320, to see the effect. (When sr5sieve starts up it prints a message `split N base b sequences into M base b^Q subsequences', the parameter affected is Q). Some harder things (new code required): Implement the PRE2_MULMOD64 macros. --------------------------------------- PRE2_MULMOD64(a,b,p) is a macro version of mulmod64(a,b,p) optimised for the very common case where it is called in a loop with fixed values of b,p. It hasn't yet been implemented for the ppc64, currently it is just an alias for mulmod64(a,b,p). The x86 version works by keeping a pre-computed product on the FPU stack, which saves one 64-bit FPU load and one FPU multiply per call. On my P3/450 with a 20k SoB.dat at p=237 trillion, this optimisation increases throughput from 33.4 kp/s to 45.7 kp/s. The ppc64 assembler mulmod(a,b,p,pMagic,pShift) function calculates a*b*pMagic each call, so there may be similar savings to be made by pre-computing the product b*pMagic in PRE2_MULMOD64_INIT(). Implement memset_fast32() in assembler. --------------------------------------- See memset_fast32.h. This is used to clear the hashtable each iteration, implementing it in assembler increases throughput by about 2% on a P3, and might make a lower value for HASH_MAX_DENSITY worthwhile. |
|
|
|
|
|
#33 |
|
Oct 2006
On a Suzuki Boulevard C90
111101102 Posts |
I've tried the first 4 things you suggested. They all caused improvements except for the static magic. Compiling with gcc4 instead of gcc3 seems to have more of an effect.
On my test machine (2.0 GHz 970 PowerMac), the unoptimized client runs at ~227000 p/sec. The fastest trial so far is with the CONST_EMPTY_SLOT and HASH_MAX_DENSITY changes using gcc4 at ~242000 p/sec. I have made the STEPS change, but that trial hasn't run yet (and it's bedtime ) I'll let you know how it goes. Tomorrow (ie. later today) I should get a chance to implement the changes and run on the faster 970MPs.
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| srsieve/sr2sieve enhancements | rogue | Software | 300 | 2021-03-18 20:31 |
| 32-bit of sr1sieve and sr2sieve for Win | pepi37 | Software | 5 | 2013-08-09 22:31 |
| sr2sieve question | SaneMur | Information & Answers | 2 | 2011-08-21 22:04 |
| sr2sieve client | mgpower0 | Prime Sierpinski Project | 54 | 2008-07-15 16:50 |
| How to use sr2sieve | nuggetprime | Riesel Prime Search | 40 | 2007-12-03 06:01 |