mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Sierpinski/Riesel Base 5

Reply
 
Thread Tools
Old 2006-11-28, 14:13   #23
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

111101102 Posts
Thumbs up

With 1.4.3, BASE=2 and the assembly code fix, I am now getting over 282000 p/sec per core on a 2.5 GHz 970MP. Thanks everyone.
BlisteringSheep is offline   Reply With Quote
Old 2006-11-28, 22:13   #24
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

3668 Posts
Thumbs up

With gcc4 and changing -mtune=G5 to -mtune=970, I am now getting ~292000 p/sec on each core of a 2.5GHz 970MP. I am also getting ~255000 p/sec on a 2.2 GHz 970FX and am getting ~212000 p/sec on each core of a 1.9 GHz POWER5+.
BlisteringSheep is offline   Reply With Quote
Old 2006-11-29, 00:42   #25
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

11×577 Posts
Default

Quote:
Originally Posted by BlisteringSheep View Post
With gcc4 and changing -mtune=G5 to -mtune=970, I am now getting ~292000 p/sec on each core of a 2.5GHz 970MP. I am also getting ~255000 p/sec on a 2.2 GHz 970FX and am getting ~212000 p/sec on each core of a 1.9 GHz POWER5+.
What is your speed without the ASM code?
rogue is offline   Reply With Quote
Old 2006-11-29, 00:55   #26
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13·89 Posts
Default

Quote:
Originally Posted by BlisteringSheep View Post
With gcc4 and changing -mtune=G5 to -mtune=970, I am now getting ~292000 p/sec on each core of a 2.5GHz 970MP. I am also getting ~255000 p/sec on a 2.2 GHz 970FX and am getting ~212000 p/sec on each core of a 1.9 GHz POWER5+.
Great to see it working, and a fair bit faster per clock than a P3 too :-)

Did you try out the inline version of the mulmod code? Although it is not needed anymore, if it did work then it would be interesting to know whether it was any faster than the externally linked version.
geoff is offline   Reply With Quote
Old 2006-11-29, 08:32   #27
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

3668 Posts
Default

I will take some time today to do the different variations. I've also got a couple of my own permutations that I'd like to check.

The standard range that I've been testing with (just because it was one of the first ones I used) is 273000-273025. Does any one have any particular values that they'd like me to test?

BlisteringSheep is offline   Reply With Quote
Old 2006-11-30, 05:07   #28
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

2·3·41 Posts
Default

More testing results. When running with DEBUG=yes, I get the following:
Code:
~/sieve/sr5sieve-1.4.3-gcc4 $./sr5sieve -s
sr2sieve 1.4.3 -- A sieve for the Sierpinski/Riesel Base 2 projects.
Read 2473631 terms for 19 sequences from dat format file `SoB.dat'.
10223*2^n+1: length=40892, popcount=20444, checksum=415534282
19249*2^n+1: length=38498, popcount=19248, checksum=367877264
21181*2^n+1: length=42362, popcount=20764, checksum=436996010
22699*2^n+1: length=22699, popcount=11349, checksum=127778346
24737*2^n+1: length=98948, popcount=47712, checksum=2351574312
33661*2^n+1: length=67322, popcount=32800, checksum=1099217216
55459*2^n+1: length=55459, popcount=26820, checksum=740696994
67607*2^n+1: length=270428, popcount=135212, checksum=1062054378
79309*2^n+1: length=158618, popcount=79308, checksum=1972783320
79817*2^n+1: length=319268, popcount=159632, checksum=3949033080
90527*2^n+1: length=362108, popcount=181052, checksum=2654493066
152267*2^n+1: length=609068, popcount=304532, checksum=2383261450
156511*2^n+1: length=156511, popcount=78255, checksum=1828877729
168451*2^n+1: length=168451, popcount=84225, checksum=2778500758
222113*2^n+1: length=888452, popcount=444224, checksum=3841764096
225931*2^n+1: length=225931, popcount=112965, checksum=4136589190
237019*2^n+1: length=237019, popcount=118509, checksum=1126535552
258317*2^n+1: length=1033268, popcount=516632, checksum=271148168
265711*2^n+1: length=265711, popcount=132855, checksum=470581841
Built (-ckb^n/p) tables for 11 even, 8 odd, 0 mixed parity sequences.
Used 617 Kb for Legendre lookup tables, 3119 Kb for subsequence bitmaps.
Split 19 base 2 sequences into 184 base 2^360 subsequences.
Notice that
Code:
222361*2^n+1: length=444722, popcount=222360, checksum=2105159460
is missing. Also, it seems to not have found all 9 of the factors. If found
Code:
273007156439257 | 225931*2^40191656+1
273013881908563 | 90527*2^22497551+1
273018111521137 | 33661*2^40903200+1
273020751694081 | 152267*2^11683419+1
273024206407651 | 10223*2^23639849+1
and it did not find
Code:
273005642053939 | 10223*2^30572969+1
273012258141967 | 258317*2^4553159+1
273019346461271 | 24737*2^26923927+1
273020049854507 | 225931*2^25456352+1
I am going to redo the run to make sure I didn't screw anything up (the machine did have a couple of crash-reboots while the testing was going on).

I just tested on a Xeon system, and it also prints the same Legendre table as the ppc64 (with 19 terms instead of the original 20)
BlisteringSheep is offline   Reply With Quote
Old 2006-11-30, 06:13   #29
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

111101102 Posts
Default

Okay, I take part of that back. It found those missing factors, but reported them as duplicates. Do you know if that is correct (are they really duplicates)? If not, is there anything that I can do to capture the duplicates (since REPORT_DUPLICATES just causes them to print to the screen)?

Edit: The client running on a Xeon reports the factors in question as actual not duplicate (ie. it writes them to the factors file).

Thanks

Last fiddled with by BlisteringSheep on 2006-11-30 at 06:23 Reason: added x86 update
BlisteringSheep is offline   Reply With Quote
Old 2006-11-30, 23:21   #30
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13·89 Posts
Default

Quote:
Originally Posted by BlisteringSheep
[/code]Notice that
Code:
222361*2^n+1: length=444722, popcount=222360, checksum=2105159460
is missing.
That is just because I was using an older SoB.dat which had 20 k, no problem.

Quote:
Originally Posted by BlisteringSheep
Okay, I take part of that back. It found those missing factors, but reported them as duplicates. Do you know if that is correct (are they really duplicates)? If not, is there anything that I can do to capture the duplicates (since REPORT_DUPLICATES just causes them to print to the screen)?

Edit: The client running on a Xeon reports the factors in question as actual not duplicate (ie. it writes them to the factors file).
Are you using the same SoB.dat on both machines? If so then it is surely a bug. Check whether both machines report the same number of terms read from the file, a difference would indicate a bug in the file parsing code. Otherwise the bitmap code is probably the culprit again.

The algorithm does not guarantee to find all duplicate factors, so recording them is not really a solution.
geoff is offline   Reply With Quote
Old 2006-12-01, 00:14   #31
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

115710 Posts
Default

Quote:
Originally Posted by BlisteringSheep View Post
Okay, I take part of that back. It found those missing factors, but reported them as duplicates. Do you know if that is correct (are they really duplicates)? If not, is there anything that I can do to capture the duplicates (since REPORT_DUPLICATES just causes them to print to the screen)?

Edit: The client running on a Xeon reports the factors in question as actual not duplicate (ie. it writes them to the factors file).
Arrrgh :-( I found the bug. In subseq.c, in function subseq_clear_m, change this line:

return clear_bit(SUBSEQ[subseq].M, m-m_low);

to

return (clear_bit(SUBSEQ[subseq].M, m-m_low) != 0);

This only affects 64-bit builds. I'll have to check I haven't done similar things in other places.
geoff is offline   Reply With Quote
Old 2006-12-01, 00:46   #32
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13·89 Posts
Default

Once the bugs are gone, there are a couple of easy experiments to do that might make it faster on the ppc64:

1. Set CONST_EMPTY_SLOT to zero in hashtable.h. This avoids one branch in the critical lookup64() function. On 32-bit machines the extra branch is worthwhile to avoid a 64-bit comparison, but on 64-bit machines maybe not.

2. Reduce HASH_MAX_DENSITY in sr5sieve.h to about half its current value. This might benefit machines with a 64Kb L1 cache.

3. Declare the getMagic() function and the pMagic, pShift global variables as static. You will need to make separate copies in bsgs.c and factors.c, and to add a call to getMagic(p) before the mod64_init0(p) call in factors.c. This might allow GCC to generate better code for the giant steps in bsgs64(). (I can make these changes for you to test if its not clear what I mean here).

4. Adjust the values of BABY_WORK, GIANT_WORK, EXP_WORK, and SUBSEQ_WORK in choose.c. The ideal values should reflect the relative speed of a mulmod vs an insert/lookup. Since the insert/lookup code works with 16-bit variables while mulmod works with 64-bit variables, the mulmod may be relatively faster on the ppc64 than on the x86. If this is the case I would expect reducing all four values by a constant amount (say 0.1) would result in better parameters being chosen. You may also need to increase LIMIT_BASE in sr5sieve.h to a small multiple of the current value, say 4320, to see the effect. (When sr5sieve starts up it prints a message `split N base b sequences into M base b^Q subsequences', the parameter affected is Q).


Some harder things (new code required):

Implement the PRE2_MULMOD64 macros.
---------------------------------------

PRE2_MULMOD64(a,b,p) is a macro version of mulmod64(a,b,p) optimised for the very common case where it is called in a loop with fixed values of b,p. It hasn't yet been implemented for the ppc64, currently it is just an alias for mulmod64(a,b,p).

The x86 version works by keeping a pre-computed product on the FPU stack, which saves one 64-bit FPU load and one FPU multiply per call. On my P3/450 with a 20k SoB.dat at p=237 trillion, this optimisation increases throughput from 33.4 kp/s to 45.7 kp/s.

The ppc64 assembler mulmod(a,b,p,pMagic,pShift) function calculates a*b*pMagic each call, so there may be similar savings to be made by pre-computing the product b*pMagic in PRE2_MULMOD64_INIT().


Implement memset_fast32() in assembler.
---------------------------------------

See memset_fast32.h. This is used to clear the hashtable each iteration, implementing it in assembler increases throughput by about 2% on a P3, and might make a lower value for HASH_MAX_DENSITY worthwhile.
geoff is offline   Reply With Quote
Old 2006-12-01, 07:46   #33
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

111101102 Posts
Default

I've tried the first 4 things you suggested. They all caused improvements except for the static magic. Compiling with gcc4 instead of gcc3 seems to have more of an effect.

On my test machine (2.0 GHz 970 PowerMac), the unoptimized client runs at ~227000 p/sec. The fastest trial so far is with the CONST_EMPTY_SLOT and HASH_MAX_DENSITY changes using gcc4 at ~242000 p/sec. I have made the STEPS change, but that trial hasn't run yet (and it's bedtime) I'll let you know how it goes. Tomorrow (ie. later today) I should get a chance to implement the changes and run on the faster 970MPs.
BlisteringSheep is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
srsieve/sr2sieve enhancements rogue Software 300 2021-03-18 20:31
32-bit of sr1sieve and sr2sieve for Win pepi37 Software 5 2013-08-09 22:31
sr2sieve question SaneMur Information & Answers 2 2011-08-21 22:04
sr2sieve client mgpower0 Prime Sierpinski Project 54 2008-07-15 16:50
How to use sr2sieve nuggetprime Riesel Prime Search 40 2007-12-03 06:01

All times are UTC. The time now is 09:37.


Sat Jul 17 09:37:06 UTC 2021 up 50 days, 7:24, 1 user, load averages: 1.06, 1.25, 1.39

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.