mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Sierpinski/Riesel Base 5

Reply
 
Thread Tools
Old 2007-06-22, 03:10   #133
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

2·3·41 Posts
Default

Quote:
Originally Posted by geoff View Post
I have added the vector methods code along with rogue's assembler for timing to the version 1.5.7 source.

I think there is still room to improve performance on the ppc64, in at least two ways:

1. The mulmod routine should be able to be re-arranged so that the multiplications are done in a different order and the intermediate product re-used in the next mulmod. At the moment we are effectively multiplying pMagic*b every time, although both pMagic and b are constant over many mulmods.

2. Assuming there are enough spare registers, it should be possible to interleave the instructions from two seperate mulmods, which should allow the CPU to get better throughput. e.g. y1 = x1*b (mod p) and y2 = x2*b (mod p) are independent computations so it should be possible to intermix the instructions from each. At the moment there are unnecessary dependenncies between them because they both use the same scratch registers.
I doubt that I would be able to provide much coding help, but will definitely do any testing you would like. You are also welcome to have an account on my Linux PowerMac. If that would be useful to you at all, PM or email me and I can set it up anytime, with either telnet or ssh access (your choice).
BlisteringSheep is offline   Reply With Quote
Old 2007-06-22, 03:44   #134
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

48516 Posts
Default

Quote:
Originally Posted by rogue View Post
This multiply is basically free since it can be done at the same time as the other multiplies in the pipeline, so I don't think there would be any gain.
OK, but removing the redundant multiplication could allow more mulmods to be done in parallel if 2. can be implemented.

Quote:
There are plenty of registers. It should be possible to do two (or possibly 3 or 4) at once, but it would require either changing the ASM or using gcc syntax for inline assembler, which I am not any good at.
In version 1.5.7 the SSE2 routines are in mulmod-sse2.S, instead of inline asm the main loops are functions such as:
Code:
vec4_mulmod64_sse2(uint64_t *X, uint64_t *Y, int count);
which computes Y[i] = X[i]*b (mod p) for 0 <= i < count, interleaving 4 mulmods at a time. (X and Y may be equal or overlap by multiples of 4). This one function can be used for both baby (Y=X+4) and giant (Y=X) steps, although it might prove more efficient to write two seperate functions, one for when Y=X and another for when Y=X+4.

Last fiddled with by geoff on 2007-06-22 at 03:47
geoff is offline   Reply With Quote
Old 2007-06-22, 12:19   #135
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

11×577 Posts
Default

Quote:
Originally Posted by geoff View Post
OK, but removing the redundant multiplication could allow more mulmods to be done in parallel if 2. can be implemented.
It could.

Quote:
Originally Posted by geoff View Post
In version 1.5.7 the SSE2 routines are in mulmod-sse2.S, instead of inline asm the main loops are functions such as:
Code:
vec4_mulmod64_sse2(uint64_t *X, uint64_t *Y, int count);
which computes Y[i] = X[i]*b (mod p) for 0 <= i < count, interleaving 4 mulmods at a time. (X and Y may be equal or overlap by multiples of 4). This one function can be used for both baby (Y=X+4) and giant (Y=X) steps, although it might prove more efficient to write two seperate functions, one for when Y=X and another for when Y=X+4.
The PPC code could certainly do that and probably in a similar way to SSE2. I'll look into it, but can make no promises that I will have time to do it.
rogue is offline   Reply With Quote
Old 2007-06-23, 02:38   #136
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

115710 Posts
Default

Attached is my attempt at doing two mulmods in parallel, this code can be tested by replacing the VEC2_* definitions in asm-ppc64.h with those in the attached file.

I didn't know how many bits a condition register like cr6 has, so the declarations of c0, c1 may need to be changed. (Or it might not matter).

I wasn't sure what the adde and addze instructions do, so I have allowed for the possibility that addze might depend on adde. Do any instructions have side effects on the ppc64, effects other than on their operands?

Anyway, if it works then I can probably make a version that does 4 in parallel using the same technique.
Attached Files
File Type: txt vec2-ppc64.txt (3.9 KB, 161 views)

Last fiddled with by geoff on 2007-06-23 at 02:42 Reason: Windows mangled the text file
geoff is offline   Reply With Quote
Old 2007-06-23, 12:36   #137
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

11000110010112 Posts
Default

Quote:
Originally Posted by geoff View Post
Attached is my attempt at doing two mulmods in parallel, this code can be tested by replacing the VEC2_* definitions in asm-ppc64.h with those in the attached file.

I wasn't sure what the adde and addze instructions do, so I have allowed for the possibility that addze might depend on adde. Do any instructions have side effects on the ppc64, effects other than on their operands?
adde sets the carry bit and addze will clear it after adding to the given register. You should be able to put the srd instruction between the two since it doesn't affect the carry bit.

It should be possible to calculate b * pMagic up front, then multiply that product by a, but you will have to hold all 128 bits of that product. That will eliminate the stall of 4 cycles after "mulhdu %8, %18, %12". It will also allow the function to reuse two registers. I have an idea that might allow you to avoid most of that assember code. I hope to play around with it later.
rogue is offline   Reply With Quote
Old 2007-06-23, 20:13   #138
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

11×577 Posts
Default

Here is some code you can incorporate. I have not tried to compile this or test this. If you put it into 1.5.8, send me the code and the means to test and I will test it for you. Typically I would use ppc_intrinsics.h, but it doesn't have these instructions, which is why they are here.

These inlines remove the need for the mulmod-ppc64.S file:

static inline uint64_t __adde (uint64_t a, uint64_t c) __attribute__((always_inline));
static inline uint64_t __adde (uint64_t a, uint64_t c)
{
uint64_t result;
__asm__ ("adde %0, %1, %2"
/* outputs: */ : "=r" (result)
/* inputs: */ : "r" (a), "r" (c));
return result;
}

static inline uint64_t __addze (uint64_t a) __attribute__((always_inline));
static inline uint64_t __addze (uint64_t a)
{
uint64_t result;
__asm__ ("addze %0, %1"
/* outputs: */ : "=r" (result)
/* inputs: */ : "r" (a));
return result;
}

static inline uint64_t __mulhdu (uint64_t a, uint64_t c) __attribute__((always_inline));
static inline uint64_t __mulhdu (uint64_t a, uint64_t c)
{
uint64_t result;
__asm__ ("mulhdu %0, %1, %2"
/* outputs: */ : "=r" (result)
/* inputs: */ : "r" (a), "r" (c));
return result;
}

I re-arranged the code (calculating b*pMagic first). You should be able to unroll from the calculation of c64a to the end.
mulmod(a, b, p, pMagic, magicShift)
{
uint64_t c64, c64a, c64b, c128, rem, quot;

// Get the shifts
rightShift = magicShift;
leftShift = 64 - magicShift;

// Calculate the 128-bit product b * pMagic
bLO = b * pMagic;
bHI = __mulhdu(b, pMagic);

// Multiple (bHI,bLO) by a
c64a = __mulhdu(bLO, a);
c64b = bHI * a;
c128 = __mulhdu(bHI, a);

// Get the upper 128-bits of aforementioned multiply
c64 = __adde(c64a, c64b);
c128 = __addze(c128);

// get the quotient (a * b) / p
quot = (c64 >> rightShift) | (c128 << leftShift);

// Calculate the remainder (a * b) - (quot * p);
rem = (a * b) - (quot * p);
}
rogue is offline   Reply With Quote
Old 2007-06-23, 20:15   #139
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

11·577 Posts
Default Build issues in srsieve

malloc.h does not exist on OS X, so you should #ifdef the include of malloc.h in bsgs.c and util.c.

HAVE_MEMALIGN should be set to 0 for OS X.
rogue is offline   Reply With Quote
Old 2007-06-24, 06:45   #140
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

24610 Posts
Default

geoff, I just found a minor bug in function read_argc_argv() in file files.c. When it's building up the strv, it needs to populate strv[0] with argv[0], and then start reading the contents of the command line file into strv beginning at strv[1]. Here's a little patch that works (or at least works for me :), though I'm not claiming it's the best fix):
Code:
diff -u sr2sieve-1.5.8.orig/files.c sr2sieve-1.5.8.mod/files.c
--- sr2sieve-1.5.8.orig/files.c 2007-03-19 06:09:45.000000000 -0400
+++ sr2sieve-1.5.8.mod/files.c  2007-06-24 02:34:09.000000000 -0400
@@ -352,7 +352,10 @@
 
   len = 16;
   strv = xmalloc(len*sizeof(char *));
-  if ((strv[0] = strtok(line," \n")) == NULL)
+  strv[0] = xmalloc(strlen(argv[0][0])+1);
+  strcpy(strv[0], argv[0][0]);
+
+  if ((strv[1] = strtok(line," \n")) == NULL)
   {
     free(strv);
     free(line);
@@ -360,7 +363,7 @@
     return;
   }
 
-  for (i = 1; ((strv[i] = strtok(NULL," \n")) != NULL); i++)
+  for (i = 2; ((strv[i] = strtok(NULL," \n")) != NULL); i++)
     if (len <= i+1)
     {
       len += 16;
BlisteringSheep is offline   Reply With Quote
Old 2007-06-25, 00:27   #141
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

2×3×41 Posts
Default

Just a clarification: what is happening with the current code is that the first option from sr2sieve-command-line.txt is put into argv[0], so it is never examined back in main().
BlisteringSheep is offline   Reply With Quote
Old 2007-06-26, 02:21   #142
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13×89 Posts
Default

Quote:
Originally Posted by rogue View Post
// Get the upper 128-bits of aforementioned multiply
c64 = __adde(c64a, c64b);
c128 = __addze(c128);
GCC doesn't know that addze depends on adde, it might reorder these operations or schedule another instruction between them that clobbers the carry flag. I don't think there is a way to tell GCC about the dependancy, so it is necessary to combine both instructions into one asm().

In version 1.5.9 I have tried to use the idea from this code to modify the existing code. Set EXPERIMENTAL to 1 in asm-ppc64.h to enable it. The definition for CONDITION_REGISTER_T might need to be changed. I have also fixed the malloc.h issues for OS X.

How many condition registers does the PPC64 have?

A brief test can be done by downloading the archive http://www.geocities.com/g_w_reynold...r5check.tar.gz and running the command line `sr5sieve -i sr5check.txt -p 100e6 -P 150e6', which should result in 10533 factors being found.
geoff is offline   Reply With Quote
Old 2007-06-26, 02:27   #143
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13·89 Posts
Default

Quote:
Originally Posted by BlisteringSheep View Post
I just found a minor bug in function read_argc_argv() in file files.c. When it's building up the strv, it needs to populate strv[0] with argv[0], and then start reading the contents of the command line file into strv beginning at strv[1].
Actually that is not a bug :-). sr5sieve-command-line.txt should contain the full command line, including the command itself, not just the switches: `sr5sieve -Z -v ...' not just `-Z -v ...'

Last fiddled with by geoff on 2007-06-26 at 02:27 Reason: fixed quote
geoff is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
srsieve/sr2sieve enhancements rogue Software 300 2021-03-18 20:31
32-bit of sr1sieve and sr2sieve for Win pepi37 Software 5 2013-08-09 22:31
sr2sieve question SaneMur Information & Answers 2 2011-08-21 22:04
sr2sieve client mgpower0 Prime Sierpinski Project 54 2008-07-15 16:50
How to use sr2sieve nuggetprime Riesel Prime Search 40 2007-12-03 06:01

All times are UTC. The time now is 09:30.


Sat Jul 17 09:30:57 UTC 2021 up 50 days, 7:18, 1 user, load averages: 1.10, 1.32, 1.47

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.