![]() |
|
|
#45 | |
|
Oct 2006
On a Suzuki Boulevard C90
2×3×41 Posts |
Quote:
|
|
|
|
|
|
|
#46 | |||
|
Mar 2003
New Zealand
48516 Posts |
Quote:
Quote:
Quote:
|
|||
|
|
|
|
|
#47 | |
|
Oct 2006
On a Suzuki Boulevard C90
24610 Posts |
Quote:
I will experiment some more with the BABY_WORK, GIANT_WORK, EXP_WORK, and SUBSEQ_WORK suggestions you made next.
|
|
|
|
|
|
|
#48 | |
|
Oct 2006
On a Suzuki Boulevard C90
24610 Posts |
Quote:
I promise, last post of the night ![]() [Bet you're all going to be glad when I go away ]
|
|
|
|
|
|
|
#49 | ||
|
Mar 2003
New Zealand
100100001012 Posts |
Quote:
In version 1.4.8 I have made another attempt at the inline mulmod, no guarantees it will work, but if you want to test it just compile with -DUSE_INLINE_MULMOD added to CPPFLAGS. Quote:
|
||
|
|
|
|
|
#50 | |
|
Oct 2006
On a Suzuki Boulevard C90
2×3×41 Posts |
Quote:
![]() Note that these tests are on a different machine with a slower CPU than the results at the top of this page, and the numbers can't be directly compared.
Last fiddled with by BlisteringSheep on 2006-12-07 at 04:56 Reason: speed disclaimer |
|
|
|
|
|
|
#51 |
|
Oct 2006
On a Suzuki Boulevard C90
2×3×41 Posts |
On the 2.5GHz 970MPs, the speedup is more significant, from about 308000 p/sec to 331000 (over 7%).
There is a similar significant improvement on the 2.2 GHz 970FX, from about 266000 to 286000 (again over 7%). One thing I did think to do in these tests vs. the one last night on the slower CPU was to remove mulmod-ppc64.o from the list of ASM_OBJS when using the USE_INLINE_MULMOD. Last fiddled with by BlisteringSheep on 2006-12-07 at 16:00 Reason: added 2.2 GHz 970FX results |
|
|
|
|
|
#52 | |
|
Mar 2003
New Zealand
13×89 Posts |
Quote:
I don't really know enough about PPC assembler to guess what is likely to work best though, and trial and error will be a long process without a machine at hand to test it on. If you have the patience to do this yourself, the basic idea is to replace a register in the clobber list with an entry in the output list associated with a temporary variable. The current code looks like this: Code:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t p)
{
register uint64_t ret;
asm ("li %0, 64" "\n\t"
"sub %0, %0, %5" "\n\t"
"mulld r7, %1, %2" "\n\t"
"mulhdu r8, %1, %2" "\n\t"
"mulld r26, r7, %4" "\n\t"
"mulhdu r27, r7, %4" "\n\t"
"mulld r28, r8, %4" "\n\t"
"mulhdu r29, r8, %4" "\n\t"
"adde r9, r27, r28" "\n\t"
"addze r10, r29" "\n\t"
"srd r9, r9, %5" "\n\t"
"sld r10, r10, %0" "\n\t"
"or r9, r9, r10" "\n\t"
"mulld r9, r9, %3" "\n\t"
"sub %0, r7, r9" "\n\t"
"cmpdi cr6, %0, 0" "\n\t"
"bge+ cr6, 0f" "\n\t"
"add %0, %0, %3" "\n"
"0:"
: "=&r" (ret)
: "r" (a), "r" (b), "r" (p), "r" (pMagic), "r" (pShift)
: "r7","r8","r9","r10","r26","r27","r28","r29","cr6" );
return ret;
}
Code:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t p)
{
register uint64_t ret, tmp1, tmp2;
asm ("li %0, 64" "\n\t"
"sub %0, %0, %7" "\n\t"
"mulld %1, %3, %4" "\n\t"
"mulhdu %2, %3, %4" "\n\t"
"mulld r26, %1, %6" "\n\t"
"mulhdu r27, %1, %6" "\n\t"
"mulld r28, %2, %6" "\n\t"
"mulhdu r29, %2, %6" "\n\t"
"adde r9, r27, r28" "\n\t"
"addze r10, r29" "\n\t"
"srd r9, r9, %7" "\n\t"
"sld r10, r10, %0" "\n\t"
"or r9, r9, r10" "\n\t"
"mulld r9, r9, %5" "\n\t"
"sub %0, %1, r9" "\n\t"
"cmpdi cr6, %0, 0" "\n\t"
"bge+ cr6, 0f" "\n\t"
"add %0, %0, %5" "\n"
"0:"
: "=&r" (ret), "=&r" (tmp1), "=&r" (tmp2)
: "r" (a), "r" (b), "r" (p), "r" (pMagic), "r" (pShift)
: "r9","r10","r26","r27","r28","r29","cr6" );
return ret;
}
|
|
|
|
|
|
|
#53 |
|
Mar 2003
New Zealand
13·89 Posts |
In version 1.4.9 I have made a small change to the inline PRE2_MULMOD64 macro that should allow GCC to recognise that the initial subtraction results in a loop invariant. You can try this change out by replacing the `#if 1' with `#if 0' in asm-ppc64.h.
Last fiddled with by geoff on 2006-12-10 at 23:06 |
|
|
|
|
|
#54 | |
|
Oct 2006
On a Suzuki Boulevard C90
2×3×41 Posts |
Quote:
|
|
|
|
|
|
|
#55 |
|
Oct 2006
On a Suzuki Boulevard C90
24610 Posts |
On the faster chips, this hurt performance.
With the #if 1 on a 2.5 970MP it knocked it from ~330kp/sec back to ~300kp/sec.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| srsieve/sr2sieve enhancements | rogue | Software | 300 | 2021-03-18 20:31 |
| 32-bit of sr1sieve and sr2sieve for Win | pepi37 | Software | 5 | 2013-08-09 22:31 |
| sr2sieve question | SaneMur | Information & Answers | 2 | 2011-08-21 22:04 |
| sr2sieve client | mgpower0 | Prime Sierpinski Project | 54 | 2008-07-15 16:50 |
| How to use sr2sieve | nuggetprime | Riesel Prime Search | 40 | 2007-12-03 06:01 |