20090824, 17:56  #1 
Jul 2009
31 Posts 
P95 trial division strategy
I'm experimenting with my own GPU Mersenne trial division tool.. it's got its own fun challenges.
The classic way to test for divisibility of 2^p 1 by q is to compute 2^p mod q and look for a 1 remainder. It's a fast and easy power ladder. The slow step is of course doing the mod q step. There's lots of strategies. The important cases are when q> 2^64, so I'm using 3 32 bit words as my representation. The modular method I use for SPRP testing is Montgomery reduction, and that's what I was planning to use for trial division as well. I have that working and it's successful.. though of course you always want more performance. I've thought of using the "multiply trick" of computing 2^160/q and changing the division into a multiplication. There's also some clever tricks to use q^1 mod 2^160, but I think that these precomputes are already slower than the Montgomery method. The trial division tools in MERS uses several methods depending on the size of q. But for the interesting and most common case of q>2^62, it just jumps to an arbitrary precision library. Prime95, however, is very interesting but also fairly opaque. Despite the many comments in factor32.asm and factor64.asm, it's hard to decipher its approach! It seems as if it computes an (approximate!) floating point reciprocal to q, then uses that to compute how many multiples of q to subtract off. So this may be doing classic division and subtraction to compute its mods, boosted by using a floating point initial approximation step. I may be very wrong about this because the code is so beautifully tuned and expanded that it's hard to take a step back and see its strategy at a larger level. The comments in the code also are well written but at a very low statement level. I understand the sieve steps, but can someone give a higher level description of P95's computational method of 2^p mod q? 
20090824, 19:25  #2 
Jul 2006
Calgary
5^{2}×17 Posts 

20090824, 20:06  #3  
Jul 2009
31 Posts 
Quote:


20090824, 20:25  #4  
"Richard B. Woods"
Aug 2002
Wisconsin USA
7692_{10} Posts 
Quote:


20090824, 20:34  #5  
Jul 2009
37_{8} Posts 
Quote:
But that's also noticably slower (in my own code) than doing it with Montgomery reduction, at least for exponents more than about 65556. Given the beautiful tuning of P95 (both at algorithm and assembly code levels) I was thinking there may be something else going on that I could learn from. 

20090824, 21:13  #6  
P90 years forever!
Aug 2002
Yeehaw, FL
2×3×19×67 Posts 
Quote:
To compute x mod y, you take the approximate reciprocal accurate to 53bits and compute (top bits of x) * (approx. recip) to get 32 bits of the quotient (almost always accurate, Knuth guarantees it is accurate to within 2). Multiply q by y, subtract from x, repeat til done. Also, note that you can compute the reciprocal quickly without division. Just take the last reciprocal you computed (it will be real close) and use Newton's method to compute the new reciprocal. Last fiddled with by Prime95 on 20090824 at 21:13 

20090824, 21:26  #7  
∂^{2}ω=0
Sep 2002
República de California
3·13^{2}·23 Posts 
Quote:
Note that should be able to get at least a 23x speedup on 64bit x86style (and most RISC) architectures (that is, ones running under a 64bit OS as well) from using the full 64x64 > 128bit hardware integer MUL instruction. That "wastes" some bits for the high parts of the multiword products (e.g. the 32x64bit subproducts), but fullnessofbitfieldaesthetics is not the name of the game here. 

20090824, 23:22  #8  
Jul 2009
31 Posts 
Quote:
I did notice the (many!) multiple methods defined, especially for the low bit ranges of < 2^64. I guess in the days of 486 processors, it was a struggle even to test up that high and having custom code for different low bit ranges was useful. I may try a couple oldschool divide approaches as well... we'll see how they compare to Montgomery. 

20090824, 23:26  #9  
Jul 2009
31 Posts 
Quote:


Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Sublinear complexity of trial division?  yih117  Math  5  20180202 02:49 
Mersenne trial division implementation  mathPuzzles  Math  8  20170421 07:21 
trial division over a factor base  Peter Hackman  Factoring  7  20091026 18:27 
Trial division software for Mersenne  SPWorley  Factoring  7  20090816 00:23 
Need GMP trialdivision timings  ewmayer  Factoring  7  20081211 22:12 