20040421, 20:20  #1 
"William Garnett III"
Oct 2002
Langhorne, PA
2·43 Posts 
improvement needed in SSE2 for Pentium 4 for nonIBDWT case for LLR/PRP
Hello,
This is a pretty long post, but I wanted to bring this up to try to get some SSE2 programs to fix LLR/PRP, if possible. This "problem" has persisted for years and if it can be fixed I wanted to get some people working on it. To help explain what I mean, let me first show some benchmarks. The following were run on a computer in a computer lab, specifically, a Dell Pentium 4 2.8Ghz, 800Mhz system bus, dual channel DDR 333 memory (I think) with Hyperthreading disabled. For the Mersenne number, 2^71377691, the latest version of Prime95 in natural SSE2 mode yields a per iteration time of: 14.085ms For Prime95 with SSE2 disabled, it yields 42.129ms For the Riesel number 3*2^71377681, using the latest version of LLR which takes advantage of IBDWT for small k, first for SSE2 enabled: [Mon Apr 19 02:33:05 2004] Using IBDWT : Mersenne fftlen = 393216, Proth fftlen = 786432, Used fftlen = 458752 19.087ms For SSE2 disabled: [Mon Apr 19 02:49:03 2004] Using IBDWT : Mersenne fftlen = 393216, Proth fftlen = 786432, Used fftlen = 458752 50.364ms For the Riesel number 513*2^71377681 in regular mode (no IBDWT), For SSE2 enabled: 67.562ms For SSE2 disabled: 105.331 For the Proth number 513*2^7137768+1, using LLR (it uses PRP mode) (no IBDWT): with SSE2 67.482ms without SSE2 105.351ms Finally, on my personal computer, a Pentium III 600 Mhz (133 mhz system bus, PC133) (which of course doesn't have SSE2): For the Riesel number 513*2^71377681 (no IBDWT). 284.437ms For Mersenne number 2^71377691 using Prime95: 109.612ms For 3*2^71377681 using LLR (IBDWT): 128.921ms Few aside notes: I am assuming Jean is going to implement the IBDWT for the Proth case too??? and not limit it to Riesel case? This is all based on Colin's algorithm: http://www.ams.org/mcom/200372241/...199/home.html Thank you Jean for implementing this algorithm!! Excellent work so far!! No SSE2 problems here for IBWDT (since modulo step done for free :) ) OK, back to the main focus. Let's look at the benchmarks. It is known that Mersenne's are roughly 3 times faster than Riesels/Proths(nonIBDWT). Let's have a look at Pentium 3: Mersenne: 109.612ms Riesel: 284.437ms Ratio: 2.59 No problems here. Let's look at Pentium 4 SSE2 disabed mode(nonIBDWT): Mersenne:42.129ms Riesel:105.331 Ratio: 2.50 No problems. So we see with non SSE2 processor like Pentium 3, and SSE2 disabled mode for P4, everything happens as planned. Now let's look at SSE2. Remember when Prime95 got SSE2 optimized George said it was 3 times faster: Mersenne(SSE2):14.082ms Mersenne(nonSSE2):42.129ms ratio:2.99 Yeap that's what happened. Let's look even at this new IBDWT mode Jean included in LLR for small k: SSE2 Riesel IBDWT: 19.087ms nonIBDWT Riesel: 50.364ms ratio: 2.64 That's happening to plan too. Makes sense, since this "Colin algorithm" uses IBDWT like Prime95 does for Mersenne's so iteration times should be close. Nice job for a first release. I imagine they will be improving even more as time goes along. These numbers get the modulo step done for free. OK, now for the bad news. Using LLR, and k > 512, Proth's and Riesel's are roughly same time. Riesel's do a deterministic primality test for k*2^n1, while PRP does a probable prime test for +1. I ran both benchmarks in LLR, since LLR automatically defaults to PRP mode for +1. Here are the times: SSE2 (no IBDWT): 67.482ms SSE2 disabled (no IBDWT):105.351ms ratio:1.56 OUCH!!! This has been like this for years. This is the one problem that exists. It is that only for the SSE2 processor Pentium 4, LLR/PRP is poorly optimized for the non IBDWT case. So my request to any SSE2 experts out there, can this be fixed? I know very little about SSE2 so don't know about feasibility of fixing this problem. Many of us are running projects that test Proth and Riesel numbers with big k, and we are not taking full advantage of the SSE2 instructions in the Pentium 4 like Prime95 is. Here are some old emails I have that may shed light on helping you guys (I hope these guys don't get upset for me cutting and pasting conversations): George Woltman: Date: Mon, 11 Mar 2002 17:42:47 0500 To: "William F. Garnett III" <vmrf@grove.iup.edu> "The P4 speedup in prp is not as dramatic as the P4 speedup in prime95. This is because the mod k*2^n1 step is not SSE2 optimized. Thus expect only a 2x speedup instead of a 3x speedup." Date: Tue, 12 Mar 2002 10:05:45 0500 To: "William F. Garnett III" <vmrf@grove.iup.edu> "Thanks. Just one more thing. For Proth primes (k*2^n+1), is PRP2 partially SSE2 optimized, or not at all? The multiplication is SSE2 optimized. I know below you said the mod step is not SSE2 optimized. Is this as far as PRP2 will probably be SSE2 optimized, or will you maybe be able to optimize it further? Its not high on my priority list. I'm really hoping the PFGW project puts PRP out of business. Maybe someone in that group can an write SSE2 optimized modulo routine. Regards, George " " From: George Woltman <woltman@alum.mit.edu> Subject: Re: SSE2 Date: Sat, 16 Mar 2002 22:24:51 0500 To: "William F. Garnett III" <vmrf@grove.iup.edu> Hi, At 09:43 PM 3/16/2002 0500, you wrote: Just some questions about PRP2 and SSE2. I know you said multiplication is SSE2 optimized but not modulo routine. You mentioned you hopefully want PFGW to put PRP2 out of business. Anyway, I just like to know, would it be hard to optimized the modulo routine for SSE2? It would be somewhat difficult. The multiplication routines use a funny memory layout for good cache utilitization. This layout makes a good modulo implementation hard. The good news is the routine is quite localized. Are the prime95 and PRP2 modulo routines similar, or is somehow the modulo routine harder to do for PRP2 and thus we get only the 2x improvement instead of 3x? Prime95 does not do a modulo. It happens for free thanks to Richard Crandall irrational discrete weighted transform. Regards, George " " From: "Brian J. Beesley" <xx@xx.net> Subject: Re: benchmarks for Proth primes Date: Tue, 09 Apr 2002 19:30:33 +0000 To: "William F. Garnett III" <vmrf@grove.iup.edu>, woltman@alum.mit.edu Cc: vmrf@iup.edu Why is there only a minor speedup, unlike Prime > 95, which gets a HUGE boost with regards to SSE2. Without bothering to check the code, I suspect that the modulo operation isn't taking advantage of SSE2 (maybe there's no easy way to do this) and ends up dominating the execution time on a P4 system. There may be a workaround. If the convolution is broken down into a^2 (mod p) and a^2 (mod q), where p and q are different primeexponent Mersenne numbers (therefore coprime) and p.q > k.2^n+1, then you can use "free modulo" code and recover the final result using the Chinese Remainder Theorem. If this enables full use to be made of SSE2, this method might well be superior on a P4, even if it's not efficient on a processor which doesn't have SSE2 logic. BTW George could you please put up a copy of "PRP v2.1" (the one that reports final residuals) for linux? The version on ftp://www.mersenne.org/gimps still doesn't log residuals. Regards Brian Beesley " From: George Woltman <woltman@alum.mit.edu> Subject: Re: benchmarks for Proth primes Date: Tue, 09 Apr 2002 17:22:08 0400 My P4 shows about 3.2 ms per iteration vs. 5.8 ms per iteration if I make PRP not use the SSE2 code. That is a big gain although not 2x. I created a special prp that did not time the proth mod step. This ran at 1.7 ms per iteration using SSE2 and 4.8 ms for nonSSE2 code. This means the multiply step did get nearly 3 times faster (4.8 vs. 1.7) but the proth mod step got slower (1.5 vs. 1.0). The SSE2 code uses a different memory layout that is apparently less cache friendly in the proth mod step. Hope that helps, George " Mon, 11 Nov 2002 18:25:17 0500 No. Reason I'm asking is for searches like mine (searching for the 5th largest prime), every bit of SSE2 optimization can help, and we can really use it. PFGW is the same speed as your PRP, so they don't have any extra SSE2 optimization I suppose. Let me know if you are optimizing the mod step. The mod step actually uses SSE2 instructions. Its problem is that it accesses cache lines poorly. It uses only 16 bytes of each 128byte cache line resulting in main memory being read 8 times more often than the ideal. This all stems from the memory layout that was chosen to make the Mersenne multiplication as fast as possible. This same layout is poor for the mod k*2^n1 step. Is there a memory layout that could satisfy both goals? Maybe, but even if it exists it means a major rewrite of all the multiplication code. Is there a different way to do the mod step that is more memory efficient? It is possible. For example it might be faster to transpose memory, do the mod, transpose back faster than the current scheme. Unfortunately, it is a low priority item right now. Regards, George " That's basically it. Is there anyway guys to improve the mod step for proths and riesels using sSE2 (nonIBDWT)? Any speed improvement would help us guys running projects that are not using small ks. I am not knowledgeable in SSE2 so don't know if it's feasible. If it is, I would like some people to work on fixing it if possible and doesn't interfere with any other obligations you have. Thanks for all the help everyone and for all the contributions!! regards, william garnett 
20040422, 03:40  #2 
P90 years forever!
Aug 2002
Yeehaw, FL
1F16_{16} Posts 
Your best hope is Jean Penne. He is the first person to successfully wade in and improve the assembly code. Very impressive.
I now know how to fix the problem without changing the memory model. It can be done in the same routines that Jean has already changed. 
20040629, 00:07  #3 
Jun 2003
2^{2}·397 Posts 
Any chance for improvement on P4 HT's. Under win XP, one processor is completely doing nothing, just sitting idle.
can't we make it do anything? Citrix 
20040629, 03:03  #4  
Aug 2002
500_{8} Posts 
Quote:


Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
performance improvement with assembly  bsquared  Software  15  20100928 19:00 
Possible improvement of quadratic sieve  Random Poster  Factoring  4  20100212 03:09 
Pentium 90 // Pentium ][ 400 years  ValerieVonck  Programming  4  20061212 17:06 
Pentium 4 not using SSE2 with Windows NT4  Unregistered  Software  10  20040520 12:14 
Possible idea for improvement  Kevin  Software  1  20020826 07:57 