![]() |
|
|
#34 | |
|
"Gang aft agley"
Sep 2002
1110101010102 Posts |
Quote:
|
|
|
|
|
|
|
#35 | |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
Quote:
I tried a quasirandom sample of 56 odd 64-bit q`s. The results for the "converging to 1" product q*qinv which gets subtracted from 2 in the Newtonian iteration are interesting. Here are the final-iteration (tmp = q*qinv_i) data for the 56 q`s. Numbering our hex digits 0-15 with 0 being the least-significant (rightmost), I expected to see some data with a nonzero hex digit 8, 9, 10, etc, but in fact the rightmost nonzero hex digit (aside from the lowest-order '1' itself) for the entire sample is digit 10, i.e. all of these data == 1 (mod 2^n) with n at least 40. Here they are sorted in order of decreasing number of converged digits: Code:
tmp = 0000000000000001, qinv_4 = 3B5C754866A089BF tmp = 9F00000000000001, qinv_4 = FB6B0F723490B861 tmp = 5F00000000000001, qinv_4 = 9A16886C9827AEA1 tmp = 1F00000000000001, qinv_4 = 2EEFE419D4CEBD9F tmp = 5F00000000000001, qinv_4 = 3CA5FDEAE58D755F tmp = 1F00000000000001, qinv_4 = 3E9A7341A241A261 tmp = DF00000000000001, qinv_4 = 264AFA0A625ABD61 tmp = 1F00000000000001, qinv_4 = 497A3304C7F29661 tmp = 3F00000000000001, qinv_4 = 4A660935C0BA841F tmp = 5F00000000000001, qinv_4 = 24C89FE7CD9BEAA1 tmp = BF00000000000001, qinv_4 = 7CF5F1F607693821 tmp = FF00000000000001, qinv_4 = 823739FC069F951F tmp = D11F000000000001, qinv_4 = AD285CE732A3848F tmp = 59BF000000000001, qinv_4 = C5C3891708E55131 tmp = D89F000000000001, qinv_4 = FAA8C0D60FB3C911 tmp = 445F000000000001, qinv_4 = 767B2DFFE7D31A0F tmp = E69F000000000001, qinv_4 = 87EA2E1CAB43A2EF tmp = F35F000000000001, qinv_4 = 1183F68A3B987B91 tmp = 643F000000000001, qinv_4 = 150760FDDDB351CF tmp = 6DFF000000000001, qinv_4 = 1C6C7B1DB8D0304F tmp = 183F000000000001, qinv_4 = BD821BC1F83A1631 tmp = 73DF000000000001, qinv_4 = 2CA2572F6FF7670F tmp = 9B7F000000000001, qinv_4 = 9FFE8AF5AD701AB1 tmp = 7F3F000000000001, qinv_4 = D55439C3F161C431 tmp = D03F000000000001, qinv_4 = 85FD09F288958631 tmp = 963F000000000001, qinv_4 = 24AB8DD59DB4EEAF tmp = 8FBF000000000001, qinv_4 = A51866DEFC91A9AF tmp = 8C489F0000000001, qinv_4 = 40208AE7CF866337 tmp = 97C3FF0000000001, qinv_4 = 36C0866788262917 tmp = 83AF3F0000000001, qinv_4 = 9D00FC96481BCBD7 tmp = 9A639F0000000001, qinv_4 = B838EEB23712F7C9 tmp = 7B683F0000000001, qinv_4 = 9FB73D65EE5074C7 tmp = 77351F0000000001, qinv_4 = F19DC03216373219 tmp = 9CB07F0000000001, qinv_4 = 127C982FA9E030F9 tmp = A95C5F0000000001, qinv_4 = AD3534A88513D9D9 tmp = 65BFDF0000000001, qinv_4 = 205FEA4B68B54259 tmp = AAE03F0000000001, qinv_4 = CAE38E907F0356D7 tmp = 62CF9F0000000001, qinv_4 = A0CE919CB841DC37 tmp = AFD1DF0000000001, qinv_4 = 4170789DFE890FA7 tmp = B2E71F0000000001, qinv_4 = B482EFB8413F20B7 tmp = B5D33F0000000001, qinv_4 = 5E07DD7601B1D7D7 tmp = C1AC5F0000000001, qinv_4 = 2EE0E4B376696F89 tmp = 46299F0000000001, qinv_4 = 4C0618F01719A237 tmp = 4515FF0000000001, qinv_4 = DB86F3D0DB67E0E9 tmp = DCC67F0000000001, qinv_4 = 56736596898EFD07 tmp = 4258DF0000000001, qinv_4 = D6774A189A573959 tmp = DD8CBF0000000001, qinv_4 = D4A0F7F44C5CFBA9 tmp = 3DFD9F0000000001, qinv_4 = 1F86273DB27CAE67 tmp = 21675F0000000001, qinv_4 = FD6C928106E83D77 tmp = DDE73F0000000001, qinv_4 = 132EBAB0FD4ACC29 tmp = E27FDF0000000001, qinv_4 = CCB866840C288259 tmp = E7B97F0000000001, qinv_4 = 2BC86A6A0EE21D97 tmp = 1EDDDF0000000001, qinv_4 = AEF965D8A6E51BA7 tmp = FBBC1F0000000001, qinv_4 = CA21D65DA4A6D319 tmp = 09E51F0000000001, qinv_4 = 3614137D1AACA2B7 tmp = 88F1FF0000000001, qinv_4 = A4DE71A925DC3317 Moreover, the rightmost nonzero hex digit (again aside from the lowest-order '1' itself) always = 0xF. There is probably a simple reason for this, but it's not evident to me yet. Here is binned-data histogram (in form of a simple table): Code:
tmp_4 == 1 (mod 2^n): n #occurrences ----- ------------ >= 64 1 60-63 0 56-59 11 52-55 0 48-51 15 44-47 0 40-43 29 36-39 0 32-35 0 < 32 0 I suspect now that when Peter M. noted the formula "gives 4 converged low-order bits" he may have meant "at least 4 but less than 8", which in the context of a power-of-2-bit wordsize translates to "assume 4 good bits", since even knowing we have 5 does not reduce our iteration count, except in the rare case where we have 8 good bits in qinv_0. Edit: In fact my example case q = 16357897499336320049 may not be the best here, since it happens to yield not the ("new, improved, now 25% more free!") minimum of 5 bits but rather 6. OTOH, it's 50/50 whether we get 5 or > 5 good bits in qinv_0, so I'm gonna stick with it, just modify the bold-highlighting in the resulting table suitably and make a comment to that effect. Last fiddled with by ewmayer on 2012-07-24 at 23:07 |
|
|
|
|
|
|
#36 | ||
|
"Gang aft agley"
Sep 2002
2×1,877 Posts |
Quote:
Edit: Quote:
Last fiddled with by only_human on 2012-07-24 at 23:51 Reason: deleted "converges faster" talk |
||
|
|
|
|
|
#37 | |
|
"Gang aft agley"
Sep 2002
2×1,877 Posts |
Quote:
Last fiddled with by only_human on 2012-07-25 at 00:20 Reason: note: (I am mostly talking to myself here. Please don't derail your time on my account) |
|
|
|
|
|
|
#38 | |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Quote:
More interestingly, the mod-16 values which give "at least 5 bits" (3,7,9,13) *never* seem to give more than 5, whereas the ones which give "at least 6" (1,5,11,15) all give 6,7,8,... good bits with probability halving with each additional "lucky" bit of precision. Feel free to go nuts on this ... maybe you can write a paper on it, or something. :) I need to wrap this part up for now so I can focus on what I'd really intended to do this week, namely the loop-folding optimizations of the remaindering and quotient algorithms for the 64-bit case. (But one should always welcome interesting distractions - I had one such happen to me back in May, and it turned into the nascent paper. :) |
|
|
|
|
|
|
#39 |
|
"Gang aft agley"
Sep 2002
2×1,877 Posts |
|
|
|
|
|
|
#40 | |
|
Dec 2008
179 Posts |
Quote:
Code:
q (mod 16) q^2 (mod 32) bits 0001 or 1111 00001 >=6 0011 or 1101 01001 5 0101 or 1011 11001 >=6 0111 or 1001 10001 5 |
|
|
|
|
|
|
#41 |
|
"Gang aft agley"
Sep 2002
375410 Posts |
Nothing to see here, I just haven't spawned a miscellaneous math thread yet. This is about those mysterious 1s (the 'F' nibble) in tmp
Code:
i b tmp qinv_i 1 8 F3D66805FF3D2BC1 4141B6714938A4D1 2 16 B0955EB99F05F001 656E0C4A279FB4D1 3 32 0A62BFBCBF000001 6EAB2BE6389FB4D1 4 64 E97F000000000001 81FC2BE6389FB4D1 Code:
tmp=111100111101011001101000000001011111111100111101001010 1111 00 0001 tmp=10110000100101010101111010111001100111110000010 11111 0000 00000001 tmp=0000101001100010101111111011110010 111111 00000000 00000000 00000001 tmp=111010010 1111111 00000000 00000000 00000000 00000000 00000000 00000001 Last fiddled with by only_human on 2012-07-25 at 17:30 |
|
|
|
|
|
#42 |
|
∂2ω=0
Sep 2002
República de California
265678 Posts |
FYI, I have finished writing up the loop-folding optimization stuff for the remaindering algorithm(s), and worked out the analogous procedure for the quotient computation: That is similar, but with a small twist. Both foldings make use of the same intermediate partial-vector remainders; the whole thing is really sweet. Just need to do some more coding of the quotient stuff and timings, but I expect very similar (perhaps identical) timings for this as for the remaindering, since the loop bodies are so similar. Thus things are pointing to being able to do a full (= remainder and quotient) large-vector DIV in ~12 cycles per 64-bit word.
I intend to post the updated PDF sometime over the weekend, would like to submit to Math. Comp. early next week. |
|
|
|
|
|
#43 |
|
"Mike"
Aug 2002
822310 Posts |
.
|
|
|
|
|
|
#44 | |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
Quote:
Main reason for the delay is that I decided to get ambitious and fill some available real estate in 2 of the tables with ASM samples, and formatting those is a pain. Main changes: o example of 64-bit qinv iteration; o loop-folding stuff for remainder and quotient computations; o comparison with GMP n/1 div algo timings on the Core 2 (reference added). Edit: Already found 1 misspelling which my editor missed (although when I right-click on the word it suggests the correct spelling): 'appraoch' (p.26, 4th line from bottom) should be 'approach'. Last fiddled with by ewmayer on 2012-07-31 at 03:05 |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Using long long's in Mingw with 32-bit Windows XP | grandpascorpion | Programming | 7 | 2009-10-04 12:13 |
| I think it's gonna be a long, long time | panic | Hardware | 9 | 2009-09-11 05:11 |
| Multiply | mgb | Lounge | 0 | 2008-07-28 12:54 |
| Long Division in Mathematica | JuanTutors | Information & Answers | 7 | 2007-06-14 17:29 |
| VIA C7 Montgomery Multiplier? | akruppa | Hardware | 1 | 2005-08-04 10:25 |