mersenneforum.org  

Go Back   mersenneforum.org > Math Stuff > Computer Science & Computational Number Theory

Reply
 
Thread Tools
Old 2012-07-24, 19:12   #34
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

1110101010102 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I believe the "more converged bits than expected" appearance of tmp above - you see how those zeros appear to be propagating leftward faster than the formal converged-bit boundary? - is data-dependent happenstance, but will run a few trials with more q's to check that.
Fascinating if it ends up not being happenstance. I notice in this example it is half again ahead: It is half as wide as the highlighted width to its right.
only_human is offline   Reply With Quote
Old 2012-07-24, 22:37   #35
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103·113 Posts
Default

Quote:
Originally Posted by only_human View Post
Fascinating if it ends up not being happenstance. I notice in this example it is half again ahead: It is half as wide as the highlighted width to its right.
It is indeed not an artifact.

I tried a quasirandom sample of 56 odd 64-bit q`s. The results for the "converging to 1" product q*qinv which gets subtracted from 2 in the Newtonian iteration are interesting. Here are the final-iteration (tmp = q*qinv_i) data for the 56 q`s. Numbering our hex digits 0-15 with 0 being the least-significant (rightmost), I expected to see some data with a nonzero hex digit 8, 9, 10, etc, but in fact the rightmost nonzero hex digit (aside from the lowest-order '1' itself) for the entire sample is digit 10, i.e. all of these data == 1 (mod 2^n) with n at least 40. Here they are sorted in order of decreasing number of converged digits:
Code:
tmp = 0000000000000001, qinv_4 = 3B5C754866A089BF
tmp = 9F00000000000001, qinv_4 = FB6B0F723490B861
tmp = 5F00000000000001, qinv_4 = 9A16886C9827AEA1
tmp = 1F00000000000001, qinv_4 = 2EEFE419D4CEBD9F
tmp = 5F00000000000001, qinv_4 = 3CA5FDEAE58D755F
tmp = 1F00000000000001, qinv_4 = 3E9A7341A241A261
tmp = DF00000000000001, qinv_4 = 264AFA0A625ABD61
tmp = 1F00000000000001, qinv_4 = 497A3304C7F29661
tmp = 3F00000000000001, qinv_4 = 4A660935C0BA841F
tmp = 5F00000000000001, qinv_4 = 24C89FE7CD9BEAA1
tmp = BF00000000000001, qinv_4 = 7CF5F1F607693821
tmp = FF00000000000001, qinv_4 = 823739FC069F951F
tmp = D11F000000000001, qinv_4 = AD285CE732A3848F
tmp = 59BF000000000001, qinv_4 = C5C3891708E55131
tmp = D89F000000000001, qinv_4 = FAA8C0D60FB3C911
tmp = 445F000000000001, qinv_4 = 767B2DFFE7D31A0F
tmp = E69F000000000001, qinv_4 = 87EA2E1CAB43A2EF
tmp = F35F000000000001, qinv_4 = 1183F68A3B987B91
tmp = 643F000000000001, qinv_4 = 150760FDDDB351CF
tmp = 6DFF000000000001, qinv_4 = 1C6C7B1DB8D0304F
tmp = 183F000000000001, qinv_4 = BD821BC1F83A1631
tmp = 73DF000000000001, qinv_4 = 2CA2572F6FF7670F
tmp = 9B7F000000000001, qinv_4 = 9FFE8AF5AD701AB1
tmp = 7F3F000000000001, qinv_4 = D55439C3F161C431
tmp = D03F000000000001, qinv_4 = 85FD09F288958631
tmp = 963F000000000001, qinv_4 = 24AB8DD59DB4EEAF
tmp = 8FBF000000000001, qinv_4 = A51866DEFC91A9AF
tmp = 8C489F0000000001, qinv_4 = 40208AE7CF866337
tmp = 97C3FF0000000001, qinv_4 = 36C0866788262917
tmp = 83AF3F0000000001, qinv_4 = 9D00FC96481BCBD7
tmp = 9A639F0000000001, qinv_4 = B838EEB23712F7C9
tmp = 7B683F0000000001, qinv_4 = 9FB73D65EE5074C7
tmp = 77351F0000000001, qinv_4 = F19DC03216373219
tmp = 9CB07F0000000001, qinv_4 = 127C982FA9E030F9
tmp = A95C5F0000000001, qinv_4 = AD3534A88513D9D9
tmp = 65BFDF0000000001, qinv_4 = 205FEA4B68B54259
tmp = AAE03F0000000001, qinv_4 = CAE38E907F0356D7
tmp = 62CF9F0000000001, qinv_4 = A0CE919CB841DC37
tmp = AFD1DF0000000001, qinv_4 = 4170789DFE890FA7
tmp = B2E71F0000000001, qinv_4 = B482EFB8413F20B7
tmp = B5D33F0000000001, qinv_4 = 5E07DD7601B1D7D7
tmp = C1AC5F0000000001, qinv_4 = 2EE0E4B376696F89
tmp = 46299F0000000001, qinv_4 = 4C0618F01719A237
tmp = 4515FF0000000001, qinv_4 = DB86F3D0DB67E0E9
tmp = DCC67F0000000001, qinv_4 = 56736596898EFD07
tmp = 4258DF0000000001, qinv_4 = D6774A189A573959
tmp = DD8CBF0000000001, qinv_4 = D4A0F7F44C5CFBA9
tmp = 3DFD9F0000000001, qinv_4 = 1F86273DB27CAE67
tmp = 21675F0000000001, qinv_4 = FD6C928106E83D77
tmp = DDE73F0000000001, qinv_4 = 132EBAB0FD4ACC29
tmp = E27FDF0000000001, qinv_4 = CCB866840C288259
tmp = E7B97F0000000001, qinv_4 = 2BC86A6A0EE21D97
tmp = 1EDDDF0000000001, qinv_4 = AEF965D8A6E51BA7
tmp = FBBC1F0000000001, qinv_4 = CA21D65DA4A6D319
tmp = 09E51F0000000001, qinv_4 = 3614137D1AACA2B7
tmp = 88F1FF0000000001, qinv_4 = A4DE71A925DC3317
Notice the byte-width granularity of the tmp ( = q*qinv_i) data ... tmp always == 1 modulo 2^n with n a multiple of 8.
Moreover, the rightmost nonzero hex digit (again aside from the lowest-order '1' itself) always = 0xF.
There is probably a simple reason for this, but it's not evident to me yet.

Here is binned-data histogram (in form of a simple table):
Code:
tmp_4 == 1 (mod 2^n):
  n	#occurrences
-----	------------
>= 64	1
60-63	0
56-59	11
52-55	0
48-51	15
44-47	0
40-43	29
36-39	0
32-35	0
 < 32	0
Again, all of the corresponding q's are 64-bit converged, but these data seem to indicate that the convergence is slightly better than quadratic. Or, this may simply imply that that the initial guess formula is in fact producing *more* than 4 good bits. If we are getting >= 5 good bits for qinv_0 we, e.g. tmp starts off with 5 converged bits instead of 4, then on iterations 1-4 we would have q*qinv == 1 modulo 2^5, 2^10, 2^20 and 2^40. And looking now at the 1st-iteration data, yep, that`s it: In every case I see the initial iterate qinv_0 give tmp = q*qinv_0 which in hex form has the rightmost digit pair = ...[even digit: 0,2,4,6,8,A,C,E]1, meaning we are getting at least 5 good bits, and (in this sample) as many as 8. The 8-bit granularity in the above data is simply a result of "1 bit of precision, doubled 3 times".

I suspect now that when Peter M. noted the formula "gives 4 converged low-order bits" he may have meant "at least 4 but less than 8", which in the context of a power-of-2-bit wordsize translates to "assume 4 good bits", since even knowing we have 5 does not reduce our iteration count, except in the rare case where we have 8 good bits in qinv_0.

Edit: In fact my example case q = 16357897499336320049 may not be the best here, since it happens to yield not the ("new, improved, now 25% more free!") minimum of 5 bits but rather 6. OTOH, it's 50/50 whether we get 5 or > 5 good bits in qinv_0, so I'm gonna stick with it, just modify the bold-highlighting in the resulting table suitably and make a comment to that effect.

Last fiddled with by ewmayer on 2012-07-24 at 23:07
ewmayer is offline   Reply With Quote
Old 2012-07-24, 23:03   #36
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

2×1,877 Posts
Default

Quote:
Originally Posted by ewmayer View Post
It is indeed not an artifact.

I tried a quasirandom sample of 56 odd 64-bit q`s. The results for the "converging to 1" product q*qinv which gets subtracted from 2 in the Newtonian iteration are interesting. Here are the final-iteration (tmp = q*qinv_i) data for the 56 q`s. Numbering our hex digits 0-15 with 0 being the least-significant (rightmost), I expected to see some data with a nonzero hex digit 8, 9, 10, etc, but in fact the rightmost nonzero hex digit (aside from the lowest-order '1' itself) for the entire sample is digit 10, i.e. all of these data == 1 (mod 2^n) with n at least 40.
[snip]
In every case I see the initial iterate qinv_0 give tmp = q*qinv_0 which in hex form has the rightmost digit pair = ...[even digit: 0,2,4,6,8,A,C,E]1, meaning we are getting at least 5 good bits, and (in this sample) as many as 8. The 8-bit granularity in the above data is simply a result of "1 bit of precision, doubled 3 times".

I suspect now that when Peter M. noted the formula "gives 4 converged low-order bits" he may have meant "at least 4 but less than 8", which in the context of a power-of-2-bit wordsize translates to "assume 4 good bits", since even knowing we have 5 does not reduce our iteration count, except in the rare case where we have 8 good bits in qinv_0.
Well done. Possibly useful to mention as an aside because it could be useful as an error check. Possibly more useful when adapting the routine to other kinds of hardware or inexact representations. I notice that the nibble to the left of the 'F' is always odd, so there is there are five 1 bits there. That might be a clue too.

Edit:
Quote:
Edit: In fact my example case q = 16357897499336320049 may not be the best here, since it happens to yield not the ("new, improved, now 25% more free!") minimum of 5 bits but rather 6. OTOH, it's 50/50 whether we get 5 or > 5 good bits in qinv_0, so I'm gonna stick with it, just modify the bold-highlighting in the resulting table suitably and make a comment to that effect.
Well one positive aspect of this is there is now a really good justification for showing some of the data in hex because it made these nuances much clearer.

Last fiddled with by only_human on 2012-07-24 at 23:51 Reason: deleted "converges faster" talk
only_human is offline   Reply With Quote
Old 2012-07-25, 00:08   #37
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

2×1,877 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Here are the actual 64-bit data for "Algorithm: 64-bit mod-inverse computation" on the same numerical example, using your bf-highlight on the converged hex digits:
Code:
i	b	tmp			qinv_i
1	8	F3D66805FF3D2BC1	4141B6714938A4D1
2	16	B0955EB99F05F001	656E0C4A279FB4D1
3	32	0A62BFBCBF000001	6EAB2BE6389FB4D1
4	64	E97F000000000001	81FC2BE6389FB4D1
yielding the desired 64-bit mod inverse qinv, which in decimal form = 9366409592816252113. Were we to do one more iteration, we would observe that the product q*qinv would have its 64-bit lower half equal to unity, hence tmp = 1 and we verify that qinv has reached its desired fixed point.
I was looking at this data from before and I see that the last 3 iterations here also have the five 1 bits (odd nibble followed by 'F' nibble) in the tmp column. I'm wondering if (assuming) tmp quickly does this, might this information be useful (through algebraic trickery) for composing an even better first guess for the starting value of the routine?

Last fiddled with by only_human on 2012-07-25 at 00:20 Reason: note: (I am mostly talking to myself here. Please don't derail your time on my account)
only_human is offline   Reply With Quote
Old 2012-07-25, 00:53   #38
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103×113 Posts
Default

Quote:
Originally Posted by only_human View Post
I was looking at this data from before and I see that the last 3 iterations here also have the five 1 bits (odd nibble followed by 'F' nibble) in the tmp column. I'm wondering if (assuming) tmp quickly does this, might this information be useful (through algebraic trickery) for composing an even better first guess for the starting value of the routine?
Possibly - I did some more data mining just now and found that the number of good bits in qinv_0 resulting from the (3q ^ 2) trick depends on q (mod 16): 1/3/5/7/9/b/d/f give >= 6/5/6/5/5/6/5/6 good bits in qinv_0, respectively. That is surely a consequence of some fairly simply bit-pattern stuff going on.

More interestingly, the mod-16 values which give "at least 5 bits" (3,7,9,13) *never* seem to give more than 5, whereas the ones which give "at least 6" (1,5,11,15) all give 6,7,8,... good bits with probability halving with each additional "lucky" bit of precision.

Feel free to go nuts on this ... maybe you can write a paper on it, or something. :)

I need to wrap this part up for now so I can focus on what I'd really intended to do this week, namely the loop-folding optimizations of the remaindering and quotient algorithms for the 64-bit case.

(But one should always welcome interesting distractions - I had one such happen to me back in May, and it turned into the nascent paper. :)
ewmayer is offline   Reply With Quote
Old 2012-07-25, 01:24   #39
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

2×1,877 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Feel free to go nuts on this ... maybe you can write a paper on it, or something. :)
I might just do that. This was fun for me while it lasted. Thank you.
only_human is offline   Reply With Quote
Old 2012-07-25, 06:57   #40
Random Poster
 
Random Poster's Avatar
 
Dec 2008

179 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Possibly - I did some more data mining just now and found that the number of good bits in qinv_0 resulting from the (3q ^ 2) trick depends on q (mod 16): 1/3/5/7/9/b/d/f give >= 6/5/6/5/5/6/5/6 good bits in qinv_0, respectively. That is surely a consequence of some fairly simply bit-pattern stuff going on.

More interestingly, the mod-16 values which give "at least 5 bits" (3,7,9,13) *never* seem to give more than 5, whereas the ones which give "at least 6" (1,5,11,15) all give 6,7,8,... good bits with probability halving with each additional "lucky" bit of precision.
There seems to be an interesting pattern here:
Code:
q (mod 16)      q^2 (mod 32)    bits
0001 or 1111    00001           >=6
0011 or 1101    01001           5
0101 or 1011    11001           >=6
0111 or 1001    10001           5
So the number of good bits depends on whether the leftmost two bits of q^2 (mod 32) are the same or different. I guess that more than 6 bits corresponds with the bit being repeated more than twice in q^2 (3 times for 7 bits, 4 times for 8 bits and so on).
Random Poster is offline   Reply With Quote
Old 2012-07-25, 17:19   #41
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

375410 Posts
Default

Nothing to see here, I just haven't spawned a miscellaneous math thread yet. This is about those mysterious 1s (the 'F' nibble) in tmp
Code:
i	b	tmp			qinv_i
1	8	F3D66805FF3D2BC1	4141B6714938A4D1
2	16	B0955EB99F05F001	656E0C4A279FB4D1
3	32	0A62BFBCBF000001	6EAB2BE6389FB4D1
4	64	E97F000000000001	81FC2BE6389FB4D1
I think the 'F' nibble is likely a boring string of 1 bits that gets longer each iteration as some artifact of computation. Here are the above tmp values in binary
Code:
tmp=111100111101011001101000000001011111111100111101001010 1111     00 0001
tmp=10110000100101010101111010111001100111110000010 11111     0000 00000001
tmp=0000101001100010101111111011110010 111111    00000000 00000000 00000001
tmp=111010010 1111111 00000000 00000000 00000000 00000000 00000000 00000001
In these iterations, the 1s are successively 4, 5, 6 and 7 bits in length

Last fiddled with by only_human on 2012-07-25 at 17:30
only_human is offline   Reply With Quote
Old 2012-07-27, 02:56   #42
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

265678 Posts
Default

FYI, I have finished writing up the loop-folding optimization stuff for the remaindering algorithm(s), and worked out the analogous procedure for the quotient computation: That is similar, but with a small twist. Both foldings make use of the same intermediate partial-vector remainders; the whole thing is really sweet. Just need to do some more coding of the quotient stuff and timings, but I expect very similar (perhaps identical) timings for this as for the remaindering, since the loop bodies are so similar. Thus things are pointing to being able to do a full (= remainder and quotient) large-vector DIV in ~12 cycles per 64-bit word.

I intend to post the updated PDF sometime over the weekend, would like to submit to Math. Comp. early next week.
ewmayer is offline   Reply With Quote
Old 2012-07-31, 02:51   #43
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

822310 Posts
Default

.
Attached Files
File Type: pdf drivel v2.pdf (383.6 KB, 380 views)
Xyzzy is offline   Reply With Quote
Old 2012-07-31, 02:57   #44
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103·113 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I intend to post the updated PDF sometime over the weekend, would like to submit to Math. Comp. early next week.
Revised version in above post - thanks, Mike.

Main reason for the delay is that I decided to get ambitious and fill some available real estate in 2 of the tables with ASM samples, and formatting those is a pain.

Main changes:

o example of 64-bit qinv iteration;

o loop-folding stuff for remainder and quotient computations;

o comparison with GMP n/1 div algo timings on the Core 2 (reference added).

Edit: Already found 1 misspelling which my editor missed (although when I right-click on the word it suggests the correct spelling): 'appraoch' (p.26, 4th line from bottom) should be 'approach'.

Last fiddled with by ewmayer on 2012-07-31 at 03:05
ewmayer is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Using long long's in Mingw with 32-bit Windows XP grandpascorpion Programming 7 2009-10-04 12:13
I think it's gonna be a long, long time panic Hardware 9 2009-09-11 05:11
Multiply mgb Lounge 0 2008-07-28 12:54
Long Division in Mathematica JuanTutors Information & Answers 7 2007-06-14 17:29
VIA C7 Montgomery Multiplier? akruppa Hardware 1 2005-08-04 10:25

All times are UTC. The time now is 17:37.


Fri Jul 16 17:37:50 UTC 2021 up 49 days, 15:25, 1 user, load averages: 1.79, 1.55, 1.54

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.