Thanks for the sample code, Jason - nice to see that vendor-provided intrinsics files are becoming more and more common - less futzing around with ASM macros, which is especially useful in early stages of code development.

I'm working out how all this can be used to speed a > 64-bit modmul - using pairs of 50-bit floats one could handle moduli of 100 bits (101 using balanced-digit for the LSW of each pair), which would be a nice alternative (or complement) to, say, a 96 or 128-bit pure-integer modmul. In that context the possibly negative lower-half outputs you mention are not a problem - unless one considers getting the digit-balancing for free a problem, that is.

