mersenneforum.org 128bit mulmod request
 Register FAQ Search Today's Posts Mark Forums Read

 2018-08-19, 15:16 #1 ATH Einyen     Dec 2003 Denmark 54758 Posts 128bit mulmod request I was wondering if those experienced in assembly programming could easily and quickly rewrite the standard 64bit mulmod into a 128bit mulmod using AVX/AVX2 instructions or if it is a big task? Code: uint64_t mulmod(uint64_t a, uint64_t b, uint64_t c) { uint64_t d; // to hold the result of a*b mod c // calculates a*b mod c, stores result in d asm ("mov %1, %%rax;" // put a into rax "mul %2;" // mul a*b -> rdx:rax "div %3;" // (a*b)/c -> quot in rax remainder in rdx "mov %%rdx, %0;" // store result in d :"=r"(d) // output :"r"(a), "r"(b), "r"(c) // input :"%rax", "%rdx" // clobbered registers ); return d; } either using the __uint128_t variable or just using high and low 64 bit for each variable: __uint128_t mulmod(__uint128_t a, __uint128_t b, __uint128_t c) or __uint128_t mulmod(uint64_t ah, uint64_t al, uint64_t bh, uint64_t bl, uint64_t ch, uint64_t cl)
2018-08-19, 16:21   #2
bsquared

"Ben"
Feb 2007

63078 Posts

Quote:
 Originally Posted by ATH I was wondering if those experienced in assembly programming could easily and quickly rewrite the standard 64bit mulmod into a 128bit mulmod using AVX/AVX2 instructions or if it is a big task?
Is this to be used in a modular exponentiation loop? If so a Montgomery multiplication is probably a better idea.

2018-08-19, 18:34   #3
ldesnogu

Jan 2008
France

10208 Posts

Quote:
 Originally Posted by ATH I was wondering if those experienced in assembly programming could easily and quickly rewrite the standard 64bit mulmod into a 128bit mulmod using AVX/AVX2 instructions or if it is a big task?
Does AVX2 have support for 128-bit integers? And for integer divides/remainders (no matter the size)?

2018-08-19, 18:51   #4
ATH
Einyen

Dec 2003
Denmark

54758 Posts

Quote:
 Originally Posted by ldesnogu Does AVX2 have support for 128-bit integers? And for integer divides/remainders (no matter the size)?
I'm not sure, I found 128bit and 256bit vectors made of 4 or 8 32 bit integers, but I did not find 128 or 256 bit integers. I always assumed AVX2 had 128bit and 256bit for both integers and floating point (and AVX512 had 512bit) but maybe not.

I have never learned assembly programming so this would take me a very long time starting from scratch. I would actually like to learn assembly, but I'm not sure I'm up for it right now, so this thread is just to check if anyone already was so well versed in assembly they could do this in like 60 min.

I will look into Montgomery multiplication.

Last fiddled with by ATH on 2018-08-19 at 18:55

 2018-08-19, 18:56 #5 ldesnogu     Jan 2008 France 24·3·11 Posts If I remember correctly 64x64 -> 128 mul is coming with one of the numerous AVX-512 variants. I don't know a lot about x86 and I feel good that way
 2018-08-19, 20:37 #6 jasonp Tribal Bullet     Oct 2004 DC716 Posts No vector x86 variant has vectorized division, and only one corner of AVX-512 can do vector 64-bit multiplies (with 64-bit low-half result). gcc does implement the uint128_t type and has function calls for 128-bit multiply and divide, but not division with 256-bit numerator. If you are really after a modular multiply then Montgomery arithmetic is by far your best bet, especially if you are always dividing by the same number. We have many experts here on Montgomery arithmetic; sample code that constructs 64-bit modular multiplies for a machine (Nvidia GPUs) without 64-bit operations can be found here. For each 64-bit n, you computed w = montmul_w(bottom word of n) and r = montmul_r(n). Then if a and b are in Montgomery form, you produce a*b mod n in Montgomery form by calling montmul64(a, b, n, w). To convert a number a into Montgomery form, compute montmul64(a, r, n, w) and to convert out compute montmul(a, 1, n, w). There are faster ways to do these things but this way leverages a few primitives only.
 2018-08-19, 21:08 #7 pinhodecarlos     "Carlos Pinho" Oct 2011 Milton Keynes, UK 24·172 Posts A bit of background: http://www.mersenneforum.org/showthr...197#post494197 Does anyone knows if Robert G had any though on this since he did the code for gap?
2018-08-20, 00:47   #8
bsquared

"Ben"
Feb 2007

CC716 Posts

Quote:
 Originally Posted by jasonp No vector x86 variant has vectorized division, and only one corner of AVX-512 can do vector 64-bit multiplies (with 64-bit low-half result).
Just as crucially, none of the x86 vector extensions implements add with carry, except the one-and-done Knight's Corner instruction set. Emulating those ends up being almost as bad as the lack of 64x64 multiplies.

One of the recent additions to yafu's sieve of Eratosthenes was a vectorized Montgomery multiply (8 modmuls in parallel). But it is limited to 32-bit reductions as of now.

Last fiddled with by bsquared on 2018-08-20 at 00:47

2018-08-20, 01:16   #9
science_man_88

"Forget I exist"
Jul 2009
Dumbassville

20C016 Posts

Quote:
 Originally Posted by ATH I have never learned assembly programming so this would take me a very long time starting from scratch. I would actually like to learn assembly, but I'm not sure I'm up for it right now, so this thread is just to check if anyone already was so well versed in assembly they could do this in like 60 min.
I haven't used it in a while, and was bad when I tried, but the basics of loops is that they are implementable as if statements. Similar to how in PARI/GP you can do:
Code:
loop()=if(x<=1000,print(x);x++;eval(self())
and then define a value of x, and call loop() and print ( or other things) in a loop you make by having the if statement call itself. yes you can also just call it by name similar to assembler. There of course many loops already now.

Last fiddled with by science_man_88 on 2018-08-20 at 01:16

 2018-08-20, 05:34 #10 ldesnogu     Jan 2008 France 24·3·11 Posts Found this: http://www.acsel-lab.com/arithmetic/...a/1616a032.pdf That's off-topic as it discusses AVX-512 IFMA (integer FMA) implementation of 1024-bit multiplication. IFMA adds 52x52 multiply + add with instructions to get low and high parts of the result.
2018-08-20, 06:13   #11
R. Gerbicz

"Robert Gerbicz"
Oct 2005
Hungary

23×32×19 Posts

Quote:
 Originally Posted by pinhodecarlos A bit of background: http://www.mersenneforum.org/showthr...197#post494197 Does anyone knows if Robert G had any though on this since he did the code for gap?
Don't have time for this, but for segmented sieve you would need to extend it from 32 bits to 64 bits, so at least actually this part isn't that tricky.

 Similar Threads Thread Thread Starter Forum Replies Last Post Uncwilly Software 5 2018-06-04 16:52 pinhodecarlos Lounge 3 2017-10-26 18:58 ATH Programming 4 2017-06-08 22:22 Dubslow YAFU 4 2012-03-31 03:07 Xyzzy Lounge 23 2011-03-08 17:50

All times are UTC. The time now is 14:52.

Tue Aug 4 14:52:44 UTC 2020 up 18 days, 10:39, 0 users, load averages: 1.61, 1.38, 1.44