mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Programming (https://www.mersenneforum.org/forumdisplay.php?f=29)
-   -   64x64 integer multiplication GCC (https://www.mersenneforum.org/showthread.php?t=25762)

 diep 2020-07-22 06:30

64x64 integer multiplication GCC

Good Morning!

Looking for way of multiplying fast in GCC 64x 64 bits unsigned 64 bits integers and obtaining both highres as well as lowres result.

Was busy reviving CUDA sieving code i wrote in 2016 and to verify the results on the CPU my oldie code there for GCC is duck slow.

Under windows it's fast enough there as one has intrinsics:

reslo = _umul128(x,y,&reshi);

However what i have for under linux where my GPU runs under i have only something duck slow for GCC.

Right now under GCC i'm using the duckslow next thing (typing over by hand):

uint64 mul_limbs(uint64 *reslo,uint64 x,uint64 y) {
__uint128_t prod;
prod = (__uint128_t) x * y;
*reslo = (uint64)prod;
return(prod>>64);
}

This duckslow - anyone know faster solution - and i just need multiply a single x * y for the BSGS algorithm verification.

Kind Regards,
Vincent

 bsquared 2020-07-22 13:25

Try:
[CODE]__inline uint64_t mul64(uint64_t x, uint64_t y, uint64_t* hi) {
__asm__(
"mulq %3 \n\t"
: "=&a"(x), "=&d"(y)
: "0"(x), "1"(y)
: "cc"
);

*hi = y;
return x;
}[/CODE]

If you have a processor with mulx then this might be slightly faster:

[CODE]__inline uint64_t mulx64(uint64_t x, uint64_t y, uint64_t* hi) {
__asm__(
"mulx %3, %0, %1 \n\t"
: "=&d"(x), "=&a"(y)
: "0"(x), "1"(y)
);

*hi = y;
return x;
}
[/CODE]

 jasonp 2020-07-22 14:14

As a slight modification, the following macro (from an old version of GMP's longlong.h) can avoid a potential trip through memory:
[code]
#define umul_ppmm(w1, w0, u, v) \
__asm__ ("mulq %3" \
: "=a" (w0), "=d" (w1) \
: "%0" ((UDItype)(u)), "rm" ((UDItype)(v)))
[/code]

 bsquared 2020-07-22 14:56

[QUOTE=jasonp;551283]As a slight modification, the following macro (from an old version of GMP's longlong.h) can avoid a potential trip through memory:
[code]
#define umul_ppmm(w1, w0, u, v) \
__asm__ ("mulq %3" \
: "=a" (w0), "=d" (w1) \
: "%0" ((UDItype)(u)), "rm" ((UDItype)(v)))
[/code][/QUOTE]

Yeah, macros may integrate better into the rest of the code.

However, for this simple benchmark loop:

[CODE]for (i = 0; i < 1000000000; i++)
{
c = mulx64(a, b, &d);
a = c;
b = d;
}[/CODE]

all of these are the same using gcc 7.3.0 (including diep's original code)

[CODE]
mulx took 0.2715 sec
mul took 0.2714 sec
mul_limbs took 0.2715 sec
umul_ppmm took 0.2714 sec
[/CODE]

 kruoli 2020-07-22 15:08

Is GCC optimizing it to the same target assembly code?

 diep 2020-07-22 15:22

i wrote that mul_limbs code like 15 years ago. maybe GCC got a lot more clever lately :)

Back then you can bet i cried it out loud like a wounded Bear that GCC was duckslow optimizing it back then in my NTT code versus oldie visual c++ compiler making fast clean code with its intrinsics (back then not so old).

I don't know which GCC version gets called by Nvidia's Cuda compiler driver.

nvcc --version gives 'built on sun_sep__4_22:14:01_cdt_2016
gcc --version gives 4.9.2

Let's see whether NVCC knows the -S option to print assembler.

If NVCC is not clever enough and this GCC version is - i can still consider making a library out of the math functions and just simply link it.

 bsquared 2020-07-22 15:48

[QUOTE=kruoli;551286]Is GCC optimizing it to the same target assembly code?[/QUOTE]

As far as I can see from objdump, yes (except the one usage of mulx).

 ewmayer 2020-07-22 23:04

[QUOTE=bsquared;551285]Yeah, macros may integrate better into the rest of the code.

However, for this simple benchmark loop:

[CODE]for (i = 0; i < 1000000000; i++)
{
c = mulx64(a, b, &d);
a = c;
b = d;
}[/CODE]

all of these are the same using gcc 7.3.0 (including diep's original code)

[CODE]
mulx took 0.2715 sec
mul took 0.2714 sec
mul_limbs took 0.2715 sec
umul_ppmm took 0.2714 sec
[/CODE][/QUOTE]

That kind of single-MUL test loop is a poor way to gauge throughput - you want to have multiple MULs - at least 4, IMO - overlapping so as to better hide latency. OTOH if the compiler is unrolling the loop for you, it might be OK - is that the case?

 bsquared 2020-07-23 02:42

[QUOTE=ewmayer;551312]That kind of single-MUL test loop is a poor way to gauge throughput - you want to have multiple MULs - at least 4, IMO - overlapping so as to better hide latency. OTOH if the compiler is unrolling the loop for you, it might be OK - is that the case?[/QUOTE]

Agreed, it's not a good benchmark. The OP asked for a way to multiply 64-bit unsigned ints and he now has several things to try. I assumed he will use them in his application to figure out which is best... the loop was just a first cut to see if there were glaring differences.

 retina 2020-07-23 08:24

I am still amazed at how much C code it takes just to coax it into using a single native instruction.

You have to throw out all the "portability" that C is supposed to provide. And you have to provide a lot of ugly boilerplate lines to coerce it into assembling just one line. And you have to write it the worst possible syntax: AT&T. :yucky::poop:

:razz:

 xilman 2020-07-23 08:34

[QUOTE=retina;551347]I am still amazed at how much C code it takes just to coax it into using a single native instruction.

You have to throw out all the "portability" that C is supposed to provide. And you have to provide a lot of ugly boilerplate lines to coerce it into assembling just one line. And you have to write it the worst possible syntax: AT&T. :yucky::poop: