mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2020-07-22, 06:30   #1
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2·3·113 Posts
Default 64x64 integer multiplication GCC

Good Morning!

Looking for way of multiplying fast in GCC 64x 64 bits unsigned 64 bits integers and obtaining both highres as well as lowres result.

Was busy reviving CUDA sieving code i wrote in 2016 and to verify the results on the CPU my oldie code there for GCC is duck slow.

Under windows it's fast enough there as one has intrinsics:

reslo = _umul128(x,y,&reshi);

However what i have for under linux where my GPU runs under i have only something duck slow for GCC.

Right now under GCC i'm using the duckslow next thing (typing over by hand):

uint64 mul_limbs(uint64 *reslo,uint64 x,uint64 y) {
__uint128_t prod;
prod = (__uint128_t) x * y;
*reslo = (uint64)prod;
return(prod>>64);
}

This duckslow - anyone know faster solution - and i just need multiply a single x * y for the BSGS algorithm verification.

Kind Regards,
Vincent
diep is offline   Reply With Quote
Old 2020-07-22, 13:25   #2
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2×17×97 Posts
Default

Try:
Code:
__inline uint64_t mul64(uint64_t x, uint64_t y, uint64_t* hi) {
    __asm__(
        "mulq %3	\n\t"
        : "=&a"(x), "=&d"(y)
        : "0"(x), "1"(y)
        : "cc"
    );

    *hi = y;
    return x;
}
If you have a processor with mulx then this might be slightly faster:

Code:
__inline uint64_t mulx64(uint64_t x, uint64_t y, uint64_t* hi) {
    __asm__(
        "mulx %3, %0, %1	\n\t"
        : "=&d"(x), "=&a"(y)
        : "0"(x), "1"(y)
    );

    *hi = y;
    return x;
}

Last fiddled with by bsquared on 2020-07-22 at 13:47 Reason: slight change to mul64
bsquared is online now   Reply With Quote
Old 2020-07-22, 14:14   #3
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,163 Posts
Default

As a slight modification, the following macro (from an old version of GMP's longlong.h) can avoid a potential trip through memory:
Code:
 	#define umul_ppmm(w1, w0, u, v) \
 	  __asm__ ("mulq %3"                                                    \
 	           : "=a" (w0), "=d" (w1)                                       \
 	           : "%0" ((UDItype)(u)), "rm" ((UDItype)(v)))
jasonp is offline   Reply With Quote
Old 2020-07-22, 14:56   #4
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

329810 Posts
Default

Quote:
Originally Posted by jasonp View Post
As a slight modification, the following macro (from an old version of GMP's longlong.h) can avoid a potential trip through memory:
Code:
 	#define umul_ppmm(w1, w0, u, v) \
 	  __asm__ ("mulq %3"                                                    \
 	           : "=a" (w0), "=d" (w1)                                       \
 	           : "%0" ((UDItype)(u)), "rm" ((UDItype)(v)))
Yeah, macros may integrate better into the rest of the code.

However, for this simple benchmark loop:

Code:
for (i = 0; i < 1000000000; i++)
{
    c = mulx64(a, b, &d);
    a = c;
    b = d;
}
all of these are the same using gcc 7.3.0 (including diep's original code)

Code:
mulx took 0.2715 sec
mul took 0.2714 sec
mul_limbs took 0.2715 sec
umul_ppmm took 0.2714 sec
bsquared is online now   Reply With Quote
Old 2020-07-22, 15:08   #5
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

23×41 Posts
Default

Is GCC optimizing it to the same target assembly code?

Last fiddled with by kruoli on 2020-07-22 at 15:08 Reason: Spelling.
kruoli is offline   Reply With Quote
Old 2020-07-22, 15:22   #6
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2·3·113 Posts
Default

i wrote that mul_limbs code like 15 years ago. maybe GCC got a lot more clever lately :)

Back then you can bet i cried it out loud like a wounded Bear that GCC was duckslow optimizing it back then in my NTT code versus oldie visual c++ compiler making fast clean code with its intrinsics (back then not so old).

I don't know which GCC version gets called by Nvidia's Cuda compiler driver.

nvcc --version gives 'built on sun_sep__4_22:14:01_cdt_2016
gcc --version gives 4.9.2

Let's see whether NVCC knows the -S option to print assembler.

If NVCC is not clever enough and this GCC version is - i can still consider making a library out of the math functions and just simply link it.
diep is offline   Reply With Quote
Old 2020-07-22, 15:48   #7
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

1100111000102 Posts
Default

Quote:
Originally Posted by kruoli View Post
Is GCC optimizing it to the same target assembly code?
As far as I can see from objdump, yes (except the one usage of mulx).
bsquared is online now   Reply With Quote
Old 2020-07-22, 23:04   #8
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

9,791 Posts
Default

Quote:
Originally Posted by bsquared View Post
Yeah, macros may integrate better into the rest of the code.

However, for this simple benchmark loop:

Code:
for (i = 0; i < 1000000000; i++)
{
    c = mulx64(a, b, &d);
    a = c;
    b = d;
}
all of these are the same using gcc 7.3.0 (including diep's original code)

Code:
mulx took 0.2715 sec
mul took 0.2714 sec
mul_limbs took 0.2715 sec
umul_ppmm took 0.2714 sec
That kind of single-MUL test loop is a poor way to gauge throughput - you want to have multiple MULs - at least 4, IMO - overlapping so as to better hide latency. OTOH if the compiler is unrolling the loop for you, it might be OK - is that the case?

Last fiddled with by ewmayer on 2020-07-22 at 23:06
ewmayer is offline   Reply With Quote
Old 2020-07-23, 02:42   #9
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·17·97 Posts
Default

Quote:
Originally Posted by ewmayer View Post
That kind of single-MUL test loop is a poor way to gauge throughput - you want to have multiple MULs - at least 4, IMO - overlapping so as to better hide latency. OTOH if the compiler is unrolling the loop for you, it might be OK - is that the case?
Agreed, it's not a good benchmark. The OP asked for a way to multiply 64-bit unsigned ints and he now has several things to try. I assumed he will use them in his application to figure out which is best... the loop was just a first cut to see if there were glaring differences.
bsquared is online now   Reply With Quote
Old 2020-07-23, 08:24   #10
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

2·3·5·193 Posts
Default

I am still amazed at how much C code it takes just to coax it into using a single native instruction.

You have to throw out all the "portability" that C is supposed to provide. And you have to provide a lot of ugly boilerplate lines to coerce it into assembling just one line. And you have to write it the worst possible syntax: AT&T.

retina is offline   Reply With Quote
Old 2020-07-23, 08:34   #11
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

278416 Posts
Default

Quote:
Originally Posted by retina View Post
I am still amazed at how much C code it takes just to coax it into using a single native instruction.

You have to throw out all the "portability" that C is supposed to provide. And you have to provide a lot of ugly boilerplate lines to coerce it into assembling just one line. And you have to write it the worst possible syntax: AT&T.

Too bad.

C is an old language designed to be efficient on a computer, the PDP-11, which was discontinued years before its manufacturer, DEC, went out of business --- itself many years ago.

AFAICT essentially all other languages (assembly excepted of course) make it difficult to exploit all the instructions provided by any CISC architecture.
xilman is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Even faster integer multiplication paulunderwood Computer Science & Computational Number Theory 17 2020-05-21 19:51
multiplication and logarithm bhelmes Math 4 2016-10-06 13:33
k*b^n+/-c where b is an integer greater than 2 and c is an integer from 1 to b-1 jasong Miscellaneous Math 5 2016-04-24 03:40
New Multiplication Algorithm vector Miscellaneous Math 10 2007-12-20 18:16
Multiplication Tendency clowns789 Miscellaneous Math 5 2005-03-11 00:23

All times are UTC. The time now is 20:47.

Thu Oct 22 20:47:55 UTC 2020 up 42 days, 17:58, 0 users, load averages: 2.19, 1.88, 1.86

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.