![]() |
![]() |
#1 |
Sep 2006
The Netherlands
11000011102 Posts |
![]()
Good Morning!
Looking for way of multiplying fast in GCC 64x 64 bits unsigned 64 bits integers and obtaining both highres as well as lowres result. Was busy reviving CUDA sieving code i wrote in 2016 and to verify the results on the CPU my oldie code there for GCC is duck slow. Under windows it's fast enough there as one has intrinsics: reslo = _umul128(x,y,&reshi); However what i have for under linux where my GPU runs under i have only something duck slow for GCC. Right now under GCC i'm using the duckslow next thing (typing over by hand): uint64 mul_limbs(uint64 *reslo,uint64 x,uint64 y) { __uint128_t prod; prod = (__uint128_t) x * y; *reslo = (uint64)prod; return(prod>>64); } This duckslow - anyone know faster solution - and i just need multiply a single x * y for the BSGS algorithm verification. Kind Regards, Vincent |
![]() |
![]() |
![]() |
#2 |
"Ben"
Feb 2007
3,617 Posts |
![]()
Try:
Code:
__inline uint64_t mul64(uint64_t x, uint64_t y, uint64_t* hi) { __asm__( "mulq %3 \n\t" : "=&a"(x), "=&d"(y) : "0"(x), "1"(y) : "cc" ); *hi = y; return x; } Code:
__inline uint64_t mulx64(uint64_t x, uint64_t y, uint64_t* hi) { __asm__( "mulx %3, %0, %1 \n\t" : "=&d"(x), "=&a"(y) : "0"(x), "1"(y) ); *hi = y; return x; } Last fiddled with by bsquared on 2020-07-22 at 13:47 Reason: slight change to mul64 |
![]() |
![]() |
![]() |
#3 |
Tribal Bullet
Oct 2004
5×709 Posts |
![]()
As a slight modification, the following macro (from an old version of GMP's longlong.h) can avoid a potential trip through memory:
Code:
#define umul_ppmm(w1, w0, u, v) \ __asm__ ("mulq %3" \ : "=a" (w0), "=d" (w1) \ : "%0" ((UDItype)(u)), "rm" ((UDItype)(v))) |
![]() |
![]() |
![]() |
#4 | |
"Ben"
Feb 2007
1110001000012 Posts |
![]() Quote:
However, for this simple benchmark loop: Code:
for (i = 0; i < 1000000000; i++) { c = mulx64(a, b, &d); a = c; b = d; } Code:
mulx took 0.2715 sec mul took 0.2714 sec mul_limbs took 0.2715 sec umul_ppmm took 0.2714 sec |
|
![]() |
![]() |
![]() |
#5 |
"Oliver"
Sep 2017
Porta Westfalica, DE
977 Posts |
![]()
Is GCC optimizing it to the same target assembly code?
Last fiddled with by kruoli on 2020-07-22 at 15:08 Reason: Spelling. |
![]() |
![]() |
![]() |
#6 |
Sep 2006
The Netherlands
2·17·23 Posts |
![]()
i wrote that mul_limbs code like 15 years ago. maybe GCC got a lot more clever lately :)
Back then you can bet i cried it out loud like a wounded Bear that GCC was duckslow optimizing it back then in my NTT code versus oldie visual c++ compiler making fast clean code with its intrinsics (back then not so old). I don't know which GCC version gets called by Nvidia's Cuda compiler driver. nvcc --version gives 'built on sun_sep__4_22:14:01_cdt_2016 gcc --version gives 4.9.2 Let's see whether NVCC knows the -S option to print assembler. If NVCC is not clever enough and this GCC version is - i can still consider making a library out of the math functions and just simply link it. |
![]() |
![]() |
![]() |
#7 |
"Ben"
Feb 2007
70418 Posts |
![]() |
![]() |
![]() |
![]() |
#8 | |
∂2ω=0
Sep 2002
Repรบblica de California
101101110011102 Posts |
![]() Quote:
Last fiddled with by ewmayer on 2020-07-22 at 23:06 |
|
![]() |
![]() |
![]() |
#9 |
"Ben"
Feb 2007
70418 Posts |
![]()
Agreed, it's not a good benchmark. The OP asked for a way to multiply 64-bit unsigned ints and he now has several things to try. I assumed he will use them in his application to figure out which is best... the loop was just a first cut to see if there were glaring differences.
|
![]() |
![]() |
![]() |
#10 |
Undefined
"The unspeakable one"
Jun 2006
My evil lair
144618 Posts |
![]()
I am still amazed at how much C code it takes just to coax it into using a single native instruction.
You have to throw out all the "portability" that C is supposed to provide. And you have to provide a lot of ugly boilerplate lines to coerce it into assembling just one line. And you have to write it the worst possible syntax: AT&T. ![]() ![]() ![]() |
![]() |
![]() |
![]() |
#11 | |
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
22×3×941 Posts |
![]() Quote:
C is an old language designed to be efficient on a computer, the PDP-11, which was discontinued years before its manufacturer, DEC, went out of business --- itself many years ago. AFAICT essentially all other languages (assembly excepted of course) make it difficult to exploit all the instructions provided by any CISC architecture. |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Even faster integer multiplication | paulunderwood | Computer Science & Computational Number Theory | 17 | 2020-05-21 19:51 |
multiplication and logarithm | bhelmes | Math | 4 | 2016-10-06 13:33 |
k*b^n+/-c where b is an integer greater than 2 and c is an integer from 1 to b-1 | jasong | Miscellaneous Math | 5 | 2016-04-24 03:40 |
New Multiplication Algorithm | vector | Miscellaneous Math | 10 | 2007-12-20 18:16 |
Multiplication Tendency | clowns789 | Miscellaneous Math | 5 | 2005-03-11 00:23 |