mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   GNU ASM -> CUDA _device_ (https://www.mersenneforum.org/showthread.php?t=19027)

ET_ 2013-12-22 19:11

GNU ASM -> CUDA _device_
 
Hello all, I have a question...

Thanks to Serge Batalov, I started working on Double Mersennes > MM33 with a multi-k siever.
Thanks to Ken Brazier, I had some hints on how to make my siever faster.
Thanks to myself, now the siever runs 400% faster, sieving 400G prime candidates on 2,000,000 k per day. :smile:

I am now in the way of translating my multiple siever into a CUDA application.

One of the functions I invoke (a mulmod) has been kindly offered by Ken Brazier, and is written as inline GCC assembly language.

This part of the code should run as __device__ code on the GPU.

Now, the question:

Is there a way to translate the code (no more than 8-9 instructions) into something like a __device__ CUDA asm? :help:

Thank you for helping me.

Luigi

P.S. Should I transfer the question into the Programming subforum?

frmky 2013-12-22 19:38

[QUOTE=ET_;362673]

Is there a way to translate the code (no more than 8-9 instructions) into something like a __device__ CUDA asm? :help:[/QUOTE]

CUDA supports inline asm. You just need to translate the current asm to PTX instructions. See [url]http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html[/url].

ET_ 2013-12-22 20:09

[QUOTE=frmky;362675]CUDA supports inline asm. You just need to translate the current asm to PTX instructions. See [url]http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html[/url].[/QUOTE]

Thanks Greg! :bow:

My problem is that I have a "divq" and "mulq" instructions in the GCC asm, while the manual (ptx_isa_2.0.pdf) doesn't seem to recognize them.

That should mean that I must translate the semantic of "divq" and "mulq" into something that the compiler understands, like mul.hi.u64 || mul.lo.u64 || mul.wide.u64 and div.u64.

The sad thing is that I don't have a manual of the GCC asm syntax either :sad:, any pointers to it / volunteers that like to explain those few lines to me in PM?

Luigi

jasonp 2013-12-22 20:25

If you have to synthesize a 64-bit div and mod from 32-bit divisions, you can use only multiplies if the modulus is assumed fixed for several mulmod operations, but if not then you have to use a multiple-precision divide. See the functions mp_modmul_2 and mp_mod_2 [url="http://sourceforge.net/p/msieve/code/HEAD/tree/trunk/common/mp.c"]here[/url]

frmky 2013-12-22 20:57

[QUOTE=ET_;362679]That should mean that I must translate the semantic of "divq" and "mulq" into something that the compiler understands, like mul.hi.u64 || mul.lo.u64 || mul.wide.u64 and div.u64.[/QUOTE]
mulq can be substituted with the two instructions mul.hi.u64 and mul.lo.u64 to get the full 128-bit product. divq, which is a 128-bit by 64-bit divide, doesn't have a corresponding instruction in PTX. div.u64 only does a 64-bit by 64-bit divide, but there are tricks as Jason mentioned to get around that.

ET_ 2013-12-22 21:16

Thank you Jason and Greg.

You cut my problem into 2 slices, and the first one (mulq) is resolved.

All it takes to resolve the second half of the question is studying the 150 lines of mp_modmul_2 and mp_mod_2 (a matter of shifting and reordering low and high portions of the operand).

Luigi :et_:

jasonp 2013-12-23 01:14

Unfortunately that code will have terrible performance on a GPU, so I would only use it to bootstrap the one division needed for the precomputation method if you could possibly get away with it. With the generalized inverse in place that that method needs, a modmul is 3 multiplies plus a few shifts and a conditional subtract, which would be 100x faster than pushing all the mod code into CUDA. In fact, if the inverse will be reused enough you can implement the division via bit-at-a-time code.

This is from an experiment in the GMP-ECM source; SP_TYPE_BITS is the width of a word, SP_NUMB_BITS is the number of bits in the modulus (assumed fixed):
[code]
sp_t sp_reciprocal(sp_t p)
{
/* integer reciprocal */

#if SP_NUMB_BITS <= SP_TYPE_BITS - 2
mp_limb_t shift = 2 * SP_NUMB_BITS + 1;
#else
mp_limb_t shift = 2 * SP_NUMB_BITS;
#endif

#if SP_TYPE_BITS == GMP_LIMB_BITS /* use GMP functions */
mp_limb_t recip, dummy;

udiv_qrnnd (recip, dummy,
(mp_limb_t) 1 << (shift - SP_TYPE_BITS), 0, p);
return recip;

#elif SP_TYPE_BITS < GMP_LIMB_BITS /* ordinary division */

return ((mp_limb_t)1 << shift) / p;

#else /* worst case: bit-at-a-time */

sp_t r = (sp_t)1 << (SP_NUMB_BITS - 1);
sp_t q = 0;
mp_limb_t i;

for (i = 0; i < shift + 1 - SP_NUMB_BITS; i++)
{
q += q;
r += r;
if (r >= p)
{
r -= p;
q |= 1;
}
}
return q;

#endif
}
[/code]
For the corresponding modular multiply functions, check out sp_mul [url="https://gforge.inria.fr/scm/viewvc.php/*checkout*/branches/nttdisk/sp.h?root=ecm&content-type=text%2Fplain"]here[/url]

frmky 2013-12-23 07:00

The CUDA div function is [B][I]slow[/I][/B] but it's worthwhile to check if the code is compute or memory bound. If it's memory bound, using a very slow but simple div function won't make any difference. If compute bound, it will be worthwhile to do something better.

ET_ 2013-12-23 10:31

Thank you for this huge Christmas present! :smile:

Luigi

ewmayer 2013-12-23 20:58

Why would one need any kind of hardware div (as opposed to simple bitwise right-shifts) to effect a modmul?

Batalov 2014-04-06 21:47

[QUOTE=jasonp;362689]...
[/code]For the corresponding modular multiply functions, check out sp_mul [URL="https://gforge.inria.fr/scm/viewvc.php/*checkout*/branches/nttdisk/sp.h?root=ecm&content-type=text%2Fplain"]here[/URL][/QUOTE]
A have a question to local portability gurus. (not really related to the preceding discussion)

Will this (or similar) hack work on Win64?

Or, in other words, what should go in the Win64 section?
[CODE]uint64_t mulmod(uint64_t a, uint64_t b, uint64_t c) {
uint64_t d; /* to hold the result of a*b mod c */
#if (defined(__GNUC__) || defined(__ICL)) && defined(__x86_64__)
/* calculates a*b mod c, stores result in d */
asm ("mov %1, %%rax;" /* put a into rax */
"mul %2;" /* mul a*b -> rdx:rax */
"div %3;" /* (a*b)/c -> quot in rax remainder in rdx */
"mov %%rdx, %0;" /* store result in d */
:"=r"(d) /* output */
:"r"(a), "r"(b), "r"(c) /* input */
:"%rax", "%rdx" /* clobbered registers */
);
#elif defined(_MSC_VER) // && defined(_WIN64)
[COLOR=Red] __asm
{
mov rax, a
mul b
div c
mov d, rdx
}[/COLOR]
#else
#error Not implemented!
#endif
return d;
}
[/CODE]


All times are UTC. The time now is 22:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.