![]() |
GNU ASM -> CUDA _device_
Hello all, I have a question...
Thanks to Serge Batalov, I started working on Double Mersennes > MM33 with a multi-k siever. Thanks to Ken Brazier, I had some hints on how to make my siever faster. Thanks to myself, now the siever runs 400% faster, sieving 400G prime candidates on 2,000,000 k per day. :smile: I am now in the way of translating my multiple siever into a CUDA application. One of the functions I invoke (a mulmod) has been kindly offered by Ken Brazier, and is written as inline GCC assembly language. This part of the code should run as __device__ code on the GPU. Now, the question: Is there a way to translate the code (no more than 8-9 instructions) into something like a __device__ CUDA asm? :help: Thank you for helping me. Luigi P.S. Should I transfer the question into the Programming subforum? |
[QUOTE=ET_;362673]
Is there a way to translate the code (no more than 8-9 instructions) into something like a __device__ CUDA asm? :help:[/QUOTE] CUDA supports inline asm. You just need to translate the current asm to PTX instructions. See [url]http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html[/url]. |
[QUOTE=frmky;362675]CUDA supports inline asm. You just need to translate the current asm to PTX instructions. See [url]http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html[/url].[/QUOTE]
Thanks Greg! :bow: My problem is that I have a "divq" and "mulq" instructions in the GCC asm, while the manual (ptx_isa_2.0.pdf) doesn't seem to recognize them. That should mean that I must translate the semantic of "divq" and "mulq" into something that the compiler understands, like mul.hi.u64 || mul.lo.u64 || mul.wide.u64 and div.u64. The sad thing is that I don't have a manual of the GCC asm syntax either :sad:, any pointers to it / volunteers that like to explain those few lines to me in PM? Luigi |
If you have to synthesize a 64-bit div and mod from 32-bit divisions, you can use only multiplies if the modulus is assumed fixed for several mulmod operations, but if not then you have to use a multiple-precision divide. See the functions mp_modmul_2 and mp_mod_2 [url="http://sourceforge.net/p/msieve/code/HEAD/tree/trunk/common/mp.c"]here[/url]
|
[QUOTE=ET_;362679]That should mean that I must translate the semantic of "divq" and "mulq" into something that the compiler understands, like mul.hi.u64 || mul.lo.u64 || mul.wide.u64 and div.u64.[/QUOTE]
mulq can be substituted with the two instructions mul.hi.u64 and mul.lo.u64 to get the full 128-bit product. divq, which is a 128-bit by 64-bit divide, doesn't have a corresponding instruction in PTX. div.u64 only does a 64-bit by 64-bit divide, but there are tricks as Jason mentioned to get around that. |
Thank you Jason and Greg.
You cut my problem into 2 slices, and the first one (mulq) is resolved. All it takes to resolve the second half of the question is studying the 150 lines of mp_modmul_2 and mp_mod_2 (a matter of shifting and reordering low and high portions of the operand). Luigi :et_: |
Unfortunately that code will have terrible performance on a GPU, so I would only use it to bootstrap the one division needed for the precomputation method if you could possibly get away with it. With the generalized inverse in place that that method needs, a modmul is 3 multiplies plus a few shifts and a conditional subtract, which would be 100x faster than pushing all the mod code into CUDA. In fact, if the inverse will be reused enough you can implement the division via bit-at-a-time code.
This is from an experiment in the GMP-ECM source; SP_TYPE_BITS is the width of a word, SP_NUMB_BITS is the number of bits in the modulus (assumed fixed): [code] sp_t sp_reciprocal(sp_t p) { /* integer reciprocal */ #if SP_NUMB_BITS <= SP_TYPE_BITS - 2 mp_limb_t shift = 2 * SP_NUMB_BITS + 1; #else mp_limb_t shift = 2 * SP_NUMB_BITS; #endif #if SP_TYPE_BITS == GMP_LIMB_BITS /* use GMP functions */ mp_limb_t recip, dummy; udiv_qrnnd (recip, dummy, (mp_limb_t) 1 << (shift - SP_TYPE_BITS), 0, p); return recip; #elif SP_TYPE_BITS < GMP_LIMB_BITS /* ordinary division */ return ((mp_limb_t)1 << shift) / p; #else /* worst case: bit-at-a-time */ sp_t r = (sp_t)1 << (SP_NUMB_BITS - 1); sp_t q = 0; mp_limb_t i; for (i = 0; i < shift + 1 - SP_NUMB_BITS; i++) { q += q; r += r; if (r >= p) { r -= p; q |= 1; } } return q; #endif } [/code] For the corresponding modular multiply functions, check out sp_mul [url="https://gforge.inria.fr/scm/viewvc.php/*checkout*/branches/nttdisk/sp.h?root=ecm&content-type=text%2Fplain"]here[/url] |
The CUDA div function is [B][I]slow[/I][/B] but it's worthwhile to check if the code is compute or memory bound. If it's memory bound, using a very slow but simple div function won't make any difference. If compute bound, it will be worthwhile to do something better.
|
Thank you for this huge Christmas present! :smile:
Luigi |
Why would one need any kind of hardware div (as opposed to simple bitwise right-shifts) to effect a modmul?
|
[QUOTE=jasonp;362689]...
[/code]For the corresponding modular multiply functions, check out sp_mul [URL="https://gforge.inria.fr/scm/viewvc.php/*checkout*/branches/nttdisk/sp.h?root=ecm&content-type=text%2Fplain"]here[/URL][/QUOTE] A have a question to local portability gurus. (not really related to the preceding discussion) Will this (or similar) hack work on Win64? Or, in other words, what should go in the Win64 section? [CODE]uint64_t mulmod(uint64_t a, uint64_t b, uint64_t c) { uint64_t d; /* to hold the result of a*b mod c */ #if (defined(__GNUC__) || defined(__ICL)) && defined(__x86_64__) /* calculates a*b mod c, stores result in d */ asm ("mov %1, %%rax;" /* put a into rax */ "mul %2;" /* mul a*b -> rdx:rax */ "div %3;" /* (a*b)/c -> quot in rax remainder in rdx */ "mov %%rdx, %0;" /* store result in d */ :"=r"(d) /* output */ :"r"(a), "r"(b), "r"(c) /* input */ :"%rax", "%rdx" /* clobbered registers */ ); #elif defined(_MSC_VER) // && defined(_WIN64) [COLOR=Red] __asm { mov rax, a mul b div c mov d, rdx }[/COLOR] #else #error Not implemented! #endif return d; } [/CODE] |
I guess I will use modmul_2() from jasonp's mp.c for a fast workaround.
With gratitude. |
[QUOTE=Batalov;370447]Will this (or similar) hack work on Win64?
Or, in other words, what should go in the Win64 section?[/QUOTE] I only have access to a Win32/MSVC devenv so can;t actually test the 64-bit inline asm, but the only obvious flaw I notice is that you need to swap the operand order in each of your two MOVs due to the opposite-order convention of Intel and AT&T syntax, e.g. mov a,rax writes rax into variable a (store), whereas you obviously intend a load there. |
In 64-bit, nothing will work, people say. "You had me at __asm, says the compiler."
It is not a critical piece of code, though, so I will use either Jason's code or George's tf_validate from mmff. The latest CUDA refuses to install if one doesn't have VS and sends you to the generic microsoft download page. You get the latest and greatest 2013, and CUDA refuses to install pointing you to its link of system requirements, where it says: "Supported are VS 2008, 2010, 2012." These bastards know how to get you to curse. ;-) The flowers in my daughter's garden outside of my window wilted from what they've recently heard... Oh fun fun... |
[QUOTE=Batalov;370452]In 64-bit, nothing will work, people say. "You had me at __asm, says the compiler."[/QUOTE]
FYI, here is a similar 32-bit Win/MSVC [erm, I mean Visual Studio] inline asm macro from my codebase - if you can get that to build in 32-bit mode, next try an r**-register-named version of same in 64-bit mode. If that also compiles ok, try the one-asm-per-line syntax in your own modmul macro. FYI, I ditched any semblance of Microsoft/Win build support for my Mlucas-related codes several years ago, mainly due to MS charging full price for what I consider a crippled compiler product, namely VS having no 64-bit inline asm support even several years after x86_64 became widespread. No excuse for them to not support such stuff in VS as soon as they added x86_64 support of any kind. Allegedly they have since remedied their oversight, but far too late for me to give a crap. AFAIC the sooner that dinosaur turd of a company fades into irrelevancy the better, and anything, however small, I can personally do to hasten the process, I will. :) While I would never refuse to offer help to a friend in need such as yourself, may I suggest you similarly think long and hard about whether you want to pollute your own code with WinCrap, when you could simply use GCC inside a linux emulator? [code]/* Multiplies two 32-bit unsigned integers _x and _y, and generates a two-word (64-bit) product in _lo and _hi. */ #define MUL_LOHI32(_x,_y,_lo,_hi)\ {\ __asm mov eax, _x \ __asm mul _y /* Result of eax*_y stored in edx:eax as _hi:_lo */ \ __asm mov _lo, eax \ __asm mov _hi, edx \ }[/code] |
[QUOTE=ewmayer;370453]
While I would never refuse to offer help to a friend in need such as yourself, may I suggest you similarly think long and hard about whether you want to pollute your own code with WinCrap, when you could simply use GCC inside a linux emulator?[/QUOTE] Use MinGW...? |
[QUOTE=mfaktc]
############################# # 2.2 Compilation (Windows) # ############################# The following instructions have been tested on Windows 7 64bit using Visual Studio 2008 Professional. A GNU compatible version of make is also required as the Makefile is not compatible with nmake. GNU Make for Win32 can be downloaded from [URL]http://gnuwin32.sourceforge.net/packages/make.htm[/URL] Run the Visual Studio 2008 x64 Win64 Command Prompt and change into the "src/" subdirectory. Run 'make -f Makefile.win' for a 64bit built (recommended on 64bit systems) or 'make -f Makefile.win32' for a 32bit built. Perhaps you have to adjust the paths to your CUDA installation and the Microsoft Visual Studio binaries in the makefiles. The binaries "mfaktc-win-64.exe" or "mfaktc-win-32.exe" are placed in the parent directory.[/QUOTE] Use MinGW...? and then tell us all about it? I am simply retracing the steps for rebuilding a [URL="http://mersenneforum.org/showthread.php?p=370282#post370282"]modified version[/URL] (for a new purpose). I am actually using linux. I wanted to spend an evening helping others (they want a Windows binary), but there's just so much wincrap I can take. I will just post the patched source where I will rid of asm code (it will become much uglier though). Thought I'd build it, but not for the price of a full studio version, no. Even uninstalling the mscrap will take the rest of my evening, and I prefer to catch a movie instead. |
[QUOTE=Batalov;370455]Use MinGW...? and then tell us all about it?
I am simply retracing the steps for rebuilding a [URL="http://mersenneforum.org/showthread.php?p=370282#post370282"]modified version[/URL] (for a new purpose). I am actually using linux. I wanted to spend an evening helping others (they want a Windows binary), but there's just so much wincrap I can take. I will just post the patched source where I will rid of asm code (it will become much uglier though). Thought I'd build it, but not for the price of a full studio version, no. Even uninstalling the mscrap will take the rest of my evening, and I prefer to catch a movie instead.[/QUOTE] ? MinGW is basically a gcc for windows. I use it almost the same as I would compiling Linux. EDIT: for fun, I can compile clLucas and mfakto with it. |
[QUOTE=Batalov;370452]In 64-bit, nothing will work, people say. "You had me at __asm, says the compiler."
It is not a critical piece of code, though, so I will use either Jason's code or George's tf_validate from mmff. The latest CUDA refuses to install if one doesn't have VS and sends you to the generic microsoft download page. You get the latest and greatest 2013, and CUDA refuses to install pointing you to its link of system requirements, where it says: "Supported are VS 2008, 2010, 2012." These bastards know how to get you to curse. ;-) The flowers in my daughter's garden outside of my window wilted from what they've recently heard... Oh fun fun...[/QUOTE] Rather than supporting 64-bit inline ASM, they offer an intrinsic function for most (all?) 64 bit operations (including SIMD stuff... there is scads of it in yafu). So you can accomplish this using, e.g., [URL="http://msdn.microsoft.com/en-us/library/vstudio/67xf5cy5(v=vs.100).aspx"]__umulh[/URL] [edit] Presumably if you put the hi-word producing __umulh(x,y) near the lo-word producing x*y, the compiler would be smart enough to only emit one multiply. But you should probably check the disassembly to be sure. |
Erm, cool (I think)... I can see that we can start with
[CODE]// umul128.c // processor: IPF, x64 #include <stdio.h> #include <intrin.h> #pragma intrinsic(_umul128) unsigned __int64 mulmod(unsigned __int64 a, unsigned __int64 b, unsigned __int64 m) { unsigned __int64 c, d; d = _umul128(a, b, &c); // ...but where is _udiv64 or something? There's none. }[/CODE] I'll have a look in yafu... I've already used Jason's modmul_2(). My interest is purely academic at this point. |
| All times are UTC. The time now is 13:03. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.