![]() |
|
|
#1 |
|
"Curtis"
Sep 2016
Fort Collins, CO
1010 Posts |
Does anyone know about the availability of libraries for performing large integer division on the GPU? So far I haven't been able to find much information on this subject, and writing such a function myself seems non-trivial. Especially since my skill with assembly leaves a lot to be desired.
|
|
|
|
|
|
#2 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
2×7×461 Posts |
I haven't seen a medium-sized integer divide in CUDA. There is GQD (https://code.google.com/archive/p/gpuprec/) which has double-double and quad-double, and double-double looks like about 100 bits of precision which might be enough for the first part of your Sieve of Eratosthenes.
Implementing small-multi-precision integer divide is something of a rite of passage, and not one I've dragged myself through. If I remember correctly, the thing that turned into OpenSSL started off as someone trying to implement multi-precision division. __clz looks useful for working out how far you have to shift things to line them up Last fiddled with by fivemack on 2016-09-29 at 16:51 |
|
|
|
|
|
#3 | |
|
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
2×17×347 Posts |
Quote:
Do you mean 128/64 -> 64, 64 or 256/128 -> 128,128? The difference may be significant. One item of many in my to-do list is to implement multiprecision arithmetic in CUDA. A severe lack of round tuits is the inhibiting factor. |
|
|
|
|
|
|
#4 |
|
I moo ablest echo power!
May 2013
3·619 Posts |
https://github.com/dmatlack/cuda-rsa
Found this while looking around for large number libraries. Never tried it out, but it looks to be replicating GMP for GPUs or something similar. |
|
|
|
|
|
#5 |
|
"Curtis"
Sep 2016
Fort Collins, CO
2×5 Posts |
Most useful to me would be 128/64 -> 64. What is the general strategy for writing such an implementation? Long division in base 2^n? In any event, I'll have to read up on the CUDA toolkit documentation, and I'll also check out that link.
|
|
|
|
|
|
#6 |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
3·23·89 Posts |
This is not CUDA but the algorithms should extend. http://www.codeproject.com/Tips/7850...vision-Modulus
|
|
|
|
|
|
#7 |
|
"Curtis"
Sep 2016
Fort Collins, CO
2·5 Posts |
I ended up writing a uint128_t class with all the arithmetic and whatnot, and it has proven to be decently efficient in a combinatorial prime number counting implementation I'm working on.
https://github.com/curtisseizert/CUDA-uint128 |
|
|
|
|
|
#8 |
|
Tribal Bullet
Oct 2004
67558 Posts |
If you are familiar with Montgomery arithmetic, the tools for doing 64x64 mod 64 bit arithmetic in CUDA are here.
Given a modulus n you call montmul_w on the bottom word of n. Then if 64-bit a and b are in Montgomery form and less than n, you can compute the Montgomery form of a*b mod n by calling montmul64(). If you want to convert out of Montgomery form, you can call montmul_r (which is slow) and make the output one operand of montmul64(). Obviously the longer you can stay in Montgomery form the less often you have to use the slow routine. Ernst has written extensively on this subject. Also, you can try converting mp_mod_2 from here. Last fiddled with by jasonp on 2016-11-27 at 00:26 |
|
|
|
|
|
#9 |
|
"Curtis"
Sep 2016
Fort Collins, CO
2·5 Posts |
I will have to look more into this, but as it is 64 * 64 -> 128 multiplication is fast enough for me since it is a single x86 instruction and only two ptx instructions. For example, I have been running this using 2 arrays of 100 000 000 primes around 2^60 and doing the resulting 100 million multiplications only takes 10 ms. on my gtx 1080. I think this is probably memory bound too.
Most of the division I do for calculating pi(x) is of the form x/(p * q) where p and q are primes < sqrt(x). But because this is ultimately used to calculate pi(x / (p * q)) via a pi table, which is a nightmare for cache misses, the time used to do this calculation is mostly hidden by memory latency. Somehow this kernel still pegs my gpu at TDP and is about 20 times faster than all 8 threads doing the same work on my cpu (with openMP), so I'm mostly concerned with correctness testing of the pi(x) implementation. Last fiddled with by cseizert on 2016-11-27 at 15:43 |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Trial division with CUDA (mmff) -- used, but runs like new! | Prime95 | Operazione Doppi Mersennes | 404 | 2023-02-24 23:24 |
| CUDA integer FFTs | jasonp | GPU Computing | 33 | 2018-12-30 13:32 |
| k*b^n+/-c where b is an integer greater than 2 and c is an integer from 1 to b-1 | jasong | Miscellaneous Math | 5 | 2016-04-24 03:40 |
| How to do big integer inversion/division with middle product, power series? | R. Gerbicz | Math | 1 | 2015-02-10 10:56 |
| Long Division in Mathematica | JuanTutors | Information & Answers | 7 | 2007-06-14 17:29 |