mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2016-09-29, 15:28   #1
cseizert
 
cseizert's Avatar
 
"Curtis"
Sep 2016
Fort Collins, CO

1010 Posts
Default 128 bit integer division in CUDA?

Does anyone know about the availability of libraries for performing large integer division on the GPU? So far I haven't been able to find much information on this subject, and writing such a function myself seems non-trivial. Especially since my skill with assembly leaves a lot to be desired.
cseizert is offline   Reply With Quote
Old 2016-09-29, 16:50   #2
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

2×7×461 Posts
Default

I haven't seen a medium-sized integer divide in CUDA. There is GQD (https://code.google.com/archive/p/gpuprec/) which has double-double and quad-double, and double-double looks like about 100 bits of precision which might be enough for the first part of your Sieve of Eratosthenes.

Implementing small-multi-precision integer divide is something of a rite of passage, and not one I've dragged myself through. If I remember correctly, the thing that turned into OpenSSL started off as someone trying to implement multi-precision division.

__clz looks useful for working out how far you have to shift things to line them up

Last fiddled with by fivemack on 2016-09-29 at 16:51
fivemack is offline   Reply With Quote
Old 2016-09-29, 17:45   #3
xilman
Bamboozled!
 
xilman's Avatar
 
"๐’‰บ๐’ŒŒ๐’‡ท๐’†ท๐’€ญ"
May 2003
Down not across

2×17×347 Posts
Default

Quote:
Originally Posted by cseizert View Post
Does anyone know about the availability of libraries for performing large integer division on the GPU? So far I haven't been able to find much information on this subject, and writing such a function myself seems non-trivial. Especially since my skill with assembly leaves a lot to be desired.
I don't know of any 128-bit integer divide but GMP-ECM may contain some useful code.

Do you mean 128/64 -> 64, 64 or 256/128 -> 128,128? The difference may be significant.

One item of many in my to-do list is to implement multiprecision arithmetic in CUDA. A severe lack of round tuits is the inhibiting factor.
xilman is offline   Reply With Quote
Old 2016-09-29, 18:09   #4
wombatman
I moo ablest echo power!
 
wombatman's Avatar
 
May 2013

3·619 Posts
Default

https://github.com/dmatlack/cuda-rsa

Found this while looking around for large number libraries. Never tried it out, but it looks to be replicating GMP for GPUs or something similar.
wombatman is offline   Reply With Quote
Old 2016-09-29, 18:18   #5
cseizert
 
cseizert's Avatar
 
"Curtis"
Sep 2016
Fort Collins, CO

2×5 Posts
Default

Most useful to me would be 128/64 -> 64. What is the general strategy for writing such an implementation? Long division in base 2^n? In any event, I'll have to read up on the CUDA toolkit documentation, and I'll also check out that link.
cseizert is offline   Reply With Quote
Old 2016-09-29, 19:13   #6
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

3·23·89 Posts
Default

This is not CUDA but the algorithms should extend. http://www.codeproject.com/Tips/7850...vision-Modulus
henryzz is offline   Reply With Quote
Old 2016-11-26, 17:28   #7
cseizert
 
cseizert's Avatar
 
"Curtis"
Sep 2016
Fort Collins, CO

2·5 Posts
Default

I ended up writing a uint128_t class with all the arithmetic and whatnot, and it has proven to be decently efficient in a combinatorial prime number counting implementation I'm working on.

https://github.com/curtisseizert/CUDA-uint128
cseizert is offline   Reply With Quote
Old 2016-11-27, 00:23   #8
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

67558 Posts
Default

If you are familiar with Montgomery arithmetic, the tools for doing 64x64 mod 64 bit arithmetic in CUDA are here.

Given a modulus n you call montmul_w on the bottom word of n. Then if 64-bit a and b are in Montgomery form and less than n, you can compute the Montgomery form of a*b mod n by calling montmul64(). If you want to convert out of Montgomery form, you can call montmul_r (which is slow) and make the output one operand of montmul64(). Obviously the longer you can stay in Montgomery form the less often you have to use the slow routine.

Ernst has written extensively on this subject.

Also, you can try converting mp_mod_2 from here.

Last fiddled with by jasonp on 2016-11-27 at 00:26
jasonp is offline   Reply With Quote
Old 2016-11-27, 15:41   #9
cseizert
 
cseizert's Avatar
 
"Curtis"
Sep 2016
Fort Collins, CO

2·5 Posts
Default

I will have to look more into this, but as it is 64 * 64 -> 128 multiplication is fast enough for me since it is a single x86 instruction and only two ptx instructions. For example, I have been running this using 2 arrays of 100 000 000 primes around 2^60 and doing the resulting 100 million multiplications only takes 10 ms. on my gtx 1080. I think this is probably memory bound too.

Most of the division I do for calculating pi(x) is of the form x/(p * q) where p and q are primes < sqrt(x). But because this is ultimately used to calculate pi(x / (p * q)) via a pi table, which is a nightmare for cache misses, the time used to do this calculation is mostly hidden by memory latency. Somehow this kernel still pegs my gpu at TDP and is about 20 times faster than all 8 threads doing the same work on my cpu (with openMP), so I'm mostly concerned with correctness testing of the pi(x) implementation.

Last fiddled with by cseizert on 2016-11-27 at 15:43
cseizert is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trial division with CUDA (mmff) -- used, but runs like new! Prime95 Operazione Doppi Mersennes 404 2023-02-24 23:24
CUDA integer FFTs jasonp GPU Computing 33 2018-12-30 13:32
k*b^n+/-c where b is an integer greater than 2 and c is an integer from 1 to b-1 jasong Miscellaneous Math 5 2016-04-24 03:40
How to do big integer inversion/division with middle product, power series? R. Gerbicz Math 1 2015-02-10 10:56
Long Division in Mathematica JuanTutors Information & Answers 7 2007-06-14 17:29

All times are UTC. The time now is 14:59.


Fri Jul 7 14:59:53 UTC 2023 up 323 days, 12:28, 0 users, load averages: 1.14, 1.12, 1.11

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

โ‰  ยฑ โˆ“ รท ร— ยท โˆ’ โˆš โ€ฐ โŠ— โŠ• โŠ– โŠ˜ โŠ™ โ‰ค โ‰ฅ โ‰ฆ โ‰ง โ‰จ โ‰ฉ โ‰บ โ‰ป โ‰ผ โ‰ฝ โŠ โА โŠ‘ โŠ’ ยฒ ยณ ยฐ
โˆ  โˆŸ ยฐ โ‰… ~ โ€– โŸ‚ โซ›
โ‰ก โ‰œ โ‰ˆ โˆ โˆž โ‰ช โ‰ซ โŒŠโŒ‹ โŒˆโŒ‰ โˆ˜ โˆ โˆ โˆ‘ โˆง โˆจ โˆฉ โˆช โจ€ โŠ• โŠ— ๐–• ๐–– ๐–— โŠฒ โŠณ
โˆ… โˆ– โˆ โ†ฆ โ†ฃ โˆฉ โˆช โІ โŠ‚ โŠ„ โŠŠ โЇ โŠƒ โŠ… โŠ‹ โŠ– โˆˆ โˆ‰ โˆ‹ โˆŒ โ„• โ„ค โ„š โ„ โ„‚ โ„ต โ„ถ โ„ท โ„ธ ๐“Ÿ
ยฌ โˆจ โˆง โŠ• โ†’ โ† โ‡’ โ‡ โ‡” โˆ€ โˆƒ โˆ„ โˆด โˆต โŠค โŠฅ โŠข โŠจ โซค โŠฃ โ€ฆ โ‹ฏ โ‹ฎ โ‹ฐ โ‹ฑ
โˆซ โˆฌ โˆญ โˆฎ โˆฏ โˆฐ โˆ‡ โˆ† ฮด โˆ‚ โ„ฑ โ„’ โ„“
๐›ข๐›ผ ๐›ฃ๐›ฝ ๐›ค๐›พ ๐›ฅ๐›ฟ ๐›ฆ๐œ€๐œ– ๐›ง๐œ ๐›จ๐œ‚ ๐›ฉ๐œƒ๐œ— ๐›ช๐œ„ ๐›ซ๐œ… ๐›ฌ๐œ† ๐›ญ๐œ‡ ๐›ฎ๐œˆ ๐›ฏ๐œ‰ ๐›ฐ๐œŠ ๐›ฑ๐œ‹ ๐›ฒ๐œŒ ๐›ด๐œŽ๐œ ๐›ต๐œ ๐›ถ๐œ ๐›ท๐œ™๐œ‘ ๐›ธ๐œ’ ๐›น๐œ“ ๐›บ๐œ”