![]() |
|
|
#1 |
|
Jul 2010
2×5 Posts |
Hello,
I've been a fairly long time-lurker and have been messing around with GPGPU stuff for a year or so. I've implemented trial factoring up to 86-bit factors using barrett's modulus as the foundation algorithm. I'm developing with the ATI toolchain right now, although it should be trivially portable to other architectures. Results at this point show a raw processing speed of 35-40M factors per second on an HD4850 at stock speeds. The efficiency of the algorithm isn't completely fabulous, however, since it requires host side pre-calculation of one factor. The AMD implementation also still is not performance optimized, so it's still producing some inefficient code in crucial parts of the algorithm. They are scheduled to release a "performance" based release early in August so hopefully that will correct some of the issues. So in summary: - OpenCL based trial division algorithm based upon Barrett's modulus - Limited by pre-calculation (division) host-side - No siever, etc, as of yet. - Raw speed of ~40M candidates per second on HD4850 If you are interested in helping out or seeing the code, drop a message here and I'm happy to send it out. I also have an FPGA implemented version of trial factoring I'll be putting up later this week if people are interested. Enjoy your weekend! |
|
|
|
|
|
#2 | |
|
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
2·17·347 Posts |
Quote:
I've purchased a Xilinx sp601 dev kit (a medium-sized Spartan-6 with a fair amount of memory and high-level IO) and am trying to teach myself the rudiments of FPGA design but the learning curve has been an insurmountable opportunity so far. Anything that can help is very welcome indeed. Please email me if you find that more convenient. The address is paul@leyland.vispa.com. Paul |
|
|
|
|
|
|
#3 | ||
|
"Oliver"
Mar 2005
Germany
5·223 Posts |
Hi!
Quote:
http://mersenneforum.org/showthread.php?t=12827 Quote:
![]() Oliver |
||
|
|
|
|
|
#4 |
|
Jul 2010
2·5 Posts |
Hey guys.
Oliver - I had seen your thread and tried out your code! It's fantastic! I had a similar version based on floating point reciprocals working a few months back. I tried switching over to integer based code since I thought it might be faster given AMDs processing units. The lack of some of the hardware instructions being passed though in OpenCL is a big downside unfortuantely. This includes some of the 58XX features including rotates, bitcounts, and 24-bit multiplies. There isn't a good "PTX hack" for AMD opencl unfortunately. Paul - I'm happy to answer any FPGA questions, it's a big hobby of mine. The code is currently fairly elementary and should be a good starter for you. It does use most of the xilinx "blocks" including rams/multipliers/dcm's so it should be a good example. I'm at work but I will try to throw both designs up somewhere early this week (probably tomorrow) Chris |
|
|
|
|
|
#5 | |
|
"Oliver"
Mar 2005
Germany
21338 Posts |
Hello Chris,
Quote:
Do you know if it is related to the ATI toolchain or to OpenCL itself that it is not possible(?) to use those functions? About barrets modular multiplication: I'm still afraid of the correction steps which might break SIMD/lock step behaviour on Nvidia GPUs. So I'm curios a) how you've implemented this b) how it performs, OpenCL should work on Nvidia GPUs, too I've choosen the simple long division in my code because a) it is simple ![]() b) it needs only one correction step at the last iteration Perhaps we can work together somehow and help each other to improve our codes. e.g. I think we both need a good (and faster) sieve on CPU. So I'm waiting for your code to see what is different and what it similar. Oliver |
|
|
|
|
|
|
#6 |
|
Dec 2009
Peine, Germany
5138 Posts |
I'm also interested. As far as I know, there isn't any Mersenne code yet available for ATI cards. I own a HD 5770 which I could offer for code validation. It would be nice to have platform independent code (as long as it is still with good performance...).
|
|
|
|
|
|
#7 |
|
Jul 2010
2·5 Posts |
Hey all, Sorry for the long long delay. Lots of changes in life, but here is the barebones opencl code. Currently not usable unless you're quite dedicated :)
It has lots of debug stuff and old code, but you can get the jist of the algorithm. It currently runs around 93M primes / second with everything precalculated and ready to go on the CPU side on an HD 6870 @ 1Ghz. I'll try to incorporate it into a .exe sometime this or next week. Chris |
|
|
|
|
|
#8 |
|
Jul 2010
2×5 Posts |
Should also mention that since the original post I modified it to use the 24-bit integer instructions so it is now only up to 72-bit factors :)
|
|
|
|
|
|
#9 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
Funny :) Note i am busy also implementing a siever as of course you want to do the sieving inside the gpu as well as you simply do not have the bandwidth to ship enough FC (factor candidates) per second to the GPU. Could you ship me an email with the source, as that would be very interesting. diep@xs4all.nl A few days ago i had emailed around to the DRUG team it would be in OpenCL as i see a way to do efficient sieving in OpenCL to generate the FC's. It will be interesting to see how many cycles per second this is at the 60-64 bits range. What matters is speed around the 60-70 bit range, as it is not possible here right now to sieve at a cpu to more than 61 bits, for Wagstaff, which is nearly 100% similar to Mersenne. So moving to 64 bits there would be interesting on GPU's. This will speedup the project here bigtime :) I'll implement the sieving this afternoon or one of the coming afternoons. This is a few hours project. Optimization will take years though :) p.s. i do not know efficiency of openCL, but i do know that some gpu's which have 2 gpu's on each cards, like the 5970 and the 6990 that opencl until a month ago didn't allow adressing the 2nd gpu. let's see whether the slow moving Indian helpdesk (whose core job at AMD seems to be renaming naming conventions every few months - yes Sahib) has fixed this by now. pps I kept an eye on gpgpu since 2007. Could only toy in CUDA back then. Last fiddled with by diep on 2011-04-11 at 07:33 |
|
|
|
|
|
|
#10 | |
|
Sep 2006
The Netherlands
14478 Posts |
Quote:
older nvidia's could only do 24 bits (fermi can do 32 now as well). AMD always could do 32 x 32 bits multiplications since the 4000 series AFAIK. One instruction gives top 32 bits another instruction lower 32 bits. I already wanted to post how to get those in openCL, but maybe you want to post this :) |
|
|
|
|
|
|
#11 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
You want the fastest code of course as far as is possible by opencl. Now opencl is the only thing AMD will keep supporting for gpgpu so there is not much of a choice there. OpenCL defines the number crunching hardware in a very specific GPU type manner. Don't expect it to run fast on CPU's. The advantage of cpu's is that they have things like cache coherency whereas opencl defines this world as total independant from each other, with only a very primitive way to communicate with other worlds; whereas cpu's have been optimized in such a manner that cores can work very well with each other. Furthermore gpu's only can do 32 bits calculations fast and the rest gets simulated in 32 bits using some extra transistors to quickly emulate 64 bits (and the nvidia gamers cards hardly have those transistors only the tesla series has - they want to make some cash for their shareholders). So ideally as you can see already from the CUDA code is to do the calculations using 32 bits code. This where at cpu's things is a lot faster using 64 bits code. Regards, Vincent |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Extracting the Modulus from publickeyblob RSA 512 | 26B | Homework Help | 2 | 2014-11-30 07:31 |
| Free Trials of GPU Cloud Computing Resources | NBtarheel_33 | GPU to 72 | 9 | 2013-07-31 15:32 |
| Guantanamo trials to be restarted | garo | Soap Box | 39 | 2011-03-22 23:07 |
| It's possible to calculate an unknown RSA modulus? | D2MAC | Math | 8 | 2010-12-26 16:32 |
| Factorization of a 768-bit RSA modulus | akruppa | Factoring | 53 | 2010-03-12 13:50 |