mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-07-18, 14:54   #1
chrisjp
 
Jul 2010

2×5 Posts
Smile Early Trials with OpenCL (Barrett's Modulus)

Hello,

I've been a fairly long time-lurker and have been messing around with GPGPU stuff for a year or so.

I've implemented trial factoring up to 86-bit factors using barrett's modulus as the foundation algorithm. I'm developing with the ATI toolchain right now, although it should be trivially portable to other architectures.

Results at this point show a raw processing speed of 35-40M factors per second on an HD4850 at stock speeds. The efficiency of the algorithm isn't completely fabulous, however, since it requires host side pre-calculation of one factor.

The AMD implementation also still is not performance optimized, so it's still producing some inefficient code in crucial parts of the algorithm. They are scheduled to release a "performance" based release early in August so hopefully that will correct some of the issues.

So in summary:
- OpenCL based trial division algorithm based upon Barrett's modulus
- Limited by pre-calculation (division) host-side
- No siever, etc, as of yet.
- Raw speed of ~40M candidates per second on HD4850

If you are interested in helping out or seeing the code, drop a message here and I'm happy to send it out.

I also have an FPGA implemented version of trial factoring I'll be putting up later this week if people are interested.

Enjoy your weekend!
chrisjp is offline   Reply With Quote
Old 2010-07-18, 20:40   #2
xilman
Bamboozled!
 
xilman's Avatar
 
"๐’‰บ๐’ŒŒ๐’‡ท๐’†ท๐’€ญ"
May 2003
Down not across

2·17·347 Posts
Default

Quote:
Originally Posted by chrisjp View Post
Hello,

I've been a fairly long time-lurker and have been messing around with GPGPU stuff for a year or so.

I've implemented trial factoring up to 86-bit factors using barrett's modulus as the foundation algorithm. I'm developing with the ATI toolchain right now, although it should be trivially portable to other architectures.

Results at this point show a raw processing speed of 35-40M factors per second on an HD4850 at stock speeds. The efficiency of the algorithm isn't completely fabulous, however, since it requires host side pre-calculation of one factor.

The AMD implementation also still is not performance optimized, so it's still producing some inefficient code in crucial parts of the algorithm. They are scheduled to release a "performance" based release early in August so hopefully that will correct some of the issues.

So in summary:
- OpenCL based trial division algorithm based upon Barrett's modulus
- Limited by pre-calculation (division) host-side
- No siever, etc, as of yet.
- Raw speed of ~40M candidates per second on HD4850

If you are interested in helping out or seeing the code, drop a message here and I'm happy to send it out.

I also have an FPGA implemented version of trial factoring I'll be putting up later this week if people are interested.

Enjoy your weekend!
I am very much interested!

I've purchased a Xilinx sp601 dev kit (a medium-sized Spartan-6 with a fair amount of memory and high-level IO) and am trying to teach myself the rudiments of FPGA design but the learning curve has been an insurmountable opportunity so far. Anything that can help is very welcome indeed.

Please email me if you find that more convenient. The address is paul@leyland.vispa.com.

Paul
xilman is offline   Reply With Quote
Old 2010-07-19, 07:57   #3
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

5·223 Posts
Default

Hi!

Quote:
Originally Posted by chrisjp View Post
- Limited by pre-calculation (division) host-side
- No siever, etc, as of yet.
Perhaps you'll find something that you could reuse in my code here:
http://mersenneforum.org/showthread.php?t=12827

Quote:
Originally Posted by chrisjp View Post
If you are interested in helping out or seeing the code, drop a message here and I'm happy to send it out.
I'm interreseted! Barretts algorithm is on my radar for longterm development.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-07-19, 17:22   #4
chrisjp
 
Jul 2010

2·5 Posts
Default

Hey guys.

Oliver - I had seen your thread and tried out your code! It's fantastic! I had a similar version based on floating point reciprocals working a few months back. I tried switching over to integer based code since I thought it might be faster given AMDs processing units.

The lack of some of the hardware instructions being passed though in OpenCL is a big downside unfortuantely. This includes some of the 58XX features including rotates, bitcounts, and 24-bit multiplies. There isn't a good "PTX hack" for AMD opencl unfortunately.

Paul - I'm happy to answer any FPGA questions, it's a big hobby of mine. The code is currently fairly elementary and should be a good starter for you. It does use most of the xilinx "blocks" including rams/multipliers/dcm's so it should be a good example.

I'm at work but I will try to throw both designs up somewhere early this week (probably tomorrow)

Chris
chrisjp is offline   Reply With Quote
Old 2010-07-20, 09:36   #5
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21338 Posts
Default

Hello Chris,

Quote:
Originally Posted by chrisjp View Post
The lack of some of the hardware instructions being passed though in OpenCL is a big downside unfortuantely. This includes some of the 58XX features including rotates, bitcounts, and 24-bit multiplies. There isn't a good "PTX hack" for AMD opencl unfortunately.
Yepp, access to some low level functions is very usefully. For my newer kernels the most usefull functions are add/sub with carry. And as you can see in my code they are comfortably useable with CUDA.
Do you know if it is related to the ATI toolchain or to OpenCL itself that it is not possible(?) to use those functions?

About barrets modular multiplication: I'm still afraid of the correction steps which might break SIMD/lock step behaviour on Nvidia GPUs. So I'm curios
a) how you've implemented this
b) how it performs, OpenCL should work on Nvidia GPUs, too

I've choosen the simple long division in my code because
a) it is simple
b) it needs only one correction step at the last iteration


Perhaps we can work together somehow and help each other to improve our codes. e.g. I think we both need a good (and faster) sieve on CPU.
So I'm waiting for your code to see what is different and what it similar.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-08-16, 10:15   #6
Brain
 
Brain's Avatar
 
Dec 2009
Peine, Germany

5138 Posts
Default

I'm also interested. As far as I know, there isn't any Mersenne code yet available for ATI cards. I own a HD 5770 which I could offer for code validation. It would be nice to have platform independent code (as long as it is still with good performance...).
Brain is offline   Reply With Quote
Old 2011-04-11, 01:05   #7
chrisjp
 
Jul 2010

2·5 Posts
Default OpenCL code.

Hey all, Sorry for the long long delay. Lots of changes in life, but here is the barebones opencl code. Currently not usable unless you're quite dedicated :)

It has lots of debug stuff and old code, but you can get the jist of the algorithm.

It currently runs around 93M primes / second with everything precalculated and ready to go on the CPU side on an HD 6870 @ 1Ghz.

I'll try to incorporate it into a .exe sometime this or next week.

Chris
Attached Files
File Type: txt trialfactor_72.txt (41.4 KB, 1154 views)
chrisjp is offline   Reply With Quote
Old 2011-04-11, 01:17   #8
chrisjp
 
Jul 2010

2×5 Posts
Default

Should also mention that since the original post I modified it to use the 24-bit integer instructions so it is now only up to 72-bit factors :)
chrisjp is offline   Reply With Quote
Old 2011-04-11, 07:30   #9
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by chrisjp View Post
Hello,

I've been a fairly long time-lurker and have been messing around with GPGPU stuff for a year or so.

I've implemented trial factoring up to 86-bit factors using barrett's modulus as the foundation algorithm. I'm developing with the ATI toolchain right now, although it should be trivially portable to other architectures.

Results at this point show a raw processing speed of 35-40M factors per second on an HD4850 at stock speeds. The efficiency of the algorithm isn't completely fabulous, however, since it requires host side pre-calculation of one factor.

The AMD implementation also still is not performance optimized, so it's still producing some inefficient code in crucial parts of the algorithm. They are scheduled to release a "performance" based release early in August so hopefully that will correct some of the issues.

So in summary:
- OpenCL based trial division algorithm based upon Barrett's modulus
- Limited by pre-calculation (division) host-side
- No siever, etc, as of yet.
- Raw speed of ~40M candidates per second on HD4850

If you are interested in helping out or seeing the code, drop a message here and I'm happy to send it out.

I also have an FPGA implemented version of trial factoring I'll be putting up later this week if people are interested.

Enjoy your weekend!
If you'd browse the forum you'd see i'm also busy writing this at AMD GPU's.
Funny :)

Note i am busy also implementing a siever as of course you want to do the sieving inside the gpu as well as you simply do not have the bandwidth to ship enough FC (factor candidates) per second to the GPU.

Could you ship me an email with the source, as that would be very interesting.

diep@xs4all.nl

A few days ago i had emailed around to the DRUG team it would be in OpenCL as i see a way to do efficient sieving in OpenCL to generate the FC's.

It will be interesting to see how many cycles per second this is at the 60-64 bits range.

What matters is speed around the 60-70 bit range, as it is not possible here right now to sieve at a cpu to more than 61 bits, for Wagstaff, which is nearly 100% similar to Mersenne.

So moving to 64 bits there would be interesting on GPU's.
This will speedup the project here bigtime :)

I'll implement the sieving this afternoon or one of the coming afternoons.

This is a few hours project. Optimization will take years though :)

p.s. i do not know efficiency of openCL, but i do know that some gpu's which have 2 gpu's on each cards,
like the 5970 and the 6990 that opencl until a month ago didn't allow adressing the 2nd gpu. let's see whether the slow moving Indian helpdesk (whose core job at AMD seems to be renaming naming conventions every few months - yes Sahib) has fixed this by now.

pps I kept an eye on gpgpu since 2007. Could only toy in CUDA back then.

Last fiddled with by diep on 2011-04-11 at 07:33
diep is offline   Reply With Quote
Old 2011-04-11, 07:36   #10
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

14478 Posts
Default

Quote:
Originally Posted by chrisjp View Post
Should also mention that since the original post I modified it to use the 24-bit integer instructions so it is now only up to 72-bit factors :)
24 bits????

older nvidia's could only do 24 bits (fermi can do 32 now as well). AMD always could do 32 x 32 bits multiplications since the 4000 series AFAIK. One instruction gives top 32 bits another instruction lower 32 bits. I already wanted to post how to get those in openCL, but maybe you want to post this :)
diep is offline   Reply With Quote
Old 2011-04-11, 07:43   #11
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by Brain View Post
I'm also interested. As far as I know, there isn't any Mersenne code yet available for ATI cards. I own a HD 5770 which I could offer for code validation. It would be nice to have platform independent code (as long as it is still with good performance...).
OpenCL is platform independant but of course you don't want it to get programmed platform independant as that will slow down the program factors.

You want the fastest code of course as far as is possible by opencl. Now opencl is the only thing AMD will keep supporting for gpgpu so there is not much of a choice there.

OpenCL defines the number crunching hardware in a very specific GPU type manner. Don't expect it to run fast on CPU's.

The advantage of cpu's is that they have things like cache coherency whereas opencl defines this world as total independant from each other, with only a very primitive way to communicate with other worlds; whereas cpu's have been optimized in such a manner that cores can work very well with each other.

Furthermore gpu's only can do 32 bits calculations fast and the rest gets simulated in 32 bits using some extra transistors to quickly emulate 64 bits (and the nvidia gamers cards hardly have those transistors only the tesla series has - they want to make some cash for their shareholders). So ideally as you can see already from the CUDA code is to do the calculations using 32 bits code.

This where at cpu's things is a lot faster using 64 bits code.

Regards,
Vincent
diep is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extracting the Modulus from publickeyblob RSA 512 26B Homework Help 2 2014-11-30 07:31
Free Trials of GPU Cloud Computing Resources NBtarheel_33 GPU to 72 9 2013-07-31 15:32
Guantanamo trials to be restarted garo Soap Box 39 2011-03-22 23:07
It's possible to calculate an unknown RSA modulus? D2MAC Math 8 2010-12-26 16:32
Factorization of a 768-bit RSA modulus akruppa Factoring 53 2010-03-12 13:50

All times are UTC. The time now is 15:12.


Fri Jul 7 15:12:10 UTC 2023 up 323 days, 12:40, 0 users, load averages: 1.20, 1.08, 1.11

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

โ‰  ยฑ โˆ“ รท ร— ยท โˆ’ โˆš โ€ฐ โŠ— โŠ• โŠ– โŠ˜ โŠ™ โ‰ค โ‰ฅ โ‰ฆ โ‰ง โ‰จ โ‰ฉ โ‰บ โ‰ป โ‰ผ โ‰ฝ โŠ โА โŠ‘ โŠ’ ยฒ ยณ ยฐ
โˆ  โˆŸ ยฐ โ‰… ~ โ€– โŸ‚ โซ›
โ‰ก โ‰œ โ‰ˆ โˆ โˆž โ‰ช โ‰ซ โŒŠโŒ‹ โŒˆโŒ‰ โˆ˜ โˆ โˆ โˆ‘ โˆง โˆจ โˆฉ โˆช โจ€ โŠ• โŠ— ๐–• ๐–– ๐–— โŠฒ โŠณ
โˆ… โˆ– โˆ โ†ฆ โ†ฃ โˆฉ โˆช โІ โŠ‚ โŠ„ โŠŠ โЇ โŠƒ โŠ… โŠ‹ โŠ– โˆˆ โˆ‰ โˆ‹ โˆŒ โ„• โ„ค โ„š โ„ โ„‚ โ„ต โ„ถ โ„ท โ„ธ ๐“Ÿ
ยฌ โˆจ โˆง โŠ• โ†’ โ† โ‡’ โ‡ โ‡” โˆ€ โˆƒ โˆ„ โˆด โˆต โŠค โŠฅ โŠข โŠจ โซค โŠฃ โ€ฆ โ‹ฏ โ‹ฎ โ‹ฐ โ‹ฑ
โˆซ โˆฌ โˆญ โˆฎ โˆฏ โˆฐ โˆ‡ โˆ† ฮด โˆ‚ โ„ฑ โ„’ โ„“
๐›ข๐›ผ ๐›ฃ๐›ฝ ๐›ค๐›พ ๐›ฅ๐›ฟ ๐›ฆ๐œ€๐œ– ๐›ง๐œ ๐›จ๐œ‚ ๐›ฉ๐œƒ๐œ— ๐›ช๐œ„ ๐›ซ๐œ… ๐›ฌ๐œ† ๐›ญ๐œ‡ ๐›ฎ๐œˆ ๐›ฏ๐œ‰ ๐›ฐ๐œŠ ๐›ฑ๐œ‹ ๐›ฒ๐œŒ ๐›ด๐œŽ๐œ ๐›ต๐œ ๐›ถ๐œ ๐›ท๐œ™๐œ‘ ๐›ธ๐œ’ ๐›น๐œ“ ๐›บ๐œ”