![]() |
|
|
#12 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
http://www.amd.com/us/products/deskt...verview.aspx#2 Maybe i should ask you at which bitlevel you are trying FC's, as i was counting at the 880Mhz @ 1536 streamcores i have here (radeon hd 6970), to reach speeds of nearly a billion per second or more this at up to 63 bits, maybe 64 bits. 93M/s is very disappointing i'd argue. Vincent Last fiddled with by diep on 2011-04-11 at 10:00 |
|
|
|
|
|
|
#13 | |
|
Jul 2010
2·5 Posts |
No I mean the 6870 at 1000 Mhz, overclocking hasn't produced any errors but you may do what you wish :).
The code is constant up to 71-bits... I and many others would be interested to see code that can work at 1B/second with these factors :) Chris Quote:
|
|
|
|
|
|
|
#14 | |
|
Jul 2010
2×5 Posts |
Right, but 24-bit instructions get issued in 4-lanes, wheres 32-bits go in one lane. 4 times the throughput.. basic stuff here. It's easy to see the difference if you profile code with 32-bit versus 24-bit multiplies. Fermi can do 32-bit at full-speed, whereas ATI cannot. The code already uses the high and low instructions in other places.
Quote:
|
|
|
|
|
|
|
#15 | |
|
Sep 2006
The Netherlands
80710 Posts |
Quote:
Besides the crucial first 64 bits you cover with 2 units using 2 x 32 bits, versus 3 x 32 bits for the 24 bits equivalent. But of course we agree that one should test it all very well, as that'll decide what and how :) Realize with wagstaff we're still trial factoring around the 10M bits sizes, so getting to 64 bits is already quite something, as default it's 61 bits now. thanks to the hard work here of the gpu's doing TF for mersenne, did i read it correct that for GIMPS things get TF'ed now at the current range to 69 bits isn't it? Thanks, Vincent |
|
|
|
|
|
|
#16 |
|
Sep 2006
The Netherlands
3·269 Posts |
I calculated with 64 bits of course, as my goal is to Trial Factor to 64 bits of course, where 61 bits got accomplished by my 16 core AMD @ 16 cores, as that's about what you need to run TF to keep Jeff Gilchrist' fed enough exponents to test; maybe it's cold during winters in Canada!
|
|
|
|
|
|
#17 | |
|
Sep 2006
The Netherlands
3·269 Posts |
Quote:
As serious note on the 1B/s. Let's consider i've got 1536 cores @ 880Mhz. Now let's skip silly definitions of AMD/Nvidia (both multiply their gpu's with a phantom factor to determine number of gflops, in case of AMD that's in the manual explained as well) and look realistic. The ideal throughput is 1.0 per PE. To reach 1B/s we can throw effectively a tad less, in my case as i'm intending doing the sieving in another wavefront, than 1536 * 0.88 cycles per FC into battle. That's 1351 cycles for a 64 bits FC. On a cpu, 1351 cycles is hell of a lot. Within 1351 cycles you can calculate for example what distance it takes to travel around the planet. Even a small bathroom stop is allowed in fact. I'm testing numbers of around 10-12M now, which all already have been processed to 61 bits. Let's take a 10M number. That's 24 bits deep. So to see whether a 64 bits number divides this, let's say roughly we need roughly 37 squares and 37 modulo's. Using barrett (note i remember one day i had reinvented barrett myself - see GMP mailing list - it beated for big numbers the default modulo at the time) that means extension of this 128 bits number to 192 bits. So another multiplication. Let's assume now stupid schoolboy algorithm, which for GPU's usually is not such a bad idea (huh?). Let's calculate everything in units for 1 step: First we have 2 units that we get to 4. That's say 4 multiplications. Then we have 4 units times 2, that's 8 multiplications. So 12 multiplications in total. 37 * 12 = 444 multiplications, and of course a lot of extra work, but all the extra work should go at full PE speed. I wrote things down in a naive manner here, without further indepth research yet, but i hope you see what i was guessing that would be theoretical possible. Now i also know the theoretist problem - paper supports everything But what's your estimate on what is possible? Realize so far i didn't count the sieve yet and that thing isn't exactly 'free' either. Regards, Vincent |
|
|
|
|
|
|
#18 | |
|
Jul 2010
2·5 Posts |
My code currently runs (on the 6870 not 6970) in 2334 clock cycles. Roughly 1/10th of that is in the squaring and the remainder in the modulus calculation.
In terms of what could be done at only 64-bits, I believe you could get 400M if you refined my code. The main limitations are going to be the carry steps (lots of serial dependencies that limit your ALU packing) and subtraction steps (no integrated carry registers in the alu's that are exposed to opencl.) You would also run into problems solely relying on 64-bit(2X32) multiplies as you need guard bits for the carries (no carry register in opencl). So for 61-or-62 bit factors 400M might be possible with a 6970 fairly easily. But again, then you have to worry about precalculating the mu (like mkafct) on the GPU side to really get good CPU thoroughput. In short 400M I think I could do at 64 exclusive, but it's hard to know! Quote:
|
|
|
|
|
|
|
#19 | ||
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
2334 clockcycles @ 1120 PE's ==> 480M /s Yet you report 93M/s which actually is in number of cycles per FC: 1120G / 93M = 1120k / 93 = 12k cycles per FC. Quote:
Last fiddled with by diep on 2011-04-12 at 08:31 |
||
|
|
|
|
|
#20 | |
|
Sep 2006
The Netherlands
14478 Posts |
Quote:
Reality is that being the first writing something always is toughest. REFINING ![]() As i'm after 64 bits initially, i'll have to start from scratch anyway. It's not needed to have guard bits at all of course, just it's a lot more expensive than just adding the carry. Much depends upon the throughput independant instructions at the gpu's. When writing a special implementation for 61-63 bits or so, so something less than 64, you will need to shiftright top bits in order to add them. That might not be so fast. It might be a problem for 2 reasons, but of course it would require alternatives to be there. These gpu's have fast CMOV capabilities i hope. The instructionset of amd already for years shows at gpu level that they have really a lot of different instructions for that. Now what i don't know is how many PE's actually can execute it. Manual carry: adding A and B: Code:
carry = 0; y = A|B; x = A+B; if( x < y ) carry = 1; ![]() Till so far for now.
Last fiddled with by diep on 2011-04-12 at 11:42 |
|
|
|
|
|
|
#21 | |
|
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
2×17×347 Posts |
Quote:
Code:
/* add with carry-out function.
Result of addition is stored where the first argument points.
Carry_out is returned as the value of the function.
*/
unsigned adc (unsigned *sum, unsigned a, unsigned b)
{
*sum = a+b;
return *sum < a; /* sum is unsigned less than a if the addition overflowed */
}
|
|
|
|
|
|
|
#22 | |
|
Sep 2006
The Netherlands
3·269 Posts |
Quote:
Yet putting it in a seperated function i guess is a lot slower in opencl. i didn't really test it to the limit, but i'd expect some crap compiler/interpreter running it, so all sorts of modern features todays compilers have i wouldn't expect it to do very well :) |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Extracting the Modulus from publickeyblob RSA 512 | 26B | Homework Help | 2 | 2014-11-30 07:31 |
| Free Trials of GPU Cloud Computing Resources | NBtarheel_33 | GPU to 72 | 9 | 2013-07-31 15:32 |
| Guantanamo trials to be restarted | garo | Soap Box | 39 | 2011-03-22 23:07 |
| It's possible to calculate an unknown RSA modulus? | D2MAC | Math | 8 | 2010-12-26 16:32 |
| Factorization of a 768-bit RSA modulus | akruppa | Factoring | 53 | 2010-03-12 13:50 |