mersenneforum.org Number-theoretic FPGA function implementation suggestions?
 Register FAQ Search Today's Posts Mark Forums Read

 2005-09-23, 07:38 #1 rdotson     Jul 2005 23·5 Posts Number-theoretic FPGA function implementation suggestions? Hi all, I'm not sure where to post the following question. I'm open to suggestions and the moderators are welcome to simply move this post to the appropriate sub-forum. Thank you. Q: Has anyone ever considered implementing portions of the FFT algorithm that I believe is (or used to be) included in Prime95 and were written in assembly language for speed - has anyone ever considered implementing them in hardware; ie, in Verilog HDL on an FPGA for example? I remember looking at the code and thinking it would be ideal for implementation on a closely coupled FPGA (perhaps one with a built-in PCI interface). Since then, just for the fun of it I've learned enough Verilog to implement both Pollard Rho and ECM in Verilog HDL for 640 and 384 bit-width numbers respectively, and am awaiting delivery of a an Actel ProASIC3 development board so I can try them out (it will take two months to get the board!). Meanwhile, I'm open to suggestions for interesting mathematical algorithms that would be useful to have implemented at the hardware level in an FPGA. Regards, Ron
 2005-09-23, 11:47 #2 alpertron     Aug 2002 Buenos Aires, Argentina 31·43 Posts I think you need to post charts comparing the times your ECM implementation in hardware takes against known implementations in software, for example GMP-ECM (which are cheaper since no additional hardware should be bought). For example, fill in these blanks for 384-bit numbers: Code:  rdotson FPGA (??? MHz) GMP-ECM (Pentium 4 2.4 GHz) B1 = xxx .... us .... us B1 = yyy .... us .... us B1 = zzz .... us .... us where xxx, yyy, zzz are possible values of B1 in your implementation. You can also cite your values of B2.
2005-09-23, 13:27   #3
rdotson

Jul 2005

23·5 Posts

Quote:
 Originally Posted by alpertron I think you need to post charts...
Actually alpertron, I needn't do anything at all. Nor am I interested in rehashing designs that I've already completed and am waiting to get some hardware to implement them on so I can get some timing figures on realistic numbers like you're asking for. At present, both designs work fine in simulation mode with small numbers, but since the simulator runs thousands of times slower (or less) than a real FPGA, I cannot use it to get timing figures on large numbers. That will have to wait until I'm able to actually implement my designs on real hardware. The synthesis reports (area usage, propagation delays, etc) all look fine however so I don't anticipate any major problems.

I'll tell you what alpertron, if you'll provide some B1 numbers and timings for GMP-ECM, I'll provide them for my FPGA implementation as soon as I get it actually running on real hardware. At the moment I'm just looking for new design ideas.

Here's something for you to think about however alpertron: If GMP-ECM s/w were faster than my ECM implementation in hardware, I would simply get a copy of the GMP-ECM source code (I assume it's available somewhere), and convert THAT code into Verilog HDL for execution in hardware and I would be guaranteed to be at least 100 to 1000 times faster than the software version of GMP-ECM. It turns out that some designs (Pollard Rho and ECM are two examples) are ideally suited to FPGA implementation, whereas others would be essentially impossible because of huge storage requirements or other factors. People (computer programmers in general, and myself in particular when I was a computer programmer) tend to think of Verilog HDL as hardware design and think that it's difficult just because of the words "hardware design." In actual fact, Verilog is nothing more than a very primitive programming language with some very unique features (and notable LACK of some features) that any C/C++ programmer could pick up in a few days like I did. In fact the precedence rules, and most of the operators are the same in both C and Verilog. Yes, there are lot's of timing issues with FPGAs that you have to worry about that you needn't worry about with a conventional sequential computer program, but just because Verilog is extremely tedious to code doesn't mean that it's difficult to do. In fact there are only a handful of constructs in the language - far fewer than in most other languages.

In any case I'm not looking to get into some sort of chest beating contest, I'm just looking for ideas for numeric applications that would be suitable for FPGA implementation to keep me entertained while I'm awaiting my new development board. I may or may not use any ideas that may or may not be submitted. That is strictly my own option because I'm not getting paid for any of this.

Regards,

Ron

 2005-09-23, 13:51 #4 alpertron     Aug 2002 Buenos Aires, Argentina 31·43 Posts Some timings for GMP-ECM 6 in a particular 384-bit number on a Pentium 4 2.4 GHz: B1 = 11000, B2 = 1100000 -> step 1: 153 ms, step 2: 111 ms B1 = 50000, B2 = 5000000 -> step 1: 700 ms, step 2: 350 ms B1 = 250000, B2 = 25000000 -> step 1: 3560 ms, step 2: 960 ms
 2005-09-23, 23:56 #5 rdotson     Jul 2005 23·5 Posts P.S. By the way, lest anyone think I'm some kind of crackpot or troll, the thread that started this whole thing (my interest in Verilog HDL for factoring) can be found over on the Yahoo PrimeNumbers forum at: http://groups.yahoo.com/group/primen.../message/16775 I've also posted a couple of followups since then at: http://groups.yahoo.com/group/primen.../message/16896 and http://groups.yahoo.com/group/primen.../message/17028 I am a retired computer software engineer with 35+ years of experience developing things like embedded systems for spacecraft and device drivers for prototype hardware and I can easily prove it to anyone who wishes to contact me privately. Now that I'm retired I have the time to do things that I want to do just for fun, instead of being paid to do things are aren't so interesting to me. -- Ron
 2005-09-24, 01:04 #6 ColdFury     Aug 2002 26×5 Posts A hardware siever for NFS or QS could probably get quite fast.
2005-09-24, 04:38   #7
jasonp
Tribal Bullet

Oct 2004

348910 Posts

Quote:
 Originally Posted by ColdFury A hardware siever for NFS or QS could probably get quite fast.
I'm not so sure. If some part of the memory needed by the algorithm could fit completely within the FPGA (either the factor base or the sieve interval) then it would really fly, but for interesting size problems I think that's impossible. You'd be IO limited, and a desktop microprocessor will cycle memory a lot faster than an FPGA can.

There were several papers in the late 90s on hardware QS, and several more recent papers that proposed ASIC solutions to NFS sieving for RSA-contest-size problems. The latter assume wafer scale integration containing gigs of DRAM.

My vote for something cool to implement is a number-theoretic FFT using modular math. The problem can be broken up into large chunks that can fit in an FPGA to run at super speed, and there's a lot of IO but the access pattern is pretty regular. This is a pretty big job though; at least you'd be able to run LL tests with it when you're done. Actually I think somebody has already implemented a first draft of this on a really big Xilinx.

jasonp

2005-09-24, 05:13   #8
rdotson

Jul 2005

23·5 Posts

Quote:
 Originally Posted by ColdFury A hardware siever for NFS or QS could probably get quite fast.
Hi ColdFury,

Yes that's true, but it would probably take an ASIC rather than an FPGA to do it and those things are out of the question because of cost. Even so, it would probably spend all of it's time waiting on data to be transferred to and from external media (probably some mix of RAM and Disk Space). Not to mention that mapping a direct interface from the FPGA to the disk controller (there is no operating system inside an FPGA) would probably be a great deal of work in itself.

I was thinking more in terms of some repetitive arithmetic operation that is perhaps part of NFS or QS or some other factoring program, but that takes place with localized data and is computationally intensive. Maybe a subroutine within the greater NFS and QS programs for example. I don't know enough about the way it's actually implemented in these huge factoring programs to even know where an FPGA might be appropriate, and that's mainly why I'm here asking.

Where an FPGA truly shines is in parallelization. You can actually have multiple processes running concurrently inside an FPGA, and I'm referring to true concurrency not the time-slicing variety used in operating systems. For example, in my IGCDEX module that uses Euclid's extendend algorithm to calculate a GCD and inverse, I simply declare three multipliers and run them all in parallel at the same time. Here's how it's done (this the Verilog source code for my IGCDEX module):
Code:
// igcdex.v, (c) Sept 08, 2005, Ron Dotson
// Use Euclid's extendend algorithm to calculate:
// g=GCD(p1,p2), and s and t such that: g= s*p1 + t*p2

include "ECM.h"

module IGCDEX(reset,clock,p1,p2,  s,t,g,finished);
parameter N= L; // Bus Width
parameter S0=0,S1=1,S2=2,S3=3, S4=4,S5=5,S6=6,S7=7;
input       reset, clock;
input       [N-1:0] p1, p2;
output reg  [N-1:0] s, t, g;    // "g" is GCD, and t and s are inverses
output reg  finished;

reg         resetDivider, resetMul, swappedInputs;
reg         [N-1:0] c1, a2,b2,c2, a3,b3,c3; // multiplier inputs
reg         [N-1:0] b,u,v;
reg         [2:0] state;

wire        dividerFinished, mulFinished1, mulFinished2, mulFinished3;
wire        [2*N-1:0] m1, m2, m3;  // multiplier outputs
wire        [N-1:0] q;

DIVIDER    divider (resetDivider,clock,g,b,  q,,dividerFinished); // q= iquo(g/b)
MULTIPLIER mul1    (resetMul,clock,q,b,c1,   m1,mulFinished1); // m1= (a1*b1+c1)
MULTIPLIER mul2    (resetMul,clock,q,u,c2,   m2,mulFinished2); // m2= (a2*b2+c2)
MULTIPLIER mul3    (resetMul,clock,q,v,c3,   m3,mulFinished3); // m3= (a3*b3+c3)

always @(posedge clock)
begin
if (reset==1)
begin
finished<=0;
resetDivider<=0;   // ensure these start off low
resetMul<=0;
g<= p1[N-1]? -p1: p1;   // g= abs(p1)
b<= p2[N-1]? -p2: p2;   // b= abs(p2)
t<= p1[N-1]?  -1:  1;   // t= sign(p1)
s<= 0;
u<= 0;
v<= p2[N-1]?  -1:  1;   // v= sign(p2)
resetMul<= 0;
resetDivider<= 0;
state<=S0;
//$display("IGCDEX 10: RESET p1=%X, p2=%X\n", p1,p2); end else begin case (state) S0: begin //$display("IGCDEX 20: g=%d, b=%d, t=%X, v=%d\n", g,b,t,v);
if (g < b)
begin
g<= b;    // swap operands so that g>=b
b<= g;
t<= v;
v<= t;
end
state<= S1;
end

S1: begin          // while b<>0
if (|b)     // if b!=0 continue loop, else exit
begin
resetDivider<= 1;  // enable divider, q= iquo(g/b)
state<= S2;
end
else
state<= S4;  // finished
end

S2: begin
resetDivider<= 0;
if (dividerFinished)
begin
//$display("IGCDEX 30: q=%X, g=%d, b=%d (q= g/b)\n", q,g,b); c1<= -g; c2<= -t; c3<= -s; resetMul<= 1; // enable multipliers state<= S3; end end S3: begin resetMul<= 0; if (mulFinished1 & mulFinished2 & mulFinished3) begin if (m1[N]!=m1[N-1] | m2[N]!=m2[N-1] | m3[N]!=m3[N-1]) begin // overflow //$display("\nIGCDEX xx: OVERFLOW!\n");
//$display("\tm1=%X, m2=%X, m3=%X\n", m1, m2, m3); //$display("\nIGCDEX Inputs: p1=%X, p2=%X\n", p1,p2);
$stop(); end g<= b; b<= -m1[N-1:0]; t<= u; u<= -m2[N-1:0]; // -m2 = x s<= v; v<= -m3[N-1:0]; state<= S1; // end of while loop //$display("IGCDEX 30: Looping Back; q=%d, g=%d, b=%d\n\n\n", q,g,b);
end
end

S4: begin
finished<= 1;
end
endcase
end
end
endmodule                  // IGCDEX
You may be wondering why I use modules (separate algorithms) to do multiplication rather than simply using the built-in Verilog "*" multiplication operator (the same as in any other language). The answer is that Verilog will by default create a combinatorial multiplier that explodes the gate count exponentially as you increase the bus-width; whereas, the gate count for the multiplier algorithm I used increases linearly with the bus-width, but it is very slow - requiring one clock cycle per data bit. Also, for a wide bus the interconnects eat you alive. About 87% of the propagation delay in one of my wide bus-width compilations was used simply for internal interconnects between modules inside the FPGA! Incidentally, I always parameterize the bus-width so that I can change it throughout the entire design simply by changing a single number in a header file.

This was my first "real" Verilog program of any size so please do not think the IGCDEX module above is a sterling example of Verilog. I'm still learning, and I'm sure there may be much better ways to do the same things I've come up with - and in fact I'll probably continue trying to improve my code as I learn more about Verilog.

Regards,

Ron

2005-09-24, 05:45   #9
rdotson

Jul 2005

23×5 Posts

Quote:
 Originally Posted by jasonp My vote for something cool to implement is a number-theoretic FFT using modular math. ... Actually I think somebody has already implemented a first draft of this on a really big Xilinx. jasonp
That sounds like it might be an excellent application for an FPGA Jason, and in fact I believe it was my original question that I started this thread with. If you happen to recall who the "somebody" was who has already implemented a first draft of it, please contact me because I would love to see what's already been done before I spend to much time on it. I have no desire to waste my time reinventing the wheel.

In any case however, if you or anyone could point me to an example of a working program that implements an FFT using modular math that's written in a conventional programming language, then I'll be happy to give it a try provided it looks do-able to me. I need a conventional program to start with so that I have a baseline against which I can test my Verilog code output, and to generate test cases, and to generate intermediate results for debugging.

Ron

 2005-09-24, 07:24 #10 rdotson     Jul 2005 23×5 Posts I just had a funny idea. I may be bordering on the verge of getting booted off this forum, but it occurred to me that the only way I'm going to convince the mods to start a separate subforum for Hardware Design Languages (Verilog, VHDL, SystemC) as I mentioned in another thread somewhere, is to turn ALL of you into VLSI designers. Here's how I'm going to do it. First, go get a copy of the free Icarus Verilog simulator from http://www.icarus.com/eda/verilog/ Next, copy the "igcdex.v" file I posted above and the two I'm posting with this message (Ecm.h and testIgcdex.v) into a directory on your computer somewhere. Install the Icarus simulator, and then run my test program (testIgcdex.v) using the simulator. Remove the "//" comment marks from the $display lines if you like (ie; change them from //$display to just $display) so that you can see the progress of the program in clock cycle time. Tinker around with it. Change the bus width and other parameters, and play with it until you're hooked and start writing your own Verilog code! -- Ron Ecm.h Code: // Ecm.h define L 32 // Bus Width define K 31 // Bus Width-1 testIgcdex.v Code: // testIgcdex.v, test IGCDEX module // (c) July 13, 2005, Ron Dotson include "ECM.h" module test; parameter N= L; // Bus Width reg reset, clock; reg [N-1:0] in1, in2; reg [2*N-1:0] expected; wire [2*N-1:0] answer; wire ready; integer clocks; reg [N-1:0] x, y; reg [N-1:0] g_expected, a_expected, b_expected; wire [N-1:0] g, a, b; wire finished; IGCDEX igcdex1(reset,clock,x,y, a,b,g,finished); // g=GCD(x,y), g= a*x + b*y initial begin clock=0; clocks=0; reset=0; // Test Case 1 //x= 34011; // setup inputs (0x84DB) //y= 24867; // setup inputs (0x6123) v //g_expected= 9; // g:=GCD(x,y) //a_expected= -1270; //b_expected= 1737; // Test Case 2 //x= 6; // setup inputs //y= 52961; //g_expected= 1; // g:=GCD(x,y) //a_expected= -1; // g= a*x + b*y //b_expected= 8827; // Test Case 3 //x= -14722; // setup inputs //y= 52961; //g_expected= 1; // g:=GCD(x,y) //a_expected= 4933; // g= a*x + b*y //b_expected= 17746; // Test Case 4 x= 0; // setup inputs y= 52961; g_expected= 52961; // g:=GCD(x,y) a_expected= 0; // g= a*x + b*y b_expected= 1;$display("\ntestIgcdex: Starting Test, x=%X, y=%X\n",x,y); #10 reset=1; #10 clock=1; // latch inputs #10 reset=0; #10 clock=0; while (~finished) begin #10 clock=1; #10 clock=0; clocks=clocks+1; if (clocks > 100000) $stop(); end if (g===g_expected & a===a_expected & b===b_expected) begin$display("\ntestIgcdex: PASS\n"); end else begin $display("\ntestIgcdex: FAIL");$display("\tInputs to IGCDEX: x=%d, y=%d", x,y); $display("\tOutputs: g=%d, a=%d, b=%X hex, Where: g=IGCDEX(x,y,'a,'b')", g, a, b);$display("\tExpected: g=%d, a=%d, b=%X hex\n", g_expected, a_expected, b_expected); end end endmodule
 2005-09-24, 07:42 #11 rdotson     Jul 2005 1010002 Posts Oops, two more files you'll need: multiplier.v Code: //================================================================== // multiplier.v, Compute: out= in1*in2 + in3; // (c) Sept 17, 2005, Ron Dotson //================================================================== include "ECM.h" module MULTIPLIER(reset,clock,in1,in2,in3, accum,ready); parameter N= L; // Bus Width parameter S0=0,S1=1,S2=2,S3=3, S4=4,S5=5,S6=6,S7=7; input reset, clock; input [N-1:0] in1, in2, in3; output reg [2*N-1:0] accum; output reg ready; reg accumNegative; // of "accum" output reg [2:0] state; reg [N-1:0] a, i, in2Minus, allOnes; reg [2*N-1:0] c, bShifted; // bShifted= in2*2^i always @(posedge clock) begin if (reset==1) begin ready<=0; in2Minus<= -in2; accum<= 0; allOnes<= -1; state<= S0; end // "if (reset)" else begin case (state) S0: begin if ((|in1[N-1:0]) & (|in2[N-1:0])) // Neither in1 nor in2 = zero begin i<=0; accumNegative<= in1[N-1] ^ in2[N-1]; state<= S1; end else // either in1==0 or in2==0 begin state<= S2; end end S1: begin // neither in1 nor in2 = zero state<= S3; if (in1[N-1]) // if in1 negative begin a<= -in1; // make positive end else begin a<= in1; end if (in2[N-1]) // if in2 negative begin bShifted[2*N-1:0]<= {L'd0,in2Minus}; // make positive end else begin bShifted[2*N-1:0]<= {L'd0,in2[N-1:0]}; end if (in3[N-1]) // negative (NOTE THAT "c" RETAINS IT'S SIGN) begin c[2*N-1:0]<= {allOnes[N-1:0],in3[N-1:0]}; // extend to double precision end else // else positive begin c[2*N-1:0]<= {L'd0,in3[N-1:0]}; // extend to double precision end end S2: begin // either in1==0 or in2==0 if (in3[N-1]) // if in3 negative begin accum<= {allOnes[N-1:0],in3[N-1:0]}; end else begin accum<= {L'd0,in3[N-1:0]}; end state<=S4; // Finished! end S3: begin if (i < N) begin state<= S3; // Loop //$display("\tOperands: in1=%X, in2=%X, in3=%X\n\taccum=%d, i=%d\n", in1,in3,in3,accum,i); bShifted<= {bShifted[2*N-2:0],1'b0}; // shift left one if (a[i]) accum<= accum + bShifted; i<= i+1; end else // accum= in1*in2 begin state<= S4; // Finished! if (accumNegative) accum<= c - accum; else accum<= accum + c; end end S4: begin ready<=1; // Finished! end endcase end end // "always @(posedge clock)" endmodule // Multiplier divider.v Code: // divider.v, return: dividend= op1/op2, remainder= op1 % op2; // Note: I wrote this division routine because I needed a quick and dirty way // to do a divide. Eventually I'll replace it with something more efficient // but it's performance is not too shabby. It basically takes advantage // of the equivalence between binary powering and shifting. One of the operands // ("Little_ShiftedLeft" in this case) is shifted left by one bit position per clock, // and then right by one bit position per clock // (c) July 02, 2005, Ron Dotson //include "ECM.h" include "ECM.h" module DIVIDER(reset,clock,op1,op2, dividend,remainder,finished); parameter N= L; // Bus Width input reset, clock; input [N-1:0] op1, op2; output reg [N-1:0] dividend, remainder; output reg finished; reg [N-1:0] Big, Little_ShiftedLeft, shiftMask; reg shiftDir; // shift direction: 0=Left, 1=Right always @(posedge clock) if (reset==1) begin Little_ShiftedLeft= 0; shiftMask=1; // shiftMask= 2^x, Where x= #Bit positions "Little" has been shifted left shiftDir=0; // Note that "Little_ShiftedLeft" = shiftMask*Little finished=0; dividend=0; remainder=0; Big= op1; Little_ShiftedLeft= op2; end else begin if (~finished) begin // on *both* posedge and negedge of clock if (~shiftDir) // if shifting Left begin if (Big >= Little_ShiftedLeft) begin //$display("DIVIDER 1: Big=0x%X, Little_ShiftedLeft=0x%X", Big,Little_ShiftedLeft); Big= Big-Little_ShiftedLeft; // subtractions while shifting left are for "free," so may as well do it dividend= dividend + shiftMask; Little_ShiftedLeft= {Little_ShiftedLeft[N-2:0],1'b0}; // shift "Little_ShiftedLeft" Left one shiftMask= {shiftMask[N-2:0],1'b0}; // shift "shiftMask" Left one //$display("DIVIDER 2: Big=0x%X, Little_ShiftedLeft=0x%X,\n\t shiftMask=%d, dividend=0x%X", Big,Little_ShiftedLeft,shiftMask,dividend); end else // Big= Little_ShiftedLeft) begin Big= Big-Little_ShiftedLeft; dividend= dividend + shiftMask; //$display("DIVIDER 4: Big=0x%X, Little_ShiftedLeft=0x%X,\n\t shiftMask=%d, dividend=0x%X", Big,Little_ShiftedLeft,shiftMask,dividend); end Little_ShiftedLeft= {1'b0,Little_ShiftedLeft[N-1:1]}; // shift "Little_ShiftedLeft" Right one shiftMask= {1'b0,shiftMask[N-1:1]}; // shift "shiftMask" Right one //$display("DIVIDER 5: Big=0x%X, Little_ShiftedLeft=0x%X,\n\t shiftMask=%d, dividend=0x%X", Big,Little_ShiftedLeft,shiftMask,dividend); end else // shifting Right and shiftMask=0 => Finished begin remainder= Big; finished= 1; //$display("\nDIVIDER 6: Big=0x%X, Little_ShiftedLeft=0x%X,\n\t shiftMask=%d, dividend=0x%X", Big,Little_ShiftedLeft,shiftMask,dividend); //\$display("DIVIDER 7: dividend=%d (0x%X), remainder=%d (0x%X)", dividend,dividend,remainder,remainder); end end // shifting Right end // "~finished" end endmodule`

 Similar Threads Thread Thread Starter Forum Replies Last Post Nick Number Theory Discussion Group 17 2016-12-01 14:27 Erasmus Other Mathematical Topics 12 2010-04-13 23:33 paulunderwood Software 7 2006-09-19 15:56 rdotson Hardware 12 2006-03-26 22:58 juergen Math 2 2004-07-10 23:01

All times are UTC. The time now is 21:51.

Tue Nov 24 21:51:21 UTC 2020 up 75 days, 19:02, 4 users, load averages: 3.07, 3.09, 3.12