20050923, 07:38  #1 
Jul 2005
2^{3}·5 Posts 
Numbertheoretic FPGA function implementation suggestions?
Hi all,
I'm not sure where to post the following question. I'm open to suggestions and the moderators are welcome to simply move this post to the appropriate subforum. Thank you. Q: Has anyone ever considered implementing portions of the FFT algorithm that I believe is (or used to be) included in Prime95 and were written in assembly language for speed  has anyone ever considered implementing them in hardware; ie, in Verilog HDL on an FPGA for example? I remember looking at the code and thinking it would be ideal for implementation on a closely coupled FPGA (perhaps one with a builtin PCI interface). Since then, just for the fun of it I've learned enough Verilog to implement both Pollard Rho and ECM in Verilog HDL for 640 and 384 bitwidth numbers respectively, and am awaiting delivery of a an Actel ProASIC3 development board so I can try them out (it will take two months to get the board!). Meanwhile, I'm open to suggestions for interesting mathematical algorithms that would be useful to have implemented at the hardware level in an FPGA. Regards, Ron 
20050923, 11:47  #2 
Aug 2002
Buenos Aires, Argentina
31·43 Posts 
I think you need to post charts comparing the times your ECM implementation in hardware takes against known implementations in software, for example GMPECM (which are cheaper since no additional hardware should be bought).
For example, fill in these blanks for 384bit numbers: Code:
rdotson FPGA (??? MHz) GMPECM (Pentium 4 2.4 GHz) B1 = xxx .... us .... us B1 = yyy .... us .... us B1 = zzz .... us .... us 
20050923, 13:27  #3  
Jul 2005
2^{3}·5 Posts 
Quote:
I'll tell you what alpertron, if you'll provide some B1 numbers and timings for GMPECM, I'll provide them for my FPGA implementation as soon as I get it actually running on real hardware. At the moment I'm just looking for new design ideas. Here's something for you to think about however alpertron: If GMPECM s/w were faster than my ECM implementation in hardware, I would simply get a copy of the GMPECM source code (I assume it's available somewhere), and convert THAT code into Verilog HDL for execution in hardware and I would be guaranteed to be at least 100 to 1000 times faster than the software version of GMPECM. It turns out that some designs (Pollard Rho and ECM are two examples) are ideally suited to FPGA implementation, whereas others would be essentially impossible because of huge storage requirements or other factors. People (computer programmers in general, and myself in particular when I was a computer programmer) tend to think of Verilog HDL as hardware design and think that it's difficult just because of the words "hardware design." In actual fact, Verilog is nothing more than a very primitive programming language with some very unique features (and notable LACK of some features) that any C/C++ programmer could pick up in a few days like I did. In fact the precedence rules, and most of the operators are the same in both C and Verilog. Yes, there are lot's of timing issues with FPGAs that you have to worry about that you needn't worry about with a conventional sequential computer program, but just because Verilog is extremely tedious to code doesn't mean that it's difficult to do. In fact there are only a handful of constructs in the language  far fewer than in most other languages. In any case I'm not looking to get into some sort of chest beating contest, I'm just looking for ideas for numeric applications that would be suitable for FPGA implementation to keep me entertained while I'm awaiting my new development board. I may or may not use any ideas that may or may not be submitted. That is strictly my own option because I'm not getting paid for any of this. Regards, Ron 

20050923, 13:51  #4 
Aug 2002
Buenos Aires, Argentina
31·43 Posts 
Some timings for GMPECM 6 in a particular 384bit number on a Pentium 4 2.4 GHz:
B1 = 11000, B2 = 1100000 > step 1: 153 ms, step 2: 111 ms B1 = 50000, B2 = 5000000 > step 1: 700 ms, step 2: 350 ms B1 = 250000, B2 = 25000000 > step 1: 3560 ms, step 2: 960 ms 
20050923, 23:56  #5 
Jul 2005
2^{3}·5 Posts 
P.S.
By the way, lest anyone think I'm some kind of crackpot or troll, the thread that started this whole thing (my interest in Verilog HDL for factoring) can be found over on the Yahoo PrimeNumbers forum at:
http://groups.yahoo.com/group/primen.../message/16775 I've also posted a couple of followups since then at: http://groups.yahoo.com/group/primen.../message/16896 and http://groups.yahoo.com/group/primen.../message/17028 I am a retired computer software engineer with 35+ years of experience developing things like embedded systems for spacecraft and device drivers for prototype hardware and I can easily prove it to anyone who wishes to contact me privately. Now that I'm retired I have the time to do things that I want to do just for fun, instead of being paid to do things are aren't so interesting to me.  Ron 
20050924, 01:04  #6 
Aug 2002
2^{6}×5 Posts 
A hardware siever for NFS or QS could probably get quite fast.

20050924, 04:38  #7  
Tribal Bullet
Oct 2004
3489_{10} Posts 
Quote:
There were several papers in the late 90s on hardware QS, and several more recent papers that proposed ASIC solutions to NFS sieving for RSAcontestsize problems. The latter assume wafer scale integration containing gigs of DRAM. My vote for something cool to implement is a numbertheoretic FFT using modular math. The problem can be broken up into large chunks that can fit in an FPGA to run at super speed, and there's a lot of IO but the access pattern is pretty regular. This is a pretty big job though; at least you'd be able to run LL tests with it when you're done. Actually I think somebody has already implemented a first draft of this on a really big Xilinx. jasonp 

20050924, 05:13  #8  
Jul 2005
2^{3}·5 Posts 
Quote:
Yes that's true, but it would probably take an ASIC rather than an FPGA to do it and those things are out of the question because of cost. Even so, it would probably spend all of it's time waiting on data to be transferred to and from external media (probably some mix of RAM and Disk Space). Not to mention that mapping a direct interface from the FPGA to the disk controller (there is no operating system inside an FPGA) would probably be a great deal of work in itself. I was thinking more in terms of some repetitive arithmetic operation that is perhaps part of NFS or QS or some other factoring program, but that takes place with localized data and is computationally intensive. Maybe a subroutine within the greater NFS and QS programs for example. I don't know enough about the way it's actually implemented in these huge factoring programs to even know where an FPGA might be appropriate, and that's mainly why I'm here asking. Where an FPGA truly shines is in parallelization. You can actually have multiple processes running concurrently inside an FPGA, and I'm referring to true concurrency not the timeslicing variety used in operating systems. For example, in my IGCDEX module that uses Euclid's extendend algorithm to calculate a GCD and inverse, I simply declare three multipliers and run them all in parallel at the same time. Here's how it's done (this the Verilog source code for my IGCDEX module): Code:
// igcdex.v, (c) Sept 08, 2005, Ron Dotson // Use Euclid's extendend algorithm to calculate: // g=GCD(p1,p2), and s and t such that: g= s*p1 + t*p2 `include "ECM.h" module IGCDEX(reset,clock,p1,p2, s,t,g,finished); parameter N= `L; // Bus Width parameter S0=0,S1=1,S2=2,S3=3, S4=4,S5=5,S6=6,S7=7; input reset, clock; input [N1:0] p1, p2; output reg [N1:0] s, t, g; // "g" is GCD, and t and s are inverses output reg finished; reg resetDivider, resetMul, swappedInputs; reg [N1:0] c1, a2,b2,c2, a3,b3,c3; // multiplier inputs reg [N1:0] b,u,v; reg [2:0] state; wire dividerFinished, mulFinished1, mulFinished2, mulFinished3; wire [2*N1:0] m1, m2, m3; // multiplier outputs wire [N1:0] q; DIVIDER divider (resetDivider,clock,g,b, q,,dividerFinished); // q= iquo(g/b) MULTIPLIER mul1 (resetMul,clock,q,b,c1, m1,mulFinished1); // m1= (a1*b1+c1) MULTIPLIER mul2 (resetMul,clock,q,u,c2, m2,mulFinished2); // m2= (a2*b2+c2) MULTIPLIER mul3 (resetMul,clock,q,v,c3, m3,mulFinished3); // m3= (a3*b3+c3) always @(posedge clock) begin if (reset==1) begin finished<=0; resetDivider<=0; // ensure these start off low resetMul<=0; g<= p1[N1]? p1: p1; // g= abs(p1) b<= p2[N1]? p2: p2; // b= abs(p2) t<= p1[N1]? 1: 1; // t= sign(p1) s<= 0; u<= 0; v<= p2[N1]? 1: 1; // v= sign(p2) resetMul<= 0; resetDivider<= 0; state<=S0; //$display("IGCDEX 10: RESET p1=%X, p2=%X\n", p1,p2); end else begin case (state) S0: begin //$display("IGCDEX 20: g=%d, b=%d, t=%X, v=%d\n", g,b,t,v); if (g < b) begin g<= b; // swap operands so that g>=b b<= g; t<= v; v<= t; end state<= S1; end S1: begin // while b<>0 if (b) // if b!=0 continue loop, else exit begin resetDivider<= 1; // enable divider, q= iquo(g/b) state<= S2; end else state<= S4; // finished end S2: begin resetDivider<= 0; if (dividerFinished) begin //$display("IGCDEX 30: q=%X, g=%d, b=%d (q= g/b)\n", q,g,b); c1<= g; c2<= t; c3<= s; resetMul<= 1; // enable multipliers state<= S3; end end S3: begin resetMul<= 0; if (mulFinished1 & mulFinished2 & mulFinished3) begin if (m1[N]!=m1[N1]  m2[N]!=m2[N1]  m3[N]!=m3[N1]) begin // overflow //$display("\nIGCDEX xx: OVERFLOW!\n"); //$display("\tm1=%X, m2=%X, m3=%X\n", m1, m2, m3); //$display("\nIGCDEX Inputs: p1=%X, p2=%X\n", p1,p2); $stop(); end g<= b; b<= m1[N1:0]; t<= u; u<= m2[N1:0]; // m2 = x s<= v; v<= m3[N1:0]; state<= S1; // end of while loop //$display("IGCDEX 30: Looping Back; q=%d, g=%d, b=%d\n\n\n", q,g,b); end end S4: begin finished<= 1; end endcase end end endmodule // IGCDEX This was my first "real" Verilog program of any size so please do not think the IGCDEX module above is a sterling example of Verilog. I'm still learning, and I'm sure there may be much better ways to do the same things I've come up with  and in fact I'll probably continue trying to improve my code as I learn more about Verilog. Regards, Ron 

20050924, 05:45  #9  
Jul 2005
2^{3}×5 Posts 
Quote:
In any case however, if you or anyone could point me to an example of a working program that implements an FFT using modular math that's written in a conventional programming language, then I'll be happy to give it a try provided it looks doable to me. I need a conventional program to start with so that I have a baseline against which I can test my Verilog code output, and to generate test cases, and to generate intermediate results for debugging. Thanks for your suggestion, Ron 

20050924, 07:24  #10 
Jul 2005
2^{3}×5 Posts 
I just had a funny idea. I may be bordering on the verge of getting booted off this forum, but it occurred to me that the only way I'm going to convince the mods to start a separate subforum for Hardware Design Languages (Verilog, VHDL, SystemC) as I mentioned in another thread somewhere, is to turn ALL of you into VLSI designers.
Here's how I'm going to do it. First, go get a copy of the free Icarus Verilog simulator from http://www.icarus.com/eda/verilog/ Next, copy the "igcdex.v" file I posted above and the two I'm posting with this message (Ecm.h and testIgcdex.v) into a directory on your computer somewhere. Install the Icarus simulator, and then run my test program (testIgcdex.v) using the simulator. Remove the "//" comment marks from the $display lines if you like (ie; change them from //$display to just $display) so that you can see the progress of the program in clock cycle time. Tinker around with it. Change the bus width and other parameters, and play with it until you're hooked and start writing your own Verilog code!  Ron Ecm.h Code:
// Ecm.h `define L 32 // Bus Width `define K 31 // Bus Width1 Code:
// testIgcdex.v, test IGCDEX module // (c) July 13, 2005, Ron Dotson `include "ECM.h" module test; parameter N= `L; // Bus Width reg reset, clock; reg [N1:0] in1, in2; reg [2*N1:0] expected; wire [2*N1:0] answer; wire ready; integer clocks; reg [N1:0] x, y; reg [N1:0] g_expected, a_expected, b_expected; wire [N1:0] g, a, b; wire finished; IGCDEX igcdex1(reset,clock,x,y, a,b,g,finished); // g=GCD(x,y), g= a*x + b*y initial begin clock=0; clocks=0; reset=0; // Test Case 1 //x= 34011; // setup inputs (0x84DB) //y= 24867; // setup inputs (0x6123) v //g_expected= 9; // g:=GCD(x,y) //a_expected= 1270; //b_expected= 1737; // Test Case 2 //x= 6; // setup inputs //y= 52961; //g_expected= 1; // g:=GCD(x,y) //a_expected= 1; // g= a*x + b*y //b_expected= 8827; // Test Case 3 //x= 14722; // setup inputs //y= 52961; //g_expected= 1; // g:=GCD(x,y) //a_expected= 4933; // g= a*x + b*y //b_expected= 17746; // Test Case 4 x= 0; // setup inputs y= 52961; g_expected= 52961; // g:=GCD(x,y) a_expected= 0; // g= a*x + b*y b_expected= 1; $display("\ntestIgcdex: Starting Test, x=%X, y=%X\n",x,y); #10 reset=1; #10 clock=1; // latch inputs #10 reset=0; #10 clock=0; while (~finished) begin #10 clock=1; #10 clock=0; clocks=clocks+1; if (clocks > 100000) $stop(); end if (g===g_expected & a===a_expected & b===b_expected) begin $display("\ntestIgcdex: PASS\n"); end else begin $display("\ntestIgcdex: FAIL"); $display("\tInputs to IGCDEX: x=%d, y=%d", x,y); $display("\tOutputs: g=%d, a=%d, b=%X hex, Where: g=IGCDEX(x,y,'a,'b')", g, a, b); $display("\tExpected: g=%d, a=%d, b=%X hex\n", g_expected, a_expected, b_expected); end end endmodule 
20050924, 07:42  #11 
Jul 2005
101000_{2} Posts 
Oops, two more files you'll need:
multiplier.v
Code:
//================================================================== // multiplier.v, Compute: out= in1*in2 + in3; // (c) Sept 17, 2005, Ron Dotson //================================================================== `include "ECM.h" module MULTIPLIER(reset,clock,in1,in2,in3, accum,ready); parameter N= `L; // Bus Width parameter S0=0,S1=1,S2=2,S3=3, S4=4,S5=5,S6=6,S7=7; input reset, clock; input [N1:0] in1, in2, in3; output reg [2*N1:0] accum; output reg ready; reg accumNegative; // of "accum" output reg [2:0] state; reg [N1:0] a, i, in2Minus, allOnes; reg [2*N1:0] c, bShifted; // bShifted= in2*2^i always @(posedge clock) begin if (reset==1) begin ready<=0; in2Minus<= in2; accum<= 0; allOnes<= 1; state<= S0; end // "if (reset)" else begin case (state) S0: begin if ((in1[N1:0]) & (in2[N1:0])) // Neither in1 nor in2 = zero begin i<=0; accumNegative<= in1[N1] ^ in2[N1]; state<= S1; end else // either in1==0 or in2==0 begin state<= S2; end end S1: begin // neither in1 nor in2 = zero state<= S3; if (in1[N1]) // if in1 negative begin a<= in1; // make positive end else begin a<= in1; end if (in2[N1]) // if in2 negative begin bShifted[2*N1:0]<= {`L'd0,in2Minus}; // make positive end else begin bShifted[2*N1:0]<= {`L'd0,in2[N1:0]}; end if (in3[N1]) // negative (NOTE THAT "c" RETAINS IT'S SIGN) begin c[2*N1:0]<= {allOnes[N1:0],in3[N1:0]}; // extend to double precision end else // else positive begin c[2*N1:0]<= {`L'd0,in3[N1:0]}; // extend to double precision end end S2: begin // either in1==0 or in2==0 if (in3[N1]) // if in3 negative begin accum<= {allOnes[N1:0],in3[N1:0]}; end else begin accum<= {`L'd0,in3[N1:0]}; end state<=S4; // Finished! end S3: begin if (i < N) begin state<= S3; // Loop //$display("\tOperands: in1=%X, in2=%X, in3=%X\n\taccum=%d, i=%d\n", in1,in3,in3,accum,i); bShifted<= {bShifted[2*N2:0],1'b0}; // shift left one if (a[i]) accum<= accum + bShifted; i<= i+1; end else // accum= in1*in2 begin state<= S4; // Finished! if (accumNegative) accum<= c  accum; else accum<= accum + c; end end S4: begin ready<=1; // Finished! end endcase end end // "always @(posedge clock)" endmodule // Multiplier Code:
// divider.v, return: dividend= op1/op2, remainder= op1 % op2; // Note: I wrote this division routine because I needed a quick and dirty way // to do a divide. Eventually I'll replace it with something more efficient // but it's performance is not too shabby. It basically takes advantage // of the equivalence between binary powering and shifting. One of the operands // ("Little_ShiftedLeft" in this case) is shifted left by one bit position per clock, // and then right by one bit position per clock // (c) July 02, 2005, Ron Dotson //`include "ECM.h" `include "ECM.h" module DIVIDER(reset,clock,op1,op2, dividend,remainder,finished); parameter N= `L; // Bus Width input reset, clock; input [N1:0] op1, op2; output reg [N1:0] dividend, remainder; output reg finished; reg [N1:0] Big, Little_ShiftedLeft, shiftMask; reg shiftDir; // shift direction: 0=Left, 1=Right always @(posedge clock) if (reset==1) begin Little_ShiftedLeft= 0; shiftMask=1; // shiftMask= 2^x, Where x= #Bit positions "Little" has been shifted left shiftDir=0; // Note that "Little_ShiftedLeft" = shiftMask*Little finished=0; dividend=0; remainder=0; Big= op1; Little_ShiftedLeft= op2; end else begin if (~finished) begin // on *both* posedge and negedge of clock if (~shiftDir) // if shifting Left begin if (Big >= Little_ShiftedLeft) begin //$display("DIVIDER 1: Big=0x%X, Little_ShiftedLeft=0x%X", Big,Little_ShiftedLeft); Big= BigLittle_ShiftedLeft; // subtractions while shifting left are for "free," so may as well do it dividend= dividend + shiftMask; Little_ShiftedLeft= {Little_ShiftedLeft[N2:0],1'b0}; // shift "Little_ShiftedLeft" Left one shiftMask= {shiftMask[N2:0],1'b0}; // shift "shiftMask" Left one //$display("DIVIDER 2: Big=0x%X, Little_ShiftedLeft=0x%X,\n\t shiftMask=%d, dividend=0x%X", Big,Little_ShiftedLeft,shiftMask,dividend); end else // Big<Little_ShiftedLeft; so start shifting Right begin shiftDir=1; // shift Right Little_ShiftedLeft= {1'b0,Little_ShiftedLeft[N1:1]}; // shift "Little_ShiftedLeft" Right one shiftMask= {1'b0,shiftMask[N1:1]}; // shift "shiftMask" Right one //$display("DIVIDER 3: Big=0x%X, Little_ShiftedLeft=0x%X,\n\t shiftMask=%d, dividend=0x%X", Big,Little_ShiftedLeft,shiftMask,dividend); end end // end shifting Left else // shifting Right begin if (shiftMask) // if shiftMask!=0 and shifting Right begin if (Big >= Little_ShiftedLeft) begin Big= BigLittle_ShiftedLeft; dividend= dividend + shiftMask; //$display("DIVIDER 4: Big=0x%X, Little_ShiftedLeft=0x%X,\n\t shiftMask=%d, dividend=0x%X", Big,Little_ShiftedLeft,shiftMask,dividend); end Little_ShiftedLeft= {1'b0,Little_ShiftedLeft[N1:1]}; // shift "Little_ShiftedLeft" Right one shiftMask= {1'b0,shiftMask[N1:1]}; // shift "shiftMask" Right one //$display("DIVIDER 5: Big=0x%X, Little_ShiftedLeft=0x%X,\n\t shiftMask=%d, dividend=0x%X", Big,Little_ShiftedLeft,shiftMask,dividend); end else // shifting Right and shiftMask=0 => Finished begin remainder= Big; finished= 1; //$display("\nDIVIDER 6: Big=0x%X, Little_ShiftedLeft=0x%X,\n\t shiftMask=%d, dividend=0x%X", Big,Little_ShiftedLeft,shiftMask,dividend); //$display("DIVIDER 7: dividend=%d (0x%X), remainder=%d (0x%X)", dividend,dividend,remainder,remainder); end end // shifting Right end // "~finished" end endmodule 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Basic Number Theory 7: idempotents and Euler's function  Nick  Number Theory Discussion Group  17  20161201 14:27 
Suggestions on good number theory books???  Erasmus  Other Mathematical Topics  12  20100413 23:33 
Number Theoretic Transforms  paulunderwood  Software  7  20060919 15:56 
ECM/FPGA Implementation  rdotson  Hardware  12  20060326 22:58 
Search for a number theoretic function related to "prime divisor sums"  juergen  Math  2  20040710 23:01 