![]() |
|
|
#1 |
|
Jul 2005
23·5 Posts |
Hi all,
I'm not sure where to post the following question. I'm open to suggestions and the moderators are welcome to simply move this post to the appropriate sub-forum. Thank you. Q: Has anyone ever considered implementing portions of the FFT algorithm that I believe is (or used to be) included in Prime95 and were written in assembly language for speed - has anyone ever considered implementing them in hardware; ie, in Verilog HDL on an FPGA for example? I remember looking at the code and thinking it would be ideal for implementation on a closely coupled FPGA (perhaps one with a built-in PCI interface). Since then, just for the fun of it I've learned enough Verilog to implement both Pollard Rho and ECM in Verilog HDL for 640 and 384 bit-width numbers respectively, and am awaiting delivery of a an Actel ProASIC3 development board so I can try them out (it will take two months to get the board!). Meanwhile, I'm open to suggestions for interesting mathematical algorithms that would be useful to have implemented at the hardware level in an FPGA. Regards, Ron |
|
|
|
|
|
#2 |
|
Aug 2002
Buenos Aires, Argentina
101111100112 Posts |
I think you need to post charts comparing the times your ECM implementation in hardware takes against known implementations in software, for example GMP-ECM (which are cheaper since no additional hardware should be bought).
For example, fill in these blanks for 384-bit numbers: Code:
rdotson FPGA (??? MHz) GMP-ECM (Pentium 4 2.4 GHz) B1 = xxx .... us .... us B1 = yyy .... us .... us B1 = zzz .... us .... us |
|
|
|
|
|
#3 | |
|
Jul 2005
4010 Posts |
Quote:
I'll tell you what alpertron, if you'll provide some B1 numbers and timings for GMP-ECM, I'll provide them for my FPGA implementation as soon as I get it actually running on real hardware. At the moment I'm just looking for new design ideas. Here's something for you to think about however alpertron: If GMP-ECM s/w were faster than my ECM implementation in hardware, I would simply get a copy of the GMP-ECM source code (I assume it's available somewhere), and convert THAT code into Verilog HDL for execution in hardware and I would be guaranteed to be at least 100 to 1000 times faster than the software version of GMP-ECM. It turns out that some designs (Pollard Rho and ECM are two examples) are ideally suited to FPGA implementation, whereas others would be essentially impossible because of huge storage requirements or other factors. People (computer programmers in general, and myself in particular when I was a computer programmer) tend to think of Verilog HDL as hardware design and think that it's difficult just because of the words "hardware design." In actual fact, Verilog is nothing more than a very primitive programming language with some very unique features (and notable LACK of some features) that any C/C++ programmer could pick up in a few days like I did. In fact the precedence rules, and most of the operators are the same in both C and Verilog. Yes, there are lot's of timing issues with FPGAs that you have to worry about that you needn't worry about with a conventional sequential computer program, but just because Verilog is extremely tedious to code doesn't mean that it's difficult to do. In fact there are only a handful of constructs in the language - far fewer than in most other languages. In any case I'm not looking to get into some sort of chest beating contest, I'm just looking for ideas for numeric applications that would be suitable for FPGA implementation to keep me entertained while I'm awaiting my new development board. I may or may not use any ideas that may or may not be submitted. That is strictly my own option because I'm not getting paid for any of this. Regards, Ron |
|
|
|
|
|
|
#4 |
|
Aug 2002
Buenos Aires, Argentina
101111100112 Posts |
Some timings for GMP-ECM 6 in a particular 384-bit number on a Pentium 4 2.4 GHz:
B1 = 11000, B2 = 1100000 -> step 1: 153 ms, step 2: 111 ms B1 = 50000, B2 = 5000000 -> step 1: 700 ms, step 2: 350 ms B1 = 250000, B2 = 25000000 -> step 1: 3560 ms, step 2: 960 ms |
|
|
|
|
|
#5 |
|
Jul 2005
2816 Posts |
By the way, lest anyone think I'm some kind of crackpot or troll, the thread that started this whole thing (my interest in Verilog HDL for factoring) can be found over on the Yahoo PrimeNumbers forum at:
http://groups.yahoo.com/group/primen.../message/16775 I've also posted a couple of followups since then at: http://groups.yahoo.com/group/primen.../message/16896 and http://groups.yahoo.com/group/primen.../message/17028 I am a retired computer software engineer with 35+ years of experience developing things like embedded systems for spacecraft and device drivers for prototype hardware and I can easily prove it to anyone who wishes to contact me privately. Now that I'm retired I have the time to do things that I want to do just for fun, instead of being paid to do things are aren't so interesting to me. -- Ron |
|
|
|
|
|
#6 |
|
Aug 2002
5008 Posts |
A hardware siever for NFS or QS could probably get quite fast.
|
|
|
|
|
|
#7 | |
|
Tribal Bullet
Oct 2004
5×23×31 Posts |
Quote:
There were several papers in the late 90s on hardware QS, and several more recent papers that proposed ASIC solutions to NFS sieving for RSA-contest-size problems. The latter assume wafer scale integration containing gigs of DRAM. My vote for something cool to implement is a number-theoretic FFT using modular math. The problem can be broken up into large chunks that can fit in an FPGA to run at super speed, and there's a lot of IO but the access pattern is pretty regular. This is a pretty big job though; at least you'd be able to run LL tests with it when you're done. Actually I think somebody has already implemented a first draft of this on a really big Xilinx. jasonp |
|
|
|
|
|
|
#8 | |
|
Jul 2005
23×5 Posts |
Quote:
Yes that's true, but it would probably take an ASIC rather than an FPGA to do it and those things are out of the question because of cost. Even so, it would probably spend all of it's time waiting on data to be transferred to and from external media (probably some mix of RAM and Disk Space). Not to mention that mapping a direct interface from the FPGA to the disk controller (there is no operating system inside an FPGA) would probably be a great deal of work in itself. I was thinking more in terms of some repetitive arithmetic operation that is perhaps part of NFS or QS or some other factoring program, but that takes place with localized data and is computationally intensive. Maybe a subroutine within the greater NFS and QS programs for example. I don't know enough about the way it's actually implemented in these huge factoring programs to even know where an FPGA might be appropriate, and that's mainly why I'm here asking. Where an FPGA truly shines is in parallelization. You can actually have multiple processes running concurrently inside an FPGA, and I'm referring to true concurrency not the time-slicing variety used in operating systems. For example, in my IGCDEX module that uses Euclid's extendend algorithm to calculate a GCD and inverse, I simply declare three multipliers and run them all in parallel at the same time. Here's how it's done (this the Verilog source code for my IGCDEX module): Code:
// igcdex.v, (c) Sept 08, 2005, Ron Dotson
// Use Euclid's extendend algorithm to calculate:
// g=GCD(p1,p2), and s and t such that: g= s*p1 + t*p2
`include "ECM.h"
module IGCDEX(reset,clock,p1,p2, s,t,g,finished);
parameter N= `L; // Bus Width
parameter S0=0,S1=1,S2=2,S3=3, S4=4,S5=5,S6=6,S7=7;
input reset, clock;
input [N-1:0] p1, p2;
output reg [N-1:0] s, t, g; // "g" is GCD, and t and s are inverses
output reg finished;
reg resetDivider, resetMul, swappedInputs;
reg [N-1:0] c1, a2,b2,c2, a3,b3,c3; // multiplier inputs
reg [N-1:0] b,u,v;
reg [2:0] state;
wire dividerFinished, mulFinished1, mulFinished2, mulFinished3;
wire [2*N-1:0] m1, m2, m3; // multiplier outputs
wire [N-1:0] q;
DIVIDER divider (resetDivider,clock,g,b, q,,dividerFinished); // q= iquo(g/b)
MULTIPLIER mul1 (resetMul,clock,q,b,c1, m1,mulFinished1); // m1= (a1*b1+c1)
MULTIPLIER mul2 (resetMul,clock,q,u,c2, m2,mulFinished2); // m2= (a2*b2+c2)
MULTIPLIER mul3 (resetMul,clock,q,v,c3, m3,mulFinished3); // m3= (a3*b3+c3)
always @(posedge clock)
begin
if (reset==1)
begin
finished<=0;
resetDivider<=0; // ensure these start off low
resetMul<=0;
g<= p1[N-1]? -p1: p1; // g= abs(p1)
b<= p2[N-1]? -p2: p2; // b= abs(p2)
t<= p1[N-1]? -1: 1; // t= sign(p1)
s<= 0;
u<= 0;
v<= p2[N-1]? -1: 1; // v= sign(p2)
resetMul<= 0;
resetDivider<= 0;
state<=S0;
//$display("IGCDEX 10: RESET p1=%X, p2=%X\n", p1,p2);
end
else
begin
case (state)
S0: begin
//$display("IGCDEX 20: g=%d, b=%d, t=%X, v=%d\n", g,b,t,v);
if (g < b)
begin
g<= b; // swap operands so that g>=b
b<= g;
t<= v;
v<= t;
end
state<= S1;
end
S1: begin // while b<>0
if (|b) // if b!=0 continue loop, else exit
begin
resetDivider<= 1; // enable divider, q= iquo(g/b)
state<= S2;
end
else
state<= S4; // finished
end
S2: begin
resetDivider<= 0;
if (dividerFinished)
begin
//$display("IGCDEX 30: q=%X, g=%d, b=%d (q= g/b)\n", q,g,b);
c1<= -g;
c2<= -t;
c3<= -s;
resetMul<= 1; // enable multipliers
state<= S3;
end
end
S3: begin
resetMul<= 0;
if (mulFinished1 & mulFinished2 & mulFinished3)
begin
if (m1[N]!=m1[N-1] | m2[N]!=m2[N-1] | m3[N]!=m3[N-1])
begin // overflow
//$display("\nIGCDEX xx: OVERFLOW!\n");
//$display("\tm1=%X, m2=%X, m3=%X\n", m1, m2, m3);
//$display("\nIGCDEX Inputs: p1=%X, p2=%X\n", p1,p2);
$stop();
end
g<= b;
b<= -m1[N-1:0];
t<= u;
u<= -m2[N-1:0]; // -m2 = x
s<= v;
v<= -m3[N-1:0];
state<= S1; // end of while loop
//$display("IGCDEX 30: Looping Back; q=%d, g=%d, b=%d\n\n\n", q,g,b);
end
end
S4: begin
finished<= 1;
end
endcase
end
end
endmodule // IGCDEX
This was my first "real" Verilog program of any size so please do not think the IGCDEX module above is a sterling example of Verilog. I'm still learning, and I'm sure there may be much better ways to do the same things I've come up with - and in fact I'll probably continue trying to improve my code as I learn more about Verilog. Regards, Ron |
|
|
|
|
|
|
#9 | |
|
Jul 2005
23·5 Posts |
Quote:
If you happen to recall who the "somebody" was who has already implemented a first draft of it, please contact me because I would love to see what's already been done before I spend to much time on it. I have no desire to waste my time reinventing the wheel. In any case however, if you or anyone could point me to an example of a working program that implements an FFT using modular math that's written in a conventional programming language, then I'll be happy to give it a try provided it looks do-able to me. I need a conventional program to start with so that I have a baseline against which I can test my Verilog code output, and to generate test cases, and to generate intermediate results for debugging. Thanks for your suggestion, Ron |
|
|
|
|
|
|
#10 |
|
Jul 2005
4010 Posts |
I just had a funny idea. I may be bordering on the verge of getting booted off this forum, but it occurred to me that the only way I'm going to convince the mods to start a separate subforum for Hardware Design Languages (Verilog, VHDL, SystemC) as I mentioned in another thread somewhere, is to turn ALL of you into VLSI designers.
![]() Here's how I'm going to do it. First, go get a copy of the free Icarus Verilog simulator from http://www.icarus.com/eda/verilog/ Next, copy the "igcdex.v" file I posted above and the two I'm posting with this message (Ecm.h and testIgcdex.v) into a directory on your computer somewhere. Install the Icarus simulator, and then run my test program (testIgcdex.v) using the simulator. Remove the "//" comment marks from the $display lines if you like (ie; change them from //$display to just $display) so that you can see the progress of the program in clock cycle time. Tinker around with it. Change the bus width and other parameters, and play with it until you're hooked and start writing your own Verilog code! -- Ron Ecm.h Code:
// Ecm.h `define L 32 // Bus Width `define K 31 // Bus Width-1 Code:
// testIgcdex.v, test IGCDEX module
// (c) July 13, 2005, Ron Dotson
`include "ECM.h"
module test;
parameter N= `L; // Bus Width
reg reset, clock;
reg [N-1:0] in1, in2;
reg [2*N-1:0] expected;
wire [2*N-1:0] answer;
wire ready;
integer clocks;
reg [N-1:0] x, y;
reg [N-1:0] g_expected, a_expected, b_expected;
wire [N-1:0] g, a, b;
wire finished;
IGCDEX igcdex1(reset,clock,x,y, a,b,g,finished); // g=GCD(x,y), g= a*x + b*y
initial
begin
clock=0;
clocks=0;
reset=0;
// Test Case 1
//x= 34011; // setup inputs (0x84DB)
//y= 24867; // setup inputs (0x6123) v
//g_expected= 9; // g:=GCD(x,y)
//a_expected= -1270;
//b_expected= 1737;
// Test Case 2
//x= 6; // setup inputs
//y= 52961;
//g_expected= 1; // g:=GCD(x,y)
//a_expected= -1; // g= a*x + b*y
//b_expected= 8827;
// Test Case 3
//x= -14722; // setup inputs
//y= 52961;
//g_expected= 1; // g:=GCD(x,y)
//a_expected= 4933; // g= a*x + b*y
//b_expected= 17746;
// Test Case 4
x= 0; // setup inputs
y= 52961;
g_expected= 52961; // g:=GCD(x,y)
a_expected= 0; // g= a*x + b*y
b_expected= 1;
$display("\ntestIgcdex: Starting Test, x=%X, y=%X\n",x,y);
#10 reset=1;
#10 clock=1; // latch inputs
#10 reset=0;
#10 clock=0;
while (~finished)
begin
#10 clock=1;
#10 clock=0;
clocks=clocks+1;
if (clocks > 100000)
$stop();
end
if (g===g_expected & a===a_expected & b===b_expected)
begin
$display("\ntestIgcdex: PASS\n");
end
else
begin
$display("\ntestIgcdex: FAIL");
$display("\tInputs to IGCDEX: x=%d, y=%d", x,y);
$display("\tOutputs: g=%d, a=%d, b=%X hex, Where: g=IGCDEX(x,y,'a,'b')", g, a, b);
$display("\tExpected: g=%d, a=%d, b=%X hex\n", g_expected, a_expected, b_expected);
end
end
endmodule
|
|
|
|
|
|
#11 |
|
Jul 2005
23×5 Posts |
multiplier.v
Code:
//==================================================================
// multiplier.v, Compute: out= in1*in2 + in3;
// (c) Sept 17, 2005, Ron Dotson
//==================================================================
`include "ECM.h"
module MULTIPLIER(reset,clock,in1,in2,in3, accum,ready);
parameter N= `L; // Bus Width
parameter S0=0,S1=1,S2=2,S3=3, S4=4,S5=5,S6=6,S7=7;
input reset, clock;
input [N-1:0] in1, in2, in3;
output reg [2*N-1:0] accum;
output reg ready;
reg accumNegative; // of "accum" output
reg [2:0] state;
reg [N-1:0] a, i, in2Minus, allOnes;
reg [2*N-1:0] c, bShifted; // bShifted= in2*2^i
always @(posedge clock)
begin
if (reset==1)
begin
ready<=0;
in2Minus<= -in2;
accum<= 0;
allOnes<= -1;
state<= S0;
end // "if (reset)"
else
begin
case (state)
S0: begin
if ((|in1[N-1:0]) & (|in2[N-1:0])) // Neither in1 nor in2 = zero
begin
i<=0;
accumNegative<= in1[N-1] ^ in2[N-1];
state<= S1;
end
else // either in1==0 or in2==0
begin
state<= S2;
end
end
S1: begin // neither in1 nor in2 = zero
state<= S3;
if (in1[N-1]) // if in1 negative
begin
a<= -in1; // make positive
end
else
begin
a<= in1;
end
if (in2[N-1]) // if in2 negative
begin
bShifted[2*N-1:0]<= {`L'd0,in2Minus}; // make positive
end
else
begin
bShifted[2*N-1:0]<= {`L'd0,in2[N-1:0]};
end
if (in3[N-1]) // negative (NOTE THAT "c" RETAINS IT'S SIGN)
begin
c[2*N-1:0]<= {allOnes[N-1:0],in3[N-1:0]}; // extend to double precision
end
else // else positive
begin
c[2*N-1:0]<= {`L'd0,in3[N-1:0]}; // extend to double precision
end
end
S2: begin // either in1==0 or in2==0
if (in3[N-1]) // if in3 negative
begin
accum<= {allOnes[N-1:0],in3[N-1:0]};
end
else
begin
accum<= {`L'd0,in3[N-1:0]};
end
state<=S4; // Finished!
end
S3: begin
if (i < N)
begin
state<= S3; // Loop
//$display("\tOperands: in1=%X, in2=%X, in3=%X\n\taccum=%d, i=%d\n", in1,in3,in3,accum,i);
bShifted<= {bShifted[2*N-2:0],1'b0}; // shift left one
if (a[i])
accum<= accum + bShifted;
i<= i+1;
end
else // accum= in1*in2
begin
state<= S4; // Finished!
if (accumNegative)
accum<= c - accum;
else
accum<= accum + c;
end
end
S4: begin
ready<=1; // Finished!
end
endcase
end
end // "always @(posedge clock)"
endmodule // Multiplier
Code:
// divider.v, return: dividend= op1/op2, remainder= op1 % op2;
// Note: I wrote this division routine because I needed a quick and dirty way
// to do a divide. Eventually I'll replace it with something more efficient
// but it's performance is not too shabby. It basically takes advantage
// of the equivalence between binary powering and shifting. One of the operands
// ("Little_ShiftedLeft" in this case) is shifted left by one bit position per clock,
// and then right by one bit position per clock
// (c) July 02, 2005, Ron Dotson
//`include "ECM.h"
`include "ECM.h"
module DIVIDER(reset,clock,op1,op2, dividend,remainder,finished);
parameter N= `L; // Bus Width
input reset, clock;
input [N-1:0] op1, op2;
output reg [N-1:0] dividend, remainder;
output reg finished;
reg [N-1:0] Big, Little_ShiftedLeft, shiftMask;
reg shiftDir; // shift direction: 0=Left, 1=Right
always @(posedge clock)
if (reset==1)
begin
Little_ShiftedLeft= 0;
shiftMask=1; // shiftMask= 2^x, Where x= #Bit positions "Little" has been shifted left
shiftDir=0; // Note that "Little_ShiftedLeft" = shiftMask*Little
finished=0;
dividend=0;
remainder=0;
Big= op1;
Little_ShiftedLeft= op2;
end
else
begin
if (~finished)
begin // on *both* posedge and negedge of clock
if (~shiftDir) // if shifting Left
begin
if (Big >= Little_ShiftedLeft)
begin
//$display("DIVIDER 1: Big=0x%X, Little_ShiftedLeft=0x%X", Big,Little_ShiftedLeft);
Big= Big-Little_ShiftedLeft; // subtractions while shifting left are for "free," so may as well do it
dividend= dividend + shiftMask;
Little_ShiftedLeft= {Little_ShiftedLeft[N-2:0],1'b0}; // shift "Little_ShiftedLeft" Left one
shiftMask= {shiftMask[N-2:0],1'b0}; // shift "shiftMask" Left one
//$display("DIVIDER 2: Big=0x%X, Little_ShiftedLeft=0x%X,\n\t shiftMask=%d, dividend=0x%X", Big,Little_ShiftedLeft,shiftMask,dividend);
end
else // Big<Little_ShiftedLeft; so start shifting Right
begin
shiftDir=1; // shift Right
Little_ShiftedLeft= {1'b0,Little_ShiftedLeft[N-1:1]}; // shift "Little_ShiftedLeft" Right one
shiftMask= {1'b0,shiftMask[N-1:1]}; // shift "shiftMask" Right one
//$display("DIVIDER 3: Big=0x%X, Little_ShiftedLeft=0x%X,\n\t shiftMask=%d, dividend=0x%X", Big,Little_ShiftedLeft,shiftMask,dividend);
end
end // end shifting Left
else // shifting Right
begin
if (|shiftMask) // if shiftMask!=0 and shifting Right
begin
if (Big >= Little_ShiftedLeft)
begin
Big= Big-Little_ShiftedLeft;
dividend= dividend + shiftMask;
//$display("DIVIDER 4: Big=0x%X, Little_ShiftedLeft=0x%X,\n\t shiftMask=%d, dividend=0x%X", Big,Little_ShiftedLeft,shiftMask,dividend);
end
Little_ShiftedLeft= {1'b0,Little_ShiftedLeft[N-1:1]}; // shift "Little_ShiftedLeft" Right one
shiftMask= {1'b0,shiftMask[N-1:1]}; // shift "shiftMask" Right one
//$display("DIVIDER 5: Big=0x%X, Little_ShiftedLeft=0x%X,\n\t shiftMask=%d, dividend=0x%X", Big,Little_ShiftedLeft,shiftMask,dividend);
end
else // shifting Right and shiftMask=0 => Finished
begin
remainder= Big;
finished= 1;
//$display("\nDIVIDER 6: Big=0x%X, Little_ShiftedLeft=0x%X,\n\t shiftMask=%d, dividend=0x%X", Big,Little_ShiftedLeft,shiftMask,dividend);
//$display("DIVIDER 7: dividend=%d (0x%X), remainder=%d (0x%X)", dividend,dividend,remainder,remainder);
end
end // shifting Right
end // "~finished"
end
endmodule
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Basic Number Theory 7: idempotents and Euler's function | Nick | Number Theory Discussion Group | 17 | 2016-12-01 14:27 |
| Suggestions on good number theory books??? | Erasmus | Other Mathematical Topics | 12 | 2010-04-13 23:33 |
| Number Theoretic Transforms | paulunderwood | Software | 7 | 2006-09-19 15:56 |
| ECM/FPGA Implementation | rdotson | Hardware | 12 | 2006-03-26 22:58 |
| Search for a number theoretic function related to "prime divisor sums" | juergen | Math | 2 | 2004-07-10 23:01 |