mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2006-06-05, 00:42   #12
John Renze
 
John Renze's Avatar
 
Nov 2005

24·3 Posts
Default

The second stage of ECM is described at the beginnng of Section 7.4.2 in C&P. It is analogous to the second stage of Pollard p-1.
John Renze is offline   Reply With Quote
Old 2006-06-05, 02:53   #13
rdotson
 
rdotson's Avatar
 
Jul 2005

4010 Posts
Default

Quote:
Originally Posted by John Renze
The second stage of ECM is described at the beginning of Section 7.4.2 in C&P.
Aha, thanks for the explanation John. You have in fact zeroed in on one of the strong points of FPGA design - parallelization. My implementation runs stages 1 and 2 in parallel! In fact, on an FPGA *everything* runs in parallel unless you specifically arrange otherwise.

I'm going to try to attach a picture of the design hierarchy so you can see how I've implemented it. The top level module is on the left and is called "main", while the lower level modules are on the right. The "main" module is only for I/O interfacing. "ECM" is the topmost ECM module, and contains the modules GEN_CURVE and ELLIPTIC_MULTIPLY which correspond to steps 2 and 3 of C&P Section 7.4.2, which I assume are the stages 1 and 2 referred to above(?)

Ron
Attached Thumbnails
Click image for larger version

Name:	x.jpg
Views:	263
Size:	39.7 KB
ID:	1152  
rdotson is offline   Reply With Quote
Old 2006-06-05, 05:52   #14
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

Judging by your comments I'd say ECM does not work the way you think it works. Try to understand the algorithm well enough to write a software implementation before you try attempt it in hardware. Btw, a 704 bit mul/REDC takes about 360 muls on a 64 bit machine, which on an AMD64 ideally takes about 750 clocks - at a 3GHz clock rate.

Somehow I don't think you thought your cunning plan all the way through.

Alex

Last fiddled with by akruppa on 2006-06-05 at 05:53
akruppa is offline   Reply With Quote
Old 2006-06-05, 10:37   #15
Mystwalker
 
Mystwalker's Avatar
 
Jul 2004
Potsdam, Germany

3×277 Posts
Default

Did you already test your implementation with easier targets?
It would be interesting to see your times for e.g. a 100 digit composite number that consists of two 50 digit primes.
Mystwalker is offline   Reply With Quote
Old 2006-06-05, 11:06   #16
rdotson
 
rdotson's Avatar
 
Jul 2005

23·5 Posts
Default

Quote:
Originally Posted by akruppa
Judging by your comments I'd say ECM does not work the way you think it works. Try to understand the algorithm well enough to write a software implementation before you try attempt it in hardware.
Indeed that is the very first thing I did Alex. Although the Verilog algorithm is based on Chapter 7 of the C&P book, I didn't convert it directly into Verilog. I first wrote a Maple implementation of the C&P description of ECM, and then gradually converted that into something I could ultimately convert to Verilog fairly easily. Using the Maple language as a design language for Verilog may seem crazy but it works out quite well. Although Maple is at the other end of the spectrum from Verilog as far as abstraction is concerned, Maple is pretty much platform independent (as is Verilog) and easy to work with and to test. After I had a working Maple ECM program I gradually massaged it by breaking every complicated operation down into simpler pieces until my Maple program resembled Verilog - Har, har! If anyone who knows Maple were to look at the final Maple program, they'd think whoever wrote it is nuts, but it's that way for easy conversion to Verilog. In any case, I have tested the Maple version of ECM on my PC, and have tested the Verilog implementation of ECM (as well as Fermat's method and Pollard-Rho) on a real Lattice Semiconductor FPGA development board (using only 14 bit numbers however, because nothing larger would fit) as well as in a "testbench" simulation environment on my AMD64 desktop PC using several different test numbers, and it seems to work fine with no known errors.

Quote:
Originally Posted by akruppa
Btw, a 704 bit mul/REDC takes about 360 muls on a 64 bit machine, which on an AMD64 ideally takes about 750 clocks - at a 3GHz clock rate.
Ouch! You really know how to hurt a guy Alex. If a mul/REDC instruction does a multiply and a modulo, then my 28 microseconds doesn't hold a candle to the AMD64's 250ns. Some FPGAs have builtin high-speed multipliers, but I doubt they could approach that speed. My only rejoinder is to point out that I can have as many of my 704 bit multiplier/modulo circuits running in parallel on an FPGA as it will accommodate, which for a Spartan3E is probably around a dozen or so depending on the FPGA version.

In any case, I just can't help feeling that there is a great future for reconfigurable computing hardware in numeric analysis, and thought this would be a good way to get my feet wet.

Quote:
Originally Posted by akruppa
Somehow I don't think you thought your cunning plan all the way through.
Alex
What, ... you mean my plan for total world domination is flawed!?

Ron
rdotson is offline   Reply With Quote
Old 2006-06-05, 13:26   #17
rdotson
 
rdotson's Avatar
 
Jul 2005

4010 Posts
Default

P.S.
Quote:
Originally Posted by akruppa
Judging by your comments I'd say ECM does not work the way you think it works.
I'm not sure to which comments you were referring, but if it's my comment about stage 1 and 2 running in parallel, perhaps I should clarify that a bit. I hadn't intended to get into implementation details since this isn't a Verilog forum, but I should have probably said that after initialization is completed, processing for stages 1 and 2 overlap. Stage 1 (GEN_CURVE) runs once during initialization to generate the initial elliptic curve parameters. At that point the ELLIPTIC_MULTIPLY module is invoked with the newly generated parameters and begins testing a large number of points on the curve for the terminating condition (failure of ELLIPTIC_ADD), and at the same time a new "random" number is generated and GEN_CURVE is restarted to generate the next random elliptic curve while ELLIPTIC_MULTIPLY is executing. Although I've done no timing analysis on it, I assume GEN_CURVE would finish long before ELLIPTIC_MULTIPLY and have to wait, so I'm not sure how much computational efficiency this overlapped processing of the stage 1 and stage 2 modules gains, but both stages run concurrently unless one of them is waiting for the other to complete.

Ron
rdotson is offline   Reply With Quote
Old 2006-06-05, 14:01   #18
R.D. Silverman
 
R.D. Silverman's Avatar
 
"Bob Silverman"
Nov 2003
North of Boston

5·17·89 Posts
Default

Quote:
Originally Posted by rdotson
P.S.

I'm not sure to which comments you were referring, but if it's my comment about stage 1 and 2 running in parallel, perhaps I should clarify that a bit. I hadn't intended to get into implementation details since this isn't a Verilog forum, but I should have probably said that after initialization is completed, processing for stages 1 and 2 overlap. Stage 1 (GEN_CURVE) runs once during initialization to generate the initial elliptic curve parameters. At that point the ELLIPTIC_MULTIPLY module is invoked with the newly generated parameters and begins testing a large number of points on the curve for the terminating condition (failure of ELLIPTIC_ADD), and at the same time a new "random" number is generated and GEN_CURVE is restarted to generate the next random elliptic curve while ELLIPTIC_MULTIPLY is executing. Although I've done no timing analysis on it, I assume GEN_CURVE would finish long before ELLIPTIC_MULTIPLY and have to wait, so I'm not sure how much computational efficiency this overlapped processing of the stage 1 and stage 2 modules gains, but both stages run concurrently unless one of them is waiting for the other to complete.

Ron

Step 1 and Step 2 can't "overlap". The output from Step 1 is an input to
Step 2.

Step 2 can indeed run in parallel. See my joint paper with Peter Montgomery
An FFT Extension to the P-1 Factoring Algorithm to see how it might be
done. Basically, one generates two sequences of numbers:

x_1, x_2, x_3, .... x_n
y_1, y_2, ... y_k

such that the sequences are co-prime. One then computes

product (x_i - y_j) or (using Brent-Suyama polynomials)
product(f(x_i) - f(y_j)) and takes a GCD with N. The generation of
the sequences and the product is easily partitioned among multiple
processors.

However, this is NOT an effective way to run ECM in parallel.

Step 1 is very hard to do in parallel. Generally, all that can be done is
to do the multi-precision modular arithmetic in parallel. But in step 1
the result of one iteration depends upon the value of the previous iteration.

I do not follow your "ELLIPTIC_MULTIPLY" begins testing a large number of
points comment. In Step 1, we let M = product(p_i ^ a_i) where
p_i are successive primes, and a_i are small integers. We compute MP
on the curve. We do this by computing 2^a P, then 3^b (2^a P), then
5^c (3^b (2^a P)) etc. We stop periodically to take a GCD. Computing
the GCD on "a large number of points" will slow the algorithm down a LOT.
We compute GCD's very infrequently.

Giving separate curves to separate processors is not "running step 1
in parallel".

The best way to run ECM in parallel is to give a single curve to each
processor and let each one run independently.

I mean no disrespect, but you have not described what you are doing in
a way that makes me feel confident that you know what you are doing.
This may simply be a failure on your part to express yourself adequately.
R.D. Silverman is offline   Reply With Quote
Old 2006-06-05, 17:29   #19
rdotson
 
rdotson's Avatar
 
Jul 2005

23·5 Posts
Default

Quote:
Originally Posted by R.D. Silverman
Step 1 and Step 2 can't "overlap". The output from Step 1 is an input to Step 2.
May I humbly suggest you familiarise yourself with the concept of "pipelining."
http://en.wikipedia.org/wiki/Pipelining

Quote:
Originally Posted by R.D. Silverman
Step 1 is very hard to do in parallel.
...
Step 2 can indeed run in parallel.
...
Please pardon me if I gave the mistaken impression that I was implying that either step 1 or step 2 lend themselves individually to parallelization. I was merely trying to point out that the step 1 (?) phase of generating elliptic curves can be overlapped in part (ie; pipelined) with the step 2 (?) phase of testing each of those curves.

I will certainly try to find a copy of your paper with Peter Montgomery "An FFT Extension to the P-1 Factoring Algorithm" because it sounds very interesting. If anyone happens to have a web link to the paper handy, it would save me some time searching. Although your method may not be a good way to run ECM in parallel on a sequential computer, it might fit nicely into the fabric of an FPGA which works best when things can be partitioned into tiny pieces. An important difference between programming an FPGA and programming a conventional computer is that when you spawn a task (invoke a module) on an FPGA, that task/module gets as much computing power as its parent without suspending the parent - at least until there's no more room left on the FPGA. This comes in very handy at times such as in my Verilog implementation of the extended Euclidean algorithm (part of ECM) in which I have three separate multipliers all running simultaneously. :-)

Quote:
Originally Posted by R.D. Silverman
Giving separate curves to separate processors is not "running step 1
in parallel".
I probably shouldn't even have mentioned "FPGA" because it is apparently leading to some misunderstanding. There are no "processors" in an FPGA per se. The task of designing a circuit for FPGA implementation consists of designing an entire application specific computing architecture.

Quote:
Originally Posted by R.D. Silverman
I do not follow your "ELLIPTIC_MULTIPLY" begins testing a large number of
points comment.
...
I mean no disrespect, but you have not described what you are doing in
a way that makes me feel confident that you know what you are doing.
This may simply be a failure on your part to express yourself adequately.
Thank you for your courtesy Professor Silverman. Your viewpoint is somewhat understandable considering that sometimes I'm not so sure what I'm doing myself - Har. ;-) I think being able to dabble in all the things I never had time to learn about when I was employed full time is great fun, even if I do have to endure a few lumps and bruises along the way. ;-) My training is as an engineer rather than as a mathematician, so I think I have a grasp of most basic concepts, but am unfamiliar with much of the proper mathematical terminology. It's somewhat like learning a foreign language - it's always easier to learn to read or understand it verbally than it is to speak it. Further, it's been months since I've looked at the original Maple source code so perhaps I should sit down and go through it carefully and document exactly what it does each step of the way so that I can be more explicit and determine whether or not it is correctly implementing ECM. I've been dealing with wires, registers, and Verilog timing issues so much lately that I've almost lost track of the top level flow. As I recall the first module simply generates a sequence of random elliptic curves along with a point on each curve. For each of these curves, the corresponding point is repeatedly multiplied (via a succession of elliptic additions) by a constant until either an arbitrary limit is reached or until the extended Euclidean algorithm (IGCDEX) portion of the elliptic addition cannot compute an inverse, in which case it returns a GCD that solves the factorization.

It may very well be that I have erred in my implementation of ECM, and I will spend some time looking at how others have done it (if I can find any understandable source code) to see if what I've done is equivalent or way off the mark. I had been assuming all along that just because it always gives me correct answers and no incorrect answers that I had implemented the algorithm correctly.

In any case, to get back on topic I would still like to see an example of the Prime95 FFT multiplication source code in order to determine whether or not it might be a good candidate for FPGA implementation.

Regards,

Ron
rdotson is offline   Reply With Quote
Old 2006-06-05, 17:36   #20
R.D. Silverman
 
R.D. Silverman's Avatar
 
"Bob Silverman"
Nov 2003
North of Boston

5·17·89 Posts
Thumbs up

Quote:
Originally Posted by rdotson
May I humbly suggest you familiarise yourself with the concept of "pipelining."
http://en.wikipedia.org/wiki/Pipelining

ROTFLMA. (we need a smiley for this!)
R.D. Silverman is offline   Reply With Quote
Old 2006-06-05, 19:31   #21
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

Quote:
Originally Posted by rdotson
Please pardon me if I gave the mistaken impression that I was implying that either step 1 or step 2 lend themselves individually to parallelization. I was merely trying to point out that the step 1 (?) phase of generating elliptic curves can be overlapped in part (ie; pipelined) with the step 2 (?) phase of testing each of those curves.
What you call "step 1" is merely the initialisation phase of ECM. Trying to do it in parallel with the actual arithmetic on the curve is futile as the initialisation takes a negligible amount of time.

What is commonly referred to as Stage 1 multiplies a point on the curve by a large, highly composite integer. This elliptic curve multiplication is a sequential process for one curve. The modular arithmetic may be parallelised. Or you may run stage 1 on several different curves in parallel.

What is commonly referred to as Stage 2 takes the resulting point of stage 1 and tries to cover a single, larger prime in the group order by looking for a match (mod the prime factor you hope to discover) between two lists. This can be done by subtracting the elements of one list from the elements of the other list one pair at a time ("enhanced standard stage 2") or by reformulating the process as arithmetic on Z/ZN[X] so one can use efficient algorithms for polynomial arithmetic ("FFT stage 2"). You comments so far mention nothing that resembles an actual stage 2.

Trying a hardware implementation of ECM should make a very interesting project and learning exercise. Prof. Franke of the Universität Bonn did experiments on this as well, I can try to dig out a paper he did on the subject if you want. But I really don't think you are ready for trying a hardware implementation yet, you seem to lack basic understanding of the algorithm. Crandall&Pomerance give a good description of ECM, with brief explanation of enh.std.stage 2 and FFT stage 2 iirc, in chapter 7 of their book. For the FFT stage 2, Montgomery's thesis "An FFT extension of the Elliptic Curve Method of factorization" is the definite source. Several algorithms for polynomial arithmetic have been improved since the thesis was written, but the basic concept of the FFT stage 2 is unchanged.

Alex

Last fiddled with by akruppa on 2006-06-05 at 19:37
akruppa is offline   Reply With Quote
Old 2006-06-05, 21:53   #22
rdotson
 
rdotson's Avatar
 
Jul 2005

2816 Posts
Default

Quote:
Originally Posted by akruppa
Your comments so far mention nothing that resembles an actual stage 2.
As someone famous once said, "I think we have a failure to communicate", and I think I finally know what it is. I only implemented the Basic ECM Algorithm described in section 7.4.1 of C&P. I did *not* implement the stage 2 continuation described in the following section 7.4.2 "Optimization of ECM" of C&P for purely pragmatic reasons - there is no way it would fit into my FPGA. I've cut my design to the bone and sacrificed almost all opportunities for parallelism in hopes to make the design fit, and that includes leaving out everything except for the absolute essentials.

I even precompute the value of p[i] using Maple and "hardcode" it and the number to be factored into the FPGA. The p[i] is used as the multiplier for the elliptic curve points, and for reasons that escape me at the moment, p[i] is divided by two at each iteration (after testing the LSB and taking different actions depending on the result) until it reaches zero, at which point the the elliptic multiply returns either the coordinates of the next point or a successful factorization. There is no guarantee that the p's are prime as far as I know, but it would be impossible in any case for a reasonable sized FPGA to compute a lengthy sequence of large primes or to store them in local storage, so it's fortunate that it is (apparently) unnecessary. It's very difficult to determine the correlation between the original Maple program and the pared down Verilog version, so obviously I need to go back and review things in more detail. If I have indeed bollixed up the algorithm somehow, I'm puzzled as to why it nonetheless yields correct factorizations.

Quote:
Originally Posted by akruppa
Prof. Franke of the Universität Bonn did experiments on this as well, I can try to dig out a paper he did on the subject if you want.
If that's the one titled "SHARK: A Realizable Special Hardware Sieving Device for Factoring 1024-bit Integers," then I already have a copy - otherwise please do send me a copy if you don't mind because I'm always interested in what others are doing along this line.

Thanks,

Ron
rdotson is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
YAFU cli parameters VolMike YAFU 8 2012-10-03 14:18
SNFS 200 parameters JoeCrump Factoring 3 2009-12-06 14:50
Quartic: Parameters R.D. Silverman Factoring 9 2009-02-18 23:24
C140 Parameters Chris61 Factoring 13 2007-04-23 09:56
optimal parameters for GMP-ECM , -oe+ , -I Walter Nissen GMP-ECM 16 2007-03-20 19:35

All times are UTC. The time now is 04:11.


Fri Jul 7 04:11:07 UTC 2023 up 323 days, 1:39, 0 users, load averages: 1.79, 1.65, 1.42

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔