mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   custom GIMPS hardware: how fast and how much? (https://www.mersenneforum.org/showthread.php?t=15325)

ixfd64 2011-03-02 17:19

custom GIMPS hardware: how fast and how much?
 
Suppose a very generous person decided to hire a company to design a custom microchip for the purpose of finding Mersenne primes, and that money is not an issue.

How fast could such a chip be, and how much would it cost to make one? Just curious.

bsquared 2011-03-02 19:43

Here's a rough guess based on nothing other than some past experience with state of the art [URL="http://en.wikipedia.org/wiki/Application-specific_integrated_circuit"]ASIC[/URL] flows, digital logic development cycles and toolchains, and FFT algorithms in general:

One to ten million dollars and a man year or two of labor for ~10-100x speedup vs. modern general purpose CPUs.

An FPGA solution would probably be cheaper (10 - 100k + 0.5 to 1 man years of labor) for maybe a 10-20x speedup.

ASIC solutions are generally only appropriate if
a) you are going to build and sell millions of them
or
b) you really really really need the size/weight/power reduction or performance improvement (i.e., it is mission critical)

jasonp 2011-03-02 20:17

The Hardware forum has a sticky that goes into a lot of the details of the answers to this question.

Christenson 2011-03-05 04:03

I think the conclusion has been that it is cheaper and easier to put together 10PCs than to get an FPGA flow working well, or get a GPU to work at the problem. Memory bandwidth is a major issue.
Me, I'd want to look at what would happen if we notice that what mprime does is very much bound by the CPU<-->memory bandwidth, and built a PCI (or other favorite bus) coprocessor card (wait! is that a GPU?) with basically the CPU, a memory slot, and a PCI interface only. (and of course, a heatsink and a way to remove lots of heat).
Architecturally, I'd want to build a machine that could carry out one step of an LL test, and then figure out how to keep it fed (for example, on each clock, feed in the inputs of a new LL step and remove the outputs of a just-completed LL step. The decision as to which LL step would be up to a general-purpose machine. Given all the steps (20-30) involved in a single FFT for an LL step, I would think that I might have that many different LL steps simultaneously in progress.

jasonp 2011-03-05 16:41

[QUOTE=Christenson;254348]
Me, I'd want to look at what would happen if we notice that what mprime does is very much bound by the CPU<-->memory bandwidth, and built a PCI (or other favorite bus) coprocessor card (wait! is that a GPU?) with basically the CPU, a memory slot, and a PCI interface only. (and of course, a heatsink and a way to remove lots of heat).
[/QUOTE]
This is essentially what a GPU is, except that ATI and Nvidia spend billions of dollars building nice ones, which we can't do. i.e. there's no way I'd be able to build a DDR memory controller in an FPGA that can run as fast or as effectively as one of the multiple memory controllers in a GPU.

I've been wondering recently if it would be worthwhile to build a memory controller optimized for high *address* bandwidth to many banks of DRAM, rather than the traditional optimization for high *data* bandwidth. That might help achieve very high GUPS (giga-updates-per-second) rather than GBPS, and the former is a critical component of fast NFS sieving and linear algebra.

Uncwilly 2011-03-05 18:20

[QUOTE=jasonp;254374]I've been wondering recently if it would be worthwhile to build a memory controller optimized for high *address* bandwidth to many banks of DRAM, rather than the traditional optimization for high *data* bandwidth. That might help achieve very high GUPS (giga-updates-per-second) rather than GBPS, and the former is a critical component of fast NFS sieving and linear algebra.[/QUOTE]The IBM 5162 (XT 286) running at 6Mhz could out perform the IBM 5170 (AT) running at 8 MHz because memory related issues. So, your idea of using memory to speed the machine does have a real world precedent.

ixfd64 2011-03-05 19:24

Argh, I meant to post this in the Hardware forum. Could a mod please move it there?

Thanks.

xilman 2011-03-05 19:44

[QUOTE=ixfd64;254388]Argh, I meant to post this in the Hardware forum. Could a mod please move it there?

Thanks.[/QUOTE]Done.

Paul

ixfd64 2011-03-05 20:15

[QUOTE=xilman;254392]Done.

Paul[/QUOTE]

Thanks!

By the way, I agree with Christenson regarding the memory bottleneck. I think someone here mentioned that his Tesla C2050 card was only performing at about 25% of the expected throughput, and the cause was found to be the limited memory bandwith.

Christenson 2011-03-08 01:39

jasonp...you missed that huge, silly, sh*t-eating grin on my face when I said GPU!....maybe we could get an open SATA chipset to feed the inputs to the FPGAs for LL, or maybe PCI-e to serialize for us. Easier than building DDR3 controllers. Sieving work, we need to look at how to optimize that "worst" case, completely (pseudo-)random memory access, scattered all over a gig or more of memory. Such a bus would need to be deep, that is, lots of memory modules, each of which can start a write and grab the data in one clock, but might need many clocks (and therefore many parallel modules) to commit the data. To sell it, find a non-sieving application, like non-colliding writes to a database, that could use similar performance.
Wonder what it would cost to get Nvidia to let us use that billion dollar investment in bus controllers to front end the kind of dedicated logic arrays that would be useful for mersenne work?

chris2be8 2011-03-08 17:23

Would it be possible to build a system with ~1Gb of level 3 (or 2) cache? If so how fast would it be for sieving and much would it cost? That's probably the simplest way to speed up main memory.

Chris K


All times are UTC. The time now is 07:03.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.