mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2011-05-11, 18:09   #45
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

144668 Posts
Default

Quote:
Originally Posted by diep View Post
You don't create a fpga for prime number crunching in order to just produce then just 1 fpga card. That would never justify costs of carrying out the project.
I don't see why you wouldn't ; a year of hobbyist evenings to write the Verilog is probably not much harder to come by than the $3500 to buy a http://hitechglobal.com/catalog/prod...5458a19baef64a or the $1795 for a http://www.xilinx.com/products/board...V6-ML605-G.htm.

You'd need rather richer and more confident hobbyists to accumulate around $60,000 to get a dozen PCIe boards with six 72Mbit QDRII SRAM chips, an SODIMM socket and a V6-LX240T designed and made - a few hundred hours of Shanzhai engineering and fabrication effort; the big problem being to find the people required to bridge the three degrees of separation between here and the Shenzhen electronic-engineering community (quick, anyone here speak both Verilog and colloquial Mandarin?). Quantity-1 XC6VLX240T chips in a package with enough I/Os turn out to cost more from avnet than a devboard with one on it does, but that says very little about the cost of quantity-50.

The next dozen would be cheaper, but we're not in the world of six-figure NREs here, let alone the seven and eight figures you're talking about; these could be made semi-practically for an audience of a dozen people each prepared to pay the price of a 48-core Magny Cours server or a 300mm/2.8 telephoto lens.

Of course, I suspect such an audience doesn't exist: this forum is likely to be sampling the very far tail of the distribution of number-theory hobbyists and contains about one person who's personally bought such a server.

Last fiddled with by fivemack on 2011-05-11 at 18:10
fivemack is offline   Reply With Quote
Old 2011-05-11, 18:15   #46
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by fivemack View Post
I gave you the DES-cracker above. Cisco routers used to be made mostly out of FPGAs; mobile-phone base stations are made mostly out of FPGAs. As soon as you're manipulating perceptible amounts of I/O, FPGAs are the way to go.
You're mixing up things here. In telecom all what matters is being low power.

So we cannot use that as a proof of the above here. We were discussing massive number crunching.

That GPU is going to annihilate such fpga's in terms of performance. Just not in power usage. So using that as an example of how to "speed up" number crunching is a very bad idea.

DES cracking is from before everyone at home could program a GPU (of course there was some gpu type hardware back in the 90s, but not very common nor cheap).

Quote:

FPGA is really useful when you have a completely different compute/control balance from CPU problems ... 64-bit adders on a Virtex 5 run at 400MHz and you've got space for hundreds of them. Or if you want something which looks very unlike an ALU (GF(2) polynomial multiplication, for example).

You can, as bsquared points out, attach large SRAMs ($60 for a 36Mbit 200MHz SRAM from Digikey for delivery tomorrow) to the FPGA through its thousand or so general-purpose I/O lines ... and an FPGA tends to be full of fast memory banks anyway: a large Virtex 5 has 256 block RAMs (512 36-bit words in each) which clock at 500MHz. The largest FPGA in a family will be the largest (in square-millimetre terms) chip that TSMC are prepared to give Xilinx a quote to make; transistor counts are very comparable with CPUs and GPUs, though it takes an awful lot of transistors to make one FPGA LUT.
You'll have to design a custom card to fit all that. Also you need to parallel read and write. You are happy with a few 64 bits 'adders'.

You realize a single gpu, just 40 nm, already delivers at each streamcore (which is 4 pe's) a FMA per each cycle, which means a simple card of 500 euro delivers 1.17 Tflop double precision and that runs 830Mhz.

Your FPGA is just 400 Mhz, which currently is the HIGHEST CLOCKED fpga. Sure 1.5ghz @ 22 nm is coming.

The junk you refer to is clocked oh what is it 30Mhz if you're so lucky or so?

How are you going to beat a gamers card man?

Sure you can, if you invest hell of a lot more money. Custom design the board and add this and add that.

That's a multimillion dollar project as printing 1 such fpga is not interesting at all. When i wanted to produce a FPGA chip the realism was that i had to produce at least a 1000 of those chips. Add 1000 PCI cards (royalties royalties).

So that's soon a really expensive project if you have some sort of semi-capable fpga.
diep is offline   Reply With Quote
Old 2011-05-11, 18:32   #47
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

2·7·461 Posts
Default

Quote:
Originally Posted by diep View Post
A custom CPU is similar price. Only if you want latest process technology for such chip it'll be more expensive.
The problem is that a custom chip made on an old enough process technology to be remotely affordable will be less capable than the FPGAs which are the lead products used for the newest processes at foundries like TSMC. It'll be cheap in huge quantity once you've paid the NRE, but the NRE will buy you a thousand of the largest Virtex-7.

The EFF did make custom DES chips at one stage; they got two thousand 800nm custom chips for about $130,000, though I suspect that was cost-price. http://cryptome.org/jya/cracking-des/cracking-des.htm

I just asked www.mosis.com for a quote; if you give them a full design (which, admittedly, requires probably the better part of a million dollars worth of licences for design software), they will for $211,500 have 100 25mm^2 chips fabbed on the TSMC 65nm logic process, packaged in 256-pin packages, and sent to you by Fedex.

25mm^2 would be six megabytes of SRAM or six Cortex-A8 processors (note that the licence fee for using Cortex-A8 processors is also enormous), or a fairly prodigious quantity of custom logic.
fivemack is offline   Reply With Quote
Old 2011-05-11, 18:34   #48
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by fivemack View Post
I don't see why you wouldn't ; a year of hobbyist evenings to write the Verilog is probably not much harder to come by than the $3500 to buy a http://hitechglobal.com/catalog/prod...5458a19baef64a or the $1795 for a http://www.xilinx.com/products/board...V6-ML605-G.htm.

You'd need rather richer and more confident hobbyists to accumulate around $60,000 to get a dozen PCIe boards with six 72Mbit QDRII SRAM chips, an SODIMM socket and a V6-LX240T designed and made - a few hundred hours of Shanzhai engineering and fabrication effort; the big problem being to find the people required to bridge the three degrees of separation between here and the Shenzhen electronic-engineering community (quick, anyone here speak both Verilog and colloquial Mandarin?). Quantity-1 XC6VLX240T chips in a package with enough I/Os turn out to cost more from avnet than a devboard with one on it does, but that says very little about the cost of quantity-50.

The next dozen would be cheaper, but we're not in the world of six-figure NREs here, let alone the seven and eight figures you're talking about; these could be made semi-practically for an audience of a dozen people each prepared to pay the price of a 48-core Magny Cours server or a 300mm/2.8 telephoto lens.

Of course, I suspect such an audience doesn't exist: this forum is likely to be sampling the very far tail of the distribution of number-theory hobbyists and contains about one person who's personally bought such a server.
Actually 2 persons own a 4 socket AMD opteron box with 4 sockets. I'm one of them. Not not the 48 core thing. Out of my budget. I own oldie board. Built it for under 1000 euro. Has 16 cores.

and i own currently 1 videocard. Plan is when i win the lotto to buy another few videocards.

A single videocard delivers 5 Tflop. Card i have just has 1 gpu on the card. Delivers 2.7 Tflop in theory. Maybe i can squeeze out 2 Tflop out of it. Note this is single precision and of course only busy wth integers.

At a FPGA you'd do it more efficient. But you must do it more efficient. You dn't have the same number of transistors.

It's not interesting to do all that effort to produce out of 1 FPGA development kit. With a compiler of say $60k and those boards you need and some additional logics and a lot of your time, say total project size $100k, programming time not counted.

Then you've got 1 fpga. by the time it's there, i can also buy a GPU in the shop with 10k cores @ 1Ghz, delivering 20 Tflop single precision.

All this code that runs on those gpu's not extremely efficient, yet you lose just a few factors to efficiency. Not some order of magnitude.

In the end the problem with the fpga's is the bandwidth to the caches.

0.5 TB someone claimed. Yet at a GPU you laugh for 0.5 TB to the L1.
It actually has, just counting instructions (multiply-add is 1 instruction),
Nearly 10 Terabye to the local caches bandwidth.

The shared cache already delivers 1 TB/s.

I hope you see the problem when designing a FPGA.

0.5 TB/s claim is not very convincing if from each internal calculation the result needs to get stored, say for a FFT.

If you already have problems beating 1 gpu, whereas you can soon buy these gpu's second hand dirt cheap and stack them up, how to beat that?

FPGA's are interesting for example for traders at an exchange, because of its latency. Yet then you also need to implement the TCP real fast at the same latency like the fastest Network card delivers it by then; that's pretty complicated project.

I guess that's why the fpga's are allowed to so quickly get printed at 22 nm, just for the financial guys.

This has hardly something to do with number crunching however.

Regards,
Vincent
diep is offline   Reply With Quote
Old 2011-05-11, 18:35   #49
xilman
Bamboozled!
 
xilman's Avatar
 
"๐’‰บ๐’ŒŒ๐’‡ท๐’†ท๐’€ญ"
May 2003
Down not across

2×17×347 Posts
Default

Quote:
Originally Posted by bsquared View Post
No. While I can't speak for xilman (and he can't tell us anyway), I assume he's talking about a chip that is working on a different problem.
You assume correctly.

For the problem on which the hardware is working, a single GPU is approximately 50 times the speed of a single CPU core. A single FPGA instance is approximately 500 times faster than a single CPU core. A single FPGA can hold a number of instances.

These are not made up numbers. I went to quite some effort to measure them.
Quote:
Originally Posted by bsquared View Post
That doesn't mean that it is more or less limited. GPUs are only good at problems which map to what GPUs are good at. There are lots of interesting problems for which GPUs will be useless.
Correct, though I'd phrase it as "for which GPUs are less effective".
Quote:
Originally Posted by bsquared View Post
For people who still want to solve these problems quickly, there are FPGA. For people who still want to solve these problems very quickly, there are ASICs.
Again, correct.

There are some problems which do not require large amounts of memory nor which use "deep" algorithms but which, rather, use algorithms which are very simple in hardware but rather complex in software. For instance, bit-permutation is expensive in almost all forms of software but is completely free in hardware because it consists solely of re-routing wires between computational elements. Primitives like shift registers and combinatorial logic are very cheap on FPGAs.

The same is true of ASICs but a significant advantage of FPGAs is that they are Field Programmable Gate Arrays. That is, the same device can be reprogrammed later to solve a different problem if a better algorithm is implemented after the hardware is built.


Paul

Last fiddled with by xilman on 2011-05-11 at 18:46 Reason: Fix monir typo
xilman is online now   Reply With Quote
Old 2011-05-11, 18:37   #50
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

5×23×31 Posts
Default

For the record, when I first thought about massively parallel NFS polynomial selection I realized that COPACOBANA would be a very nice target platform for it; but it's not clear whether the system would work out for a problem that needs somewhat more memory than an ECM core but still enough to fit inside one mid-range FPGA. Plus a GPU was only a hundred dollars.
jasonp is offline   Reply With Quote
Old 2011-05-11, 18:38   #51
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by fivemack View Post
The problem is that a custom chip made on an old enough process technology to be remotely affordable will be less capable than the FPGAs which are the lead products used for the newest processes at foundries like TSMC. It'll be cheap in huge quantity once you've paid the NRE, but the NRE will buy you a thousand of the largest Virtex-7.

The EFF did make custom DES chips at one stage; they got two thousand 800nm custom chips for about $130,000, though I suspect that was cost-price. http://cryptome.org/jya/cracking-des/cracking-des.htm

I just asked www.mosis.com for a quote; if you give them a full design (which, admittedly, requires probably the better part of a million dollars worth of licences for design software), they will for $211,500 have 100 25mm^2 chips fabbed on the TSMC 65nm logic process, packaged in 256-pin packages, and sent to you by Fedex.

25mm^2 would be six megabytes of SRAM or six Cortex-A8 processors (note that the licence fee for using Cortex-A8 processors is also enormous), or a fairly prodigious quantity of custom logic.
I was thinking the uni's also are allowed to produce a few thousands of cpu's for free each year. Why not lobby there?

Just need to get convinced that a specific real cpu makes sense to produce.

Not sure what technology they are allowed to print in. Usually not extremely bad though.
diep is offline   Reply With Quote
Old 2011-05-11, 18:39   #52
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

11001001101102 Posts
Default

Quote:
Originally Posted by diep View Post
You're mixing up things here. In telecom all what matters is being low power
That's not really the case for base stations, and particularly not for big routers; the base stations are FPGA-based because mobile phone standards evolve faster than you'd want to make ASICs, and the routers are FPGA-based because they need extremely fast transceivers and it's easier to get Xilinx to design them and then use the commercial chips from Xilinx. If you're in a big-volume business like telecom and want low power, you definitely don't want FPGAs - using a 16-bit SRAM to implement a NAND gate is not the way to go to save power.

Quote:
You realize a single gpu, just 40 nm, already delivers at each streamcore (which is 4 pe's) a FMA per each cycle, which means a simple card of 500 euro delivers 1.17 Tflop double precision and that runs 830Mhz.
I realise that's what the peak numbers multiply out to. I don't think "deliver" is remotely the right word to use until you start presenting software which actually achieves that kind of performance on a problem we're interested in.
fivemack is offline   Reply With Quote
Old 2011-05-11, 18:49   #53
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11001001112 Posts
Default

Speaking of GPU's i just had good news from AMD helpdesk, assuming a knowledgeable person answerred. So top bits of 24 x 24 bits also goes at full PE speed (1536 processing elements per GPU @ 880Mhz). OpenCL currently doesn't implement this, but i would sure hope that next servicepack they add an AMD specific function for that, which hopeflly also gets integrated into OpenCL.

That speeds up the AMD gpu factor 3 or so for Trial Factoring. Also might get it 3x faster than the GTX470 which was used to benchmark for nvidia (which is not the fastest nvidia card for doing this).

More upon this in another thread.

Currently the gpu's are slow in 32 x 32 bits multiplications. Yet it's wishful thinking that will remain so.

Some years ago, it was easy for some random good programmer to pick up FPGA programming and outgun by some factor 1000+ a CPU.

If we try that nowadays that is really a lot harder and i'd guess it'll get ever harder in future as well, if we simply speak about throughput.

Majority of the persons here is very interested in very big prime numbers. i sure write things from that context, as when i tried to design on paper a design for a FPGA there, i sure couldn't cheaply do that.

If we discuss something utmost tiny that works under a couple of thousands of bits, obviously there is a lot of tricks possible. yet that's not so relevant i'd argue.

Maybe there is 1 or 2 guys with such problem. If some oldie GPU already is factor 50 faster there than a cpu and a fpga factor 500. I'd argue, get 16 gpu's. And right now you can upgrade to new ones; some years from now a single gpu will outgun that.

They're busy fixing those gpu's more and more for HPC type crunching workloads.

Point is, if you never publicly show up with a problem that GPU's can't do fast, it's not so sure they'll fix them for it.

What you post on the net, you have good odds they fix.

Regards,
Vincent
diep is offline   Reply With Quote
Old 2011-05-11, 18:59   #54
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by fivemack View Post
That's not really the case for base stations, and particularly not for big routers; the base stations are FPGA-based because mobile phone standards evolve faster than you'd want to make ASICs, and the routers are FPGA-based because they need extremely fast transceivers and it's easier to get Xilinx to design them and then use the commercial chips from Xilinx. If you're in a big-volume business like telecom and want low power, you definitely don't want FPGAs - using a 16-bit SRAM to implement a NAND gate is not the way to go to save power.



I realise that's what the peak numbers multiply out to. I don't think "deliver" is remotely the right word to use until you start presenting software which actually achieves that kind of performance on a problem we're interested in.
Actually, for the Nvidia's there is a reasonable fast LL implementation based upon the Nvidia library. Nearly no one is using it though.

Having it doesn't mean people already are using it.

Tells you also something about how unlikely it is that anyone over here ever would invest into FPGA technology.

So i'd really argue that 'show' and 'proof' is there, yet even that hardly has people use it.

Really hard conclusion on the people rather than GPU's and proving them fast.
diep is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
(Pseudo)-Mathematics in Economics clowns789 Miscellaneous Math 3 2016-06-07 04:01
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Eugenics: Economics for the Long Run Asian-American Soap Box 62 2005-02-15 05:45

All times are UTC. The time now is 16:27.


Fri Jul 7 16:27:38 UTC 2023 up 323 days, 13:56, 0 users, load averages: 1.76, 2.03, 1.73

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

โ‰  ยฑ โˆ“ รท ร— ยท โˆ’ โˆš โ€ฐ โŠ— โŠ• โŠ– โŠ˜ โŠ™ โ‰ค โ‰ฅ โ‰ฆ โ‰ง โ‰จ โ‰ฉ โ‰บ โ‰ป โ‰ผ โ‰ฝ โŠ โА โŠ‘ โŠ’ ยฒ ยณ ยฐ
โˆ  โˆŸ ยฐ โ‰… ~ โ€– โŸ‚ โซ›
โ‰ก โ‰œ โ‰ˆ โˆ โˆž โ‰ช โ‰ซ โŒŠโŒ‹ โŒˆโŒ‰ โˆ˜ โˆ โˆ โˆ‘ โˆง โˆจ โˆฉ โˆช โจ€ โŠ• โŠ— ๐–• ๐–– ๐–— โŠฒ โŠณ
โˆ… โˆ– โˆ โ†ฆ โ†ฃ โˆฉ โˆช โІ โŠ‚ โŠ„ โŠŠ โЇ โŠƒ โŠ… โŠ‹ โŠ– โˆˆ โˆ‰ โˆ‹ โˆŒ โ„• โ„ค โ„š โ„ โ„‚ โ„ต โ„ถ โ„ท โ„ธ ๐“Ÿ
ยฌ โˆจ โˆง โŠ• โ†’ โ† โ‡’ โ‡ โ‡” โˆ€ โˆƒ โˆ„ โˆด โˆต โŠค โŠฅ โŠข โŠจ โซค โŠฃ โ€ฆ โ‹ฏ โ‹ฎ โ‹ฐ โ‹ฑ
โˆซ โˆฌ โˆญ โˆฎ โˆฏ โˆฐ โˆ‡ โˆ† ฮด โˆ‚ โ„ฑ โ„’ โ„“
๐›ข๐›ผ ๐›ฃ๐›ฝ ๐›ค๐›พ ๐›ฅ๐›ฟ ๐›ฆ๐œ€๐œ– ๐›ง๐œ ๐›จ๐œ‚ ๐›ฉ๐œƒ๐œ— ๐›ช๐œ„ ๐›ซ๐œ… ๐›ฌ๐œ† ๐›ญ๐œ‡ ๐›ฎ๐œˆ ๐›ฏ๐œ‰ ๐›ฐ๐œŠ ๐›ฑ๐œ‹ ๐›ฒ๐œŒ ๐›ด๐œŽ๐œ ๐›ต๐œ ๐›ถ๐œ ๐›ท๐œ™๐œ‘ ๐›ธ๐œ’ ๐›น๐œ“ ๐›บ๐œ”