![]() |
![]() |
#2 |
Aug 2006
3×1,993 Posts |
![]() |
![]() |
![]() |
![]() |
#3 |
Nov 2014
32 Posts |
![]()
Interesting. Would be nice if someone could get their hands on a reference board.
According to the website, this TURBOCARD2 has four of these processors: http://www.kalray.eu/products/mppa-manycore/mppa-256/ Which each have 256 of these cores:
Not sure how that translates to performance for large integer arithmetic though. Last fiddled with by BenR on 2014-11-23 at 17:58 Reason: Grammar |
![]() |
![]() |
![]() |
#4 |
Sep 2006
The Netherlands
2×3×131 Posts |
![]()
Looks like student proposal.
12.8GB/s to the RAM for 256 cores and stupid slow DDR3 from years ago. Todays DDR3 you can easily get 40GB/s if you want to with ddr3. You'd want 1TB/s however. I do not see how much of a register file there is for each array of 16 cores. I miss a drawing how the clusters connect to each other. We just see a drawing of 1 cluster i suppose, yet 16 would be there on 1 chip. How much L2 and L1 do we have? I can easily produce a card with 1 million cores yet without any caches and connecting to 12GB/s DDR3 is something your phone processor already manages. A cpu is as good as its cache subsystem. They quote it's gonna be interesting for specific companies, which rely heavily upon the cache subsystem of GPU's. Yet i see no mention of any caches let alone register files for this. How fast is the instruction cache and how large is it for each cluster? It's supposed to be an array. In arrays you usually have an input and an output. Where is the input and where is the output? I miss a simple arrow pointing that out. Seems it's a pdf made just to get subsidy. Good Luck with that. |
![]() |
![]() |
![]() |
#5 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
233658 Posts |
![]()
The x4 board still has only two channels of RAM per chip (per 256 cores), so eight channels totally. I think that it would be quite bandwidth limited, and LL performance will not overtake a _good_ (Xeon?) CPU. Maybe for TF, it could rival a good GPU, eventually running something like 1024 classes in parallel, every class in one core, its independent thread, in this case the memory access would be minimal.
Last fiddled with by LaurV on 2014-11-24 at 02:56 |
![]() |
![]() |
![]() |
#6 |
Sep 2006
The Netherlands
2×3×131 Posts |
![]()
What if it is clocked 1Mhz and releases in the year 2030?
There is zero numbers attached to this PDF. I can write 100 of this sort of PDF's a day. |
![]() |
![]() |
![]() |
#7 |
"Curtis"
Feb 2005
Riverside, CA
5·11·97 Posts |
![]() |
![]() |
![]() |
![]() |
#8 |
Sep 2006
The Netherlands
14228 Posts |
![]()
You find 5 tflops single precision impressive huh?
I can write you 100 pdf's a day like that with higher Tflop numbers! Just don't ask me more details like how the chip will look like 30 years from now :) I remember especially one specific presentations of intel some years ago when they were busy with Larrabee and all the names its successor had until they renamed it to Xeon Phi. First there came always 5 sheets of disclaimers that what was getting presented "possibly" would not be for real. Now at least THAT was back then a chip they already had a prototype from. They just didn't know whether they would manage to clock it at the specific target frequency. In modern cpu's that get those 1+ Tflop DOUBLE PRECISION, there is so many problems they aren't telling about. One of the problems is how do you distribute the instructions around the chip? Yet in the end each chip is as good as its caches are. If that's better then they also have reasons to get a higher bandwidth to the RAM. So if we reverse logics - the bandwidth to the RAM is an indication how well they succeeded in producing a chip. Forget efficient forms of Fast Fourier like DWT with the currently public known algorithms to be working on an array processor that gets 10 GB/s to the RAM without caches. Because in the end there is 2 solutions to do the FFT incarnation. Either you get a small block from the RAM, then toy with it inside the caches and then write it back to the RAM. Or you get the entire transform inside the caches and then write back the result. With something that's 1 Tflop a second, i hope you realize that the amount of data to and from the caches/register files is kind of 2 input values + 1 output value. That times 8 bytes. That's 24 Terabyte bandwidth a second. So some of notion how the datacaches work is the first babystep knowing what the chip is like. |
![]() |
![]() |
![]() |
#9 |
Sep 2006
The Netherlands
78610 Posts |
![]()
@LaurV
http://www.amd.com/en-gb/products/gr...top/5000/5970# That's also 5 Tflop single precision. Granted there is some restrictions (24 bits huh and no prefetch from the RAM etc). Yet for TF, you do not need much more. You can pick them up for near free from ebay now. Of course if you multiply you'd ideally also want a quick way to get the highbits. But well... |
![]() |
![]() |
![]() |
#10 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
9,973 Posts |
![]()
Bwaahaha, are you arguing with me?
![]() I wasn't reply to your post, and, except the fierce you use to attack those cards, and some additional gibberish in your posts, I said the same thing as you said. (i.e. against those cards). By the way, where the fierce is coming from? You look like you want to convince anybody NOT to buy/invest in those chips/cards. I think that everybody can do its own due-diligence in case they want to buy/invest in those cards. One should not expect them to run as 1024 computers in parallel, of course. In top of the dedicated software tools, and (most probably) proprietary architecture, no idea if compatible x86, arm, whatever. Anyhow, the prices are most probably prohibitive, they talk about servers with thousands of chips, and millions of cores, costing (wild guess) fortunes... |
![]() |
![]() |
![]() |
#11 |
Sep 2006
The Netherlands
2×3×131 Posts |
![]()
You massively overestimate the expertise of government commissions towards hardware they invest in.
In Europe such commissions are usually paid commissions - so people sit in such commissions based upon hierarchical order. Any knowledge there is total coincidental, as the only thing that matters is sit in as many as possible paid commissions, as that's the only thing that increases the salary. Because of this low expertise level, other factors completely overrule decision taking where to invest in. |
![]() |
![]() |