mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Parallella / Epiphany (https://www.mersenneforum.org/showthread.php?t=18589)

Mark Rose 2013-09-15 00:40

Parallella / Epiphany
 
So I have a [URL="http://www.parallella.org/"]Parallella[/URL] board with a 1 GHz [URL="http://www.adapteva.com/wp-content/uploads/2012/12/epiphany_arch_reference_3.12.12.18.pdf"]16-core Epiphany[/URL] [PDF] chip on its way to me in a month or so. If you're not familiar with the chip, it's a 32-bit RISC architecture with nano-second fabric between the cores.

Each core can do a single 32-bit integer operation (load/store, add/sub, shift, bitwise stuff) per clock.

Each core also has a IEEE754 single-precision FPU unit capable of addition, subtraction, [b]fused multiply-add, fused multiply-subtract[/b], fixed-to-float conversion, absolute, float-to-fixed conversion per clock (2 GFlops/s).

It also has 64 registers usable with no restrictions on access, 32 KB of L1 cache per core, and 1 GB of shared memory for the whole board in one unified address space.

And its power draw is only 5 watts for the whole board. So it's efficient power wise (but maybe not cost wise).

I bought it mainly as something to get more into lower level programming and to learn some math. It does support OpenCL, but I'm thinking writing assembly could get more performance out of it. What would such an architecture be useful for with regards to GIMPS (or SoB)?

rogue 2013-09-15 01:23

Being RISC rather than x86, very limited. You could theoretically use it on GIMPS, but not SoB. You should talk to Ernst (ewmayer) about optimizing mlucas for it.

ewmayer 2013-09-15 20:18

[QUOTE=rogue;353020]Being RISC rather than x86, very limited. You could theoretically use it on GIMPS, but not SoB. You should talk to Ernst (ewmayer) about optimizing mlucas for it.[/QUOTE]

Sorry, Mlucas requires 64-bit int and floating-double support. W.r.to LL testing (rather than TF) the former is needed for the quad-float emulation used in trig-tables inits; the FFT could obviously be recoded to use single-floats, but not worth my time. I wish I had time to custom-code for every interesting arch out there, but have alas just one lifetime and need to choose my coding battles carefully.

Anyone is welcome to take the source and try to convert the scalar-double build stuff (as opposed to the x86 SIMD) to use float rather the double, but aside from very, very limited "start here and look at..." advice they'd be on their own.

Mark Rose 2013-09-15 23:33

[QUOTE=ewmayer;353068]Sorry, Mlucas requires 64-bit int and floating-double support. W.r.to LL testing (rather than TF) the former is needed for the quad-float emulation used in trig-tables inits; the FFT could obviously be recoded to use single-floats, but not worth my time. I wish I had time to custom-code for every interesting arch out there, but have alas just one lifetime and need to choose my coding battles carefully.[/quote]

Having just one lifetime was the saddest realization I ever had. So many things I'll never have time to do...

[quote]Anyone is welcome to take the source and try to convert the scalar-double build stuff (as opposed to the x86 SIMD) to use float rather the double, but aside from very, very limited "start here and look at..." advice they'd be on their own.[/QUOTE]

I may yet try it. When the chip arrives I'll look into it in more detail.

xilman 2013-09-20 10:59

[QUOTE=Mark Rose;353018]So I have a [URL="http://www.parallella.org/"]Parallella[/URL] board with a 1 GHz [URL="http://www.adapteva.com/wp-content/uploads/2012/12/epiphany_arch_reference_3.12.12.18.pdf"]16-core Epiphany[/URL] [PDF] chip on its way to me in a month or so. If you're not familiar with the chip, it's a 32-bit RISC architecture with nano-second fabric between the cores.

Each core can do a single 32-bit integer operation (load/store, add/sub, shift, bitwise stuff) per clock.

Each core also has a IEEE754 single-precision FPU unit capable of addition, subtraction, [b]fused multiply-add, fused multiply-subtract[/b], fixed-to-float conversion, absolute, float-to-fixed conversion per clock (2 GFlops/s).

It also has 64 registers usable with no restrictions on access, 32 KB of L1 cache per core, and 1 GB of shared memory for the whole board in one unified address space.

And its power draw is only 5 watts for the whole board. So it's efficient power wise (but maybe not cost wise).

I bought it mainly as something to get more into lower level programming and to learn some math. It does support OpenCL, but I'm thinking writing assembly could get more performance out of it. What would such an architecture be useful for with regards to GIMPS (or SoB)?[/QUOTE]Just placed an order for the 4-board system with delivery expected in November. OK, so it was a wild impulse.

Likely initial projects will be to get ECM implemented and/or a RNS arithmetic library with each core using its own modulus. 64 cores will allow just short of 1024-bit arithmetic to be implemented mostly in parallel.

ewmayer 2013-09-20 20:26

[QUOTE=Mark Rose;353082]Having just one lifetime was the saddest realization I ever had. So many things I'll never have time to do...

I may yet try it. When the chip arrives I'll look into it in more detail.[/QUOTE]

Having just recently completed the main phase [meaning "aside from ongoing optimization efforts"] of my big coding project that began the year, porting all my Mlucas SIMD code to AVX, for my part I am now going to spend some weeks bringing the old scalar-double code under the same pthread ||ization umbrella. [I have ditched all the old experimental OpenMP pragmas, the interface is just way too opaque - and having all the pthread infrastructure in place for the SIMD builds makes it relatively trivial to handle the scalar code the same way.]

Long story short, the official releases will still be doubles-based, but should make it much easier for someone to try to port to architectures like the one under discussion here.

R.D. Silverman 2013-09-20 22:21

[QUOTE=ewmayer;353068]Sorry, Mlucas requires 64-bit int and floating-double support. W.r.to LL testing (rather than TF) the former is needed for the quad-float emulation used in trig-tables inits; the FFT could obviously be recoded to use single-floats, but not worth my time. I wish I had time to custom-code for every interesting arch out there, but have alas just one lifetime and need to choose my coding battles carefully.

[/QUOTE]

Some computing history.....

Indeed. Back in the 80's Sam Wagstaff and Jeff Smith at UGA wanted to
produce a custom architecture to run CFRAC. It was known as the EPOC.
[extended precision operand computer]. Very shortly thereafter I got
MPQS running on a VAX and people came to realize that trying to implement
algorithms on custom hardware was (in general) not worth the effort.
The NSF made an informal decision to look at custom architecture proposals
VERY carefully.

At the same time, the MicroVax and Sun's were becoming available. It was
also realized that it was more practical, more cost effective, and VASTLY
more portable to "ride the technology curve" as low cost distributed
small computers became widespread.

One can always squeeze more performance (sometimes a lot more) from
custom hardware. The price is portability. History has shown that custom
implementations that only run on special hardware are generally not
worth the effort.

only_human 2013-09-20 23:24

[QUOTE=R.D. Silverman;353646]History has shown that custom
implementations that only run on special hardware are generally not
worth the effort.[/QUOTE]Even TWINKLE?

R.D. Silverman 2013-09-20 23:46

[QUOTE=only_human;353654]Even TWINKLE?[/QUOTE]

Never implemented. Would not have been funded by NSF.
It was of theoretical interest only.

jasonp 2013-09-21 00:36

A group in Japan did implement an NFS line sieve in dedicated hardware several years ago. SHARCS has some nice papers on hardware architectures for sieving too.

GPUs are technically special-purpose, but the economies of scale in the gaming market make them dramatically more common. Now BOINC participants think you're a moron if your computation doesn't run on their cards.

Batalov 2013-09-21 01:25

[QUOTE=jasonp;353665]...a moron if your computation doesn't run on their cards.[/QUOTE]
Do they?! /gasp/

firejuggler 2013-09-21 01:34

GPU are special-purpose hardware, but they are widespread.

ewmayer 2013-09-21 02:07

[QUOTE=firejuggler;353669]GPU are special-purpose hardware, but they are widespread.[/QUOTE]

Similar to x86 - where I indeed have spent much time on custom coding these past 7-8 years. OTOH, the folks in the Sparc hardware group at Oracle inform me that the recent sparc CPUs - like x86 - support SIMD, but there the bang-for-my-buck math alas doesn't support the custom coding effort.

Actually, another reason I decided to do the ||ization of the scalar-double code as my next project is by way of preparation for port-to-GPU. There the major "how best to spend one's time?" issues relate to coding APIs - e.g. cuda vs OpenCL - and whether to focus on nVidia offerings or try to be more general.

My first toy GPU-coding trials used cuda/nVidia, but now it seems an open, non-single-vendor-tied standard like OpenCL is the way to go.

henryzz 2013-09-21 13:10

[QUOTE=ewmayer;353672]My first toy GPU-coding trials used cuda/nVidia, but now it seems an open, non-single-vendor-tied standard like OpenCL is the way to go.[/QUOTE]
I am not sure that people have been very successful at running opencl code like mfakto on nVidia cards at least at a good speed. It would probably be worth checking that problem has been solved before exclusively using opencl. Nvidia cards are more common on this forum still I think.

sanaris 2013-09-21 23:56

Not just because it is special-hardware, it won't work.
I tested Prime95 on Intel Atoms - its performance is 400ms per iteration - just does not worth looking into.

If you pay for 10 times decrease in power with 100 times decrease with performance, it just doesn't buy itself. Any chip with less then ~80 watt just not worth looking into (of cause if it is not ASIC/big FPGA from $10k, like those in MDGRAPE project for example).

Second. Every fabless attempt to production is useless. Plants have all the money in the world and will reject any deals <$10M. Only powerful nations can afford plants and production. It is political question - why US is not powerful enough to achieve computing targets. Nothing can be done in this area from developer perspective who does not have $10M right in pocket. Nvidia and Amd is now failing companies, they was relying on "fabless fake dream".

jasonp 2013-09-22 00:09

[QUOTE=Batalov;353667]Do they?! /gasp/[/QUOTE]
I've seen several posts on BOINC message boards that say 'it is well known that GPUs are awesome. When will your project use them?'

Batalov 2013-09-22 01:06

I think I've seen similar posts, too, but I don't think they concluded that any particular author was a moron. ;-) They very well may think that the GPUs are awesome and the author is awesome but is busy with IRL stuff. Try George's reply on them:
[quote]At no point in time did I say I was going to implement this [([I]immediately[/I])]. I can guarantee it won't happen this decade*.[/quote]And then, when you will have implemented in a couple if years, they will be pleasantly surprised.

How does you schedule look?
______________
*keep in mind that there was ~a year left in [I]that[/I] decade, so it was a cleverly playful answer.

xilman 2013-10-25 08:10

[QUOTE=xilman;353554]Just placed an order for the 4-board system with delivery expected in November. OK, so it was a wild impulse.

Likely initial projects will be to get ECM implemented and/or a RNS arithmetic library with each core using its own modulus. 64 cores will allow just short of 1024-bit arithmetic to be implemented mostly in parallel.[/QUOTE]Delivery has slipped to December but the architecture manual and datasheet has just come out.

The Epiphany co-processors are 32-bit floating point --- no double precision --- OR 32-bit integer arithmetic together with an independently usable limited integer ALU. The latter doesn't have hardware multiplication/division and the former shares opcodes between fp and [i]signed[/i] integer arithmetic. [i]There is no division operation in either mode[/i] but fused multiply-add and multiply-sub is available in both.

The coprocessors are optimised for DSP and should also be pretty good for symmetric crypto primitives (at first sight) but high performance multi-precision integer arithmetic might be challenging.

I'm looking forward to receiving the kit in a few weeks.

paulunderwood 2017-03-13 20:56

A bit of necro-post...

Did you, Paul and Mark, get anything useful done with these boards?

How is the fpga programmed?

What flavour of Linux are you using?

:bump2:

Mark Rose 2017-03-13 22:02

Nope, I did not.

I never did find a solution to the flaky ethernet, plus I never got around to getting quiet cooling on my kickstarter board. So it's basically sitting in a parts drawer.

GP2 2017-03-13 22:06

[QUOTE=paulunderwood;454819]How is the fpga programmed?[/QUOTE]

I know absolutely nothing about any of this, but I see the keyword FPGA and I'm reminded that Amazon [URL="https://aws.amazon.com/ec2/instance-types/f1/"]AWS recently introduced FPGA instances[/URL] (still in preview), using 16 nm Xilinx UltraScale Plus FPGA.

Is there any hope of some radically faster implementations via this route?

Madpoo 2017-03-14 04:12

[QUOTE=GP2;454824]I know absolutely nothing about any of this, but I see the keyword FPGA and I'm reminded that Amazon [URL="https://aws.amazon.com/ec2/instance-types/f1/"]AWS recently introduced FPGA instances[/URL] (still in preview), using 16 nm Xilinx UltraScale Plus FPGA.

Is there any hope of some radically faster implementations via this route?[/QUOTE]

I suppose the question there is, what kind of magic operation would be awesome to have that isn't currently available anywhere, or what current op/ops can be made faster if only it had a dedicated configuration?

128-bit FP? Faster FMA? (just tossing out two examples that will probably get people laughing at me).

Then it comes down to how (or can) that's implemented in the FPGA in question. Are there existing libraries that offer that or is there anyone with the experience to have a go at coding it?

It sure seems interesting but I suspect the work involved in just programming the chip to work as expected and get the gains you like (plus customizing mprime to use it) would be a major effort. Could be worthwhile in the long run, especially if Intel and/or AMD makes FPGAs part of their future dies.

xilman 2017-03-14 07:23

[QUOTE=paulunderwood;454819]A bit of necro-post...

Did you, Paul and Mark, get anything useful done with these boards?

How is the fpga programmed?

What flavour of Linux are you using?

:bump2:[/QUOTE]My cluster is running ECM on the ARM cores. Nothing exciting.

LaurV 2017-03-14 14:52

[QUOTE=Madpoo;454838]
128-bit FP? Faster FMA? (just tossing out two examples that will probably get people laughing at me).[/QUOTE]
No laughing. 96-bit modular multiplication (or just squaring), single tick. Thinking about exponentiation and TF. This is not very difficult to implement for somebody who knows VHDL well. We tried once, but got our hands dirty and the nose red, and gave up.. Need to learn more... We are still dreaming about duplicating the design 96 times in the same chip (it may not use more than a hundredth of a xilinx or so, for a multiplier) so we could crunch all 96 classes (on a 420-based) at the same time. But it may be a very expensive machine, and not extremely fast... I mean, it may be faster than the actual tools, but to have it worth, it will need the next step: asics, low cost and low power consumption, hundreds of them. Like bitcoin mining. Which will need big investment... etc.

GP2 2017-03-14 18:02

[QUOTE=LaurV;454859]We are still dreaming about duplicating the design 96 times in the same chip (it may not use more than a hundredth of a xilinx or so, for a multiplier) so we could crunch all 96 classes (on a 420-based) at the same time.[/QUOTE]

I think TF is several years ahead of the LL wavefront, and GPUs are plentiful. So maybe other areas could benefit more. If it was somehow possible to use FPGA to greatly speed up LL testing, the $150,000 prize for a 100M decimal-digit prime could be an incentive.

We have had an unusually dense streak of prime finds in recent years. Historically there were some big gaps, including the ones between M127 and M521, or M216091 and M756839. So if we're unlucky it's certainly possible that the next exponent that yields a prime could be more than three and a half times larger than than the current 74.2M, which would be above 250M, and we might wait nearly a decade between discoveries like the Sierpinksi problem.

LaurV 2017-03-17 10:01

There is no profit in hunting for EFF money, you should know that after so many years of hunting for primes :razz:, what we do, we do for fun, for socializing, for fame, etc, but from the expense point of view, we are always in the negatives. You will spend more in hardware and electricity, unless you are bloody lucky. In fact you contradict yourself in the second paragraph of the post, by [U]correctly[/U] pointing how long it may take to find that prime - think about the effort and finances to build specialized hardware, even if you run it almost for free, and you get the EFFs money, you will still be in the negatives when you find that prime...

OTOH, I was talking about TF because this is easy to implement, only integers, shift and add registers in FPGA are elementary (but that is not a "tick" solution, you will need a cleverer-than-that multiplication to be fast enough, otherwise you waste about 100 ticks for every squaring - but this may be ok too, as you need less hardware, and you may be able to do more multipliers, and go for a full 960 classes, in a 4620-based), etc.

Implementing FFT multiplication for LL or P-1 in VHDL would be totally OVER my head... (and most of the people here).

retina 2017-03-17 15:54

[QUOTE=LaurV;455022]Implementing FFT multiplication for LL or P-1 in VHDL would be totally OVER my head... (and most of the people here).[/QUOTE]I think NTT would be the thing to target for FPGAs. Forget about FFT, too messy and untidy. The only reason we use it now is because CPUs are good at FFT, and not so good at NTT.

GP2 2017-03-17 17:36

[QUOTE=retina;455040]I think NTT would be the thing to target for FPGAs. Forget about FFT, too messy and untidy. The only reason we use it now is because CPUs are good at FFT, and not so good at NTT.[/QUOTE]

Turns out there's actually a thing called a [URL="http://www.sciencedirect.com/science/article/pii/S1051200413002388"]Mersenne number transform[/URL], a specialization of NTT. Is that relevant to our interests by any chance?

GP2 2017-03-23 07:59

I was under the impression that you have to learn and use VHDL to use FPGAs. But apparently FPGAs can be programmed with the same OpenCL as for GPUs. I wonder if clLucas could be adapted for this?

Xilinx supports OpenCL in their [URL="http://www.xilinx.com/products/design-tools/software-zone/sdaccel.html"]SDAccel Development Environment[/URL] (the FPGA instances available in preview on Amazon cloud are Xilinx UltraScale Plus). They give a number of [URL="http://www.xilinx.com/products/design-tools/software-zone/sdaccel.html#examples"]examples of applications[/URL] (Bitcoin mining, etc). There are also some tutorial videos.

Edit: here's [URL="https://deixismagazine.org/2017/03/chipping-away/"]an article[/URL] that suggests FPGAs could become more widely used in high-performance computing applications.

jasonp 2017-03-23 16:04

Mersenne number transforms basically replace floating point numbers in complex FFTs with integers modulo a small Mersenne number. The most convenient such Mersenne number on a 64-bit system is 2^61-1. They've been known for a long time, probably since the early 1980s. If the figure of merit for how fast an integer NTT runs is the number of integer multiplications that you have to do for very big transform lengths, these kinds of transforms are probably the best tools available.

Ernst and I have a lot of practice building them, and they are a playground for optimization; but floating point FFTs are still faster on general purpose hardware.

Edit: a thread from [url="mersenneforum.org/showthread.php?t=11255"]the last time this came up[/url]

GP2 2017-03-23 18:52

[QUOTE=jasonp;455351]Ernst and I have a lot of practice building them, and they are a playground for optimization; but floating point FFTs are still faster on general purpose hardware.[/url][/QUOTE]

Sure, but what about FPGAs? Would integer make more sense there? By "general purpose" I presume you mean x86-64 architecture. Do the GPU programs like cudaLucas and clLucas use integer or floating point?

jasonp 2017-03-23 23:41

They would be floating point only.


All times are UTC. The time now is 07:12.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.