mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2020-07-05, 08:42   #1
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2·2,861 Posts
Default Open source chip design

This I probably too restricted to be useful to the forum but it sounds interesting.
https://www.theregister.com/AMP/2020...chip_hardware/
henryzz is offline   Reply With Quote
Old 2020-07-05, 10:20   #2
Nick
 
Nick's Avatar
 
Dec 2012
The Netherlands

143610 Posts
Default

So what is the commercial benefit for them?
Nick is offline   Reply With Quote
Old 2020-07-05, 10:53   #3
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

131328 Posts
Default

Quote:
Originally Posted by Nick View Post
So what is the commercial benefit for them?
Publicity and bug finding?
henryzz is offline   Reply With Quote
Old 2020-07-05, 10:53   #4
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

2·7·409 Posts
Default

Quote:
Originally Posted by Nick View Post
So what is the commercial benefit for them?
Loss leader.

Get a few happy customers to say "works great" and start raking in the dosh from others.
retina is offline   Reply With Quote
Old 2020-07-05, 20:45   #5
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Quote:
The goal is to develop an entirely open-source semiconductor manufacturing workflow. To help achieve this, Google and Skywater released an open-source PDK, or process development kit, which is described as a grab bag of design rules, logic and analog models and cells, specifications, and other data to turn your RTL files into actual working patterns of semiconductors, metals, and other chemicals on tiny squares of plastic-packaged silicon.

Normally, PDKs from foundries involve a lot of money; this one is free – the first-ever open source one, apparently – though it is a work-in-progress experiment.

And if you're worried about Google using this as a means to snaffle your intellectual property, don't forget: it's only for public projects that are open-source all the way down to the silicon layout. So if you qualify, you've already handed over your work to the world anyway.
Let me rephrase the above a bit: "The goal is for a bunch of people to provide Google with low-cost (by its deep-pockets standards) previews of potentially interesting work-in-silicon. Any genuinely interesting and potentially commercializable tech by project participants will be easy for Google to take, elaborate, modify and work into their own proprietary microprocessor work, and the folks who came up with said ideas won't be able to do a damn thing about it. After all, said ideas "were already in the public domain".
ewmayer is offline   Reply With Quote
Old 2020-07-05, 21:39   #6
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

100001100100102 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Let me rephrase the above a bit:
Uncwilly is online now   Reply With Quote
Old 2020-07-05, 22:31   #7
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·397 Posts
Default

My dream would be a processor (hardware) that is designed to compute large convolutions, i.e. squaring giant numbers. I wonder how much faster a specialized design could be compared to the "general purpose" CPUs (including GPUs) we use now.

For example, such a design could either go down the established floating-point-FFT route, or could have fast specialized integer units for NTTs (the problem with NTTs right now is that the current CPUs/GPUs are much faster at FP, so FP-FFT wins).

If using FP, we'd like to maximize the mantissa size. One way would be to reduce the number of bits in the exponent (to have a wider mantissa). Another would be to use larger FP than DP. (e.g. 80bit or 128bit FP)

Next, what would be a good elementary operation (a basic building block) for computing large FP-FFTs. For example, what we have right now (on CPUs/GPUs) is FMA ("fused multiply add") which is generally useful but not particularly great for FFTs (especially in the "high register pressure" context of the GPUs)

I was thinking of having some giant "twiddle OP":
twiddle(A,B,C): return (A*B+C, A*B-C)

where A,B,C are complex values; such an OP may be great for FFTs.

Last fiddled with by preda on 2020-07-05 at 22:34
preda is offline   Reply With Quote
Old 2020-07-05, 22:44   #8
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

2×7×409 Posts
Default

Quote:
Originally Posted by preda View Post
If using FP, we'd like to maximize the mantissa size. One way would be to reduce the number of bits in the exponent (to have a wider mantissa). Another would be to use larger FP than DP. (e.g. 80bit or 128bit FP)

Next, what would be a good elementary operation (a basic building block) for computing large FP-FFTs. For example, what we have right now (on CPUs/GPUs) is FMA ("fused multiply add") which is generally useful but not particularly great for FFTs (especially in the "high register pressure" context of the GPUs)

I was thinking of having some giant "twiddle OP":
twiddle(A,B,C): return (A*B+C, A*B-C)

where A,B,C are complex values; such an OP may be great for FFTs.
Sure, you could do all that. But at 130nm and 10mm x 10mm you would be doing well to fit just one multiply unit on the chip.

And what RAM interface would you use? With no space for internal caches you will need an awesome RAM interface.

I think this open source chip will be useless for anything that needs high performance with large numbers.

Last fiddled with by retina on 2020-07-05 at 22:44
retina is offline   Reply With Quote
Old 2020-07-05, 22:49   #9
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101100111111102 Posts
Default

Specialized traansform-doing hardware for DSPs is widespread, the problem for us is that the precision and convo-size needs of the mobile-telecoms industry are rather different than ours.

Quote:
Originally Posted by preda View Post
Next, what would be a good elementary operation (a basic building block) for computing large FP-FFTs. For example, what we have right now (on CPUs/GPUs) is FMA ("fused multiply add") which is generally useful but not particularly great for FFTs (especially in the "high register pressure" context of the GPUs)

I was thinking of having some giant "twiddle OP":
twiddle(A,B,C): return (A*B+C, A*B-C)

where A,B,C are complex values; such an OP may be great for FFTs.
I've written about such an op here on several occasions, though focusing on the real-operands case - in the floating-point realm it's especially attractive because only one MUL is needed and the FP-operand-preprocessing pipeline stages (exponent and significand extraction, relative-shifting the A*B and C significands by the difference in exponents so as to align them) also can be done just once before feeding the 2 data pairs to the adder and subtracter.

An even bigger "why no such hardware instruction?" for me is complex multiply - I recall having a huge "WTF?" moment when I first saw the x86 SSE2 instruction set specification, seeing instantly how potentially useful it was for the scientific computing community, and seeing how Intel/AMD apparently completely disregarded the needs of said community in their instruction set design. And w.r.to CMUL, all these years later, they still omit it. "Guys, we'd be OK if there were such a SIMD instruction and the latency was high, just give us enough registers to be able to hide the latency and we'll be in Happyville."

Intel et al are fabulous (ha, made a punny on 'fabless') when it comes to hardware, but absolute shit w.r.to instruction set design. Having worked with both the old DEC Alpha ISA and the current ARMv8 one, the contrast with Intel's fumbling-in-the-dark long an painful road from MMX to SSE and beyond is massive. AVX-512 is actually halfway decent despite the lack of instructions like vector both-halves-of-128-bit product and CMUL, but it took them, what, 20 years to get there?

A specialized form of CMUL in which one operand is a root of unity would be really useful for convolutions - if there were some digital magic by which one could cheaply interconvert between Cartesian and polar form for a complex number, one could do such a twidde mul by converting the 2 inputs to polar form and doing 1 real add of the 2 angles, then back to (x,y) representation.

Last fiddled with by ewmayer on 2020-07-05 at 22:58
ewmayer is offline   Reply With Quote
Old 2020-07-06, 00:23   #10
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

119110 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Specialized traansform-doing hardware for DSPs is widespread, the problem for us is that the precision and convo-size needs of the mobile-telecoms industry are rather different than ours.

I've written about such an op here on several occasions, though focusing on the real-operands case - in the floating-point realm it's especially attractive because only one MUL is needed and the FP-operand-preprocessing pipeline stages (exponent and significand extraction, relative-shifting the A*B and C significands by the difference in exponents so as to align them) also can be done just once before feeding the 2 data pairs to the adder and subtracter.

An even bigger "why no such hardware instruction?" for me is complex multiply - I recall having a huge "WTF?" moment when I first saw the x86 SSE2 instruction set specification, seeing instantly how potentially useful it was for the scientific computing community, and seeing how Intel/AMD apparently completely disregarded the needs of said community in their instruction set design. And w.r.to CMUL, all these years later, they still omit it. "Guys, we'd be OK if there were such a SIMD instruction and the latency was high, just give us enough registers to be able to hide the latency and we'll be in Happyville."

Intel et al are fabulous (ha, made a punny on 'fabless') when it comes to hardware, but absolute shit w.r.to instruction set design. Having worked with both the old DEC Alpha ISA and the current ARMv8 one, the contrast with Intel's fumbling-in-the-dark long an painful road from MMX to SSE and beyond is massive. AVX-512 is actually halfway decent despite the lack of instructions like vector both-halves-of-128-bit product and CMUL, but it took them, what, 20 years to get there?

A specialized form of CMUL in which one operand is a root of unity would be really useful for convolutions - if there were some digital magic by which one could cheaply interconvert between Cartesian and polar form for a complex number, one could do such a twidde mul by converting the 2 inputs to polar form and doing 1 real add of the 2 angles, then back to (x,y) representation.
Yes hardware full-precision sincos(x*PI)!

Do you have some links to your posts or threads discussing the instructions? (if not too much trouble)

Last fiddled with by preda on 2020-07-06 at 00:24
preda is offline   Reply With Quote
Old 2020-07-06, 00:44   #11
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

164448 Posts
Default

Quote:
Originally Posted by preda View Post
Yes hardware full-precision sincos(x*PI)!

Do you have some links to your posts or threads discussing the instructions? (if not too much trouble)
I would rather see a coprocessor board with many (at least 32) general purpose 1024-bit
registers. The board could do 512 x 512 bit integer multiples and 1024/512 bit divides
with remainder in just a few cycles. 1024 bit Add/subt took two cycles. [one to handle
the carries]. It had a small instruction set.

I was part of a team that designed such a board in the late 1980's when I was at MITRE.
it could do a 512 x 512 bit Montgomery multiply in 7 cycles using Texas Instrument 32x32 bit
signal processing chips in parallel (with Karatcuba). Division with remainder
took 10 cycles. The clock rate was slow [only 10 KHz] because the board was built
with prototype wirewrap. [proof of concept]

I expect that the register size could be increased to at least 2048 with modern hardware.
The board was designed to do very fast public key crypto operations. Its instruction set
was small but included, e.g. register bit count as well as msb and lsb. It supported
a full set of arithmetic and logical operations.

One would load up the registers and do all computations within the registers until the
final answer was moved off board. It had a very small instruction cache: 4K

Last fiddled with by R.D. Silverman on 2020-07-06 at 00:44 Reason: pagination
R.D. Silverman is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Primenet web design Madpoo PrimeNet 565 2017-06-27 18:04
Database design xilman Astronomy 1 2017-04-30 22:25
Strongest chesscomputer of the world (open source) NormanRKN Chess 1 2013-08-25 20:30
new intel design tha Hardware 5 2007-04-19 11:38
DRM, the end of open source, "grass roots", and creativity? E_tron Soap Box 1 2005-08-18 09:45

All times are UTC. The time now is 05:56.

Sun Sep 20 05:56:22 UTC 2020 up 10 days, 3:07, 0 users, load averages: 1.41, 1.22, 1.24

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.