mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   DPUs on DDR: In-memory processing (https://www.mersenneforum.org/showthread.php?t=24711)

hansl 2019-08-21 04:50

DPUs on DDR: In-memory processing
 
I was actually just wondering the other day if any types of RAM modules existed that could perform operations on data, and it turns out this was recently announced:

[url]https://www.anandtech.com/show/14750/hot-chips-31-analysis-inmemory-processing-by-upmem[/url]

This is incredibly interesting to me. I would guess that it could provide phenomenal performance for sieving tasks.
I don't know enough about FFT multiplication etc to determine if it could help for LL/PRP type tasks though.
What other interesting mathy applications could you foresee these excelling at?

Will be exciting to see these when they come to market! According to the slides they should be available around Q4 2019 / Q1 2020.

LaurV 2019-08-21 05:48

wow! they have simulators for it! we actually could "test drive" those thingies and decide what we can do with them, before they hit the market. we live interesting times!

thanks for sharing that.

Uncwilly 2019-08-21 13:58

[QUOTE]If we go into all 16/18 chips, we can see that each 8GB module is going to be in the 19.2-21.6 watts.[/QUOTE]Will that be worth it to GIMPS folks?

hansl 2019-08-21 15:48

While the total power draw and heat dissipation are somewhat concerning for a single DIMM slot, the overall idea is that doing operations in memory should still consume much less total energy than moving that data from RAM to CPU and back.

They are claiming about 10x performance/watt.

kriesel 2019-08-21 16:46

[QUOTE=hansl;524153]While the total power draw and heat dissipation are somewhat concerning for a single DIMM slot, the overall idea is that doing operations in memory should still consume much less total energy than moving that data from RAM to CPU and back.

They are claiming about 10x performance/watt.[/QUOTE]I saw figures 14-20x on the link given earlier. Which might make DPUs roughly comparable to gpus for TF.

aurashift 2019-08-21 17:57

I tried finding it in my bookmarks and I can't find it, so I'm gonna post before I forget. One of the NVMe association partners is doing the same sort of thing with storage, where they're putting the intelligence/coprocessors in the NICS or SFPs so that the storage fabric can do the work.

kriesel 2019-08-21 18:45

[QUOTE=Uncwilly;524141]Will that be worth it to GIMPS folks?[/QUOTE]The 64MB per DPU boundary means a primality test or normal size P-1 run won't fit on one DPU.
If I recall correctly, there was reference in the linked article to "a clean 32bit ISA" meaning a different instruction set. It would be interesting to see what Ernst or George or others could accomplish with it. Both have prior experience with multi-core programming.
Sample pricing requested.
[URL]https://www.upmem.com/developer/[/URL]
[URL]https://github.com/upmem[/URL]
SDK linux (no Win) [URL]https://sdk.upmem.com/[/URL]

SDK User manual [URL]https://sdk.upmem.com/2019.2.0/[/URL]

NOTE: There are multiple indications this is not well suited to Fft multiplication using 64-bit floats, on which LL, PRP, and P-1 depend. Also it may be slow in 32 bit Int mul affecting TF. [URL]https://sdk.upmem.com/2019.2.0/fff_CodingTips.html#bits-variables-are-expensive[/URL][QUOTE]The DPU is a native 32-bits machine. 64-bits instructions are emulated by the toolchain and are usually more expensive than 32-bits. Typically, and addition is emulated by 2 or three instructions, so is twice or thrice more expensive.
As a consequence, 64-bits code is slower and requires more program memory than 32-bits code.
[/QUOTE]Also, [QUOTE][B]Multiplications and divisions of shorts and integers are expensive[/B]

Multiplications of 32 bits words rely on the UPMEM DPU instruction mul_step, implying an overcost up to approximately 30 clock cycles per multiplication. The same applies to the 32 bits division and the remainder.
As a consequence, avoid using these operation when not needed, preferring shifts and sums.[/QUOTE]and [QUOTE]Floating-point support Albeit understood natively by the compiler, floating points are emulated by software. As a consequence, floating point operations are very slow and should be avoided.[/QUOTE]So maybe its niche is sieving? It may be too slow to do gcd's for CUDAPm1, gpuowl, etc.

The FPGA evaluation unit is specified at 200Mhz.

ewmayer 2019-08-21 19:52

[QUOTE=kriesel;524176]The 64MB per DPU boundary means a primality test or normal size P-1 run won't fit on one DPU.
If I recall correctly, there was reference in the linked article to "a clean 32bit ISA" meaning a different instruction set. It would be interesting to see what Ernst or George or others could accomplish with it. Both have prior experience with multi-core programming.
[/QUOTE]

64MB should be enough to fit the residue array and auxiliary FFT/DWT data tables for a current first-time LL test.

A basic C build (no SIMD) of Mlucas could serve as a basic "is this even close to being interesting?" test vehicle. Based on the just-about-all-the-key-ops-are-emulated data I suspect LL testing will be out of the question, speed-wise, but as always, actual data are preferable to surmises here.

Uncwilly 2019-08-21 19:59

[QUOTE=kriesel;524176]The 64MB per DPU boundary means a primality test or normal size P-1 run won't fit on one DPU.[/QUOTE]My question was about the power costs.

kriesel 2019-08-21 21:42

[QUOTE=ewmayer;524189]64MB should be enough to fit the residue array and auxiliary FFT/DWT data tables for a current first-time LL test.

A basic C build (no SIMD) of Mlucas could serve as a basic "is this even close to being interesting?" test vehicle. Based on the just-about-all-the-key-ops-are-emulated data I suspect LL testing will be out of the question, speed-wise, but as always, actual data are preferable to surmises here.[/QUOTE]Thanks for your input Ernst. I was going by the 224MB working set on a 79M prime95 PRP DC I have running, plus many past runs of larger P-1 that took multiple GB. What I see in CUDAPm1 is 1GB or less uses smaller bounds than GPU72 wants, for even current wavefront and smaller.

A separate question is how does it do for LL or PRP performance per kw-hr, in a system one already has, as additional memory, compared to Mlucas on a cellphone. There's always plenty of DC to do.

ewmayer 2019-08-21 22:48

[QUOTE=kriesel;524215]Thanks for your input Ernst. I was going by the 224MB working set on a 79M prime95 PRP DC I have running, plus many past runs of larger P-1 that took multiple GB. What I see in CUDAPm1 is 1GB or less uses smaller bounds than GPU72 wants, for even current wavefront and smaller.[/QUOTE]

Yes, one may have the issue of working set being larger than the obvious alloc'ed stuff in the program. I do know that Prime95 uses larger aux-data arrays than Mlucas, George has carefully coded things so as to stream those in and out as needed with minimal conflict with the FFT-data I/Os. In my case since I'm not targeting just one major lineage of CPUs I chose to try to minimize the size of the FFT/DWT auxiliary-data arrays, so I do a lot of 2-tables-multiply stuff, trading a few FPU ops for the generic cache-friendliness of small O(sqrt(N))-sized aux-data tables. But even so, some code tweakage might be needed to get the working set size to fit, since IIRC that includes more than just the explicit code-allocated stuff, e.g. libraries.


All times are UTC. The time now is 06:59.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.