mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2019-08-21, 04:50   #1
hansl
 
hansl's Avatar
 
Apr 2019

CD16 Posts
Default DPUs on DDR: In-memory processing

I was actually just wondering the other day if any types of RAM modules existed that could perform operations on data, and it turns out this was recently announced:

https://www.anandtech.com/show/14750...ssing-by-upmem

This is incredibly interesting to me. I would guess that it could provide phenomenal performance for sieving tasks.
I don't know enough about FFT multiplication etc to determine if it could help for LL/PRP type tasks though.
What other interesting mathy applications could you foresee these excelling at?

Will be exciting to see these when they come to market! According to the slides they should be available around Q4 2019 / Q1 2020.
hansl is offline   Reply With Quote
Old 2019-08-21, 05:48   #2
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

41×251 Posts
Default

wow! they have simulators for it! we actually could "test drive" those thingies and decide what we can do with them, before they hit the market. we live interesting times!

thanks for sharing that.
LaurV is offline   Reply With Quote
Old 2019-08-21, 13:58   #3
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

22·2,767 Posts
Default

Quote:
If we go into all 16/18 chips, we can see that each 8GB module is going to be in the 19.2-21.6 watts.
Will that be worth it to GIMPS folks?
Uncwilly is offline   Reply With Quote
Old 2019-08-21, 15:48   #4
hansl
 
hansl's Avatar
 
Apr 2019

5×41 Posts
Default

While the total power draw and heat dissipation are somewhat concerning for a single DIMM slot, the overall idea is that doing operations in memory should still consume much less total energy than moving that data from RAM to CPU and back.

They are claiming about 10x performance/watt.

Last fiddled with by hansl on 2019-08-21 at 15:49
hansl is offline   Reply With Quote
Old 2019-08-21, 16:46   #5
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24×3×163 Posts
Default

Quote:
Originally Posted by hansl View Post
While the total power draw and heat dissipation are somewhat concerning for a single DIMM slot, the overall idea is that doing operations in memory should still consume much less total energy than moving that data from RAM to CPU and back.

They are claiming about 10x performance/watt.
I saw figures 14-20x on the link given earlier. Which might make DPUs roughly comparable to gpus for TF.
kriesel is online now   Reply With Quote
Old 2019-08-21, 17:57   #6
aurashift
 
Jan 2015

FE16 Posts
Default

I tried finding it in my bookmarks and I can't find it, so I'm gonna post before I forget. One of the NVMe association partners is doing the same sort of thing with storage, where they're putting the intelligence/coprocessors in the NICS or SFPs so that the storage fabric can do the work.
aurashift is offline   Reply With Quote
Old 2019-08-21, 18:45   #7
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by Uncwilly View Post
Will that be worth it to GIMPS folks?
The 64MB per DPU boundary means a primality test or normal size P-1 run won't fit on one DPU.
If I recall correctly, there was reference in the linked article to "a clean 32bit ISA" meaning a different instruction set. It would be interesting to see what Ernst or George or others could accomplish with it. Both have prior experience with multi-core programming.
Sample pricing requested.
https://www.upmem.com/developer/
https://github.com/upmem
SDK linux (no Win) https://sdk.upmem.com/

SDK User manual https://sdk.upmem.com/2019.2.0/

NOTE: There are multiple indications this is not well suited to Fft multiplication using 64-bit floats, on which LL, PRP, and P-1 depend. Also it may be slow in 32 bit Int mul affecting TF. https://sdk.upmem.com/2019.2.0/fff_C...-are-expensive
Quote:
The DPU is a native 32-bits machine. 64-bits instructions are emulated by the toolchain and are usually more expensive than 32-bits. Typically, and addition is emulated by 2 or three instructions, so is twice or thrice more expensive.
As a consequence, 64-bits code is slower and requires more program memory than 32-bits code.
Also,
Quote:
Multiplications and divisions of shorts and integers are expensive

Multiplications of 32 bits words rely on the UPMEM DPU instruction mul_step, implying an overcost up to approximately 30 clock cycles per multiplication. The same applies to the 32 bits division and the remainder.
As a consequence, avoid using these operation when not needed, preferring shifts and sums.
and
Quote:
Floating-point support Albeit understood natively by the compiler, floating points are emulated by software. As a consequence, floating point operations are very slow and should be avoided.
So maybe its niche is sieving? It may be too slow to do gcd's for CUDAPm1, gpuowl, etc.

The FPGA evaluation unit is specified at 200Mhz.

Last fiddled with by kriesel on 2019-08-21 at 19:03
kriesel is online now   Reply With Quote
Old 2019-08-21, 19:52   #8
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

Quote:
Originally Posted by kriesel View Post
The 64MB per DPU boundary means a primality test or normal size P-1 run won't fit on one DPU.
If I recall correctly, there was reference in the linked article to "a clean 32bit ISA" meaning a different instruction set. It would be interesting to see what Ernst or George or others could accomplish with it. Both have prior experience with multi-core programming.
64MB should be enough to fit the residue array and auxiliary FFT/DWT data tables for a current first-time LL test.

A basic C build (no SIMD) of Mlucas could serve as a basic "is this even close to being interesting?" test vehicle. Based on the just-about-all-the-key-ops-are-emulated data I suspect LL testing will be out of the question, speed-wise, but as always, actual data are preferable to surmises here.
ewmayer is offline   Reply With Quote
Old 2019-08-21, 19:59   #9
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

22×2,767 Posts
Default

Quote:
Originally Posted by kriesel View Post
The 64MB per DPU boundary means a primality test or normal size P-1 run won't fit on one DPU.
My question was about the power costs.
Uncwilly is offline   Reply With Quote
Old 2019-08-21, 21:42   #10
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by ewmayer View Post
64MB should be enough to fit the residue array and auxiliary FFT/DWT data tables for a current first-time LL test.

A basic C build (no SIMD) of Mlucas could serve as a basic "is this even close to being interesting?" test vehicle. Based on the just-about-all-the-key-ops-are-emulated data I suspect LL testing will be out of the question, speed-wise, but as always, actual data are preferable to surmises here.
Thanks for your input Ernst. I was going by the 224MB working set on a 79M prime95 PRP DC I have running, plus many past runs of larger P-1 that took multiple GB. What I see in CUDAPm1 is 1GB or less uses smaller bounds than GPU72 wants, for even current wavefront and smaller.

A separate question is how does it do for LL or PRP performance per kw-hr, in a system one already has, as additional memory, compared to Mlucas on a cellphone. There's always plenty of DC to do.

Last fiddled with by kriesel on 2019-08-21 at 21:48
kriesel is online now   Reply With Quote
Old 2019-08-21, 22:48   #11
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2DEC16 Posts
Default

Quote:
Originally Posted by kriesel View Post
Thanks for your input Ernst. I was going by the 224MB working set on a 79M prime95 PRP DC I have running, plus many past runs of larger P-1 that took multiple GB. What I see in CUDAPm1 is 1GB or less uses smaller bounds than GPU72 wants, for even current wavefront and smaller.
Yes, one may have the issue of working set being larger than the obvious alloc'ed stuff in the program. I do know that Prime95 uses larger aux-data arrays than Mlucas, George has carefully coded things so as to stream those in and out as needed with minimal conflict with the FFT-data I/Os. In my case since I'm not targeting just one major lineage of CPUs I chose to try to minimize the size of the FFT/DWT auxiliary-data arrays, so I do a lot of 2-tables-multiply stuff, trading a few FPU ops for the generic cache-friendliness of small O(sqrt(N))-sized aux-data tables. But even so, some code tweakage might be needed to get the working set size to fit, since IIRC that includes more than just the explicit code-allocated stuff, e.g. libraries.

Last fiddled with by ewmayer on 2019-08-21 at 22:48
ewmayer is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
16e Post Processing Progress pinhodecarlos NFS@Home 8 2018-11-28 13:45
Resume dependency processing? sean YAFU 7 2017-10-14 02:30
Only 25% Processing Power??? JVD Information & Answers 4 2012-05-19 05:05
How to use Dual Processing dryicerx Software 3 2004-01-05 19:37
Post processing for 2,757- xilman NFSNET Discussion 3 2003-11-06 14:23

All times are UTC. The time now is 16:36.


Fri Jul 7 16:36:46 UTC 2023 up 323 days, 14:05, 1 user, load averages: 3.10, 2.56, 2.14

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔