mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   The prime-crunching on dedicated hardware FAQ (II) (https://www.mersenneforum.org/showthread.php?t=12720)

jasonp 2009-11-15 23:16

The prime-crunching on dedicated hardware FAQ (II)
 
[B]Q: What is this?[/B]

[URL="http://mersenneforum.org/showthread.php?t=10275"]The original [/URL]prime-crunching-on-dedicated-hardware-FAQ was written in the middle of 2008, and the state of the art in high-performance graphics cards appears to be advancing at a rate that is exceeding Moore's law. At the same time, the libraries for running non-graphics code on one of those things have advanced to the point where there's a fairly large community of programmers involved, both working and playing with Nvidia cards (see [URL="http://forums.nvidia.com"]here[/URL]). Hell, [URL="http://mersenneforum.org/showthread.php?t=12562"]even I'm doing it[/URL]. So we're going to need a few modifications to the statements made in the original FAQ about where things are going.

[B]Q: So I can look for prime numbers on a GPU now?[/B]

Indeed you can. There is [URL="http://mersenneforum.org/showthread.php?t=12576"]an active development effort underway[/URL] to modify one of the standard Lucas-Lehmer test tools to use the FFT code made available by Nvidia in their cufft library. If you have a latter-day card, i.e. [URL="http://www.nvidia.com/object/product_geforce_gtx_260_us.html"]a GTX 260[/URL] or better, then you can do double-precision floating point arithmetic in hardware at a rate of 1/8 what the card can do in single precision. Even that card has so much floating point firepower that it can manage respectable performance despite the handicap.

[B]Q: So how fast does it go?[/B]

It's a work in progress, but with a top-of-the-line card the current speed seems to be around what one core of a high-end PC can achieve.

[B]Q: That result is not very exciting. What about the next generation of high-end hardware from Nvidia?[/B]

The next generation of GPU from Nvidia promises much better double-precision performance (whitepaper [URL="http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFermiComputeArchitectureWhitepaper.pdf"]here[/URL]). Fermi will be quite interesting from another viewpoint: 32-bit integer multiplication looks like it will be a very fast operation on that architecture, which makes integer FFTs with respectable performance a possibility.

[B]Q: Does this mean you'll stop being a naysayer on this subject?[/B]

If you read the first few followup posts to the original FAQ, and read into my tone in this one, I'm somewhat skeptical that the overall progress of any prime-finding project stands to benefit from a porting of the computational software to a graphics card. Much of the interest in doing so stems from three observations:

- other projects benefit greatly from it, far out of proportion to the number of GPUs that they use

- Most otherwise-idle PCs will also have an otherwise-idle graphics card, so using it would amount to essentially a 'free lunch'

- if a super-powered-by-GPU version of the code existed, buying a super-powered card would make [I]your individual contribution[/I] more valuable

In the case of projects like GIMPS, what 'other projects do' is immaterial. It's not that other projects have smart programmers but we don't, it's that their hardware needs are different. Further, GPU code is a free lunch as long as resources are not diverted away from the mainstream project to tap into those resources. As long as somebody volunteers to do so, there's no harm in trying. But in all the years the crunch-on-special-hardware argument has raged, only in the last few months have GPU programming environments stabilized to the point where someone actually stepped forward to do so.

As to your individual contribution, unless you have a big cluster of your own (thousands of machines) to play with, no amount of dedicated hardware is going to change the fact that 1000 strangers running Prime95 in the background are going to contribute more than you can ever hope to. It's not distributed computing if that was otherwise.

So, long story short, I'm still a buzzkill on this subject.

msft 2009-11-16 02:41

Hi, jasonp

Thank you everything,

lfm 2009-11-16 03:44

[QUOTE=jasonp;195982]
[b]Q: So I can look for prime numbers on a GPU now?[/b]

Indeed you can. There is [url="http://mersenneforum.org/showthread.php?t=12576"]an active development effort underway[/url] to modify one of the standard Lucas-Lehmer test tools to use the FFT code made available by Nvidia in their cufft library. If you have a latter-day card, i.e. [url="http://www.nvidia.com/object/product_geforce_gtx_260_us.html"]a GTX 260[/url] or better, then you can do double-precision floating point arithmetic in hardware at a rate of 1/8 what the card can do in single precision. Even that card has so much floating point firepower that it can manage respectable performance despite the handicap.
[/QUOTE]

Note that the GTX 260M and 280M (for laptops mostly, the M is important) does not have double precision and is NOT supported.

__HRB__ 2009-11-18 05:35

[QUOTE=xkey;196257]I'd really like to see a few 8192 (or bigger) bit registers in the upcoming incarnations from Intel/Amd/IBM. I know Intel is slowly headed there with AVX, but not fast enough for some problems I need solved in a quad or octo chip box.[/QUOTE]

What kind of problems are those?

I think large registers with complex instructions are a mistake. Top level algorithms can only be chunked into power-of-two sub-problems until you hit register size. Need 17-bit integers? Waste ~50%. Need floats with 12-bit exponent and 12-bit mantissa? Waste ~50%.

Therefore, I'd rather have something really simple, say a 64x64 bit-matrix (with instructions to read/write from rows or columns), an accumulator and maybe two other registers like on the 6502 (now, that was fun programming!), a 4096-bit instruction cache and 2 cycle 8-bit instructions (Load/Store + 8 logical + 8 permute instructions + control flow), so that neighboring units can conflict-free peek/poke each other one clock out of sync.

Then put 16K of these on a single chip, with a couple of F21s sitting on the edges for I/O control.

nucleon 2010-05-29 01:12

[QUOTE=jasonp;195982]
[b]Q: So how fast does it go?[/b]

It's a work in progress, but with a top-of-the-line card the current speed seems to be around what one core of a high-end PC can achieve.

[/QUOTE]

Umm this needs to be updated.


Taking some figures on the forum (and my own machine - core i7 930), I get the following measurements:

PS3
2048k fft sec/iter = 0.084
4096k fft sec/iter = 0.194

GTX260
2048k fft sec/iter = 0.0106
4096k fft sec/iter = 0.0218

core i7 930 (single core)
2048k fft sec/iter = 0.0363
4096k fft sec/iter = 0.0799

GTX480 cuda 3.0
2048k fft sec/iter = 0.00547
4096k fft sec/iter = 0.0104

Dual Socket hex-core 3.33GHz
2048k fft sec/iter = 0.00470

GTX480 cuda 3.1
2048k fft sec/iter = 0.00466
4096k fft sec/iter = 0.00937

Top of the line video card appears to be roughly 8times a single core (depending on which video card/cpu combination is compared).

Or you could say, a single GTX480 is similar to the using the full cpu cycles of a dual processor hex core 3.33GHz xeon.

I'm extremely positive this is only going to get better.

-- Craig

henryzz 2010-06-01 19:13

Do any of those figures for GPUs use any CPU cycles as well?

frmky 2010-06-01 22:58

Very little. About 2%.

Commaster 2010-06-10 02:29

AMD Radeon GPU
 
Speaking of GPU crunching... Anybody forgot AMD?
According to [URL="http://www.geeks3d.com/public/jegx/200910/amd_opencl_supported_devices.jpg"]this[/URL], new cards do support the required DP-FP.

ewmayer 2010-08-09 20:15

I am slated to get a new laptop at work by year`s end ... it would be cool if it offered the possibility of doing some GPU code-dev work on the side. The 2 different GPUs on offer are

512 MB NVidia NVS 3100M

512MB NVidia Quadro FX 1800M

Note that latter is only available in the "ultra high performance" notebook, which requires justification and manager approval.

Here are the minimal requirements for me to spend any nontrivial time reading-docs/installing-shit/playing-with-code, along with some questions:

1. The software development environment (SWDE) needs to be somewhat portable, in the sense that I don't want to write a whole bunch of GPU-oriented code which then only works on one model or family of GPUs.

[b]Q: Does this mean OpenCL?[/b]

2. All systems run Windows v7 professional edition. Is OpenCL available here? If so, is it integrated with Visual Studio or a linux-emulation environment (e.g. Wine), what?

3. The SWDE must support double-precision floating-point arithmetic, even if the GPU hardware does not. If DP support is via emulation, I need to have reasonable assurance that the timings of code run this way at least correlate decently well with true GPU-hardware-double performance.

4. It is preferred that the GPU have DP support - how do the above 2 GPUs compare in this regard?

And yes, I *know* "all this information is available" out there on the web somewhere, but I've found no convenient FAQ-style page on the nVidia website which answers more than one of the above questions, so I figured I'd ask the experts. I simply do not have time to read multiple hundred-page PDFs in order to try to glean answers to a few basic questions.

Thanks in advance,
-Ernst

frmky 2010-08-09 22:27

[QUOTE=ewmayer;224634]
512 MB NVidia NVS 3100M
512MB NVidia Quadro FX 1800M
[/QUOTE]

According to the [URL="http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf"]CUDA programmer's guide[/URL], Appendix A, these are both Compute Capability 1.2 devices, which per Appendix G means they do not support double precision. To address your questions...

1. Non-DP CUDA code will work on any current nVidia device. The goal of OpenCL code is that the same code can be compiled for multiple devices, but for now it usually will require some tweaking.

2. nVidia releases SDKs for both Windows (Visual Studio) and Linux.

3. All DP calculations are demoted to SP for Compute 1.2 devices and below. New to version 3 of the CUDA toolkit is the elimination of DP emulation in software. This is a good thing as it wasn't very reliable anyway and it did not give realistic timings.

4. As above, no and no. Neither card, nor any of their mobile cards for that matter, can be used to develop DP code. For (relatively) inexpensive DP code development, pick up a GTX 460 or GTX 465. The GTX 460 is faster for DP code and less expensive, but generates more heat. The GTX 465 is faster for SP code.

ewmayer 2010-08-09 23:09

[QUOTE=frmky;224667]4. As above, no and no. Neither card, nor any of their mobile cards for that matter, can be used to develop DP code. For (relatively) inexpensive DP code development, pick up a GTX 460 or GTX 465. The GTX 460 is faster for DP code and less expensive, but generates more heat. The GTX 465 is faster for SP code.[/QUOTE]

Many thanks, frmky. I've been told that one needs a full-sized case system with sufficient-wattage power supply to run one of these high-end cards, but I wonder: can one also get one in a standalone external format? I have 2 laptops at home: A 3-year-old Thinkpad running XP with VS2005 installed and a 1-year-old MacBook running Linux 4.2 ... I would be happy to get one of the above cards in an external add-on format, but really don't fancy the idea of buying a full-sized PC system anymore ... trying to keep the amount of compute hardware in my home restricted to a small footprint.

nucleon 2010-08-10 00:16

I think what you're after is a pci-express docking bay.

[url]http://www.pcworld.com/article/128401/external_pciexpress_graphics_for_laptops.html[/url]

It's an article from 2007, so not sure if anything around today exists. Or if the device in the article is suitable for this.

I've never tried this personally, so your mileage may vary. I'm a fan of self-built desktops.

-- Craig

Mathew 2010-08-10 04:05

ewmayer,

This is what I have found

[URL="http://www.magma.com/expressbox1.asp"]PCI express enclosure[/URL] Here is the youtube video of it [URL="http://www.youtube.com/watch?v=yIp3gHmAeGc&feature=related"]Expressbox1 video[/URL]

[URL="http://www.amperordirect.com/pc/c-audio-video/audiovideo-ViDock_2_5670.html"]ATI package[/URL] This company creates just for ATI currently. Here is the youtube of it [URL="http://www.youtube.com/watch?v=qE5NmURV5P8&feature=related"]ViDock2 video[/URL]

Hopefully this helps

Mathew

frmky 2010-08-10 05:26

[QUOTE=Mathew Steine;224699]
[URL="http://www.magma.com/expressbox1.asp"]PCI express enclosure[/URL] Here is the youtube video of it [URL="http://www.youtube.com/watch?v=yIp3gHmAeGc&feature=related"]Expressbox1 video[/URL][/QUOTE]
Cool, but wow is it expensive! You can put together a full desktop for that price!

ewmayer 2010-08-10 18:56

Thanks for all the links and replies!

OK, so it sounds like there's no reasonable choice but to get a full-sized desktop-style or pedestal system. Given that, next set of questions:

1. Is is cheaper to buy a discount desktop system sans GPU and install a GTX 46* oneself, or are there good reasons to go with a unit having the GPU preinstalled? (If one can get the latter for not terribly more $ than the former, I'd be happy to do that, as my limited free time is precious to me.)

2. What is the most compact format one can get the complete PC+GPU system in? Are there compact rack-mount formats one should consider?

3. I'd like to be able to build and test any GPU code under both Windows and Linux ... should I get a Win7 system (I'd have to also buy and install Visual Studio on that} and do a separate Linux install, or is there a virtual-desktop solution which is simpler to manage in this regard?

Thanks,
-Ernst

xilman 2010-08-11 07:48

[quote=ewmayer;224782]Thanks for all the links and replies!

OK, so it sounds like there's no reasonable choice but to get a full-sized desktop-style or pedestal system. Given that, next set of questions:

1. Is is cheaper to buy a discount desktop system sans GPU and install a GTX 46* oneself, or are there good reasons to go with a unit having the GPU preinstalled? (If one can get the latter for not terribly more $ than the former, I'd be happy to do that, as my limited free time is precious to me.)[/quote]I did exactly that --- purchased an all in one system.

Something to take into account: a beefy GPU will suck a lot of power when under full load. I suggest upgrading the stock power supply unless it's unusually well specified.


Paul

ewmayer 2010-08-11 22:10

[QUOTE=xilman;224869]I did exactly that --- purchased an all in one system.

Something to take into account: a beefy GPU will suck a lot of power when under full load. I suggest upgrading the stock power supply unless it's unusually well specified.

Paul[/QUOTE]

So would you recommend a system designed specifically for gamers,or just one with a beefy (400W-plus) power supply?

Since I'm mainly interested in code-dev (as opposed to 24/7 crunching), any chance of renting (we could negotiate a suitable incentives package of furs, gunsand liquor offline) of a guest account on your system?

xilman 2010-08-12 06:55

[quote=ewmayer;224995]So would you recommend a system designed specifically for gamers,or just one with a beefy (400W-plus) power supply?

Since I'm mainly interested in code-dev (as opposed to 24/7 crunching), any chance of renting (we could negotiate a suitable incentives package of furs, gunsand liquor offline) of a guest account on your system?[/quote]I took a gaming system as a base and tweaked it into a HPC machine. It has a 950W PSU which, it safe to say, is comfortably 400W-plus.

Paul

lavalamp 2010-08-14 10:38

[QUOTE=ewmayer;224995]So would you recommend a system designed specifically for gamers,or just one with a beefy (400W-plus) power supply?[/QUOTE]If you get a system with a top of the line graphics card, you can pretty much expect it to draw a full 300W all by itself under load. CPUs also frequently top 100W.

Obviously not all systems will suck the same juice, so you'll have to do a little googling on power draw for the various bits inside that you choose, but if you don't really feel like doing that, you can always go for overkill and get a 1000W PSU.

[QUOTE=ewmayer;224782]1. Is is cheaper to buy a discount desktop system sans GPU and install a GTX 46* oneself, or are there good reasons to go with a unit having the GPU preinstalled? (If one can get the latter for not terribly more $ than the former, I'd be happy to do that, as my limited free time is precious to me.)[/QUOTE]Don't know about the cost, depends where you buy I guess, but the ease of installing a card is ... well, easy. Just make sure that the power supply has the right connections (6 pin and/or 8 pin PCIe power connectors) before you buy, and be aware that high end cards will normally require TWO power connections.

[QUOTE=ewmayer;224782]2. What is the most compact format one can get the complete PC+GPU system in? Are there compact rack-mount formats one should consider?[/quote]That will probably be micro-ATX (sometimes written μATX or uATX). It's basically the same as ATX, but with three of the expansion slots missing off the bottom (so it's shorter). You can get smaller cases for these, but they generally aren't designed for cooling above all else, which is probably what you want if you're loading it up with power hungry bits of kit. μATX boards also tend to be a little bit ... crap, they're not designed for performance, so you tend not to get it, though there are [url=http://www.scan.co.uk/Products/Asus-ROG-Rampage-II-GEGE-uATX-x58-Mobo]rare exceptions[/url] of course.

Technically, you COULD get a mini-ITX board or similar (these are tiny). You can get mini-ITX boards that support i7 CPUs and have just one expansion slot on them, but they are made for low power and any half decent graphics card would be bigger than the motherboard. It'd be kind of a ridiculous combination.

Here are some mini-ITX i7 boards:
[url]http://www.mini-itx.com/store/?c=68[/url]

ewmayer 2010-08-16 16:16

Thanks, lavalamp - Paul's gonna set me up with an account on his system once he builds it. That's ideal (at least for starters) since I'm interested in coding and building, not using the hardware for gaming or full-time crunching. (If I lived in a colder climate and needed a space heater, I might be more open to buying the hardware).

Will keep y'all posted on progress once we`re up and running ... will likely warm up on the coding front with some trial-factoring code.

Christenson 2011-09-03 04:25

[url]http://arstechnica.com/hardware/news/2011/08/ibms-new-transactional-memory-make-or-break-time-for-multithreaded-revolution.ars[/url]

Do you suppose P95 could get some time on Blue Gene to test how well it works with P95?

This is a somewhat fatuous article on transactional memory. I'm not sure why the software overhead is so high; it seems to me that it is just a somewhat different implementation of "lock" semantics.

I think you could make it work if you simply guaranteed that every process asking for and using write access got a different value of a locking word...16 bits would suffice unless you are on a massive supercomputer with more than 2^16 threads!

Dubslow 2011-09-17 06:38

Man...
Just spent the last hour wikipedia'ing about super computers...
Even 150th place would double the entire GIMPS throughput... ([url]http://www.top500.org/list/2011/06/200[/url])

Dubslow 2011-09-17 06:57

[url]http://www.supermicro.com/GPU/[/url]

Christenson 2011-09-17 18:55

[QUOTE=Dubslow;271889][url]http://www.supermicro.com/GPU/[/url][/QUOTE]

Interesting stuff... but I suspect that's out of "average" consumer range, and therefore Jason's original argument applies.

GPUs now do LLs and TFs, and do TF exteremely efficiently compared to CPUs, and they suddenly aren't "Dedicated Hardware" anymore. See mfaktc, mfakto, and CUDALucas. They also aren't yet integrated into P95.

When we think of "Dedicated Hardware" at this point, we need something more than a collection of GPUs, like maybe an FPGA pipeline.

jasonp 2011-09-18 16:14

[QUOTE=Dubslow;271888]Man...
Just spent the last hour wikipedia'ing about super computers...
Even 150th place would double the entire GIMPS throughput... ([url]http://www.top500.org/list/2011/06/200[/url])[/QUOTE]
Yes, this is the reason I hopped off the ever-faster-main-home-system treadmill around 2007. Greg and Ilya have run NFS linear algebra on parallel systems using 256 up to over 900 MPI processes, in one case reducing computations that would have taken a year down to 2.5 days. [i]Nothing[/i] you can do to a home system would let you duplicate that kind of performance. On the one hand, Teragrid is a rationed resource, but on the other hand if you can find someone in government or academia to write a mini-proposal on your behalf then they'll give you 50000 CPU hours essentially no questions asked.

Christenson 2011-09-18 19:52

The question in my mind is, supposing I want to do better (more GHz-days per day, or fewer joules/GHz day) than my reasonably fast multi-core home system with a GTX560 GPU, what's the minimum hardware investment and amount of power supply that would be needed? Aren't supercomputers on a treadmill too? These "server" systems just don't seem to be an economic way to go, but maybe I'm looking in the wrong corner of the market.

Incidentally, user BDodson is in academia (I was acquainted with him personally about 15 years ago)...and GIMPS is gathering 50,000 CPU hours on a daily basis from random users such as myself.

jasonp 2011-09-19 13:35

Right now you and I have different objective functions to optimize, and you have more choices because the work you want done can proceed perfectly in parallel.

Christenson 2011-09-20 01:23

[QUOTE=jasonp;272041]Right now you and I have different objective functions to optimize, and you have more choices because the work you want done can proceed perfectly in parallel.[/QUOTE]

I was thinking more in terms of your objective functions than my (somewhat naive) ones. I imagine you wanting to finish that last, linear algebra step on various GNFS jobs...requiring both massive computation and massive communications....I have not forgotten.

The question was, if I wanted to make a dent in those big problems, how much $$ would I need to spend on what kind of hardware, and how much electricity would I use running it? The big clusters (Lomonosov, teragrid) are now a few years old, so should cost a bit less to begin to duplicate....

fivemack 2011-09-20 09:20

The problem is that the big clusters are actually-big; Lomonsov is twenty racks full of 2010-era servers, and the procurement process began with the relatively-rich Russian state wanting to have a big computer for its most prestigious university.

The big supercomputer facilities are quite close to 'spec up a reasonably powerful server and buy {one, ten} thousand', and so cost one to ten thousand times more than a basement server. They're fundamentally not susceptible of imitation, in the same sort of way that space programs aren't.

You could start sticking quad-Opteron 6168 boxes together with QDR Infiniband in your basement, but at the price of a second-hand decent car for each box and for the Infiniband switch, that's quite a good way to spend a very large amount of your money on a resource which will be almost surely under-used.

Christenson 2011-09-20 12:47

Got to start with that "large amount of $$" part...I haven't got it....I was thinking that if I had the computer, jasonp would have little trouble putting it to 100% work, if I was willing to pay for the electricity.

I was wondering how well funded GIMPS would need to be to afford even a shadow of one of those clusters.....it took no time for the forum to come up with $500 when it was needed.

jasonp 2011-09-20 12:50

GIMPS is already a 501(c)(3) charity, and the last mersenneforum drive collected more in donations than GIMPS has ever made. I don't know if that can be changed.

You occasionally find older Cray machines for auction on eBay, but the specs on an older Cray machine are laughable compared to a hot modern server, plus the big ones cost a million dollars a year just to keep turned on and healthy.

TACC recently upgraded the Lonestar cluster to very powerful 12-core nodes, and Teragrid has added many existing clusters from universities in just the last 6 months or so.

Xyzzy 2011-09-20 13:03

[QUOTE]…it took no time for the forum to come up with $500 when it was needed.[/QUOTE][url]http://www.mersenneforum.org/showthread.php?t=964[/url]

jasonp 2011-09-20 13:31

[QUOTE=Xyzzy;272138][url]http://www.mersenneforum.org/showthread.php?t=964[/url][/QUOTE]
What a great thread. Is that Salem guy still protesting by turning off more computers?

Christenson 2011-09-20 21:41

[QUOTE=jasonp;272140]What a great thread. Is that Salem guy still protesting by turning off more computers?[/QUOTE]

If he is, I'll have to protest by turning another, new one on!

Dubslow 2011-09-21 06:20

I don't know about any super computers, but 515 GFLOPS peak Tesla cards run about $1500 a pop.
[url]http://compeve.com/video-cards/pci-express/nvidia-tesla-m2050-3gb-server-gpu-video-card-sh885b-633246-001[/url] Server card
[url]http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=6391103[/url] Workstation card

Stats
[url]http://en.wikipedia.org/wiki/Nvidia_Tesla[/url]

They both run on a PCIe x16 2.0 slot. Now, $1500 for a 1% performance increase isn't bad (assuming GIMPS throughput is 50 TFLOPS, which admittedly is a bit low)

xilman 2011-09-22 09:48

I have recently been looking at low-power computation and have purchased an [URL="http://mbed.org/handbook/Homepage"]mbed[/URL] system to play with. It is a 96MHz ARM processor, memory, a micro-USB socket, some LEDs and a reset button, all built into a standard 40-pin DIP package. The whole thing takes 100mW. It's also easy to program in C and/or C++. Roughly speaking, the compute performance is comparable with a good PC of around 1995 vintage.

I then started seeing what else is available in the ARM range and came across these beasties: [URL="http://www.ti.com/product/am3703"]TI am3703[/URL]. If I read it correctly, you get a 1GHz 32-bit cpu for around USD15 and it draws about 1W. They come in a 15mm square package.

It seems to me that a PCI board could easily hold 16 of them, a significant amount of memory and any necessary glue, perhaps including ethernet and/or USB and/or another ARM for system control. What you would then have is a snazzy little system for learning real parallel computing because the interconnect and topology could be entirely under software control. Computational power should be useful but not astounding --- comparable with a 4GHz 4-core x86 processor perhaps.

What would be much more interesting would to be build boards with 100 or 128 of them on each side ...

Comments?

Paul

Christenson 2011-09-22 12:18

Le't see..you are saying 16W to match a 4 core x86 CPU that probably pulls 100W. At least approximately...very interesting.....I like the power efficiency. PC's aren't terribly energy efficient in lots of ways. All of that branch prediction circuitry and speculative execution must use some power!

Have to say I think the on-board memory and interconnect is going to be the biggest issue. You'll need memory to run FFTs for LL tests, but quite a bit of interconnect to run factoring algorithms.

How do you think your card will do compared to a processor on blue gene, an NVIDIA TI-560, or your other favorite supercomputer, in terms of J/GFlop or J/GHz-day?

xilman 2011-09-22 12:30

[QUOTE=Christenson;272381]Le't see..you are saying 16W to match a 4 core x86 CPU that probably pulls 100W. At least approximately...very interesting.....I like the power efficiency. PC's aren't terribly energy efficient in lots of ways. All of that branch prediction circuitry and speculative execution must use some power!

Have to say I think the on-board memory and interconnect is going to be the biggest issue. You'll need memory to run FFTs for LL tests, but quite a bit of interconnect to run factoring algorithms.

How do you think your card will do compared to a processor on blue gene, an NVIDIA TI-560, or your other favorite supercomputer, in terms of J/GFlop or J/GHz-day?[/QUOTE]It's the low power aspect that first attracted me.

Interconnect shouldn't be too hard --- the ARM chips have any number of I/O pins under software control.

My guess is that each ARM will be comparable to each processor in a GPU in compute performance. A GPU has hundreds of them for a power budget of 0.3W each, say, so will outperform one of these cards many times over. OTOH, the programming model of a GPU is heavily constrained and it's very hard to get sustained compute performance.

The real attraction of the idea, from my point of view is that it might be an ideal educational tool for developing parallel algorithms and designing parallel computers.


Paul

jasonp 2011-09-22 13:05

Years ago some researchers put 8x100MHz StrongARM processors on a 33MHz PCI board, ostensibly for neural net related computations. I think their board got to the prototype stage with a few copies made, and the total power draw was well short of the PCI limit (25W).

Nowadays high-performance integer-only embedded processors run at 1+GHz with very low power, and come with lots of onboard cache and interfaces to high-performance DRAM (check out the Cortex family in [url="http://en.wikipedia.org/wiki/List_of_ARM_microprocessor_cores"]this[/url] list). Most of them use the ARM architecture, although there are also high-performance MIPS models in the network processor space. It's possible the highest-performance MIPS chips are in the [url="http://en.wikipedia.org/wiki/Loongson"]Loongson[/url] line.

ldesnogu 2011-09-23 09:49

[QUOTE=xilman;272375]I then started seeing what else is available in the ARM range and came across these beasties: [URL="http://www.ti.com/product/am3703"]TI am3703[/URL]. If I read it correctly, you get a 1GHz 32-bit cpu for around USD15 and it draws about 1W. They come in a 15mm square package.

[...]Computational power should be useful but not astounding --- comparable with a 4GHz 4-core x86 processor perhaps.[/QUOTE]
The computational power of 16 Cortex-A8@1 GHz will be much lower than 4 x86@4 GHz.

First, Cortex-A8 FPU is non-pipelined.
Second, it's only dual issue without out of order execution.
Third, it's not 64-bit.
Fourth, memory bandwidth is typically rather low because the target market doesn't require high bandwidth.

So as a compute engine probably not a good thing (even from a perf/W point of view), but as an educational tool might be fun for sure :smile:

xilman 2011-09-23 14:06

[QUOTE=ldesnogu;272496]The computational power of 16 Cortex-A8@1 GHz will be much lower than 4 x86@4 GHz.

First, Cortex-A8 FPU is non-pipelined.
Second, it's only dual issue without out of order execution.
Third, it's not 64-bit.
Fourth, memory bandwidth is typically rather low because the target market doesn't require high bandwidth.

So as a compute engine probably not a good thing (even from a perf/W point of view), but as an educational tool might be fun for sure :smile:[/QUOTE]First: correct.
Second: Also correct.
Third: I confess I was benchmarking my mbed on problems which don't need 64-bit arithmetic.
Fourth: Correct [i]per cpu[/i]. Give each cpu its own memory and the effective bandwidth is raised 16-fold. To a first approximation, anyway.

Still make a nice educational tool IMO, so we're in agreement there.

Paul

ldesnogu 2011-09-23 14:15

[QUOTE=xilman;272517]Fourth: Correct [I]per cpu[/I]. Give each cpu its own memory and the effective bandwidth is raised 16-fold. To a first approximation, anyway.[/QUOTE]
Good point. I was just pointing the weaknesses of existing ARM chips which alas can't saturate their memory interface due to poor memory controllers. Sad to see a 2-core Cortex-A9 cluster only able to reach a 3 or 4 hundreds of MB/s while it could be several GB/s :down:

[QUOTE]Still make a nice educational tool IMO, so we're in agreement there.[/QUOTE]
Definitely.

Dubslow 2011-09-24 01:04

My core i7-2600k is running about 75W at 3.5 GHz over 4 cores + hyperthreads.

Dubslow 2011-09-25 06:37

[URL]http://en.wikipedia.org/wiki/FLOPS#Cost_of_computing[/URL]
According to that, these days we run about $1.80 per GFLOPS. That means all of GIMPS on current hardware is ~60,000*$1.80=$108000. So if we're smart about it, with $2000 we could increase throughput by 1-2%.

Christenson 2011-09-27 00:50

[url]http://mersenneforum.org/showthread.php?t=16050[/url]

Xyzzy does just this, with two computers running nVidia GPUs and TF. The question here was if a different trade of power versus speed might make sense.

Mark Rose 2016-07-17 20:07

[url]http://www.mersenneforum.org/showthread.php?t=20795[/url]

The above thread includes several hardware setups from the first half of 2016, including run-cost analysis. There are several techniques for sharing power supplies.

Recycling high-core low-clock Xeon chips was also found to be cost-efficient.

jasong 2016-07-18 16:41

[url]http://semiaccurate.com/2016/07/11/sifive-opens-silicon-access-freedom-e300-u500/[/url]

Thought people might be interested.


All times are UTC. The time now is 16:21.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.