![]() |
|
|
#12 | |
|
Jun 2010
Pennsylvania
947 Posts |
Quote:
Rodrigo |
|
|
|
|
|
|
#13 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
There is very little software. there is llrcuda and mfaktc, both work only in nvidia hardware. I'm working on some opencl codes, which work on any platform but i optimize them for AMD gpu, simply because i only have amd gpu that works there. As for opencl there simply isn't much software that's "push button and go". Yet it'll come. OpenCL is a standard that basically get supported by all manufacturers for all sorts of cpu's and gpu's and future manycore hardware. AMD is going so far now to kick out all support for anything else with respect to gpgpu. So only OpenCL is supported. If you look at opencl you'll notice it seems like a very simple small language similar to C with a few function calls. Setting it up for development is not so easy if you want to develop your own parsers and combined mixed exes with opencl code. Hopefully all this will get easier in future. So right now other than nvidia there is not much that is push button and go; except for some non-prime number codes which already for years work nearly exclusively at AMD gpu's, though there was ports for CUDA (yet so much slower for that specific software that nearly no one is using nvidia's for that). As for speeds, there is little discussion about it that for the money, gpu's deliver big performance. Where guys like TheJudger and me look a lot to the latest generation cards, realize that usually older generation cards, in case of AMD can be very interesting to buy once you have software that works. The 6000 series of AMD is not that much better than the 5000 series for well optimized gpgpu codes. Most well optimized gpu codes, work with 32 bits codes as those run so much faster on those gpu's than double precision (and usually you can get away with single precision or emulating things with integers). So that means that as for mfaktc it's also pretty fast for the cheap nvidia gamerscards which are lobotomized for double precision codes (factor 4 slower or something?). So the fundamental problem of gpgpu is the lack of software for it. The step to develop gpgpu codes is high. There is very little information. If you don't work for some sneaky organisation that has all information, then that is a huge disadvantage in gpgpu as you have to test every single thing yourself. I can give you great examples of that. An example is i noticed a few months ago that there was no function for the top16 bits of a 24x24 bits multiplication in opencl, despite that i found the instruction in the hardware manual of the 6900 series as existing (and therefore probably already for the 4000 series and onwards). Of course i posted it into the forums. Basically i get nearly instantly reaction of someone who says the instruction has been casted onto the 32x32 bits multiplication (cost: 4 cycles), so that makes the multiplication real slow. Just to be sure i post the question to the AMD helpdesk. After a few weeks i get back that the instruction 24x24 bits delivering top 16 bits of the multiplication, works at the full 1.351 Tflop speed per gpu (times 2 for the 6990). There is very little info on the speed of each individual instruction. Writing a testprogram doesn't really help you out, as there is also an opencl compiler that simply has it wrong, and you can't write assembler for the gpu (they really should allow that). Nvidia doesn't have a manual that shows you which instructions the gpu supports in hardware. The many conferences and workshops are just a joke of course; no one is doing all that effort to write code for a gpu in order to be slower than a few cpu's of course. you want the max speed out of it. Zero manufacturers, so neither AMD nor Nvidia release much information on actual LATENCIES, nor how to schedule things in a manner that you can get the full IPC (instructions pro cycle, which is of course 1 instruction per cycle per PE) out of it. This basically means that 99.99% of the programmers will with 100% sureness fail to write fast code on those things. As a result mainly some obvious simple to program codes for gpgpu have been developed for them. Majority is really brain dead what runs on gpu's, with some algorithmic toying you can really improve those algorithms. I remember how i toyed for 1 afternoon with some quantum mechanica (fluid dynamics in fact) codes that so called were not able to get parallellized well in order to get it to work at a quadcore. They may be big experts on their field, writing code that's fast and parallel is a very special expertise. Only a few programmers can really kick butt there and most of them only program for money, so that's explaining maybe the huge lack of software. The reason why gpgpu might become more popular now is because of the huge improvements in those gpu's. both the fermi as well as the 5000/6000 architectures from AMD are very capable. So there is much potential to run majority of crunching codes on those gpu's. Of course it is very unique hardware so not everything will be able to work well on it. Sometimes the effort simply is too big to setup a software program at a gpu, as compared to a cpu, versus the payment for it. A good example there is computerchess. Who would be able to sell a chessprogram at a gpu? Can it run faster at gpu's? Oh sure, and a lot, just it's so much effort and no one will pay for it. You get what you pay for. This where history shows us past few years that most people are simply not willing to pay for software, they only want to pay for hardware. So where all hardware problems are easy to solve or figure out, the fundamental problem with gpu's is a shift in thinking: you NEED to pay good programmers to write code there; the effort usually is too big to wait for volunteer codes to be there from a good programmer. Try to find phd's which are able to program well for gpgpu crunching codes. It'll take most 10 years to learn that. So it's the scientific public world that gets hit hardest there, as they simply NEVER give budget for writing software for a research. That's what is biggest limit on calculating on gpu's. The fact that this very fast hardware has a lots of constraints in programming and you need high IQ guys to program for 'em. In itself objectively seen OpenCL has a lot of potential to grow and second hand those cards are dirt cheap now (the 5000 series), but with the above constraint on programmers we will have to see whether that'll boost codes for OpenCL. If i just look at the amd developers forum for opencl which i check out frequently, i can't say too many positive things about that there. Also look around, how many companies are willing to pay for a good opencl programmer? I saw 0 so far. Last fiddled with by diep on 2011-05-27 at 11:40 |
|
|
|
|
|
|
#14 |
|
Sep 2006
The Netherlands
14478 Posts |
http://www.computingcareers.co.uk/jo...esign-engineer
I found a job actually there in UK. Yet i'd say they won't find anyone qualified for 40k pound a year, except if they double that salary (in which case i'd also apply). Last fiddled with by diep on 2011-05-27 at 11:48 |
|
|
|
|
|
#15 |
|
Dec 2010
Monticello
5·359 Posts |
As a designer and a programmer, I tell you that both good hardware design and good software are expensive, and require those smart folks diep talks about. The huge numbers of people in computing (and electrical engineering) hide the fact that only a few of us are really effective at this stuff...whatever the language, whatever the thing being designed.
Good software has the property that once written, copies are free, and someone able to charge for copies (like microsoft, and the game companies, and my employer, since you have to buy our hardware to get our software) gets fabulously rich. It has really only been 4 years since any of the open computing languages (OpenCL, CUDA, FireStream, maybe PhysX) have been released, and the need for the kinds of compute speed offered by these cards really isn't there for the general public, except in games and possibly video playback. So the massive subsidies aren't there, and the codes we do have are roughly par for the course. (The research literature is always complaining that the hardware is WAY ahead of the software -- anyone know a good speech recognizer?). Computational number theory is more a hobby than a job. I mean seriously, I'm not doing anything, besides hunting primes and the occasional crypto factor (or other favorite RESEARCH distributed compute project) on my PCs that requires anything that didn't work on my Windows95 PC, running 100M of RAM at 100MHz, except for the Windows OS and possibly my Youtube videos. This includes the stuff we use at work, including Autocad, circuit design software, and the "office" programs. The only thing I'm not sure about is the MRP software, which is a database that seems to want to put all of itself in 100Gig or so of main memory on one of our servers. The company I work for did pay $30 grand for it, and have a bad case of sunk cost syndrome with it. Diep, I have a GEForce GT210 card on Windows XP that gets about 5M TFs per second. I don't think it is going to do any more "production" work, as my GE440 runs circles around it at 50M TFs per second. If OpenCL is offered on it, I can run tests for you under Windows XP. ************************************************* But why we are where we are and where you, diep, might get some resources to help you test OpenCL (and moving sieving in mfaktc from the CPU to the GPU) is "Off Topic" -- Moderators: We STILL need a sticky which is a concise guide to GPU computing for Mersenne-aries, that doesn't even get to 10 posts long. Installing a GPU in a PC is no more difficult than installing a motherboard correctly, and almost anyone here can run mfaktc, as long as they can FIND it. This might be true of the other GPU programs, such as CUDAlucas, too. Rodrigo, would you be willing to be the "Nazi" for such a topic? |
|
|
|
|
|
#16 |
|
Jun 2010
Pennsylvania
16638 Posts |
diep,
Excellent rundown on the state of the art, thank you. I understand much better now why the sort of thing I was thinking of, hasn't already happened: things are still taking shape! I'll use your information and that provided by Brain, Christenson, and lavalamp to start groping my way through the forest. But it does look to me that, at the moment, this is still an area that's best left to programmers and others who know what they're doing... Rodrigo |
|
|
|
|
|
#17 | |
|
Jun 2010
Pennsylvania
947 Posts |
Quote:
We'll I'm not sure that I'd want to be a "Nazi" for the project ![]() ...but since my ignorance of the subject does provide ideas as to the kind of info that would be needed by someone just getting into this, then maybe we can explore that. Though I do doubt that I'm qualified to dive in head first, as I said to diep. How do we take this up with the Powers That Be? Rodrigo |
|
|
|
|
|
|
#18 | |
|
Sep 2006
The Netherlands
80710 Posts |
Quote:
That 25% was consistent with reports from some who toyed in CUDA at the time. That 50% i took very serious. the problem was however, despite calling AMD about 25 times (they just took over ATI) that the development kit was impossible to get simply. How did all those researchers in China already program for 'em? Basically all their codes work both at nvidia as well as AMD. How did they do that already 4+ years ago? Several companies already run for years at gpu hardware. Their big successtories aren't getting told. That's what stops it a lot as well. As a result of that right now the fastest supercomputer is in China and it really DELIVERS what they claim it delivers. Note it's a lot more capable than other supercomputers on that list as they have their own interconnect that's a lot faster than anything else on this planet that's publicly available. So if you would want to run some gigantic FFT type calculation, that sort of interconnect is what you need. Now i'mm guessing that a lot of companies simply pay big bucks to nvidia+amd/ati to get what they want. This 100% information stop on latencies and only slowly now opencl starts to work, that stops however all the students. The learning curve takes too long without all that information. I'd argue with that information already it is tough to produce something that works well. Besides that, you need to be real good in parallellizing software, in case of software that default doesn't parallellize very well. I've been shipping out some emails of emails looking for transforms that might do well at the AMD gpu. Realize the huge effort it is. You know any student who'd do that? |
|
|
|
|
|
|
#19 | |
|
"Oliver"
Mar 2005
Germany
5×223 Posts |
Hi Brain,
Quote:
It is save to remove the "possible but slow (too few registers)" for cc 1.1 from your paper. With default settings a cc 1.1 capable GPU is not that bad, e.g. a GTX 9800 is still much faster than CPU only TF. Oliver |
|
|
|
|
|
|
#20 |
|
Dec 2010
Monticello
5·359 Posts |
Diep has privately objected to my calling the autocratic leader of the documentation project aimed at newbies a "nazi"...I meant it as a title for someone who is going to enforce some rules, at some cost..in this case, killing all the posts that get in the way of the beginner figuring out what to do.
However, I dislike objectors who complain about things without solutions -- so Diep, what is the better word? ********************** The sort of friendlyness (not!) from ATI that Diep describes is probably why we have more CUDA codes running on NVIDIA than Firestream or OpenCL codes on ATI. We also see some caginess in the CUDA spec because everyone remembers all the 8086 instructions that *still* have to run on PC processors, even if they make no sense. Also, the latest CPUs from Intel do huge amounts of work hiding latencies...all in hardware... *********** Rodrigo, PM xyzzy and see what he says.... |
|
|
|
|
|
#21 | ||
|
"Richard B. Woods"
Aug 2002
Wisconsin USA
22×3×641 Posts |
Quote:
or, more generally, http://en.wikipedia.org/wiki/List_of...ch_%27czars%27 Quote:
So, your term could be "documentation czar" (perhaps "docuczar" for short). Perhaps you may prefer the "tsar" spelling, more commonly used in UK (http://en.wikipedia.org/wiki/Czar_%2...United_Kingdom), producing "docutsar". (Be prepared for "docustar" misspellings. One might even prefer that one to the others as the official term, but it would not project the image of an "autocratic" leader so well.) However, neither term may work so well in Russia as in US/UK. :-) Last fiddled with by cheesehead on 2011-05-27 at 22:26 Reason: various improvements |
||
|
|
|
|
|
#22 |
|
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
2×17×347 Posts |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Putting more than one computer in a pine box | fivemack | Hardware | 9 | 2012-08-22 19:47 |
| Putting prime 95 on a large number of machines | moo | Software | 10 | 2004-12-15 13:25 |