![]() |
|
|
#12 | |
|
"Richard B. Woods"
Aug 2002
Wisconsin USA
1E0C16 Posts |
Quote:
It is all-too-common for vendors to use nonstandard wording. If instead of "128-bit floating point precision for all operations", it had said "IEEE quad-precision floating point used for all operations", that would refer to an independent standard. (However, it seems that the IEEE 754r standard which will include a 128-bit format is still in progress, not yet official.) "128-bit floating point precision" may seem the same as "quad-precision" at first glance, but then one recalls that quad-precision numbers (http://en.wikipedia.org/wiki/Quadruple_precision) have only 112, not 128, bits in their fractions. Thus, it's likely that the 128-bit numbers cited in the spec are just groups of four 32-bit floating-point numbers (each of which has only 23 bits of precision in its fraction), as sylvester says. Last fiddled with by cheesehead on 2008-05-18 at 10:24 |
|
|
|
|
|
#13 | |
|
"Mark"
Apr 2003
Between here and the
11000110100102 Posts |
Quote:
|
|
|
|
|
|
#14 |
|
Tribal Bullet
Oct 2004
3·1,181 Posts |
I haven't been 'pulling punches', it doesn't matter to me at all what anyone believes. I wrote the FAQ so that people who want to ask about using special-purpose hardware can first understand the engineering compromise that all finished code really is, and also that there is a calculus that professionals need to have when mapping out what they will do to achieve a given goal. The possibility that dedicated hardware can make things better needs to be weighed against the costs needed to get there, and the amount of buy-in that can be expected if and when they manage to finish the work.
Relying on distributed computing to solve tough computational problems automatically assumes that 1) no single contributor to the project can ever hope to do a majority of the work, no matter how many resources that user brings to bear, and 2) that the capacity to do that work would otherwise be wasted. i.e. for every user that builds their own farm solely for running Prime95, there are thousands of users that just download something and forget about it, and the aggregate throughput of the latter far exceeds that of the former. If you are the (one) project developer and want your problem solved faster, then you have to optimize for the common case. |
|
|
|
|
#15 |
|
Dec 2007
RSM, CA, USA
2×3×7 Posts |
OK, now I'll attempt to make a positive contribution. Here is what I propose to be added at the end of the opening FAQ post of this thread:
Q: I read all the above and I think that you suffer from NIH (Not Invented Here) Syndrome. A: Fine. Take the lucdwt.c code (about 900 lines) and port it to your preferred hardware. If your results beat "gcc -O0 lucdwt.c" or "cl /Od lucdwt.c" running on the Intel or AMD processor manufactured within last 3 years, post your results here. Otherwise take your "advanced streaming floating-point hardware" and shove it where the sun don't shine. (Sun doesn't shine inside of my 700MHz Pentium III box that I use to type this ;-) My argumentation for the inclusion of the above is similar to the USPTO policy for submission of Perpetuum Mobile designs. For everything else they accept descriptions and drawings. But for P.M. they ask for a "working model". USPTO has a good track record of dealing with crank technologists and we may benefit from following their pattern. Obviously feel free to re-edit my words to suit your style. Thanks for not tasing me and my apologies to JasonP for misunderstanding his intentions. |
|
|
|
|
#16 | |
|
"William"
May 2003
New Haven
44768 Posts |
Quote:
Last fiddled with by wblipp on 2008-05-19 at 22:19 Reason: Add operation count question |
|
|
|
|
|
#17 | |
|
∂2ω=0
Sep 2002
República de California
265768 Posts |
Quote:
Code:
N = 2 K maxP = 14987 N = 3 K maxP = 21891 N = 4 K maxP = 28637 N = 5 K maxP = 35266 N = 6 K maxP = 41804 N = 7 K maxP = 48265 N = 8 K maxP = 54661 N = 9 K maxP = 61000 N = 10 K maxP = 67289 N = 11 K maxP = 73533 N = 12 K maxP = 79736 N = 13 K maxP = 85901 N = 14 K maxP = 92032 N = 15 K maxP = 98131 N = 16 K maxP = 104201 N = 18 K maxP = 116257 N = 20 K maxP = 128214 N = 22 K maxP = 140082 N = 24 K maxP = 151870 N = 26 K maxP = 163583 N = 28 K maxP = 175228 N = 30 K maxP = 186811 N = 32 K maxP = 198334 N = 36 K maxP = 221218 N = 40 K maxP = 243907 N = 44 K maxP = 266420 N = 48 K maxP = 288773 N = 52 K maxP = 310980 N = 56 K maxP = 333052 N = 60 K maxP = 355000 N = 64 K maxP = 376831 N = 72 K maxP = 420172 N = 80 K maxP = 463126 N = 88 K maxP = 505732 N = 96 K maxP = 548022 N = 104 K maxP = 590022 N = 112 K maxP = 631756 N = 120 K maxP = 673243 N = 128 K maxP = 714499 N = 144 K maxP = 796376 N = 160 K maxP = 877486 N = 176 K maxP = 957906 N = 192 K maxP = 1037700 N = 208 K maxP = 1116920 N = 224 K maxP = 1195613 N = 240 K maxP = 1273815 N = 256 K maxP = 1351560 N = 288 K maxP = 1505792 N = 320 K maxP = 1658502 N = 352 K maxP = 1809843 N = 384 K maxP = 1959943 N = 416 K maxP = 2108907 N = 448 K maxP = 2256822 N = 480 K maxP = 2403765 N = 512 K maxP = 2549801 N = 576 K maxP = 2839375 N = 640 K maxP = 3125928 N = 704 K maxP = 3409767 N = 768 K maxP = 3691141 N = 832 K maxP = 3970259 N = 896 K maxP = 4247295 N = 960 K maxP = 4522401 N = 1024 K maxP = 4795706 Here is the code snippet in question [the DP version - for SP just change the value of the constant Bmant from 53 to 24], in case anyone is interested. The value 0.6 of the order-unity asymptotic constant was chosen because that seemed to best match the error levels of my FFT implementation. Your mileage may vary depending on FFT implementational details and hardware [e.g. x86-style 80-bit floating registers or not], but the value should be around 1 in any event. Code:
/*
For a given FFT length, estimate maximum exponent that can be tested.
This implements formula (8) in the F24 paper (Math Comp. 72 (243), pp.1555-1572,
December 2002) in order to estimate the maximum average wordsize for a given FFT length.
For roughly IEEE64-compliant arithmetic, an asymptotic constant of 0.6 (log2(C) in the
the paper, which recommends something around unity) seems to fit the observed data best.
*/
uint32 given_N_get_maxP(uint32 N)
{
const double Bmant = 53;
const double AsympConst = 0.6;
const double ln2inv = 1.0/log(2.0);
double ln_N, lnln_N, l2_N, lnl2_N, l2l2_N, lnlnln_N, l2lnln_N;
double Wbits, maxExp2;
ln_N = log(1.0*N);
lnln_N = log(ln_N);
l2_N = ln2inv*ln_N;
lnl2_N = log(l2_N);
l2l2_N = ln2inv*lnl2_N;
lnlnln_N = log(lnln_N);
l2lnln_N = ln2inv*lnlnln_N;
Wbits = 0.5*( Bmant - AsympConst - 0.5*(l2_N + l2l2_N) - 1.5*(l2lnln_N) );
maxExp2 = Wbits*N;
/* 3/10/05: Future versions will need to loosen this p < 2^32 restriction: */
ASSERT(HERE, maxExp2 <= 1.0*0xffffffff,"given_N_get_maxP: maxExp2 <= 1.0*0xffffffff");
fprintf(stderr,"N = %8u K maxP = %10u\n", N>>10, (uint32)maxExp2);
return (uint32)maxExp2;
}
Last fiddled with by ewmayer on 2008-05-19 at 23:37 |
|
|
|
|
|
#18 |
|
Jun 2003
15310 Posts |
Jasonp,
Thank you for creating your FAQ and starting this thread. It took a little bit of Google searching but here is a link to an AMD forum thread that may shed some more light on our double precision discussion topic. http://forums.amd.com/devforum/messa...threadid=92248 Hopefully Micah Villmow and Michael Chu can provide additional information about AMD's SDK. Last fiddled with by RMAC9.5 on 2008-05-20 at 05:05 |
|
|
|
|
#19 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
642310 Posts |
This business about Brook+ allowing you to use double-precision units on the ATI HD3870 card is quite interesting; the system behaves as if it has 64 775MHz double-precision FPUs rather than 320 single-precision ones, but it looks as if you get something which might be fairly adequate.
I haven't seen benchmarks, I haven't seen running code, and I get the strong impression that a) ATI are a long way behind nVidia in toolsets and in mindshare and b) you have to develop on Windows XP so I'd have to devote a whole computer to it. £130 is not a completely ridiculous amount to spend to have a play, but I'm a bit wary, having bought a GeForce 5900 for GPGPU reasons five years ago and never managed to program it at all. |
|
|
|
|
#20 | |
|
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
981810 Posts |
Quote:
If I could run a factoring program (one that already exists in C) on a GPU on a machine, when not otherwise occupied, I would. Also if, a GPU would support a MLucas or GLucas (from their C source), I might run them. If the SDK for a GPU allowed them to work together (with out the CPU being involved) for multi-monitor systems (like with a dedicated plug-in cable), then the multiprocessor code of M/GLucas might work on them, great for a double check machine. If there exist a pure C version of a P-1 program, that would be great too. Not all of us think that George should GPU. We just think that a DP GPU would be nice. "Wouldn't you like to be a DP, too?" |
|
|
|
|
|
#21 | |
|
"William"
May 2003
New Haven
93E16 Posts |
Quote:
Is that about right? William |
|
|
|
|
|
#22 | |
|
∂2ω=0
Sep 2002
República de California
1164610 Posts |
Quote:
Going in the other direction it also becomes clear why SP needs to be at least 10x faster than DP on the hardware in question in order to be of interest for the purpose of current and future GIMPS work - at 1024K SP only allows ~4.5 bits per input, nearly 5x less than DP, and the ratio gets worse as the FFT lengths get larger. To test an exponent ~50M, SP needs a 14336K FFT, DP needs just 2560K - 5.6x smaller. And since the FFT work scales roughly as N*log(N), a 14336K FFT in fact needs roughly 7x as many operations as a 2560K DP FFT. I'm not saying SP is completely out of the running, but it is likely mainly of interest for smaller FFT lengths, and only if the hardware in question can do it on the order of 10x faster than DP. |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| New PC dedicated to Mersenne Prime Search | Taiy | Hardware | 12 | 2018-01-02 15:54 |
| The prime-crunching on dedicated hardware FAQ (II) | jasonp | Hardware | 46 | 2016-07-18 16:41 |
| How would you design a CPU/GPU for prime number crunching? | emily | Hardware | 4 | 2012-02-20 18:46 |
| DSP hardware for number crunching? | ixfd64 | Hardware | 15 | 2011-08-09 01:11 |
| Optimal Hardware for Dedicated Crunching Computer | Angular | Hardware | 5 | 2004-01-16 12:37 |