mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Closed Thread
 
Thread Tools
Old 2008-05-18, 10:21   #12
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

11110000011002 Posts
Default

Quote:
Originally Posted by RMAC9.5 View Post
I only mentioned the ATI Radeon HD 3850 graphics card in my previous post because it comes in both PCI-e and AGP flavors. It also, according to AMD, has 320 stream processing units which support 128 bit floating point precision for all vertex, geometry, and pixel shader operations. Here is a link to its GPU specifications page http://ati.amd.com/products/radeonhd3800/specs.htm
That URL needs an "l" appended: http://ati.amd.com/products/radeonhd3800/specs.html

It is all-too-common for vendors to use nonstandard wording. If instead of "128-bit floating point precision for all operations", it had said "IEEE quad-precision floating point used for all operations", that would refer to an independent standard. (However, it seems that the IEEE 754r standard which will include a 128-bit format is still in progress, not yet official.)

"128-bit floating point precision" may seem the same as "quad-precision" at first glance, but then one recalls that quad-precision numbers (http://en.wikipedia.org/wiki/Quadruple_precision) have only 112, not 128, bits in their fractions.

Thus, it's likely that the 128-bit numbers cited in the spec are just groups of four 32-bit floating-point numbers (each of which has only 23 bits of precision in its fraction), as sylvester says.

Last fiddled with by cheesehead on 2008-05-18 at 10:24
cheesehead is offline  
Old 2008-05-18, 12:07   #13
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

11×577 Posts
Default

Quote:
Originally Posted by sylvester View Post
If you didn't write that you are a "programmer and understand" I would just call you confused by the marketing collateral. Put you call yourself a programmer and have no idea of how basic data types are represented in a processor. This puts you in the category of cranks. I just hope that you aren't programming "applications" for my health insurance or my bank. <Deity> help us!
I think you are being a little unfair. Not all applications programmers know how a basic data type in a given language is represented internally to the CPU. For example, just because you call a datatype an integer doesn't mean that it is 32-bits. In some programming languages an integer is 16 bits, even if the CPU is 64 bits (PowerBuilder PowerScript). On 64 bit platforms an integer could be 32 bits or 64 bits. It depends upon the compiler options used to build the software (gcc with the -m64 switch). I can also refer to Oracle PL/SQL. I believe (although not 100% certain) that in PL/SQL an integer is 128 bits, regardless of the platform. Finally you have scripting languages, those that require an interpreter at runtime. Visual Basic Script (and other languages calling themselves "basic") is a good example. I don't think that these languages can guarantee the size of an integer until runtime.
rogue is offline  
Old 2008-05-18, 14:57   #14
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

354110 Posts
Default

I haven't been 'pulling punches', it doesn't matter to me at all what anyone believes. I wrote the FAQ so that people who want to ask about using special-purpose hardware can first understand the engineering compromise that all finished code really is, and also that there is a calculus that professionals need to have when mapping out what they will do to achieve a given goal. The possibility that dedicated hardware can make things better needs to be weighed against the costs needed to get there, and the amount of buy-in that can be expected if and when they manage to finish the work.

Relying on distributed computing to solve tough computational problems automatically assumes that 1) no single contributor to the project can ever hope to do a majority of the work, no matter how many resources that user brings to bear, and 2) that the capacity to do that work would otherwise be wasted. i.e. for every user that builds their own farm solely for running Prime95, there are thousands of users that just download something and forget about it, and the aggregate throughput of the latter far exceeds that of the former. If you are the (one) project developer and want your problem solved faster, then you have to optimize for the common case.
jasonp is offline  
Old 2008-05-18, 18:07   #15
sylvester
 
sylvester's Avatar
 
Dec 2007
RSM, CA, USA

1010102 Posts
Default

OK, now I'll attempt to make a positive contribution. Here is what I propose to be added at the end of the opening FAQ post of this thread:

Q: I read all the above and I think that you suffer from NIH (Not Invented Here) Syndrome.

A: Fine. Take the lucdwt.c code (about 900 lines) and port it to your preferred hardware. If your results beat "gcc -O0 lucdwt.c" or "cl /Od lucdwt.c" running on the Intel or AMD processor manufactured within last 3 years, post your results here. Otherwise take your "advanced streaming floating-point hardware" and shove it where the sun don't shine. (Sun doesn't shine inside of my 700MHz Pentium III box that I use to type this ;-)

My argumentation for the inclusion of the above is similar to the USPTO policy for submission of Perpetuum Mobile designs. For everything else they accept descriptions and drawings. But for P.M. they ask for a "working model". USPTO has a good track record of dealing with crank technologists and we may benefit from following their pattern.

Obviously feel free to re-edit my words to suit your style.

Thanks for not tasing me and my apologies to JasonP for misunderstanding his intentions.
sylvester is offline  
Old 2008-05-19, 22:11   #16
wblipp
 
wblipp's Avatar
 
"William"
May 2003
New Haven

2·7·132 Posts
Default

Quote:
Originally Posted by jasonp View Post
If you used single precision floating point, all chunks of the answer must fit in 24 bits. Since a 10-million digit number has ~33 million bits, you would have to break it into 33 million one-bit chunks, perform an FFT of size 33 million, and get an answer where each chunk has very nearly 24 bits.
In single precision, could we work with 48 thousand bit numbers, with 12 bits per chunk and 4,000 chunks? Would that make ECM and P-1 feasible in GPUs for numbers up to 14,000 digits? Would that be a real boon for factoring projects like the Cunningham project, or would the increased operation count outweigh the speedup, resulting in slower time for each test?

Last fiddled with by wblipp on 2008-05-19 at 22:19 Reason: Add operation count question
wblipp is offline  
Old 2008-05-19, 23:35   #17
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103·113 Posts
Default

Quote:
Originally Posted by wblipp View Post
In single precision, could we work with 48 thousand bit numbers, with 12 bits per chunk and 4,000 chunks? Would that make ECM and P-1 feasible in GPUs for numbers up to 14,000 digits? Would that be a real boon for factoring projects like the Cunningham project, or would the increased operation count outweigh the speedup, resulting in slower time for each test?
William, I tweaked the little C function in Mlucas which implements the ROE-modeling heuristic in the F24 paper to use SP rather than DP, and generated a table of FFT-length-versus-max-Mersenne-exponent, which follows. Take these numbers with a large grain of salt, since I haven't actually run any single-precision FFTs to check whether they are realistic in practice:
Code:
N =        2 K  maxP =      14987
N =        3 K  maxP =      21891
N =        4 K  maxP =      28637
N =        5 K  maxP =      35266
N =        6 K  maxP =      41804
N =        7 K  maxP =      48265
N =        8 K  maxP =      54661
N =        9 K  maxP =      61000
N =       10 K  maxP =      67289
N =       11 K  maxP =      73533
N =       12 K  maxP =      79736
N =       13 K  maxP =      85901
N =       14 K  maxP =      92032
N =       15 K  maxP =      98131
N =       16 K  maxP =     104201
N =       18 K  maxP =     116257
N =       20 K  maxP =     128214
N =       22 K  maxP =     140082
N =       24 K  maxP =     151870
N =       26 K  maxP =     163583
N =       28 K  maxP =     175228
N =       30 K  maxP =     186811
N =       32 K  maxP =     198334
N =       36 K  maxP =     221218
N =       40 K  maxP =     243907
N =       44 K  maxP =     266420
N =       48 K  maxP =     288773
N =       52 K  maxP =     310980
N =       56 K  maxP =     333052
N =       60 K  maxP =     355000
N =       64 K  maxP =     376831
N =       72 K  maxP =     420172
N =       80 K  maxP =     463126
N =       88 K  maxP =     505732
N =       96 K  maxP =     548022
N =      104 K  maxP =     590022
N =      112 K  maxP =     631756
N =      120 K  maxP =     673243
N =      128 K  maxP =     714499
N =      144 K  maxP =     796376
N =      160 K  maxP =     877486
N =      176 K  maxP =     957906
N =      192 K  maxP =    1037700
N =      208 K  maxP =    1116920
N =      224 K  maxP =    1195613
N =      240 K  maxP =    1273815
N =      256 K  maxP =    1351560
N =      288 K  maxP =    1505792
N =      320 K  maxP =    1658502
N =      352 K  maxP =    1809843
N =      384 K  maxP =    1959943
N =      416 K  maxP =    2108907
N =      448 K  maxP =    2256822
N =      480 K  maxP =    2403765
N =      512 K  maxP =    2549801
N =      576 K  maxP =    2839375
N =      640 K  maxP =    3125928
N =      704 K  maxP =    3409767
N =      768 K  maxP =    3691141
N =      832 K  maxP =    3970259
N =      896 K  maxP =    4247295
N =      960 K  maxP =    4522401
N =     1024 K  maxP =    4795706
So in answer to your specific query, 48 kbit moduli would need closer to an 8K FFT in single precision. [Remember, convolution outputs grow quadratically with inputs wordsize and roughly as the square root of transform length, so your 12 bits estimate likely was simply 2x too large.]

Here is the code snippet in question [the DP version - for SP just change the value of the constant Bmant from 53 to 24], in case anyone is interested. The value 0.6 of the order-unity asymptotic constant was chosen because that seemed to best match the error levels of my FFT implementation. Your mileage may vary depending on FFT implementational details and hardware [e.g. x86-style 80-bit floating registers or not], but the value should be around 1 in any event.
Code:
/*
For a given FFT length, estimate maximum exponent that can be tested.

This implements formula (8) in the F24 paper (Math Comp. 72 (243), pp.1555-1572,
December 2002) in order to estimate the maximum average wordsize for a given FFT length.
For roughly IEEE64-compliant arithmetic, an asymptotic constant of 0.6 (log2(C) in the
the paper, which recommends something around unity) seems to fit the observed data best.
*/
uint32 given_N_get_maxP(uint32 N)
{
	const double Bmant = 53;
	const double AsympConst = 0.6;
	const double ln2inv = 1.0/log(2.0);
	double ln_N, lnln_N, l2_N, lnl2_N, l2l2_N, lnlnln_N, l2lnln_N;
	double Wbits, maxExp2;

	ln_N     = log(1.0*N);
	lnln_N   = log(ln_N);
	l2_N     = ln2inv*ln_N;
	lnl2_N   = log(l2_N);
	l2l2_N   = ln2inv*lnl2_N;
	lnlnln_N = log(lnln_N);
	l2lnln_N = ln2inv*lnlnln_N;

	Wbits = 0.5*( Bmant - AsympConst - 0.5*(l2_N + l2l2_N) - 1.5*(l2lnln_N) );
	maxExp2 = Wbits*N;

	/* 3/10/05: Future versions will need to loosen this p < 2^32 restriction: */
	ASSERT(HERE, maxExp2 <= 1.0*0xffffffff,"given_N_get_maxP: maxExp2 <= 1.0*0xffffffff");

	fprintf(stderr,"N = %8u K  maxP = %10u\n", N>>10, (uint32)maxExp2);

	return (uint32)maxExp2;
}

Last fiddled with by ewmayer on 2008-05-19 at 23:37
ewmayer is offline  
Old 2008-05-20, 05:02   #18
RMAC9.5
 
RMAC9.5's Avatar
 
Jun 2003

100110012 Posts
Default

Jasonp,
Thank you for creating your FAQ and starting this thread. It took a little bit of Google searching but here is a link to an AMD forum thread that may shed some more light on our double precision discussion topic. http://forums.amd.com/devforum/messa...threadid=92248
Hopefully Micah Villmow and Michael Chu can provide additional information about AMD's SDK.

Last fiddled with by RMAC9.5 on 2008-05-20 at 05:05
RMAC9.5 is offline  
Old 2008-05-20, 16:21   #19
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

72×131 Posts
Default

This business about Brook+ allowing you to use double-precision units on the ATI HD3870 card is quite interesting; the system behaves as if it has 64 775MHz double-precision FPUs rather than 320 single-precision ones, but it looks as if you get something which might be fairly adequate.

I haven't seen benchmarks, I haven't seen running code, and I get the strong impression that a) ATI are a long way behind nVidia in toolsets and in mindshare and b) you have to develop on Windows XP so I'd have to devote a whole computer to it. £130 is not a completely ridiculous amount to spend to have a play, but I'm a bit wary, having bought a GeForce 5900 for GPGPU reasons five years ago and never managed to program it at all.
fivemack is offline  
Old 2008-05-20, 22:37   #20
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

23·1,223 Posts
Default

Quote:
Originally Posted by jasonp View Post
4) if the goal is to get the most LL tests done across all users, is improving your personal farm's throughput the best use of man-years of George Woltman's time?
The goal is to clear the most exponents. LL is not the goal.
If I could run a factoring program (one that already exists in C) on a GPU on a machine, when not otherwise occupied, I would. Also if, a GPU would support a MLucas or GLucas (from their C source), I might run them. If the SDK for a GPU allowed them to work together (with out the CPU being involved) for multi-monitor systems (like with a dedicated plug-in cable), then the multiprocessor code of M/GLucas might work on them, great for a double check machine.
If there exist a pure C version of a P-1 program, that would be great too.

Not all of us think that George should GPU. We just think that a DP GPU would be nice.

"Wouldn't you like to be a DP, too?"
Uncwilly is offline  
Old 2008-05-21, 05:47   #21
wblipp
 
wblipp's Avatar
 
"William"
May 2003
New Haven

93E16 Posts
Default

Quote:
Originally Posted by ewmayer View Post
William, I tweaked the little C function in Mlucas which implements the ROE-modeling heuristic in the F24 paper to use SP rather than DP ... Here is the code snippet in question [the DP version - for SP just change the value of the constant Bmant from 53 to 24],
Thanks! Let's see if I understand this. Suppose I'm interested in doing ECM curves on (3536+1)/(38+1), a number of about 837 bits. Using Single Precision, I need a transform of length about 96, and using double precision I need a transform of length about 36. FFT scales with transform length as N*log(N), so the single precision version will need 3.4 times as many multiplications as the double precision version.

Is that about right?

William
wblipp is offline  
Old 2008-05-21, 16:53   #22
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101101011101112 Posts
Default

Quote:
Originally Posted by wblipp View Post
Thanks! Let's see if I understand this. Suppose I'm interested in doing ECM curves on (3536+1)/(38+1), a number of about 837 bits. Using Single Precision, I need a transform of length about 96, and using double precision I need a transform of length about 36. FFT scales with transform length as N*log(N), so the single precision version will need 3.4 times as many multiplications as the double precision version.

Is that about right?
If those are the numbers the above code spat out for you, then, yes - though note that some of the asymptotic scalings used in the heuristic ROE model will begin to break down at very small FFT lengths. Should still be ballpark-style-good, though.

Going in the other direction it also becomes clear why SP needs to be at least 10x faster than DP on the hardware in question in order to be of interest for the purpose of current and future GIMPS work - at 1024K SP only allows ~4.5 bits per input, nearly 5x less than DP, and the ratio gets worse as the FFT lengths get larger. To test an exponent ~50M, SP needs a 14336K FFT, DP needs just 2560K - 5.6x smaller. And since the FFT work scales roughly as N*log(N), a 14336K FFT in fact needs roughly 7x as many operations as a 2560K DP FFT.

I'm not saying SP is completely out of the running, but it is likely mainly of interest for smaller FFT lengths, and only if the hardware in question can do it on the order of 10x faster than DP.
ewmayer is offline  
Closed Thread



Similar Threads
Thread Thread Starter Forum Replies Last Post
New PC dedicated to Mersenne Prime Search Taiy Hardware 12 2018-01-02 15:54
The prime-crunching on dedicated hardware FAQ (II) jasonp Hardware 46 2016-07-18 16:41
How would you design a CPU/GPU for prime number crunching? emily Hardware 4 2012-02-20 18:46
DSP hardware for number crunching? ixfd64 Hardware 15 2011-08-09 01:11
Optimal Hardware for Dedicated Crunching Computer Angular Hardware 5 2004-01-16 12:37

All times are UTC. The time now is 11:38.


Fri Jul 16 11:38:47 UTC 2021 up 49 days, 9:26, 1 user, load averages: 1.84, 1.70, 1.62

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.