![]() |
[quote]Q: What is this 'Number Theoretic Transform' that the threads I dutifully read above mention?
A: After explaining at great length why you have to have double precision floating point to do this FFT stuff, I'll backpedal and say that floating point arithmetic is not needed to compute the equivalent of an FFT multiply. You can instead use modular math in the finite field of elements modulo a prime integer. This sounds great, but instead of floating point multiplication you must instead use modular multiplication of integers. There are many tricks available to make this fast, but they all require 32-bit or 64-bit integer multiplication, that produces a 64-bit or 128-bit product. This unfortunately is not an operation needed in graphics cards or most other special processors, so it's not implemented in dedicated hardware, so it's not fast. In fact, we use floating point because most processors, general-purpose or special-purpose, don't care much about integer multiply throughput but care a great deal about making floating point fast. We go where the hardware is, and NTTs are not at present a good substitute for the way things are done now.[/quote] I'm not sure how long this "answer" has been here, but it's simply wrong, the AMD based LLR program I mentioned an hour or so ago in another thread uses this method. Saying it isn't fast based on opinion is even more of a religious belief than many Christians belief that Christ died and rose again. There's NO evidence that the "integer transform is slower" belief is true. It's simply people parroting each other, though I'll admit that they're parroting for the same reason I'm parroting, the reputation of the person they got the information from. |
1 Attachment(s)
[quote=jasong;153353]It's simply people parroting each other, though I'll admit that they're parroting for the same reason [I][B]I'm parroting[/B][/I], the reputation of the person they got the information from.[/quote]
This is a fine example of a [I]belief[/I] that you have, based on nothing other than your own experience. Did you ever try [URL="http://en.wikipedia.org/wiki/Scientific_method"]Scientific_method[/URL]? Or you can only retell [I]your friend's[/I] stories? |
[QUOTE=jasong;153353] There's NO evidence that the "integer transform is slower" belief is true.[/QUOTE]
:orly owl: |
[QUOTE=jasong;153353]I'm not sure how long this "answer" has been here, but it's simply wrong, the AMD based LLR program I mentioned an hour or so ago in another thread uses this method.
There's NO evidence that the "integer transform is slower" belief is true. [/QUOTE] Jason, I spent the better part of a year in 2004 building the fastest all-integer large convolution code available, and for Fermat-mod arithmetic it was 15% slower than mlucas for very large numbers. That FAQ has been there since the beginning, because I have code to back it up. Believe me, I wish I was wrong, because I've spent the better part of a decade trying to make all-integer large-number arithmetic go faster than the floating point version. |
Is there any way for PS3 to trial factor, then? Or is this impractical too? (Though too many trial factorers isn't that good either...)
|
[QUOTE=starrynte;154937]Is there any way for PS3 to trial factor, then? Or is this impractical too? (Though too many trial factorers isn't that good either...)[/QUOTE]It could be done. How much effort does someone want to put into writing the code is something else. Also, how practical is a third item.
|
Canada does Quantum!!!
With a little help from Gov't funding....
[url]http://www.itworldcanada.com/Pages/Docbase/ViewArticle.aspx?id=idgml-4ee960e9-8d07-48bd&Portal=448d158c-d857-4785-b759-ffa1c005933c&sub=490495[/url] |
To be honest I'm surprised that billions haven't already been poured into this, not least by the American government.
|
[QUOTE=lavalamp;161221]To be honest I'm surprised that billions haven't already been poured into this, not least by the American government.[/QUOTE]
Why would the American government want to do that? Having an grade school education with strong science curricula has been against the will of government for decades. The NCLBA (No Child Left Behind Act) has been more interested in political correctness than promoting academic competition, whether its between students in a school or between schools. Its been a punishment on the schools because some kids (those at the left end of the bell curve) should not even be in the public school system and those on the right end are not given the tools to excel. |
Depending on how powerful you think US intelligence agencies are, it's possible that the chance to implement Schor's factoring algorithm on a top secret quantum computer has already merited billions of (secret) US government dollars...
|
[QUOTE=jasonp;161227]Depending on how powerful you think US intelligence agencies are, it's possible that the chance to implement Schor's factoring algorithm on a top secret quantum computer has already merited billions of (secret) US government dollars...[/QUOTE]Well that was the very big reason I was thinking of for the US gov't to get in on it.
However, I think people give gov'ts far too much credit for keeping things secret and just generally being competent. Over here it seems that our lot seem to be able to lose an unencrypted laptop/CD/flash drive with important data on at a rate of 2 or 3 per week. From what I've seen, businesses that actually need to make money to keep going have their wits about them, but anything state run gets buried in paper and procrastination, including schools. The NHS is probably the biggest example, billions just get poured into a black hole and nothing improves. |
[QUOTE=jasonp;153534]Jason, I spent the better part of a year in 2004 building the fastest all-integer large convolution code available, and for Fermat-mod arithmetic it was 15% slower than mlucas for very large numbers. That FAQ has been there since the beginning, because I have code to back it up. Believe me, I wish I was wrong, because I've spent the better part of a decade trying to make all-integer large-number arithmetic go faster than the floating point version.[/QUOTE]
And since then the speed of Mlucas on Intel-style hardware has doubled due to use of SSE2, which (even using SSE3/4/5) sucks at 64-bit integer math, especially multiply. But hey, what do we know - all we`ve ever done is spent years and years studying the algorithms and the hardware ISAs and writing hundreds of thousands of lines of code. When that code ends up running at measured speed X on hardware Y, that`s not a "fact", that`s like, just the hardware`s opinion. JasonG, I notice you mentioned an LLR code that uses pure-integer arithmetic - so why don`t you provide us with a head-to-head comparison of that running on the same hardware vs Prime95, for various convolution sizes. In other words, back up your noise with some actual simple-to-obtain "data". Also, in early September you mentioned that your mystery friend was all better and that we`d "be hearing all about his work" and be suitably in awe in 2-3 months. 4 months later ... tick, tock. Or did I miss the resulting New York Times "Computational revolution!!! ... Lights All Askew In The Heavens/Men Of Science More Or Less Agog Over Results" front page? This is not to disparage your friend ... I have a sneaking suspicion that if he knew the extent and extremity of the bragging-on-his-behalf you`ve been doing, he`d be rather embarrassed. |
Ernst, you ask nothing that hasn't already been asked dozens of times. I only hope someday the inferno in Jason's head will drive him to light a candle instead of cursing the darkness.
|
Per a post elsewhere, [url="http://www.sccg.sk/~vgsem/data/zima2008/FP64_GPU.pdf"]high-end 2009-era GPUs[/url] seem to have significant double-precision potential at a cost slightly higher than a bare-bones PC, so at least the cost hurdle is coming down rapidly.
|
For the record, the new version of the CUDA toolkit's FFT library now sports a double precision FFT implementation:
[url]http://forums.nvidia.com/index.php?showtopic=102548[/url] Only supported in GT200 cards, i.e. GTX260, GTX280, etc. |
[quote=Robert Holmes;182846]For the record, the new version of the CUDA toolkit's FFT library now sports a double precision FFT implementation:
[URL]http://forums.nvidia.com/index.php?showtopic=102548[/URL] Only supported in GT200 cards, i.e. GTX260, GTX280, etc.[/quote] Might it, then, finally be practical to port Prime95 to CUDA? |
[QUOTE=Robert Holmes;182846]For the record, the new version of the CUDA toolkit's FFT library now sports a double precision FFT implementation:
[url]http://forums.nvidia.com/index.php?showtopic=102548[/url] Only supported in GT200 cards, i.e. GTX260, GTX280, etc.[/QUOTE] But not ALL GT200 cards. Only the 260 and up. the 250 and 210 don't work. Be very careful. |
[quote=Mini-Geek;182849]Might it, then, finally be practical to port Prime95 to CUDA?[/quote]
This I would LOVE to know. |
It doesn't seem worth the effort --- as said [I]ad nauseam[/I] before, double precision on the GT200 sucks: 1 DP unit per SM.
For a top of the line GTX285 card, this gives us (1476*10^6 * 2 * 30) / 10^9 = 88 GFlops, assuming all operations can be turned into MADs. More realistically it will be closer to 40-50, which is what a regular CPU can do. Maybe the GT300 will have proper DP support. At any rate, a proof of concept wouldn't hurt. I'd be willing to waste some time trying during August, if there was simple enough pseudo-code to begin with. EDIT: After a simple experiment, doing a double precision 2^24 Complex to Complex FFT using CUFFT in a GTX260 (theoretical 60 GFlops) takes about 0.0913 seconds, memory transfers excluded. Let's approximate the number of FP operations in the FFT as 5*n log n, i.e. 5 * 2^24 * 24 --- This gives us ((5 * 2^24 * 24) / 0.09) / 10^9 ~ 22 GFlops of FFT. How does this compare to a decent current quadcore? |
[quote=Robert Holmes;183279]It doesn't seem worth the effort --- as said [I]ad nauseam[/I] before, double precision on the GT200 sucks: 1 DP unit per SM.
For a top of the line GTX285 card, this gives us (1476*10^6 * 2 * 30) / 10^9 = 88 GFlops, assuming all operations can be turned into MADs. More realistically it will be closer to 40-50, which is what a regular CPU can do. Maybe the GT300 will have proper DP support. At any rate, a proof of concept wouldn't hurt. I'd be willing to waste some time trying during August, if there was simple enough pseudo-code to begin with. EDIT: After a simple experiment, doing a double precision 2^24 Complex to Complex FFT using CUFFT in a GTX260 (theoretical 60 GFlops) takes about 0.0913 seconds, memory transfers excluded. Let's approximate the number of FP operations in the FFT as 5*n log n, i.e. 5 * 2^24 * 24 --- This gives us ((5 * 2^24 * 24) / 0.09) / 10^9 ~ 22 GFlops of FFT. How does this compare to a decent current quadcore?[/quote] In comparison, a Core i7 950 running last version of prime95 says: Timing FFTs using 8 threads on 4 physical CPUs: Best time for 8192K FFT length: 49.532 ms. Assuming the linear scaling, current GPUs are no better than CPUs at this, as pretty much everyone expected. |
[quote=Robert Holmes;183295]Assuming the linear scaling, current GPUs are no better than CPUs at this, as pretty much everyone expected.[/quote]But it's not really necessary for GPUs to be [I]better[/I] than CPUs in order for a port to be worthwhile, is it?
Even if they're now only in the same range of FFT speed, a port to CUDA could eventually double the potential number of processors GIMPS could use -- assuming one GPU per CPU, and eventually most GPUs are as capable as the now-top-of-the-line models. |
This ^
|
Rather than CUDA, which is for nVidia cards only, might it be worthwhile to write the code in OpenCL or similar?
|
Good call not to code for the PS3. Latest PS3 released can't run linux, so by my extrapalation has put up significant barriers to run 3rd party code.
-- Craig |
Posting some of my research over the last week:
According to [url]http://en.wikipedia.org/wiki/FLOPS[/url] here is a breakdown of performance between current flagship intel (core-i7 965 XE) and nvidia (Tesla C1060 similar to 280) [CODE] SP DP corei7 140 70 nvidia 933 78 SP=32bit floats DP=64bit floats [/CODE] Given these are top theoretical optimistic figures, don't expect to get this performance, but it's a good guide to compare the two. Lets say in DP it's pretty similar between the 2x architectures. The big difference between SP & DP of nvidia boils down to 1x 64 bit fp unit to 8x 32bit fp units. I think this has been mentioned before. As GPUs advance - who knows how this ratio will evolve. (New generation of GPUs are due in the next couple of months) Then also take this little gem: from [url]https://visualization.hpc.mil/wiki/Introduction_to_CUDA[/url] "Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second run time restriction. For this reason it is recommended that CUDA is run on a GPU that is NOT attached to an X display." This suggests (to me anyway) that coding for CUDA isn't exactly trivial process. There will be gotchas in the development process. No way there will be mass adoption of a GPGPU client if it causes screen corruption. So what are the x86 camp doing? Intel? [url]http://en.wikipedia.org/wiki/Larrabee_(GPU[/url]) First half of 2010; Larrabee will also compete in the GPGPU and high-performance computing markets. It will run x86 code which prime95 has an excellent heritage in. (No mention of 64bit float capability) If I could put my holier than though hat on, yes in a commercial development organization one would set aside resources to develop on GPU platforms to show a proof-of-concept then make a business call which way to go, but when resources are tight it's a real tough call where to go. Anyone want to have a stab at a proof-of-concept LL test on a GPU? Plenty of training resources at: [url]http://www.nvidia.com/object/cuda_education.html[/url] -- Craig |
[QUOTE=fivemack;133761]This business about Brook+ allowing you to use double-precision units on the ATI HD3870 card is quite interesting; the system behaves as if it has 64 775MHz double-precision FPUs rather than 320 single-precision ones, but it looks as if you get something which might be fairly adequate.[/QUOTE]
Note this is a 5x penalty for DP. It seems to correspond to the 10x penatly quited for software DP emulation using SP mentioned in the faq. A factor of 2 is the minimal hardware support implemented for DP. I think nvidia has a similar scheme for a few of their top end boards. |
[quote=lfm;188884]Note this is a 5x penalty for DP. It seems to correspond to the 10x penatly quited for software DP emulation using SP mentioned in the faq. A factor of 2 is the minimal hardware support implemented for DP. I think nvidia has a similar scheme for a few of their top end boards.[/quote]
The AMD and NVidia approaches are very different. AMD uses its single-precision ALUs with microcode programming to produce DP results. These are not IEEE doubles, though, but still much higher precision than single precision. So DP is slower on AMD since it takes many more clocks to compute. NVidia instead has full IEEE double precision ALUs in hardware, [B]in addition[/B] to its single precision ALUs. So for example, a GTX 275 card has 240 single precison ALUs, it also has only 40 double precision ALUs. So NV's double precision is at full speed but just fewer ALUs. The architecture hides this all from you, so they could create any mix of single and double precision ALUs and your code doesn't know or care. The cards also have a "Special function" unit, which is for transcendental evaluation (sin, cos, log, etc).. this is something that even CPUs don't have, those are usually done in microcode. All the G200 series of cards has these native DP processors... so it's not their top end boards, it's mid-level ($150) and above. |
[QUOTE=SPWorley;188929]The AMD and NVidia approaches are very different.
AMD uses its single-precision ALUs with microcode programming to produce DP results. These are not IEEE doubles, though, but still much higher precision than single precision. So DP is slower on AMD since it takes many more clocks to compute. NVidia instead has full IEEE double precision ALUs in hardware, [B]in addition[/B] to its single precision ALUs. So for example, a GTX 275 card has 240 single precison ALUs, it also has only 40 double precision ALUs. So NV's double precision is at full speed but just fewer ALUs. The architecture hides this all from you, so they could create any mix of single and double precision ALUs and your code doesn't know or care. The cards also have a "Special function" unit, which is for transcendental evaluation (sin, cos, log, etc).. this is something that even CPUs don't have, those are usually done in microcode. All the G200 series of cards has these native DP processors... so it's not their top end boards, it's mid-level ($150) and above.[/QUOTE] It seems the same end effect. A 5x penalty on ATI and it looks like 1/5th the alus. I think its a 8 to 1 ratio on nvidia (current cards?) From what I could gather on the Nvidia site the 250 and 210 models do not support DP and there is even a 260 model (260M the one that fits on a motherboard) that has no DP support. Are these the ones you mean currently under $150? |
[quote=lfm;188937]It seems the same end effect. A 5x penalty on ATI and it looks like 1/5th the alus. I think its a 8 to 1 ratio on nvidia (current cards?)
From what I could gather on the Nvidia site the 250 and 210 models do not support DP and there is even a 260 model (260M the one that fits on a motherboard) that has no DP support. Are these the ones you mean currently under $150?[/quote] Exactly, it's 8 SP : 1 DP : 1 SF for the G200 Nvidia cards. The ATI isn't a constant 5x at all.. it's variable based on operation type since it's software, but also it's not IEEE either so sometimes it's hard to compare. (ATI is effectively running Kahan summation to emulate the DP behavior, which is very fast for adds but slower for mults and especially painfully slow for divides) Yeah, none of the mobile chips have DP support, even the latest GTS 260m. The lowest NV card with DP support is the GTX260. [URL="http://www.newegg.com/Product/Product.aspx?Item=N82E16814127426"]That's $150.[/URL] That gives you about 95 double precision GFlops. That's very similar to the ideal DP flops of a 3Ghz quad i7 processor using packed SSE. |
Stats on the soon to be released ATI video card:
ATI 5000 series video card - 2.72SP Gflops/544 DP Gflops Cost: $400ish ATI article: [url]http://pcper.com/article.php?aid=783[/url] "AMD added IEEE754-2008 compliant precision" Does that statement qualify for suitablity for the algorithms for 2^P-1 primes? Today's flops count for entire GIMPS - 41.761Tflops. Just 1 of these vid cards would increase GIMPs output by 1.3% (based on purely theoretical numbers). Even if the first iteration of code ran the software at 1/4 of theoretical, 3 of these cards would be 1% of GIMPs output. Anybody want to look at coding LL test in OpenCL? -- Craig |
Just to add some information about CUDA with Ellipitc curves:
[URL]http://eecm.cr.yp.to/mpfq.html[/URL] [URL]http://cr.yp.to/talks/2009.09.12/slides.pdf[/URL] I think this 2 links are very helpful. If someone can compile this to use with win32, please, post the links. Thank you in advance. |
[QUOTE=nucleon;190970]ATI 5000 series video card - 2.72SP Gflops/544 DP Gflops[/QUOTE]How come SP is 200 times less than DP? I think the figures are wrong somewhere.
|
It is, of course, 2.72 SP Tflops peak.
Only a gigabyte of memory, so no use for GNFS linear algebra, though there does seem to be an addressing mode for the per-SIMD local store which is designed for doing radix-4 SP FFTs. Some reviews suggest that there are now four 24x24 integer multipliers rather than one 32x32, not sure how good a trade-off that is - depends on how many extra bits you wanted to carry around between carry propagation passes. Bernstein, Lange &c's work on mulmod on GPGPU is all on very small numbers (say 210 bits) which fit in the GPU register sets, I'm not sure how painful things get if the numbers stop fitting there. On the other hand, there's little point in my buying a 5850 to write GPGPU code on until the R800 Instruction Set Architecture document codes out, and AMD pull their finger out and release the OpenCL-running-on-GPU implementation. |
I think all of the complexity of porting algorithms to GPUs boils down to making the important part of the working set fit into caches that are much too small.
If your applications do not require double precision floating point, getting started with graphics card programming is ridiculously cheap. Nvidia's development libraries are a free download, and a 1-year old GPU with 1GB of memory costs ~$100 and potentially can crunch 5-10x faster than the machine it's plugged into. |
Yep, what 5mack said....Tflops.
Oops. -- Craig |
You might be interested in this:
[url]http://www.mpi-inf.mpg.de/~emeliyan/poly_mul.pdf[/url] Not sure how this compares to floating point implementations, though. |
Adding information about GPU Double Precision floating point performance:
Tesla C1060: 78 GFLOPS - $ 1.299,99 (tigerdirect) Tesla S1070: 311 to 345 GFlops (S1070 use 4 C1060) $ no ideia AMD FireStream 9270: 240 GFLOPS -$ 1.204,99 (newegg) Thank you for your time. |
The
[B]Radeon HD 5870[/B] has dobule precision and is fast as hell. Now it should be possible to optimize Prime95 on it ;) ... |
[QUOTE=zarabatana;191258]
Tesla S1070: 311 to 345 GFlops (S1070 use 4 C1060) $ no ideia [/QUOTE] [url="http://www.google.com/products?hl=en&source=hp&q=nvidia%20tesla%20s1070&um=1&ie=UTF-8&sa=N&tab=wf"]Not for casual use[/url] |
[QUOTE=joblack;191517]The
[B]Radeon HD 5870[/B] has double precision and is fast as hell. Now it should be possible to optimize Prime95 on it ;) ...[/QUOTE] Only when AMD get round to releasing the specifications for its assembly language, or at very least a compiler capable of compiling for it. At the moment you have to code in VLIW assembly language to get any kind of double-precision performance out of the RV770. |
Rumor says the upcoming GT300 cards will have stronger DP support:
[url]http://www.brightsideofnews.com/news/2009/9/30/nvidia-gt300s-fermi-architecture-unveiled-512-cores2c-up-to-6gb-gddr5.aspx[/url] At 2 Ghz, this means 1 TFlop/s DP (256 * 2 * (2*10^9)). In other news, NVIDIA seems to have released its OpenCL SDK to the public: [url]http://developer.nvidia.com/object/get-opencl.html[/url] |
Official GT300 blurb:
[url]http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFermiArchitectureWhitepaper.pdf[/url] [quote=Robert Holmes;191533]Rumor says the upcoming GT300 cards will have stronger DP support: [URL]http://www.brightsideofnews.com/news/2009/9/30/nvidia-gt300s-fermi-architecture-unveiled-512-cores2c-up-to-6gb-gddr5.aspx[/URL] At 2 Ghz, this means 1 TFlop/s DP (256 * 2 * (2*10^9)). In other news, NVIDIA seems to have released its OpenCL SDK to the public: [URL]http://developer.nvidia.com/object/get-opencl.html[/URL][/quote] |
I think I want one of those.
|
Some more info about GT300 here: [url]http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932&p=1[/url]
|
[QUOTE=jasonp;133411]
[b]Q: Can I use my graphics card to increase my contribution to projects like GIMPS, LLR, Riesel, etc? Q: My Playstation 3 has wondrous crunch power. Can I use it to increase my contribution to projects like GIMPS, LLR, Riesel, etc?[/b] A: No. [/QUOTE] This has very recently changed to [URL="http://www.mersenneforum.org/showthread.php?t=12576"]yes[/URL], at least for GIMPS on nVidia cards. Thanks to the inclusion of double-precision FFTs in the CUDA 2.3 cufft library (an FFTW-like library for CUDA), msft has successfully ported [URL="http://www.garlic.com/~wedgingt/mersenne.html"]MacLucasFFTW[/URL] to CUDA. It works with nVidia GTX 260 and higher cards; earlier cards do not have the required double precision units. While not blazingly fast, the performance is on par with Prime95. The program is distributed as source only and requires manual reservation of exponents. |
Part 2 of the FAQ is now posted [url="http://mersenneforum.org/showthread.php?t=12720"]here[/url]; please direct followups to that thread instead.
(The original FAQ exceeded the 10k post size limit, which was imposed after it was written, and editing it directly is not allowed unless it is split into multiple parts) |
| All times are UTC. The time now is 21:07. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.