![]() |
![]() |
#1 |
May 2004
FRANCE
11248 Posts |
![]()
Hi All,
I uploaded yesterday the binaries and source of the new version 3.8.1 of LLR. For now, it is a development version, and the zip files are in my development directory : http://jpenne.free.fr/Development/ I will release it at the "official" version as soon as it seems to be stable enough ; I hope it will be very soon... I made the source available because I cannot build the MacIntel binary by myself. I would be grateful if someone would send to me the result of the build... - This version uses the most recent Version 25.14 of George Woltman's gwnum library. - It allows now mutiple data formats in the input file, a feature which was wanted by LLRNET developpers. - The PRP testing of (generalized) repunits has been implemented. - The error checking and recovery has been made more rigorous (I hope...). Happy prime number hunting, and Best Regards, Jean |
![]() |
![]() |
![]() |
#2 |
Mar 2006
Germany
3·23·43 Posts |
![]() |
![]() |
![]() |
![]() |
#3 |
Jul 2003
So Cal
244210 Posts |
![]()
Jean,
Would it be difficult to produce (if you haven't done so already) a version of LLR based on FFTW? I ask not because anyone would want to use an FFTW version (too slow) but because the CUDA cufft libraries are based on the FFTW model. Starting from MacLucasFFTW, converting the FFTW calls to cufft, writing a normalization routine in CUDA to avoid having to transfer data off and back on the GPU for every iteration, plus a few clever tricks, msft has produced a fast LL testing program. It's limited to power-of-2 FFT's, but completes a LL iteration using 2048K FFTs in 10.6ms on a GTX 260 and 5.5ms on a GTX 480, and should be about 4x faster on the Tesla C20X0's. Similar performance should be possible for LLR. |
![]() |
![]() |
![]() |
#4 | |
"Mark"
Apr 2003
Between here and the
1A1A16 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#5 | |
May 2004
FRANCE
59610 Posts |
![]() Quote:
Are you a seer ? I am developping a portable version of LLR based on FFTW! Presently, I did'nt release it because : - It works only on power-of-two FFT's - I implemented only full IBDWT method (not yet zero padded one), so, accepted k values are at most 20 bits large... - It is ~three times slower than gwnum based LLR. - It yields sometimes unexplained false negative results (even although there are no excessive round-off errors...). I did not know anything about CUDA cufft libraries, so, I am very interested ! Is this code faster than FFTW? Is it available for non x86 machines? Thank you for your interesting message, and best regards, Jean |
|
![]() |
![]() |
![]() |
#6 |
May 2004
FRANCE
22×149 Posts |
![]() |
![]() |
![]() |
![]() |
#7 | |
Banned
"Luigi"
Aug 2002
Team Italia
29×167 Posts |
![]() Quote:
NVIDIA CUFFT Library This document describes CUFFT, the NVIDIA® CUDA™ (compute unified device architecture) Fast Fourier Transform (FFT) library. The FFT is a divide‐and‐conquer algorithm for efficiently computing discrete Fourier transforms of complex or real‐valued data sets, and it is one of the most important and widely used numerical algorithms, with applications that include computational physics and general signal processing. The CUFFT library provides a simple interface for computing parallel FFTs on an NVIDIA GPU, which allows users to leverage the floating‐point power and parallelism of the GPU without having to develop a custom, GPU‐based FFT implementation. FFT libraries typically vary in terms of supported transform sizes and data types. For example, some libraries only implement Radix‐2 FFTs, restricting the transform size to a power of two, while other implementations support arbitrary transform sizes. This version of the CUFFT library supports the following features: - 1D, 2D, and 3D transforms of complex and real‐valued data. - Batch execution for doing multiple 1D transforms in parallel. - 2D and 3D transform sizes in the range [2, 16384] in any dimension. - 1D transform sizes up to 8 million elements. - In‐place and out‐of‐place transforms for real and complex data. The CUFFT API is modeled after FFTW (see http://www.fftw.org), which is one of the most popular and efficient CPU‐based FFT libraries. FFTW provides a simple configuration mechanism called a plan that completely specifies the optimal—that is, the minimum #define CUFFT_FORWARD -1 #define CUFFT_INVERSE 1 floating‐point operation (flop)—plan of execution for a particular FFT size and data type. The advantage of this approach is that once the user creates a plan, the library stores whatever state is needed to execute the plan multiple times without recalculation of the configuration. The FFTW model works well for CUFFT because different kinds of FFTs require different thread configurations and GPU resources, and plans are a simple way to store and reuse configurations. The CUFFT library implements several FFT algorithms, each having different performance and accuracy. The best performance paths correspond to transform sizes that meet two criteria: 1. Fit in CUDAʹs shared memory 2. Are powers of a single factor (for example, powers of two) These transforms are also the most accurate due to the numeric stability of the chosen FFT algorithm. For transform sizes that meet the first criterion but not second, CUFFT uses a more general mixed‐radix FFT algorithm that is usually slower and less numerically accurate. Therefore, if possible it is best to use sizes that are powers of two or four, or powers of other small primes (such as, three, five, or seven). In addition, the power‐of‐two FFT algorithm in CUFFT makes maximum use of shared memory by blocking sub‐transforms for signals that do not meet the first criterion. http://developer.download.nvidia.com...ibrary_1.1.pdf HTH ![]() Luigi Last fiddled with by ET_ on 2010-05-12 at 14:33 |
|
![]() |
![]() |
![]() |
#8 | |
Jul 2003
So Cal
2×3×11×37 Posts |
![]() Quote:
For example, on a recent test, one Prime95 thread required about 33ms/iteration for a 1028K FFT. The CUDA program, limited to power-of-2 FFT's, used a 2048K FFT for the same number (nearly twice as large!) but is still 3-6 times faster on current cards (10.6ms/iter on a GTX 260 and 5.5ms/iter on a GTX 480), and should be about 20 times faster on the Tesla C20X0's when they are released in a few months. To test the hardware on the new GTX 480 card we just got, I'm running a double check of M42643801 using a 4096K FFT, and it will take just over 5 days. A Tesla C20X0 should do that in less than 36 hours. Last fiddled with by frmky on 2010-05-12 at 19:14 |
|
![]() |
![]() |
![]() |
#9 | |
May 2004
FRANCE
22×149 Posts |
![]() Quote:
Thank you ET and frmky for these infos! I realize I was totally inexperienced about GPU usage for computing! It seems to be greatly interesting for the future of fast primality testing, but unfortunately, I own neither the hardware nor the software to develop a CUDA program, so, how to do that? Regards, Jean |
|
![]() |
![]() |
![]() |
#10 | |
Banned
"Luigi"
Aug 2002
Team Italia
29×167 Posts |
![]() Quote:
![]() There is also an "emulator" option to test software ehile not yet owing a GPU card. As for the hardware... I guess that when new Tesla cards are released, older GTX 260 and 275 will get cheaper. Luigi Last fiddled with by ET_ on 2010-05-12 at 21:31 |
|
![]() |
![]() |
![]() |
#11 | |
"Oliver"
Mar 2005
Germany
23·139 Posts |
![]()
Hi Jean,
Quote:
![]() It is singlethreaded (won't reveal race conditions) and runs very slow (at least for my code). Oliver Last fiddled with by TheJudger on 2010-05-12 at 22:58 |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
LLR Version 3.8.5 is available! | Jean Penné | Software | 11 | 2011-02-20 18:22 |
LLR Version 3.8.0 is now available! | Jean Penné | Software | 22 | 2010-04-28 07:45 |
Which version for P-III's? | richs | Software | 41 | 2009-01-07 14:40 |
LLR - new version | Cruelty | Riesel Prime Search | 8 | 2006-05-16 15:00 |
Which LLR version to use... | Cruelty | Riesel Prime Search | 1 | 2005-11-10 15:17 |