20100512, 06:39  #1 
May 2004
FRANCE
597_{10} Posts 
LLR Version 3.8.1 is now available!
Hi All,
I uploaded yesterday the binaries and source of the new version 3.8.1 of LLR. For now, it is a development version, and the zip files are in my development directory : http://jpenne.free.fr/Development/ I will release it at the "official" version as soon as it seems to be stable enough ; I hope it will be very soon... I made the source available because I cannot build the MacIntel binary by myself. I would be grateful if someone would send to me the result of the build...  This version uses the most recent Version 25.14 of George Woltman's gwnum library.  It allows now mutiple data formats in the input file, a feature which was wanted by LLRNET developpers.  The PRP testing of (generalized) repunits has been implemented.  The error checking and recovery has been made more rigorous (I hope...). Happy prime number hunting, and Best Regards, Jean 
20100512, 07:21  #2 
Mar 2006
Germany
101110100001_{2} Posts 

20100512, 08:03  #3 
Jul 2003
So Cal
3^{2}×5^{2}×11 Posts 
Jean,
Would it be difficult to produce (if you haven't done so already) a version of LLR based on FFTW? I ask not because anyone would want to use an FFTW version (too slow) but because the CUDA cufft libraries are based on the FFTW model. Starting from MacLucasFFTW, converting the FFTW calls to cufft, writing a normalization routine in CUDA to avoid having to transfer data off and back on the GPU for every iteration, plus a few clever tricks, msft has produced a fast LL testing program. It's limited to powerof2 FFT's, but completes a LL iteration using 2048K FFTs in 10.6ms on a GTX 260 and 5.5ms on a GTX 480, and should be about 4x faster on the Tesla C20X0's. Similar performance should be possible for LLR. 
20100512, 12:22  #4  
"Mark"
Apr 2003
Between here and the
2^{4}×421 Posts 
Quote:


20100512, 13:57  #5  
May 2004
FRANCE
3×199 Posts 
FFTW based LLR
Quote:
Are you a seer ? I am developping a portable version of LLR based on FFTW! Presently, I did'nt release it because :  It works only on poweroftwo FFT's  I implemented only full IBDWT method (not yet zero padded one), so, accepted k values are at most 20 bits large...  It is ~three times slower than gwnum based LLR.  It yields sometimes unexplained false negative results (even although there are no excessive roundoff errors...). I did not know anything about CUDA cufft libraries, so, I am very interested ! Is this code faster than FFTW? Is it available for non x86 machines? Thank you for your interesting message, and best regards, Jean 

20100512, 14:00  #6 
May 2004
FRANCE
3·199 Posts 

20100512, 14:27  #7  
Banned
"Luigi"
Aug 2002
Team Italia
12EB_{16} Posts 
Quote:
NVIDIA CUFFT Library This document describes CUFFT, the NVIDIA® CUDA™ (compute unified device architecture) Fast Fourier Transform (FFT) library. The FFT is a divide‐and‐conquer algorithm for efficiently computing discrete Fourier transforms of complex or real‐valued data sets, and it is one of the most important and widely used numerical algorithms, with applications that include computational physics and general signal processing. The CUFFT library provides a simple interface for computing parallel FFTs on an NVIDIA GPU, which allows users to leverage the floating‐point power and parallelism of the GPU without having to develop a custom, GPU‐based FFT implementation. FFT libraries typically vary in terms of supported transform sizes and data types. For example, some libraries only implement Radix‐2 FFTs, restricting the transform size to a power of two, while other implementations support arbitrary transform sizes. This version of the CUFFT library supports the following features:  1D, 2D, and 3D transforms of complex and real‐valued data.  Batch execution for doing multiple 1D transforms in parallel.  2D and 3D transform sizes in the range [2, 16384] in any dimension.  1D transform sizes up to 8 million elements.  In‐place and out‐of‐place transforms for real and complex data. The CUFFT API is modeled after FFTW (see http://www.fftw.org), which is one of the most popular and efficient CPU‐based FFT libraries. FFTW provides a simple configuration mechanism called a plan that completely specifies the optimal—that is, the minimum #define CUFFT_FORWARD 1 #define CUFFT_INVERSE 1 floating‐point operation (flop)—plan of execution for a particular FFT size and data type. The advantage of this approach is that once the user creates a plan, the library stores whatever state is needed to execute the plan multiple times without recalculation of the configuration. The FFTW model works well for CUFFT because different kinds of FFTs require different thread configurations and GPU resources, and plans are a simple way to store and reuse configurations. The CUFFT library implements several FFT algorithms, each having different performance and accuracy. The best performance paths correspond to transform sizes that meet two criteria: 1. Fit in CUDAʹs shared memory 2. Are powers of a single factor (for example, powers of two) These transforms are also the most accurate due to the numeric stability of the chosen FFT algorithm. For transform sizes that meet the first criterion but not second, CUFFT uses a more general mixed‐radix FFT algorithm that is usually slower and less numerically accurate. Therefore, if possible it is best to use sizes that are powers of two or four, or powers of other small primes (such as, three, five, or seven). In addition, the power‐of‐two FFT algorithm in CUFFT makes maximum use of shared memory by blocking sub‐transforms for signals that do not meet the first criterion. http://developer.download.nvidia.com...ibrary_1.1.pdf HTH Luigi Last fiddled with by ET_ on 20100512 at 14:33 

20100512, 19:10  #8  
Jul 2003
So Cal
3^{2}×5^{2}×11 Posts 
Quote:
For example, on a recent test, one Prime95 thread required about 33ms/iteration for a 1028K FFT. The CUDA program, limited to powerof2 FFT's, used a 2048K FFT for the same number (nearly twice as large!) but is still 36 times faster on current cards (10.6ms/iter on a GTX 260 and 5.5ms/iter on a GTX 480), and should be about 20 times faster on the Tesla C20X0's when they are released in a few months. To test the hardware on the new GTX 480 card we just got, I'm running a double check of M42643801 using a 4096K FFT, and it will take just over 5 days. A Tesla C20X0 should do that in less than 36 hours. Last fiddled with by frmky on 20100512 at 19:14 

20100512, 19:38  #9  
May 2004
FRANCE
597_{10} Posts 
LLR on CUDA ??
Quote:
Thank you ET and frmky for these infos! I realize I was totally inexperienced about GPU usage for computing! It seems to be greatly interesting for the future of fast primality testing, but unfortunately, I own neither the hardware nor the software to develop a CUDA program, so, how to do that? Regards, Jean 

20100512, 21:30  #10  
Banned
"Luigi"
Aug 2002
Team Italia
29·167 Posts 
Quote:
There is also an "emulator" option to test software ehile not yet owing a GPU card. As for the hardware... I guess that when new Tesla cards are released, older GTX 260 and 275 will get cheaper. Luigi Last fiddled with by ET_ on 20100512 at 21:31 

20100512, 22:58  #11  
"Oliver"
Mar 2005
Germany
2^{3}×139 Posts 
Hi Jean,
Quote:
It is singlethreaded (won't reveal race conditions) and runs very slow (at least for my code). Oliver Last fiddled with by TheJudger on 20100512 at 22:58 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
LLR Version 3.8.5 is available!  Jean Penné  Software  11  20110220 18:22 
LLR Version 3.8.0 is now available!  Jean Penné  Software  22  20100428 07:45 
Which version for PIII's?  richs  Software  41  20090107 14:40 
LLR  new version  Cruelty  Riesel Prime Search  8  20060516 15:00 
Which LLR version to use...  Cruelty  Riesel Prime Search  1  20051110 15:17 