![]() |
In preparation for assaulting the higher N numbers, I made some changes to the calculation of b^N at the beginning of the check() function. Here's the new beginning of that function, along with a few new helper functions it uses:
This is the OLD code: [code]void check(const Int32 b, const UInt32 m, char *expectedResidue) { ~~snip~~ if (read_checkpoint(b, m, z, &startTime, &iStart, &Nlen, &iOrigStart)) { fprintf(STDOUT, "\rResuming %d^%u+1 from a checkpoint (%d iterations left)\n", b, m, iStart); //1.00 fflush(STDOUT); } else { Nlen = 1; Na = (UInt32 *)myAlloc(1 * sizeof(UInt32)); Na[0] = (UInt32)b; for (j = m; j != 1; j /= 2) { a = (UInt32 *)myAlloc(2 * Nlen * sizeof(UInt32)); for (i = 0; i != Nlen; ++i) a[i] = 0; for (i = 0; i != Nlen; ++i) { BOINC_GPU_CHECK_FOR_ABORT(true); // should be good spot for abort check. Runs sqrt of total times a[i + Nlen] = mul_1_add_n(&a[i], Na[i], Na, Nlen); } myFree(Na); Na = a; Nlen *= 2; if (Na[Nlen - 1] == 0) --Nlen; } iStart = iOrigStart = lg(Na, Nlen) - 1; fprintf(STDOUT, "%sTesting %d^%u+1... %d steps to go ", ISBOINC ? "\n" : "\r", b, m, iStart); fflush(STDOUT); }[/code] This is the new code and helper functions: [code]// transfer input int array into complex array for fft __global__ void fftIn_kernel(cufftDoubleComplex * const g_fftIn, const UInt32 * const g_intArray, const UInt32 len) { int threadID = blockIdx.x * blockDim.x + threadIdx.x; g_fftIn[threadID*2].y = g_fftIn[threadID*2+1].y = 0.0; if (threadID < len) { g_fftIn[threadID*2].x = (double) (g_intArray[threadID] & 0xFFFF); g_fftIn[threadID*2+1].x = (double) (g_intArray[threadID] >> 16); } else g_fftIn[threadID*2].x = g_fftIn[threadID*2+1].x = (double) 0.0; } // Compute the size of the FFT array for compute_square // round up to the nearest power of 2, minimum 128 UInt32 fftSize(const UInt32 len) { UInt32 size = lpt(len); return size >= 128 ? size : 128; } // In order to compute the value of b^N, we need to repeatedly square b. // Use cuFFT libraries to do the square // The resulting value, the address of which is returned by this function // will be twice as long as the input (i.e., 2*Nlen) // This function allocates memory for the result -- the calling routine has // responsibility to eventually free that memory. // // Try reducing bits per element to 16 -- maybe we have a loss of precision // This means doubling the size of the FFT // Yes, this was the problem -- so do fft's using 16 bit unisgned ints in doubles rather than 32 bits. // UInt32 * computeSquare(const UInt32 * Na, const UInt32 Nlen) { cufftDoubleComplex * g_fftIn; // FFT input array cufftDoubleComplex * g_fftOut; // FFT Output array cufftDoubleComplex * fftOut; // final output (easier to do this on CPU because of carry propogation) UInt32 * g_intArray; // input/output integer array on the GPU UInt32 * a; // imaginatively named pointer to the result UInt32 fftLen = 2*fftSize(2*Nlen); // Size of the FFT -- NLen*2 rounded up to the next power of 2 cufftHandle plan; // the FFT plan UInt32 i, carry; // index counter and carry UInt64 l; // temporary 64 bit unsigned int for preserving carry // allocated GPU memory for the FFT in and out (Complex double, 2*Nlen, rounded up to a good FFT size cutilSafeCall(cudaMalloc((void**)&g_fftIn, sizeof(*g_fftIn)*fftLen)); cutilSafeCall(cudaMalloc((void**)&g_fftOut, sizeof(*g_fftOut)*fftLen)); // allocate GPU memory to load the integer array (UInt32, same count as the fft array) cutilSafeCall(cudaMalloc((void**)&g_intArray, sizeof(*g_intArray)*fftLen/2)); // Copy Na into the device memory cutilSafeCall(cudaMemcpy(g_intArray, Na, sizeof(*g_intArray)*Nlen, cudaMemcpyHostToDevice)); // invoke kernal to build FFT input fftIn_kernel<<< (fftLen/2)/128, 128 >>> (g_fftIn, g_intArray, Nlen); // Create the FFT plan cufftSafeCall(cufftPlan1d(&plan, fftLen, CUFFT_Z2Z, 1)); // Do the forward FFT cufftSafeCall(cufftExecZ2Z(plan, g_fftIn, g_fftOut, CUFFT_FORWARD)); // invoke kernel to square the resulting complex matrix mul_kernel<<< fftLen/128,128 >>> (g_fftOut); // Do the reverse FFT (from 'out' back to 'in') cufftSafeCall(cufftExecZ2Z(plan, g_fftOut, g_fftIn, CUFFT_INVERSE)); // allocate memory for the result (UInt32, 2*Nlen) a = (UInt32 *)myAlloc(sizeof(*a)*2*Nlen); // allocate memory for output and copy the result from GPU fftOut = (cufftDoubleComplex *)myAlloc(sizeof(*fftOut)*fftLen); cutilSafeCall(cudaMemcpy(fftOut, g_fftIn, sizeof(*fftOut)*fftLen, cudaMemcpyDeviceToHost)); // Now extract the result from the array of complex variables // Slower than doing this on the gpu, but much easier on host because of carry propogation. // It's only run log n times, so it should be insignificant for(i = 0, carry = 0; i < 2*Nlen; i++) { l = (UInt64) carry + (UInt64) round(fftOut[i*2].x / (double) fftLen) + (((UInt64) round(fftOut[i*2+1].x / (double) fftLen)) << 16); a[i] = (UInt32) l; // lower 32 bits carry = (UInt32) (l >> 32); } // free the 3 device memory arrays cutilSafeCall(cudaFree((char *)g_fftIn)); cutilSafeCall(cudaFree((char *)g_fftOut)); cutilSafeCall(cudaFree((char *)g_intArray)); myFree((void*)fftOut); // Destroy the FFT plan cufftSafeCall(cufftDestroy(plan)); return a; } void check(const Int32 b, const UInt32 m, char *expectedResidue) { ~~snip~~ if (read_checkpoint(b, m, z, &startTime, &iStart, &Nlen, &iOrigStart)) { fprintf(STDOUT, "\rResuming %d^%u+1 from a checkpoint (%d iterations left)\n", b, m, iStart); //1.00 fflush(STDOUT); } else { Nlen = 1; Na = (UInt32 *)myAlloc(1 * sizeof(UInt32)); Na[0] = (UInt32)b; for (j = m; j != 1; j /= 2) { a = computeSquare(Na, Nlen); // compute_square will allocate "a" myFree(Na); Na = a; Nlen *= 2; if (Na[Nlen - 1] == 0) --Nlen; } iStart = iOrigStart = lg(Na, Nlen) - 1; fprintf(STDOUT, "%sTesting %d^%u+1... %d steps to go ", ISBOINC ? "\n" : "\r", b, m, iStart); fflush(STDOUT); }[/code] Also, the mul_1_add_n() function is no longer used and can be safely deleted. For 1248^4194304+1, this change reduces the initialization time, where it's calculating the value of b^N, from 2 hours down to a small fraction of one second. Full source and Windows executables will be available shortly. |
Source and executable can be downloaded here: [url=http://www.asgoodasitgoetz.com/distribution/geneferCUDA-boinc.1.03beta.7z]GeneferCUDA-boinc.1.03beta.7z[/url]
I expected this to be released into production on PrimeGrid's Boinc server, but I want to do more testing first. This source is 100% compatible with Shoichiro's code up through Shoichiro's v1.049. Modifications from 1.050 onward have NOT been incorporated into my source. It can operate under Boinc, under PRPNet, or standalone. It's currently built for windows, but there's only a handful of places where the code is incompatible with Linux. (Setting Windows thread priority and variable arguments in a logging function are the two things I remember.) |
[QUOTE=AG5BPilot;285590]
Also, the mul_1_add_n() function is no longer used and can be safely deleted. For 1248^4194304+1, this change reduces the initialization time, where it's calculating the value of b^N, from 2 hours down to a small fraction of one second.[/QUOTE] Now that Na computation is fast, there is probably no need for it to be written to/read from checkpoint. Doesn't the computesquare call need BOINC_START_GPU/END_GPU? |
[QUOTE=axn;285687]Now that Na computation is fast, there is probably no need for it to be written to/read from checkpoint.[/quote]
A bit premature, as it turns out -- further testing has shown there's a problem at N=4194304, which is a bit of a nuisance since that's our real target. At 4M it's exceeding maxErr doing the last FFT. The b limit on that calculation is effectively less than 100. :( The extra information in the checkpoint file doesn't make that much of a difference. I'm doing that calculation using each element as a 16 bit unsigned integer, rather than32 bits because if you square two 32 bit ints, your result is 64 bits long -- which will NOT fit in a 64 bit float since the float uses about 10 bits for the two signs and the exponent. However, this made the FFT have twice as many elements, which is itself causing problems. [quote]Doesn't the computesquare call need BOINC_START_GPU/END_GPU?[/QUOTE] It might, if those calls actually had any code in them anymore. Those calls, which are supposed to be boinc critical sections, would actually prevent the task from ever shutting down. Boinc's critical section code appears to be horribly broken. It's only useful when you're protecting a tiny percentage of your code within the critical section. In our case, where only a tiny percentage ISN"T protected (since we spend almost all our time in the GPU), the BOINC client -- which effectively talks to the client via polling, not interrupts (at least as far as the critical section code is concerned) -- never gets the signal through unless it's unbelievably lucky. BTW, I'm going to pull that build off the website until I find a fix for the problem. At N=4194304, it would produce incorrect results without any indication that the result is wrong. If anyone wants that source, let me know. |
[QUOTE=AG5BPilot;285706]A bit premature, as it turns out -- further testing has shown there's a problem at N=4194304, which is a bit of a nuisance since that's our real target. At 4M it's exceeding maxErr doing the last FFT. The b limit on that calculation is effectively less than 100.[/QUOTE]
You can use a Karatsuba-style multiplication using 3 half-sized FFT multiplies for the final iteration. [url]http://en.wikipedia.org/wiki/Karatsuba_algorithm#The_basic_step[/url] [Note that, while the wiki example is X & Y, when you need to square, all the underlying mults also become squares.] Also, you can gain a little bit more by dividing out all the twos from the 'b'. Eg:- 1248^4194304 = 39^4194304 * (2^5)^4194304 (1248 = 39*2^5). So you only need to do the FFT to calculate 39^4194304, and multiplication by (2^5)^4194304 is just shifting. Definitely getting more complex, isn't it? :smile: Well, good luck. |
Hi ,
I retry "Reduce CPU Time". [code] if(n1 > 16384) ithreads = 128; else ithreads = 16; for (i = iStart; i != 0; --i) { FFTsquareFFT(z, n1); bt = Na[i / (8 * sizeof(UInt32))] >> (i % (8 * sizeof(UInt32))) & 1; FFTnextStepGFN(b,invBase,(bt == 0) ? t1 : t2,n1,t3,ithreads,SHIFT); if(i % 0xF == 0) cutilSafeCall(cudaMemcpy(l_maxErr,g_maxErr,sizeof(float),cudaMemcpyDeviceToHost)); if (!(i & 0xffff)) { printf("\rTesting %d^%u+1... %d steps to go ", b, m, i); fflush(stdout); cutilSafeCall(cudaMemcpy(l_maxErr,g_maxErr,sizeof(float),cudaMemcpyDeviceToHost)); //1.00 [/code] Ver 1.061: [code] 24518^262144+1 is a probable prime. (1150678 digits) (err = 0.0002) (time = 1:27:48) 08:02:55 2347.29user 2912.59system 1:27:47elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 336inputs+238616outputs (0major+13265minor)pagefaults 0swaps [/code] This code version: [code] 24518^262144+1 is a probable prime. (1150678 digits) (err = 0.0002) (time = 1:29:02) 06:35:07 215.20user 25.99system 1:29:02elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+238528outputs (0major+11867minor)pagefaults 0swaps [/code] 1% slow, but 95% reduce cpu time. |
a 1% increase in time in exchange of an almost free cpu? nice
|
[QUOTE=axn;285707]
Definitely getting more complex, isn't it? :smile: Well, good luck.[/QUOTE] It's actually a heck of a lot simpler than I imagined. First time I looked at this my first thought was "It just can not be that simple. I have to be missing something." I was very surprised when the right numbers started coming out. It's very much like like making a smoothie in a blender, reversing the power plug, and having whole fruit emerge. :) Thanks for the advice, and the luck! |
About 4 different ideas later, I think I've fixed 1.03. I'm going to test it more thoroughly and THEN post the code.
|
[url=http://www.asgoodasitgoetz.com/distribution/geneferCUDA-boinc.1.03a.7z]geneferCUDA-boinc.1.03a.7z[/url]
This has working code for the fast initialization. In the end, I sacrificed speed for precision. It's a whopping one and a half seconds slower than 1.03, but it has the slight advantage of actually working. I reduced the FFT element size to 8 bits, down from 16, which gave me tons of room to avoid rounding errors, but did push the FFT size above 8 million. I guess it's a good thing I GPUs are pretty fast! |
Linux need #127 patch with CPU TIME 100% issue.
Win not need #127 patch. |
| All times are UTC. The time now is 20:52. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.