![]() |
|
|
#56 |
|
Jul 2003
So Cal
1010011001112 Posts |
Here's the profiler output for k in case it's useful. I can send the CSV is you wish.
It appears that in version u, transpose speeds up normalize_kernel but by a bit less than the transpose takes so it's a small net loss. I presume on the GTX260, normalize_kernel was significantly slower before so the transpose saves much more time than it takes. The C1060s are only running at a clock rate of 1.3GHz, so if your GTX260 is running at 1.44GHz (deviceQuery will tell you) then the difference between our measured times for version u is simply the clock rate of the cards. Last fiddled with by frmky on 2009-11-10 at 21:43 |
|
|
|
|
|
#57 | |
|
Jul 2009
Tokyo
2·5·61 Posts |
Quote:
Add -DTESRA at .0157 sec/iter for the 2048K FFT and .0415 sec/iter for the 4096K FFT on GTX260. No -DTESRA at .0131 sec/iter for the 2048K FFT and .0247 sec/iter for the 4096K FFT on GTX260. Thank you, |
|
|
|
|
|
|
#58 |
|
Jul 2003
So Cal
2,663 Posts |
Excellent! On the Tesla C1060, version v 4096K FFTs times are
without TESRA: 0.0275 sec/iter with TESRA: 0.0264 sec/iter so it does match the speed of version k. I've also confirmed that restart works correctly. The 10000 iteration residue matches the mprime "interim Wd8 residue at iteration 10002." Is this just a difference in how iterations are counted in the two programs? |
|
|
|
|
|
#59 |
|
Jul 2009
Tokyo
2·5·61 Posts |
|
|
|
|
|
|
#60 |
|
Jul 2003
So Cal
2,663 Posts |
OK, it's not really an issue anyway. Just have to remember that when comparing interim residues with Prime95. Actually, I should check that the final residues for composites match those of Prime95...
|
|
|
|
|
|
#61 |
|
Jul 2009
Tokyo
2·5·61 Posts |
|
|
|
|
|
|
#62 |
|
Jul 2003
So Cal
2,663 Posts |
I checked it. They do agree with the Res64 of Prime95. I've found a problem, though...
[childers test]$ ./MacLucas_v_T_0 23102129 cutilCheckMsg() CUTIL CUDA error: CUDA Kernel execution failed in file <MacLucasFFTW.cu>, line 1218 : invalid configuration argument. |
|
|
|
|
|
#63 | |
|
Jul 2009
Tokyo
2·5·61 Posts |
Quote:
100% Repeatability ? When after machine reboot... |
|
|
|
|
|
|
#64 |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
10111111111012 Posts |
|
|
|
|
|
|
#65 | |
|
Jul 2003
So Cal
2,663 Posts |
Quote:
I modified the source so that when TESRA is defined, it uses the normalize_kernel from version k rather than breaking it into the three sequential calls. This works correctly on all of the GPUs in the S1070. I'm not sure why breaking it into three kernel calls was causing a problem. Attached is the source as modified. |
|
|
|
|
|
|
#66 |
|
Jul 2009
Tokyo
10011000102 Posts |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Don't DC/LL them with CudaLucas | LaurV | Data | 131 | 2017-05-02 18:41 |
| CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 | Brain | GPU Computing | 13 | 2016-02-19 15:53 |
| CUDALucas: which binary to use? | Karl M Johnson | GPU Computing | 15 | 2015-10-13 04:44 |
| settings for cudaLucas | fairsky | GPU Computing | 11 | 2013-11-03 02:08 |
| Trying to run CUDALucas on Windows 8 CP | Rodrigo | GPU Computing | 12 | 2012-03-07 23:20 |