![]() |
|
|
#386 | |
|
Romulan Interpreter
"name field"
Jun 2011
Thailand
101000001100112 Posts |
Quote:
(about the edit, buy one and try, then tell us so everybody knows!)
|
|
|
|
|
|
|
#387 |
|
"David"
Jul 2015
Ohio
51710 Posts |
Curiously, my W9100 and W8100s actually give pretty subpar performance for having such fast DP units.
My Fury X is still the fastest. at a 4K FFT 3.7ms/iteration vs. 4.9ms/iteration on the FirePro. I wonder if there is something not quite being used right in this case. Titan Black with cuFFT is doing 2.65ms/iteration. |
|
|
|
|
|
#388 |
|
Jul 2003
So Cal
2,663 Posts |
clLucas is limited by memory bandwidth on the R9 290. The HBM in the Fury X wins. Adding extra DP units doesn't help.
|
|
|
|
|
|
#389 | |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Quote:
That means either clFFT has much much worse memory performance than cuFFT, which isn't supported by AMDs 1-1 comparison at these FFT sizes http://developer.amd.com/community/b...clfft-library/ or there is something off in clLucas itself. My guess is it is the other kernels in clLucas that cause the trouble, the point wise multiplication in particular does not look very optimized for GCN. When I get time I plan to pre-bake FFT plans of a certain size and embed the multiply step and 2nd FFT + normalize kernels into one call. I believe that will give a substantial uplift. clFFT added preCallback support for exactly this reason. Last fiddled with by airsquirrels on 2016-01-12 at 18:29 |
|
|
|
|
|
|
#390 |
|
"Mr. Meeseeks"
Jan 2012
California, USA
37·59 Posts |
clFFT 2.10.0 is out.. no idea if this will/could affect clLucas in any way.
Code:
This clFFT release tagged as v2.10.0 is part of AMD Compute Libraries (ACL) 1.0 GA. clFFT - Release Notes - version 2.10.0:
|
|
|
|
|
|
#391 | |
|
"David"
Jul 2015
Ohio
10058 Posts |
Quote:
I do believe using the pre/post callback feature will offer nice boosts. |
|
|
|
|
|
|
#392 |
|
"Kieren"
Jul 2011
In My Own Galaxy!
2·3·1,693 Posts |
You guys are amazing!
(With complete respect.)
|
|
|
|
|
|
#393 |
|
Jul 2003
27·5 Posts |
hi,
i did a short test with v1.04 on a R9 390X (Hawaii XT) Code:
C:\1cllucas104>clLucas_x64 -sixstepfft -clfftbench 1048576 8388608 1048576
Platform 0 : Advanced Micro Devices, Inc.
Platform :Advanced Micro Devices, Inc.
Device 0 : Hawaii
Build Options are : -D KHR_DP_EXTENSION
CL_DEVICE_NAME Hawaii
CL_DEVICE_VENDOR Advanced Micro Devices, Inc.
CL_DEVICE_VERSION OpenCL 2.0 AMD-APP (1800.8)
CL_DRIVER_VERSION 1800.8 (VM)
CL_DEVICE_MAX_COMPUTE_UNITS 44
CL_DEVICE_MAX_CLOCK_FREQUENCY 1080
CL_DEVICE_GLOBAL_MEM_SIZE 0
CL_DEVICE_MAX_WORK_GROUP_SIZE 256
CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE 1
clFFT bench start = 1048576 end = 8388608 distance = 1048576
clFFT size= 1048576 time= 0.624990 msec
clFFT size= 2097152 time= 0.781230 msec
clFFT size= 3145728 time= 1.093740 msec
clFFT size= 4194304 time= 1.517150 msec
clFFT size= 5242880 time= 3.749980 msec
clFFT size= 6291456 time= 4.094270 msec
clFFT size= 7340032 time= 4.843740 msec
clFFT size= 8388608 time= 3.497480 msec
C:\1cllucas104>
--
C:\1cllucas104>clLucas_x64 -sixstepfft -f 2048k 36976267
Platform 0 : Advanced Micro Devices, Inc.
Platform :Advanced Micro Devices, Inc.
Device 0 : Hawaii
Build Options are : -D KHR_DP_EXTENSION
CL_DEVICE_NAME Hawaii
CL_DEVICE_VENDOR Advanced Micro Devices, Inc.
CL_DEVICE_VERSION OpenCL 2.0 AMD-APP (1800.8)
CL_DRIVER_VERSION 1800.8 (VM)
CL_DEVICE_MAX_COMPUTE_UNITS 44
CL_DEVICE_MAX_CLOCK_FREQUENCY 1080
CL_DEVICE_GLOBAL_MEM_SIZE 0
CL_DEVICE_MAX_WORK_GROUP_SIZE 256
CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE 1
mkdir: cannot create directory `': File exists
Starting M36976267 fft length = 2048K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration 100, average error = 0.04351, max error = 0.05859
Iteration 200, average error = 0.05244, max error = 0.06250
Iteration 300, average error = 0.05579, max error = 0.06250
Iteration 400, average error = 0.05747, max error = 0.06250
Iteration 500, average error = 0.05998, max error = 0.07031
Iteration 600, average error = 0.06170, max error = 0.07031
Iteration 700, average error = 0.06293, max error = 0.07031
Iteration 800, average error = 0.06385, max error = 0.07031
Iteration 900, average error = 0.06457, max error = 0.07031
Iteration 1000, average error = 0.06514 < 0.25 (max error = 0.07031), continuing test.
Iteration 20000 M( 36976267 )C, 0x8c1da923bc3ab356, n = 2048K, clLucas v1.04 err = 0.0703 (0:45 real, 2.2464 ms/iter, ETA 23:03:03)
Iteration 40000 M( 36976267 )C, 0x91133c8d2a727523, n = 2048K, clLucas v1.04 err = 0.0703 (0:45 real, 2.2344 ms/iter, ETA 22:54:55)
Iteration 60000 M( 36976267 )C, 0x5a30204bb64469fa, n = 2048K, clLucas v1.04 err = 0.0703 (0:45 real, 2.2363 ms/iter, ETA 22:55:18)
Iteration 80000 M( 36976267 )C, 0x8a8bd90aeb56e711, n = 2048K, clLucas v1.04 err = 0.0703 (0:45 real, 2.2361 ms/iter, ETA 22:54:28)
Iteration 100000 M( 36976267 )C, 0x53d5d66e91a7cc1e, n = 2048K, clLucas v1.04 err = 0.0703 (0:44 real, 2.2354 ms/iter, ETA 22:53:16)
Iteration 120000 M( 36976267 )C, 0xfc64c4c3b6ec8934, n = 2048K, clLucas v1.04 err = 0.0703 (0:44 real, 2.2350 ms/iter, ETA 22:52:16)
Iteration 140000 M( 36976267 )C, 0xd3509068b42c84c5, n = 2048K, clLucas v1.04 err = 0.0703 (0:45 real, 2.2362 ms/iter, ETA 22:52:15)
Iteration 160000 M( 36976267 )C, 0xe640c2979b7861bc, n = 2048K, clLucas v1.04 err = 0.0703 (0:45 real, 2.2344 ms/iter, ETA 22:50:25)
Iteration 180000 M( 36976267 )C, 0x3e91782572e7b52c, n = 2048K, clLucas v1.04 err = 0.0703 (0:45 real, 2.2364 ms/iter, ETA 22:50:55)
Iteration 200000 M( 36976267 )C, 0x2df9fc599a8ad81d, n = 2048K, clLucas v1.04 err = 0.0703 (0:44 real, 2.2370 ms/iter, ETA 22:50:32)
Iteration 220000 M( 36976267 )C, 0x3a49b542818203ef, n = 2048K, clLucas v1.04 err = 0.0703 (0:45 real, 2.2361 ms/iter, ETA 22:49:14)
Iteration 240000 M( 36976267 )C, 0x2378291204a357e6, n = 2048K, clLucas v1.04 err = 0.0703 (0:45 real, 2.2397 ms/iter, ETA 22:50:43)
Iteration 260000 M( 36976267 )C, 0x20a525306e49c305, n = 2048K, clLucas v1.04 err = 0.0703 (0:45 real, 2.2367 ms/iter, ETA 22:48:05)
Iteration 280000 M( 36976267 )C, 0xeb7dea574bc026be, n = 2048K, clLucas v1.04 err = 0.0703 (0:45 real, 2.2369 ms/iter, ETA 22:47:30)
Unknown signal caught, writing checkpoint. Estimated time spent so far: 10:31
C:\1cllucas104>
|
|
|
|
|
|
#394 |
|
Jul 2003
27·5 Posts |
hi,
i did a doublecheck-test with v1.04 on a R9 390X (Hawaii XT) with a powerlimit of -40 percent for the gpu M43656857 - around 3ms per iteration - estimated total time = 37:23:36 Code:
C:\1cllucas104>clLucas_x64 -sixstepfft Platform 0 : Advanced Micro Devices, Inc. Platform :Advanced Micro Devices, Inc. Device 0 : Hawaii Build Options are : -D KHR_DP_EXTENSION CL_DEVICE_NAME Hawaii CL_DEVICE_VENDOR Advanced Micro Devices, Inc. CL_DEVICE_VERSION OpenCL 2.0 AMD-APP (1800.8) CL_DRIVER_VERSION 1800.8 (VM) CL_DEVICE_MAX_COMPUTE_UNITS 44 CL_DEVICE_MAX_CLOCK_FREQUENCY 1080 CL_DEVICE_GLOBAL_MEM_SIZE 0 CL_DEVICE_MAX_WORK_GROUP_SIZE 256 CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE 1 mkdir: cannot create directory `': File exists Starting M43656857 fft length = 2240K Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length. Iteration = 32 < 1000 && err = 0.43750 >= 0.35, increasing n from 2240K Starting M43656857 fft length = 2304K Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length. Iteration 100, average error = 0.16664, max error = 0.22266 Iteration 200, average error = 0.19465, max error = 0.22266 Iteration 300, average error = 0.21301, max error = 0.25000 Iteration 400, average error = 0.22226, max error = 0.25000 Iteration 500, average error = 0.22780, max error = 0.25000 Iteration 600, average error = 0.23150, max error = 0.25000 Iteration 700, average error = 0.23415, max error = 0.25000 Iteration 800, average error = 0.23613, max error = 0.25000 Iteration 900, average error = 0.23767, max error = 0.25000 Iteration 1000, average error = 0.23889 < 0.25 (max error = 0.25000), continuing test. Iteration 20000 M( 43656857 )C, 0x93ce78fa2fef13c3, n = 2304K, clLucas v1.04 err = 0.2500 (1:01 real, 3.0184 ms/iter, ETA 36:34:24) Iteration 40000 M( 43656857 )C, 0xa7c5eb7734d6a507, n = 2304K, clLucas v1.04 err = 0.2500 (1:00 real, 3.0075 ms/iter, ETA 36:25:26) Iteration 60000 M( 43656857 )C, 0x1adcc85304d23fe2, n = 2304K, clLucas v1.04 err = 0.2500 (1:00 real, 3.0059 ms/iter, ETA 36:23:16) Iteration 80000 M( 43656857 )C, 0xb7e7a01734335a19, n = 2304K, clLucas v1.04 err = 0.2500 (1:00 real, 3.0117 ms/iter, ETA 36:26:31) Iteration 100000 M( 43656857 )C, 0xd96857efa86503ac, n = 2304K, clLucas v1.04 err = 0.2500 (1:00 real, 3.0065 ms/iter, ETA 36:21:41) Iteration 120000 M( 43656857 )C, 0xe5c88868a0f71263, n = 2304K, clLucas v1.04 err = 0.2500 (1:00 real, 3.0056 ms/iter, ETA 36:20:03) Iteration 140000 M( 43656857 )C, 0xd55a8ea7e5995672, n = 2304K, clLucas v1.04 err = 0.2500 (1:00 real, 3.0070 ms/iter, ETA 36:20:03) Iteration 160000 M( 43656857 )C, 0xfdf27393cff1d8e1, n = 2304K, clLucas v1.04 err = 0.2500 (1:00 real, 3.0075 ms/iter, ETA 36:19:25) Iteration 180000 M( 43656857 )C, 0xd2383a2759ace039, n = 2304K, clLucas v1.04 err = 0.2500 (1:01 real, 3.0054 ms/iter, ETA 36:16:56) Iteration 200000 M( 43656857 )C, 0x19ffa4ce2c4039fd, n = 2304K, clLucas v1.04 err = 0.2500 (1:00 real, 3.0073 ms/iter, ETA 36:17:17) ... Iteration 43460000 M( 43656857 )C, 0xfed9a59e92e8b13c, n = 2304K, clLucas v1.04 err = 0.3125 (1:02 real, 3.1015 ms/iter, ETA 9:18) Iteration 43480000 M( 43656857 )C, 0x2b6ad4f57568dd67, n = 2304K, clLucas v1.04 err = 0.3125 (1:02 real, 3.0936 ms/iter, ETA 8:14) Iteration 43500000 M( 43656857 )C, 0xc1efde761cd7e7ae, n = 2304K, clLucas v1.04 err = 0.3125 (1:02 real, 3.0916 ms/iter, ETA 7:12) Iteration 43520000 M( 43656857 )C, 0x0acd599af4479068, n = 2304K, clLucas v1.04 err = 0.3125 (1:01 real, 3.0883 ms/iter, ETA 6:10) Iteration 43540000 M( 43656857 )C, 0x33512556c84a656e, n = 2304K, clLucas v1.04 err = 0.3125 (1:01 real, 3.0929 ms/iter, ETA 5:09) Iteration 43560000 M( 43656857 )C, 0x88c796f90f2780a0, n = 2304K, clLucas v1.04 err = 0.3125 (1:02 real, 3.0888 ms/iter, ETA 4:07) Iteration 43580000 M( 43656857 )C, 0xc494e7dd7d84bc48, n = 2304K, clLucas v1.04 err = 0.3125 (1:02 real, 3.0892 ms/iter, ETA 3:05) Iteration 43600000 M( 43656857 )C, 0x503c1fe26490b85b, n = 2304K, clLucas v1.04 err = 0.3125 (1:02 real, 3.0873 ms/iter, ETA 2:03) Iteration 43620000 M( 43656857 )C, 0xc9ba71029f42e911, n = 2304K, clLucas v1.04 err = 0.3125 (1:02 real, 3.1016 ms/iter, ETA 1:02) Iteration 43640000 M( 43656857 )C, 0xd945e5cbd2bc21ee, n = 2304K, clLucas v1.04 err = 0.3125 (1:02 real, 3.0932 ms/iter, ETA 0:00) M( 43656857 )C, 0x10b5c3b8fad63572, n = 2304K, clLucas v1.04, estimated total time = 37:23:36 No valid assignment found. C:\1cllucas104> |
|
|
|
|
|
#396 |
|
Jul 2009
Tokyo
2·5·61 Posts |
Thank you for your support.
clLucas.ini support six step fft. Code:
# SixStepFFT is the same as the -sixstepfft option. SixStepFFT=1 0 off |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1724 | 2023-06-04 23:31 |
| Can't get OpenCL to work on HD7950 Ubuntu 14.04.5 LTS | VictordeHolland | Linux | 4 | 2018-04-11 13:44 |
| OpenCL accellerated lattice siever | pstach | Factoring | 1 | 2014-05-23 01:03 |
| OpenCL for FPGAs | TObject | GPU Computing | 2 | 2013-10-12 21:09 |
| AMD's Graphics Core Next- a reason to accelerate towards OpenCL? | Belteshazzar | GPU Computing | 19 | 2012-03-07 18:58 |