![]() |
[QUOTE=kriesel;478313](Received back my HD620-equipped laptop today, from warranty repair, and after letting it warm up, unpacked it and tried gpuowl v1.9 on it. Will try some things on a real discrete AMD GPU later.)
[CODE]gpuowl-v1.9-74f1a38>gpuowl --help gpuOwL v1.9- GPU Mersenne primality checker Command line options: -size 2M|4M|8M : override FFT size. -fft DP|SP|M61|M31 : choose FFT variant [default DP]: DP : double precision floating point. SP : single precision floating point. M61 : Fast Galois Transform (FGT) modulo M(61). M31 : FGT modulo M(31). -user <name> : specify the user name. -cpu <name> : specify the hardware name. -legacy : use legacy kernels -dump <path> : dump compiled ISA to the folder <path> that must exist. -verbosity <level> : change amount of information logged. [0-2, default 0]. -device <N> : select specific device among: 0 : Intel(R) HD Graphics 620, 24x1050MHz 1 : Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz, 4x2700MHz [/CODE]-dump seems also not to work on Intel IGP. [CODE]gpuowl -user kriesel -cpu falcon-hd620 -dump isa -verbosity 2 -device 0 gpuOwL v1.9- GPU Mersenne primality checker Intel(R) HD Graphics 620, 24x1050MHz[/CODE]-fft DP silently crashes the program. [CODE]gpuowl.exe gpuOwL v1.9- GPU Mersenne primality checker Intel(R) HD Graphics 620, 24x1050MHz [/CODE]-fft M61 runs, after some warnings: [CODE]gpuowl -device 0 -fft M61 gpuOwL v1.9- GPU Mersenne primality checker Intel(R) HD Graphics 620, 24x1050MHz warning: Linking two modules of different data layouts: '' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas '<origin>' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32:64' warning: Linking two modules of different target triples: ' is 'spir64' whereas '<origin>' is 'vISA_64' warning: Linking two modules of different data layouts: '' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas '<origin>' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32:64' warning: Linking two modules of different target triples: ' is 'spir64' whereas '<origin>' is 'vISA_64' fcl build 1 succeeded. bcl build succeeded. OpenCL compilation in 28284 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=83780327u -DWIDTH=2048u -DHEIGHT=2048u -DLOG_NWORDS=23u -DFGT_61=1 -DLOG_ROOT2=55u " Note: using long carry kernels PRP-3: FFT 8M (2048 * 2048 * 2) of 83780327 (9.99 bits/word) [2018-01-24 18:41:01 Central Standard Time] Starting at iteration 0 OK 0 / 83780327 [ 0.00%], 0.00 ms/it; ETA 0d 00:00; 0000000000000003 [18:46:39] OK 1000 / 83780327 [ 0.00%], 697.29 ms/it; ETA 676d 03:31; cc32f90fcf7bb85a [19:03:53] Stopping, please wait.. OK 1500 / 83780327 [ 0.00%], 669.39 ms/it; ETA 649d 02:01; 7577c10a6a191001 [19:15:11] Bye [/CODE]It runs slowly, at great cost to Prime95 (V29.4b7 64-bit) throughput. Approx 270msec/iter for M61 4M at best, while the two intel cpu cores/workers are reduced in throughput from ~33msec/iter each (60 iter/sec total) Prime95 only, to 50msec/iter (40 iter/sec 2 cores combined) plus ~3iter/sec from the IGP = loss of throughput, 17 iter/sec to 43iter/sec (72%). Or with reduction to one IA core, 40ms/iter = 25iter/sec, ~4iter/sec from igp, 29 iter/sec combined (48%). I think it's the shared memory arrangement; cpu speed drops from ~2.7Ghz to 1.6 or 1.7, cpu utilization from 90+ to 50 or 25%, wattage from 13W to 5.5 or 4, as the IGP goes from 0.1 at idle to 8 or 9 watts of the 15W TDP budget. Primality testing is probably just too memory intensive for shared memory gpus. Trial factoring maybe? [CODE]gpuowl -device 0 -fft M61 -dump isa gpuOwL v1.9- GPU Mersenne primality checker Intel(R) HD Graphics 620, 24x1050MHz warning: Linking two modules of different data layouts: '' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas '<origin>' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32:64' warning: Linking two modules of different target triples: ' is 'spir64' whereas '<origin>' is 'vISA_64' warning: Linking two modules of different data layouts: '' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas '<origin>' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32:64' warning: Linking two modules of different target triples: ' is 'spir64' whereas '<origin>' is 'vISA_64' WARNING: -s is not supported on the Intel OpenCL GPU device. fcl build 1 succeeded. bcl build succeeded. OpenCL compilation in 28378 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=77973559u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u -save-temps=isa/M61_4M" Note: using long carry kernels PRP-3: FFT 4M (1024 * 2048 * 2) of 77973559 (18.59 bits/word) [2018-01-24 19:24:47 Central Standard Time] Starting at iteration 0 OK 0 / 77973559 [ 0.00%], 0.00 ms/it; ETA 0d 00:00; 0000000000000003 [19:27:10] OK 1000 / 77973559 [ 0.00%], 653.80 ms/it; ETA 590d 00:46; 04b3acbff2710af0 [19:40:33] OK 5000 / 77973559 [ 0.01%], 292.32 ms/it; ETA 263d 19:00; debea46fa265c7e3 [20:02:30] OK 10000 / 77973559 [ 0.01%], 289.25 ms/it; ETA 261d 00:15; f3ea22ad73c203e0 [20:29:03] OK 20000 / 77973559 [ 0.03%], 270.90 ms/it; ETA 244d 09:58; c94ad5f9c8a09a3b [21:16:27][/CODE]I'm puzzled by the gpuowl ms/iter declining as the computation progresses, and note that the wall clock time is 1034 seconds for 1000 iterations that the program indicates executed at 697msec/iter (for the 8M -M61 0-1000 iteration interval). Similarly for the 10k-20k interval on 4M, 270msec/iter is given for a 2844 second duration (284.4msec/iter) I note the date and time zone have been removed from the progress lines, including when the time of day rolls over at midnight.[/QUOTE] Try with "-verbosity 1" or "-verbosity 2", it may explain why the iteration time doesn't match the wall-time: the time taken by the gerbicz check is not included in the "time per it", but accounted separately. OTOH the "gerbicz iteration" (which is done, I think, every 500 iterations) is included in the "per iteration" time. It is indeed terribly slow on the integrated GPU. And apparently that GPU doesn't take DP. |
[QUOTE=preda;478331]Try with "-verbosity 1" or "-verbosity 2", it may explain why the iteration time doesn't match the wall-time: the time taken by the gerbicz check is not included in the "time per it", but accounted separately. OTOH the "gerbicz iteration" (which is done, I think, every 500 iterations) is included in the "per iteration" time.
It is indeed terribly slow on the integrated GPU. And apparently that GPU doesn't take DP.[/QUOTE] The HD620 is slow. Mfakto V0.15pre6 gives 20 Ghzd/day for TF of M154582327 71-72 bits if the cpu is idle; 18 if prime95 LL is loading the cpu. (Comparable to an HD4600, judging by old posts in the mfakto thread.) Not surprisingly, due to shared memory, the prime95 LL throughput takes a hit when mfakto is run. For comparison, an RX 550 does 100-130 Ghzd/day of TF in mfakto; TF benchmarks at mersenne.ca give RX Vega 64 or GTX1080 ~1020; Quadro 2000 90 ghzd/day TF. Looks like the gerbicz check takes minutes. What does the cv % mean? [CODE]gpuowl -user kriesel -cpu falcon-hd620 -verbosity 2 -device 0 -fft M61 gpuOwL v1.9- GPU Mersenne primality checker Intel(R) HD Graphics 620, 24x1050MHz warning: Linking two modules of different data layouts: '' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas '<origin>' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32:64' warning: Linking two modules of different target triples: ' is 'spir64' whereas '<origin>' is 'vISA_64' warning: Linking two modules of different data layouts: '' is 'e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024' whereas '<origin>' is 'e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32:64' warning: Linking two modules of different target triples: ' is 'spir64' whereas '<origin>' is 'vISA_64' fcl build 1 succeeded. bcl build succeeded. OpenCL compilation in 25158 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=77973559u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u " Note: using long carry kernels PRP-3: FFT 4M (1024 * 2048 * 2) of 77973559 (18.59 bits/word) [2018-01-25 08:27:16 Central Standard Time] Starting at iteration 40500 OK 40500 / 77973559 [ 0.05%], 0.00 ms/it; ETA 0d 00:00; 51becf154b293f2a [08:29:42] OK 41000 / 77973559 [ 0.05%], 285.59 ms/it; ETA 257d 14:31; d6e3ddef6125ec5c [08:34:31] OK 42000 / 77973559 [ 0.05%], 284.92 ms/it [284.40, 285.44] CV 0.3%, check 145.84s; ETA 256d 23:53; 87c245c4c50e5bbb [08:41:42] OK 45000 / 77973559 [ 0.06%], 288.01 ms/it [287.03, 291.72] CV 0.6%, check 145.23s; ETA 259d 18:30; 9c04ee0df183900d [08:58:31] [/CODE]For what it's worth, GPU-Z v2.7.0 claims DP, SP, and half-FP capabilities for the HD620. [CODE]General Platform Name Intel(R) OpenCL Platform Vendor Intel(R) Corporation Platform Profile FULL_PROFILE Platform Version OpenCL 2.1 Vendor Intel(R) Corporation Device Name Intel(R) HD Graphics 620 Version OpenCL 2.1 Driver Version 22.20.16.4799 C Version OpenCL C 2.0 IL Version SPIR-V_1.0 Profile FULL_PROFILE Global Memory Size 3227 MB Clock Frequency 1050 MHz Compute Units 24 Device Available Yes Compiler Available Yes Linker Available Yes Preferred Synchronization User CMD Queue Properties Out of Order, Profiling SVM Capabilities Coarse, Fine, Atomics DP Capability Denorm, INF NAN, Round Nearest, Round Zero, Round INF, FMA SP Capability Denorm, INF NAN, Round Nearest, Round Zero, Round INF, FMA Half FP Capability Denorm, INF NAN, Round Nearest, Round Zero, Round INF, FMA Address Bits 64 Preferred On-Device Queue 128 KB Global Memory Cache 512 KB (RW Cache) Global Memory Cacheline 0 KB Preferred Global Atomic Alignment 64 Preferred Local Atomic Alignment 64 Preferred Platform Atomic Alignment 64 Local Memory Local (64 KB) Memory Alignment 1024 bits Pitch Alignment 4 pixels Built-in Kernels block_motion_estimate_intel;block_advanced_motion_estimate_check_intel;block_advanced_motion_estimate_bidirectional_check_intel Little Endian Yes Error Correction No Execution Capability Kernel Unified Memory Yes Image Support Yes Limits Max Device Events 1024 Max Device Queues 1 Max On-Device Queue 65536 KB Preferred Max Variable Size 2147483647 Bytes Max Memory Allocation 2047 MB Max Constant Buffer 2097151 KB Max Constant Args 8 Max Pipe Args 16 Max Pipe Reservations 1 Max Pipe Packet Size 1024 Max Read Image Args 128 Max Write Image Args 128 Max Read-Write Image Args 128 Max Samplers 16 Max Work Item Dims 3 Max Write Image Args 128 Native Vectors Native Vector Width (CHAR) 16 Native Vector Width (SHORT) 8 Native Vector Width (INT) 4 Native Vector Width (LONG) 1 Native Vector Width (FLOAT) 1 Native Vector Width (DOUBLE) 1 Native Vector Width (HALF) 8 Preferred Vector Width (CHAR) 16 Preferred Vector Width (SHORT) 8 Preferred Vector Width (INT) 4 Preferred Vector Width (LONG) 1 Preferred Vector Width (FLOAT) 1 Preferred Vector Width (DOUBLE) 1 Preferred Vector Width (HALF) 8 Extensions cl_intel_accelerator cl_intel_advanced_motion_estimation cl_intel_d3d11_nv12_media_sharing cl_intel_device_side_avc_motion_estimation cl_intel_driver_diagnostics cl_intel_dx9_media_sharing cl_intel_media_block_io cl_intel_motion_estimation cl_intel_planar_yuv cl_intel_packed_yuv cl_intel_required_subgroup_size cl_intel_simultaneous_sharing cl_intel_subgroups cl_intel_subgroups_short cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_depth_images cl_khr_dx9_media_sharing cl_khr_fp16 cl_khr_fp64 cl_khr_gl_depth_images cl_khr_gl_event cl_khr_gl_msaa_sharing cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_gl_sharing cl_khr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_khr_spir cl_khr_subgroups cl_khr_throttle_hints [/CODE] |
[QUOTE=kriesel;478338]
Looks like the gerbicz check takes minutes. What does the cv % mean? [/QUOTE] The check is done by doing 500 [additional] iterations. Then it's consistent with the time-per-it * 500. CV is "coefficient of variation" [url]https://en.wikipedia.org/wiki/Coefficient_of_variation[/url] , indicating how spread the time-per-it values are distributed. The min/max are also there. A small CV (let's say 0.1% or 0.3%) means almost all iterations take the same time, while a high CV (1%) may indicate e.g. periodic thermal throttling, or something else being run at the same time, etc. |
gpuowl speed
[QUOTE=preda;478207]Thanks for the generous offer! But let me think it over. I see there is demand to get it working on Nvidia, it seems I need to take it seriously :)[/QUOTE]
I'm 60% through a ~76.8M exponent on rx550, originally projected to take 10d15hr total, on an RX550. The RX550 is rated at 7.7ghzd/day at the mersenne.ca GPU benchmarks page for LL near that exponent range, so that's ~81.8 ghzd/exponent. CUDALucas on GTX1070 takes 5.5 days for similar sized exponents (using fft length 4608k) and is rated 46 ghzd/day for LL. CUDALucas does not currently have Jacobi check or Gerbicz check implemented. 5.5*46 / 81.8 = 3.09 Quadro 2000 with CUDALucas takes about a month for similar sized exponents (30d21h for M76.6M with 4320K fft length) and is rated at 8ghzd/day for LL. Extrapolating to 76.8m, (76.8/76.6)^2.1 * 8 = 248. ghzd 248/81.8 = 3.04. ClLucas also does not have Jacobi check or Gerbicz check implemented. ClLucas on my RX550 reported 8.0871 ms/iter for 2048k fft length residue checks. If this follows the 1.1 power relationship per iteration I've found with CUDALucas (and it is a port of CUDALucas), that yields an estimate of 17.3 msec/iter for 4k length, and gpuowl is running at 12.00-12.01 msec/iter on the same GPU. 17.3/12.01= 1.44. After the GpuOwL exponent completes, I'll run some ClLucas on the same RX550 for a more complete and direct comparison. |
[QUOTE=kriesel;478453]I'm 60% through a ~76.8M exponent on rx550, originally projected to take 10d15hr total, on an RX550. The RX550 is rated at 7.7ghzd/day at the mersenne.ca GPU benchmarks page for LL near that exponent range, so that's ~81.8 ghzd/exponent.[/QUOTE]
A 76.8m exponent will yield about 200 GHday ([url]http://www.mersenne.ca/credit.php?worktype=LL&exponent=76800000&f_exponent=&b1=&b2=&numcurves=&factor=&frombits=&tobits=&submitbutton=Calculate[/url]). If RX550 can complete that in 10.5 days, it should be rated at 20GHzday/day |
[QUOTE=axn;478517]A 76.8m exponent will yield about 200 GHday (<a href="http://www.mersenne.ca/credit.php?worktype=LL&exponent=76800000&f_exponent=&b1=&b2=&numcurves=&factor=&frombits=&tobits=&submitbutton=Calculate" target="_blank">http://www.mersenne.ca/credit.php?worktype=LL&exponent=76800000&f_exponent=&b1=&b2=&numcurves=&factor=&frombits=&tobits=&submitbutton=Calculate</a>). If RX550 can complete that in 10.5 days, it should be rated at 20GHzday/day[/QUOTE]
Thanks for that link. I see it does not have a selection for PRP3. (Not proposing a ton of work for anyone, just noting that PRP3 is not a choice there.) I plan to run clLucas on the same GPU next week, and will compare its LL speed to the hardware LL benchmark figure using the link. My use of the hardware LL benchmark figure in comparison to PRP3 speed might be considered a misuse. But speed matters and relative indications are useful, even if somewhat inaccurate, especially before confirmation on the actual hardware.. The 10d15h duration estimate is for GpuOwL V1.9, doing PRP3, and which avoids some transforms to and fro, not for an LL computation which as I understand it requires those transforms every iteration to perform the -2 operation. It seems reasonable and expected the PRP3 performance would differ from the LL performance somewhat. If there are two primality test implementations C and G of LL and PRP3 respectively, do we change the rating of hardware H when G is developed if it runs faster than C on the test hardware H, making H look faster than other hardware models in the benchmarks that are actually comparable, or do we say, G is faster than C, by roughly factor x, at least partly for intrinsic differences between LL and PRP3, and leave the benchmark figures for hundreds or thousands of hardware models comparable for LL, or develop another set of benchmarks for PRP3, or something else? (Leaving aside for the moment, the possibility G could be ported to another architecture N and be compared in speed to implementation U and bring into question benchmark ratings for more hardware models, or the prospect of perhaps rerating all cpus' benchmarks because application P has also implemented both LL and PRP3. And that there's now LL, PRP3, P-1, and ECM as well as TF. And that Gerbicz checks provide something more than Jacobi does, or unchecked LL tests. And that there could be future developments unforeseen, as occurred with the Gerbicz check not so long ago.) |
Hmm, not sure why, but the 12msec/iteration time in GpuOwL v1.9 latest build that has been rock solid for days on the RX550 has begun climbing.
GPU fans were only at 30%. [CODE]OK 47000000 / 76812401 [61.19%], 12.01 ms/it; ETA 4d 03:26; 107c7bd555765e93 [13:04:15] OK 47500000 / 76812401 [61.84%], 12.01 ms/it; ETA 4d 01:46; 950930b488fce86f [14:44:26] OK 48000000 / 76812401 [62.49%], 12.00 ms/it; ETA 4d 00:05; 9b06644a1b4e1118 [16:24:37] OK 48500000 / 76812401 [63.14%], 12.01 ms/it; ETA 3d 22:25; ca0d8f717b50c226 [18:04:48] OK 49000000 / 76812401 [63.79%], 12.01 ms/it; ETA 3d 20:45; 5930ebef739c2c85 [19:44:58] OK 49500000 / 76812401 [64.44%], 12.01 ms/it; ETA 3d 19:05; 67a7ce42059bf857 [21:25:10] OK 50000000 / 76812401 [65.09%], 12.01 ms/it; ETA 3d 17:25; 484f4a62db8fa1b4 [23:05:21] OK 50500000 / 76812401 [65.74%], 12.01 ms/it; ETA 3d 15:45; bde0eb185277f359 [00:45:32] OK 51000000 / 76812401 [66.40%], 12.01 ms/it; ETA 3d 14:05; d9139a175391d055 [02:25:43] OK 51500000 / 76812401 [67.05%], 12.00 ms/it; ETA 3d 12:25; 46cb956267f1ca85 [04:05:53] OK 52000000 / 76812401 [67.70%], 12.01 ms/it; ETA 3d 10:45; f85d4209c361be06 [05:46:05] OK 52500000 / 76812401 [68.35%], 12.01 ms/it; ETA 3d 09:05; 15313977148cd942 [07:26:16] OK 53000000 / 76812401 [69.00%], 13.34 ms/it; ETA 3d 16:14; 26055fdeca7b4808 [09:17:34] OK 53500000 / 76812401 [69.65%], 16.16 ms/it; ETA 4d 08:41; 7176178b31fb54f2 [11:32:24][/CODE]Application stop and restart seems to have cleared it up (first 1000 iterations at 12.03 ms/it). Will relaunch with verbosity 2. |
[QUOTE=kriesel;478539]Thanks for that link. I see it does not have a selection for PRP3. (Not proposing a ton of work for anyone, just noting that PRP3 is not a choice there.)
I plan to run clLucas on the same GPU next week, and will compare its LL speed to the hardware LL benchmark figure using the link. My use of the hardware LL benchmark figure in comparison to PRP3 speed might be considered a misuse. But speed matters and relative indications are useful, even if somewhat inaccurate, especially before confirmation on the actual hardware.. The 10d15h duration estimate is for GpuOwL V1.9, doing PRP3, and which avoids some transforms to and fro, not for an LL computation which as I understand it requires those transforms every iteration to perform the -2 operation. It seems reasonable and expected the PRP3 performance would differ from the LL performance somewhat. [/QUOTE] The -2 operation in LL is free (no performance cost), being merged into the carry propagation step. The PRP and LL cost is mostly the same. The PRP is a little bit more expensive than LL, in the order of 0.5% - 2% more expensive, because PRP needs to do a few additional iterations for the error check. |
[QUOTE=kriesel;478539]The 10d15h duration estimate is for GpuOwL V1.9, doing PRP3, and which avoids some transforms to and fro, not for an LL computation which as I understand it requires those transforms every iteration to perform the -2 operation. It seems reasonable and expected the PRP3 performance would differ from the LL performance somewhat. [/QUOTE]
Your understanding is incorrect. |
[QUOTE=axn;478586]Your understanding is incorrect.[/QUOTE]
It happens. I try to be accurate and clear and sometimes succeed, sometimes not. I remember reading a post by Preda listing the steps and stating a transform each way were avoided in PRP (7 steps) relative to LL (11). Didn't find it when I searched yesterday. Perhaps it's for a version before the Gerbicz check was added, and no longer relevant. (I don't think I just dreamed it ....) |
high cv and a bump in iteration time
1 Attachment(s)
[QUOTE=preda;478351]The check is done by doing 500 [additional] iterations. Then it's consistent with the time-per-it * 500.
CV is "coefficient of variation" [URL]https://en.wikipedia.org/wiki/Coefficient_of_variation[/URL] , indicating how spread the time-per-it values are distributed. The min/max are also there. A small CV (let's say 0.1% or 0.3%) means almost all iterations take the same time, while a high CV (1%) may indicate e.g. periodic thermal throttling, or something else being run at the same time, etc.[/QUOTE] Here's a snapshot of GpuOwL 1.9-74f1a38, running verbosity 2 on Radeon RX550 2GB, on a Windows 7 Pro test system also running prime95 on all cpu cores. There's little else running on this system; gpu-z, task manager, remote desktop. Display when it's in use is driven by an NVIDIA NVS295 adapter currently. CVs are mostly 1.8% or higher, and around 1am the iteration time computed was higher. Wall clock time is up too so not just a time computation issue. GPU fans at 30%, HWMonitor reports temperatures at various sensors in the system ranging from 33C to 76C, so thermal throttling seems unlikely. Most CV values are 1.8% or higher, and associated with [11.92, 12.92]. Timing seems to me a bit of a mess on Windows. [URL]https://stackoverflow.com/questions/3729169/how-can-i-get-the-windows-system-time-with-millisecond-resolution[/URL] Is there something else going on though? |
| All times are UTC. The time now is 22:38. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.