![]() |
NVIDIA GTX1070 OpenCL data as reported by GPU-Z
That certainly seems to explain the VectorSize=1 requirement given by Bdot for mfakto on NVIDIA.
[CODE]General Platform Name NVIDIA CUDA Platform Vendor NVIDIA Corporation Platform Profile FULL_PROFILE Platform Version OpenCL 1.2 CUDA 8.0.0 Vendor NVIDIA Corporation Device Name GeForce GTX 1070 Version OpenCL 1.2 CUDA Driver Version 378.66 C Version OpenCL C 1.2 Profile FULL_PROFILE Global Memory Size 8192 MB Clock Frequency 1708 MHz Compute Units 15 Device Available Yes Compiler Available Yes Linker Available Yes Preferred Synchronization Device CMD Queue Properties Out of Order, Profiling SVM Capabilities Coarse DP Capability Denorm, INF NAN, Round Nearest, Round Zero, Round INF, FMA SP Capability Denorm, INF NAN, Round Nearest, Round Zero, Round INF, FMA Half FP Capability None Address Bits 64 Preferred On-Device Queue 256 KB Global Memory Cache 240 KB (RW Cache) Global Memory Cacheline 0 KB Local Memory Local (48 KB) Memory Alignment 4096 bits Built-in Kernels Little Endian Yes Error Correction No Execution Capability Kernel Unified Memory No Image Support Yes Limits Max Device Events 2048 Max Device Queues 4 Max On-Device Queue 256 KB Max Memory Allocation 2048 MB Max Constant Buffer 64 KB Max Constant Args 9 Max Read Image Args 256 Max Write Image Args 16 Max Samplers 32 Max Work Item Dims 3 Max Write Image Args 16 Native Vectors Native Vector Width (CHAR) 1 Native Vector Width (SHORT) 1 Native Vector Width (INT) 1 Native Vector Width (LONG) 1 Native Vector Width (FLOAT) 1 Native Vector Width (DOUBLE) 1 Native Vector Width (HALF) N/A Preferred Vector Width (CHAR) 1 Preferred Vector Width (SHORT) 1 Preferred Vector Width (INT) 1 Preferred Vector Width (LONG) 1 Preferred Vector Width (FLOAT) 1 Preferred Vector Width (DOUBLE) 1 Preferred Vector Width (HALF) N/A Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts[/CODE] |
[QUOTE=xx005fs;478001]Don't really think DP performance matters that much for GpuOwL, because for my gpu overclocking the HBM memory makes it a lot faster and more efficient. Not 100% sure why tho.[/QUOTE]
Interesting. I interpret the GpuOwL author's post regarding new features in v1.5 to say that DP performance is very important for GpuOwL; sounds to me like it's the best one of the four transforms implemented. That's both sufficiently fast and provides sufficient bits of precision to be worth using at 4M length and above: [URL]http://www.mersenneforum.org/showpost.php?p=471318&postcount=224[/URL] [URL="http://www.mersenneforum.org/showpost.php?p=471318&postcount=224"][/URL] |
[QUOTE=kriesel;478058]Interesting. I interpret the GpuOwL author's post regarding new features in v1.5 to say that DP performance is very important for GpuOwL; sounds to me like it's the best one of the four transforms implemented. That's both sufficiently fast and provides sufficient bits of precision to be worth using at 4M length and above: [URL]http://www.mersenneforum.org/showpost.php?p=471318&postcount=224[/URL]
[URL="http://www.mersenneforum.org/showpost.php?p=471318&postcount=224"][/URL][/QUOTE] Double precision is definitely important, but memory speed (at least for Vega card) is just as important. Increasing the clock speed from 1400 to 1700MHz reduced from 3.6ms/it to 3.2ms/it on 1190MHz HBM, however, 800MHz HBM with 1700MHz increased it to 4ms/it. So they are equally important I guess. |
mfakto compilation on debian
Hello I am trying to compile mfakto on debian stretch . It gives a great amount of errors, I can post the compiler trace if necessary, but I would like to know if you have some first-time suggestions.
SELROC |
Has anyone else tried using WINE to run mfakto on macOS?
I'm getting the following error: [QUOTE]Compiling kernels. Error 002a:fixme:msvcp:_Locinfo__Locinfo_ctor_cat_cstr (0x22f2a8 1 C) semi-stub 002a:fixme:msvcp:_Locinfo__Locinfo_ctor_cat_cstr (0x22eeb8 1 C) semi-stub -43 (Invalid build options): clBuildProgram ERROR: load_kernels(0) failed[/QUOTE] |
Does opencl ever work in WINE?
|
mfakto downshift upon finding a factor
Anyone have an idea why mfakto dropped in indicated throughput immediately upon finding a factor? It's 3 for 3, on a new RX550, mfakto 0.15pre6, 64bit on Win7, that passed the full selftest. 183ghzd/day before, 89 after, on an RX550. The drop is persistent, continuing after several hours, and more than a 2:1 ratio. The ETA seems not affected, so maybe it's only a cosmetic effect. (Exponent, factor and bits were changed in the first most recent example below, not yet submitted since the bit level hasn't completed yet.)
[CODE]Apr 26 12:34 | 1785 38.8% | 512.54 3d11h | 183.47 38299 0.00% Apr 26 12:42 | 1792 38.9% | 513.40 3d11h | 183.16 38299 0.00% M1234567 has a factor: 123456789012134567 (72.843305 bits, 504.966873 GHz-d) Apr 26 12:51 | 1801 39.0% | 511.66 3d11h | 88.82 38299 0.00% Apr 26 12:59 | 1809 39.1% | 513.38 3d11h | 88.52 38299 0.00%[/CODE]Also, it seems to clear up with either completion of a worktodo line or a restart, or perhaps a bitlevel completion. Ctrl-c and restart cleared it up, with the stop/start and short selftest costing about 15 minutes of throughput. [CODE]Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Apr 26 22:28 | 2116 45.9% | 513.02 3d01h | 183.29 38299 0.00%[/CODE]I do not recall seeing such an effect in mfaktc, which I've run much more than mfakto. A search through a 60MB sample of mfaktc screen output logged to file shows nothing like that. Here's another mfakto example, with a more than 15:1 ratio indicated [CODE]Apr 13 21:01 | 264 5.8% | 456.00 4d18h | 212.08 10045 0.00% Apr 13 21:08 | 267 5.9% | 453.66 4d17h | 213.18 10045 0.00% Apr 13 21:16 | 271 6.0% | 453.81 4d17h | 213.11 10045 0.00% M111269 has a factor: 617778664352573195639 (69.065652 bits, 70.546139 GHz-d) Apr 13 21:24 | 276 6.1% | 457.79 4d18h | 13.87 10045 0.00% Apr 13 21:31 | 280 6.3% | 459.59 4d18h | 13.81 10045 0.00% Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Apr 13 21:39 | 291 6.4% | 459.27 4d18h | 13.82 10045 0.00% [/CODE]Another example, which persisted to completion of this exponent's bit level, clearing up when going on to the next worktodo entry. [CODE]Apr 13 11:11 | 3555 77.2% | 42.900 2h36m | 110.71 63018 0.00% Apr 13 11:12 | 3564 77.3% | 44.406 2h41m | 106.96 63018 0.00% Apr 13 11:13 | 3567 77.4% | 43.091 2h35m | 110.22 63018 0.00% M290001377 has a factor: 96303240212210144213599 (76.350002 bits, 18.470592 GHz-d) Apr 13 11:14 | 3568 77.5% | 44.628 2h40m | 37.25 63018 0.00% Apr 13 11:14 | 3579 77.6% | 44.089 2h37m | 37.70 63018 0.00% Apr 13 11:15 | 3583 77.7% | 43.331 2h34m | 38.36 63018 0.00% [/CODE] |
Reference material
I was offered "a blog area to consolidate all of your pdfs and guides and stuff" and accepted.
Feel free to have a look and suggest content. (G-rated only;) General interest gpu related reference material [URL]http://www.mersenneforum.org/showthread.php?t=23371[/URL] Mfakto OpenCl based factoring on gpus [URL]http://www.mersenneforum.org/showthread.php?t=23394[/URL] Future updates to material previously posted in this thread will probably occur on the blog threads and not here. Having in-place update without a time limit makes it more manageable there. |
Just as a matter of curiosity:
Both mfakto and mfaktc have a limit of exponent < 2[sup]32[/sup]. That's an obvious limit point, but how arbitrary or absolute is that limit? In the (probably distant) future, how easy or hard is it to extend the capabilities of mfakto to higher exponents? |
[QUOTE=James Heinrich;490385]Just as a matter of curiosity:
Both mfakto and mfaktc have a limit of exponent < 2[sup]32[/sup]. That's an obvious limit point, but how arbitrary or absolute is that limit? In the (probably distant) future, how easy or hard is it to extend the capabilities of mfakto to higher exponents?[/QUOTE] I looked only briefly, and in the main routines it seemed not too bad. But a small sample of CUDA interface code shows various u32 instructions. So probably it has to be gone through from one end to the other by someone who knows what they're doing, routine by routine, kernel by kernel. I nominate not-me. Looking at prime95's p-1 code for other reasons, I noticed code for handling bounds values bigger than 2^32, which you'll find in ecm.c (containing both ecm and p-1 code). There's certainly plenty of gpu trial factoring to do within the mersenne.org 10^9 exponent cap, much less 2^32-5, more than 4.2 times higher. |
[QUOTE=kriesel;490452]I looked only briefly, and in the main routines it seemed not too bad. But a small sample of CUDA interface code shows various u32 instructions. So probably it has to be gone through from one end to the other by someone who knows what they're doing, routine by routine, kernel by kernel. I nominate not-me.[/QUOTE]Thanks for looking into it. That's kind of what I suspected. I also assume that rewritten code for larger exponents has the potential to be at least slightly slower.
Fortunately there's still a couple hundred million exponents below 2[sup]32[/sup] that need some more TF'ing first, so it's not really a high-priority problem. :smile: |
| All times are UTC. The time now is 22:42. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.