![]() |
[QUOTE=xx005fs;477886]What is the estimated Teraflop conversion to GHz-day?[/QUOTE]For mfaktc and mfakto, the numbers I use on [url=http://www.mersenne.ca/mfaktc.php]my mfakt* performance table[/url]:[code]$TF_GFLOPS_per_GHzDayPerDay = array(
'NVIDIA' => array( Compute 1.0 => 0.0, // mfaktc doesn't run Compute 1.1 => 14.0, Compute 1.2 => 14.0, Compute 1.3 => 14.0, Compute 2.0 => 3.6, Compute 2.1 => 5.3, Compute 3.0 => 10.5, Compute 3.5 => 11.4, Compute 3.7 => 11.4, Compute 5.0 => 9.0, Compute 5.2 => 9.0, Compute 6.0 => 8.1, // ??? no benchmarks yet Compute 6.1 => 8.1, Compute 7.0 => 8.1, // Titan V; ??? no benchmarks yetapproximate ), 'AMD' => array( VLIW5 => 11.3, VLIW4 => 11.0, GCN 1.0 => 9.3, GCN 1.1 => 9.3, GCN 1.2 => 9.3, GCN 1.3 => 10.9, GCN 1.5 => 10.0, ), );[/code] |
[QUOTE=James Heinrich;477909]For mfaktc and mfakto, the numbers I use on [URL="http://www.mersenne.ca/mfaktc.php"]my mfakt* performance table[/URL]:[code]$TF_GFLOPS_per_GHzDayPerDay = array(
'N' => array( Compute 1.0 => 0.0, // mfaktc doesn't run Compute 1.1 => 14.0, Compute 1.2 => 14.0, Compute 1.3 => 14.0, Compute 2.0 => 3.6, Compute 2.1 => 5.3, Compute 3.0 => 10.5, Compute 3.5 => 11.4, Compute 3.7 => 11.4, Compute 5.0 => 9.0, Compute 5.2 => 9.0, Compute 6.0 => 8.1, // ??? no benchmarks yet Compute 6.1 => 8.1, Compute 7.0 => 8.1, // Titan V; ??? no benchmarks yetapproximate ), 'A' => array( VLIW5 => 11.3, VLIW4 => 11.0, GCN 1.0 => 9.3, GCN 1.1 => 9.3, GCN 1.2 => 9.3, GCN 1.3 => 10.9, GCN 1.5 => 10.0, ), );[/code][/QUOTE] So, N means NVIDIA, A means AMD, and there's no I for Intel IGPs? |
Correct. I don't know what/if/how mfakto runs on Intel, I've never seen a benchmark and I don't have one to test.
|
[QUOTE=James Heinrich;477921]Correct. I don't know what/if/how mfakto runs on Intel, I've never seen a benchmark and I don't have one to test.[/QUOTE]
I've read up lately on the HD4x00 saga. Looks like some were able to get it working with a special version of mfakto. The links for that in the forum thread are no longer valid. Some of the more promising indications in [URL]http://mersenneforum.org/showthread.php?t=15646[/URL] were posts 750, 752, 753 hd4000 silently fails to handle larger number of threads unless tuned to Gridsize=0 to avoid; throughput ~5Ghzd/day 1076-1107, 1113-1122 George (Prime95) working with Bdot attempting compile for Intel opencl of mfakto, compiler and debugging issues, code edits w/o git use himself; HD4600 16-20GhzD/day 1232 v0.15pre4 summary of testing results, including hd4600 working at 18-19ghzd/day 1236 hd4600 performance profiling results (wide variation, up to 24ghzd/day depending on p, bits, kernel) 1249 more hd 4600 profiling results There was also 1019 v0.14pre4 (NVIDIA support) Apparently there was not an official 0.15 release. Latest code I've found is 0.15pre6. Its ini file says: [CODE]# Different GPUs may have their best performance with different kernels # Here, you can give a hint to mfakto on how to optimize the kernels. # # Possible values: # GPUType=AUTO try to auto-detect, if that does not work: let me know # GPUType=GCN Tahiti et al. (HD77xx-HD79xx), also assumed for unknown devices. # GPUType=VLIW4 Cayman (HD69xx) # GPUType=VLIW5 most other AMD GPUs (HD4xxx, HD5xxx, HD62xx-HD68xx) # GPUType=APU all APUs (C-30 - C-60, E-240 - E-450, A2-3200 - A8-3870K) not sure if the "small" APUs would work better as VLIW5. # GPUType=CPU all CPUs (when GPU not found, or forced to CPU) # GPUType=NVIDIA reserved for Nvidia-OpenCL. Currently mapped to "CPU" and not yet functional on Nvidia Hardware. # GPUType=INTEL reserved for Intel-OpenCL (e.g. HD4000). Not yet functional. [/CODE]I don't know whether those last two lines are current status or left over from earlier versions. It was the same back at 0.14p, before a lot of development work on and reported successful selftests and benchmarks of intel hd4x00. Note also, in post 1277, Bdot says there's a GCN3 type coming in v0.15. Other posts mention GCN2. My best candidate hardware for attempting mfakto on Intel IGP is an HD620 in a laptop that's out this month for warranty repair. The system fan finally failed entirely three weeks before the warranty would have expired. Better that than weeks after. |
Oops. Much better answers already given. :blush:
|
[QUOTE=James Heinrich;477909]For mfaktc and mfakto, the numbers I use on [url=http://www.mersenne.ca/mfaktc.php]my mfakt* performance table[/url]:[code]$TF_GFLOPS_per_GHzDayPerDay = array(
'NVIDIA' => array( Compute 1.0 => 0.0, // mfaktc doesn't run Compute 1.1 => 14.0, Compute 1.2 => 14.0, Compute 1.3 => 14.0, Compute 2.0 => 3.6, Compute 2.1 => 5.3, Compute 3.0 => 10.5, Compute 3.5 => 11.4, Compute 3.7 => 11.4, Compute 5.0 => 9.0, Compute 5.2 => 9.0, Compute 6.0 => 8.1, // ??? no benchmarks yet Compute 6.1 => 8.1, Compute 7.0 => 8.1, // Titan V; ??? no benchmarks yetapproximate ), 'AMD' => array( VLIW5 => 11.3, VLIW4 => 11.0, GCN 1.0 => 9.3, GCN 1.1 => 9.3, GCN 1.2 => 9.3, GCN 1.3 => 10.9, GCN 1.5 => 10.0, ), );[/code][/QUOTE] Does this mean that the lower the number the more efficient the conversion is right? |
mfakto 0.15pre6 failing on first selftest on NVIDIA
[QUOTE=Bdot;369966]I just posted the win version of [URL="http://mersenneforum.org/mfakto/mfakto-0.14pre4/mfakto-0.14pre4-win.zip"]mfakto-0.14pre4[/URL]. It should be pretty much complete, except that I did not check my latest changes on Linux yet. I plan to do that next week. Sometimes valgrind finds some weird stuff that I better change for win too.
These are the changes: [LIST][*]--perftest enhancements including GPU sieve evaluation (for optimizing GPUSievePrimes etc.)[*]successfully resync when the working directory was temporarily lost (ejected USB device or interrupted network drive)[*]save and reload compiled OpenCL kernels => reduce startup time (UseBinfile config variable)[*]MoreClasses config variable to allow for a "less classes" version for very short assignments (GPU sieve only)[*]BugFix: enforce GPUSieveSize being a multiple of GPUSieveProcessSize[*]FlushInterval config variable to fine tune the number of kernels in the GPU queue => address high CPU load issue of newer AMD drivers[*]MinGW build (thanks to kracker)[*]slight performance improvement for the montgomery kernels[*]improved English wording in program output, ini file etc. (thanks to kracker)[*]recognition of new GPUs (8xxx, R9, new APUs) (thanks to kracker)[*]added a warning when using VectorSize=1 (AMD driver issue (?), [URL]http://devgurus.amd.com/thread/167571[/URL])[*]fix for a small memory leak (~0.5kB per assignment)[*]and it should work on Win8.1[/LIST]I'm interested to hear [LIST][*]is it really running on Win8.1?[*]when doing real tests with GPU sieving, what's the CPU load, is the performance slower or faster than before?[*]anything odd[/LIST]Observations so far: [LIST][*] VectorSize=1 is not usable on AMD GPUs (see above). It works on Intel and AMD CPUs, and on NVIDIA GPUs (!)[*]When playing around with various ini-file-settings, it is sometimes necessary to manually delete the binary kernel cache file (see UseBinfile setting)[*]Yes, NVIDIA GPUs finally work, but only with these settings:[LIST][*]VectorSize=1[*]FlushInterval=6 (or lower, depending on the SieveSize)[*]SieveOnGPU=1[/LIST] [*]other settings crash the nvidia driver (out of resources), compiler (mfakto crash) or fail the selftest.[*]on a Quadro 2000M, mfakto runs at 57.7 GHz-days/day (mfaktc: 78.8 GHz-days/day), which is better than I expected.[*]As it runs on one non-AMD platform now, we should go for the HD4000 next.[/LIST][/QUOTE] With 0.15pre6, on NVIDIA GTX1070, on Windows 7, with OpenCL verified to be functional, and nothing else running on the gpu, using oclDeviceQuery.exe, I tried the above recommended NVIDIA settings, with GPUType=NVIDIA, and also tried them on GTX1050Ti and Quadro 2000. Then I concentrated on the GTX1070, and tried successively lower FlushInterval values 5, 4,3, 2, 1, 0. Driver supports CUDA8.0 and OpenCL 1.2. After that, I also changed # Default: GPUSieveSize=96 GPUSieveSize=32 and # Default: GPUSieveProcessSize=24 GPUSieveProcessSize=8 Almost all attempts produced a mfakto_Kernels.elf. Out of conservatism I removed it between tries with different ini parameters. Results with verbosity=3 looked like following. c:\Users\Ken\Documents\mfakto-q2000>mfakto -d 1 >>debug.txt Error 0 (Success): clBuildProgram Error -9999 (Unknown errorcode (not an OpenCL error)): clWaitForEvents with debug.txt containing [CODE]mfakto 0.15pre6-Win (64bit build) Runtime options Inifile mfakto.ini Verbosity 3 SieveOnGPU yes MoreClasses yes GPUSievePrimes 81157 GPUSieveProcessSize 8Ki bits GPUSieveSize 32Mi bits FlushInterval 1 WorkFile worktodo.txt ResultsFile results.txt Checkpoints enabled CheckpointDelay 600s Stages enabled StopAfterFactor bitlevel PrintMode full V5UserID kriesel ComputerID condorette ProgressHeader "Date Time | class Pct | time ETA | GHz-d/day Sieve Wait" ProgressFormat "%d %T | %C %p%% | %t %e | %g %s %W%%" TimeStampInResults yes VectorSize 1 GPUType NVIDIA SmallExp no UseBinfile mfakto_Kernels.elf Compiletime options Select device - Get device info: Device 1/3: GeForce GTX 1070 (NVIDIA Corporation), device version: OpenCL 1.2 CUDA, driver version: 378.66 Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts Global memory:8589934592, Global memory cache: 245760, local memory: 49152, workgroup size: 1024, Work dimensions: 3[1024, 1024, 64, 0, 0] , Max clock speed:1708, compute units:15 OpenCL device info name GeForce GTX 1070 (NVIDIA Corporation) device (driver) version OpenCL 1.2 CUDA (378.66) maximum threads per block 1024 maximum threads per grid 67108864 number of multiprocessors 15 (120 compute elements) clock rate 1708MHz Automatic parameters threads per grid 0 optimizing kernels for NVIDIA Loading binary kernel file mfakto_Kernels.elf Compiling kernels (build options: "-I. -DVECTOR_SIZE=1 -DNVIDIA -DMORE_CLASSES -DCL_GPU_SIEVE"). BUILD OUTPUT END OF BUILD OUTPUT GPUSievePrimes (adjusted) 81206 GPUsieve minimum exponent 1037054 Started a simple selftest ... ######### testcase 1/30 (M1031831[63-64]) ######### Error -5 (Out of resources): clEnqueueReadBuffer RES failed. ERROR from tf_class. [/CODE]or, earlier, [CODE]mfakto 0.15pre6-Win (64bit build) Runtime options Inifile mfakto.ini Verbosity 2 SieveOnGPU yes MoreClasses yes GPUSievePrimes 81157 GPUSieveProcessSize 24Ki bits GPUSieveSize 96Mi bits FlushInterval 1 WorkFile worktodo.txt ResultsFile results.txt Checkpoints enabled CheckpointDelay 600s Stages enabled StopAfterFactor bitlevel PrintMode full V5UserID kriesel ComputerID condorette-quadro2000 ProgressHeader "Date Time | class Pct | time ETA | GHz-d/day Sieve Wait" ProgressFormat "%d %T | %C %p%% | %t %e | %g %s %W%%" TimeStampInResults yes VectorSize 1 GPUType NVIDIA SmallExp no UseBinfile mfakto_Kernels.elf Compiletime options Select device - Get device info: Device 1/3: GeForce GTX 1070 (NVIDIA Corporation), device version: OpenCL 1.2 CUDA, driver version: 378.66 Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_kh r_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl _khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts Global memory:8589934592, Global memory cache: 245760, local memory: 49152, workgroup size: 1024, Work dimensions: 3[1024, 1024, 64, 0, 0] , Max clock speed:1708, compute units:15 OpenCL device info name GeForce GTX 1070 (NVIDIA Corporation) device (driver) version OpenCL 1.2 CUDA (378.66) maximum threads per block 1024 maximum threads per grid 67108864 number of multiprocessors 15 (120 compute elements) clock rate 1708MHz Automatic parameters threads per grid 0 optimizing kernels for NVIDIA Compiling kernels (build options: "-I. -DVECTOR_SIZE=1 -DNVIDIA -DMORE_CLASSES -DCL_GPU_SIEVE").Warning: Dumping only the first of 3 binary formats - if loading the bi nary file mfakto_Kernels.elf fails, delete it and specify the -d <n> option for mfakto. Wrote binary kernel for "GeForce GTX 1070" to "mfakto_Kernels.elf". GPUSievePrimes (adjusted) 81206 GPUsieve minimum exponent 1037054 Started a simple selftest ... Error -9999 (Unknown errorcode (not an OpenCL error)): clWaitForEvents Error -5 (Out of resources): clEnqueueReadBuffer RES failed. ERROR from tf_class.[/CODE]So, my requests are: Anyone see anything obvious (or even not)? Have an mfakto.ini file that works, for an NVIDIA gpu matching or similar to any of the following, and willing to share? Quadro 2000 Quadro 4000 GTX 480 GTX 1050Ti GTX 1060 3gb GTX 1070 Thanks in advance for any help! |
[QUOTE=xx005fs;477957]Does this mean that the lower the number the more efficient the conversion is right?[/QUOTE]Correct. For example, GTX 1080 @ 1607 stock clock and 8228 GFLOPS should give: 8.228 / 8.1 = 1015.8 GHz-days/day.
GTX 580 (Compute 2.0) was top-class in its day, by today's standard it didn't have a lot of GFLOPS (1581) but it was very efficient [10.5 vs 3.6] at converting them to GHz-days (433). In comparison, the GTX 680 (Compute 3.0) had twice the GFLOPS (3090) but was much less efficient at converting them to useful-to-us work, so you actually ended up with (much) lower throughput (294). The differences for AMD are less dramatic between generations. Note: provided numbers are for single-precision GFLOPS. |
[QUOTE=James Heinrich;477981]...
Note: provided numbers are for single-precision GFLOPS.[/QUOTE] Right, from what I've read, GPU SP performance is what matters for trial factoring (mfakto, mfaktc). Conversely, GPU DP performance is what matters for primality testing and related code and algorithms; LL, PRP3, P-1. (CUDALucas, clLucas, GpuOwL, CUDAPm1, ...) |
[QUOTE=James Heinrich;477981]Correct. For example, GTX 1080 @ 1607 stock clock and 8228 GFLOPS should give: 8.228 / 8.1 = 1015.8 GHz-days/day.
GTX 580 (Compute 2.0) was top-class in its day, by today's standard it didn't have a lot of GFLOPS (1581) but it was very efficient [10.5 vs 3.6] at converting them to GHz-days (433). In comparison, the GTX 680 (Compute 3.0) had twice the GFLOPS (3090) but was much less efficient at converting them to useful-to-us work, so you actually ended up with (much) lower throughput (294). The differences for AMD are less dramatic between generations. Note: provided numbers are for single-precision GFLOPS.[/QUOTE] Is there ever going to be better optimisation for current generation amd gpus such as gcn 1.2, 1.3, and 1.5? Especially for gcn 1.5 where half precision might be populated to achieve impressive numbers (dunno if it's possible). |
[QUOTE=kriesel;477987]Right, from what I've read, GPU SP performance is what matters for trial factoring (mfakto, mfaktc).
Conversely, GPU DP performance is what matters for primality testing and related code and algorithms; LL, PRP3, P-1. (CUDALucas, clLucas, GpuOwL, CUDAPm1, ...)[/QUOTE] Don't really think DP performance matters that much for GpuOwL, because for my gpu overclocking the HBM memory makes it a lot faster and more efficient. Not 100% sure why tho. |
| All times are UTC. The time now is 22:42. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.