mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

James Heinrich 2018-01-19 15:53

[QUOTE=xx005fs;477886]What is the estimated Teraflop conversion to GHz-day?[/QUOTE]For mfaktc and mfakto, the numbers I use on [url=http://www.mersenne.ca/mfaktc.php]my mfakt* performance table[/url]:[code]$TF_GFLOPS_per_GHzDayPerDay = array(
'NVIDIA' => array(
Compute 1.0 => 0.0, // mfaktc doesn't run
Compute 1.1 => 14.0,
Compute 1.2 => 14.0,
Compute 1.3 => 14.0,
Compute 2.0 => 3.6,
Compute 2.1 => 5.3,
Compute 3.0 => 10.5,
Compute 3.5 => 11.4,
Compute 3.7 => 11.4,
Compute 5.0 => 9.0,
Compute 5.2 => 9.0,
Compute 6.0 => 8.1, // ??? no benchmarks yet
Compute 6.1 => 8.1,
Compute 7.0 => 8.1, // Titan V; ??? no benchmarks yetapproximate
),
'AMD' => array(
VLIW5 => 11.3,
VLIW4 => 11.0,
GCN 1.0 => 9.3,
GCN 1.1 => 9.3,
GCN 1.2 => 9.3,
GCN 1.3 => 10.9,
GCN 1.5 => 10.0,
),
);[/code]

kriesel 2018-01-19 17:19

[QUOTE=James Heinrich;477909]For mfaktc and mfakto, the numbers I use on [URL="http://www.mersenne.ca/mfaktc.php"]my mfakt* performance table[/URL]:[code]$TF_GFLOPS_per_GHzDayPerDay = array(
'N' => array(
Compute 1.0 => 0.0, // mfaktc doesn't run
Compute 1.1 => 14.0,
Compute 1.2 => 14.0,
Compute 1.3 => 14.0,
Compute 2.0 => 3.6,
Compute 2.1 => 5.3,
Compute 3.0 => 10.5,
Compute 3.5 => 11.4,
Compute 3.7 => 11.4,
Compute 5.0 => 9.0,
Compute 5.2 => 9.0,
Compute 6.0 => 8.1, // ??? no benchmarks yet
Compute 6.1 => 8.1,
Compute 7.0 => 8.1, // Titan V; ??? no benchmarks yetapproximate
),
'A' => array(
VLIW5 => 11.3,
VLIW4 => 11.0,
GCN 1.0 => 9.3,
GCN 1.1 => 9.3,
GCN 1.2 => 9.3,
GCN 1.3 => 10.9,
GCN 1.5 => 10.0,
),
);[/code][/QUOTE]

So, N means NVIDIA, A means AMD, and there's no I for Intel IGPs?

James Heinrich 2018-01-19 17:28

Correct. I don't know what/if/how mfakto runs on Intel, I've never seen a benchmark and I don't have one to test.

kriesel 2018-01-19 21:10

[QUOTE=James Heinrich;477921]Correct. I don't know what/if/how mfakto runs on Intel, I've never seen a benchmark and I don't have one to test.[/QUOTE]

I've read up lately on the HD4x00 saga. Looks like some were able to get it working with a special version of mfakto. The links for that in the forum thread are no longer valid. Some of the more promising indications in [URL]http://mersenneforum.org/showthread.php?t=15646[/URL] were posts

750, 752, 753 hd4000 silently fails to handle larger number of threads unless tuned to Gridsize=0 to avoid; throughput ~5Ghzd/day

1076-1107, 1113-1122 George (Prime95) working with Bdot attempting compile for Intel opencl of mfakto, compiler and debugging issues, code edits w/o git use himself; HD4600 16-20GhzD/day

1232 v0.15pre4 summary of testing results, including hd4600 working at 18-19ghzd/day

1236 hd4600 performance profiling results (wide variation, up to 24ghzd/day depending on p, bits, kernel)

1249 more hd 4600 profiling results

There was also
1019 v0.14pre4 (NVIDIA support)

Apparently there was not an official 0.15 release. Latest code I've found is 0.15pre6. Its ini file says:
[CODE]# Different GPUs may have their best performance with different kernels
# Here, you can give a hint to mfakto on how to optimize the kernels.
#
# Possible values:
# GPUType=AUTO try to auto-detect, if that does not work: let me know
# GPUType=GCN Tahiti et al. (HD77xx-HD79xx), also assumed for unknown devices.
# GPUType=VLIW4 Cayman (HD69xx)
# GPUType=VLIW5 most other AMD GPUs (HD4xxx, HD5xxx, HD62xx-HD68xx)
# GPUType=APU all APUs (C-30 - C-60, E-240 - E-450, A2-3200 - A8-3870K) not sure if the "small" APUs would work better as VLIW5.
# GPUType=CPU all CPUs (when GPU not found, or forced to CPU)
# GPUType=NVIDIA reserved for Nvidia-OpenCL. Currently mapped to "CPU" and not yet functional on Nvidia Hardware.
# GPUType=INTEL reserved for Intel-OpenCL (e.g. HD4000). Not yet functional.
[/CODE]I don't know whether those last two lines are current status or left over from earlier versions. It was the same back at 0.14p, before a lot of development work on and reported successful selftests and benchmarks of intel hd4x00.
Note also, in post 1277, Bdot says there's a GCN3 type coming in v0.15. Other posts mention GCN2.

My best candidate hardware for attempting mfakto on Intel IGP is an HD620 in a laptop that's out this month for warranty repair. The system fan finally failed entirely three weeks before the warranty would have expired. Better that than weeks after.

kladner 2018-01-19 21:35

Oops. Much better answers already given. :blush:

xx005fs 2018-01-20 05:44

[QUOTE=James Heinrich;477909]For mfaktc and mfakto, the numbers I use on [url=http://www.mersenne.ca/mfaktc.php]my mfakt* performance table[/url]:[code]$TF_GFLOPS_per_GHzDayPerDay = array(
'NVIDIA' => array(
Compute 1.0 => 0.0, // mfaktc doesn't run
Compute 1.1 => 14.0,
Compute 1.2 => 14.0,
Compute 1.3 => 14.0,
Compute 2.0 => 3.6,
Compute 2.1 => 5.3,
Compute 3.0 => 10.5,
Compute 3.5 => 11.4,
Compute 3.7 => 11.4,
Compute 5.0 => 9.0,
Compute 5.2 => 9.0,
Compute 6.0 => 8.1, // ??? no benchmarks yet
Compute 6.1 => 8.1,
Compute 7.0 => 8.1, // Titan V; ??? no benchmarks yetapproximate
),
'AMD' => array(
VLIW5 => 11.3,
VLIW4 => 11.0,
GCN 1.0 => 9.3,
GCN 1.1 => 9.3,
GCN 1.2 => 9.3,
GCN 1.3 => 10.9,
GCN 1.5 => 10.0,
),
);[/code][/QUOTE]
Does this mean that the lower the number the more efficient the conversion is right?

kriesel 2018-01-20 14:42

mfakto 0.15pre6 failing on first selftest on NVIDIA
 
[QUOTE=Bdot;369966]I just posted the win version of [URL="http://mersenneforum.org/mfakto/mfakto-0.14pre4/mfakto-0.14pre4-win.zip"]mfakto-0.14pre4[/URL]. It should be pretty much complete, except that I did not check my latest changes on Linux yet. I plan to do that next week. Sometimes valgrind finds some weird stuff that I better change for win too.

These are the changes:

[LIST][*]--perftest enhancements including GPU sieve evaluation (for optimizing GPUSievePrimes etc.)[*]successfully resync when the working directory was temporarily lost (ejected USB device or interrupted network drive)[*]save and reload compiled OpenCL kernels => reduce startup time (UseBinfile config variable)[*]MoreClasses config variable to allow for a "less classes" version for very short assignments (GPU sieve only)[*]BugFix: enforce GPUSieveSize being a multiple of GPUSieveProcessSize[*]FlushInterval config variable to fine tune the number of kernels in the GPU queue => address high CPU load issue of newer AMD drivers[*]MinGW build (thanks to kracker)[*]slight performance improvement for the montgomery kernels[*]improved English wording in program output, ini file etc. (thanks to kracker)[*]recognition of new GPUs (8xxx, R9, new APUs) (thanks to kracker)[*]added a warning when using VectorSize=1 (AMD driver issue (?), [URL]http://devgurus.amd.com/thread/167571[/URL])[*]fix for a small memory leak (~0.5kB per assignment)[*]and it should work on Win8.1[/LIST]I'm interested to hear
[LIST][*]is it really running on Win8.1?[*]when doing real tests with GPU sieving, what's the CPU load, is the performance slower or faster than before?[*]anything odd[/LIST]Observations so far:
[LIST][*] VectorSize=1 is not usable on AMD GPUs (see above). It works on Intel and AMD CPUs, and on NVIDIA GPUs (!)[*]When playing around with various ini-file-settings, it is sometimes necessary to manually delete the binary kernel cache file (see UseBinfile setting)[*]Yes, NVIDIA GPUs finally work, but only with these settings:[LIST][*]VectorSize=1[*]FlushInterval=6 (or lower, depending on the SieveSize)[*]SieveOnGPU=1[/LIST] [*]other settings crash the nvidia driver (out of resources), compiler (mfakto crash) or fail the selftest.[*]on a Quadro 2000M, mfakto runs at 57.7 GHz-days/day (mfaktc: 78.8 GHz-days/day), which is better than I expected.[*]As it runs on one non-AMD platform now, we should go for the HD4000 next.[/LIST][/QUOTE]

With 0.15pre6, on NVIDIA GTX1070, on Windows 7, with OpenCL verified to be functional, and nothing else running on the gpu, using oclDeviceQuery.exe, I tried the above recommended NVIDIA settings, with
GPUType=NVIDIA,
and also tried them on GTX1050Ti and Quadro 2000.
Then I concentrated on the GTX1070, and tried successively lower FlushInterval values 5, 4,3, 2, 1, 0. Driver supports CUDA8.0 and OpenCL 1.2. After that, I also changed
# Default: GPUSieveSize=96
GPUSieveSize=32
and
# Default: GPUSieveProcessSize=24
GPUSieveProcessSize=8

Almost all attempts produced a mfakto_Kernels.elf. Out of conservatism I removed it between tries with different ini parameters.

Results with verbosity=3 looked like following.
c:\Users\Ken\Documents\mfakto-q2000>mfakto -d 1 >>debug.txt
Error 0 (Success): clBuildProgram
Error -9999 (Unknown errorcode (not an OpenCL error)): clWaitForEvents

with debug.txt containing
[CODE]mfakto 0.15pre6-Win (64bit build)


Runtime options
Inifile mfakto.ini
Verbosity 3
SieveOnGPU yes
MoreClasses yes
GPUSievePrimes 81157
GPUSieveProcessSize 8Ki bits
GPUSieveSize 32Mi bits
FlushInterval 1
WorkFile worktodo.txt
ResultsFile results.txt
Checkpoints enabled
CheckpointDelay 600s
Stages enabled
StopAfterFactor bitlevel
PrintMode full
V5UserID kriesel
ComputerID condorette
ProgressHeader "Date Time | class Pct | time ETA | GHz-d/day Sieve Wait"
ProgressFormat "%d %T | %C %p%% | %t %e | %g %s %W%%"
TimeStampInResults yes
VectorSize 1
GPUType NVIDIA
SmallExp no
UseBinfile mfakto_Kernels.elf
Compiletime options

Select device - Get device info:
Device 1/3: GeForce GTX 1070 (NVIDIA Corporation),
device version: OpenCL 1.2 CUDA, driver version: 378.66
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts
Global memory:8589934592, Global memory cache: 245760, local memory: 49152, workgroup size: 1024, Work dimensions: 3[1024, 1024, 64, 0, 0] , Max clock speed:1708, compute units:15

OpenCL device info
name GeForce GTX 1070 (NVIDIA Corporation)
device (driver) version OpenCL 1.2 CUDA (378.66)
maximum threads per block 1024
maximum threads per grid 67108864
number of multiprocessors 15 (120 compute elements)
clock rate 1708MHz

Automatic parameters
threads per grid 0
optimizing kernels for NVIDIA

Loading binary kernel file mfakto_Kernels.elf
Compiling kernels (build options: "-I. -DVECTOR_SIZE=1 -DNVIDIA -DMORE_CLASSES -DCL_GPU_SIEVE").
BUILD OUTPUT


END OF BUILD OUTPUT

GPUSievePrimes (adjusted) 81206
GPUsieve minimum exponent 1037054
Started a simple selftest ...
######### testcase 1/30 (M1031831[63-64]) #########
Error -5 (Out of resources): clEnqueueReadBuffer RES failed.
ERROR from tf_class.
[/CODE]or, earlier,

[CODE]mfakto 0.15pre6-Win (64bit build)


Runtime options
Inifile mfakto.ini
Verbosity 2
SieveOnGPU yes
MoreClasses yes
GPUSievePrimes 81157
GPUSieveProcessSize 24Ki bits
GPUSieveSize 96Mi bits
FlushInterval 1
WorkFile worktodo.txt
ResultsFile results.txt
Checkpoints enabled
CheckpointDelay 600s
Stages enabled
StopAfterFactor bitlevel
PrintMode full
V5UserID kriesel
ComputerID condorette-quadro2000
ProgressHeader "Date Time | class Pct | time ETA | GHz-d/day Sieve Wait"
ProgressFormat "%d %T | %C %p%% | %t %e | %g %s %W%%"
TimeStampInResults yes
VectorSize 1
GPUType NVIDIA
SmallExp no
UseBinfile mfakto_Kernels.elf
Compiletime options

Select device - Get device info:
Device 1/3: GeForce GTX 1070 (NVIDIA Corporation),
device version: OpenCL 1.2 CUDA, driver version: 378.66
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_kh
r_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl
_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts
Global memory:8589934592, Global memory cache: 245760, local memory: 49152, workgroup size: 1024, Work dimensions: 3[1024, 1024, 64, 0, 0] , Max clock speed:1708, compute units:15

OpenCL device info
name GeForce GTX 1070 (NVIDIA Corporation)
device (driver) version OpenCL 1.2 CUDA (378.66)
maximum threads per block 1024
maximum threads per grid 67108864
number of multiprocessors 15 (120 compute elements)
clock rate 1708MHz

Automatic parameters
threads per grid 0
optimizing kernels for NVIDIA

Compiling kernels (build options: "-I. -DVECTOR_SIZE=1 -DNVIDIA -DMORE_CLASSES -DCL_GPU_SIEVE").Warning: Dumping only the first of 3 binary formats - if loading the bi
nary file mfakto_Kernels.elf fails, delete it and specify the -d <n> option for mfakto.
Wrote binary kernel for "GeForce GTX 1070" to "mfakto_Kernels.elf".

GPUSievePrimes (adjusted) 81206
GPUsieve minimum exponent 1037054
Started a simple selftest ...
Error -9999 (Unknown errorcode (not an OpenCL error)): clWaitForEvents
Error -5 (Out of resources): clEnqueueReadBuffer RES failed.
ERROR from tf_class.[/CODE]So, my requests are:
Anyone see anything obvious (or even not)?
Have an mfakto.ini file that works, for an NVIDIA gpu matching or similar to any of the following, and willing to share?
Quadro 2000
Quadro 4000
GTX 480
GTX 1050Ti
GTX 1060 3gb
GTX 1070

Thanks in advance for any help!

James Heinrich 2018-01-20 15:52

[QUOTE=xx005fs;477957]Does this mean that the lower the number the more efficient the conversion is right?[/QUOTE]Correct. For example, GTX 1080 @ 1607 stock clock and 8228 GFLOPS should give: 8.228 / 8.1 = 1015.8 GHz-days/day.
GTX 580 (Compute 2.0) was top-class in its day, by today's standard it didn't have a lot of GFLOPS (1581) but it was very efficient [10.5 vs 3.6] at converting them to GHz-days (433). In comparison, the GTX 680 (Compute 3.0) had twice the GFLOPS (3090) but was much less efficient at converting them to useful-to-us work, so you actually ended up with (much) lower throughput (294).
The differences for AMD are less dramatic between generations.
Note: provided numbers are for single-precision GFLOPS.

kriesel 2018-01-20 16:55

[QUOTE=James Heinrich;477981]...
Note: provided numbers are for single-precision GFLOPS.[/QUOTE]

Right, from what I've read, GPU SP performance is what matters for trial factoring (mfakto, mfaktc).

Conversely, GPU DP performance is what matters for primality testing and related code and algorithms; LL, PRP3, P-1. (CUDALucas, clLucas, GpuOwL, CUDAPm1, ...)

xx005fs 2018-01-20 20:03

[QUOTE=James Heinrich;477981]Correct. For example, GTX 1080 @ 1607 stock clock and 8228 GFLOPS should give: 8.228 / 8.1 = 1015.8 GHz-days/day.
GTX 580 (Compute 2.0) was top-class in its day, by today's standard it didn't have a lot of GFLOPS (1581) but it was very efficient [10.5 vs 3.6] at converting them to GHz-days (433). In comparison, the GTX 680 (Compute 3.0) had twice the GFLOPS (3090) but was much less efficient at converting them to useful-to-us work, so you actually ended up with (much) lower throughput (294).
The differences for AMD are less dramatic between generations.
Note: provided numbers are for single-precision GFLOPS.[/QUOTE]

Is there ever going to be better optimisation for current generation amd gpus such as gcn 1.2, 1.3, and 1.5? Especially for gcn 1.5 where half precision might be populated to achieve impressive numbers (dunno if it's possible).

xx005fs 2018-01-20 20:04

[QUOTE=kriesel;477987]Right, from what I've read, GPU SP performance is what matters for trial factoring (mfakto, mfaktc).

Conversely, GPU DP performance is what matters for primality testing and related code and algorithms; LL, PRP3, P-1. (CUDALucas, clLucas, GpuOwL, CUDAPm1, ...)[/QUOTE]

Don't really think DP performance matters that much for GpuOwL, because for my gpu overclocking the HBM memory makes it a lot faster and more efficient. Not 100% sure why tho.


All times are UTC. The time now is 22:42.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.