mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

R. Gerbicz 2019-01-01 22:49

Used this mfakto source: [url]https://github.com/Bdot42/mfakto[/url].
Is it possible to run mfakto on Intel HD 530 gpu ? The most important issue.
So far I had no success, even for cpu version it was a little challenging to compile (used Visual Studio 2017 [community version] with win10 home edition for this, but I have also Ubuntu; and only this slow GPU on this desktop pc). The GPU is really working with OpenCL, I have successful run on this non-trivial code: [url]https://gist.github.com/ddemidov/2925717[/url] .

At compiling there were 2 errors, one of them at mfakto.cpp line 855 (even you could simply delete the whole line, that was only a cerr), the other was similar to this.
And replaced mfakto.cpp 738th line with: strcat(program_options, " -I src ");// -O3
so we haven't used the optimization flag, and included also a folder. Noticed also that the taskbar has shown the build as a 32 bits build (why??), and as you can see it has 6.5 GHzdays speed. (the usage was high, close to 100%).

Do you see anything suspicious?

[CODE]
mfakto 0.15pre6-Win (64bit build)


Runtime options
Inifile mfakto.ini
Verbosity 3
SieveOnGPU yes
MoreClasses yes
GPUSievePrimes 81157
GPUSieveProcessSize 24Ki bits
GPUSieveSize 96Mi bits
FlushInterval 0
WorkFile worktodo.txt
ResultsFile results.txt
Checkpoints enabled
CheckpointDelay 300s
Stages enabled
StopAfterFactor class
PrintMode compact
V5UserID none
ComputerID none
ProgressHeader "Date Time | class Pct | time ETA | GHz-d/day Sieve Wait"
ProgressFormat "%d %T | %C %p%% | %t %e | %g %s %W%%"
TimeStampInResults yes
VectorSize 2
GPUType AUTO
SmallExp no
UseBinfile mfakto_Kernels.elf
Compiletime options

Select device - GPU not found, fallback to CPU.
Get device info:
Device 1/1: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (Intel(R) Corporation),
device version: OpenCL 2.1 (Build 10), driver version: 7.0.0.2567
Extensions: cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_dx9_media_sharing cl_intel_dx9_media_sharing cl_khr_d3d11_sharing cl_khr_gl_sharing cl_khr_fp64 cl_khr_image2d_from_buffer
Global memory:17048297472, Global memory cache: 262144, local memory: 32768, workgroup size: 8192, Work dimensions: 3[8192, 8192, 8192, 0, 0] , Max clock speed:3400, compute units:8

OpenCL device info
name Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (Intel(R) Corporation)
device (driver) version OpenCL 2.1 (Build 10) (7.0.0.2567)
maximum threads per block 8192
maximum threads per grid 0
number of multiprocessors 8 (8 compute elements)
clock rate 3400MHz

Automatic parameters
threads per grid 0
optimizing kernels for CPU

Loading binary kernel file mfakto_Kernels.elf
Compiling kernels (build options: "-I. -DVECTOR_SIZE=2 -DCPU -I src -DMORE_CLASSES -DCL_GPU_SIEVE").
BUILD OUTPUT
Device build started
Device build done
Reload Program Binary Object.
END OF BUILD OUTPUT
Error 0 (Success): clBuildProgram

GPUSievePrimes (adjusted) 81206
GPUsieve minimum exponent 1037054
Started a simple selftest ...
Selftest statistics
number of tests 30
successful tests 30

selftest PASSED!

got assignment: exp=3321932839 bit_min=75 bit_max=76 (2.30 GHz-days)
Starting trial factoring M3321932839 from 2^75 to 2^76 (2.30GHz-days)
k_min = 5686287725580 - k_max = 11372575453491
Using GPU kernel "cl_barrett32_77_gs_2"

Found a valid checkpoint file.
last finished class was: 92
found 0 factors already

RES (32): <32 x 0>
<30 x 0 at the end>
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
RES (32): <32 x 0> 2.1% | 31.804 8h18m | 6.52 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 2.2% | 32.015 8h21m | 6.48 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 2.3% | 31.526 8h12m | 6.58 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 2.4% | 32.096 8h21m | 6.46 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 2.5% | 31.868 8h17m | 6.51 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 2.6% | 31.999 8h18m | 6.48 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 2.7% | 31.979 8h17m | 6.48 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 2.8% | 32.132 8h19m | 6.45 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 2.9% | 31.598 8h10m | 6.56 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 3.0% | 31.559 8h09m | 6.57 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 3.1% | 32.066 8h17m | 6.47 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 3.2% | 31.709 8h10m | 6.54 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 3.3% | 31.563 8h08m | 6.57 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 3.4% | 31.966 8h13m | 6.49 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 3.5% | 31.824 8h11m | 6.51 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 3.6% | 31.746 8h09m | 6.53 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 3.8% | 31.826 8h10m | 6.51 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 3.9% | 31.730 8h08m | 6.53 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 4.0% | 31.648 8h06m | 6.55 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 4.1% | 31.709 8h06m | 6.54 81206 0.00%
<30 x 0 at the end>
RES (32): <32 x 0> 4.2% | 31.641 8h05m | 6.55 81206 0.00%
<30 x 0 at the end>
[/CODE]

kriesel 2019-01-02 01:01

[QUOTE=R. Gerbicz;504628]Used this mfakto source: [URL]https://github.com/Bdot42/mfakto[/URL].
Is it possible to run mfakto on Intel HD 530 gpu ? The most important issue.
So far I had no success, even for cpu version it was a little challenging to compile (used Visual Studio 2017 [community version] with win10 home edition for this, but I have also Ubuntu; and only this slow GPU on this desktop pc). The GPU is really working with OpenCL, I have successful run on this non-trivial code: [URL]https://gist.github.com/ddemidov/2925717[/URL] .
[/QUOTE]
I don't have access to an HD530. The precompiled executables available at James Heinrich's mirror site have worked for me, on Windows, on RX480, RX550, HD620, and UHD630.
[URL]https://download.mersenne.ca/[/URL] (get mfakto 0.15pre6)
The IGP and cpu are different devices as far as OpenCl is concerned and should be selectable by different device numbers on the command line, eg mfakto -d 11 >>mfakto.txt . You might want to run opencl-z.exe on Windows to check it lists the IGP, otherwise the required Intel OpenCl driver may be missing or something. [URL]https://sourceforge.net/projects/opencl-z/[/URL]
When I see "gpu not found, falling back to cpu" it's generally an issue with device identification or the opencl driver for the gpu.

R. Gerbicz 2019-01-02 17:11

[QUOTE=kriesel;504640]
When I see "gpu not found, falling back to cpu" it's generally an issue with device identification or the opencl driver for the gpu.[/QUOTE]

Thanks, that was the problem (old driver), I can get roughly 19GHzdays on GPU.

Rodrigo 2019-01-02 23:21

@kriesel and @SELROC, I checked the system up and down, inside and out. There was nothing obvious going on that would account for the stuttering and choppiness: CPU and RAM usage were normal (0-2% CPU when Prime95 not running), disk (SSD) is healthy, all hardware tests came back fine.

So I reverted to an earlier GPU driver for the 7770 (12.104.0.0). With MFAKTO running, now I can move the mouse around and open program windows with less delay or hesitation. The hesitation is not completely gone, but at least I can track the mouse pointer more accurately. I think there's something about the newest (or all newer?) AMD drivers for this card that led to the problem. However, the situation is not back to where it had been previously, where there was no perceptible delay even with MFAKTO running.

With the older driver, TF throughput is somewhat higher (134-135 GHz-days/day, vs. 130-132 with the newer driver).

kriesel 2019-01-03 01:38

[QUOTE=Rodrigo;504710]@kriesel and @SELROC, I checked the system up and down, inside and out. There was nothing obvious going on that would account for the stuttering and choppiness: CPU and RAM usage were normal (0-2% CPU when Prime95 not running), disk (SSD) is healthy, all hardware tests came back fine.

So I reverted to an earlier GPU driver for the 7770 (12.104.0.0). With MFAKTO running, now I can move the mouse around and open program windows with less delay or hesitation. The hesitation is not completely gone, but at least I can track the mouse pointer more accurately. I think there's something about the newest (or all newer?) AMD drivers for this card that led to the problem. However, the situation is not back to where it had been previously, where there was no perceptible delay even with MFAKTO running.

With the older driver, TF throughput is somewhat higher (134-135 GHz-days/day, vs. 130-132 with the newer driver).[/QUOTE]
Glad to read you got it sorted out.

In gpuowl, I've seen later releases require a driver upgrade, and diminished performance with the new driver on a previous version (by up to 5% reduction)

SELROC 2019-01-03 09:47

[QUOTE=kriesel;504726]Glad to read you got it sorted out.

In gpuowl, I've seen later releases require a driver upgrade, and diminished performance with the new driver on a previous version (by up to 5% reduction)[/QUOTE]


You use the same GPU for both computations and rendering...you could do another test...move the mouse quickly open programs, and read out the ms/it or ms/sq from gpuowl or mfakto, if they interfere with each other you should see it the ms/it changing a few.

kriesel 2019-01-03 22:09

[QUOTE=SELROC;504744]You use the same GPU for both computations and rendering...you could do another test...move the mouse quickly open programs, and read out the ms/it or ms/sq from gpuowl or mfakto, if they interfere with each other you should see it the ms/it changing a few.[/QUOTE]
The 5% reduction was seen in V1.9 gpuowl, after upgrading the driver to allow running V2.0. The measurements were made with as little user activity as practical. It was also likely done via a remote desktop session. [URL]https://www.mersenneforum.org/showpost.php?p=484694&postcount=370[/URL]
A later driver update was also slowing, but by a fraction of a percent.

Two experiments were made today to check to what extent user activity via remote desktop affects gpu computing throughput, with some fairly maximal user activity.

a) click and continuously drag about the screen, a gpuowl console box, for a few minutes, while two gpus run gpuowl and mfakto. Note start time of dragging, by gpuowl timestamp of screen output. Check ms/iter in gpuowl, ghzd/day numbers in mfakto for that time period. Check also prime95 iteration times in its 4 workers. Result: gpu throughput change if any is within the normal fluctuation (7.65-7.69 ms/it in gpuowl 3.8; 91.49 to 91.63 ghzD/d in mfakto). Prime95 worker one iteration time ~37.2 ms/iter increased to 38.5 (~3.5% higher) during 8:06-8:10 local time high user remote console activity period; other prime95 workers timings remain within the 1% typical fluctuation.

b) halt gpuowl and mfakto, verify gpu utilization has gone to zero in gpu-z sessions for both gpus. Resume gpuowl box drag, while watching gpu-z sessions for both gpus for 30 seconds (several gpu-z update periods). Result: gpu-z indicated utilization remains zero on both gpus.

SELROC 2019-01-03 22:35

[QUOTE=kriesel;504827]The 5% reduction was seen in V1.9 gpuowl, after upgrading the driver to allow running V2.0. The measurements were made with as little user activity as practical. It was also likely done via a remote desktop session. [URL]https://www.mersenneforum.org/showpost.php?p=484694&postcount=370[/URL]
A later driver update was also slowing, but by a fraction of a percent.

Two experiments were made today to check to what extent user activity via remote desktop affects gpu computing throughput, with some fairly maximal user activity.

a) click and continuously drag about the screen, a gpuowl console box, for a few minutes, while two gpus run gpuowl and mfakto. Note start time of dragging, by gpuowl timestamp of screen output. Check ms/iter in gpuowl, ghzd/day numbers in mfakto for that time period. Check also prime95 iteration times in its 4 workers. Result: gpu throughput change if any is within the normal fluctuation (7.65-7.69 ms/it in gpuowl 3.8; 91.49 to 91.63 ghzD/d in mfakto). Prime95 worker one iteration time ~37.2 ms/iter increased to 38.5 (~3.5% higher) during 8:06-8:10 local time high user remote console activity period; other prime95 workers timings remain within the 1% typical fluctuation.

b) halt gpuowl and mfakto, verify gpu utilization has gone to zero in gpu-z sessions for both gpus. Resume gpuowl box drag, while watching gpu-z sessions for both gpus for 30 seconds (several gpu-z update periods). Result: gpu-z indicated utilization remains zero on both gpus.[/QUOTE]




May be moving the mouse and clicking is not enough, try to run Quake 3 while you run mfakto or gpuowl :-)

R. Gerbicz 2019-01-03 22:44

Hm, reading the code and still don't know where do you sieve by p>11; confirmed that it is in SegSieve in gpusieve.cl, but as I can see we don't even call this in the runs:
say replacing line 1296 by big_bit_array32[j * threadsPerBlock + get_local_id(0)]=0; (this would eliminate all k values) or placing Visual Studio's breakpoints in the first line in SegSieve. Or even using
[CODE]
#define TRACE_SIEVE_KERNEL 5
// If above tracing is on, only the thread with the ID below will trace
#define TRACE_SIEVE_TID 2
[/CODE]
results we pass the selftest and we're seeing no break, no additonal debug info. Furthermore modifying the default GPUSievePrimes=81157 we get different times, suggesting that we're really sieving, but where?

One more thing that I've also seen on this forum, that at run it is displaying that the automatic parameter for threads per grid is 0:
[CODE]
OpenCL device info
name Intel(R) HD Graphics 530 (Intel(R) Corporation)
device (driver) version OpenCL 2.1 NEO (25.20.100.6471)
maximum threads per block 256
maximum threads per grid 16777216
number of multiprocessors 24 (24 compute elements)
clock rate 1150MHz

Automatic parameters
threads per grid 0
optimizing kernels for INTEL
[/CODE]
Still I've some ideas (not that many) to improve the current code but without understanding the basics of the code it is somewhat hard.

kriesel 2019-01-04 01:03

[QUOTE=SELROC;504830]May be moving the mouse and clicking is not enough, try to run Quake 3 while you run mfakto or gpuowl :-)[/QUOTE]
"Moving the mouse and click" is not what I did. Click on the gpuowl console window and drag the whole thing around the screen in an elliptical pattern, with full display (not just outline box) while dragging; hundreds of thousands of pixels being updated at frame rate would have been the impact if doing it on the local display & gpu. I regularly get 5-15% gpu utilization on one gpu when doing that very same thing on the local display. It's one of the more drastic demands an interactive user might put on a Mersenne compute system's display as a display activity. It's the demanding opposite of how to treat a system running a benchmark after changing drivers to compare their performance.

The experiment showed that the display workload shifts from the gpu on the host system, to its cpu running the rdp server service, and to the remote desktop client's gpu, when using remote desktop.

Hence, normal low impact user activity to account for the difference in driver performance as you suggested some posts back seems to be ruled out.

SELROC 2019-01-04 03:22

[QUOTE=kriesel;504843]"Moving the mouse and click" is not what I did. Click on the gpuowl console window and drag the whole thing around the screen in an elliptical pattern, with full display (not just outline box) while dragging; hundreds of thousands of pixels being updated at frame rate would have been the impact if doing it on the local display & gpu. I regularly get 5-15% gpu utilization on one gpu when doing that very same thing on the local display. It's one of the more drastic demands an interactive user might put on a Mersenne compute system's display as a display activity. It's the demanding opposite of how to treat a system running a benchmark after changing drivers to compare their performance.

The experiment showed that the display workload shifts from the gpu on the host system, to its cpu running the rdp server service, and to the remote desktop client's gpu, when using remote desktop.

Hence, normal low impact user activity to account for the difference in driver performance as you suggested some posts back seems to be ruled out.[/QUOTE]




You are still reading old messages, but you didn't run Quake 3 :-)

For what is worth I don't trust windows as a secure and stable operating system. Sorry, it is just personal experience.


All times are UTC. The time now is 22:30.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.