mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

preda 2017-05-28 13:50

[QUOTE=LaurV;459911]After endless struggle we succeeded in building our own owl.[/QUOTE]

Congrats for building :)

I think one way is to use MSYS2 [url]http://www.msys2.org/[/url] which may be a bit friendlier.

I do think there's a Ctrl-C problem on windows/msys or something. I don't know exactly why or what's the fix for that -- I'd like to find out more. It seems Ctrl-C works.. after a while or from time to time, or something.

LaurV 2017-05-28 14:04

Thanks. Now (after your new commit) it also makes sense why the lines were skipped... (expo limits change, we figured out as much meantime, from the source code in worktodo.h). :davieddy:

Working right now. Testing 76000117 (DC-ing). We'll keep in touch. :smile:

kracker 2017-05-28 23:37

1 Attachment(s)
[QUOTE=preda;459908]I just submitted changes to avoid [most of] the performance penalty when the offset is zero (-offset 0 on exponent start). So even with v0.3, you should see now the same perf as with v0.2 when not using offset.[/QUOTE]

Back to as before now. :bow:

Windows binary from the latest commit as of now:

preda 2017-06-06 00:00

v0.4
 
In v0.4 I did a trade-off between memory-bandwidth and compute; in the transposition kernels, the trigonometric table is now much smaller at the cost of 2 additional complex multiplications. This improves performance in situations where memory bandwidth was the bottleneck. A side effect is slightly reduced accuracy (thus, increased round-off error).

This improves performance on "slow-memory" GPUs, in particular Polaris cards (rx580, rx480) should see ~8% perf increase. On the "fast-memory" GPUs (e.g. Fury series) the perf increase is much lower at ~2%. (but it did move my FuryX to under 2ms/it :)

kracker 2017-06-11 19:11

1 Attachment(s)
Not sure exactly what's changed, but here's the latest (v0.5) binaries.

Also, I've noticed something interesting(and i'm not sure why i didn't notice this a long time ago :davieddy: ), when an assignment from worktodo.txt finishes, it doesn't remove that line from the file... also to note i've never put multiple assignments there just for the record.

preda 2017-06-12 10:19

I tried to fix the workdoto.txt delete (though I couldn't repro, thus I didn't verify the fix).

In v0.5 I try a new amalgamation kernel; which reduces memory roundtrips by 2 (3 kernels merged into 1), but because of poor opencl compiler the VGPR usage becomes > 128, thus occupancy is lowered. Anyway, if I get reports that 0.5 is slower (compared to 0.3/0.4) I'll add a new "legacy" option to allow use of previous kernels.

kracker 2017-06-14 03:54

[QUOTE=preda;461098]I tried to fix the workdoto.txt delete (though I couldn't repro, thus I didn't verify the fix).

In v0.5 I try a new amalgamation kernel; which reduces memory roundtrips by 2 (3 kernels merged into 1), but because of poor opencl compiler the VGPR usage becomes > 128, thus occupancy is lowered. Anyway, if I get reports that 0.5 is slower (compared to 0.3/0.4) I'll add a new "legacy" option to allow use of previous kernels.[/QUOTE]

Compared to 0.4, 0.5 is running at 4.88 ms/iter pretty steadily, compared to 4.8 ms/iter, so slightly bit slower.(R9 285)

kriesel 2017-07-12 16:49

GPU memory reliability
 
[QUOTE=Mark Rose;457528]I believe I've read on the forum that some users doing GPU LL had to underclock the memory to get an accurate result.[/QUOTE]

I'll add another anecdote to that; one of my faster GPUs bought used initially passed cudalucas memtest on 250MB. I went back later, after seeing it progress to lower reliability, such that neither cudalucas nor cudapm1 were reliable, and memtested as much vram as possible. Its default is slight overclock. Std clock, and any amount of underclock, did not cure or even noticeably reduce the memory errors observed in memtest. Errors were always observed in the range 23-40 of 25MB blocks.

Lessons I took from that are:
Test fully;
retest fully.
Track reliability of individual hardware versus time.
Built-in self-test features of number theory software are very useful.

kriesel 2017-07-12 20:36

What's next?
 
[QUOTE=preda;461098]I tried to fix the workdoto.txt delete (though I couldn't repro, thus I didn't verify the fix).

In v0.5 I try a new amalgamation kernel; which reduces memory roundtrips by 2 (3 kernels merged into 1), but because of poor opencl compiler the VGPR usage becomes > 128, thus occupancy is lowered. Anyway, if I get reports that 0.5 is slower (compared to 0.3/0.4) I'll add a new "legacy" option to allow use of previous kernels.[/QUOTE]

Just learned of your effort recenty. This is great progress you've made. What's next; 8M fft length, 6M, 3M? A well deserved break?

Thanks for the GpuOwL links, I've added them to the available software table.

Can you think of any reason GpuOwL would not work on Intel integrated graphics processors with OpenCL support?

kriesel 2017-07-13 03:31

illegal or repeating residues
 
[QUOTE=airsquirrels;459512] It is worth noting that we should cause gpuowl to fail if it reaches 00...0002 or all zero at any point in the calculation. One of the most interesting results I have had lately hit the 00...02 residue at one point. What made this most interesting is that this was on a FirePro W8100 which should be more resilient than the typical card due to ECC memory and better binning.[/QUOTE]

Amen to the known-bad-residue-values check. 0x0 is permitted as the last result, otherwise it is unexpected. Either 0 or 2 occurring and repeating monotonously is bad news and should be detected and warned on and perhaps the exponent halted, and any following work in worktodo.txt begun or resumed as applicable. Another repeating value I've seen is 0xfffffffffffffffd, along with much faster than expected iteration times, in CUDALucas.

Is a least-significant-64-bit residue matching that of the previous iteration ever valid? I think it would be useful to detect and warn on any successive iteration match. As long as the maximum 64-bit residue value is large compared to the exponent, the chance of a repeat or a premature zero is low.
M11 needs at least 5 bits.
i residue binary
0 4 0100
1 14 1110
2 194 1100 0010
3 788 11 0001 0100
4 701 10 1011 1101
5 119 111 0111
6 1877 111 0101 0101
7 240 1111 0000
8 282 1 0001 1010
9 1736 110 1100 1000
not prime

kriesel 2017-07-17 03:03

First try on Intel igp of GpuOwL
 
GpuOwL (4M flavor) trial on an I7-7500U system

An i7-7500U is two CPU cores with hyperthreading, plus one HD620 iGP.
[URL]http://ark.intel.com/products/95451/Intel-Core-i7-7500U-Processor-4M-Cache-up-to-3_50-GHz-[/URL]

GpuOwL took approx 19 seconds to get started and launch the first selftest exponent.
It identified and listed 4 devices.
Zero and two are identified as 24x1050Mhz Intel(R) HD Graphics 620; OpenCL 2.1
One and three are identified as 4x2700Mhz Intel(R) Core*TM) i7-7500U CPU @ 2.70Ghz; OpenCL 2.1 (build 2)

It raises the HD620 clock rate from 300Mhz idle value to 950-1000Mhz.
Selftest cases run at iteration times of 140-145 msec/iteration.
Prime95 throughput drops by about half during this use of the iGP by GpuOwL.
(prime95 43M LL test goes from 16 msec/iteration, to 30-33msec/iteration; other prime95 worker goes from ECM on 16.9M 96 seconds per screen output to 152 seconds per output)

The 15W TDP total was not changed; but 8.4 watts of it is absorbed by the iGP, dropping cpu frequency to 1.6Ghz and lowering CPU wattage. (CPUID HWMonitor and TechPowerUp GPU-Z)
Other elements are Uncore 1.5 W, DRAM 1.6W, IA Cores 5.2W

Task Manager shows prime95 dropping from ~72% CPU load before launch of GpuOwl, to about 40% during GpuOwL operation.
This is not due to cpu load by GpuOwL. The GpuOwL process shows no cpu usage in Task Manager (occasionally a fraction of a percent).
The drop in prime95 cpu usage is assumed to be due to some combination of TDP management and contention for memory bandwidth since the iGP uses shared system RAM.

Launching a second instance of GpuOwL in a separate directory near the end of completion of the selftest (two exponents left) did not noticeably change prime95 cpu.
There was no discernible difference in HD620 clock rate or power consumption or Prime95 throughput between one and two instances. I found later that the one started earlier had been accidentally put into Select status. Once that was cleared the two instances ran simultaneously at half speed. Run times were 140-145msec/iteration with a single gpuowl running, and approx 299msec/iteration with two. So in regard to prime95, still no discernible difference between two and one instance of gpuowl after clearing the select status.

Pilot error? May try it on a different older system too.


All times are UTC. The time now is 07:02.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.