mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2017-05-28, 13:50   #166
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default

Quote:
Originally Posted by LaurV View Post
After endless struggle we succeeded in building our own owl.
Congrats for building :)

I think one way is to use MSYS2 http://www.msys2.org/ which may be a bit friendlier.

I do think there's a Ctrl-C problem on windows/msys or something. I don't know exactly why or what's the fix for that -- I'd like to find out more. It seems Ctrl-C works.. after a while or from time to time, or something.
preda is offline   Reply With Quote
Old 2017-05-28, 14:04   #167
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

32·29·37 Posts
Default

Thanks. Now (after your new commit) it also makes sense why the lines were skipped... (expo limits change, we figured out as much meantime, from the source code in worktodo.h).

Working right now. Testing 76000117 (DC-ing). We'll keep in touch.

Last fiddled with by LaurV on 2017-05-28 at 14:04
LaurV is offline   Reply With Quote
Old 2017-05-28, 23:37   #168
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

Quote:
Originally Posted by preda View Post
I just submitted changes to avoid [most of] the performance penalty when the offset is zero (-offset 0 on exponent start). So even with v0.3, you should see now the same perf as with v0.2 when not using offset.
Back to as before now.

Windows binary from the latest commit as of now:
Attached Files
File Type: zip gpuowl-v0.3-2679475.zip (288.1 KB, 105 views)
kracker is offline   Reply With Quote
Old 2017-06-06, 00:00   #169
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default v0.4

In v0.4 I did a trade-off between memory-bandwidth and compute; in the transposition kernels, the trigonometric table is now much smaller at the cost of 2 additional complex multiplications. This improves performance in situations where memory bandwidth was the bottleneck. A side effect is slightly reduced accuracy (thus, increased round-off error).

This improves performance on "slow-memory" GPUs, in particular Polaris cards (rx580, rx480) should see ~8% perf increase. On the "fast-memory" GPUs (e.g. Fury series) the perf increase is much lower at ~2%. (but it did move my FuryX to under 2ms/it :)
preda is offline   Reply With Quote
Old 2017-06-11, 19:11   #170
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

Not sure exactly what's changed, but here's the latest (v0.5) binaries.

Also, I've noticed something interesting(and i'm not sure why i didn't notice this a long time ago ), when an assignment from worktodo.txt finishes, it doesn't remove that line from the file... also to note i've never put multiple assignments there just for the record.
Attached Files
File Type: zip gpuowl-v0.5-92b94b8.zip (289.0 KB, 120 views)

Last fiddled with by kracker on 2017-06-11 at 19:12
kracker is offline   Reply With Quote
Old 2017-06-12, 10:19   #171
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default

I tried to fix the workdoto.txt delete (though I couldn't repro, thus I didn't verify the fix).

In v0.5 I try a new amalgamation kernel; which reduces memory roundtrips by 2 (3 kernels merged into 1), but because of poor opencl compiler the VGPR usage becomes > 128, thus occupancy is lowered. Anyway, if I get reports that 0.5 is slower (compared to 0.3/0.4) I'll add a new "legacy" option to allow use of previous kernels.
preda is offline   Reply With Quote
Old 2017-06-14, 03:54   #172
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

Quote:
Originally Posted by preda View Post
I tried to fix the workdoto.txt delete (though I couldn't repro, thus I didn't verify the fix).

In v0.5 I try a new amalgamation kernel; which reduces memory roundtrips by 2 (3 kernels merged into 1), but because of poor opencl compiler the VGPR usage becomes > 128, thus occupancy is lowered. Anyway, if I get reports that 0.5 is slower (compared to 0.3/0.4) I'll add a new "legacy" option to allow use of previous kernels.
Compared to 0.4, 0.5 is running at 4.88 ms/iter pretty steadily, compared to 4.8 ms/iter, so slightly bit slower.(R9 285)
kracker is offline   Reply With Quote
Old 2017-07-12, 16:49   #173
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3×13×139 Posts
Default GPU memory reliability

Quote:
Originally Posted by Mark Rose View Post
I believe I've read on the forum that some users doing GPU LL had to underclock the memory to get an accurate result.
I'll add another anecdote to that; one of my faster GPUs bought used initially passed cudalucas memtest on 250MB. I went back later, after seeing it progress to lower reliability, such that neither cudalucas nor cudapm1 were reliable, and memtested as much vram as possible. Its default is slight overclock. Std clock, and any amount of underclock, did not cure or even noticeably reduce the memory errors observed in memtest. Errors were always observed in the range 23-40 of 25MB blocks.

Lessons I took from that are:
Test fully;
retest fully.
Track reliability of individual hardware versus time.
Built-in self-test features of number theory software are very useful.

Last fiddled with by kriesel on 2017-07-12 at 16:50
kriesel is online now   Reply With Quote
Old 2017-07-12, 20:36   #174
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3×13×139 Posts
Default What's next?

Quote:
Originally Posted by preda View Post
I tried to fix the workdoto.txt delete (though I couldn't repro, thus I didn't verify the fix).

In v0.5 I try a new amalgamation kernel; which reduces memory roundtrips by 2 (3 kernels merged into 1), but because of poor opencl compiler the VGPR usage becomes > 128, thus occupancy is lowered. Anyway, if I get reports that 0.5 is slower (compared to 0.3/0.4) I'll add a new "legacy" option to allow use of previous kernels.
Just learned of your effort recenty. This is great progress you've made. What's next; 8M fft length, 6M, 3M? A well deserved break?

Thanks for the GpuOwL links, I've added them to the available software table.

Can you think of any reason GpuOwL would not work on Intel integrated graphics processors with OpenCL support?
kriesel is online now   Reply With Quote
Old 2017-07-13, 03:31   #175
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3·13·139 Posts
Default illegal or repeating residues

Quote:
Originally Posted by airsquirrels View Post
It is worth noting that we should cause gpuowl to fail if it reaches 00...0002 or all zero at any point in the calculation. One of the most interesting results I have had lately hit the 00...02 residue at one point. What made this most interesting is that this was on a FirePro W8100 which should be more resilient than the typical card due to ECC memory and better binning.
Amen to the known-bad-residue-values check. 0x0 is permitted as the last result, otherwise it is unexpected. Either 0 or 2 occurring and repeating monotonously is bad news and should be detected and warned on and perhaps the exponent halted, and any following work in worktodo.txt begun or resumed as applicable. Another repeating value I've seen is 0xfffffffffffffffd, along with much faster than expected iteration times, in CUDALucas.

Is a least-significant-64-bit residue matching that of the previous iteration ever valid? I think it would be useful to detect and warn on any successive iteration match. As long as the maximum 64-bit residue value is large compared to the exponent, the chance of a repeat or a premature zero is low.
M11 needs at least 5 bits.
i residue binary
0 4 0100
1 14 1110
2 194 1100 0010
3 788 11 0001 0100
4 701 10 1011 1101
5 119 111 0111
6 1877 111 0101 0101
7 240 1111 0000
8 282 1 0001 1010
9 1736 110 1100 1000
not prime

Last fiddled with by kriesel on 2017-07-13 at 03:36
kriesel is online now   Reply With Quote
Old 2017-07-17, 03:03   #176
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3×13×139 Posts
Default First try on Intel igp of GpuOwL

GpuOwL (4M flavor) trial on an I7-7500U system

An i7-7500U is two CPU cores with hyperthreading, plus one HD620 iGP.
http://ark.intel.com/products/95451/...p-to-3_50-GHz-

GpuOwL took approx 19 seconds to get started and launch the first selftest exponent.
It identified and listed 4 devices.
Zero and two are identified as 24x1050Mhz Intel(R) HD Graphics 620; OpenCL 2.1
One and three are identified as 4x2700Mhz Intel(R) Core*TM) i7-7500U CPU @ 2.70Ghz; OpenCL 2.1 (build 2)

It raises the HD620 clock rate from 300Mhz idle value to 950-1000Mhz.
Selftest cases run at iteration times of 140-145 msec/iteration.
Prime95 throughput drops by about half during this use of the iGP by GpuOwL.
(prime95 43M LL test goes from 16 msec/iteration, to 30-33msec/iteration; other prime95 worker goes from ECM on 16.9M 96 seconds per screen output to 152 seconds per output)

The 15W TDP total was not changed; but 8.4 watts of it is absorbed by the iGP, dropping cpu frequency to 1.6Ghz and lowering CPU wattage. (CPUID HWMonitor and TechPowerUp GPU-Z)
Other elements are Uncore 1.5 W, DRAM 1.6W, IA Cores 5.2W

Task Manager shows prime95 dropping from ~72% CPU load before launch of GpuOwl, to about 40% during GpuOwL operation.
This is not due to cpu load by GpuOwL. The GpuOwL process shows no cpu usage in Task Manager (occasionally a fraction of a percent).
The drop in prime95 cpu usage is assumed to be due to some combination of TDP management and contention for memory bandwidth since the iGP uses shared system RAM.

Launching a second instance of GpuOwL in a separate directory near the end of completion of the selftest (two exponents left) did not noticeably change prime95 cpu.
There was no discernible difference in HD620 clock rate or power consumption or Prime95 throughput between one and two instances. I found later that the one started earlier had been accidentally put into Select status. Once that was cleared the two instances ran simultaneously at half speed. Run times were 140-145msec/iteration with a single gpuowl running, and approx 299msec/iteration with two. So in regard to prime95, still no discernible difference between two and one instance of gpuowl after clearing the select status.

Pilot error? May try it on a different older system too.
kriesel is online now   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 16:56.


Mon Aug 2 16:56:37 UTC 2021 up 10 days, 11:25, 0 users, load averages: 2.49, 2.37, 2.22

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.