mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2017-04-26, 03:27   #34
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

23×137 Posts
Default

I believe I've read on the forum that some users doing GPU LL had to underclock the memory to get an accurate result.
Mark Rose is online now   Reply With Quote
Old 2017-04-26, 04:20   #35
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

236568 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I believe I've read on the forum that some users doing GPU LL had to underclock the memory to get an accurate result.
That was definitely my experience. Until I slowed the memory clock, my gtx 460 and 580 cards could not complete both self-tests.

I have yet to really work at getting CUDALucas to run on the GTX 1060. Early self-test runs blew up in seconds. It's been a while, so I don't remember the details that well. With an i7 Skylake turning out a DC every 25 hours, I found it much more productive to keep the GPUs on TF.
kladner is offline   Reply With Quote
Old 2017-04-26, 08:08   #36
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

26508 Posts
Default Exponent range supported by gpuOwL

gpuOwL now only supports FFT 4096K. This allows LL in the range about 35M - 77M.

But all the exponents under ~ 40.8M have been double-checked, thus there's not much interest there.

Using FFT 4096K for exponents under about 65M may be a waste, because faster FFT-sizes are available. Thus gpuOwL is probably best used for first-time and double-checks in the range 70M - 77M.

I would recommend to start by doing at least a couple of double-checks (to validate correct function) before doing any first-time LL.

The results, found in "resuts.txt", are in a format that can be directly submitted on "Manual Testing" webpage.

Note: on github in the "fft2m" branch, https://github.com/preda/gpuowl/tree/fft2m
there is an implementation for FFT 2048K as well, in case anybody is particularly interested in those small sizes (probably most useful for testing, or as sample code).

About larger FFT sizes: for now I'm only looking toward supporting POT (power-of-two) FFTs. But I think there's no interest in 8M or 16M FFTs (for LL), because there's plenty of first-time LL to do under 78M.

For world-record tests (exponents > 332M), the smallest POT that can handle them is 32M, which is also overkill (in my estimation, 32M FFT may handle exponents up to 550M). So it seems that world-record tests are better handled by a non-POT FFT. In addition, it would probably not be such a good idea to spend a big amount of time (on huge exponents) with a new program with limited testing.
preda is offline   Reply With Quote
Old 2017-04-26, 08:59   #37
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

23×181 Posts
Default gpuOwL stop / resume

gpuOwL writes on every logstep (20k iterations) a checkpoint to a file save-N.bin
(moving the previous file to save-N.old).

The program can be safely stopped/killed at any time. Upon restart, it will look for a checkpoint file for the given exponent, and continue from there if found.

The checkpoint file starts with a human-readable header, like this:
LL1 42643801 160000 1024 2048 0

With the values meaning:
file-signature, exponent, iteration, width, height, offset
followed by a newline, a ctrl-Z character, and a binary dump of the words.

These save files can be safely moved around. If deleted you lose the progress. If deleted/moved, the program starts from iteration 0.
preda is offline   Reply With Quote
Old 2017-04-26, 14:46   #38
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

5·112·17 Posts
Default

Haha, nice avatar

To the subject: As promised, Victor sent me his built. I gave up doing mine, I found out I have some old tools and no time to renew them, but I will resume the trials as soon as the time will allow.

Let's first start by getting an assignment in 77M, to avoid wasting precious cycles, as the Owl only knows 4K FFT. We got for a start, M77002759. Good. For a comparison, we tried to give it a run with clLucas first, to see what we are fighting against. As we didn't use this machine for testing for a while, we had first an unsuccessful struggle to convince clLucas to stick with the FFT size. When we do not specify the FFT size in the command line, he works for a long while, deciding which FFT is the best (it starts much lower), and every time ends with a "wrong" one, i.e. above 4K.

We gave up, after he decided to get a too big error, and increase the FFT, regardless of what we were doing. Score, 1-0 for clLucas against us. The point is that the next FFT that he wants to use is about half-speed compared with the POT one. This is easy to see when he prints the test lines in the beginning, every 100 iterations, the text lines come less often (half speed) after he increases the FFT. This is visible, like two seconds per line, against one second per line before. Grrrr... We decided to forget the things, shot the dead horse, and get a new, smaller, assignment. This time we got M76453229, and clLucas happily decided not to increase the FFT. Gooooood.....

Then we did the same run with gpuOwl, and we decided to do both runs just to see the difference. gpuOwl is indeed faster, but we have to complain about it zerorizing half of the residue (of course, this is just a printing bug, we assume, or maybe a compilation bug).

Code:
e:\99 - Prime\clLucas>cllucas_x64 -c 2000 -threads 256 -f 4194304 -s backups

Platform 0 : Advanced Micro Devices, Inc.
Platform :Advanced Micro Devices, Inc.
Device 0 : Tahiti

Build Options are : -D KHR_DP_EXTENSION

CL_DEVICE_NAME                          Tahiti
CL_DEVICE_VENDOR                        Advanced Micro Devices, Inc.
CL_DEVICE_VERSION                       OpenCL 1.2 AMD-APP (2348.3)
CL_DRIVER_VERSION                       2348.3
CL_DEVICE_MAX_COMPUTE_UNITS             32
CL_DEVICE_MAX_CLOCK_FREQUENCY           1050
CL_DEVICE_GLOBAL_MEM_SIZE               3221225472
CL_DEVICE_MAX_WORK_GROUP_SIZE           256
CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE 1

mkdir: cannot create directory `backups': File exists
Starting M77002759 fft length = 4096K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration  100, average error = 0.16452, max error = 0.21875
Iteration  200, average error = 0.19164, max error = 0.21875
Iteration  300, average error = 0.20540, max error = 0.23438
Iteration  400, average error = 0.21530, max error = 0.25000
Iteration  500, average error = 0.22243, max error = 0.28125
Iteration  600, average error = 0.23223, max error = 0.28125
Iteration  700, average error = 0.23923, max error = 0.28125
Iteration  800, average error = 0.24449, max error = 0.28125
Iteration  900, average error = 0.24857, max error = 0.28125
Iteration 1000, average error = 0.25181 >= 0.25 (max error = 0.28125), increasing FFT length and restarting.
Starting M77002759 fft length = 4480K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration  100, average error = 0.02343, max error = 0.03125
Iteration  200, average error = 0.02738, max error = 0.03320
Iteration  300, average error = 0.02932, max error = 0.03320
Iteration  400, average error = 0.03035, max error = 0.03516
Iteration  500, average error = 0.03131, max error = 0.03516
Iteration  600, average error = 0.03195, max error = 0.03516
Iteration  700, average error = 0.03241, max error = 0.03516
Iteration  800, average error = 0.03276, max error = 0.03516
Iteration  900, average error = 0.03302, max error = 0.03516
Iteration 1000, average error = 0.03323 < 0.25 (max error = 0.03516), continuing test.
Iteration 2000 M( 77002759 )C, 0x9a2f030ffaeda2c4, n = 4480K, clLucas v1.04 err = 0.0371 (0:28 real, 14.0000 ms/iter, ETA 299:26:41)
Iteration 4000 M( 77002759 )C, 0xed3b849574c96289, n = 4480K, clLucas v1.04 err = 0.0371 (0:27 real, 13.9316 ms/iter, ETA 297:58:24)
Iteration 6000 M( 77002759 )C, 0x6d71868cfc75973d, n = 4480K, clLucas v1.04 err = 0.0371 (0:28 real, 13.9408 ms/iter, ETA 298:09:45)
        Unknown signal caught, writing checkpoint. Estimated time spent so far: 1:50

e:\99 - Prime\clLucas>cllucas_x64 -c 2000 -threads 256 -f 4194304 -s backups

Platform 0 : Advanced Micro Devices, Inc.
Platform :Advanced Micro Devices, Inc.
Device 0 : Tahiti

Build Options are : -D KHR_DP_EXTENSION

CL_DEVICE_NAME                          Tahiti
CL_DEVICE_VENDOR                        Advanced Micro Devices, Inc.
CL_DEVICE_VERSION                       OpenCL 1.2 AMD-APP (2348.3)
CL_DRIVER_VERSION                       2348.3
CL_DEVICE_MAX_COMPUTE_UNITS             32
CL_DEVICE_MAX_CLOCK_FREQUENCY           1050
CL_DEVICE_GLOBAL_MEM_SIZE               3221225472
CL_DEVICE_MAX_WORK_GROUP_SIZE           256
CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE 1

mkdir: cannot create directory `backups': File exists
Starting M76453229 fft length = 4096K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration  100, average error = 0.14852, max error = 0.21875
Iteration  200, average error = 0.18363, max error = 0.21875
Iteration  300, average error = 0.19534, max error = 0.21875
Iteration  400, average error = 0.20119, max error = 0.21875
Iteration  500, average error = 0.20470, max error = 0.21875
Iteration  600, average error = 0.20704, max error = 0.21875
Iteration  700, average error = 0.20872, max error = 0.21875
Iteration  800, average error = 0.20997, max error = 0.21875
Iteration  900, average error = 0.21095, max error = 0.21875
Iteration 1000, average error = 0.21172 < 0.25 (max error = 0.21875), continuing test.
Iteration 2000 M( 76453229 )C, 0xbb1d6624a3ab7bf8, n = 4096K, clLucas v1.04 err = 0.2188 (0:12 real, 5.7752 ms/iter, ETA 122:38:34)
Iteration 4000 M( 76453229 )C, 0xbaefb39c3b82c9d1, n = 4096K, clLucas v1.04 err = 0.2188 (0:12 real, 5.7400 ms/iter, ETA 121:53:32)
Iteration 6000 M( 76453229 )C, 0x580c90a32431aeea, n = 4096K, clLucas v1.04 err = 0.2188 (0:11 real, 5.7350 ms/iter, ETA 121:46:58)
Iteration 8000 M( 76453229 )C, 0x034cc7c190b474a6, n = 4096K, clLucas v1.04 err = 0.2188 (0:12 real, 5.7300 ms/iter, ETA 121:40:24)
Iteration 10000 M( 76453229 )C, 0x40e22bdb628637bd, n = 4096K, clLucas v1.04 err = 0.2188 (0:11 real, 5.7100 ms/iter, ETA 121:14:44)
        Unknown signal caught, writing checkpoint. Estimated time spent so far: 1:02

e:\99 - Prime\clLucas>cd ..\gpuowl

e:\99 - Prime\gpuOwl>gpuowl -logstep 2000
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti - OpenCL 1.2 AMD-APP (2348.3)
LL FFT 4096K (1024*2048*2) of 77002759 (18.36 bits/word) at iteration 0
OpenCL setup: 960 ms
00002000 / 77002759 [0.00%], ms/iter: 4.765, ETA: 4d 05:55; 00000000faeda2c4 error 0.238075 (max 0.238075)
00004000 / 77002759 [0.01%], ms/iter: 4.750, ETA: 4d 05:36; 0000000074c96289 error 0.236749 (max 0.238075)
00006000 / 77002759 [0.01%], ms/iter: 4.745, ETA: 4d 05:29; 00000000fc75973d error 0.234828 (max 0.238075)
00008000 / 77002759 [0.01%], ms/iter: 4.745, ETA: 4d 05:29; 000000001f37b1fb error 0.24425 (max 0.24425)
00010000 / 77002759 [0.01%], ms/iter: 4.745, ETA: 4d 05:29; 00000000b0f55ab1 error 0.235088 (max 0.24425)
^C
[changing the lines' order in worktodo]

e:\99 - Prime\gpuOwl>gpuowl -logstep 2000
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti - OpenCL 1.2 AMD-APP (2348.3)
LL FFT 4096K (1024*2048*2) of 76453229 (18.23 bits/word) at iteration 0
OpenCL setup: 1000 ms
00002000 / 76453229 [0.00%], ms/iter: 4.770, ETA: 4d 05:18; 00000000a3ab7bf8 error 0.199004 (max 0.199004)
00004000 / 76453229 [0.01%], ms/iter: 4.745, ETA: 4d 04:46; 000000003b82c9d1 error 0.20329 (max 0.20329)
00006000 / 76453229 [0.01%], ms/iter: 4.740, ETA: 4d 04:39; 000000002431aeea error 0.208951 (max 0.208951)
00008000 / 76453229 [0.01%], ms/iter: 4.750, ETA: 4d 04:52; 0000000090b474a6 error 0.206401 (max 0.208951)
00010000 / 76453229 [0.01%], ms/iter: 4.745, ETA: 4d 04:45; 00000000628637bd error 0.203467 (max 0.208951)
^C
e:\99 - Prime\gpuOwl>
Note that 4 days, 5 hours, is 101 hours. The difference in time between the two gpuOwl runs seems normal as there are less iterations to do, in spite of the fact that the iterations themselves take the same time (as the same FFT is used).

This is about 20% speed increase, for this build, and this card.

Next step would be to try to compile our own version, and if (or when) the zerorizing error is fixed, to finish these tests and compare the residues with what the Titan/cudaLucas gives. If success, we will report both as LL and DC. Yes, we know this will infuriate MadPoo which will have to triple check

Last fiddled with by LaurV on 2017-04-26 at 14:50
LaurV is offline   Reply With Quote
Old 2017-04-26, 15:34   #39
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

5·112·17 Posts
Default

Additional "complaints", beside of the fact that the little thief is stealing my hex digits of the residue:

1. When something like this happens: (yes, we can force it, by giving irrealistic work to the card in the same time it is doing gpuOwl):
Code:
Error 4 is too large!00030000 / 76453229 [0.04%], ms/iter: 5.561, ETA: 4d 22:03; 00000000eb2493b3 error 4 (max 4)
, then the error should be saved in the file too, and carried on with it at the next restart (use a byte in the header, or so), or the program should exit, and resume from an aterior saved file. Otherwise, (first) it is wasting precious time (yes the residue is wrong the correct one ends in 2BBB1710, and it continues with the wrong calculus, wasting cycles in vain) and (second) the user can restart the program (and the error is lost) so he never knows he has problems with his hardware. (most of use use some batch file like
Code:
:loop
gpuOwl
goto loop
to run the tools, because sometimes they crash, or the card crash, and the calculus has to resume, and not wait until I come from work in the evening. In this case, gpuOwl will continue with wrong residue, and the error will not be carried on after restart, so the user will have no idea (who's reading the logs? )

and 2., because we are here, please do not delete the partial residue files. Let them there, and use the iteration number and the residue, as part of the file name (cudaLucas style). The user can delete them manually if he wants. This is useful when we compare files and residues between different runs, different cards, different programs, as they all have different file structures, different shift counts, etc - i.e. the two files, one produced by cudaLucas, and one by clLucas, are not the same inside, but if they are both called "s76453229.30001.9d0222732bbb1710.txt" (real file name here!), then I know that both programs are doing fine, and my batch file can automatically parse the two backup folders and kill the programs and resume from a previous iteration (by renaming the files to cxxxxx and txxxxx, see cudaLucas), without looking inside of the files (I am not interested in internal structure/kitchen) so we don't lose any time walking paths that start with wrong residues and backtracking those paths.

and 3. save those residue in a sub-folder, call it backup or so (later it can be given in an ini file) so they can be handled in bunch, deleted, etc, without inadvertently deleting the exe file or some library in the folder the program is running.

then we are good to go...

Last fiddled with by LaurV on 2017-04-26 at 16:33
LaurV is offline   Reply With Quote
Old 2017-04-26, 16:02   #40
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

240558 Posts
Default

On the bright side, we timed the little owl again, this time with the clock (stopwatch) on hand, to see if it does not cheat on displaying ETAs. It does not. (but you have to agree that it is possible and we are paranoid by formation, nothing personal - we can produce a LL test program to say that it tests a 77M in 55 hours, but you let it and let it and let it, and after 55 hours he is only one third done, and it will take another 110 hours to do the other thirds, in spite of the fact that it now shows only 33 hours to go (two thirds). Not the case for gpuOwl, but you got the point. So, we made sure.

Using it, will make our card from ~42 GHz Days per Day (it is indeed its score at this range, despite James' site saying here that it only scores between 35 and 40) into a 20% faster card, as we have seen above, which is a bit over 50 GHzD/D, at parity with some "good" gtx 1070 or quadro plex, not talking about the boost it will give to lots of "fury" cards that are already at 50 or higher.

This matches with the displayed time, because the 77M exponent we tested is exactly 217 GHzDays worth, and the ETA time was (roughly) 4 days 5 hours, i.e. 4.2 days, then this makes exactly 50 GhzDays per Day.

Yay!

Therefore, if you can find the time, resources, motivation, whatever, to continue improving it, you will make a lot of people here happy....

Two thumbs up...


Time for bed, midnight here (almost)...

Last fiddled with by LaurV on 2017-04-26 at 16:04
LaurV is offline   Reply With Quote
Old 2017-04-26, 17:30   #41
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

37×59 Posts
Default

Quote:
Originally Posted by VictordeHolland View Post
Hi,
I forgot to mention, but my card is a HD7950, which only supports OpenCL1.2
https://en.wikipedia.org/wiki/Radeon_HD_7000_Series

At least gpuOwl detects it is a Tahiti OpenCL 1.2 AMD-APP 2079.5 device :).

...
Funny enough, although the HD7950 doesn't support OpenCL 1.2, the R9 280 which is a rebrand of the 7950 with a higher clock speed does... just driver things I guess....

Last fiddled with by kracker on 2017-04-26 at 17:30
kracker is offline   Reply With Quote
Old 2017-04-26, 18:52   #42
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

32×131 Posts
Default

Quote:
Originally Posted by kracker View Post
Funny enough, although the HD7950 doesn't support OpenCL 1.2, the R9 280 which is a rebrand of the 7950 with a higher clock speed does... just driver things I guess....
The Wikipedia pages (they could be wrong of course) state:

GCN 1st gen support OpenCL1.2:
Tahiti chips (HD79xx, HD89xx, R9 280(X))

GCN 2nd gen supports supports OpenCL2.0 used in
Bonaire chip (HD7790, HD8770, R7 260X, R7 360)
Hawaii chip (R9 290(X) , R9 390(X))

GCN 3rd gen also supports OpenCL2.0:
Tonga (R9 285, R9 380(X))
Fiji (R9 Fury, Nano X)

Seems highly plausible that the OpenCL support is dependant on the GCN generation.
VictordeHolland is offline   Reply With Quote
Old 2017-04-26, 23:05   #43
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

23·181 Posts
Default

@LaurV, that's a detailed analysis!

It seems your build was not fresh though: the "error too large" not stopping is already fixed. The zeroed residue is also "maybe" fixed already. If you can get a fresh build, I'll know if the residue is indeed now printed correctly (or look more into that if not).

I still don't understand how you got "4" for error.. it's supposed to go only up to 0.5..

I'll think about the structure of the save-files. I probably need to look a bit at what CUDALucas does there. But, to keep *all* the old checkpoints around? -- each file is 16MB. If you get 4000 of those, that'd be 64GB, probably too much.
preda is offline   Reply With Quote
Old 2017-04-27, 06:14   #44
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

I should have some time this weekend to do ISA dumps and try upgrading drivers / APP SDK to new versions on one of my FuryX Systems.

Also, most consumer cards do occasionally have errors. I have seen them less often on AMD cards than NVIDIA, but they do happen. If it is helpful, I have a system with 3x W9100s in it which have ECC memory and (ideally) do not exhibit hardware errors (100s of double checks agree). If you setup to select the GPU I can run a few double check exponents on those cards to check stability.
airsquirrels is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1720 2023-02-27 03:10
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 17:14.


Tue Mar 28 17:14:19 UTC 2023 up 222 days, 14:42, 0 users, load averages: 0.58, 0.70, 0.81

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔