mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

paulunderwood 2019-11-18 16:36

[QUOTE=storm5510;530896]I did a bit of a comparison yesterday. I am really not sure how much difference there is between PRP and LL. I used the same exponent on gpuOwl and CUDALucas. I started a PRP on gpuOwl. I reported 5 days to complete. I then started a LL on CUDALucas. It reported just under 8 days.

What value this has, I do not know. It is what it is.[/QUOTE]

If you discover a Mersenne PRP then you have the prize -- an LL verification is merely a formality because the chances of it not being prime are like winning a lottery every day for a year.

The PRP test is more robust in that the Gerbicz error correction greatly reduces the risk of an error -- a PRP test is better for the project's throughout.

In your case with your hardware PRP it is a win-win situation.

storm5510 2019-11-18 23:23

[QUOTE=paulunderwood;530897]If you discover a Mersenne PRP then you have the prize -- an LL verification is merely a formality because the chances of it not being prime are like winning a lottery every day for a year.

The PRP test is more robust in that the Gerbicz error correction greatly reduces the risk of an error -- a PRP test is better for the project's throughout.

In your case with your hardware PRP it is a win-win situation.[/QUOTE]

Thank you for the reply.

Given what you wrote. PRP could be used as a substitute for a DC. Is this correct?

kriesel 2019-11-18 23:52

[QUOTE=storm5510;530933]PRP could be used as a substitute for a DC. Is this correct?[/QUOTE]One PRP is not the equivalent of 2 LL tests, in multiple ways.
A PRP or an LL on a gpu (or any application, other than mprime/prime95 with its security code, reported manually) could be a complete fabrication by a malicious user. Or could be accidentally incorrectly transcribed by an honest user. There have been PRP results that were affected by software bugs lying outside the code guarded by the GEC. Hardware issues could also hit there. The GEC is very reliable, but Mr. Gerbicz has indicated its error detection miss rate is not zero.

Doublecheck relies on matching residues. So the residues need to be the same type. An LL first test needs an LL DC. A PRP first test needs a PRP DC, of the same residue type.

The error rate of LL is historically 2% so it took ~2.04 LL on average to get a match. PRP would be ~2 tests. That's a 2% improvement.

paulunderwood 2019-11-18 23:53

[QUOTE=storm5510;530933]Thank you for the reply.

Given what you wrote. PRP could be used as a substitute for a DC. Is this correct?[/QUOTE]

Almost! I leave that to those in the know on how easy it is for a rogue agent to submit fake PRP results.

It would be interesting to know the GIMPS bad PRP ratio anyway.

kriesel 2019-11-19 00:18

[QUOTE=paulunderwood;530935]...how easy it is for a rogue agent to submit fake PRP results.

It would be interesting to know the GIMPS bad PRP ratio anyway.[/QUOTE]
1) It's trivial to forge a manual gpu result, for LL or PRP or TF NF or P-1 NF. That doesn't stop the miscreants from getting it wrong and making detecting them easier.

2) More data samples are needed to get a sense of PRP error rate "in the wild". Only about 1/10 of the PRP I've done have been double checked or double checks.
The top producer for LL DC is 9850 results, while for PRP DC only 108.

preda 2019-11-19 11:28

[QUOTE=kriesel;530887]An option to save persistent checkpoints would be good, perhaps every hour or million iterations, especially for when a long computation goes bad, such as the zero residue case encountered recently. See [url]https://www.mersenneforum.org/showpost.php?p=530876&postcount=60[/url]. In that run all the gpuowl save files were bad before the error was spotted.[/QUOTE]

I haven't prioritized persistent checkpoints for P-1 because, in my mind, P-1 is a relativelly fast test. I see one P-1 as taking up to 1h, but in the past I've done quite a bit of 5min/exponent P-1 with both stages (at the wavefront), and it seemed like a relatively good use of my compute (IMO).

As a simple estimation, 24h for a full PRP test; let's allocate 2% of that time for a P-1 test, comes to 30min.

kriesel 2019-11-19 13:17

[QUOTE=preda;530978]I haven't prioritized persistent checkpoints for P-1 because, in my mind, P-1 is a relativelly fast test. I see one P-1 as taking up to 1h, but in the past I've done quite a bit of 5min/exponent P-1 with both stages (at the wavefront), and it seemed like a relatively good use of my compute (IMO).

As a simple estimation, 24h for a full PRP test; let's allocate 2% of that time for a P-1 test, comes to 30min.[/QUOTE]That's ok for wavefront PRP & P-1 on fastest gpus (Radeon VII). The 100M digit exponents or higher, or ordinary gpus are another matter. On an RX480, one PRP of 333M is 68 days, one P-1 to 2-tests-saved bounds is 2.2 days. On an RX480, 2-test-saved bounds P-1 of 102M is 4hr 15 min. On an RX550, it will be about 3.5 times that, around 16 hours for 102M P-1, and would be 7 to 8 days for 333M P-1.
Your time to spend how you see fit, of course, including beach vacations.

storm5510 2019-11-19 16:47

[QUOTE=kriesel;530934]One PRP is not the equivalent of 2 LL tests, in multiple ways.
A PRP or an LL on a gpu (or any application, other than mprime/prime95 with its security code, reported manually) could be a complete fabrication by a malicious user. Or could be accidentally incorrectly transcribed by an honest user. There have been PRP results that were affected by software bugs lying outside the code guarded by the GEC. Hardware issues could also hit there. The GEC is very reliable, but Mr. Gerbicz has indicated its error detection miss rate is not zero.

Doublecheck relies on matching residues. So the residues need to be the same type. An LL first test needs an LL DC. A PRP first test needs a PRP DC, of the same residue type.

The error rate of LL is historically 2% so it took ~2.04 LL on average to get a match. PRP would be ~2 tests. That's a 2% improvement.[/QUOTE]

By security code, I take it that you mean the assignment ID?

If this is the case, [I]gpuOwl[/I] might include an AID in its results, if there is one present in the work assignment. It uses a rather different results format than anything I've seen before. It has a name, I know, but I cannot remember it at the moment.

"2 LL tests." A [U]first time[/U] and then a [U]DC[/U]. I have done DC's with [I]CUDALucas[/I] and never had a problem with a mismatched residue. Compared to this, running a LL or DC with [I]Prime95[/I] can be cumbersome regarding the time required. I will always use a GPU whenever possible! Some GPU applications do not pass through an AID when one is present in the work file.. I feel this needs to be corrected. Specifically, [I]mfaktc[/I], [I]CUDAPm1[/I], and [I]CUDALucas[/I]. Sorry for going a bit off-topic...

kriesel 2019-11-19 23:05

[QUOTE=storm5510;530999]By security code, I take it that you mean the assignment ID?[/QUOTE]No. By security code, I meant the prime95/mprime generated security code in the result. Different field entirely. The source code for the rest is available, but not for that. Only mprime or prime95 results can have the security code field. The AID is merely repeated from input to output. The security code's value depends on the software and on the calculation, in a way that's unclear, but provides a way of checking whether there is a discrepancy. A completely fictitious res64 could be submitted with a valid AID in other applications, but the security code in mprime or prime95 results would reveal the fraud, when checked at the primenet server.

[QUOTE]If this is the case, [I]gpuOwl[/I] might include an AID in its results, if there is one present in the work assignment.[/QUOTE]GpuOwL DOES use an AID in its results.[QUOTE]It uses a rather different results format than anything I've seen before. It has a name, I know, but I cannot remember it at the moment.[/QUOTE]JSON.[QUOTE]
Some GPU applications do not pass through an AID when one is present in the work file.. I feel this needs to be corrected. Specifically, [I]mfaktc[/I], [I]CUDAPm1[/I], and [I]CUDALucas[/I]. Sorry for going a bit off-topic...[/QUOTE]Partly true, partly false.

All the following include AID in their result output (if one is present in the worktodo entry):
[B]CUDALucas v2.06[/B]
[B]CUDAPm1 v0.20[/B]
gpuowl of many versions
cllucas v1.04
mprime/prime95

These do not put the AID in the result record:
mfaktc
mfakto

(I have no results yet for mlucas)

nomead 2019-11-20 00:21

[QUOTE=kriesel;531022](I have no results yet for mlucas)[/QUOTE]

From a recent DC result in Mlucas 18.0 :
[C]M51318853 is not prime. Res64: FA4E8EBD90893264. Program: E18.0. Final residue shift count = 21868454[/C]
So, no AID, at least not yet. (I'm including the full residue since it's been double checked and it appears in full on the mersenne.org exponent lookup as well.)

storm5510 2019-11-20 01:37

[QUOTE=kriesel;531022]No. By security code, I meant the prime95/mprime generated security code in the result. Different field entirely. The source code for the rest is available, but not for that. Only mprime or prime95 results can have the security code field. The AID is merely repeated from input to output. The security code's value depends on the software and on the calculation, in a way that's unclear, but provides a way of checking whether there is a discrepancy. A completely fictitious res64 could be submitted with a valid AID in other applications, but the security code in mprime or prime95 results would reveal the fraud, when checked at the primenet server.

GpuOwL DOES use an AID in its results.JSON.Partly true, partly false.

All the following include AID in their result output (if one is present in the worktodo entry):
[B]CUDALucas v2.06[/B]
[B]CUDAPm1 v0.20[/B]
gpuowl of many versions
cllucas v1.04
mprime/prime95

These do not put the AID in the result record:
mfaktc
mfakto

(I have no results yet for mlucas)[/QUOTE]

Ah! This is a security code: [B]Wh4: 12E068A0. [/B]I've only seen this with [I]Prime95[/I]. I know nothing about [I]Mprime[/I] as I do not do [I]Linux[/I].

Please forgive my error regarding [I]CUDALucas[/I] and [I]CUDAPm1[/I]. I must not have ran anything with them which had an AID. An old saying about "assume:" It makes an a** out of you and me.

JSON. Hit me on the head with a mallet. :blush:

kriesel 2019-11-20 02:11

[QUOTE=storm5510;531034]Ah! This is a security code: [B]Wh4: 12E068A0. [/B]I've only seen this with [I]Prime95[/I]. I know nothing about [I]Mprime[/I] as I do not do [I]Linux[/I].

Please forgive my error regarding [I]CUDALucas[/I] and [I]CUDAPm1[/I]. I must not have ran anything with them which had an AID. An old saying about "assume:" It makes an a** out of you and me.

JSON. Hit me on the head with a mallet. :blush:[/QUOTE]Wh4 is a compact version identifier.

LaurV 2019-11-20 03:41

[QUOTE=kriesel;530984]That's ok for wavefront PRP & P-1 on fastest gpus (Radeon VII). The 100M digit exponents or higher, or ordinary gpus are another matter. On an RX480, one PRP of 333M is 68 days, one P-1 to 2-tests-saved bounds is 2.2 days. On an RX480, 2-test-saved bounds P-1 of 102M is 4hr 15 min. On an RX550, it will be about 3.5 times that, around 16 hours for 102M P-1, and would be 7 to 8 days for 333M P-1.
Your time to spend how you see fit, of course, including beach vacations.[/QUOTE]
:goodposting::iws:

kriesel 2019-11-20 15:32

[QUOTE=nomead;531026]From a recent DC result in Mlucas 18.0 :
[C]M51318853 is not prime. Res64: FA4E8EBD90893264. Program: E18.0. Final residue shift count = 21868454[/C]
So, no AID, at least not yet. (I'm including the full residue since it's been double checked and it appears in full on the mersenne.org exponent lookup as well.)[/QUOTE]Thanks! Would you please also run a small known prime that would be quick? Perhaps

756839

nomead 2019-11-20 21:06

[QUOTE=kriesel;531078]Thanks! Would you please also run a small known prime that would be quick? Perhaps

756839[/QUOTE]

Well... there, Mlucas works a bit differently.

First, in normal use, it has a table of known Mersenne primes and will report any re-test as such.

But after modification of the source and recompilation, it still behaves differently. There is no output on results.txt. But the tail end of the p756839.stat file states that
[C]M756839 is a new MERSENNE PRIME!!![/C]
and the program exits with the same message. The same happens whether it's a Test= or a DoubleCheck= line in the worktodo.ini file.

kriesel 2019-11-22 13:44

[QUOTE=nomead;531113]Well... there, Mlucas works a bit differently.

First, in normal use, it has a table of known Mersenne primes and will report any re-test as such.

But after modification of the source and recompilation, it still behaves differently. There is no output on results.txt. But the tail end of the p756839.stat file states that
[C]M756839 is a new MERSENNE PRIME!!![/C]
and the program exits with the same message. The same happens whether it's a Test= or a DoubleCheck= line in the worktodo.ini file.[/QUOTE]Thanks again. I'm incorporating it into a list of result formats that's now a draft but will eventually be a reference post in [URL]https://www.mersenneforum.org/showthread.php?t=24607[/URL]

storm5510 2019-11-27 15:35

I have been doing some off-and-on work for [B]James Heinrich[/B] on his project. He has helped me with questions in the past so I feel I should return the favor. He has a sub-project relating to poorly factored P-1 exponents.

Earlier, I took a few from his list, and pasted them into [I]Prime95's[/I] worktodo file and started it. Doing this assigns AID's to all the available exponents. The ones which do not get AID's I delete from the list. I have ran them in the form below and had no problems for close to a week:

[CODE]B1=600000,B2=8500000;PFactor=FFAABAF0706FF9F306F07524FA6A949F,1,2,91538501,-1,77,2
[/CODE]This morning, it was ignoring everything in my work file. I sat here over an hour, then I saw it; "P[COLOR=DarkRed]f[/COLOR]actor" where it should have been "P[COLOR=darkred]F[/COLOR]actor." [I]Prime95[/I] did not see this difference as a problem. I made a notation in a document in the folder to change these by hand. As for [I]gpuOwl[/I], trying to accommodate for this may not be worth the effort.

kracker 2019-11-27 16:25

[QUOTE=storm5510;531569]This morning, it was ignoring everything in my work file. I sat here over an hour, then I saw it; "P[COLOR=DarkRed]f[/COLOR]actor" where it should have been "P[COLOR=darkred]F[/COLOR]actor." [I]Prime95[/I] did not see this difference as a problem. I made a notation in a document in the folder to change these by hand. As for [I]gpuOwl[/I], trying to accommodate for this may not be worth the effort.[/QUOTE]

Same here for gpu72, slightly annoying and I feel like it's not something I really need to do at all...

kriesel 2019-11-28 03:15

Had an error in a gpuowl v6.11-9 P-1 run:
[CODE]2019-11-27 13:03:48 Exception NSt10filesystem7__cxx1116filesystem_errorE: filesystem error: cannot rename: File exists [C:\Users\ken\Document
414000187\414000187-new.p2.owl] [C:\Users\ken\Documents\v6.11-9-g9ae3189\414000187\414000187.p2.owl]
2019-11-27 13:03:48 waiting for background GCDs..
2019-11-27 13:03:48 Bye[/CODE]

kriesel 2019-11-28 04:43

DobleCheck etc.
 
I know that gpuowl v0.6 is ancient history at this point, not being maintained, but I find it useful to be able to run LL DC with Jacobi check on exponents ~50M-77M on an AMD gpu.
A Radeon VII can knock one of those out in 15-20 hours.
I've noticed repeated errors in the worktodo.txt. At first I thought it was typos I made.
But it appears that when a result is produced and the worktodo is rewritten by the program,
DoubleCheck= is getting altered for following assignments. Then they fail the validity test, and the program terminates since it has nothing to do, causing considerable loss of throughput. I've seen both DobleCheck and DubleCheck generated, and oubleCheck.


It also failed to remove the worktodo item Test=57885161 after finding it prime, and then terminated instead of continuing with following work.
It also in that case produced a result record with the AID = the exponent.

preda 2019-11-28 06:21

Funny. Probably something to do with windows newlines (\r\n). Or a memory corruption.
At some point I'd like to add LL back to master. Sorry, I don't think I'll look into fixing 0.6.

[QUOTE=kriesel;531622]I know that gpuowl v0.6 is ancient history at this point, not being maintained, but I find it useful to be able to run LL DC with Jacobi check on exponents ~50M-77M on an AMD gpu.
A Radeon VII can knock one of those out in 15-20 hours.
I've noticed repeated errors in the worktodo.txt. At first I thought it was typos I made.
But it appears that when a result is produced and the worktodo is rewritten by the program,
DoubleCheck= is getting altered for following assignments. Then they fail the validity test, and the program terminates since it has nothing to do, causing considerable loss of throughput. I've seen both DobleCheck and DubleCheck generated, and oubleCheck.


It also failed to remove the worktodo item Test=57885161 after finding it prime, and then terminated instead of continuing with following work.
It also in that case produced a result record with the AID = the exponent.[/QUOTE]

preda 2019-11-28 07:09

[QUOTE=kriesel;531619]Had an error in a gpuowl v6.11-9 P-1 run:
[CODE]2019-11-27 13:03:48 Exception NSt10filesystem7__cxx1116filesystem_errorE: filesystem error: cannot rename: File exists [C:\Users\ken\Document
414000187\414000187-new.p2.owl] [C:\Users\ken\Documents\v6.11-9-g9ae3189\414000187\414000187.p2.owl]
2019-11-27 13:03:48 waiting for background GCDs..
2019-11-27 13:03:48 Bye[/CODE][/QUOTE]

This is strange, I don't understand why it happened. Can you reproduce it? does it happen every time? anything special, like: disk full, read-only folder, read-only file, etc?

There are 3 files:
foo-old.owl ("old")
foo.owl ("savefile")
foo-new.owl ("new")

The sequence is:
1. write "new"
2. remove "old" (ignoring errors)
3. rename "savefile" to "old" (ignoring errors)
4. rename "new" to "savefile"

It seems in your case step 4 failed. It failed because "savefile" was there. That suggests that step 3 silently failed.

kracker 2019-11-28 07:11

[QUOTE=preda;531626]
At some point I'd like to add LL back to master. Sorry, I don't think I'll look into fixing 0.6.[/QUOTE]

:party:

Also, [URL="https://github.com/preda/gpuowl/commit/5d8110d37b6080245bad3c1ad731a024d66fdeb6"]thank you!!![/URL]

LaurV 2019-11-28 13:59

[QUOTE=preda;531626]At some point I'd like to add LL back to master.[/QUOTE]
We salute that idea and waiting for it to be sculpted in that stone called gpuOwl... :razz:

kriesel 2019-11-28 16:12

[QUOTE=preda;531628]This is strange, I don't understand why it happened. Can you reproduce it? does it happen every time? anything special, like: disk full, read-only folder, read-only file, etc?
[/QUOTE]I've only seen it once on this system, in cranking through four 41xM P-1 on a GTX1080, disk has 1.34TB free, no read-only on files, exponent folder has same properties as for others that did not show the issue, user has full control permissions, I haven't modified any permissions. Maybe some sort of race condition with Windows file indexing which is enabled?

kriesel 2019-11-28 16:15

[QUOTE=preda;531626]At some point I'd like to add LL back to master.[/QUOTE]That would be very welcome, as it is likely to incorporate the accumulated effort of various Windows-linux differences accommodations, and provide a supported version for LL DC. Ideally it would include Jacobi check, and pseudorandom offset.

preda 2019-11-28 20:43

-pool <dir>
 
The usual way for me running multiple instances of GpuOwl was:
- each in its own folder, each with its own workdoto.txt and results.txt
- the script primenet.py watching all these folders, keeping the right amount of work queued in worktodo and sending the results out.

I started to think how to do "common worktodo.txt", i.e. multiple instances feeding from one worktodo. This is the solution I come up with:

- specify one "shared" directory (using "-pool <dir>")
- this shared dir contains only worktodo.txt and results.txt
- every instance of GpuOwl works as before, inside its own local folder, with these two changes:
a) when the local worktodo.txt is empty, extract the first assignment from the shared worktodo and move it to the local worktodo
b) write any result to the shared results.txt instead of the local one

This allows primenet.py to now watch only the shared folder, and not the local ones.


Number of assignments:
Before: for N instances, I was queing 3*N PRP assignments
After: for N instances, I queue N in the shared worktodo, plus 1 in each of the N local folders, for a total of 2*N.

LaurV 2019-11-29 03:22

[QUOTE=preda;531651]The usual way for me running multiple instances of GpuOwl was:
- each in its own folder, each with its own workdoto.txt and results.txt[/QUOTE]
There is [U]nothing wrong[/U] with that, and we were doing it for cudaLucas for ages. The advantage was that our rusted OCD-etched soul fell happy managing the stuff 'face-to-face', in person.

Of course, we salute the new idea of a common pool (like the misfit is doing for mfaktX programs), and generally, we salute any improvements, in spite of the fact that we are thinking a bit that your efforts and [U]commendable[/U] skills are wasted, being channeled in the wrong direction. Make theOwl faster, better, add back the LL, improve the P-1, add few additional FFTs, optimize the old one, fix old bugs, etc... and let us, 'the stupid masses', handle multiple instances by our/themselves. It is not like we are doing thousands of assignments per day like in TF. We just have one or two worktodo files, which change(s) once or twice [U]weekly[/U], when some LL finishes, and looking to our TWO folders once per week is not such a bothersome activity.... or, is it?

Let's be serious, how many of you have 50 GPUs in your rigs? Most of us have 1, few have 2, rarely 3 or 4. Those with more than one, anyhow are "hooked", they spend all the day looking at the folder where LL is running, with nothing else on the screen, and doing nothing else than counting iterations all day... "Yeaahh, 1% done, still 99% to finish! Good... WTF? it was the same 20 seconds ago? No progress?" :razz:

Moreover, adding a common pool would be detrimental when you have two instances and run [U]the same[/U] exponent in both (LL+DC) - well, some of us still doing that, better waste some resources than lose a prime, so the result will still be two different folders, with two different pools, each pool running a single instance, each instance sucking from its own pool, or so.. :wink:

R. Gerbicz 2019-11-29 11:04

[QUOTE=LaurV;531665] Moreover, adding a common pool would be detrimental when you have two instances and run [U]the same[/U] exponent in both (LL+DC) - well, some of us still doing that, better waste some resources than lose a prime[/QUOTE]

And have you ever found a mismatch in the residues?

storm5510 2019-11-30 00:22

[QUOTE=LaurV;531665]...WTF? it was the same 20 seconds ago? No progress?" :razz:
[/QUOTE]

I have two caveats:

[U]#1:[/U] The screen writes could to be more frequent. It appears to be 10,000 iterations, or 10,000 something. With my vision being what it is, I walk by the screen and wonder if it is still running, or if it has frozen. Allow the user to decide by making this a [I]config.txt[/I] option. Being an antique programmer, I understand there may be some effort involved.

[U]#2:[/U] For every exponent ran, a folder containing checkpoint information is created, but not deleted after completion of the test. The housekeeping could be better.

Other than these, I feel [I]gpuOwl[/I] does a really good job. I have only ran P-1's with it. Stage 2 is far faster than any of the other programs I have used.

ATH 2019-11-30 04:13

How do you specify PRP type in gpuOwL ?

Just finished my first gpuowl test using Google Colab, but it was a PRP DC and forgot to think of the PRP type, so it finished the wrong type:
[url]https://mersenne.org/M87000929[/url]

I found a type 1 result to DC for the next one, so that should be ok, but how do I choose the type? It is fixed in the different versions which type it uses?

Could I continue from the last savefile of 87000929 and finish it as a type 4 if the difference
between types is only at the end?
According to undoc.txt from Prime95:
type 1: a^(n-1)
type 4: a^((n+1)/2)

kracker 2019-11-30 04:49

[QUOTE=ATH;531727]How do you specify PRP type in gpuOwL ?

Just finished my first gpuowl test using Google Colab, but it was a PRP DC and forgot to think of the PRP type, so it finished the wrong type:
[url]https://mersenne.org/M87000929[/url]

I found a type 1 result to DC for the next one, so that should be ok, but how do I choose the type? It is fixed in the different versions which type it uses?

Could I continue from the last savefile of 87000929 and finish it as a type 4 if the difference
between types is only at the end?
According to undoc.txt from Prime95:
type 1: a^(n-1)
type 4: a^((n+1)/2)[/QUOTE]
Pretty sure the prp type is hard fixed, but gpuowl has changed types several times in the past... Savefiles, I'm not sure.
[QUOTE=storm5510;531718]I have two caveats:

[U]#1:[/U] The screen writes could to be more frequent. It appears to be 10,000 iterations, or 10,000 something. With my vision being what it is, I walk by the screen and wonder if it is still running, or if it has frozen. Allow the user to decide by making this a [I]config.txt[/I] option. Being an antique programmer, I understand there may be some effort involved.

[U]#2:[/U] For every exponent ran, a folder containing checkpoint information is created, but not deleted after completion of the test. The housekeeping could be better.

Other than these, I feel [I]gpuOwl[/I] does a really good job. I have only ran P-1's with it. Stage 2 is far faster than any of the other programs I have used.[/QUOTE]

#1- You can use -log... but it's the opposite to what you want, it slows down the output to a multiple of 10000(last i recall)... If you use windows, you can get something like GPU-Z to see gpu usage percentages- to see if it's actually working the gpu...

#2- So much yes... I run gpuowl from google drive and running P-1 I run out of space so fast and have to manually delete them.... I realize I can make a script... whenever I feel like it :razz:

storm5510 2019-11-30 20:05

1 Attachment(s)
[QUOTE=kracker;531729]#1- You can use -log... but it's the opposite to what you want, it slows down the output to a multiple of 10000(last i recall)... If you use windows, you can get something like GPU-Z to see gpu usage percentages- to see if it's actually working the gpu...[/QUOTE]

The output is in multiples of 10,000 now so there is not much point of messing with that.

It is working the GPU, to a point. I have an gadget, image snip below, that sits in the upper-right corner of the screen. It displays the GPU temperature, among other things. Idle is around 30°C. GPU-Z says a 100% load. The GPU temperature shown in the gadget is with gpuOwl running on a large P-1.

LaurV 2019-12-01 04:02

[QUOTE=R. Gerbicz;531680]And have you ever found a mismatch in the residues?[/QUOTE]
Not with the new Gerbicz/PRP tests. I will let you know for sure! hehe.... But currently, I still use the ol'good cudaLucas with no such improvements, and with residue as part of the file name, and use a batch to check, and yes, sometimes cosmic rays have bad habit of nesting exactly inside of my computers! In that case, both instances will retry from the last good checkpoint. See my posts about M666666667 here around, which had to resume few times during the test.

preda 2019-12-01 08:28

[QUOTE=ATH;531727]How do you specify PRP type in gpuOwL ? [/QUOTE]
It's not settable, it depends on the version. The current version is type-1, and there aren't plans to change that anymore.
[QUOTE]
Could I continue from the last savefile of 87000929 and finish it as a type 4 if the difference
between types is only at the end?
According to undoc.txt from Prime95:
type 1: a^(n-1)
type 4: a^((n+1)/2)[/QUOTE]
They may be compatible. You're right the change between types is only at the end, but there are small differences between savefile (unrelated to type). Anyway, if gpuowl can't load the savefile it won't (i.e. it's safe to try); it may also be possible to manually "massage" the savefiles carefully, but probably not worth the effort. Not too risky for data corruption in the process as it is validated on load.

PS: all GpuOwl's savefile have a text header of *one line*, followed by binary. E.g. on Linux you can see the header like this:
head -1 savefile.owl
If the binary part is perfectly preserved, the text header could be altered, but the editor would need to not mess around with the binary.

ATH 2019-12-01 14:24

Which version was the last to use type 4 ?

This is a problem for future PRP DC, that people need specific knowledge to be able to DC them.
I should have been experienced enough to think about this, but I forgot. Many other users will not even know about the different types.

I assume Prime95/mprime can DC gpuowl results if the type is the same? Hopefully primenet sets type 4 automatically when needed.

ATH 2019-12-01 15:51

Anyone with an old working gpuowl version that uses type 4 PRP tests want to try and finish this test as a type 4?
[url]http://mersenne.org/M87000929[/url]

I have the savefiles after 86750000 and 87000000 iterations.

Prime95 2019-12-02 00:06

@preda: Any particular reason the -block command line argument is no longer available?

Prime95 2019-12-02 03:34

@preda; ./gpuowl -pm1 2000003

fails

preda 2019-12-02 11:54

[QUOTE=Prime95;531834]@preda; ./gpuowl -pm1 2000003

fails[/QUOTE]

Thanks, I just fixed a problem with a timer, which was causing too frequent saves (in P-1). Other than that (rather severe problem), it seems to run correctly. How was it failing for you? is it still failing?

preda 2019-12-02 12:22

[QUOTE=Prime95;531829]@preda: Any particular reason the -block command line argument is no longer available?[/QUOTE]

I re-enabled it for now, as I think I don't have a very strong reason to disable it yet.

I think that a block-size of 400 is a rather nice and overall good value (note, this is a bit smaller than the old default of 500). Why do you need a custom block-size, and to what value do you usually set it to?

As I have 2 GPUs (an XFX and an Asrock) that sometimes generate errors (about 1-2 per day), I come to appreciate a smaller block size, and I added a bit of logic to adaptivelly vary the default check-step depending on the number of errors up to now, by starting with a check-step of 200'000, and roughly halving it for each additional error up to 20'000.

And there is one more reason for the smallish block-size: relative to the PRP-proof (future), the plan right now is to have the proof cover (for exponent E) a region from beginning up to an iteration that is a multiple of 1024 * block-size (such that any halving step in this region hits a block-size boundary and can be checked). This leaves a "tail" of up to 1024 * blockSize iterations at the end that are not covered by the proof, and that will need to be re-run by the checker, thus it's good for the tail to not be too large.

storm5510 2019-12-02 13:41

[QUOTE=preda;531848]I re-enabled it for now, as I think I don't have a very strong reason to disable it yet.

I think that a block-size of 400 is a rather nice and overall good value (note, this is a bit smaller than the old default of 500). Why do you need a custom block-size, and to what value do you usually set it to?

As I have 2 GPUs (an XFX and an Asrock) that sometimes generate errors (about 1-2 per day), I come to appreciate a smaller block size, and I added a bit of logic to adaptivelly vary the default check-step depending on the number of errors up to now, by starting with a check-step of 200'000, and roughly halving it for each additional error up to 20'000.

And there is one more reason for the smallish block-size: relative to the PRP-proof (future), the plan right now is to have the proof cover (for exponent E) a region from beginning up to an iteration that is a multiple of 1024 * block-size (such that any halving step in this region hits a block-size boundary and can be checked). This leaves a "tail" of up to 1024 * blockSize iterations at the end that are not covered by the proof, and that will need to be re-run by the checker, thus it's good for the tail to not be too large.[/QUOTE]

After reading all the above, I don't think I want to change what I have, for now. It runs very well.

I have only used it for P-1 tests. I just have to make sure the "F" in "PFactor" is a capital. I think [I]PrimeNet[/I] issues these in lower case. It took me quite a while to figure out how to customize the bounds. Once done, no problems... :smile:

Prime95 2019-12-02 15:31

[QUOTE=preda;531848]
I think that a block-size of 400 is a rather nice and overall good value (note, this is a bit smaller than the old default of 500). Why do you need a custom block-size, and to what value do you usually set it to?[/QUOTE]

I use a block size of 1000. My Radeons have been pretty solid. Most go for a month or more without errors. I increase voltage or reduce mem speed if a Radeon gives me more than a couple of errors in a week.

I chose 1000 as Mr. Gerbicz original threads used that value calculating a 0.2% overhead. A block size of 400 has a 0.5% overhead.

I understand frequent errors make a smaller block size desirable. Prime95 automatically reduces the block size when an error occurs. I'm not suggesting this feature -- a bit of overkill.

The P-1 error I was getting: GPU->host read failed (check 61e4 vs 3f07)

R. Gerbicz 2019-12-02 17:23

[QUOTE=Prime95;531855] A block size of 400 has a 0.5% overhead.
[/QUOTE]

In these calculations include also the cost of the (possible) rollbacks, when you are redoing the iterations. Ofcourse the task is to minimize this (expected!) cost of error check+rollbacks.

preda 2019-12-02 20:14

[QUOTE=Prime95;531855]I use a block size of 1000. My Radeons have been pretty solid. Most go for a month or more without errors. I increase voltage or reduce mem speed if a Radeon gives me more than a couple of errors in a week.

I chose 1000 as Mr. Gerbicz original threads used that value calculating a 0.2% overhead. A block size of 400 has a 0.5% overhead.

I understand frequent errors make a smaller block size desirable. Prime95 automatically reduces the block size when an error occurs. I'm not suggesting this feature -- a bit of overkill.

The P-1 error I was getting: GPU->host read failed (check 61e4 vs 3f07)[/QUOTE]

The difference between 0.2% and 0.5% is minor though. How does Prime95 reduce/change the block size? -- are you sure you're not reducing the "check size" (how often the check is done) while keeping the block size the same?

The P-1 error -- strange, I don't understand why you were getting it, it seems that the memory transfer (reading from the GPU) or the synchronization around it (i.e. waiting for it to finish) was failing.

Prime95 2019-12-02 21:43

[QUOTE=preda;531872]The difference between 0.2% and 0.5% is minor though. How does Prime95 reduce/change the block size? -- are you sure you're not reducing the "check size" (how often the check is done) while keeping the block size the same?[/QUOTE]

Once you pass a Gerbicz error check (or fail and rollback to a save file that passed a check) you are essentially in a virgin state where you can select any block size you want going forward.

R. Gerbicz 2019-12-02 21:51

[QUOTE=Prime95;531878]Once you pass a Gerbicz error check (or fail and rollback to a save file that passed a check) you are essentially in a virgin state where you can select any block size you want going forward.[/QUOTE]

Yeah, with f(n)=a^(2^n) mod N
it is trivial that f(s+t)=f(t)^(2^s), so you can start a new blocklength=new L at
an error checked residue at iteration=t using a new "base"=f(t).
(why? because you are trusted, that at iteration=t with high probability you have a good residue).

The only difference with this is that at error check you need to multiple with f(t) and not with the
smallish a=3. So in the leading wavefront exponents you'd need ~100-250 more mulmods (almost nothing in computation time).

ewmayer 2019-12-02 22:38

[QUOTE=R. Gerbicz;531879]Yeah, with f(n)=a^(2^n) mod N
it is trivial that f(s+t)=f(t)^(2^s), so you can start a new blocklength=new L at
an error checked residue at iteration=t using a new "base"=f(t).
(why? because you are trusted, that at iteration=t with high probability you have a good residue).

The only difference with this is that at error check you need to multiple with f(t) and not with the smallish a=3. So in the leading wavefront exponents you'd need ~100-250 more mulmods (almost nothing in computation time).[/QUOTE]
Why so many mulmod-equivalents? Just forward-FFT the pure-integer f(t) read from the savefile and do a 2-input FFT-modmul as usual. Or were you referring to a pure-integer modmul? (If so, why?)

kriesel 2019-12-03 01:00

[QUOTE=ATH;531727]How do you specify PRP type in gpuOwL ?

Just finished my first gpuowl test using Google Colab, but it was a PRP DC and forgot to think of the PRP type, so it finished the wrong type:
[URL]https://mersenne.org/M87000929[/URL]

I found a type 1 result to DC for the next one, so that should be ok, but how do I choose the type? It is fixed in the different versions which type it uses?

Could I continue from the last savefile of 87000929 and finish it as a type 4 if the difference
between types is only at the end?
According to undoc.txt from Prime95:
type 1: a^(n-1)
type 4: a^((n+1)/2)[/QUOTE]
There's a whole reference thread on gpuowl in [URL]https://mersenneforum.org/showthread.php?t=24607[/URL]
[URL]https://www.mersenneforum.org/showpost.php?p=489083&postcount=7[/URL] and [URL]https://www.mersenneforum.org/showpost.php?p=519603&postcount=15[/URL] are about gpuowl versions and residue types.

kracker 2019-12-03 05:18

I've been having a weird issue with gpuowl, I have a system(RX570) which I run headless most of the time... When an assignment finishes, gpuowl will write the result then do nothing... until I login with RDP, then gpuowl immediately starts the next assignment. Tried running mfakto... zero issues.
[code]
2019-12-02 02:25:51 core 92912081 P2 2880/2880: setup 1128 ms; 5931 us/prime, 9223 primes
2019-12-02 02:25:51 core waiting for background GCDs..
2019-12-02 02:25:51 core 92912087 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.72 bits/word
2019-12-02 02:25:51 core OpenCL args "-DEXP=92912087u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x9.b3f5913600238p-3 -DIWEIGHT_STEP=0xd.311c9cb7274a8p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-02 02:25:53 core OpenCL compilation in 2060 ms
2019-12-02 02:26:39 core 92912081 P2 GCD: no factor
2019-12-02 02:26:39 core {"exponent":"92912081", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.11-11-gfaaa2f2"}, "timestamp":"2019-12-02 10:26:39 UTC", "user":"kracker", "computer":"core", "aid":"----", "fft-length":5242880, "B1":720000, "B2":13680000}
2019-12-02 06:07:51 core 92912087 P1 B1=720000, B2=13680000; 1038539 bits; starting at 0
2019-12-02 06:08:38 core 92912087 P1 10000 0.96%; 4698 us/sq; ETA 0d 01:21; b195a86475b0f7e5
[/code]

R. Gerbicz 2019-12-03 09:51

[QUOTE=ewmayer;531880]Why so many mulmod-equivalents? Just forward-FFT the pure-integer f(t) read from the savefile and do a 2-input FFT-modmul as usual.[/QUOTE]

I see it, you're right.

kriesel 2019-12-05 22:58

Multiple instances not always better
 
It's been reported that on Radeon VII, running two instances improves total throughput, and throughput per watt-hour.
I found a case where very different instances result in ~95% of single instance throughput.
This case combines very different gpuowl versions, computation (LL vs. PRP3), exponent and so fft length.
Windows 10, Lenovo Thinkstation D30, XFX Radeon VII; stock settings.

gpuowl v0.6 alone 1.005ms/iter (50330737 LL DC, 4M fft) = 995. iter/sec alone

v6.11 alone 1.193 ms/iter (89260099 PRP, 5M fft) = 838 iter/sec alone

Two disparate instances run together:
gpuowl v0.6 2.161 ms/iter; 463 iter/sec; throughput 463/995 = 0.4651 of solo;
v6.11 2.458 ms/iter; 407 iter/sec; throughput 407/838 = 0.4855 of solo;
combined, 0.4651 + 0.4855 = 0.9506 < 1. Slower running together, noticeably.

kriesel 2019-12-07 22:07

What variables determine max P-1 exponent for two stages
 
Several scenarios, different observed limits on exponent in gpuowl P-1, several differences, some matches

A) Windows 7 Pro, dual old Xeon 4-core processors, 24GB system ram, NVIDIA GTX1080Ti with 11gb gpu ram, gpuowl v6.7-4-g278407a, -maxAlloc [B]10240[/B]
Observed maximum exponent for completing two stages with GPU72 bounds, at least 514M but less than 520M, testing 517M next
From the gpuowl.log,
2019-11-10 01:45:55 520000009 5580000 99.94%; 26842 us/sq; ETA 0d 00:01; 1baeff7353d9f091
2019-11-10 01:47:23 Not enough GPU memory, will skip stage2. Please wait for stage1 GCD

B) Colab, and so Ubuntu linux on VM, 2 cores Xeon, KVM hypervisor, ~12.7GB system ram, NVIDIA Tesla K80 with 12gb gpu ram, gpuowl: Fan Ming's build for Colab, -maxAlloc [B]10240[/B] [URL]https://www.mersenneforum.org/showpost.php?p=528390&postcount=379[/URL]
Observed maximum exponent for completing two stages with GPU72 bounds, at least 564M, upper bound TBD, testing 665M now

C) different Windows 7 Pro system, dual old Xeon 6-core processors, 12GB system ram, AMD RX480 with 8GB gpu ram, Gpuowl V6.6-5-667954b, -maxAlloc not an available option in that version
Observed maximum exponent for completing two stages with GPU72 bounds, at least 500M (gpu-z indicated 3.7GB used in stage 2), upper bound TBD, testing 530M now

D) AMD RX550 with 4GB gpu ram, Gpuowl v6.7-4-g278407a,
Observed maximum exponent for completing two stages with GPU72 bounds, at least 150M, upper bound 224M, testing 180M now

E) AMD RX550 with 2GB gpu ram, gpuowl-v6.10-9-g54cba1d -maxAlloc 1900
Observed maximum exponent for completing two stages with GPU72 bounds, at least 24M, upper bound TBD

F) NVIDIA GTX1060 3GB, gpuowl v6.9, -maxAlloc 3000
Observed maximum exponent for completing two stages with GPU72 bounds, <24M, not useful for GIMPS wavefront

G) NVIDIA GTX1080 with 8GB gpu ram,
Observed maximum exponent for completing two stages with GPU72 bounds, at least 499M, upper bound TBD

Any ideas why the difference in observed limit, especially for the same -maxAlloc? I'd really appreciate Preda's thoughts on what determines that. There's clearly more to it than only gpu ram. If maxAlloc controlled max exponent, the limits on scenarios A and B would be very similar or identical, and they're not.

Re unknown linux system ram amount, how do I ask the Colab VM's linux that?
After a web search, tried
!lshw -short
!sudo lshw -short
!sudo dmidecode -t memory
and got
bash: lshw: command not found
sudo: lshw: command not found
sudo: dmidecode: command not found

finally got it with !cat /proc/meminfo since top is not part of the gpuowl scripts

kriesel 2019-12-08 13:43

Dept of corrections
 
[QUOTE=kriesel;532301]Several scenarios, different observed limits on exponent in gpuowl P-1, several differences, some matches

A) Windows 7 Pro, dual old Xeon 4-core processors, 24GB system ram, NVIDIA GTX1080Ti with 11gb gpu ram, gpuowl v6.7-4-g278407a, -maxAlloc 10240
Observed maximum exponent for completing two stages with GPU72 bounds, at least 510M but less than [B]514[/B]M, testing [B]511[/B]M next. [/QUOTE]Fft length transition 28672 vs 36864K is sizable and 28672's max exponent is indicated in the help output as at 510.47M.

Prime95 2019-12-08 19:51

New gpuOwl version
 
A new version of gpuOwl is ready with some fairly decent speed and power improvements. The key change was merging the transpose and middle steps into one kernel. This reduces memory bandwidth requirements by 33%.

On my test Radeon VII system running a 5M FFT, one instance timing was 950us, two instance timings was 1720us. Timings with the new code are 839us along with a reduction of 17 watts.

For Linux, download the source from preda's github gpuowl page. Here is a link to the Windows executable: [url]https://www.dropbox.com/s/w9qnxd02qlt3sof/gpuowl-win.exe?dl=0[/url]

To activate the new code, add "-use MERGED_MIDDLE" to the command line. Do not run 2 instances, it will reduce throughput.

The new code is still undergoing some minor tuning and cleanup especially with regards to architectures other than Radeon VII. It needs QA on P-1 tasks.

Bonus: I added the -cleanup command line argument to delete PRP save files at the end of the run for composite results.

paulunderwood 2019-12-08 20:23

[QUOTE=Prime95;532379]A new version of gpuOwl is ready with some fairly decent speed and power improvements. The key change was merging the transpose and middle steps into one kernel. This reduces memory bandwidth requirements by 33%.

On my test Radeon VII system running a 5M FFT, one instance timing was 950us, two instance timings was 1720us. Timings with the new code are 839us along with a reduction of 17 watts.

For Linux, download the source from preda's github gpuowl page. Here is a link to the Windows executable: [url]https://www.dropbox.com/s/w9qnxd02qlt3sof/gpuowl-win.exe?dl=0[/url]

To activate the new code, add "-use MERGED_MIDDLE" to the command line. Do not run 2 instances, it will reduce throughput.

The new code is still undergoing some minor tuning and cleanup especially with regards to architectures other than Radeon VII. It needs QA on P-1 tasks.

Bonus: I added the -cleanup command line argument to delete PRP save files at the end of the run for composite results.[/QUOTE]

A very nice speed up from 990us to 860us :tu:

kriesel 2019-12-08 22:32

[QUOTE=paulunderwood;532381]A very nice speed up from 990us to 860us :tu:[/QUOTE]Or 1720/2=860, to 839, 2.5% higher throughput. There's also around 6% single-instance speedup for p=89796247, fft 5M, gtx1080, Win7 Pro, gpuowl v6.11-71-g7e02b07, and perhaps a little more to come. (That commit took different -use input)

xx005fs 2019-12-08 22:37

[QUOTE=Prime95;532379]A new version of gpuOwl is ready with some fairly decent speed and power improvements. The key change was merging the transpose and middle steps into one kernel. This reduces memory bandwidth requirements by 33%.

On my test Radeon VII system running a 5M FFT, one instance timing was 950us, two instance timings was 1720us. Timings with the new code are 839us along with a reduction of 17 watts.

For Linux, download the source from preda's github gpuowl page. Here is a link to the Windows executable: [url]https://www.dropbox.com/s/w9qnxd02qlt3sof/gpuowl-win.exe?dl=0[/url]

To activate the new code, add "-use MERGED_MIDDLE" to the command line. Do not run 2 instances, it will reduce throughput.

The new code is still undergoing some minor tuning and cleanup especially with regards to architectures other than Radeon VII. It needs QA on P-1 tasks.

Bonus: I added the -cleanup command line argument to delete PRP save files at the end of the run for composite results.[/QUOTE]

Awesome! I will test on my Titan V which is severely bounded by memory bandwidth and I'll also experiment with the P100s on Colab with this version when I have time.

UPDATE 1: Significant speed up with my Vega 64, going from 2100us/it to 1870us/it. Not bad at all.

UPDATE 2: Getting the following error on my Nvidia GPUs. The following error is happening on my Windows 10 1909, Nvidia Titan V running driver version 440.97 using George's executable:
[CODE]OpenCL compilation error -11 (args -DEXP=94205039u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2daa303b0ff18p-3 -DIWEIGHT_STEP=0xf.a6a9923a8d87p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DMERGED_MIDDLE=1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-12-08 14:43:55 <kernel>:1117:11: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation
atom_add(&localSum, sum);
^
<kernel>:1120:39: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation
if (get_local_id(0) == 0) { atom_add(&out[0], localSum); }
^

2019-12-08 14:43:55 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:226 build
2019-12-08 14:43:55 Bye[/CODE]

kriesel 2019-12-08 23:17

[QUOTE=Prime95;532379]A new version of gpuOwl is ready with some fairly decent speed and power improvements. ...

For Linux, download the source from preda's github gpuowl page. Here is a link to the Windows executable: [URL]https://www.dropbox.com/s/w9qnxd02qlt3sof/gpuowl-win.exe?dl=0[/URL]
[/QUOTE]
[CODE]C:\Users\ken\Documents\gwtest>gpuowl-win -time -iters 10000 -use NO_ASM
2019-12-08 17:07:54 gpuowl v6.11-64-g01a9778-dirty
2019-12-08 17:07:54 Note: no config.txt file found
2019-12-08 17:07:54 config: -time -iters 10000 -use NO_ASM
2019-12-08 17:07:54 89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word
2019-12-08 17:07:56 OpenCL args "-DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGHT_STEP=0x8.bce25ec56bc
2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-08 17:07:56 OpenCL compilation error -11 (args -DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGH
T_STEP=0x8.bce25ec56bc2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-12-08 17:07:56 <kernel>:1117:11: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation
atom_add(&localSum, sum);
^
<kernel>:1120:39: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation
if (get_local_id(0) == 0) { atom_add(&out[0], localSum); }
^

2019-12-08 17:07:56 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:226 build
2019-12-08 17:07:56 Bye

C:\Users\ken\Documents\gwtest>gpuowl-win -time -iters 10000 -use NO_ASM,MERGED_MIDDLE
2019-12-08 17:07:56 gpuowl v6.11-64-g01a9778-dirty
2019-12-08 17:07:57 Note: no config.txt file found
2019-12-08 17:07:57 config: -time -iters 10000 -use NO_ASM,MERGED_MIDDLE
2019-12-08 17:07:57 89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word
2019-12-08 17:07:58 OpenCL args "-DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGHT_STEP=0x8.bce25ec56bc
2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -DMERGED_MIDDLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-08 17:07:58 OpenCL compilation error -11 (args -DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGH
T_STEP=0x8.bce25ec56bc2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -DMERGED_MIDDLE=1 -I. -cl-fast-relaxed-math
-cl-std=CL2.0)
2019-12-08 17:07:58 <kernel>:1117:11: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation
atom_add(&localSum, sum);
^
<kernel>:1120:39: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation
if (get_local_id(0) == 0) { atom_add(&out[0], localSum); }
^

2019-12-08 17:07:59 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:226 build
2019-12-08 17:07:59 Bye[/CODE]Preda had made a pragma fix for his latest commit, 6.11-71-g7e02b07, which apparently didn't make it into prime95's Windows build yet.

xx005fs 2019-12-08 23:28

[QUOTE=kriesel;532389][CODE]C:\Users\ken\Documents\gwtest>gpuowl-win -time -iters 10000 -use NO_ASM
2019-12-08 17:07:54 gpuowl v6.11-64-g01a9778-dirty
2019-12-08 17:07:54 Note: no config.txt file found
2019-12-08 17:07:54 config: -time -iters 10000 -use NO_ASM
2019-12-08 17:07:54 89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word
2019-12-08 17:07:56 OpenCL args "-DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGHT_STEP=0x8.bce25ec56bc
2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-08 17:07:56 OpenCL compilation error -11 (args -DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGH
T_STEP=0x8.bce25ec56bc2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-12-08 17:07:56 <kernel>:1117:11: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation
atom_add(&localSum, sum);
^
<kernel>:1120:39: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation
if (get_local_id(0) == 0) { atom_add(&out[0], localSum); }
^

2019-12-08 17:07:56 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:226 build
2019-12-08 17:07:56 Bye

C:\Users\ken\Documents\gwtest>gpuowl-win -time -iters 10000 -use NO_ASM,MERGED_MIDDLE
2019-12-08 17:07:56 gpuowl v6.11-64-g01a9778-dirty
2019-12-08 17:07:57 Note: no config.txt file found
2019-12-08 17:07:57 config: -time -iters 10000 -use NO_ASM,MERGED_MIDDLE
2019-12-08 17:07:57 89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word
2019-12-08 17:07:58 OpenCL args "-DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGHT_STEP=0x8.bce25ec56bc
2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -DMERGED_MIDDLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-08 17:07:58 OpenCL compilation error -11 (args -DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGH
T_STEP=0x8.bce25ec56bc2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -DMERGED_MIDDLE=1 -I. -cl-fast-relaxed-math
-cl-std=CL2.0)
2019-12-08 17:07:58 <kernel>:1117:11: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation
atom_add(&localSum, sum);
^
<kernel>:1120:39: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation
if (get_local_id(0) == 0) { atom_add(&out[0], localSum); }
^

2019-12-08 17:07:59 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:226 build
2019-12-08 17:07:59 Bye[/CODE]Preda had made a pragma fix for his latest commit, 6.11-71-g7e02b07, which apparently didn't make it into prime95's Windows build yet.[/QUOTE]

I see. Any build instruction for windows then? It would be great if I can figure out how to build on windows myself :D

Prime95 2019-12-08 23:42

[QUOTE=kriesel;532389]Preda had made a pragma fix for his latest commit, 6.11-71-g7e02b07, which apparently didn't make it into prime95's Windows build yet.[/QUOTE]

I'm waiting for one last merge before making another Windows executable.

kriesel 2019-12-09 00:19

[QUOTE=xx005fs;532390]I see. Any build instruction for windows then? It would be great if I can figure out how to build on windows myself :D[/QUOTE]Probably best to let Preda and Prime95 get back into sync first.

But in general, for relatively recent gpuowl versions, on Windows,
do steps 1 through 4 of kracker's instructions at [URL]https://www.mersenneforum.org/showpost.php?p=483209&postcount=356[/URL]
(The AMD APP SDK 3.0 link has gone dead. See for example [URL]https://github.com/fireice-uk/xmr-stak/issues/1511[/URL] or [URL]https://en.wikipedia.org/wiki/AMD_APP_SDK[/URL])

Install git on msys2
This may not be the whole story for setting up for compiles.

In an msys2 cmd prompt box from here on:
# to refresh a git working folder:
git pull [URL]https://github.com/preda/gpuowl[/URL]

#or to new folder that has not been a git folder before:
git clone [URL]https://github.com/preda/gpuowl[/URL]

cd gpuowl
make gpuowl-win.exe

To use the executable, switch to an NT command prompt box. It won't run in the msys2 context.
Msys2 is a linux like environment. The executable is a Windows executable. It's a sort of cross-compile.

I usually run gpuowl-win.exe -h immediately, both to save it, and to verify the newly compiled program is working well enough to identify gpus on the build box. Since it's OpenCL based, it's the same build whether used on AMD or NVIDIA gpus.

kracker 2019-12-09 00:26

[QUOTE=kriesel;532394]
To use the executable, switch to an NT command prompt box. It won't run in the msys2 context.[/QUOTE]

It'll run, you just have to tell it to look for it in the "current" folder like "./gpuowl-win.exe" for example.

xx005fs 2019-12-09 00:32

[QUOTE=kriesel;532394]Probably best to let Preda and Prime95 get back into sync first.

But in general, for relatively recent gpuowl versions, on Windows,
do steps 1 through 4 of kracker's instructions at [URL]https://www.mersenneforum.org/showpost.php?p=483209&postcount=356[/URL]
(The AMD APP SDK 3.0 link has gone dead. See for example [URL]https://github.com/fireice-uk/xmr-stak/issues/1511[/URL] or [URL]https://en.wikipedia.org/wiki/AMD_APP_SDK[/URL])

Install git on msys2
This may not be the whole story for setting up for compiles.

In an msys2 cmd prompt box from here on:
# to refresh a git working folder:
git pull [URL]https://github.com/preda/gpuowl[/URL]

#or to new folder that has not been a git folder before:
git clone [URL]https://github.com/preda/gpuowl[/URL]

cd gpuowl
make gpuowl-win.exe

To use the executable, switch to an NT command prompt box. It won't run in the msys2 context.
Msys2 is a linux like environment. The executable is a Windows executable. It's a sort of cross-compile.

I usually run gpuowl-win.exe -h immediately, both to save it, and to verify the newly compiled program is working well enough to identify gpus on the build box. Since it's OpenCL based, it's the same build whether used on AMD or NVIDIA gpus.[/QUOTE]

Thank you so much! I was wondering what step I was missing that was causing a bunch of nasty OpenCL link errors, it's because I have never copied the libraries from the APP SDK folders into MSYS2.

kracker 2019-12-09 00:33

Tried on a P100 in colab with 4608K FFT/PRP... I'm getting 766 us/it compared to 1064 us/it without the switch.(!!)

kriesel 2019-12-09 00:44

[QUOTE=kracker;532397]Tried on a P100 in colab with 4608K FFT/PRP... I'm getting 766 us/it compared to 1064 us/it without the switch.(!!)[/QUOTE]Could you get some comparative wattage readings from nvidia-smi?

kriesel 2019-12-09 00:48

[QUOTE=xx005fs;532396]Thank you so much! I was wondering what step I was missing that was causing a bunch of nasty OpenCL link errors, it's because I have never copied the libraries from the APP SDK folders into MSYS2.[/QUOTE]You're welcome, been there myself, so I try not to break it once it's working. See also [URL]https://www.mersenneforum.org/showthread.php?t=24938&highlight=msys2&page=4[/URL] including the caution about an unannounced system shutdown
Have fun!

kracker 2019-12-09 01:06

[QUOTE=kriesel;532399]Could you get some comparative wattage readings from nvidia-smi?[/QUOTE]

The readings seem to change a lot... power usage as shown in nvidia-smi has been slowly climbing over the past several minutes...

EDIT: looks like it's semi stabilized... ~180W without, ~190W with.

xx005fs 2019-12-09 01:20

[QUOTE=kracker;532397]Tried on a P100 in colab with 4608K FFT/PRP... I'm getting 766 us/it compared to 1064 us/it without the switch.(!!)[/QUOTE]

I also tested the K80 with 5120K FFT, went down from ~4350us/it before to around 3300us/it depending on the instance. Pretty impressive speedup.

More Updates: The updated source code by Preda works on windows now, and I'm seeing almost exactly 33% speed up on my Titan V much less for regular Vega. Something I found very strange is that I don't know if the graphics that changes from . to o to 0 then to * is intentional or not, but it seems to slow down my Colab console and leave a symbol in front of every line in the log. Is there an option to disable that?

kracker 2019-12-09 04:54

[QUOTE=kracker;532397]Tried on a P100 in colab with 4608K FFT/PRP... I'm getting 766 us/it compared to 1064 us/it without the switch.(!!)[/QUOTE]

With MERGED_MIDDLE,WORKINGOUT,WORKINGIN4, it dropped further to 754 us/it... a very impressive 41% speed boost from the beginning!

dcheuk 2019-12-09 05:55

I tried this on one of my Radeon VII cards that has not yet gave me any errors from the last 4-5 PRP tests (while the other returned too many lol). This card sits on second slot with no display attached to it.

[CODE]2019-12-08 23:47:19 config.txt: -user dcheuk/gpu01 -use ORIG_X2 -device 1 -log 100000 -use MERGED_MIDDLE
2019-12-08 23:47:19 config.txt:
2019-12-08 23:47:19 gfx906-0 94607437 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 18.04 bits/word
2019-12-08 23:47:20 gfx906-0 OpenCL args "-DEXP=94607437u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xf.8262bb7326f28p-3 -DIWEIGHT_STEP=0x8.40cb53a4a1fd8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DORIG_X2=1 -DMERGED_MIDDLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-08 23:47:21 gfx906-0 OpenCL compilation in 1.31 s
2019-12-08 23:47:22 gfx906-0 94607437 OK 2071500 loaded: blockSize 500, 132c5e1692604fd6
2019-12-08 23:47:23 gfx906-0 94607437 OK 2072500 2.19%; 891 us/it (min 885 885); ETA 0d 22:54; 8d4ac7f8617372d8 (check 0.53s)
2019-12-08 23:47:48 gfx906-0 94607437 OK 2100000 2.22%; 887 us/it (min 884 884); ETA 0d 22:48; f8d6a63b03cfa32a (check 0.53s)
2019-12-08 23:49:17 gfx906-0 94607437 OK 2200000 2.33%; 887 us/it (min 884 884); ETA 0d 22:47; 42044b1ea9fb8b01 (check 0.53s)
2019-12-08 23:50:47 gfx906-0 94607437 OK 2300000 2.43%; 887 us/it (min 884 884); ETA 0d 22:45; fcd02bb8420d5ba7 (check 0.53s)
2019-12-08 23:52:17 gfx906-0 94607437 OK 2400000 2.54%; 887 us/it (min 884 884); ETA 0d 22:43; d784ed68cfa19bd7 (check 0.53s)
2019-12-08 23:53:46 gfx906-0 94607437 OK 2500000 2.64%; 887 us/it (min 884 884); ETA 0d 22:42; 79d614fc892e7a5a (check 0.53s)
[/CODE]

And tuned at 1449MHz , 867mV, 1200MHz memory. Fan about 75% at temperature hovering 64-66C, junction 78-81C. Ambient temperature 20C. Wattage 140-143W. Very impressive, I'm amazed at what you guys can do. Good work. :smile:

Prime95 2019-12-09 06:13

[QUOTE=kracker;532423]With MERGED_MIDDLE,WORKINGOUT,WORKINGIN4, it dropped further to 754 us/it... a very impressive 41% speed boost from the beginning![/QUOTE]

Preliminary results from Ken suggested WORKINGOUT4 is better than WORKINGOUT. Of course, that was from a huge sample size of 1 nVidia card.

kriesel 2019-12-09 09:16

[QUOTE=Prime95;532431]Preliminary results from Ken suggested WORKINGOUT4 is better than WORKINGOUT. Of course, that was from a huge sample size of 1 nVidia card.[/QUOTE]
p=89796247, fft 5M, gtx1080, Win7 pro, typ timing iters 9200
obtained with -time -iters 10000

[CODE]ms/it -use options
5124 no_asm
5120 no_asm
4868 no_asm,merged_middle,workingin
4873 no_asm,merged_middle,workingin
4873 no_asm,merged_middle,workingin1
4951 no_asm,merged_middle,workingin1a
4876 no_asm,merged_middle,workingin2
4874 no_asm,merged_middle,workingin3
[B]4865[/B] no_asm,merged_middle,workingin5
4878 no_asm,merged_middle,workingout
4911 no_asm,merged_middle,workingout0
4872 no_asm,merged_middle,workingout1
4950 no_asm,merged_middle,workingout1a
4881 no_asm,merged_middle,workingout2
4875 no_asm,merged_middle,workingout3
[B]4836[/B] no_asm,merged_middle,workingout4
4876 no_asm,merged_middle,workingout5[/CODE]repeatability ~+/-0.05%
5122/4836= 1.059

obtained with this batch file derived from a list of cases George requested:
[CODE]:iter count is required to be multiple of 10000
set iters=10000
:first one was there just to ensure the gpu is warmed up and clock-stable somewhat, ignore its timing, use the second, but maybe the first 800 iters block does that
gpuowl-win -time -iters %iters% -use NO_ASM
gpuowl-win -time -iters %iters% -use NO_ASM
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN
:repeated, let's see reproducibility once; then onward through the list
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN1
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN1A
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN2
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN3
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN5
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT0
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT1
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT2
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT3
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT4
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT5[/CODE]

kriesel 2019-12-09 10:17

gtx1080 again

usec/iter; -use case
5055 no_asm
5104 no_asm
4848 NO_ASM,MERGED_MIDDLE,WORKINGIN
4863 NO_ASM,MERGED_MIDDLE,WORKINGIN
4851 NO_ASM,MERGED_MIDDLE,WORKINGIN4
4859 NO_ASM,MERGED_MIDDLE,WORKINGIN5
4873 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT5

retest with minimal user interaction:
5058 no_asm
5091 no_asm
4837 NO_ASM,MERGED_MIDDLE,WORKINGIN
4836 NO_ASM,MERGED_MIDDLE,WORKINGIN
4836 NO_ASM,MERGED_MIDDLE,WORKINGIN4
4833 NO_ASM,MERGED_MIDDLE,WORKINGIN5
4835 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT5

5091/4833 =~ 1.053

nomead 2019-12-09 12:50

RTX 2080, Linux. Just some general observations and incoherent rambling, I haven't done that much tuning. In fact I haven't touched gpuOwL at all before this past weekend, so it's all a bit new to me.

I started with whatever was committed to github up until 2019-12-04. First I had some issues with the compilation, but that was due to those #pragma statements commented out in gpuowl.cl (fixed now).

Then the program apparently needed the -use NO_ASM option to run (thanks SELROC for pointing that out on IRC).

After that I got the program running... but the timings seemed to be all over the place. 2816K was 3.743 ms but the next one I tested, 5120K, was 3.884 ms/iter, and the difference should be bigger, so something must be wrong. Well, after some fiddling around, I found out how to specify the FFT options (width/height/middle) and found that the default settings were most of the time not the fastest ones. Maybe I should have read through this thread better...

Anyway, how would one specify both FFT size and other options? -fft 5632K and -fft +2 seem to be mutually exclusive, only one works. And it would be really useful to have these options in some configuration file, so the program is ready to use even if the FFT size changes (and the new size is faster with different options).

Then came the commits from yesterday (2019-12-08). On the RTX2080, the calculation is limited by FP64 units, not memory bandwidth (memory bus usage is in the 20-30% range depending on FFT size), but there were still some noticeable improvements. For example, 5632K (-fft +2) was 4.396 ms/iter before, and 4.237 ms after the update, using MERGED_MIDDLE. So that's almost 4% better. The improvements vary quite a bit, from 0% to 5.5%, and the average across all FFT sizes and parameters I tested (2M to 20M) was 2.2%.

Another comparison: CUDALucas with the closest applicable FFT size (5760K) and the same hardware is 5.585 ms/iter, so gpuOwL is a bit over 30% faster. Of course the difference varies quite a lot there, too, but 20-30% seems to be the norm.

One observation, though, about that MERGED_MIDDLE improvement. If the FFT size happens to be one without that "middle" part (2M, 4M, 8M, 16M...) and the dumb user (me) still instructs the program to -use MERGED_MIDDLE then the calculation will fail. In hindsight this shouldn't be a surprise, but I plead ignorance and the effects of a Monday morning. :coffee: The error is :
[CODE]2019-12-09 08:24:18 38000009 EE 0 loaded: blockSize 400, 0000000000000000 (expected 0000000000000003)
2019-12-09 08:24:18 Exiting because "error on load"
[/CODE]

kracker 2019-12-09 15:33

Some warnings when compiling... Probably unimportant.
[code]
In file included from common.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
In file included from Worktodo.cpp:6:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
In file included from main.cpp:8:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
In file included from clwrap.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
In file included from ProofSet.h:6,
from Gpu.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
In file included from Task.cpp:7:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
In file included from checkpoint.h:5,
from checkpoint.cpp:3:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
In file included from Args.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
31 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs

[/code]

kriesel 2019-12-09 16:04

[QUOTE=kracker;532452]Some warnings when compiling... Probably unimportant.
[/QUOTE]Maybe [url]https://www.mersenneforum.org/showpost.php?p=530766&postcount=40[/url] will help.

kriesel 2019-12-09 17:38

Feature request: OpenCL version test
 
Preda, please add an OpenCL version test. As previously posted, [URL]https://www.mersenneforum.org/showpost.php?p=525496&postcount=1354[/URL]

Gpuowl will not run on a test gpu Quadro 2000 (compute capability 2.1, opencl 1.1/1.2), and assorted other older gpus, producing a shower of cl compile errors relating to atomics. I think it requires at least OpenCL 2 and therefore a CUDA compute capability above 2.x. [B]An explicit test for opencl version[/B] by gpuowl and clear message if the version is too low might be a good thing. ("Gpuowl requires OpenCL 2 support for atomics, which this gpu does not appear to support. Exiting now." or some such helpful message.)

kracker 2019-12-09 18:41

Some numbers:

RX570
[code]
5033 NO_ASM
4384 NO_ASM,MERGED_MIDDLE
7285 NO_ASM,MERGED_MIDDLE,WORKINGIN
4365 NO_ASM,MERGED_MIDDLE,WORKINGIN1
4360 NO_ASM,MERGED_MIDDLE,WORKINGIN1A
4459 NO_ASM,MERGED_MIDDLE,WORKINGIN2
4381 NO_ASM,MERGED_MIDDLE,WORKINGIN3
4358 NO_ASM,MERGED_MIDDLE,WORKINGIN5
7433 NO_ASM,MERGED_MIDDLE,WORKINGOUT
5818 NO_ASM,MERGED_MIDDLE,WORKINGOUT0
4400 NO_ASM,MERGED_MIDDLE,WORKINGOUT1
4410 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
4762 NO_ASM,MERGED_MIDDLE,WORKINGOUT2
4385 NO_ASM,MERGED_MIDDLE,WORKINGOUT3
4610 NO_ASM,MERGED_MIDDLE,WORKINGOUT4
4517 NO_ASM,MERGED_MIDDLE,WORKINGOUT5
[/code]

Tesla P100
[code]
1318 NO_ASM
951 NO_ASM,MERGED_MIDDLE
945 NO_ASM,MERGED_MIDDLE,WORKINGIN
944 NO_ASM,MERGED_MIDDLE,WORKINGIN1
952 NO_ASM,MERGED_MIDDLE,WORKINGIN1A
945 NO_ASM,MERGED_MIDDLE,WORKINGIN2
952 NO_ASM,MERGED_MIDDLE,WORKINGIN3
939 NO_ASM,MERGED_MIDDLE,WORKINGIN4
942 NO_ASM,MERGED_MIDDLE,WORKINGIN5
948 NO_ASM,MERGED_MIDDLE,WORKINGOUT
948 NO_ASM,MERGED_MIDDLE,WORKINGOUT0
948 NO_ASM,MERGED_MIDDLE,WORKINGOUT1
956 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
948 NO_ASM,MERGED_MIDDLE,WORKINGOUT2
951 NO_ASM,MERGED_MIDDLE,WORKINGOUT3
954 NO_ASM,MERGED_MIDDLE,WORKINGOUT4
949 NO_ASM,MERGED_MIDDLE,WORKINGOUT5
[/code]

kriesel 2019-12-09 19:50

[QUOTE=kracker;532464]Some numbers:[/QUOTE]Wow, 15% and 40%.
Thanks for running these.
Please try on any other colab gpu models when you get a chance.

kriesel 2019-12-09 19:53

[QUOTE=Prime95;532431]Of course, that was from a huge sample size of 1 nVidia card.[/QUOTE]The need for sleep and a more efficient way of getting the data intervened. You're welcome. I also had multiple gpus tied up in P-1 limit and runtime scaling runs in gpuowl versions predating P-1 save file capability at the time.

preda 2019-12-09 19:58

Ken, I'm not confident that I can do the OpenCL version test reliably. For example, until recently, ROCm OpenCL was self-reporting as being OpenCL 1.2 although it was compiling fine 2.0. I'm worried that adding this check would not even attempt to compile in such a situation.

That said, I added an OpenCL 2.0 version check, please try it out on the old cards.

[QUOTE=kriesel;532458]Preda, please add an OpenCL version test. As previously posted, [URL]https://www.mersenneforum.org/showpost.php?p=525496&postcount=1354[/URL]

Gpuowl will not run on a test gpu Quadro 2000 (compute capability 2.1, opencl 1.1/1.2), and assorted other older gpus, producing a shower of cl compile errors relating to atomics. I think it requires at least OpenCL 2 and therefore a CUDA compute capability above 2.x. [B]An explicit test for opencl version[/B] by gpuowl and clear message if the version is too low might be a good thing. ("Gpuowl requires OpenCL 2 support for atomics, which this gpu does not appear to support. Exiting now." or some such helpful message.)[/QUOTE]

kriesel 2019-12-09 20:17

gpuowl feature request
 
Gpuowl feature request: P-1 res64 check for special very likely bad interim residues 0x00 and 0x01. P-1 is currently computing on the high wire without a safety net.

Undetected errors could cost hours or days in single lengthy P-1 runs, and also missed factors.



Come to think of it, that res64 check could save some lost PRP time too when errors occur. Less incentive there though, since the excellent GEC safety net catches the errors eventually.
[CODE]2019-12-08 06:01:08 91305491 OK 62250000 68.18%; 1184 us/sq; ETA 0d 09:34; 5cf68328b1473b4a (check 0.90s) 2 errors
2019-12-08 06:02:07 91305491 62300000 68.23%; 1184 us/sq; ETA 0d 09:32; 3efaa597c7d9c53d
2019-12-08 06:03:06 91305491 62350000 68.29%; 1184 us/sq; ETA 0d 09:31; 392c10b87c906301
2019-12-08 06:04:05 91305491 62400000 68.34%; 1178 us/sq; ETA 0d 09:28; 0000000000000000 <-- already have the res64 for output, test it, return to 62250000, save 100,000 additional bad iterations until the GEC check
2019-12-08 06:05:03 91305491 62450000 68.40%; 1155 us/sq; ETA 0d 09:15; 0000000000000000
2019-12-08 06:06:02 91305491 EE 62500000 68.45%; 1156 us/sq; ETA 0d 09:15; 0000000000000000 (check 0.90s) 2 errors
2019-12-08 06:07:02 91305491 62300000 68.23%; 1204 us/sq; ETA 0d 09:42; 3efaa597c7d9c53d
2019-12-08 06:08:01 91305491 62350000 68.29%; 1184 us/sq; ETA 0d 09:31; 392c10b87c906301
2019-12-08 06:09:01 91305491 62400000 68.34%; 1184 us/sq; ETA 0d 09:30; e0c75b60654dbfb4
2019-12-08 06:10:00 91305491 62450000 68.40%; 1183 us/sq; ETA 0d 09:29; c271cc2b8386285f
2019-12-08 06:11:00 91305491 OK 62500000 68.45%; 1183 us/sq; ETA 0d 09:28; 070950e467249083 (check 0.96s) 3 errors[/CODE]Average savings would be 125,000 iterations (2.5 minutes at wavefront on Radeon VII, proportionally higher on bigger exponents or slower gpus), min 50,000 (~1 minute), max 200,000 (about 4 minutes per error in this case)[CODE]2019-12-03 21:19:42 89064097 OK 75500000 84.77%; 1214 us/sq; ETA 0d 04:34; fba20ffb703f9fb7 (check 0.91s)
2019-12-03 21:20:42 89064097 75550000 84.83%; 1189 us/sq; ETA 0d 04:28; 0000000000000000
2019-12-03 21:21:40 89064097 75600000 84.88%; 1167 us/sq; ETA 0d 04:22; 0000000000000000
2019-12-03 21:22:39 89064097 75650000 84.94%; 1171 us/sq; ETA 0d 04:22; 0000000000000000
2019-12-03 21:23:38 89064097 75700000 84.99%; 1171 us/sq; ETA 0d 04:21; 0000000000000000
2019-12-03 21:24:37 89064097 EE 75750000 85.05%; 1172 us/sq; ETA 0d 04:20; 0000000000000000 (check 0.94s)
2019-12-03 21:25:39 89064097 75550000 84.83%; 1239 us/sq; ETA 0d 04:39; 49985b238359ff96
2019-12-03 21:26:40 89064097 75600000 84.88%; 1217 us/sq; ETA 0d 04:33; 78c0f7429d9a238f
2019-12-03 21:27:41 89064097 75650000 84.94%; 1217 us/sq; ETA 0d 04:32; efab7475b57165bb
2019-12-03 21:28:42 89064097 75700000 84.99%; 1216 us/sq; ETA 0d 04:31; f43c80e5de778e68
2019-12-03 21:29:44 89064097 OK 75750000 85.05%; 1212 us/sq; ETA 0d 04:29; d83c92710ddb50e8 (check 0.91s) 1 errors
[/CODE]It appears the majority of PRP3 GEC errors on my Radeon VII are of the 0x00 variety. I've not seen 0x01 yet. The rest are seemingly normal residue values.

Prime95 2019-12-09 20:25

Latest windows build (with a fix for power-of-two FFT size with MERGED_MIDDLE).

[url]https://www.dropbox.com/s/bxty3e5qz5is68d/gpuowl-win.exe?dl=0[/url]

kriesel 2019-12-09 20:27

[QUOTE=preda;532469]Ken, I'm not confident that I can do the OpenCL version test reliably. For example, until recently, ROCm OpenCL was self-reporting as being OpenCL 1.2 although it was compiling fine 2.0. I'm worried that adding this check would not even attempt to compile in such a situation.

That said, I added an OpenCL 2.0 version check, please try it out on the old cards.[/QUOTE]Thanks, will do. Even just a less than perfectly reliable warning as to why there may be trouble will help us ordinary users. I still don't know whether a certain CUDA gpu that failed to do gpuowl P-1 was because of OpenCL level, a bad -maxAlloc value, or something else. I have the impression I should go back and retest many for limits with some very recent version. If I recall correctly, the memory handling got better since sometime after v6.9. [URL]https://www.mersenneforum.org/showpost.php?p=525696&postcount=1361[/URL]
I recently found that V6.11-9 could do P-1 on a 2GB RX550 that a 3GB GTX1060 with v6.9-0-gc137007 could not.

R. Gerbicz 2019-12-09 20:35

[QUOTE=kriesel;532471]
Come to think of it, that res64 check could save some lost PRP time too when errors occur. Less incentive there though, since the excellent GEC safety net catches the errors eventually.
[[/QUOTE]
Ofcourse if res64=0 then you need to check the full residue to see if it is really true that res=0. For much larger p>2^64 you could see (multiple) interim res64=0 during a prp test.

Prime95 2019-12-09 21:26

Notes on the new MERGED_MIDDLE code. There are many implementations buried in the code. The fastest implementation depends on the memory bus width and bandwidth and GPU architecture and maybe the cache architecture.

The benefits of MERGED_MIDDLE really kick in for FFTs with a WIDTH >= 256 and SMALL_HEIGHT >= 256.

To find the best implementation for your GPU. Benchmark using each of these options:
WORKINGIN,WORKINGIN1,WORKINGIN1A,WORKINGIN2,WORKINGIN3,WORKINGIN4,WORKINGIN5. Then benchmark again using each of these options: WORKINGOUT,WORKINGOUT0,WORKINGOUT1,WORKINGOUT1A,WORKINGOUT2,WORKINGOUT3,WORKINGOUT4,WORKINGOUT5

Once you've determined the best implementations you can add the best WORKINGIN and WORKINGOUT options to your production config.txt file.

The default is WORKINGIN3 and WORKINGOUT3.

If we can obtain some consistent data, we can select different default values for non-AMD GPUs. So let us know your GPU and your timings. Thanks.

kriesel 2019-12-09 21:32

[QUOTE=R. Gerbicz;532482]Ofcourse if res64=0 then you need to check the full residue to see if it is really true that res=0. For much larger p>2^64 you could see (multiple) interim res64=0 during a prp test.[/QUOTE]
If res64 == 0x00 then if full res == 0 then panic, retreat
Yes there's a very slight chance that a res64 zero is correct for a nonzero full residue. One place it shows up is in penultimate residues.

It's also true that eventually we will reach a point where early residues will correctly have values that are currently treated as errors. This occurs within the 2[SUP]32[/SUP] capability of Mlucas. (See the attachment at [URL]https://www.mersenneforum.org/showpost.php?p=515172&postcount=9[/URL]) Before the probability of zero or 2 res64 becomes high, the project is likely to switch to a longer residue for such checks, say res128.

With all due respect, Dr. Gerbicz, none of us will need to worry about residues for p>2[SUP]64[/SUP], or likely 2[SUP]48[/SUP]. TF is feasible with the right software up to a point, but P-1 or primality testing exponents of order 2[SUP]64[/SUP] is quite out of reach and will be for more than my lifetime and many others'. In GIMPS we're dealing with p<2[SUP]32[/SUP] and generally <2[SUP]30[/SUP] (mersenne.org exponent limit for PRP, LL, or P-1 results acceptance is10[SUP]9[/SUP]), with most current activity other than my limits testing or the 100Mdigit attempts occurring at the wavefront <2[SUP]26.6[/SUP].
A single 2[SUP]30[/SUP] exponent PRP takes several months on the fastest available gpus. P-1 factoring to feasible limits imposed by memory and software takes weeks on most hardware if not all. The scaling for primality testing and P-1 is roughly p[SUP]2.1[/SUP], p~2[SUP]32[/SUP] takes years, p~2[SUP]33[/SUP] decades (longer than hardware lifetime), and would require fft lengths longer than available in gpuowl or CUDALucas.

kriesel 2019-12-09 21:45

Compiling Gpuowl
 
Compiling Gpuowl [URL="https://www.mersenneforum.org/showpost.php?p=532454&postcount=21"]https://www.mersenneforum.org/showpo...4&postcount=21[/URL]
added to reference content. Probably has errors or omissions. I'll fix them as they are identified.

xx005fs 2019-12-10 03:20

Getting this error when trying to use -nospin as argument:
[CODE]2019-12-09 19:19:27 gpuowl v6.11-71-g7e02b07
2019-12-09 19:19:27 Argument '-nospin' '' not understood
2019-12-09 19:19:27 Exiting because "args"
2019-12-09 19:19:27 Bye[/CODE]

Also -yield doesn't seem to reduce any CPU resources anymore on Windows.

kriesel 2019-12-10 04:40

[QUOTE=preda;532469]Ken, I'm not confident that I can do the OpenCL version test reliably. For example, until recently, ROCm OpenCL was self-reporting as being OpenCL 1.2 although it was compiling fine 2.0. I'm worried that adding this check would not even attempt to compile in such a situation.

That said, I added an OpenCL 2.0 version check, please try it out on the old cards.[/QUOTE]
Quite a shower of warnings, but it did build.[CODE]$ make gpuowl-win.exe
cat head.txt gpuowl.cl tail.txt > gpuowl-wrap.cpp
echo \"`git describe --long --dirty --always`\" > version.new
diff -q -N version.new version.inc >/dev/null || mv version.new version.inc
echo Version: `cat version.inc`
Version: "v6.11-79-g0c139c4"
g++ -MT Pm1Plan.o -MMD -MP -MF .d/Pm1Plan.Td -Wall -O2 -std=c++17 -c -o Pm1Plan.o Pm1Plan.cpp
g++ -MT GmpUtil.o -MMD -MP -MF .d/GmpUtil.Td -Wall -O2 -std=c++17 -c -o GmpUtil.o GmpUtil.cpp
g++ -MT Worktodo.o -MMD -MP -MF .d/Worktodo.Td -Wall -O2 -std=c++17 -c -o Worktodo.o Worktodo.cpp
In file included from Worktodo.cpp:6:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT common.o -MMD -MP -MF .d/common.Td -Wall -O2 -std=c++17 -c -o common.o common.cpp
In file included from common.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT main.o -MMD -MP -MF .d/main.Td -Wall -O2 -std=c++17 -c -o main.o main.cpp
In file included from main.cpp:8:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT Gpu.o -MMD -MP -MF .d/Gpu.Td -Wall -O2 -std=c++17 -c -o Gpu.o Gpu.cpp
In file included from ProofSet.h:6,
from Gpu.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT clwrap.o -MMD -MP -MF .d/clwrap.Td -Wall -O2 -std=c++17 -c -o clwrap.o clwrap.cpp
In file included from clwrap.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT Task.o -MMD -MP -MF .d/Task.Td -Wall -O2 -std=c++17 -c -o Task.o Task.cpp
In file included from Task.cpp:7:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT checkpoint.o -MMD -MP -MF .d/checkpoint.Td -Wall -O2 -std=c++17 -c -o checkpoint.o checkpoint.cpp
In file included from checkpoint.h:5,
from checkpoint.cpp:3:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT timeutil.o -MMD -MP -MF .d/timeutil.Td -Wall -O2 -std=c++17 -c -o timeutil.o timeutil.cpp
g++ -MT Args.o -MMD -MP -MF .d/Args.Td -Wall -O2 -std=c++17 -c -o Args.o Args.cpp
In file included from Args.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT state.o -MMD -MP -MF .d/state.Td -Wall -O2 -std=c++17 -c -o state.o state.cpp
g++ -MT Signal.o -MMD -MP -MF .d/Signal.Td -Wall -O2 -std=c++17 -c -o Signal.o Signal.cpp
g++ -MT FFTConfig.o -MMD -MP -MF .d/FFTConfig.Td -Wall -O2 -std=c++17 -c -o FFTConfig.o FFTConfig.cpp
g++ -MT AllocTrac.o -MMD -MP -MF .d/AllocTrac.Td -Wall -O2 -std=c++17 -c -o AllocTrac.o AllocTrac.cpp
g++ -MT gpuowl-wrap.o -MMD -MP -MF .d/gpuowl-wrap.Td -Wall -O2 -std=c++17 -c -o gpuowl-wrap.o gpuowl-wrap.cpp
g++ -o gpuowl-win.exe Pm1Plan.o GmpUtil.o Worktodo.o common.o main.o Gpu.o clwrap.o Task.o checkpoint.o timeutil.o Args.o state.o Signal.o FFTConfig.o AllocTrac.o gpuowl-wrap.o -lstdc++fs -lOpenCL -lgmp -pthread -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L/c/Windows/System32 -L. -static
strip gpuowl-win.exe
[/CODE]It launched ok on Win7 on an AMD RX550 and is running some comparative timings now.
Following is a test of the OpenCL version check on a Quadro 2000, which indicates 1.1/1.2 in gpu-z.
[CODE]c:\Users\Ken\Documents\gpuowl\v6.11-79-g0c139c4>gpuowl-win -time -iters 10000 -use NO_ASM
2019-12-09 22:33:39 gpuowl v6.11-79-g0c139c4
2019-12-09 22:33:39 config.txt: -device 1 -user kriesel -cpu condorette/q2000
2019-12-09 22:33:39 condorette/q2000 config: -time -iters 10000 -use NO_ASM
2019-12-09 22:33:39 condorette/q2000 89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word
2019-12-09 22:33:40 condorette/q2000 OpenCL args "-DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGHT_STEP=0x8.b
ce25ec56bc2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-09 22:33:40 condorette/q2000 OpenCL compilation error -11 (args -DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-
3 -DIWEIGHT_STEP=0x8.bce25ec56bc2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-12-09 22:33:40 condorette/q2000 <kernel>:13:9: warning: GpuOwl requires OpenCL 200, found 110
#pragma message "GpuOwl requires OpenCL 200, found " STR(__OPENCL_VERSION__)
^
<kernel>:14:2: error: OpenCL >= 2.0 required
#error OpenCL >= 2.0 required
^
<kernel>:2777:66: error: use of undeclared identifier 'memory_scope_device'
work_group_barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE, memory_scope_device);
^
<kernel>:2786:66: error: use of undeclared identifier 'memory_scope_device'
work_group_barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE, memory_scope_device);
^
<kernel>:2845:12: warning: implicit declaration of function 'atomic_load' is invalid in C99
while(!atomic_load((atomic_uint *) &ready[gr - 1]));
^
<kernel>:2845:25: error: use of undeclared identifier 'atomic_uint'
while(!atomic_load((atomic_uint *) &ready[gr - 1]));
^
<kernel>:2845:38: error: expected expression
while(!atomic_load((atomic_uint *) &ready[gr - 1]));
^
<kernel>:2846:5: warning: implicit declaration of function 'atomic_store' is invalid in C99
atomic_store((atomic_uint *) &ready[gr - 1], 0);
^
<kernel>:2846:19: error: use of undeclared identifier 'atomic_uint'
atomic_store((atomic_uint *) &ready[gr - 1], 0);
^
<kernel>:2846:32: error: expected expression
atomic_store((atomic_uint *) &ready[gr - 1], 0);
^
<kernel>:2919:25: error: use of undeclared identifier 'atomic_uint'
while(!atomic_load((atomic_uint *) &ready[gr - 1]));
^
<kernel>:2919:38: error: expected expression
while(!atomic_load((atomic_uint *) &ready[gr - 1]));
^
<kernel>:2920:19: error: use of undeclared identifier 'atomic_uint'
atomic_store((atomic_uint *) &ready[gr - 1], 0);
^
<kernel>:2920:32: error: expected expression
2019-12-09 22:33:40 condorette/q2000 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build
2019-12-09 22:33:40 condorette/q2000 Bye[/CODE]

nomead 2019-12-10 05:40

1 Attachment(s)
[QUOTE=Prime95;532488]To find the best implementation for your GPU. Benchmark using each of these options:
WORKINGIN,WORKINGIN1,WORKINGIN1A,WORKINGIN2,WORKINGIN3,WORKINGIN4,WORKINGIN5. Then benchmark again using each of these options: WORKINGOUT,WORKINGOUT0,WORKINGOUT1,WORKINGOUT1A,WORKINGOUT2,WORKINGOUT3,WORKINGOUT4,WORKINGOUT5[/QUOTE]
Ah, OK, so it's more like an array of settings, and one of each list needs to be chosen.

RTX 2080, clock pinned to 1920 MHz, Linux. Command line options [C]-yield -log 10000 -prp 89796247 -fft +2 -iters 50000 -use NO_ASM,MERGED_MIDDLE[/C] except for those two baseline timings (3807 and 3808 µs) that were run without MERGED_MIDDLE. And then one IN and one OUT setting chosen. For whatever reason, the differences were really small on this card. 0.35% between the highest and lowest value, and if that one outlier (IN1A and OUT1A chosen) is taken out, the rest are within 0.19%.

None of the WORKINGOUT0 tests would run, an error occurred:
[C]2019-12-10 04:14:53 Exception gpu_error: OUT_OF_RESOURCES carryA at clwrap.cpp:304 run[/C]

The smallest value was 3680 µs, which was reached with several different combinations. I have attached the full array of timings to this message.

kriesel 2019-12-10 06:36

[QUOTE=nomead;532530]The smallest value was 3680 µs, which was reached with several different combinations. I have attached the full array of timings to this message.[/QUOTE]Those values are suspiciously similar, closer than my repeatability runs.
Lots of differences of course; gpu, OS, pinning clock, exponent.
Try some repeatability runs.

Also, I think it's a rectangular array with one more row and column than you allowed for.
George gave a list of ins and a list of outs, but there's also the null entry for in (baseline in) and for out (baseline out).
And it appears from my recent test that minimum in, and minimum out, don't necessarily mean even better in combination.

nomead 2019-12-10 07:18

[QUOTE=kriesel;532533] Try some repeatability runs.[/QUOTE]
Already did, but of course not enough (five each of "without merge", IN1A+OUT1A, and IN3+OUT5). At least that time, the results varied max. 2µs from run to run. The advantage of benchmarking on Linux is that the results are more predictable, it's less likely that the OS starts indexing or going through updates or scanning for viruses in the background.
[QUOTE=kriesel;532533]Also, I think it's a rectangular array with one more row and column than you allowed for.
George gave a list of ins and a list of outs, but there's also the null entry for in (baseline in) and for out (baseline out).
And it appears from my recent test that minimum in, and minimum out, don't necessarily mean even better in combination.[/QUOTE]
As George said in his message, the default is IN3 and OUT3, so those are chosen anyway, if nothing else is specified. And yeah, that is exactly the reason why I benchmarked the whole array of combinations, to see whether that way of searching for the optimum spot really works. (first test the IN values, then using the optimum IN value, search through the OUT values) And in my case it works, but then, there are so many "correct" spots to land on that it makes it easier than it should be.

kriesel 2019-12-10 10:04

[QUOTE=nomead;532534]Already did, but of course not enough (five each of "without merge", IN1A+OUT1A, and IN3+OUT5). At least that time, the results varied max. 2µs from run to run.[/QUOTE]For a rock steady constant signal, wouldn't there be +-1 lsb of digitization noise, in this case +-1us?
I guess George's post means that if there's MERGE_MIDDLE, the default is 3's; else baseline NO_ASM only, middle is not merged so prior code, no in or out, so no 3's.

(Who leaves indexing and autoupdates turned on?)

nomead 2019-12-10 11:23

[QUOTE=kriesel;532537]For a rock steady constant signal, wouldn't there be +-1 lsb of digitization noise, in this case +-1us?
I guess George's post means that if there's MERGE_MIDDLE, the default is 3's; else baseline NO_ASM only, middle is not merged so prior code, no in or out, so no 3's. [/QUOTE]
Yeah, well, whatever the explanation, I now reran those repeatability runs. 20 runs each of 50000 iterations, alternating between no merge (only NO_ASM), IN1A+OUT1A and IN3+OUT5. The baseline (NO_ASM) had a slight anomaly on the first run (3804 µs) but the rest were 3807 or 3808, with the average being 3807,4 µs including that one outlier. It is very tempting to throw away that first measurement result, but then it wouldn't be an accurate representation of reality anymore. 1A+1A was 3689 or 3690 µs, average 3689,8 µs. 3+5 was 3680 µs every single time.

Don't get me started on quantization noise...:cmd:

I'm used to getting reliable and repeatable results when timing other programs, mostly mfaktc, but I have to admit these are exceptionally steady, about one digit more than I'm used to getting. Maybe I should start doubting the method, and use some sort of external timer as well, instead of blindly trusting the internal timer within the program. But that's way too much effort to sink into a quick test like this.
[QUOTE=kriesel;532537] (Who leaves indexing and autoupdates turned on?)[/QUOTE]
Not by my own choice of course, but the win10 box I have at work has autoupdates forced on by group policy (corporate IT). Not sure about search though. And yeah, likewise the antivirus software (F-Secure) is forced always on. I still manage to run prime95 on it, but there the iteration timings are anything but stable.

kriesel 2019-12-10 14:04

[QUOTE=nomead;532541]Yeah, well, whatever the explanation, I now reran those repeatability runs. 20 runs each of 50000 iterations, alternating between no merge (only NO_ASM), IN1A+OUT1A and IN3+OUT5. The baseline (NO_ASM) had a slight anomaly on the first run (3804 µs) but the rest were 3807 or 3808, with the average being 3807,4 µs including that one outlier. It is very tempting to throw away that first measurement result, but then it wouldn't be an accurate representation of reality anymore.

...
But that's way too much effort to sink into a quick test like this.

Not by my own choice of course, but the win10 box I have at work has autoupdates forced on by group policy (corporate IT).[/QUOTE]
I always throw away the first one. It's there just to recreate the situation analogous to the program and hardware is up to steady state, let's see the sustainable throughput. Not only would the first have an advantage of cool hardware and higher clock rate where the clock is not pinned, it is likely to have the advantage of memory is already free and ready to allocate, and others I can't think of right now.
I think you left "quick test" territory a while ago. Wow that's thorough. I'm running single 10,000-iter timings generally.

Re corporate, condolences. Scheduled virus scans and backups may be dodged, but not all aspects.


All times are UTC. The time now is 21:16.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.