mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2018-11-03 13:13

[QUOTE=SELROC;499455]The quick tests on prime 859433.[/QUOTE]
In 3.5 run
[CODE]OK 2018-11-03 13:10:46 RX580 859600/859433 [99.95%], 0.53 ms/it [0.52, 0.61]; ETA 0d 00:00; f69dbe1c12d7020b (check 0.21s)
[/CODE]why 99.95%, not 100, or 859600/859433=100.02%?

in 5.0 run
[CODE]2018-11-03 13:38:10 RX580 859433 850000 98.88%; 0.53 ms/sq, 0 MULs; ETA 0d 00:00; 98954ce9f15bdc12[/CODE]
why 98.88%, not 850000/859433 = 98.90%?

SELROC 2018-11-03 13:39

2 Attachment(s)
[QUOTE=kriesel;499456]In 3.5 run
[CODE]OK 2018-11-03 13:10:46 RX580 859600/859433 [99.95%], 0.53 ms/it [0.52, 0.61]; ETA 0d 00:00; f69dbe1c12d7020b (check 0.21s)
[/CODE]why 99.95%, not 100, or 859600/859433=100.02%?

in 5.0 run
[CODE]2018-11-03 13:38:10 RX580 859433 850000 98.88%; 0.53 ms/sq, 0 MULs; ETA 0d 00:00; 98954ce9f15bdc12[/CODE]why 98.88%, not 850000/859433 = 98.90%?[/QUOTE]




I don't know, I guess not every iteration is logged.



Now I am redoing the same tests with 89000001, the performance appears to be the same +-0.01 ms.

tServo 2018-11-03 14:04

[QUOTE=kriesel;499410]

Case in point: 87m P-1 on GTX1060, ~5 hours, PRP in V3.8, 3d 22 hours; so combined time is 4d 3h. Compare to 4d 12h for gpuowl V5, 9 hours slower.
Again: 171m P-1 on GTX1060, ~20 hours, PRP in V3.8, 14d 21h, combined 15d 17 h; Estimated V5.0 PRP time 15d 16h. If PRP-1 is within ~1 hour of PRP, V5 wins.[/QUOTE]

Are you running v5.0 on a GTX 1060 ?
If so, how ?
It always ends with an error -5 (carryA).
Assertion failed clwrap.cpp
when I try.

kriesel 2018-11-03 15:28

[QUOTE=tServo;499461]Are you running v5.0 on a GTX 1060 ?
If so, how ?
It always ends with an error -5 (carryA).
Assertion failed clwrap.cpp
when I try.[/QUOTE]
No; CUDAPm1 v0.20 mostly the Sept 2013 CUDA5.5 build. (Thought that was clear in context of mentioning CUDAPm1 in the previous paragraph of [url]https://www.mersenneforum.org/showpost.php?p=499410&postcount=834[/url], but I guess not.)

kriesel 2018-11-03 15:35

89m timings vs gpuowl version on RX480
 
I ignore the first 10,000 iterations when looking at timings.
At the beginning of an exponent, the residue is very short.
At the beginning of a run, gpu is cool, letting 10,000 iterations run warms it up to where thermal throttling may have reached or neared its asymptote.
Timings after that are iteration-count-weighted averaged in cases like v2.0 where iteration counts per screen output vary.
m89000167 PRP no P-1
Ver ms/it (low good, high bad)
2.0 4.857
3.3 4.463
3.5 4.330
3.6 4.331
3.8 4.326 <---min
3.9 4.473
4.3 4.531
4.6 4.486
4.7 NA
5.0 4.483 <-- 1.036 times min
5.0 -fft +1 4.907<--max
Note, these are not exactly comparable because they are almost all progressions on the same owl file so executed different iteration number ranges (in version order, except v3.6 was done last starting from zero)

preda 2018-11-03 15:52

[QUOTE=kriesel;499456]In 3.5 run
[CODE]OK 2018-11-03 13:10:46 RX580 859600/859433 [99.95%], 0.53 ms/it [0.52, 0.61]; ETA 0d 00:00; f69dbe1c12d7020b (check 0.21s)
[/CODE]why 99.95%, not 100, or 859600/859433=100.02%?
[/QUOTE]

At that point it was rounding the exponent up to 1000. 859600 / 860000 = 99.95%.

[QUOTE]
in 5.0 run
[CODE]2018-11-03 13:38:10 RX580 859433 850000 98.88%; 0.53 ms/sq, 0 MULs; ETA 0d 00:00; 98954ce9f15bdc12[/CODE]
why 98.88%, not 850000/859433 = 98.90%?[/QUOTE]

Now it's rounding the exponent up to blockSize, which is 400 by default. That's the correct number of total iterations. 850000 / 859600 = 98.88%

preda 2018-11-03 16:19

[QUOTE=kriesel;499469]I ignore the first 10,000 iterations when looking at timings.
At the beginning of an exponent, the residue is very short.
At the beginning of a run, gpu is cool, letting 10,000 iterations run warms it up to where thermal throttling may have reached or neared its asymptote.
Timings after that are iteration-count-weighted averaged in cases like v2.0 where iteration counts per screen output vary.
m89000167 PRP no P-1
Ver ms/it (low good, high bad)
2.0 4.857
3.3 4.463
3.5 4.330
3.6 4.331
3.8 4.326 <---min
3.9 4.473
4.3 4.531
4.6 4.486
4.7 NA
5.0 4.483 <-- 1.036 times min
5.0 -fft +1 4.907<--max
Note, these are not exactly comparable because they are almost all progressions on the same owl file so executed different iteration number ranges (in version order, except v3.6 was done last starting from zero)[/QUOTE]

Thank you! Here is my interpretation of what happened:

At some point (between 3.8 and 3.9) I did an insignificant modification in the OpenCL code, but that modification disturbed the delicate balance required to get a reasonable compilation out of AMD's OpenCL compiler (such as the compiler in amdgpu-pro or in Adrenalin 18.2.3). But the ROCm compiler is a bit better, and was not affected. By chance I was running on ROCm at that time, and I didn't realize that there's a perf impact on the other drivers.

On the other hand, the change itself is in no way reasonably linked to a performance impact. Let's say that it just "triggers" a compilation bug in amdgpu-pro.

As such, I personally put the blame on AMD's OpenCL "legacy" compiler. The situation seems to be a bit improved on ROCm (but such non-optimal compilation problems are still present with ROCm).

SELROC 2018-11-03 17:31

[QUOTE=preda;499479]Thank you! Here is my interpretation of what happened:

At some point (between 3.8 and 3.9) I did an insignificant modification in the OpenCL code, but that modification disturbed the delicate balance required to get a reasonable compilation out of AMD's OpenCL compiler (such as the compiler in amdgpu-pro or in Adrenalin 18.2.3). But the ROCm compiler is a bit better, and was not affected. By chance I was running on ROCm at that time, and I didn't realize that there's a perf impact on the other drivers.

On the other hand, the change itself is in no way reasonably linked to a performance impact. Let's say that it just "triggers" a compilation bug in amdgpu-pro.

As such, I personally put the blame on AMD's OpenCL "legacy" compiler. The situation seems to be a bit improved on ROCm (but such non-optimal compilation problems are still present with ROCm).[/QUOTE]


As ROCm has different requirements from amdgpu-pro (it requires Gen3 pci slots), I have bought a new mainboard, and equipped it with 2 RX580. The PCI slots are the 16x slots. This has various benefits: first it is stable, in 5 days of work no error has arisen from the driver or system, second the GEC is faster, by 1 or even 2 seconds for large exponents. The system configuration:
- Debian Testing
- Kernel 4.19.0-rc7
- ROCm 1.9.211

kriesel 2018-11-03 23:36

[QUOTE=kriesel;499469]I ignore the first 10,000 iterations when looking at timings.
At the beginning of an exponent, the residue is very short.
At the beginning of a run, gpu is cool, letting 10,000 iterations run warms it up to where thermal throttling may have reached or neared its asymptote.
Timings after that are iteration-count-weighted averaged in cases like v2.0 where iteration counts per screen output vary.
m89000167 PRP no P-1
Ver ms/it (low good, high bad)
2.0 4.857
3.3 4.463
3.5 4.330
3.6 4.331
3.8 4.326 <---min
3.9 4.473
4.3 4.531
4.6 4.486
4.7 NA
5.0 4.483 <-- 1.036 times min
5.0 -fft +1 4.907<--max
Note, these are not exactly comparable because they are almost all progressions on the same owl file so executed different iteration number ranges (in version order, except v3.6 was done last starting from zero)[/QUOTE]
Spot check after updating driver to Adrenalin 18.10.2, latest available dated Oct 19 2018:
ver ms/it
3.8 4.348 (1.0051 x earlier driver ms/it)
5.0 4.512 (1.0065 x earlier driver ms/it)
This is why I usually postpone driver upgrades.

SELROC 2018-11-04 07:17

1 Attachment(s)
[QUOTE=SELROC;499486]As ROCm has different requirements from amdgpu-pro (it requires Gen3 pci slots), I have bought a new mainboard, and equipped it with 2 RX580. The PCI slots are the 16x slots. This has various benefits: first it is stable, in 5 days of work no error has arisen from the driver or system, second the GEC is faster, by 1 or even 2 seconds for large exponents. The system configuration:
- Debian Testing
- Kernel 4.19.0-rc7
- ROCm 1.9.211[/QUOTE]




Latest test Nov 4: prime 756839
I don't understand what means the 0 MULs; and ms/sq is a different measure?

preda 2018-11-04 11:32

[QUOTE=SELROC;499522]Latest test Nov 4: prime 756839
I don't understand what means the 0 MULs; and ms/sq is a different measure?[/QUOTE]

0 MULs means that no GCD multiplications were done. This is normal with B1=0. When B1 != 0, sometimes a number of MULs are done, and sometimes 0 are done, depending on iteration.

When MULs==0, ms/sq is the same as the old ms/it.
But when MULs are done, the ms/sq tries to measure only the "squaring" time (the normal PRP iteration), excluding the time taken by the MULs. Thus I changed the name to ms/sq to show that it's not the same as the old ms/it.

Why I though that indicating speed this way is good: because this number, ms/sq, is relatively stable and does not depend on the number (or time taken) by the MULs that are in variable number in iteration blocks. Thus this number can be used to compare speed.

The other option would be: take total time (squares + muls) and divide it by the number of iterations in the block. This number would be larger where there are more MULs and smaller with less MULs, thus a bit more difficult to read GPU perf from it, IMO.


All times are UTC. The time now is 23:10.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.