mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2018-07-15 14:27

[QUOTE=preda;491835]I added a factor-9 step, and now there's a larger selection of FFT sizes:
[CODE] FFT maxExp W H M
0.5M 10.3M 512 512 1
1.0M 20.3M 1024 512 1
2.0M 39.8M 2048 512 1
2.0M 39.8M 512 2048 1
2.5M 49.4M 512 512 5
4.0M 78.0M 1024 2048 1
4.0M 78.0M 4096 512 1
4.5M 87.5M 512 512 9
5.0M 96.9M 1024 512 5
8.0M 153.0M 2048 2048 1
9.0M 171.6M 1024 512 9
10.0M 190.0M 512 2048 5
10.0M 190.0M 2048 512 5
16.0M 300.0M 4096 2048 1
18.0M 336.3M 2048 512 9
18.0M 336.3M 512 2048 9
20.0M 372.5M 4096 512 5
20.0M 372.5M 1024 2048 5
36.0M 659.0M 1024 2048 9
36.0M 659.0M 4096 512 9
40.0M 730.0M 2048 2048 5
72.0M 1290.9M 2048 2048 9
80.0M 1429.8M 4096 2048 5
144.0M 2527.5M 4096 2048 9
[/CODE] Now it's a bit easier to validate openowl on small known primes (e.g. M(1398269) in 6 minutes). For fun, it can also do things like 1Billion exponents in 39ms/it.

(As I have not tested every FFT size precisely, bugs may be hiding around.)[/QUOTE]
Wow.

I don't see the POT lengths 32M, 64M, or 128M there. Presumably something like
4096 4096 1
8192 4096 1
8192 8192 1
respectively.
A ~1 billion exponent at 39ms/it is ~451 days or ~15 months to completion, without errors requiring repetition of GEC blocks. (On an RX Vega 64 presumably.)

What version do you call this advance?

preda 2018-07-15 15:25

[QUOTE=kriesel;491854]Wow.

I don't see the POT lengths 32M, 64M, or 128M there. Presumably something like
4096 4096 1
8192 4096 1
8192 8192 1
respectively.


A ~1 billion exponent at 39ms/it is ~451 days or ~15 months to completion, without errors requiring repetition of GEC blocks. (On an RX Vega 64 presumably.)

What version do you call this advance?[/QUOTE]
The "column" (H) step of the matrix FFT now does only 512 or 2048, that's why those sizes are missing.

I agree that doing 1billion exponents is likely not a good idea.

I'll probably bump the version to 3.4, I just want to do a bit more tuning/validation before that.

SELROC 2018-07-15 16:15

[QUOTE=preda;491858]The "column" (H) step of the matrix FFT now does only 512 or 2048, that's why those sizes are missing.

I agree that doing 1billion exponents is likely not a good idea.

I'll probably bump the version to 3.4, I just want to do a bit more tuning/validation before that.[/QUOTE]

The argument -list fft show empty list and says Bye

SELROC 2018-07-15 18:44

[QUOTE=SELROC;491861]The argument -list fft show empty list and says Bye[/QUOTE]

Yes but only if the worktodo.txt file is empty, otherwise it lists ffts and starts computation !

preda 2018-07-16 12:23

I re-enabled the display of devices with "-h" (OpenCL only), and re-enabled the kernel profiling with "-time" (OpenCL only).

In the list of FFT sizes, there are in places multiple successive lines with the same size. By default the app selects the first line for a given size. The others can be selected easily with "-fft +1", "-fft +2", etc. There are small performance differences between them, so the user can investigate and choose the fastest. There is no "auto-tuning" yet (where the program automatically times and selects the fastest).

SELROC 2018-07-16 16:21

[QUOTE=preda;491904]I re-enabled the display of devices with "-h" (OpenCL only), and re-enabled the kernel profiling with "-time" (OpenCL only).

In the list of FFT sizes, there are in places multiple successive lines with the same size. By default the app selects the first line for a given size. The others can be selected easily with "-fft +1", "-fft +2", etc. There are small performance differences between them, so the user can investigate and choose the fastest. There is no "auto-tuning" yet (where the program automatically times and selects the fastest).[/QUOTE]

Testing 300M exponent with version 3.4, selected FFT size is 18M. Now I am using -fft +1 (18M FFT) and the timing went from 18 ms/it to 16 ms/it. The ETA went from 62d to 56d.

preda 2018-07-17 00:20

Pushing the GPU fan up a bit:
[CODE]
amdgpu-pci-6700
Adapter: PCI adapter
vddgfx: +1.06 V
fan1: 3276 RPM
temp1: +70.0°C (crit = +89.0°C, hyst = -273.1°C)
power1: 206.00 W (cap = 220.00 W)
[/CODE]


I get just under 9ms/it for "100M digits" exponents; Vega64, 205W, temperature 70C.
[CODE]
vega0 16570000/332193109 [ 4.99%], 8.95 ms/it [8.94, 8.95]; ETA 32d 16:31; f6b94760b829ddec
[/CODE]
This is with the amdgpu-pro 18.20 driver. There is hope that ROCm may be a bit better still (when I can install it).

SELROC 2018-07-17 06:46

[QUOTE=SELROC;491917]Testing 300M exponent with version 3.4, selected FFT size is 18M. Now I am using -fft +1 (18M FFT) and the timing went from 18 ms/it to 16 ms/it. The ETA went from 62d to 56d.[/QUOTE]

It is actually possible to use -fft 16M with a 300M exponent, the timing goes down to 14 ms/it ...

preda 2018-07-17 09:28

[QUOTE=SELROC;491968]It is actually possible to use -fft 16M with a 300M exponent, the timing goes down to 14 ms/it ...[/QUOTE]
Yes, but you're in the danger zone at that bits-per-word level. You're likely to encounter numerical errors, that will trigger retries. That works fine, only that it costs some time. In such a situation it may be worth starting the exponent with a lower "block size" than the default of 400, with "-block 100" or "-block 200".

SELROC 2018-07-17 09:42

[QUOTE=preda;491979]Yes, but you're in the danger zone at that bits-per-word level. You're likely to encounter numerical errors, that will trigger retries. That works fine, only that it costs some time. In such a situation it may be worth starting the exponent with a lower "block size" than the default of 400, with "-block 100" or "-block 200".[/QUOTE]


Fine tuning is not an easy task. I have to investigate when time permits :-)

SELROC 2018-07-17 18:26

[QUOTE=SELROC;491981]Fine tuning is not an easy task. I have to investigate when time permits :-)[/QUOTE]

OK

[QUOTE=preda;491979]Yes, but you're in the danger zone at that bits-per-word level. You're likely to encounter numerical errors, that will trigger retries. That works fine, only that it costs some time. In such a situation it may be worth starting the exponent with a lower "block size" than the default of 400, with "-block 100" or "-block 200".[/QUOTE]

using the master version:

1) I have tried -block 200 but it seems stubborn to blockSize 400.

2) I have noted that for the 300M the best FFT is 18M even it is slower, with 16M the ETA was going back in time by a considerable amount of time (hours), which I suppose means a lot of retries have been done, but no error has been reported, I mean no EE but the timing was varying considerably.


All times are UTC. The time now is 23:02.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.