mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

preda 2019-12-20 14:26

I have no idea, sorry. Could the 2x speed-up be real? Or maybe some error situation was reached in which everything is fast? I don't know.

[QUOTE=kriesel;533210]After running steadily for days at ~153 ms/mul, stage 2 dropped to less than a third of that for the last few hours. This was v6.6-5-g667954b on an RX480 and Windows 7.

[CODE]2019-12-19 02:16:16 Round 171 of 180: init 16.18 s; 153.09 ms/mul; 24346 muls
2019-12-19 03:18:38 Round 172 of 180: init 16.14 s; 152.69 ms/mul; 24398 muls
2019-12-19 03:59:29 Round 173 of 180: init 17.56 s; 99.85 ms/mul; 24374 muls
2019-12-19 04:18:01 Round 174 of 180: init 5.14 s; 45.54 ms/mul; 24312 muls
2019-12-19 04:36:41 Round 175 of 180: init 5.86 s; 45.46 ms/mul; 24506 muls
2019-12-19 04:55:12 Round 176 of 180: init 5.78 s; 45.56 ms/mul; 24254 muls
2019-12-19 05:13:44 Round 177 of 180: init 5.56 s; 45.50 ms/mul; 24307 muls
2019-12-19 05:32:13 Round 178 of 180: init 6.12 s; 45.46 ms/mul; 24268 muls
2019-12-19 05:50:51 Round 179 of 180: init 5.53 s; 45.47 ms/mul; 24476 muls
2019-12-19 06:09:29 Round 180 of 180: init 6.05 s; 45.49 ms/mul; 24441 muls
2019-12-19 06:22:22 530000039 P-1 final GCD: no factor[/CODE][/QUOTE]

kriesel 2019-12-20 21:12

1 Attachment(s)
[QUOTE=preda;533274]I have no idea, sorry. Could the 2x speed-up be real? Or maybe some error situation was reached in which everything is fast? I don't know.[/QUOTE]I believe it's the longer 153 ms/iter timings in stage 2 (nearly all of stage 2's duration) that are anomalous. Both stage 1 and stage 2 on 530M are significant deviations from the usual run time scaling, but stage 2 is much more dramatic. See the attached pdf for run times etc. and plots.

kriesel 2019-12-21 20:24

Gpuowl v6.11-99-gdd8527b for Windows
 
2 Attachment(s)
This was the current commit as of yesterday on Preda's github. Haven't run it myself yet, beyond generating the help output. See the attachments. The recent shower of build warnings persists.

kriesel 2019-12-22 13:32

Gpuowl 6.11-99 tuning on GTX 1060 3GB
 
PRP runs. No P-1 attempts made yet.[CODE]Gpuowl v6.11-99-gdd8527b
GTX1060 3GB
Windows 7 X64
exponent 90507919
fft length 5120K, PRP3

us/it -use
10265 NO_ASM
10173 NO_ASM
10247 NO_ASM,MERGED_MIDDLE,WORKINGIN
10291 NO_ASM,MERGED_MIDDLE,WORKINGIN
10323 NO_ASM,MERGED_MIDDLE,WORKINGIN1
10311 NO_ASM,MERGED_MIDDLE,WORKINGIN1A
10254 NO_ASM,MERGED_MIDDLE,WORKINGIN2
10104 NO_ASM,MERGED_MIDDLE,WORKINGIN3
[B]10063[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN4
10102 NO_ASM,MERGED_MIDDLE,WORKINGIN5

10240 NO_ASM,MERGED_MIDDLE,WORKINGOUT
10244 NO_ASM,MERGED_MIDDLE,WORKINGOUT0
10244 NO_ASM,MERGED_MIDDLE,WORKINGOUT1
10219 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
10244 NO_ASM,MERGED_MIDDLE,WORKINGOUT2
10102 NO_ASM,MERGED_MIDDLE,WORKINGOUT3
[B]9973[/B] NO_ASM,MERGED_MIDDLE,WORKINGOUT4
10077 NO_ASM,MERGED_MIDDLE,WORKINGOUT5

9938 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4
9829 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH
9838 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_MIDDLE
9836 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT
9942 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_REVERSELINE
9706 NO_ASM,MERGED_MIDDLE,WORKINGIN4,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE
[B]9622[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE
9835 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE
9731 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_MIDDLE

base timing 10173 us/iter
repeatability +-22/10269 = +-0.21%
best 9622 us/iter
ratio 10173/9622 = 1.0573[/CODE]

kriesel 2019-12-22 16:29

GTX 1050 Ti gpuowl 6.11-99 tune
 
All the shuffles are beneficial.[CODE]Gpuowl V6.11-99-gdd8527b on Windows 7 X64
GTX 1050Ti timings
5M PRP, exponent 90507919
iters 20000

us/iter -use
15154 NO_ASM
15167 NO_ASM

15389 NO_ASM,MERGED_MIDDLE,WORKINGIN
15389 NO_ASM,MERGED_MIDDLE,WORKINGIN
15392 NO_ASM,MERGED_MIDDLE,WORKINGIN1
15371 NO_ASM,MERGED_MIDDLE,WORKINGIN1A
15395 NO_ASM,MERGED_MIDDLE,WORKINGIN2
15173 NO_ASM,MERGED_MIDDLE,WORKINGIN3
[B]15084[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN4
15166 NO_ASM,MERGED_MIDDLE,WORKINGIN5

15387 NO_ASM,MERGED_MIDDLE,WORKINGOUT
15387 NO_ASM,MERGED_MIDDLE,WORKINGOUT0
15386 NO_ASM,MERGED_MIDDLE,WORKINGOUT1
15357 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
15389 NO_ASM,MERGED_MIDDLE,WORKINGOUT2
15169 NO_ASM,MERGED_MIDDLE,WORKINGOUT3
[B]15040[/B] NO_ASM,MERGED_MIDDLE,WORKINGOUT4
15166 NO_ASM,MERGED_MIDDLE,WORKINGOUT5

14958 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4
14790 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH
14840 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_MIDDLE
14808 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT
14952 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_REVERSELINE

14802 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE
14684 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_MIDDLE
[B]14520[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE

repeatability 0%
base 15167
best 14520
ratio 15167/14520 = 1.0446[/CODE]

kriesel 2019-12-22 17:42

[QUOTE=kriesel;533378]All the shuffles are beneficial.[/QUOTE]Re the results on GTX1050Ti, T2 shuffle etc effects, on workingin4 and out4:
width -168, middle -118, height -150, reverse -6, sum: -442: 14958->predicted 14516,
vs. measured all 14520, close, at 4 off,
and resulting from 6 measures, so up to +-~6 digitization noise,
so within the error bars

kriesel 2019-12-22 21:48

little gain for RX480 tune on gpuowl v6.11-99-gdd8527b
 
Almost no gain on RX480[CODE]Gpuowl version and commit V6.11-99-gdd8527b
GPU model RX480
GPU clock free running ~
Host OS Windows 7 Pro x64
Notes

Exponent timed 90507919
Computation type (PRP, P-1 stage 1, P-1 stage 2): PRP
FFT length FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word (copy/paste from console or log)
config file entries -time -iters 10000 -device 0 -user kriesel -cpu condorella/rx480

varying tuning -use options, in chronological order
3567 NO_ASM us/sq warmup, end user interaction, stabilize
3575 NO_ASM baseline

In benchmarking (highlight fastest time in bold)
6209 NO_ASM,MERGED_MIDDLE,WORKINGIN
6203 NO_ASM,MERGED_MIDDLE,WORKINGIN (repeatability)
3601 NO_ASM,MERGED_MIDDLE,WORKINGIN1
3591 NO_ASM,MERGED_MIDDLE,WORKINGIN1A
3677 NO_ASM,MERGED_MIDDLE,WORKINGIN2
3715 NO_ASM,MERGED_MIDDLE,WORKINGIN3
4134 NO_ASM,MERGED_MIDDLE,WORKINGIN4
[B]3579[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN5

Out benchmarking (highlight fastest time in bold)
6086 NO_ASM,MERGED_MIDDLE,WORKINGOUT
4797 NO_ASM,MERGED_MIDDLE,WORKINGOUT0
3598 NO_ASM,MERGED_MIDDLE,WORKINGOUT1
3589 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
3941 NO_ASM,MERGED_MIDDLE,WORKINGOUT2
[B]3573[/B] NO_ASM,MERGED_MIDDLE,WORKINGOUT3
3646 NO_ASM,MERGED_MIDDLE,WORKINGOUT4
3690 NO_ASM,MERGED_MIDDLE,WORKINGOUT5

baseline WORKINGIN4, WORKINGOUT4 combination:
4227 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4

Shuffle/reverse options:
4239 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH
4209 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_MIDDLE
4213 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT
4225 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_REVERSELINE
4209 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE
4195 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_REVERSELINE
4199 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE

3572 NO_ASM
3579 NO_ASM
[B]3566 [/B]NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_REVERSELINE

repeatability +-4/3571 = +-0.11%
best 3571 (base)
base 3566
ratio 1.0014[/CODE]I wonder, does the optimal -use option set change much versus fft length for a given gpu? Is there any good reason to believe it does, or that it doesn't?

paulunderwood 2019-12-24 09:00

My card threw another Gerbicz error, this time after a couple of error-free tests. I relaxed the undervolt and bumped up the setsclk to 5 and the fan is at 160:

[CODE]sh pp.sh
1 1160 840 1050 5
[/CODE]

[CODE]amdgpu-pci-0300
Adapter: PCI adapter
vddgfx: +0.97 V
fan1: 2955 RPM (min = 0 RPM, max = 3850 RPM)
edge: +72.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
junction: +96.0°C (crit = +110.0°C, hyst = -273.1°C)
(emerg = +115.0°C)
mem: +77.0°C (crit = +94.0°C, hyst = -273.1°C)
(emerg = +99.0°C)
power1: 241.00 W (cap = 250.00 W)
[/CODE]

Now 755 us/it

preda 2019-12-25 22:31

ROCm GPU unique_id
 
ROCm exposes a per-GPU unique_id, e.g.:

[CODE]
cat /sys/class/drm/card0/device/unique_id
3044212172dc768c
[/CODE]

This id is a property of the GPU itself, and does not depend on the system or PCIe slot. So changing a GPU in a different slot, or in a different system, preserves the UID.

I added a way to specify the GPU to run on by using this unique id:
./gpuowl -uid 3044212172dc768c

this can be used instead of -device (-d) which specifies the device by position in the list of devices. The advantage is that the identity of the GPU is preserved when swapping the PCIe slots.

Combining -uid with -cpu allows to associate a stable symbolic name to an actual GPU.

I also added a few small python scripts (ROCm) under the tools/ directory in the source code:
- monitor.py : prints general information about all the ROCm GPUs found
- device.py : given a UID, prints the device serial id

The last script, device.py, can be used in user power-play scripts that set parameters of GPUs (e.g. memory frequency, undervolting, fan etc), to identify GPUs by UID instead of serial-id to achieve a correct GPU identification.

kriesel 2019-12-26 16:40

[QUOTE=preda;533571]ROCm exposes a per-GPU unique_id,[/QUOTE]I wonder how that's done, since gpus do not have readable serial number registers. Serial number is a human-readable sticker on the device, at best. Nearest we could find for CUDALucas use in Windows was the 64-bit-only Windows uuid, which changes upon removal/reinstall of the same piece of hardware. There's also [URL]https://www.mersenneforum.org/showpost.php?p=460426&postcount=2603[/URL]
[CODE]CUDALucas v2.06beta 64-bit build, compiled May 5 2017 @ 13:00:15

binary compiled for CUDA 6.50
CUDA runtime version 6.50
CUDA driver version 8.0

------- DEVICE 0 -------
name GeForce GTX 1060 3GB
[B]UUID GPU-5e2c5531-4684-57ec-6393-8b762f286c70[/B]
ECC Support? Disabled
Compatibility 6.1
[/CODE]
[url]https://stackoverflow.com/questions/13781738/how-does-cuda-assign-device-ids-to-gpus[/url]

preda 2019-12-26 17:53

I suppose it's coming from the GPU ROM or GPU BIOS. It's not the serial number, but it is stable when moving the GPU between systems which is great. I added stickers on my GPUs with their UID to help identify them :)

[QUOTE=kriesel;533597]I wonder how that's done, since gpus do not have readable serial number registers. Serial number is a human-readable sticker on the device, at best. Nearest we could find for CUDALucas use in Windows was the 64-bit-only Windows uuid, which changes upon removal/reinstall of the same piece of hardware. There's also [URL]https://www.mersenneforum.org/showpost.php?p=460426&postcount=2603[/URL]
[CODE]CUDALucas v2.06beta 64-bit build, compiled May 5 2017 @ 13:00:15

binary compiled for CUDA 6.50
CUDA runtime version 6.50
CUDA driver version 8.0

------- DEVICE 0 -------
name GeForce GTX 1060 3GB
[B]UUID GPU-5e2c5531-4684-57ec-6393-8b762f286c70[/B]
ECC Support? Disabled
Compatibility 6.1
[/CODE]
[url]https://stackoverflow.com/questions/13781738/how-does-cuda-assign-device-ids-to-gpus[/url][/QUOTE]

paulunderwood 2019-12-27 08:05

I noticed a huge difference in running times on Linux between compiling with [c]make[/c] and [c]make gpuowl[/c].

kriesel 2019-12-27 18:22

[QUOTE=paulunderwood;533618]I noticed a huge difference in running times on Linux between compiling with [c]make[/c] and [c]make gpuowl[/c].[/QUOTE]Please elaborate. Gpu models, relative timing values, which is faster, etc.

paulunderwood 2019-12-27 18:28

[QUOTE=kriesel;533637]Please elaborate. Gpu models, relative timing values, which is faster, etc.[/QUOTE]

GPU: Asus Radeon VII

make: 1243 us/it
make gpuowl: 760 us/it

I don't know if the same applies to [c]make gpuowl-win.exe[/c]

kriesel 2019-12-27 18:30

Windows build of v6.11-104-g91ef9a8
 
4 Attachment(s)
I'm about to spin up another recent version, to replace an old version that's about to finish an exponent, so why not the very latest? Build process is still generating lots of warnings. Help file attached.
I tried to run monitor.py as shown in the attachment monitor.txt, but ran into what looks like a hard obstacle to me, without knowing much about linux or msys2/mingw64.

Now if such a thing only required OpenCL, that would be great. Linux kernel modules, especially in a sort of emulator, are well out of my depth.

kriesel 2019-12-27 19:01

[QUOTE=paulunderwood;533639]GPU: Asus Radeon VII

make: 1243 us/it
make gpuowl: 760 us/it

I don't know if the same applies to [c]make gpuowl-win.exe[/c][/QUOTE]
No it does not. See [URL]https://www.mersenneforum.org/showpost.php?p=533006&postcount=1625[/URL] where I got my XFX Radeon VII to run under 800us/it on 5M fft, clocked fast and very hot, on a Windows 10 Pro system. Make alone presumably attempts to build a linux executable, and that does not go well in msys2/mingw64. There is no executable generated for comparison. Probably for good reason.[CODE]ken@condorella MINGW64 ~/gpuowl-compile/v6.11-104-g91ef9a8/gpuowl
$ make
echo \"`git describe --long --dirty --always`\" > version.new
diff -q -N version.new version.inc >/dev/null || mv version.new version.inc
echo Version: `cat version.inc`
Version: "v6.11-104-g91ef9a8"
g++ -o gpuowl Pm1Plan.o GmpUtil.o Worktodo.o common.o main.o Gpu.o clwrap.o Task.o checkpoint.o timeutil.o Args.o state.o Signal.o FFTConfig.o AllocTrac.o gpuowl-wrap.o -lstdc++fs -lOpenCL -lgmp -pthread -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L/c/Windows/System32 -L.
d000050.o:(.idata$5+0x0): multiple definition of `__imp___C_specific_handler'
d000044.o:(.idata$5+0x0): first defined here
C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.2.0/../../../../x86_64-w64-mingw32/lib/../lib/crt2.o: In function `pre_c_init':
C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:146: undefined reference to `__p__fmode'
C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.2.0/../../../../x86_64-w64-mingw32/lib/../lib/crt2.o: In function `__tmainCRTStartup':
C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:290: undefined reference to `_set_invalid_parameter_handler'
C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:299: undefined reference to `__p__acmdln'
common.o:common.cpp:(.text+0x371): undefined reference to `__imp___acrt_iob_func'
common.o:common.cpp:(.text+0x92d): undefined reference to `__imp___acrt_iob_func'
Gpu.o:Gpu.cpp:(.text+0x2b5): undefined reference to `__imp___acrt_iob_func'
Gpu.o:Gpu.cpp:(.text+0x822e): undefined reference to `__imp___acrt_iob_func'
Args.o:Args.cpp:(.text+0x29): undefined reference to `__imp___acrt_iob_func'
C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.2.0/../../../../x86_64-w64-mingw32/lib/../lib/libmingw32.a(lib64_libmingw32_a-merr.o): In function `_matherr':
C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/merr.c:72: undefined reference to `__acrt_iob_func'
C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.2.0/../../../../x86_64-w64-mingw32/lib/../lib/libmingw32.a(lib64_libmingw32_a-pseudo-reloc.o): In function `__report_error':
C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/pseudo-reloc.c:149: undefined reference to `__acrt_iob_func'
C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/pseudo-reloc.c:150: undefined reference to `__acrt_iob_func'
C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.2.0/../../../../x86_64-w64-mingw32/lib/../lib/libmingwex.a(lib64_libmingwex_a-wassert.o): In function `_wassert':
C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/misc/wassert.c:35: undefined reference to `__acrt_iob_func'
C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.2.0/../../../../x86_64-w64-mingw32/lib/../lib/libmingwex.a(lib64_libmingwex_a-mingw_vfprintf.o): In function `__mingw_vfprintf':
C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/stdio/mingw_vfprintf.c:53: undefined reference to `_lock_file'
C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/stdio/mingw_vfprintf.c:55: undefined reference to `_unlock_file'
collect2.exe: error: ld returned 1 exit status
make: *** [Makefile:19: gpuowl] Error 1
[/CODE]The difference Paul reports looks similar to the difference between using optimized combinations of George's T2_shuffle, workingin, and workingout, -use directives, and using none of them.

paulunderwood 2019-12-27 19:08

have you tried [c]make clean[/c] followed by [c]make gpuowl-win.exe[/c]?

kriesel 2019-12-27 19:17

[QUOTE=paulunderwood;533648]have you tried [c]make clean[/c] followed by [c]make gpuowl-win.exe[/c]?[/QUOTE]No. I don't see the point of that. I start with an empty folder, git clone, then make gpuowl-win.exe for each build. What would clean gain there?

paulunderwood 2019-12-27 19:21

[QUOTE=kriesel;533649]I don't see the point of that. I start with an empty folder, git clone, then make gpuowl-win.exe for each build. What would clean gain there?[/QUOTE]


If that is what you do: making a clean directory then there is no need for make clean.

I see only "make" in your previous output.

[CODE]ken@condorella MINGW64 ~/gpuowl-compile/v6.11-104-g91ef9a8/gpuowl
$ make
[/CODE]

EDIT: my apology for not reading your post correctly. In the attachment you indeed do make gpuowl-win.exe.

kriesel 2019-12-27 19:42

[QUOTE=paulunderwood;533650]If that is what you do: making a clean directory then there is no need for make clean.[/QUOTE]The usual sequence, in a folder called gpuowl-compile:
mkdir latest
cd latest
git clone ...
make gpuowl-win.exe
(move the executable up a level from ./gpuowl/, run -h, create config.txt, close the compile window, rename latest folder to the version-and-commit, and test/use there, after copying to my server drive for zip/post)

The make (nul) test should perhaps have had make clean preceding it.
Redoing it that way produces its same outcome, build fail.

make clean followed by make gpuowl-win.exe produces a byte-for-byte match to the output of the usual sequence.

Make clean appears to me to only remove .o files in this case.

paulunderwood 2019-12-27 19:47

[QUOTE=kriesel;533654]The usual sequence, in a folder called gpuowl-compile:
mkdir latest
cd latest
git clone ...
make gpuowl-win.exe
(move the executable up a level from ./gpuowl/, run -h, create config.txt, close the compile window, rename latest folder to the version-and-commit, and test/use there, after copying to my server drive for zip/post)

The make (nul) test should perhaps have had make clean preceding it.
Redoing it that way produces its same outcome, build fail.

make clean followed by make gpuowl-win.exe produces a byte-for-byte match to the output of the usual sequence.

Make clean appears to me to only remove .o files in this case.[/QUOTE]

In Linux gpuowl.cl has to follow gpuowl around the file system. To be more explicit: if I move gpuowl into the production directory I must also move the file gpuowl.cl with it.

kracker 2019-12-27 21:19

[QUOTE=paulunderwood;533655]In Linux gpuowl.cl has to follow gpuowl around the file system. To be more explicit: if I move gpuowl into the production directory I must also move the file gpuowl.cl with it.[/QUOTE]

I'm pretty sure the kernels get put into the binary when compiling, the binary works literally alone. (wasn't always that way)

paulunderwood 2019-12-27 21:42

[QUOTE=kracker;533662]I'm pretty sure the kernels get put into the binary when compiling, the binary works literally alone. (wasn't always that way)[/QUOTE]

Quoting from [url]https://github.com/preda/gpuowl[/url]

[CODE]Usage

Make sure that the gpuowl.cl file is in the same folder as the executable
Get "PRP smallest available first time tests" assignments from GIMPS Manual Testing ( http://mersenne.org/ ).
Copy the assignment lines from GIMPS to a file named 'worktodo.txt'
Run gpuowl. It prints progress report on stdout and in gpuowl.log, and writes result lines to results.txt
Submit the result lines from results.txt to http://mersenne.org/ manual testing.

Build

To build simply invoke "make" (or look inside the Makefile for a manual build).

the library libgmp-dev
a C++ compiler (e.g. gcc, clang)
an OpenCL implementation (which provides the libOpenCL library). Recommended: an AMD GPU with ROCm 1.7.
[/CODE]

Is that all outdated? I have to do [c]make gpuowl[/c]!

kriesel 2019-12-27 23:34

[QUOTE=paulunderwood;533664]Quoting from [URL]https://github.com/preda/gpuowl[/URL]
...
Is that all outdated?[/QUOTE]8 to 9 months outdated, yes; announcement of Gpuowl v6.4 by Preda: [URL]https://www.mersenneforum.org/showpost.php?p=513288&postcount=1056[/URL]

kriesel 2019-12-31 02:59

700M P-1 on gpuowl / P100 / colab
 
It took ~1.74 days of run time, several colab sessions, with a Fan Ming-provided executable. [URL]https://www.mersenne.org/report_exponent/?exp_lo=700000031&full=1[/URL] Current projections from runtime scaling and buffer count trend is higher data points will take 2-4 days each, and throughout the mersenne.org range will be possible. The run times can probably be improved upon; I'm not using any of the performance enhancing T2_shuffle or merged-middle -use options during these runs.

Lorenzo 2019-12-31 12:10

Hello!
How to switch gpuOwl to show the traditional "ms/it" instead us/sq?

kriesel 2019-12-31 12:32

[QUOTE=Lorenzo;533824]Hello!
How to switch gpuOwl to show the traditional "ms/it" instead us/sq?[/QUOTE]
Edit source code and recompile.

Lorenzo 2019-12-31 12:44

[QUOTE=kriesel;533825]Edit source code and recompile.[/QUOTE]

uhhhh. Too many efforts to me :(

paulunderwood 2019-12-31 12:50

[QUOTE=Lorenzo;533824]Hello!
How to switch gpuOwl to show the traditional "ms/it" instead us/sq?[/QUOTE]

I don't know about a switch but...


[CODE]grep -B3 -A3 "us/it" Gpu.cpp
static string makeLogStr(u32 E, string_view status, u32 k, u64 res, float secsPerIt, u32 nIters) {
char buf[256];

snprintf(buf, sizeof(buf), "%u %2s %8d %6.2f%%; %4.0f us/it; ETA %s; %s",
E, status.data(), k, k / float(nIters) * 100,
secsPerIt * 1'000'000, getETA(k, nIters, secsPerIt).c_str(),
hex(res).c_str());
[/CODE]

Change:

%4.0f us/it ---> %4.3f ms/it
1'000'000 ---> 1'000

And recompile.

Or just divide by 1000 in your head. :crank:

Lorenzo 2020-01-03 09:03

[QUOTE=paulunderwood;533827]Or just divide by 1000 in your head. :crank:[/QUOTE]

Thank you! This is what I'm looking for :smile:

storm5510 2020-01-03 23:45

[QUOTE=kriesel;533825]Edit source code and recompile.[/QUOTE]

[QUOTE]"The more you overtake the plumbing, the easier it is to stop up the drain." ~Jimmy Doohan.[/QUOTE]The version I am using has a 9/29/2019 date stamp. I have only ran P-1's with it and there have been no problems. I would be reluctant to replace it with anything newer. I feel it needs to update the screen more often, but I live with it. It produces correct results, as far as I know. This is the important part.

preda 2020-01-04 10:51

CARRY32 and CARRY64
 
A new optimization has been contributed by George, it consists in using only 32bits to store the carry-out from a word after the convolution. The theoretical analysis of whether this carry value does fit in 32bits or not is not very clear AFAIK, but the rough idea is that the higher the FFT size, the larger the expected value of the carry is. The new CARRY32 has been tested quite a bit at the wavefront (5M FFT) and never produced an error, OTOH the situation may be different at higher FFT sizes.

The performance gain is significant at about 3-5%. Given the above, CARRY32 is now enabled by default. To get the old behavior one can supply "-use CARRY64" to gpuowl.

PRP should detect a carry overflow (when using CARRY32) if that occurs (and report the usual error, and retry, and get a repetitive error 3 times and stop).

OTOH P-1 has no check; probably it's safer to keep using CARRY64 when doing P-1, especially when using FFT sizes larger than 5M (which is the FFT that was tested a lot for now).

If anybody sees an error which seems to be caused by CARRY32 (at any FFT size), please report it.

kriesel 2020-01-04 15:01

gpuowl-v6.11-112-gf1b00d1 Windows build
 
2 Attachment(s)
This should have the -use CARRY32 default that Preda described above. I've only gone as far as running -h on it so far. Build again had the usual shower of warnings.

Just when I think we're at diminishing returns or at the end of optimizations, George provides another pleasant surprise.

paulunderwood 2020-01-04 16:08

[QUOTE=preda;534207]A new optimization has been contributed by George, it consists in using only 32bits to store the carry-out from a word after the convolution. The theoretical analysis of whether this carry value does fit in 32bits or not is not very clear AFAIK, but the rough idea is that the higher the FFT size, the larger the expected value of the carry is. The new CARRY32 has been tested quite a bit at the wavefront (5M FFT) and never produced an error, OTOH the situation may be different at higher FFT sizes.

The performance gain is significant at about 3-5%. Given the above, CARRY32 is now enabled by default. To get the old behavior one can supply "-use CARRY64" to gpuowl.

PRP should detect a carry overflow (when using CARRY32) if that occurs (and report the usual error, and retry, and get a repetitive error 3 times and stop).

OTOH P-1 has no check; probably it's safer to keep using CARRY64 when doing P-1, especially when using FFT sizes larger than 5M (which is the FFT that was tested a lot for now).

If anybody sees an error which seems to be caused by CARRY32 (at any FFT size), please report it.[/QUOTE]

I don't understand it. I git cloned gpuowl and compiled, and it runs slower than before 1240 us. vs. 750 us. What am I doing wrong?

PhilF 2020-01-04 16:55

[QUOTE=paulunderwood;534226]I don't understand it. I git cloned gpuowl and compiled, and it runs slower than before 1240 us. vs. 750 us. What am I doing wrong?[/QUOTE]

I can confirm it works for me on a Radeon VII. I pulled this version before this was even posted, so was using CARRY32 during my tuning without realizing it. With a 5632K FFT, I was getting 888us/it. I placed -use CARRY64 on the command line and the timing slowed to 910us/it.

It just keeps getting better all the time!

NOTE: I installed AMD's ROCm drivers with the --opencl=pal and --headless options, which installs the lightest weight drivers possible. I am using an i7 CPU and motherboard that has built-in video, so that is what I'm using for the console. There's no monitor connected to the Radeon VII at all. Like George said, these Linux drivers are light years ahead of the Windows drivers.

Prime95 2020-01-04 20:04

[QUOTE=PhilF;534229]With a 5632K FFT, I was getting 888us/it.[/QUOTE]

I think you should be getting under 800us. Are you overclocking memory yet?

Thanks for the --headless idea -- I'll try that soon.

Prime95 2020-01-04 20:42

To expand on the new CARRY32 feature. The size of the carry increases as FFTs get larger and as the exponent approaches the limit of the current FFT size. I did test 2000 iterations of an exponent over 1 billion near the upper end of a 56M FFT. The maximum carry I saw was 80% of a fatal overflow value. Thus, I think the new code is safe for some time to come though we really should do some more research.

Also, the new code stores carries in a different order to be more AMD-friendly. One can get the old memory layout with "-use OLD_CARRY_LAYOUT". That layout might be better on nVidia or it might be irrelevant. CARRY32 and CARRY64 both work with the new and old memory layout.

To activate the old code "-use CARRY64,OLD_CARRY_LAYOUT"

preda 2020-01-04 21:41

[QUOTE=paulunderwood;534226]I don't understand it. I git cloned gpuowl and compiled, and it runs slower than before 1240 us. vs. 750 us. What am I doing wrong?[/QUOTE]

I don't know, but the ROCm compiler can generate surprising results sometimes. What version of ROCm are you using, and what FFT size?

One way to attempt to debug this is:
- run with CARRY64, do you recover the normal perfermance you had before?
- produce a ISA dump with CARRY64 (using -dump <folder>)
- produce another dump with CARRY32
- compare the .s files from the two dumps. This can be facilitated by the delta.sh script in gpuowl/tools/ which produces a partially agregated instruction counts

Anothe interesting bit of information is to run with -time in before/after cases, and see which kernel has a massive slowdown.

One more thing to keep an eye on is thermal throttling by the GPU. If you keep the hottest tempearature (spot) at under 98C (e.g. 90, 95) there should be little/no thermal throttling.

Prime95 2020-01-05 19:44

@preda: Feature request:

It seems that Ben Delo's big increase in PRP firepower makes it impossible for P-1'ers to stay ahead of the PRP wavefronts. This means we may get assigned an exponent that hasn't had any P-1 done.

Can we change the default behavior of gpuowl to do a P-1 test on the exponent if needed? For first implementation, don't worry about optimal bounds, we can add that later. P-1 has about a 5% chance of finding a factor. For me a PRP test take 18 hours, so investing up to 54 minutes of P-1 makes sense. Looking at recent P-1 results turned into primenet, prime95 chose bounds around B1=745000, B2=14713750 for a 96M exponent. I've no idea how long that takes on my GPU -- maybe I'll go test that now.

Prime95 2020-01-05 20:28

[QUOTE=Prime95;534311]I've no idea how long that takes on my GPU -- maybe I'll go test that now.[/QUOTE]

I tested B1=750000, B2=20*B1 on a 5M FFT expo and it took 26 minutes. Clearly a worthwhile investment if no P-1 has been done before (PRP lines in worktodo that do not end in ",0") .

Bonus. My test found a factor! So the P-1 code still works and another exponent bites the dust.

PhilF 2020-01-05 21:04

[QUOTE=Prime95;534317]I tested B1=750000, B2=20*B1 on a 5M FFT expo and it took 26 minutes. Clearly a worthwhile investment if no P-1 has been done before (PRP lines in worktodo that do not end in ",0") .

Bonus. My test found a factor! So the P-1 code still works and another exponent bites the dust.[/QUOTE]

Cool!

I was just assigned a few Cat 4 exponents in the 103M range, TF'ed to 74 bits with no P-1 at all. With a Radeon VII, should I TF it higher first, or skip that and do some P-1 first, or both?

preda 2020-01-05 21:31

[QUOTE=Prime95;534311]@preda: Feature request:

It seems that Ben Delo's big increase in PRP firepower makes it impossible for P-1'ers to stay ahead of the PRP wavefronts. This means we may get assigned an exponent that hasn't had any P-1 done.

Can we change the default behavior of gpuowl to do a P-1 test on the exponent if needed? For first implementation, don't worry about optimal bounds, we can add that later. P-1 has about a 5% chance of finding a factor. For me a PRP test take 18 hours, so investing up to 54 minutes of P-1 makes sense. Looking at recent P-1 results turned into primenet, prime95 chose bounds around B1=745000, B2=14713750 for a 96M exponent. I've no idea how long that takes on my GPU -- maybe I'll go test that now.[/QUOTE]

Understood; I'm looking into this, estimated 1-2days.

Prime95 2020-01-05 21:49

[QUOTE=PhilF;534322]Cool!

I was just assigned a few Cat 4 exponents in the 103M range, TF'ed to 74 bits with no P-1 at all. With a Radeon VII, should I TF it higher first, or skip that and do some P-1 first, or both?[/QUOTE]

Skip the TF, just P-1.

PhilF 2020-01-05 22:03

[QUOTE=Prime95;534328]Skip the TF, just P-1.[/QUOTE]

OK, thanks.

BTW, in regards to my memory timing, I had a chance to play with it today without success. Even overclocked to just 1050 produced errors.

preda 2020-01-05 22:06

[QUOTE=PhilF;534329]OK, thanks.

BTW, in regards to my memory timing, I had a chance to play with it today without success. Even overclocked to just 1050 produced errors.[/QUOTE]

Did you undervolt? that could also be the reason for errors.

Prime95 2020-01-05 22:13

[QUOTE=PhilF;534329]BTW, in regards to my memory timing, I had a chance to play with it today without success. Even overclocked to just 1050 produced errors.[/QUOTE]

You could just be very unlucky. My worst card does 1150. However, I was sent an XFX card that wouldn't do 1000. I RMA'ed it, I don't know for a fact that the memory was the culprit.

PhilF 2020-01-05 22:15

[QUOTE=preda;534330]Did you undervolt? that could also be the reason for errors.[/QUOTE]

Yes, I do. Power draw and fan speed is a problem.

I'm still tuning, but this card seems pretty optimized out of the box. It's branded Gigabyte that advertises a top clock speed of 1800 Mhz. It seems I can't vary the voltage much from the stock power curve, no matter the clock frequency, without getting errors.

Prime95 2020-01-05 22:25

@Phil: Linux, right?

Stay away from that power-hungry 1800 MHz! This is a sample script for one of my cards. Run it as root.

[CODE]#Allow manual control
#echo "manual" >/sys/class/drm/card2/device/power_dpm_force_performance_level
#Undervolt by setting max voltage
# V Set this to 50mV less than the max stock voltage of your card (which varies from card to card), then optionally tune it down
# V Default for this card is 1085mV
echo "vc 2 1801 1020" >/sys/class/drm/card2/device/pp_od_clk_voltage
#Overclock mclk to up to 1200
echo "m 1 1190" >/sys/class/drm/card2/device/pp_od_clk_voltage
#Push a dummy sclk change for the undervolt to stick
echo "s 1 1801" >/sys/class/drm/card2/device/pp_od_clk_voltage
#Push everything to the card
echo "c" >/sys/class/drm/card2/device/pp_od_clk_voltage
#Put card into desired performance level
/opt/rocm/bin/rocm-smi -d 2 --setsclk 4 --setfan 170
[/CODE]

You might want to start voltage at 1080 and work down as well as memory at 1000 and work up. You'll need to change "card2" to "card0" and "-d 2" to "-d 0"

PhilF 2020-01-05 22:39

[QUOTE=Prime95;534333]@Phil: Linux, right?

Stay away from that power-hungry 1800 MHz! This is a sample script for one of my cards. Run it as root.

[CODE]#Allow manual control
#echo "manual" >/sys/class/drm/card2/device/power_dpm_force_performance_level
#Undervolt by setting max voltage
# V Set this to 50mV less than the max stock voltage of your card (which varies from card to card), then optionally tune it down
# V Default for this card is 1085mV
echo "vc 2 1801 1020" >/sys/class/drm/card2/device/pp_od_clk_voltage
#Overclock mclk to up to 1200
echo "m 1 1190" >/sys/class/drm/card2/device/pp_od_clk_voltage
#Push a dummy sclk change for the undervolt to stick
echo "s 1 1801" >/sys/class/drm/card2/device/pp_od_clk_voltage
#Push everything to the card
echo "c" >/sys/class/drm/card2/device/pp_od_clk_voltage
#Put card into desired performance level
/opt/rocm/bin/rocm-smi -d 2 --setsclk 4 --setfan 170
[/CODE]

You might want to start voltage at 1080 and work down as well as memory at 1000 and work up. You'll need to change "card2" to "card0" and "-d 2" to "-d 0"[/QUOTE]

Yes, it is Linux.

My own scripts are quite similar. I started by catting pp_od_clk_voltage file and using those values as my base. I have high, medium, and low scripts, which corresponds to sclk settings of 5, 4, and 3 respectively. That gives me GPU frequencies of 1684, 1547, and 1373 Mhz. The best I can get out of the 1684 Mhz setting causes it to run at 185W (actually more when measured at the wall outlet), 92 degrees, and fans at 99%. That's not comfortable.

So today I started tuning my "medium" setting of 1547 Mhz. Now the power draw is 145 watts stock, with the fan speed set manually at 130. Much more manageable. But as soon as I deviate even slightly from the stock pp_od_clk_voltage settings, even as little as 5 mV (!), I start throwing errors.

Prime95 2020-01-05 23:08

[QUOTE=PhilF;534334]So today I started tuning my "medium" setting of 1547 Mhz. Now the power draw is 145 watts stock, with the fan speed set manually at 130. Much more manageable. But as soon as I deviate even slightly from the stock pp_od_clk_voltage settings, even as little as 5 mV (!), I start throwing errors.[/QUOTE]

That sucks, btw what was stock voltage? Also what does "/opt/rocm/bin/rocm-smi -a" report for the actual voltage at 1547 MHz?

Can you increase memory speed at 1547MHz?

PhilF 2020-01-05 23:20

[QUOTE=Prime95;534337]That sucks, btw what was stock voltage? Also what does "/opt/rocm/bin/rocm-smi -a" report for the actual voltage at 1547 MHz?

Can you increase memory speed at 1547MHz?[/QUOTE]

Stock voltages are:

808Mhz / 723mV
1304Mhz / 801mV
1801Mhz / 1107mV

rocm-smi -a is showing 887mV @ 1547Mhz.

I didn't even try increasing memory speed at 1684Mhz. But at 1547Mhz, the first memory setting I tried was 1100, and it didn't take long to produce an error. So I took it down to 1050, then it took even less time to produce an error. But when using the stock speed of 1000 and stock voltage it has produced zero errors (so far), and it is easy to keep the temperature below 90 degrees.

Prime95 2020-01-05 23:30

[QUOTE=PhilF;534340]I didn't even try increasing memory speed at 1684Mhz. But at 1547Mhz, the first memory setting I tried was 1100, and it didn't take long to produce an error. So I took it down to 1050, then it took even less time to produce an error. But when using the stock speed of 1000 and stock voltage it has produced zero errors (so far), and it is easy to keep the temperature below 90 degrees.[/QUOTE]

My condolences -- easily the worst Radeon VII card I've heard of. Worse yet, it works at stock settings so cannot ethically be RMA'd.

PhilF 2020-01-05 23:34

[QUOTE=Prime95;534342]My condolences -- easily the worst Radeon VII card I've heard of. Worse yet, it works at stock settings so cannot ethically be RMA'd.[/QUOTE]

It's ok. It still beats anything else I've ever had by miles! :smile:

kriesel 2020-01-06 04:20

[QUOTE=Prime95;534342]My condolences -- easily the worst Radeon VII card I've heard of. Worse yet, it works at stock settings so cannot ethically be RMA'd.[/QUOTE]Well that sounds like a cue for an update on stability testing of my XFX Radeon VII. I've had it go up to a couple days with no error, then the weather warms up and another error or two appear, and I've progressively dialed it back slightly at each occasion, to where it's now at 1270 gpu Mhz and 900 mem Mhz, so it's now running just under 10.9 msec/it on a ~655M PRP ~22% complete with 13 errors accumulated so far, about 64 days left.

Prime95 2020-01-06 04:36

[QUOTE=kriesel;534362]I've progressively dialed it back slightly at each occasion, to where it's now at 1270 gpu Mhz and 900 mem Mhz, [/QUOTE]

RMA is your friend, it won't run correctly at stock settings. I assume you tried it in a different machine with similar results.

PhilF 2020-01-06 06:13

[QUOTE=Prime95;534328]Skip the TF, just P-1.[/QUOTE]

Good call. Decided to run P-1 on my next assignment, M103464293, with B1=50000 B2=50000000, and out popped a factor!

I just guessed at those bounds. The test took 52 minutes. Is there an easy fast way to determine sane bounds to use with GPU-based P-1 tests when no previous P-1 testing has been done?

preda 2020-01-06 07:17

[QUOTE=PhilF;534380]Good call. Decided to run P-1 on my next assignment, M103464293, with B1=50000 B2=50000000, and out popped a factor!

I just guessed at those bounds. The test took 52 minutes. Is there an easy fast way to determine sane bounds to use with GPU-based P-1 tests when no previous P-1 testing has been done?[/QUOTE]

I tend to prefer a factor of 30x between B1 and B2 (i.e. B2 = 30*B1). Probably anything between 10x to 50x may be acceptable. OTOH you ratio of 1000x was too large.

For B1 probably something between 500'000 and 1'000'000 is reasonable (for 100M exponents). The exact value doesn't matter too much.

kriesel 2020-01-06 07:18

[QUOTE=PhilF;534380]Good call. Decided to run P-1 on my next assignment, M103464293, with B1=50000 B2=50000000, and out popped a factor!

I just guessed at those bounds. The test took 52 minutes. Is there an easy fast way to determine sane bounds to use with GPU-based P-1 tests when no previous P-1 testing has been done?[/QUOTE]Look up the exponent on mersenne.ca and use the PrimeNet bounds. That will satisfy the server and retire the P-1 task.

preda 2020-01-06 07:25

One datapoint on my GPUs (output of gpuowl/tools/monitor.py)
[CODE]
GPU UID VDD SCLK MCLK Mem-used Mem-busy PWR FAN Temp PCIeErr
0 3044212172dc768c 800mV 1358 1181 0.33GB 37% 146W 1925 70/87/77 0
1 780c28c172da5ebb 825mV 1363 1171 0.33GB 38% 154W 1783 68/84/74 0
2 a810192172fd5d12 781mV 1363 1181 0.61GB 37% 139W 1797 69/84/76 0
[/CODE]

I run my GPUs at --setsclk 3 (i.e. about 145W). If I need extra heat I can push to setsclk 4 (170-180W), if it's too hot I can go down to --setsclk 2 but there the efficiency gain is smaller. In general I would not run a RadeonVII above --setsclk 4 (because of noise and lower efficiency).

The efficiency gain from undervolting is modest, so I wouldn't worry if the card does not undervolt. In fact I would suggest to tune first the memory without any undervolting, and only afterwards tune the voltage.

For sure I would watch the temperature, for two reasons: the RadeonVII thermally-throttles a lot. So if you set it to max frequency, it will simply become super-hot and go down to a much lower frequency, for no benefit but with lower efficiency in the process. Second, all the errors are more frequent on hot.

[QUOTE=PhilF;534340]Stock voltages are:

808Mhz / 723mV
1304Mhz / 801mV
1801Mhz / 1107mV

rocm-smi -a is showing 887mV @ 1547Mhz.

I didn't even try increasing memory speed at 1684Mhz. But at 1547Mhz, the first memory setting I tried was 1100, and it didn't take long to produce an error. So I took it down to 1050, then it took even less time to produce an error. But when using the stock speed of 1000 and stock voltage it has produced zero errors (so far), and it is easy to keep the temperature below 90 degrees.[/QUOTE]

kriesel 2020-01-06 07:29

800M P-1 on Tesla P100, Colab
 
Fan Ming build of gpuowl, 800M P-1 on Tesla P100, 2.35 days running time for both stages, [URL]https://www.mersenne.org/report_exponent/?exp_lo=800000027&full=1[/URL]

[QUOTE=kriesel;533812]It took ~1.74 days of run time, several colab sessions, with a Fan Ming-provided executable. [URL]https://www.mersenne.org/report_exponent/?exp_lo=700000031&full=1[/URL] Current projections from runtime scaling and buffer count trend is higher data points will take 2-4 days each, and throughout the mersenne.org range will be possible. The run times can probably be improved upon; I'm not using any of the performance enhancing T2_shuffle or merged-middle -use options during these runs.[/QUOTE]

preda 2020-01-06 07:35

[CODE]
GPU UID VDD SCLK MCLK Mem-used Mem-busy PWR FAN Temp PCIeErr
0 3044212172dc768c 800mV 1358 1181 0.33GB 37% 146W 1925 70/87/77 0
1 780c28c172da5ebb 825mV 1363 1171 0.33GB 38% 154W 1783 68/84/74 0
2 a810192172fd5d12 781mV 1363 1181 0.61GB 37% 139W 1797 69/84/76 0
[/CODE]
Who wants to guess which of the above is the XFX? :)

kriesel 2020-01-06 07:36

[QUOTE=Prime95;534366]RMA is your friend, it won't run correctly at stock settings. I assume you tried it in a different machine with similar results.[/QUOTE]Nope, new-to-me system bought for housing it, only, so far, running Windows 10. Maybe I'll try a linux dual boot install on that system before relocating it which would require displacing some other production gpus. Right now I'm on deadline on some other things.

kriesel 2020-01-06 07:43

[QUOTE=preda;534388]Who wants to guess which of the above is the XFX? :)[/QUOTE]
1; high voltage, high power, low mclk

PhilF 2020-01-06 15:46

[QUOTE=preda;534385]I run my GPUs at --setsclk 3 (i.e. about 145W). If I need extra heat I can push to setsclk 4 (170-180W), if it's too hot I can go down to --setsclk 2 but there the efficiency gain is smaller. In general I would not run a RadeonVII above --setsclk 4 (because of noise and lower efficiency).[/QUOTE]

Based on that, maybe my card isn't so pitiful after all. All my testing has been with a setsclk setting of 4 or 5, and my --setsclk 4 speed is pulling only 150W.

My --setsclk 5 setting is hungry (about 185W), hot, and noisy. But it does work at that speed. I completed a PRP double-check using that setting. But I haven't even played with a setsclk of 3, because I thought everyone was using 4 or 5.

The outrageous --setsclk 6 setting (1800 Mhz) sets off the overload alarm on my UPS!

PhilF 2020-01-06 16:07

[QUOTE=preda;534383]I tend to prefer a factor of 30x between B1 and B2 (i.e. B2 = 30*B1). Probably anything between 10x to 50x may be acceptable. OTOH you ratio of 1000x was too large.

For B1 probably something between 500'000 and 1'000'000 is reasonable (for 100M exponents). The exact value doesn't matter too much.[/QUOTE]

Can I assume that by picking a B2 bounds of 1000X that the only repercussion is the test took a little longer? The reason I picked one so large is that I figured a larger B2 would allow for better utilization of the card's 16GB of memory.

preda 2020-01-06 21:08

[QUOTE=PhilF;534411]Can I assume that by picking a B2 bounds of 1000X that the only repercussion is the test took a little longer? The reason I picked one so large is that I figured a larger B2 would allow for better utilization of the card's 16GB of memory.[/QUOTE]

Yes; a B2 that is very large relative to B1 is safe, but is not a very efficient use of the compute.

P-1 works by finding a factor of "p" of the mersenne candidate, such that p-1 is the product of prime factors of which all but at most one are less that B1 and at most one is between B1 and B2.

In your case it would make sense to increase B1 to 500'000 or 1M if you want to keep B2 at 50M.

preda 2020-01-07 10:08

[QUOTE=Prime95;534317]I tested B1=750000, B2=20*B1 on a 5M FFT expo and it took 26 minutes. Clearly a worthwhile investment if no P-1 has been done before (PRP lines in worktodo that do not end in ",0") .

Bonus. My test found a factor! So the P-1 code still works and another exponent bites the dust.[/QUOTE]

Can somebody please remind me what is the meaning of the last integer value ("0" below) in a PRP assignment such as:

PRP=700000F64405DAFE2EXXXXXXC85EEF72,1,2,91157779,-1,77,0

Do I understand correctly that when it's 0, it means "don't do any P-1"?
Then what does it mean when it's 1, 2, or what else can it be?

kriesel 2020-01-07 10:32

[QUOTE=preda;534476]Can somebody please remind me what is the meaning of the last integer value ("0" below) in a PRP assignment such as:

PRP=700000F64405DAFE2EXXXXXXC85EEF72,1,2,91157779,-1,77,0

Do I understand correctly that when it's 0, it means "don't do any P-1"?
Then what does it mean when it's 1, 2, or what else can it be?[/QUOTE]Roughly, number of primality tests that could be saved if a factor is found. [URL]https://www.mersenneforum.org/showpost.php?p=522098&postcount=22[/URL]

0 two or more primality tests have already been done, or sufficient P-1 factoring has already been done, make no effort on P-1 factoring
1 one primality test has already been done, make any P-1 effort that makes sense for that
2 no primality test or sufficient P-1 factoring done yet, make any P-1 effort that makes sense for that

In other usages, such as CUDAPm1, values higher than 2 up to 9 mean try harder to find a P-1 factor, with larger bounds. See also [URL]https://www.mersenneforum.org/showpost.php?p=501984&postcount=17[/URL]
If I recall correctly, inputs above 9 get effort capped equivalent to 9.

George's description of the optimization process is in the P-1 Factoring section of [URL]https://www.mersenne.org/various/math.php[/URL].
It's there to read in the source codes also.

Why do you ask; are you thinking of adding automatic bounds selection to gpuowl? If so, please go for the PrimeNet bounds so the P-1 task is retired.

The exponent in question had more than optimal TF applied and less than optimal P-1 bounds, and the net effect fell short of optimal factoring probability. [URL]https://www.mersenne.ca/exponent/91157779[/URL]

preda 2020-01-07 11:26

Thanks Ken for the explanation.
I'm thinking of adding "implicit preliminary P-1" for PRP assignments for exponents that didn't have any P-1.

So I obtain from PrimeNet one PRP assignment with one AID. I run both P-1 and PRP, now I have two results. With what AID should I submit the "implicit" P-1 result?

[QUOTE=kriesel;534477]Roughly, number of primality tests that could be saved if a factor is found. [URL]https://www.mersenneforum.org/showpost.php?p=522098&postcount=22[/URL]

0 two or more primality tests have already been done, or sufficient P-1 factoring has already been done, make no effort on P-1 factoring
1 one primality test has already been done, make any P-1 effort that makes sense for that
2 no primality test or sufficient P-1 factoring done yet, make any P-1 effort that makes sense for that

In other usages, such as CUDAPm1, values higher than 2 up to 9 mean try harder to find a P-1 factor, with larger bounds. See also [URL]https://www.mersenneforum.org/showpost.php?p=501984&postcount=17[/URL]
If I recall correctly, inputs above 9 get effort capped equivalent to 9.

George's description of the optimization process is in the P-1 Factoring section of [URL]https://www.mersenne.org/various/math.php[/URL].
It's there to read in the source codes also.

Why do you ask; are you thinking of adding automatic bounds selection to gpuowl? If so, please go for the PrimeNet bounds so the P-1 task is retired.

The exponent in question had more than optimal TF applied and less than optimal P-1 bounds, and the net effect fell short of optimal factoring probability. [URL]https://www.mersenne.ca/exponent/91157779[/URL][/QUOTE]

kriesel 2020-01-07 11:55

[QUOTE=preda;534480]Thanks Ken for the explanation.
I'm thinking of adding "implicit preliminary P-1" for PRP assignments for exponents that didn't have any P-1.

So I obtain from PrimeNet one PRP assignment with one AID. I run both P-1 and PRP, now I have two results. With what AID should I submit the "implicit" P-1 result?[/QUOTE]P-1 runs first. Assuming no factor is found: Submit the P-1 with AID 0. It will be accepted. It may generate some sort of warning as not the work type assigned for that user and exponent. It will leave the PRP assignment in place. Then when the PRP is completed, it gets reported with its AID and all's good.
If a factor is found, don't run the PRP, and try reporting the P-1 factor with its AID.

What do you mean by preliminary? Hopefully not a reduced-bounds P-1 run, which I've determined by test even if optimized for most factors per gpu hour, is a waste of resources. See [URL]https://www.mersenneforum.org/showpost.php?p=531129&postcount=20[/URL]

But not to worry: P-1 is covered on that one. [URL]https://www.mersenne.org/report_exponent/?exp_lo=91157779&full=1[/URL]

PhilF 2020-01-07 15:19

[QUOTE=preda;534480]So I obtain from PrimeNet one PRP assignment with one AID. I run both P-1 and PRP, now I have two results. With what AID should I submit the "implicit" P-1 result?[/QUOTE]

I have submitted P-1 results twice now by pasting gpuowl's results.txt into Primenet's manual results submission page. One result had a factor, the other didn't. No AID. Both results were accepted without error and properly credited to my user account.

[code]{"exponent":"103464293", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.11-112-gf1b00d1"}, "timestamp":"2020-01-06 06:01:49 UTC", "user":"pfrakes", "computer":"i7-4790", "fft-length":5767168, "B1":50000, "B2":50000000, "factors":["2419588148340043449947153"]}[/code]

preda 2020-01-08 11:04

[QUOTE=kriesel;534481]
What do you mean by preliminary? Hopefully not a reduced-bounds P-1 run[/QUOTE]

By "preliminary" I meant "subtask to be run before the main task".

About the bounds I propose an extremly simple heuristic for the defaults:
B1 = exponent / 100, rounded to a multiple of 100'000
B2 = 30 * B1

kriesel 2020-01-08 19:10

[QUOTE=preda;534580]By "preliminary" I meant "subtask to be run before the main task".

About the bounds I propose an extremly simple heuristic for the defaults:
B1 = exponent / 100, rounded to a multiple of 100'000
B2 = 30 * B1[/QUOTE]
That won't match either the GPUto72 or PrimeNet bounds goals, although it's probably close in effect to the PrimeNet figures. Sampling of both, and fits to them, can be seen at the attachment to [url]https://www.mersenneforum.org/showpost.php?p=522257&postcount=23[/url].

chalsall 2020-01-08 20:40

[QUOTE=kriesel;534605]That won't match either the GPUto72 or PrimeNet bounds goals...[/QUOTE]

Ummm... Just to put on the table, GPU72 doesn't /have/ P-1 bounds. Our only dimension is TF "depth" (and range, of course).

Are you instead talking about the B1/B2 values chosen by Prime95/mprime based on the depth the candidate has already been TF'ed to?

Sorry... Juggling lots of different stuff, but your above confused me.

kriesel 2020-01-08 21:28

[QUOTE=chalsall;534611]Ummm... Just to put on the table, GPU72 doesn't /have/ P-1 bounds. Our only dimension is TF "depth" (and range, of course).

Are you instead talking about the B1/B2 values chosen by Prime95/mprime based on the depth the candidate has already been TF'ed to?

Sorry... Juggling lots of different stuff, but your above confused me.[/QUOTE]
What I was referring to is the B1 and B2 numbers listed in the Gputo72 row or PrimeNet row of any exponent entry at mersenne.ca. See for example [URL]https://www.mersenne.ca/exponent/200000033[/URL]

chalsall 2020-01-08 21:33

[QUOTE=kriesel;534614]See for example [URL]https://www.mersenne.ca/exponent/200000033[/URL][/QUOTE]

Ah... Thank you. I truly was confused where all the references to GPU72 bounds was coming from. This now makes sense.

James: rational? I'm sure it's sane, knowing you... :smile:

kriesel 2020-01-08 22:53

[QUOTE=chalsall;534615]makes sense.[/QUOTE]I try. And iterate.:smile:

wfgarnett3 2020-01-09 12:23

[QUOTE=kriesel;534222]This should have the -use CARRY32 default that Preda described above. I've only gone as far as running -h on it so far. Build again had the usual shower of warnings.

Just when I think we're at diminishing returns or at the end of optimizations, George provides another pleasant surprise.[/QUOTE]

I downloaded this gpuowl-v6.11-112-gf1b00d1.7z windows exe file that kriesel posted.

However it seems to have CPU usage (and I checked Task Manager to verify).

Whenever I have it running on the GPU and Prime95 on the CPU the iteration time on Prime95 slows down. Once I stop gpuowl Prime95 goes back to its normal iteration time, but as soon as I restart gpuowl Prime95 slows back down.

This is my first real usage of gpuowl (I am "double-checking" a PRP number just to test the software out).

This never happened when I double-checked LL for CudaLucas - CudaLucas (or mfaktc for TF) running on the GPU never affected Prime95 running simultanosly on the CPU.

Can someone tell me why this happens?

EVGA GeForce GTX 1050 SC GAMING (2GB GDDR5)
Part number: 02G-P4-6152-KR

Dell Desktop Tower with Windows 10
Intel i3-4150 @ 3.5GHz
Memory: 8.00 GB

preda 2020-01-09 12:39

Try using -yield which is specially designed to address CUDA CPU usage.

[QUOTE=wfgarnett3;534666]I downloaded this gpuowl-v6.11-112-gf1b00d1.7z windows exe file that kriesel posted.

However it seems to have CPU usage (and I checked Task Manager to verify).

Whenever I have it running on the GPU and Prime95 on the CPU the iteration time on Prime95 slows down. Once I stop gpuowl Prime95 goes back to its normal iteration time, but as soon as I restart gpuowl Prime95 slows back down.

This is my first real usage of gpuowl (I am "double-checking" a PRP number just to test the software out).

This never happened when I double-checked LL for CudaLucas - CudaLucas (or mfaktc for TF) running on the GPU never affected Prime95 running simultanosly on the CPU.

Can someone tell me why this happens?

EVGA GeForce GTX 1050 SC GAMING (2GB GDDR5)
Part number: 02G-P4-6152-KR

Dell Desktop Tower with Windows 10
Intel i3-4150 @ 3.5GHz
Memory: 8.00 GB[/QUOTE]

wfgarnett3 2020-01-09 12:46

[QUOTE=preda;534667]Try using -yield which is specially designed to address CUDA CPU usage.[/QUOTE]

-yield does not seem to help - Prime95 still gets slowed.

(Note I am actually going to sleep right now so will be offline but feel free for you or anyone else to comment on what may be happening)

kriesel 2020-01-09 15:40

[QUOTE=wfgarnett3;534668]-yield does not seem to help - Prime95 still gets slowed.

(Note I am actually going to sleep right now so will be offline but feel free for you or anyone else to comment on what may be happening)[/QUOTE]An unfortunate feature of prime95 is that without hyperthreading enabled on the host, occupying one cpu core with something else can cost an entire prime95 worker's output, however many cores that is. A little cpu usage by gpuowl even with -yield is normal. Some cpu cycles are used for save checkpoints to disk, screen output, doing the GEC, etc. But if -yield is in the config.txt or the command line, it should be reduced from the full cpu core or hyperthread that occurs without that option. How much were you seeing without -yield, and how much with?

mrh 2020-01-09 21:12

On my ubuntu system, rocm got upgraded (by apt upgrade) to what looks to be 3.0.6, and now in some instances gpuowl crashes immediately with this message:

[CODE]Memory access fault by GPU node-1 (Agent handle: 0x56153da65fb0) on address 0x7f6250a00000. Reason: Page not present or supervisor privilege.[/CODE]

This from the kernel:
[CODE][10312.567135] amdgpu 0000:04:00.0: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32769, for process gpuowl pid 10653 thread gpuowl pid 10653)
[10312.567142] amdgpu 0000:04:00.0: in page starting at address 0x00007f6250a00000 from client 27
[10312.567146] amdgpu 0000:04:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00801030
[10312.567150] amdgpu 0000:04:00.0: MORE_FAULTS: 0x0
[10312.567154] amdgpu 0000:04:00.0: WALKER_ERROR: 0x0
[10312.567157] amdgpu 0000:04:00.0: PERMISSION_FAULTS: 0x3
[10312.567160] amdgpu 0000:04:00.0: MAPPING_ERROR: 0x0
[10312.567164] amdgpu 0000:04:00.0: RW: 0x0[/CODE]


-pm1 works fine, -prp 133331333 works fine, but -prp on most other numbers crashes.

This was with a clean build from github. As far as I can tell nothing happened to the card, mfacto also seems to work just fine.

Any ideas about what to do? Is 3.0.6 a bad version to be using?

preda 2020-01-09 21:18

automatic initial P-1 for PRP
 
For a task of the form:
PRP=XXXXXXXX,1,2,91408469,-1,77,1
i.e. note the final integer, let's call it "wantsPm1", being "1" instead of the usual "0" -- this indicates that P-1 testing is desired;

gpuowl will automatically expand the task into a P-1 and a PRP with the "wantsPm1" set to 0.

It works like this:
- gpuowl reads the first good line from worktodo.txt
- if that line is a PRP with wantsPm1 non-zero, two new tasks are *appended* to the worktodot.txt (i.e. at the end)
- after which the PRP task that was having wantPm1 is deleted from worktodo.txt
- loop to find the first task in workdoto.txt

I.e. this would result in a re-ordering of the tasks in worktodo.txt because the "expanded" tasks are always added to the end.

It is likely there are some bugs, please let me know if you see any.

preda 2020-01-09 21:22

It is because of the upgrade to ROCm 3.0. I don't know where exactly is the bug (in gpuowl or in ROCm) because personally I couldn't yet upgrade to ROCm 3.0 to test. When I can install 3.0 I'll have a look. In the meantime a solution is to move back to ROCm 2.10 .

[QUOTE=mrh;534703]On my ubuntu system, rocm got upgraded (by apt upgrade) to what looks to be 3.0.6, and now in some instances gpuowl crashes immediately with this message:

[CODE]Memory access fault by GPU node-1 (Agent handle: 0x56153da65fb0) on address 0x7f6250a00000. Reason: Page not present or supervisor privilege.[/CODE]

This from the kernel:
[CODE][10312.567135] amdgpu 0000:04:00.0: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32769, for process gpuowl pid 10653 thread gpuowl pid 10653)
[10312.567142] amdgpu 0000:04:00.0: in page starting at address 0x00007f6250a00000 from client 27
[10312.567146] amdgpu 0000:04:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00801030
[10312.567150] amdgpu 0000:04:00.0: MORE_FAULTS: 0x0
[10312.567154] amdgpu 0000:04:00.0: WALKER_ERROR: 0x0
[10312.567157] amdgpu 0000:04:00.0: PERMISSION_FAULTS: 0x3
[10312.567160] amdgpu 0000:04:00.0: MAPPING_ERROR: 0x0
[10312.567164] amdgpu 0000:04:00.0: RW: 0x0[/CODE]


-pm1 works fine, -prp 133331333 works fine, but -prp on most other numbers crashes.

This was with a clean build from github. As far as I can tell nothing happened to the card, mfacto also seems to work just fine.

Any ideas about what to do? Is 3.0.6 a bad version to be using?[/QUOTE]

mrh 2020-01-09 21:31

Ah, thanks! I rolled back to rocm 2.10 and I'm all better.

PhilF 2020-01-09 21:40

[QUOTE=preda;534707]It is because of the upgrade to ROCm 3.0. I don't know where exactly is the bug (in gpuowl or in ROCm) because personally I couldn't yet upgrade to ROCm 3.0 to test. When I can install 3.0 I'll have a look. In the meantime a solution is to move back to ROCm 2.10 .[/QUOTE]

Wow, I must have gotten in just under the wire. My installation is recent (Dec. 30), and I loaded whatever ROCm version was available then. However, It does not have video drivers loaded, only OpenCL, and no monitor is attached to my Radeon VII. I've not had a problem.

How can I check my current ROCm version? It doesn't show up in dmesg or lsmod.

mrh 2020-01-09 21:56

[QUOTE=PhilF;534709]Wow, I must have gotten in just under the wire. My installation is recent (Dec. 30), and I loaded whatever ROCm version was available then. However, It does not have video drivers loaded, only OpenCL, and no monitor is attached to my Radeon VII. I've not had a problem.

How can I check my current ROCm version? It doesn't show up in dmesg or lsmod.[/QUOTE]

I think the simplest is "dpkg -l |grep rocm"

PhilF 2020-01-09 22:07

[QUOTE=mrh;534711]I think the simplest is "dpkg -l |grep rocm"[/QUOTE]

My dpkg output doesn't have any lines that contain rocm. Some do contain "amdgpu", but the only version number I can find is 19.30-934563, which doesn't shed any light on the rocm version as far as I can tell.

[code]ii amdgpu-core 19.30-934563 all Core meta package for unified amdgpu driver.
ii amdgpu-dkms 19.30-934563 all amdgpu driver in DKMS format.
ii amdgpu-pro-core 19.30-934563 all Core meta package for Pro components of the unified amdgpu driver.
ii amdgpu-pro-pin 19.30-934563 all Meta package to pin a specific amdgpu driver version.
ii clinfo-amdgpu-pro 19.30-934563 amd64 AMD OpenCL info utility
ii libdrm-amdgpu-amdgpu1:amd64 1:2.4.98-934563 amd64 Userspace interface to amdgpu-specific kernel DRM services -- runtime
ii libdrm-amdgpu-common 1.0.0-934563 all List of AMD/ATI cards' device IDs, revision IDs and marketing names
ii libdrm2-amdgpu:amd64 1:2.4.98-934563 amd64 Userspace interface to kernel DRM services -- runtime
ii libopencl1-amdgpu-pro:amd64 19.30-934563 amd64 AMD OpenCL ICD Loader library
ii opencl-amdgpu-pro-comgr 19.30-934563 amd64 non-free AMD OpenCL ICD Loaders
ii opencl-amdgpu-pro-icd 19.30-934563 amd64 non-free AMD OpenCL ICD Loaders
[/code]

mrh 2020-01-09 22:50

[QUOTE=PhilF;534713]My dpkg output doesn't have any lines that contain rocm. Some do contain "amdgpu", but the only version number I can find is 19.30-934563, which doesn't shed any light on the rocm version as far as I can tell.
[/QUOTE]

I'm not an amd expert, but I think that indicates you are using the "amd pro" drivers (I think that is what they are called) vs. rocm.

-mike

PhilF 2020-01-09 22:57

[QUOTE=mrh;534717]I'm not an amd expert, but I think that indicates you are using the "amd pro" drivers (I think that is what they are called) vs. rocm.

-mike[/QUOTE]

That would make sense. If so, I highly recommend the pro drivers to others.

kriesel 2020-01-10 00:13

Windows build for gpuowl v6.11-116-g5ca090d
 
2 Attachment(s)
For anyone who'd like to give it a try on Windows, this build of the latest available commit was done on Windows 7 x64 minutes ago. I haven't tried it past the help function. Make gpuown-win again generated the usual shower of warnings; see build-log.txt attached.
[QUOTE=preda;534705]For a task of the form:
PRP=XXXXXXXX,1,2,91408469,-1,77,1
i.e. note the final integer, let's call it "wantsPm1", being "1" instead of the usual "0" -- this indicates that P-1 testing is desired;

gpuowl will automatically expand the task into a P-1 and a PRP with the "wantsPm1" set to 0.

It works like this:
- gpuowl reads the first good line from worktodo.txt
- if that line is a PRP with wantsPm1 non-zero, two new tasks are *appended* to the worktodot.txt (i.e. at the end)
- after which the PRP task that was having wantPm1 is deleted from worktodo.txt
- loop to find the first task in workdoto.txt

I.e. this would result in a re-ordering of the tasks in worktodo.txt because the "expanded" tasks are always added to the end.

It is likely there are some bugs, please let me know if you see any.[/QUOTE]

Prime95 2020-01-10 00:56

[QUOTE=preda;534705]
gpuowl will automatically expand the task into a P-1 and a PRP with the "wantsPm1" set to 0.[/QUOTE]

If P-1 finds a factor are both the P-1 and PRP lines deleted?

wfgarnett3 2020-01-10 01:42

5 Attachment(s)
[QUOTE=kriesel;534675]An unfortunate feature of prime95 is that without hyperthreading enabled on the host, occupying one cpu core with something else can cost an entire prime95 worker's output, however many cores that is. A little cpu usage by gpuowl even with -yield is normal. Some cpu cycles are used for save checkpoints to disk, screen output, doing the GEC, etc. But if -yield is in the config.txt or the command line, it should be reduced from the full cpu core or hyperthread that occurs without that option. How much were you seeing without -yield, and how much with?[/QUOTE]

Hyperthreading is enabled on my 2 core i3-4150.

-yield does help somewhat.

Here are my tests.

Screenshot 1 is Prime95 by itself PRP testing 90519811 with a 14.3 iteration time and 53% CPU usage.
Screenshot 2 is gpuOwL by itself without -yield PRP testing 81943843 with a 17.7 iteration time and 27% CPU usage.
Screenshot 3 is gpuOwL by itself with -yield with a 17.8 iteration time and 31% CPU usage.
Screenshot 4 is both Prime95 and gpuOwL (without -yield) showing Prime95 has a 19.1 iteration time and 81% CPU usage (thus gpuOwL slowed Prime95 down from 14.3 to 19.1).
Screenshot 5 is both Prime95 and gpuOwL (with -yield) showing Prime95 now has a 17.7 iteration time and 77% CPU usage so the -yield option helped some.

Thanks.

kriesel 2020-01-10 04:15

[QUOTE=wfgarnett3;534731]Hyperthreading is enabled on my 2 core i3-4150.

-yield does help somewhat.

Here are my tests.

Screenshot 1 is Prime95 by itself PRP testing 90519811 with a 14.3 iteration time and 53% CPU usage.
Screenshot 2 is gpuOwL by itself without -yield PRP testing 81943843 with a 17.7 iteration time and 27% CPU usage.
Screenshot 3 is gpuOwL by itself with -yield with a 17.8 iteration time and 31% CPU usage.
Screenshot 4 is both Prime95 and gpuOwL (without -yield) showing Prime95 has a 19.1 iteration time and 81% CPU usage (thus gpuOwL slowed Prime95 down from 14.3 to 19.1).
Screenshot 5 is both Prime95 and gpuOwL (with -yield) showing Prime95 now has a 17.7 iteration time and 77% CPU usage so the -yield option helped some.

Thanks.[/QUOTE]See [url]https://www.mersenneforum.org/showpost.php?p=526331&postcount=1403[/url] for how much difference -yield made in my case; the difference between one core saturated, and 2% of a core. What else is using cpu during your gpuowl-only runs? Windows Task Manager, Processes tab, sort by cpu % usage. On my systems, the ratio of accumulated cpu time prime95 to gpuowl-win are >20:1.

preda 2020-01-10 08:13

[QUOTE=Prime95;534723]If P-1 finds a factor are both the P-1 and PRP lines deleted?[/QUOTE]

No, not yet (an oversight on my part). What should I do with the AID and the assignment relative to primeNet?

- should I put the AID (of the PRP) on the P-1 factor-found result?
- if I simply drop the PRP assignment from worktodo.txt on P-1 factor found, it would still be assigned on the server even if the factor is reported?

wfgarnett3 2020-01-10 08:48

2 Attachment(s)
[QUOTE=kriesel;534743]See [url]https://www.mersenneforum.org/showpost.php?p=526331&postcount=1403[/url] for how much difference -yield made in my case; the difference between one core saturated, and 2% of a core. What else is using cpu during your gpuowl-only runs? Windows Task Manager, Processes tab, sort by cpu % usage. On my systems, the ratio of accumulated cpu time prime95 to gpuowl-win are >20:1.[/QUOTE]

See attached screenshots -- gpuOwL-only runs are basically using all of the 27% CPU usage

preda 2020-01-10 12:26

Abort PRP test on P-1 factor found
 
I added some untested code that is supposed to:

1. when a P-1 factor is found, all PRP entries from worktodo.txt for the same exponent are removed. No result is written (to results.txt) for these deleted tasks.
2. when a P-1 factor is found in the background (GCD) while a PRP test for the same exponent is ongoing, the PRP test is aborted early and the point 1. above is applied.

I think this solution [in addition to bugs] has the problem of leaving PRP assignments "hanging" on primenet. Maybe the server could implement auto-release of the PRP assignments of a user when that user submits a factor for the same exponent (because, after a factor found, it does not make sense for the user that found the factor to pursue the PRP tests)

[QUOTE=preda;534752]No, not yet (an oversight on my part). What should I do with the AID and the assignment relative to primeNet?

- should I put the AID (of the PRP) on the P-1 factor-found result?
- if I simply drop the PRP assignment from worktodo.txt on P-1 factor found, it would still be assigned on the server even if the factor is reported?[/QUOTE]


All times are UTC. The time now is 21:16.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.