mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

axn 2020-03-20 03:06

[QUOTE=Prime95;540173]Way to go Mihai!! With his latest commit this GPU has broken through the 600us barrier -- 597us.[/QUOTE]

Single instance or two instance combined?

Prime95 2020-03-20 05:03

[QUOTE=axn;540195]Single instance or two instance combined?[/QUOTE]

The two instance combined.

ewmayer 2020-03-20 19:11

[QUOTE=Prime95;540173]Way to go Mihai!! With his latest commit this GPU has broken through the 600us barrier -- 597us.[/QUOTE]

So just 'git clone [url]https://github.com/preda/gpuowl[/url] && cd gpuowl && make', then halt/restart ongoing runs?

preda 2020-03-20 20:04

[QUOTE=ewmayer;540279]So just 'git clone [url]https://github.com/preda/gpuowl[/url] && cd gpuowl && make', then halt/restart ongoing runs?[/QUOTE]

Yes.

"git clone" only the first time, afterwards you can "git pull" in the existing dir.

"scons" can be used as an alternative to "make" (I build myself with scons), but either should work.

ewmayer 2020-03-20 20:54

[QUOTE=preda;540285]Yes.

"git clone" only the first time, afterwards you can "git pull" in the existing dir.

"scons" can be used as an alternative to "make" (I build myself with scons), but either should work.[/QUOTE]

Thanks - timing for my pair of side-by-side jobs at 5632K FFT and sclk=4 dropped from 1475 us/iter (for each job) to 1387 us/iter, 6.3% faster. So now I'm getting slightly better throughput at sclk=4 than I was before at sclk=5.

Comparing apples-to-apples at sclk=5, before was 2 jobs each @1405 us/iter, with the new build down to 1331, 5.6% faster. But sclk=4 saves 60 watts ... hmm, tough choice. I'll probably run at sclk=4 on warm days, sclk=5 otherwise and at night.

Nice work, guys! I hope to begin contributing more substantively later this year, rather than just running code and cheerleading.

preda 2020-03-20 21:46

__attribute__(overloadable) support
 
I would like to start using __attribute__((overloadable)) in gpuowl OpenCL source, but before that I'd like to find out whether it's supported everywhere we care.

The attribute is described here:
[url]https://clang.llvm.org/docs/AttributeReference.html#overloadable[/url]

I would like confirmation that it works on these platforms:
- windows (with whatever OpenCL windows uses for AMD GPUs -- catalyst?)
- Nvidia
- amdgpuPro (the other driver for Linux vs. ROCm)

To check the attribute, simply add "__attribute__((overloadable))" to some function between the return type and function name, e.g.:

in gpuowl.cl
Replace

T2 mul(T2 a, T2 b) ...
with
T2 __attribute__((overloadable)) mul(T2 a, T2 b) ...

And recompile, and afterwards *run* the resulting gpuowl to check the OpenCL compilation that happens at startup.
Thanks!

Note: the title should read "__attribute__((overloadable))", double parens.

paulunderwood 2020-03-20 21:48

:tu:

With latest commit, running 1 1200 800 1050 3 @ FFT 5632K, timings have sped up for 2 instances from 1473 us/it to 1400 us/it each

ewmayer 2020-03-20 21:53

[QUOTE=paulunderwood;540308]:tu:

With latest commit, running 1 1200 800 1050 3 @ FFT 5632K, timings have sped up for 2 instances from 1473 us/it to 1400 us/it each[/QUOTE]

You're gonna need that extra speed - I'm a mere 60,000 GHz-days behind you in the Top500, roughly equivalent to 150 PRP-tests @5632K. :)

kracker 2020-03-23 13:10

Possible bug- -cleanup works for PRP tests but not P-1 for me.

ewmayer 2020-03-23 22:19

Had 2 side-by-side runs going on my Radeon VII - a PRP run, and a p-1 ... the latter just crashed:
[code]2020-03-23 15:03:11 gfx906+sram-ecc-0 102958243 P2 2394/2880: 174743 primes; setup 4.24 s, 2.271 ms/prime
Memory access fault by GPU node-1 (Agent handle: 0x562cd3ec2150) on address 0xb5f9ee28000. Reason: Unknown.
Aborted (core dumped)[/code]
Both jobs now appear to be halted as result ... wait, the PRP is still *running* but somehow got tripped into super-low priority - I saw the same kind of MCLK-suddenly-gets-cut-by-2/3 yesterday, result of some kind of GPU glitch, that needed reboot to resolve at the time:
[code]GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
1 37.0c 24.0W 809Mhz 351Mhz 21.96% manual 250.0W 2% 0%[/code]
Just restarted the p-1 run, it's re-doing the entire stage 2 ... MCLK now back to normal (1001 MHz). Will update re. what happens with the p-1 stage 2 retry once it finishes.

This is gpuowl v6.11-211-gca63aa9-dirty.

[b]Edit:[/b] Misspoke - p-1 stage 2 picked up at ~90% of the way through ("P2 2736/2880"), and retry completed successfully. So appear to have been a one-off glitch in the matrix. Also oddly, on restart of the p-1 job MCLK got reset back to normal, but SCLK, which I had manually downclocked to 4, somehow got reset not to its default level (IIRC, 7) but to 5, as reflected by wall wattage, fan noise and GPU temperature. Reset to 4, all appears back to normal.

kriesel 2020-03-24 00:40

[QUOTE=ewmayer;540693]Had 2 side-by-side runs going on my Radeon VII - a PRP run, and a p-1.[/QUOTE]I'd be skeptical about the performance advantage of running too disparate parallel runs. I've seen it reduce throughput. PRP & LL in tandem, for example, which is different code from different versions.
Did you use -maxAlloc for your P-1 run? If not, start, and if doing parallel runs the limit will need to be lower than if the P-1 stage 2 has the gpu ram to itself.

So can we count you as another fan of P-1 save files?


All times are UTC. The time now is 23:09.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.