mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-11-22, 16:04   #177
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

2·7·103 Posts
Default

Quote:
Originally Posted by Viliam Furik View Post
I have downloaded the version 7.2-13-g266aed4, and when I run a 108M test, it runs at 1250 us/it. The same test runs at 920 us/it when using v6.11-380-g79ea0cc.

I have also noticed it's not saying anything about the duration of the GEC, but I hope it is doing it.
While writing, I have noticed it looks like it's doing P-1 on a different FFT size, is that possible?

Code:
020-11-22 09:39:53 gfx906-1 108850051 FFT: 6M 1K:12:256 (17.30 bpw)
...
2020-11-22 09:39:58 gfx906-1 108850051 P1(5.5M) 7935851 bits
2020-11-22 09:39:58 gfx906-1 108850051 PRP starting from beginning
2020-11-22 09:39:59 gfx906-1 108850051 Acquired memory lock 'memlock-1'
2020-11-22 09:39:59 gfx906-1 108850051 P1(5.5M) using 112 buffers
In that P1(5.5M) the 5.5M is the first stage limit in the P-1 method.
The product of prime powers up to that limit has roughly 5.5*10^6/ln(2)=7934822 bits.
And there was no P-1 method inside the prp test in earlier version, just before(!) the prp test [and halt if a factor is found]. But it should not be that slow, maybe you could wait longer time to see more reasonable times.

Last fiddled with by R. Gerbicz on 2020-11-22 at 16:18 Reason: grammar
R. Gerbicz is offline   Reply With Quote
Old 2020-11-23, 02:58   #178
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2·11·61 Posts
Default

Quote:
Originally Posted by Viliam Furik View Post
You should use -maxAlloc if your GPU has more than 4GB memory. See help '-h'
Is your GPU actually a 4GB RAM card?
preda is offline   Reply With Quote
Old 2020-11-23, 06:17   #179
Viliam Furik
 
"Viliam Furík"
Jul 2018
Martin, Slovakia

2×191 Posts
Default

Quote:
Originally Posted by preda View Post
Is your GPU actually a 4GB RAM card?
No, it's a Radeon VII (16 GB). I should have probably mentioned it.

P-1 ended, but still, it's running at 1065 us/it, 140 higher than I'd like it to run at.
Viliam Furik is offline   Reply With Quote
Old 2020-11-23, 09:53   #180
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24768 Posts
Default

Quote:
Originally Posted by Viliam Furik View Post
No, it's a Radeon VII (16 GB). I should have probably mentioned it.

P-1 ended, but still, it's running at 1065 us/it, 140 higher than I'd like it to run at.
Interesting. I haven't tried FFT 6M myself yet (I'm still on 5.5M), I probably should.
- time helps a bit with timing the kernels. Sometimes running with -time old/new and comparing may provide a hint as to what's regressed.

Another tool is to dump the ISA (using -dump) and compare using the delta.py script from tools/. Differences in occupancy usually have a performance impact.

PS: in the future please provide -maxAlloc when running P1/P2, it will speed things up for those stages.

Last fiddled with by preda on 2020-11-23 at 09:53
preda is offline   Reply With Quote
Old 2020-11-23, 13:29   #181
aheeffer
 
Aug 2020

37 Posts
Default

Quote:
Originally Posted by Viliam Furik View Post
No, it's a Radeon VII (16 GB). I should have probably mentioned it.

P-1 ended, but still, it's running at 1065 us/it, 140 higher than I'd like it to run at.
I reverted back to v6.380 for my Radeon VII cards because I had to increase the fft size with v7 (under Windows). I am currently running 110M+ exponents at 907µs/it with power -20%, mem clock +10% and gpu clock -10%.
aheeffer is offline   Reply With Quote
Old 2020-11-23, 15:12   #182
Runtime Error
 
Sep 2017
USA

2×5×23 Posts
Default GpuOwl + mprime

Can exponents started on GpuOwl be transferred and finished on mprime?

I am GPU poor. I'd like to run PRP on a 332M+ exponent until the massive P-1 has concluded, and then finish the remaining PRP on a CPU while another 332M+ exponent runs TF and PRP/P-1 on the GPU. Is that possible with the current save-file formats? Thank you.
Runtime Error is offline   Reply With Quote
Old 2020-11-23, 22:23   #183
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2×11×61 Posts
Default

Quote:
Originally Posted by Runtime Error View Post
Can exponents started on GpuOwl be transferred and finished on mprime?

I am GPU poor. I'd like to run PRP on a 332M+ exponent until the massive P-1 has concluded, and then finish the remaining PRP on a CPU while another 332M+ exponent runs TF and PRP/P-1 on the GPU. Is that possible with the current save-file formats? Thank you.
No that is not possible with the current savefiles, they are most probably different between GpuOwl and mprime.

OTOH mprime may be offering the merged PRP+P1 at some point in the future.
preda is offline   Reply With Quote
Old 2020-11-24, 17:44   #184
Viliam Furik
 
"Viliam Furík"
Jul 2018
Martin, Slovakia

2·191 Posts
Default

Quote:
Originally Posted by preda View Post
- time helps a bit with timing the kernels. Sometimes running with -time old/new and comparing may provide a hint as to what's regressed.
Code:
2020-11-24 18:39:05 GpuOwl VERSION v7.2-13-g266aed4
2020-11-24 18:39:05 config: -device 1
2020-11-24 18:39:05 config: -proof 8
2020-11-24 18:39:05 config: -nospin
2020-11-24 18:39:05 config: -maxAlloc 12288
2020-11-24 18:39:05 config: -time new
2020-11-24 18:39:05 device 1, unique id ''
2020-11-24 18:39:05 gfx906-1 108850051 FFT: 6M 1K:12:256 (17.30 bpw)
2020-11-24 18:39:05 gfx906-1 108850051 OpenCL args "-DEXP=108850051u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=12u -DAMDGPU=1 -DWEIGHT_STEP_MINUS_1=0.62309825525553619 -DIWEIGHT_STEP_MINUS_1=-0.3838943534305243 -DIWEIGHTS={0,-0.3838943534305243,-0.24082766453041662,-0.064539274795706897,-0.42365736505765839,-0.28982409650658664,-0.12491323160025802,-0.46085410075068395,-0.33565833429543745,-0.18139069701609609,-0.49565018609731404,-0.37853446361658188,-0.23422314777169634,-0.056401114659886273,-0.41864339864529271,-0.28364583046985026,}  -cl-std=CL2.0 -cl-finite-math-only "
2020-11-24 18:39:05 gfx906-1 108850051 ASM compilation failed, retrying compilation using NO_ASM
2020-11-24 18:39:10 gfx906-1 108850051 OpenCL compilation in 4.88 s
2020-11-24 18:39:10 gfx906-1 108850051 maxAlloc: 12.0 GB
2020-11-24 18:39:11 gfx906-1 108850051 P1(5.5M) 7935851 bits
2020-11-24 18:39:11 gfx906-1 108850051 OK  23181600 on-load: blockSize 400, 1b5e1ed9eeabd073
2020-11-24 18:39:11 gfx906-1 108850051 validating proof residues for power 8
2020-11-24 18:39:16 gfx906-1 108850051 Proof using power 8
2020-11-24 18:39:17 gfx906-1 108850051 OK  23182400  21.30% b8b6300a79c5adab 1072 us/it + check 0.52s + save 0.20s; ETA 1d 01:30
2020-11-24 18:39:17 gfx906-1 108850051 32.86% carryFused     :    386 us/call x   771 calls
2020-11-24 18:39:17 gfx906-1 108850051 20.78% tailFusedSquare :    236 us/call x   799 calls
2020-11-24 18:39:17 gfx906-1 108850051 20.65% fftMiddleOut   :    226 us/call x   828 calls
2020-11-24 18:39:17 gfx906-1 108850051 19.39% fftMiddleIn    :    205 us/call x   858 calls
2020-11-24 18:39:17 gfx906-1 108850051  2.96% tailFusedMul   :    925 us/call x    29 calls
2020-11-24 18:39:17 gfx906-1 108850051  1.36% fftP           :    142 us/call x    87 calls
2020-11-24 18:39:17 gfx906-1 108850051  1.18% fftW           :    188 us/call x    57 calls
2020-11-24 18:39:17 gfx906-1 108850051  0.65% carryA         :    105 us/call x    56 calls
2020-11-24 18:39:17 gfx906-1 108850051  0.15% carryB         :     24 us/call x    57 calls
2020-11-24 18:39:17 gfx906-1 108850051  0.01% carryM         :    106 us/call x     1 calls
2020-11-24 18:39:17 gfx906-1 108850051 Total time 0.906 s
2020-11-24 18:39:25 gfx906-1 108850051     23190000  21.30% 64c7d926d1d9b5de 1066 us/it
2020-11-24 18:39:37 gfx906-1 108850051 OK  23200000  21.31% e9c259cd41928e74 1067 us/it + check 0.53s + save 0.19s; ETA 1d 01:24
2020-11-24 18:39:37 gfx906-1 108850051 36.49% carryFused     :    386 us/call x 17199 calls
2020-11-24 18:39:37 gfx906-1 108850051 22.27% tailFusedSquare :    235 us/call x 17200 calls
2020-11-24 18:39:37 gfx906-1 108850051 21.43% fftMiddleOut   :    226 us/call x 17242 calls
2020-11-24 18:39:37 gfx906-1 108850051 19.43% fftMiddleIn    :    205 us/call x 17244 calls
2020-11-24 18:39:37 gfx906-1 108850051  0.24% tailFusedMul   :   1060 us/call x    42 calls
2020-11-24 18:39:37 gfx906-1 108850051  0.07% fftP           :    280 us/call x    45 calls
2020-11-24 18:39:37 gfx906-1 108850051  0.04% fftW           :    187 us/call x    43 calls
2020-11-24 18:39:37 gfx906-1 108850051  0.02% carryA         :    105 us/call x    43 calls
2020-11-24 18:39:37 gfx906-1 108850051 Total time 18.177 s
2020-11-24 18:39:47 gfx906-1 108850051     23210000  21.32% 97594ec565fc36c5 1066 us/it
2020-11-24 18:39:58 gfx906-1 108850051     23220000  21.33% 40ad02e262e45845 1067 us/it
2020-11-24 18:40:09 gfx906-1 108850051     23230000  21.34% 64fbb613bc7be810 1067 us/it
2020-11-24 18:40:10 gfx906-1 108850051 Stopping, please wait..
Seems like "tailFusedMul" has slowed down, right?

BTW, what does this "tailFusedMul" do? It seems to be called not often and is the slowest one of them. If it was faster, the whole iteration time could rapidly decrease, no?
Viliam Furik is offline   Reply With Quote
Old 2020-11-24, 18:13   #185
moebius
 
moebius's Avatar
 
Jul 2009
Germany

10001000112 Posts
Default

Quote:
Originally Posted by Viliam Furik View Post
I have downloaded the version 7.2-13-g266aed4, and when I run a 108M test, it runs at 1250 us/it. The same test runs at 920 us/it when using v6.11-380-g79ea0cc.
I noticed that with the Vega 10 chip the option -carry short brings some us/it improvement. Maybe that will help with your VII too. In addition, Linux version 6.11.364 (if you could get it) should also be a bit faster than 6.11.380. In any case, it already supports proofs.
You also have a bit of a slowdown at tailFusedMul. (maybe because of heat) Here comparison values ​​with Vega 64 6.11.364 Win 10 Pro
Code:
2020-11-24 19:36:00 config:  -user geschwen
2020-11-24 19:36:00 config:  -cpu AMD_RXVega64
2020-11-24 19:36:00 config:  -carry short
2020-11-24 19:36:00 config: -time new -prp 108850051 
2020-11-24 19:36:00 device 0, unique id ''
2020-11-24 19:36:00 AMD_RXVega64 108850051 FFT: 6M 1K:12:256 (17.30 bpw)
2020-11-24 19:36:00 AMD_RXVega64 Expected maximum carry32: 2D8C0000
2020-11-24 19:36:00 AMD_RXVega64 OpenCL args "-DEXP=108850051u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=12u -DPM1=0 -DAMDGPU=1 -DWEIGHT_STEP_MINUS_1=0x9.f835e0484667p-4 -DIWEIGHT_STEP_MINUS_1=-0xc.48dccfa34d258p-5  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2020-11-24 19:36:00 AMD_RXVega64 ASM compilation failed, retrying compilation using NO_ASM
2020-11-24 19:36:03 AMD_RXVega64 OpenCL compilation in 2.61 s
2020-11-24 19:36:04 AMD_RXVega64 108850051 OK      800 loaded: blockSize 400, 2396026441e24dde
2020-11-24 19:36:04 AMD_RXVega64 validating proof residues for power 8
2020-11-24 19:36:04 AMD_RXVega64 Proof using power 8
2020-11-24 19:36:06 AMD_RXVega64 108850051 OK     1600   0.00%; 1772 us/it; ETA 2d 05:34; 7f6b9fd4a86eb8b1 (check 0.78s)
2020-11-24 19:36:06 AMD_RXVega64 33.87% carryFused     :    651 us/call x  1169 calls
2020-11-24 19:36:06 AMD_RXVega64 23.91% tailFusedSquare :    448 us/call x  1199 calls
2020-11-24 19:36:06 AMD_RXVega64 19.95% fftMiddleIn    :    356 us/call x  1259 calls
2020-11-24 19:36:06 AMD_RXVega64 16.34% fftMiddleOut   :    299 us/call x  1229 calls
2020-11-24 19:36:06 AMD_RXVega64  2.53% tailFusedMul   :   1898 us/call x    30 calls
2020-11-24 19:36:06 AMD_RXVega64  1.38% fftP           :    345 us/call x    90 calls
2020-11-24 19:36:06 AMD_RXVega64  1.00% fftW           :    375 us/call x    60 calls
2020-11-24 19:36:06 AMD_RXVega64  0.92% carryA         :    350 us/call x    59 calls
2020-11-24 19:36:06 AMD_RXVega64  0.07% carryB         :     27 us/call x    60 calls
2020-11-24 19:36:06 AMD_RXVega64  0.02% carryM         :    380 us/call x     1 calls
2020-11-24 19:36:06 AMD_RXVega64 Total time 2.246 s
2020-11-24 19:42:03 AMD_RXVega64 108850051 OK   200000   0.18%; 1794 us/it; ETA 2d 06:08; 33200cbce32214be (check 0.80s)
2020-11-24 19:42:03 AMD_RXVega64 36.76% carryFused     :    659 us/call x 197904 calls
2020-11-24 19:42:03 AMD_RXVega64 25.53% tailFusedSquare :    457 us/call x 198400 calls
2020-11-24 19:42:03 AMD_RXVega64 20.17% fftMiddleIn    :    359 us/call x 199390 calls
2020-11-24 19:42:03 AMD_RXVega64 16.93% fftMiddleOut   :    302 us/call x 198895 calls
2020-11-24 19:42:03 AMD_RXVega64  0.26% tailFusedMul   :   1847 us/call x   495 calls
2020-11-24 19:42:03 AMD_RXVega64  0.15% fftP           :    347 us/call x  1486 calls
2020-11-24 19:42:03 AMD_RXVega64  0.10% fftW           :    363 us/call x   991 calls
2020-11-24 19:42:03 AMD_RXVega64  0.10% carryA         :    352 us/call x   991 calls
2020-11-24 19:42:03 AMD_RXVega64 Total time 354.767 s
2020-11-24 19:42:39 AMD_RXVega64 Stopping, please wait..

Last fiddled with by moebius on 2020-11-24 at 18:48
moebius is offline   Reply With Quote
Old 2020-11-24, 21:38   #186
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2×11×61 Posts
Default

Quote:
Originally Posted by Viliam Furik View Post

Seems like "tailFusedMul" has slowed down, right?

BTW, what does this "tailFusedMul" do? It seems to be called not often and is the slowest one of them. If it was faster, the whole iteration time could rapidly decrease, no?
"has slowed down" -- relative to what? did you compare two versions, to see what is the difference between them? Let's call the two versions you compare "before" and "after", or "good" and "bad". What is the "good" and what is the "bad" situation?

For this kernel in particular, you see that it's invoked only like 500 times while the others are called 200'000 times, so its speed does not matter! This kernel is using in total 0.26% of time, so if magically you'd speed it up to take 0 time, you'd still not gain half a percent overall.

Also, the numbers with a small total number of iterations (e.g. measured over a verry small real time) are not very meaningful, you want larger numbers as you have in the second part of your timing log.

Last fiddled with by preda on 2020-11-24 at 21:39
preda is offline   Reply With Quote
Old 2020-12-12, 20:07   #187
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

44748 Posts
Default

Has 6.x reached end-of-life, or are you going to continue updating it alongside the 7.x branch?
ixfd64 is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
GpuOwl PRP-Proof changes preda GpuOwl 20 2020-10-17 06:51
gpuowl: runtime error SELROC GpuOwl 59 2020-10-02 03:56
gpuOWL for Wagstaff GP2 GpuOwl 22 2020-06-13 16:57
gpuowl tuning M344587487 GpuOwl 14 2018-12-29 08:11
How to interface gpuOwl with PrimeNet preda PrimeNet 2 2017-10-07 21:32

All times are UTC. The time now is 17:10.

Mon Mar 1 17:10:06 UTC 2021 up 88 days, 13:21, 0 users, load averages: 3.43, 2.94, 2.60

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.