![]() |
![]() |
#89 |
"Composite as Heck"
Oct 2017
13×59 Posts |
![]()
Even 50% throughput from a 6800XT would be somewhat of a win, it'd still be the most viable consumer card out there in production and the IC is a big enough departure from the usual design that there is potential for further optimisation. There is an assumption that the 6700 and lower will also have IC but at lower capacities, another interesting wrinkle.
There is some potential with the RTX 3000's double-duty INT32/FP32 units too, but the optimisation effort may be punishing and they're still on a less efficient node so my expectations for that are low. IMO it's an optimisation effort that may not pay off now but could pay dividends in the future assuming nvidia stick to the design. |
![]() |
![]() |
![]() |
#90 |
"Composite as Heck"
Oct 2017
13×59 Posts |
![]()
A kind soul with a 6900XT answered my request for some gpuowl benchmarks. Asrock Phantom Gaming 6900XT stock, Ubuntu 20.04 with kernel 5.4, ROCm 4.0. They've since wiped the install in an attempt to get ROCm's ML working (godspeed) so these are all the benchmarks we're going to get, but it's plenty to give us an idea of performance.
The 57M and 79M tests were done a few different ways (no ops, -maxAlloc 14000, -carry short, -carry short -maxAlloc 14000), they all had the same timings so I've only pasted the no ops results. maxAlloc didn't seem to stick or at least it didn't translate into doing P-1, no matter. Code:
[57885161, ] 2020-12-28 17:09:56 GpuOwl VERSION v7.2-21-g28dbf88 2020-12-28 17:09:56 GpuOwl VERSION v7.2-21-g28dbf88 2020-12-28 17:09:56 Note: not found 'config.txt' 2020-12-28 17:09:56 config: -prp 57885161 -iters 200000 2020-12-28 17:09:56 device 0, unique id '' 2020-12-28 17:09:56 gfx1030-0 57885161 FFT: 3M 1K:6:256 (18.40 bpw) 2020-12-28 17:09:56 gfx1030-0 57885161 OpenCL args "-DEXP=57885161u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=6u -DAMDGPU=1 -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0.51445938099070077 -DIWEIGHT_STEP_MINUS_1=-0.33969836857173502 -DIWEIGHTS={0,-0.33969836857173502,-0.12800351106634347,-0.42421929575738759,-0.23962212328737542,-0.49792124750469385,-0.33695316124376262,-0.12437818131180543,-0.42182548460600072,-0.23646084847019155,-0.49583385258551427,-0.33419654070262389,-0.12073777923072029,-0.41942172117280535,-0.23328643063717852,-0.49373777931154078,} -cl-std=CL2.0 -cl-finite-math-only " 2020-12-28 17:09:58 gfx1030-0 57885161 OpenCL compilation in 2.35 s 2020-12-28 17:09:58 gfx1030-0 57885161 maxAlloc: 0.0 GB 2020-12-28 17:09:58 gfx1030-0 57885161 You should use -maxAlloc if your GPU has more than 4GB memory. See help '-h' 2020-12-28 17:09:58 gfx1030-0 57885161 P1(0) 0 bits 2020-12-28 17:09:58 gfx1030-0 57885161 PRP starting from beginning 2020-12-28 17:09:59 gfx1030-0 57885161 OK 0 on-load: blockSize 400, 0000000000000003 2020-12-28 17:09:59 gfx1030-0 57885161 validating proof residues for power 8 2020-12-28 17:09:59 gfx1030-0 57885161 Proof using power 8 2020-12-28 17:09:59 gfx1030-0 57885161 OK 800 0.00% 5727fe6a7225c273 459 us/it + check 0.26s + save 0.10s; ETA 07:22 2020-12-28 17:10:04 gfx1030-0 57885161 10000 0.02% 91565f36715e33e3 465 us/it 2020-12-28 17:10:08 gfx1030-0 57885161 20000 0.03% f2c610087d02c3ea 464 us/it 2020-12-28 17:10:13 gfx1030-0 57885161 30000 0.05% fe1565094c7f7b47 462 us/it 2020-12-28 17:10:17 gfx1030-0 57885161 40000 0.07% adb226c2322baa14 463 us/it 2020-12-28 17:10:22 gfx1030-0 57885161 50000 0.09% 96339cf030b79d74 464 us/it 2020-12-28 17:10:27 gfx1030-0 57885161 60000 0.10% 175901ec29adfa87 464 us/it 2020-12-28 17:10:31 gfx1030-0 57885161 70000 0.12% 7c2d3978b07c9f39 467 us/it 2020-12-28 17:10:36 gfx1030-0 57885161 80000 0.14% c2ee4a9ca385f917 464 us/it 2020-12-28 17:10:41 gfx1030-0 57885161 90000 0.16% a7a038f5438a2fa5 466 us/it 2020-12-28 17:10:45 gfx1030-0 57885161 100000 0.17% f1cbf8d474fd3237 467 us/it 2020-12-28 17:10:50 gfx1030-0 57885161 110000 0.19% 6d709cb8366f244d 464 us/it 2020-12-28 17:10:55 gfx1030-0 57885161 120000 0.21% 2172b8f3cc5b3272 465 us/it 2020-12-28 17:10:59 gfx1030-0 57885161 130000 0.22% 06fffcab14e3c81b 466 us/it 2020-12-28 17:11:04 gfx1030-0 57885161 140000 0.24% af31f96be3309024 466 us/it 2020-12-28 17:11:09 gfx1030-0 57885161 150000 0.26% f6ac00d9a2354121 465 us/it 2020-12-28 17:11:13 gfx1030-0 57885161 160000 0.28% fd84ac518a5eb59d 465 us/it 2020-12-28 17:11:18 gfx1030-0 57885161 170000 0.29% e91f9213bc5ea1a3 464 us/it 2020-12-28 17:11:23 gfx1030-0 57885161 180000 0.31% 63a2a2c5417898f9 464 us/it 2020-12-28 17:11:27 gfx1030-0 57885161 190000 0.33% 48ec91fc60cf2bde 466 us/it 2020-12-28 17:11:32 gfx1030-0 57885161 Stopping, please wait.. 2020-12-28 17:11:32 gfx1030-0 57885161 OK 200000 0.35% de62d6db1ad5092d 466 us/it + check 0.27s + save 0.11s; ETA 07:28 2020-12-28 17:11:32 gfx1030-0 Exiting because "stop requested" 2020-12-28 17:11:32 gfx1030-0 Bye Code:
[77936867, ] 2020-12-28 17:11:32 GpuOwl VERSION v7.2-21-g28dbf88 2020-12-28 17:11:32 GpuOwl VERSION v7.2-21-g28dbf88 2020-12-28 17:11:32 Note: not found 'config.txt' 2020-12-28 17:11:32 config: -prp 77936867 -iters 200000 2020-12-28 17:11:32 device 0, unique id '' 2020-12-28 17:11:32 gfx1030-0 77936867 FFT: 4M 1K:8:256 (18.58 bpw) 2020-12-28 17:11:32 gfx1030-0 77936867 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0.33644726404543274 -DIWEIGHT_STEP_MINUS_1=-0.25174750481886216 -DIWEIGHTS={0,-0.25174750481886216,-0.44011820345520131,-0.16213409745771243,-0.37306474779553728,-0.061788266441989627,-0.29798072935699788,-0.47471232907613115,-0.21390437908665341,-0.41180199020062258,-0.11975874301407295,-0.3413572830988989,-0.014337887291734644,-0.26247586476052853,-0.44814572555075455,-0.17414732433395128,} -cl-std=CL2.0 -cl-finite-math-only " 2020-12-28 17:11:35 gfx1030-0 77936867 OpenCL compilation in 2.26 s 2020-12-28 17:11:35 gfx1030-0 77936867 maxAlloc: 0.0 GB 2020-12-28 17:11:35 gfx1030-0 77936867 You should use -maxAlloc if your GPU has more than 4GB memory. See help '-h' 2020-12-28 17:11:35 gfx1030-0 77936867 P1(0) 0 bits 2020-12-28 17:11:35 gfx1030-0 77936867 PRP starting from beginning 2020-12-28 17:11:35 gfx1030-0 77936867 OK 0 on-load: blockSize 400, 0000000000000003 2020-12-28 17:11:35 gfx1030-0 77936867 validating proof residues for power 8 2020-12-28 17:11:35 gfx1030-0 77936867 Proof using power 8 2020-12-28 17:11:36 gfx1030-0 77936867 OK 800 0.00% 1579c241dc63eca6 613 us/it + check 0.32s + save 0.13s; ETA 13:17 2020-12-28 17:11:42 gfx1030-0 77936867 10000 0.01% fc4f135f7cf4ad29 620 us/it 2020-12-28 17:11:48 gfx1030-0 77936867 20000 0.03% 3cd1bd9d5e09cbc5 618 us/it 2020-12-28 17:11:54 gfx1030-0 77936867 30000 0.04% c4e0ff35e3290d98 620 us/it 2020-12-28 17:12:01 gfx1030-0 77936867 40000 0.05% dffe1b1b0d748128 619 us/it 2020-12-28 17:12:07 gfx1030-0 77936867 50000 0.06% 52e286945371ed29 619 us/it 2020-12-28 17:12:13 gfx1030-0 77936867 60000 0.08% 0945da4dc08bdd95 620 us/it 2020-12-28 17:12:19 gfx1030-0 77936867 70000 0.09% 7131fa4eb77f4bb2 620 us/it 2020-12-28 17:12:25 gfx1030-0 77936867 80000 0.10% 8d76071d27ee4221 621 us/it 2020-12-28 17:12:31 gfx1030-0 77936867 90000 0.12% 0bacff453b2f470e 620 us/it 2020-12-28 17:12:38 gfx1030-0 77936867 100000 0.13% 6d7296b9e2830f50 622 us/it 2020-12-28 17:12:44 gfx1030-0 77936867 110000 0.14% 8cbfd4435622bda7 622 us/it 2020-12-28 17:12:50 gfx1030-0 77936867 120000 0.15% 79ae5dad855057ad 622 us/it 2020-12-28 17:12:56 gfx1030-0 77936867 130000 0.17% 50c97bcbf876231f 621 us/it 2020-12-28 17:13:03 gfx1030-0 77936867 140000 0.18% e1db15f897271496 622 us/it 2020-12-28 17:13:09 gfx1030-0 77936867 150000 0.19% 127631386c6a9b17 622 us/it 2020-12-28 17:13:15 gfx1030-0 77936867 160000 0.21% 25b7b6206fc6f085 623 us/it 2020-12-28 17:13:21 gfx1030-0 77936867 170000 0.22% 416816b0d9f4bba8 622 us/it 2020-12-28 17:13:27 gfx1030-0 77936867 180000 0.23% 6bee5d054f770861 623 us/it 2020-12-28 17:13:34 gfx1030-0 77936867 190000 0.24% f37f068f014b18a0 621 us/it 2020-12-28 17:13:40 gfx1030-0 77936867 Stopping, please wait.. 2020-12-28 17:13:40 gfx1030-0 77936867 OK 200000 0.26% f0b04b45b0855bd2 620 us/it + check 0.34s + save 0.16s; ETA 13:24 2020-12-28 17:13:40 gfx1030-0 Exiting because "stop requested" 2020-12-28 17:13:40 gfx1030-0 Bye Code:
“[332220523, ]” 2020-12-28 16:57:10 GpuOwl VERSION v7.2-21-g28dbf88 2020-12-28 16:57:10 GpuOwl VERSION v7.2-21-g28dbf88 2020-12-28 16:57:10 Note: not found 'config.txt' 2020-12-28 16:57:10 config: -prp 332220523 -iters 50000 2020-12-28 16:57:10 device 0, unique id '' 2020-12-28 16:57:10 gfx1030-0 332220523 FFT: 18M 1K:9:1K (17.60 bpw) 2020-12-28 16:57:11 gfx1030-0 332220523 OpenCL args "-DEXP=332220523u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=9u -DAMDGPU=1 -DWEIGHT_STEP_MINUS_1=0.31797529154814252 -DIWEIGHT_STEP_MINUS_1=-0.24126043453715812 -DIWEIGHTS={0,-0.24126043453715812,-0.42431427180125786,-0.12640892148665336,-0.337171884696568,-0.49708608381811953,-0.23683862754188784,-0.42095927188310595,-0.12131777912660047,-0.33330903355459196,-0.49415518582120899,-0.23239105099670423,-0.41758471958785059,-0.11619696644233313,-0.32942367036371434,-0.49120720704209719,} -cl-std=CL2.0 -cl-finite-math-only " 2020-12-28 16:57:14 gfx1030-0 332220523 OpenCL compilation in 2.89 s 2020-12-28 16:57:14 gfx1030-0 332220523 maxAlloc: 0.0 GB 2020-12-28 16:57:14 gfx1030-0 332220523 You should use -maxAlloc if your GPU has more than 4GB memory. See help '-h' 2020-12-28 16:57:14 gfx1030-0 332220523 P1(0) 0 bits 2020-12-28 16:57:14 gfx1030-0 332220523 PRP starting from beginning 2020-12-28 16:57:16 gfx1030-0 332220523 OK 0 on-load: blockSize 400, 0000000000000003 2020-12-28 16:57:16 gfx1030-0 332220523 validating proof residues for power 8 2020-12-28 16:57:16 gfx1030-0 332220523 Proof using power 8 2020-12-28 16:57:21 gfx1030-0 332220523 OK 800 0.00% b950798999630b08 3954 us/it + check 1.92s + save 0.49s; ETA 15d 04:53 2020-12-28 16:57:58 gfx1030-0 332220523 10000 0.00% 503cd91d7b8e30e5 3969 us/it 2020-12-28 16:58:38 gfx1030-0 332220523 20000 0.01% f2d3ffbb3586c527 3978 us/it 2020-12-28 16:59:18 gfx1030-0 332220523 30000 0.01% e7846100baf7ce53 3977 us/it 2020-12-28 16:59:57 gfx1030-0 332220523 40000 0.01% e305c82567149969 3969 us/it 2020-12-28 17:00:37 gfx1030-0 332220523 Stopping, please wait.. 2020-12-28 17:00:39 gfx1030-0 332220523 OK 50000 0.02% 72885d5ee0a11128 3974 us/it + check 1.90s + save 0.50s; ETA 15d 06:39 2020-12-28 17:00:39 gfx1030-0 Exiting because "stop requested" 2020-12-28 17:00:39 gfx1030-0 Bye Code:
Adapter: PCI adapter vddgfx: 1.09 V fan1: 1760 RPM (min = 0 RPM, max = 3300 RPM) edge: +73.0°C (crit = +100.0°C, hyst = -273.1°C) (emerg = +105.0°C) junction: +90.0°C (crit = +110.0°C, hyst = -273.1°C) (emerg = +115.0°C) mem: +74.0°C (crit = +100.0°C, hyst = -273.1°C) (emerg = +105.0°C) |
![]() |
![]() |
![]() |
#91 |
"Mike"
Aug 2002
24×499 Posts |
![]()
We wonder if the 6800 XT will scale down ~10% for the 72 versus 80 compute units. Everthing else on the cards is identical we think.
|
![]() |
![]() |
![]() |
#92 |
"Composite as Heck"
Oct 2017
13×59 Posts |
![]()
If we're bandwidth limited there's a chance it might scale down better than that, although binning will pull it back the other way and probably then some.
|
![]() |
![]() |
![]() |
#93 | |
Jul 2009
Germany
547 Posts |
![]() Quote:
post #24 |
|
![]() |
![]() |
![]() |
#94 |
Jun 2003
4,861 Posts |
![]() |
![]() |
![]() |
![]() |
#95 | |
Jun 2003
4,861 Posts |
![]() Quote:
1) Running two copies of gpuowl -- 128MB cache can hold two FFTs at current PRP leading edge easily. 2) Not using the GPU for display. Not sure if the benchmarks were done when the GPU was also driving display(s). That might affect the cache / performance. |
|
![]() |
![]() |
![]() |
#96 | |
"Composite as Heck"
Oct 2017
76710 Posts |
![]() Quote:
When one of us gets a card they should run bandwidth tests across a spectrum of RAM utilisations to determine how strong the cache really is. Or test every FFT and estimate that way. You can bet that the "1.5TB/s effective memory bandwidth" is marketing speak for "the cache has a maximum throughput of 1.5TB/s", so that's probably the upper bound at best. Normally negligible performance hit, if there is any it's within margin of error so undetectable on an R7 AFAIR. They did drive a display with the card during the tests mostly idling on desktop, if there is a performance penalty the numbers will be slightly better than we've got. I doubt cache has much to do with the framebuffer but I'd be happy to be proved wrong if it gets us free performance. |
|
![]() |
![]() |
![]() |
#97 | |
"Viliam Furík"
Jul 2018
Martin, Slovakia
3·131 Posts |
![]() Quote:
But I would like to know, whether gpuOwl has been told to use the cache, i.e. whether it can even use the L3 cache of these GPUs. I don't recall any confirmation from Preda. However, I expect he has implemented that. Performance numbers indicate that it is indeed using the cache, but I want to be sure. |
|
![]() |
![]() |
![]() |
#98 |
Jul 2009
Germany
547 Posts |
![]() Code:
2020-12-30 09:20:24 config: -carry short -use CARRY32,ORIG_SLOWTRIG,IN_WG=128,IN_SIZEX=16,IN_SPACING=4,OUT_WG=128,OUT_SIZEX=16,OUT_SPACING=4 -nospin -block 100 -maxAlloc 10000 -B1 750000 -rB2 20 -prp 57885161 2020-12-30 09:20:24 device 0, unique id '' 2020-12-30 09:20:24 Tesla P100-PCIE-16GB-0 57885161 FFT: 3M 1K:6:256 (18.40 bpw) 2020-12-30 09:20:24 Tesla P100-PCIE-16GB-0 Expected maximum carry32: 42500000 2020-12-30 09:20:25 Tesla P100-PCIE-16GB-0 OpenCL args "-DEXP=57885161u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=6u -DPM1=0 -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0x1.07673850f37p-1 -DIWEIGHT_STEP_MINUS_1=-0x1.5bd9e39e14a3dp-2 -DCARRY32=1 -DIN_SIZEX=16 -DIN_SPACING=4 -DIN_WG=128 -DORIG_SLOWTRIG=1 -DOUT_SIZEX=16 -DOUT_SPACING=4 -DOUT_WG=128 -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only " 2020-12-30 09:20:27 Tesla P100-PCIE-16GB-0 2020-12-30 09:20:27 Tesla P100-PCIE-16GB-0 OpenCL compilation in 2.16 s 2020-12-30 09:20:27 Tesla P100-PCIE-16GB-0 57885161 OK 0 loaded: blockSize 100, 0000000000000003 2020-12-30 09:20:27 Tesla P100-PCIE-16GB-0 validating proof residues for power 8 2020-12-30 09:20:27 Tesla P100-PCIE-16GB-0 Proof using power 8 2020-12-30 09:20:27 Tesla P100-PCIE-16GB-0 57885161 OK 200 0.00%; 531 us/it; ETA 0d 08:33; 08e8268acbd436a3 (check 0.14s) 2020-12-30 09:20:37 Tesla P100-PCIE-16GB-0 Stopping, please wait.. 2020-12-30 09:20:37 Tesla P100-PCIE-16GB-0 57885161 OK 18600 0.03%; 531 us/it; ETA 0d 08:32; 5cde8a0b1e18bd84 (check 0.14s) 2020-12-30 09:20:37 Tesla P100-PCIE-16GB-0 Exiting because "stop requested" 2020-12-30 09:20:37 Tesla P100-PCIE-16GB-0 Bye |
![]() |
![]() |
![]() |
#99 | |
"Mihai Preda"
Apr 2015
17×79 Posts |
![]() Quote:
Separate from the caches there is the *local* memory (LDS), which is managed explicitly by the software. Last fiddled with by preda on 2021-01-05 at 10:13 |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Navi (RX 5700, RX 5700XT) | M344587487 | GPU Computing | 29 | 2019-11-28 14:00 |