![]() |
RDNA 3?
Does anyone have any inside info on the upcoming RDNA 3 (aka RX 7000 series)? If [URL="https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_RX_7000_series"]all knowing wiki[/URL] is to be believed, the latest and greatest should be 2.5-3x faster that the 6950 XT, which would make it the absolute fastest PRP cruncher.
|
[QUOTE=axn;617227]Does anyone have any inside info on the upcoming RDNA 3 (aka RX 7000 series)? If [URL="https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_RX_7000_series"]all knowing wiki[/URL] is to be believed, the latest and greatest should be 2.5-3x faster that the 6950 XT, which would make it the absolute fastest PRP cruncher.[/QUOTE]
Hopefully faster than the 4y-old RadeonVII. |
[QUOTE=preda;617230]Hopefully faster than the 4y-old RadeonVII.[/QUOTE]
This is the first time I have ever heard anyone insinuate that the Radeon VII is slow! :smile: |
[QUOTE=preda;617230]Hopefully faster than the 4y-old RadeonVII.[/QUOTE]
According to the gpuowl benchmarks compiled by moebius here ([url]https://docs.google.com/spreadsheets/d/1AUFL8VC1C4KKry60mkWu7PgIrITgItkC/[/url]), 6950 XT and VII are neck-and-neck. So 7900 XTX should come in at top of that list, and 7900 XT should be at #2 or #3. *fingers crossed* |
I was surprised at how good RDNA2 remained relative to R7, expected a bigger divergence between CDNA and RDNA as generations rolled by. The divergence probably did happen but manifests in AI/ML/whatever instead of more traditional compute. With luck the characteristics that have allowed RDNA to remain very viable for our niche remain intact, it would suck if the only reason RDNA has been good to date is that AMD didn't have the resources to optimise further for gaming by gutting compute.
For gaming it's unclear which of XT/XTX is better bang for buck (well IMO gaming on anything beyond midrange is a waste but YMMV). For gpuowl the XTX is almost certainly the one to go for. Less but faster cache is an interesting wrinkle, it's the only metric (that we know of) which isn't strictly an upgrade over the 6950XT. That the 80/96 MiB cache matches up with the midrange 6700/6700XT is interesting, but it may be down to them not double-stacking cache on the 7900xt/xtx (which was something rumoured and I'm guessing might be reserved for a refresh or pro cards down the line, it might just be that smaller faster cache performed better on average for gaming). |
[QUOTE=axn;617241]According to the gpuowl benchmarks compiled by moebius here ([url]https://docs.google.com/spreadsheets/d/1AUFL8VC1C4KKry60mkWu7PgIrITgItkC/[/url]), 6950 XT and VII are neck-and-neck. So 7900 XTX should come in at top of that list, and 7900 XT should be at #2 or #3.[/QUOTE]
Something weird with the chart. The 6950 has half the memory bandwidth and half the FP64 throughput but comes out faster than the VII? BTW, I just tuned some of my Radeon VIIs for maximum energy efficiency. Typical for 111M exponents, 150W (assuming 91% efficient power supply), I get 813 us/it. My goal is to add a used Radeon VII for just over $300 and run all Radeon VIIs at peak energy efficiency, getting slightly more throughput using less power with a roughly two-year breakeven on the used Radeon VII. |
[QUOTE=Prime95;617274]Something weird with the chart. The 6950 has half the memory bandwidth and half the FP64 throughput but comes out faster than the VII?.[/QUOTE]
Infinity Cache is big enough to run the entire FFT out of it, and that has much higher bandwidth. I think Radeon VIIs were severely bottlenecked on memory for PRP tests, so their increased FLOPS were ineffective. The 7900s have similar TFLOPS to the VII, but has matching bandwidth increase for the cache, so I am expecting to see proportional improvement -- assuming the wiki numbers are in the right ballpark. |
[QUOTE=Prime95;617274]Something weird with the chart. The 6950 has half the memory bandwidth and half the FP64 throughput but comes out faster than the VII?
BTW, I just tuned some of my Radeon VIIs for maximum energy efficiency. Typical for 111M exponents, 150W (assuming 91% efficient power supply), I get 813 us/it. My goal is to add a used Radeon VII for just over $300 and run all Radeon VIIs at peak energy efficiency, getting slightly more throughput using less power with a roughly two-year breakeven on the used Radeon VII.[/QUOTE] I'm doing that as well. Running like that I can put two VIIs in each system and not worry about heat. |
[QUOTE=axn;617285]Infinity Cache is big enough to run the entire FFT out of it, and that has much higher bandwidth. [/QUOTE]
Thanks for the insight. Yes, gpuowl on Radeon VII is near or at max memory bandwidth which is why Radeon VII Pro benchmarks are not much faster than a Radeon VII. [QUOTE=mrh;617287]I'm doing that as well. Running like that I can put two VIIs in each system and not worry about heat.[/QUOTE] This probably belongs in a different thread. My first observation was that sclk=2 gives the maximum iterations/watt. One can fine tune the clock speed using 'echo "s 1 XXXX" >/sys/class/drm/card2/device/pp_od_clk_voltage' where XXXX is between 1500 and 2200. Even though there is a big gap in clock speeds between sclk=1 and sclk=3, peak efficiency is near sclk=2. Then I thought, let's maximize the clock speed for the sclk=2 voltage which uses 725mV. So then I worked on the voltage curve working up from 760mV until I found the voltage that did not produce errors. echo "vc 1 1304 760" >/sys/class/drm/card2/device/pp_od_clk_voltage I already had set the upper end of the voltage curve with echo "vc 2 1801 1030" >/sys/class/drm/card2/device/pp_od_clk_voltage Finally, find the largest XXXX value that chooses 725mV with sclk=2. GPU example 1: echo "vc 1 1304 770" >/sys/class/drm/card0/device/pp_od_clk_voltage echo "vc 2 1801 1030" >/sys/class/drm/card0/device/pp_od_clk_voltage echo "s 1 1958" >/sys/class/drm/card0/device/pp_od_clk_voltage echo "c" >/sys/class/drm/card0/device/pp_od_clk_voltage /opt/rocm/bin/rocm-smi -d 1 --setsclk 2 --setfan 160 GPU example 2 (one of my better cards): echo "vc 1 1304 760" >/sys/class/drm/card1/device/pp_od_clk_voltage echo "vc 2 1801 1030" >/sys/class/drm/card1/device/pp_od_clk_voltage echo "s 1 2085" >/sys/class/drm/card1/device/pp_od_clk_voltage echo "c" >/sys/class/drm/card1/device/pp_od_clk_voltage /opt/rocm/bin/rocm-smi -d 1 --setsclk 2 --setfan 160 GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 69.0c 138.0W 1186Mhz 1201Mhz 82.75% manual 250.0W N/A 93% 1 66.0c 139.0W 1228Mhz 1201Mhz 80.78% manual 250.0W N/A 75% |
Seeing as I have been unable to get my hands on a 4090 at MSRP.... The 7900xtx will be on my list!
|
[QUOTE=Prime95;617274]Something weird with the chart. The 6950 has half the memory bandwidth and half the FP64 throughput but comes out faster than the VII?[/QUOTE]
First of all, I would like to say that it is extremely important to me that the values that I enter in the table are realistic, so I check them for plausibility if possible (e.g. by extrapolating using values of a reference card, as in this case the RX6800 XT). for the 6900 XT e.g. the following values are available. [URL="https://www.mersenneforum.org/showpost.php?p=588434&postcount=2732"]DrDerpenberg [/URL] [URL="https://www.mersenneforum.org/showpost.php?p=598925&postcount=137"]JCoveiro[/URL] I always enter the best value for a single instance, as well as for the Radeon VII, to ensure a relatively fair comparison. It is quite possible that with 2 instances at the same time the Radeon VII will show up better than with one instance. I started the list because I am often suspicious of the benchmarks on mersenne.ca where a RX 5700XT performs better than a RX 6800XT, which I consider almost impossible. |
[QUOTE=Prime95;617288]Thanks for the insight. Yes, gpuowl on Radeon VII is near or at max memory bandwidth which is why Radeon VII Pro benchmarks are not much faster than a Radeon VII.
[/QUOTE] The 7900 xtx has a 3.5 TB/s "infinity cache" bandwidth. How that performance relates to GPUOWL I am not sure. I will post results as soon as I get my hands on one! [url]https://wccftech.com/amd-radeon-rx-7900-xtx-7900-xt-reference-models-by-sapphire-listed-at-amazon/[/url] |
[url]https://www.reddit.com/r/Amd/comments/zbgcb1/rocm_developer_be_prepared_to_be_pleasantly/[/url]
|
Embargo has ended as of a few hours ago. Mostly looked at Phoronix's coverage so far it seems okay, lackluster for gaming (at least on windows, possible driver issues) and some glitches on the eariest supported Linux kernels, with the type of compute we're interested in TBD. Anyone buying in the next months to use on Linux will likely need to manually add the firmware files and everyone should commit to using the latest kernel whenever one gets released for at least a year IMO. If an early adopter is in a benchmarking mood it would be interesting to know the process to install and results of rusticl vs rocm's opencl implementation (if rusticl works for our tools at all, AFAIK no one's tested).
[url]https://www.phoronix.com/review/rx7900xt-rx7900xtx-linux[/url] If gaming performance is below par on windows that may be good for us price-wise in the long run, assuming compute performance is solid. |
Judging by this review ([url]https://babeltechreviews.com/hellhound-rx-7900-xtx-vs-rtx-4080-50-games-vr/6/[/url]) that ran AIDA64 GPGPU something went horribly wrong with RDNA3, as the double precision performance and effective bandwidth of the 7900XTX are both lower than the 6900XT and the 4090. The single precision performance is also extremely low for the CU count and clockspeed it has. Hopefully it's a driver issue not a hardware issue, and someone with a card can benchmark it to confirm the findings.
|
Hopefully it's a driver issue although the tested ratio of ~30.25 is suspicious as it's close to 1:32, if that's what it is then DP is half the 1:16 we've come to expect from AMD which would be a bummer. Other ratio's are much closer to the theoretical so 30.25 seems low even if the cards ratio is 1:32 (an optimised 1:32 card looks like it should be more like ~31.7 in practice), so it could just be a driver issue. The memory read/write being worse than the 6900xt is clearly wrong which is another source of doubt.
I don't know about single precision performance being low, it's about 40% better than the 6900XT which is roughly how much better the 7900xtx is at games/raster. That may or may not be a relevant correlation. |
[QUOTE=M344587487;619662]Hopefully it's a driver issue although the tested ratio of ~30.25 is suspicious as it's close to 1:32, if that's what it is then DP is half the 1:16 we've come to expect from AMD which would be a bummer. Other ratio's are much closer to the theoretical so 30.25 seems low even if the cards ratio is 1:32 (an optimised 1:32 card looks like it should be more like ~31.7 in practice), so it could just be a driver issue. The memory read/write being worse than the 6900xt is clearly wrong which is another source of doubt.
I don't know about single precision performance being low, it's about 40% better than the 6900XT which is roughly how much better the 7900xtx is at games/raster. That may or may not be a relevant correlation.[/QUOTE] Wikichip has it as 1/16th. DP is 2918/2016 GFLOPS compared to 2688 of the Radeon VII. Bandwidth is 960/800 GB/s, at little under the Radeon VII's 1024 GB/s. |
[QUOTE=M344587487;619662]Hopefully it's a driver issue although the tested ratio of ~30.25 is suspicious as it's close to 1:32, if that's what it is then DP is half the 1:16 we've come to expect from AMD which would be a bummer. Other ratio's are much closer to the theoretical so 30.25 seems low even if the cards ratio is 1:32 (an optimised 1:32 card looks like it should be more like ~31.7 in practice), so it could just be a driver issue. The memory read/write being worse than the 6900xt is clearly wrong which is another source of doubt.
I don't know about single precision performance being low, it's about 40% better than the 6900XT which is roughly how much better the 7900xtx is at games/raster. That may or may not be a relevant correlation.[/QUOTE] FP32 is advertised to be 61TFLOPs peak on AMD's website, and that's definitely too low no matter how you look at it, it should've been at least faster than a 4080. As for memory read/write being slower than 6900XT it could be due to the smaller infinity cache, but most likely still due to driver issues. So far the card's not looking great but we also don't have any real world PRP or TF runs done on it. Guess we'll just have to wait and hopefully someone can pick one up and test it. |
[QUOTE=xx005fs;619676]FP32 is advertised to be 61TFLOPs peak on AMD's website, and that's definitely too low no matter how you look at it, it should've been at least faster than a 4080. [/QUOTE]
:confused: 4080 is 49 TFLOPS, so what's the problem? |
[QUOTE=xx005fs;619676]wait and hopefully someone can pick one up and test it.[/QUOTE]Was just in my local computer store. They had all sorts of customers lined up at opening to pick up a 7900 XTX. And zero stock to sell.
Loads of 4080s on the shelf though, not-selling, and RTX 30x0 and RX 6xxx too. |
Wendell at Level1Techs will often run open source software benchmarks for people.
Just ask in the forum: [url]https://forum.level1techs.com/[/url] |
Sadly, I wasn't able to pick one up. I had one in my cart and I was ready to pull the trigger but it was a bestbuy purcahse and that would mean driving 5 hours round trip to pick up the card.
|
Has anyone been able to benchmark the 7900 xtx?
|
[QUOTE=Magellan3s;620591]Has anyone been able to benchmark the 7900 xtx?[/QUOTE]If there are any [url=https://www.mersenne.ca/mfaktc.php?filter=7900+xt]mfakto[/url] or [url=https://www.mersenne.ca/cudalucas.php?filter=7900+xt]gpuowl[/url] benchmarks for 7900 (XT or XTX I don't care) I would desperately likely to see them, the numbers on my charts are more-or-less made up.
|
[url]https://old.reddit.com/r/Amd/comments/zt95bg/all_of_the_internal_things_that_the_7xxx_series/[/url]
:mike: |
I don't know what to make of that thread. Particularly this quote seems misguided:
[quote]...dual SIMD is useless for some (most) applications since the added second SIMD per CU doesn't support integer ops...[/quote] given that most applications are interested in fp AFAIK. The locked PP sounds concerning, but from what I recall we could do as we pleased with Vega 10/20, and this quote: [quote]There is some small sliver of hope that AMD will eventually unlock the PPtables, but looking at Vega10/20, that doesn't seem likely.[/quote] seems to contradict that. If they're wrong about that I'm not convinced they know what they're talking about. [quote]...Also, indications are that they've moved instruction pipeline responsibilities to software, meaning you now need to carefully reorder instructions to not get pipeline stalls and/or provide hints (there's a new instruction for this specific purpose, s_delay_alu). Since many software kernels are hand-rolled in raw assembly, this is a potentially a huge pain point for developers - since this platform needs specific instructions that no other platform does....[/quote]Does this apply to gpuowl or mfakto? I don't think .cl files are assembly or that assembly is used at all but could be wrong. |
[QUOTE=James Heinrich;620616]If there are any [url=https://www.mersenne.ca/mfaktc.php?filter=7900+xt]mfakto[/url] or [url=https://www.mersenne.ca/cudalucas.php?filter=7900+xt]gpuowl[/url] benchmarks for 7900 (XT or XTX I don't care) I would desperately likely to see them, the numbers on my charts are more-or-less made up.[/QUOTE]
7900 xtx [Code]2022-12-28 01:30:43 GpuOwl VERSION v7.2-70-g212618e 2022-12-28 01:30:43 config: log 1000 2022-12-28 01:30:43 config: 2022-12-28 01:30:43 device 0, unique id '' 2022-12-28 01:30:43 gfx1100-0 77936867 FFT: 4M 1K:8:256 (18.58 bpw) 2022-12-28 01:30:43 gfx1100-0 77936867 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DMAX_ACCURACY=1 -DWEIGHT_STEP=0.33644726404543274 -DIWEIGHT_STEP=-0.25174750481886216 -DIWEIGHTS={0,-0.44011820345520131,-0.37306474779553728,-0.29798072935699788,-0.21390437908665341,-0.11975874301407295,-0.014337887291734644,-0.44814572555075455,} -DFWEIGHTS={0,0.78609128957452257,0.5950610473469905,0.42446232150303748,0.2721098723818392,0.1360521812214803,0.014546452690911484,0.81207258201996746,} -cl-std=CL2.0 -cl-finite-math-only " 2022-12-28 01:30:44 gfx1100-0 77936867 OpenCL compilation in 1.07 s 2022-12-28 01:30:44 gfx1100-0 77936867 trig table : 65 points, cos 73.86 bits, sin 73.34 bits 2022-12-28 01:30:44 gfx1100-0 77936867 trig table : 257 points, cos 72.90 bits, sin 73.11 bits 2022-12-28 01:30:45 gfx1100-0 77936867 trig table : 262145 points, cos 72.03 bits, sin 72.56 bits 2022-12-28 01:30:45 gfx1100-0 77936867 maxAlloc: 0.0 GB 2022-12-28 01:30:45 gfx1100-0 77936867 You should use -maxAlloc if your GPU has more than 4GB memory. See help '-h' 2022-12-28 01:30:45 gfx1100-0 77936867 P1(0) 0 bits 2022-12-28 01:30:45 gfx1100-0 77936867 PRP starting from beginning 2022-12-28 01:30:45 gfx1100-0 77936867 OK 0 on-load: blockSize 400, 0000000000000003 2022-12-28 01:30:45 gfx1100-0 77936867 validating proof residues for power 8 2022-12-28 01:30:45 gfx1100-0 77936867 Proof using power 8 2022-12-28 01:30:46 gfx1100-0 77936867 OK 800 0.00% 1579c241dc63eca6 784 us/it + check 0.36s + save 0.11s; ETA 16:58 2022-12-28 01:30:54 gfx1100-0 77936867 10000 0.01% fc4f135f7cf4ad29 785 us/it 2022-12-28 01:31:02 gfx1100-0 77936867 20000 0.03% 3cd1bd9d5e09cbc5 788 us/it 2022-12-28 01:31:09 gfx1100-0 77936867 30000 0.04% c4e0ff35e3290d98 791 us/it 2022-12-28 01:31:17 gfx1100-0 77936867 40000 0.05% dffe1b1b0d748128 793 us/it 2022-12-28 01:31:25 gfx1100-0 77936867 50000 0.06% 52e286945371ed29 793 us/it 2022-12-28 01:31:33 gfx1100-0 77936867 60000 0.08% 0945da4dc08bdd95 795 us/it 2022-12-28 01:31:41 gfx1100-0 77936867 70000 0.09% 7131fa4eb77f4bb2 795 us/it[/code] |
[QUOTE=Magellan3s;621144]7900 xtx
[Code]2022-12-28 01:30:43 GpuOwl VERSION v7.2-70-g212618e[/code][/QUOTE] Can you try running 2 parallel instances of gpuowl (you can use 77936923 for the second instance)? Would like to see what, if any, thruput gains we can get. |
[QUOTE=axn;621149]Can you try running 2 parallel instances of gpuowl (you can use 77936923 for the second instance)? Would like to see what, if any, thruput gains we can get.[/QUOTE]
The results are from here: [url]https://mersenneforum.org/showthread.php?t=28303[/url] |
Chips&Cheese benchmarking 7900xtx:
[url]https://chipsandcheese.com/2023/01/07/microbenchmarking-amds-rdna-3-graphics-architecture/[/url] |
The ISA has been published, a nice light read at 600 pages: [url]https://gpuopen.com/rdna3-isa-guide-now-available/[/url]
|
| All times are UTC. The time now is 14:16. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.