![]() |
Radeon VII (2nd gen consumer Vega GPU)
They just announced this in the CES keynote. pertinent bullet points:
[LIST][*]7nm process[*]1TB/s memory bandwidth[*]16GB HBM2[*]Slide shows +62% OpenCL performance over Vega64, whatever that means[*]RRP of $699[*]ETA February 7th[*]60 CU's, so it looks to be a cut-down MI50[/LIST]That's over twice the memory bandwidth of Vega64. If the bandwidth can be saturated that's twice the performance of a Vega64 for roughly the same (current) price as 2xVega64, at what is hopefully a much lower power consumption than 2xVega 64. Does that analysis sound about right? GW: See post #76 and #195 for quick-start on setting up gpuowl under Linux. |
The big display behind Lisa said 25% more performance at same power - in what? gaming? While it loses some CUs vs Vega, it gains in clock more than offsetting it. I guess the question then is, how is a particular workload affected by bandwidth?
My guess, whatever efficiency benefits they got from process, they spent on clock. So maybe they'll stick to similar board power for the overall higher absolute performance. |
[QUOTE=mackerel;505466]The big display behind Lisa said 25% more performance at same power - in what? gaming? While it loses some CUs vs Vega, it gains in clock more than offsetting it. I guess the question then is, how is a particular workload affected by bandwidth?
My guess, whatever efficiency benefits they got from process, they spent on clock. So maybe they'll stick to similar board power for the overall higher absolute performance.[/QUOTE] I would assume 25-40% better on gaming depending on the game. As long as they kept 1/2 rate DP like the MI50, it should be the best card on the market to do PRP/LL. It's also gonna be the best value indefinitely because even if the v100s beats it, they would cost so much more that makes them not worthy. |
[QUOTE=M344587487;505435]They just announced this in the CES keynote. pertinent bullet points:
[LIST][*]7nm process[*]1TB/s memory bandwidth[*]16GB HBM2[*]Slide shows +62% OpenCL performance over Vega64, whatever that means[*]RRP of $699[*]ETA February 7th[*]60 CU's, so it looks to be a cut-down MI50[/LIST]That's over twice the memory bandwidth of Vega64. If the bandwidth can be saturated that's twice the performance of a Vega64 for roughly the same (current) price as 2xVega64, at what is hopefully a much lower power consumption than 2xVega 64. Does that analysis sound about right?[/QUOTE] Interesting! Any power dissipation numbers? A dual-slot-width card? Does it require pcie 3.0? |
[QUOTE=xx005fs;505470]I would assume 25-40% better on gaming depending on the game. As long as they kept 1/2 rate DP like the MI50, it should be the best card on the market to do PRP/LL. It's also gonna be the best value indefinitely because even if the v100s beats it, they would cost so much more that makes them not worthy.[/QUOTE]
At Ars Technica they say the Vega 20 GPU is a die shrink of the Vega 10 GPU found in the Vega 64, so it's probably 1:16. |
[QUOTE=kriesel;505477]Interesting!
Any power dissipation numbers? A dual-slot-width card? Does it require pcie 3.0?[/QUOTE] Anandtech estimates 300W power. Dual width, 3 fans that exhaust heat within the case. It looks like it is higher than the end io bracket. I've never seen any card that requires pcie 3.0 . [URL="http://www.anandtech.com/show/13832/amd-radeon-vii-high-end-7nm-february-7th-for-699"]http://www.anandtech.com/show/13832/amd-radeon-vii-high-end-7nm-february-7th-for-699[/URL] |
[QUOTE=Mark Rose;505480]At Ars Technica they say the Vega 20 GPU is a die shrink of the Vega 10 GPU found in the Vega 64, so it's probably 1:16.[/QUOTE]
It's a die shrink of Vega indeed. However, it is the same GPU as the MI50 which have 1/2 DP capabilities, and unless AMD botched that feature on the consumer variant, it would be able to be the king of LL/PRP |
[QUOTE=kriesel;505477]...
Does it require pcie 3.0?[/QUOTE] It's a GFX9 card so no: [URL]https://github.com/RadeonOpenCompute/ROCm[/URL] [quote=ROCm git readme] As described above, GFX8 GPUs require PCIe 3.0 with PCIe atomics in order to run ROCm. In particular, the CPU and every active PCIe point between the CPU and GPU require support for PCIe 3.0 and PCIe atomics. The CPU root must indicate PCIe AtomicOp Completion capabilities and any intermediate switch must indicate PCIe AtomicOp Routing capabilities. ... Beginning with ROCm 1.8, GFX9 GPUs (such as Vega 10) no longer require PCIe atomics. We have similarly opened up more options for number of PCIe lanes. GFX9 GPUs can now be run on CPUs without PCIe atomics and on older PCIe generations, such as PCIe 2.0. This is not supported on GPUs below GFX9, e.g. GFX8 cards in the Fiji and Polaris families.[/quote] |
[QUOTE=M344587487;505503]It's a GFX9 card so no: [URL]https://github.com/RadeonOpenCompute/ROCm[/URL][/QUOTE]
The pcie 1.0 slots are limited in speed, this affects GEC speed. Faster with pcie 3.0 |
[URL="https://twitter.com/RyanSmithAT/status/1083959608371175424"]https://twitter.com/RyanSmithAT/status/1083959608371175424[/URL]
"FP64 is not among the couple of features they dialed back for the consumer card." So if this is indeed true, that gets me a bit excited. :grin: |
[QUOTE=nomead;505691][URL]https://twitter.com/RyanSmithAT/status/1083959608371175424[/URL]
"FP64 is not among the couple of features they dialed back for the consumer card." So if this is indeed true, that gets me a bit excited. :grin:[/QUOTE] Fingers crossed. If it has the full 1:2 ratio does that mean we can potentially saturate the memory at lower core clocks, or even do TF with the extra headroom with higher clocks? I wonder if it's possible to assign some CU's to gpuowl and others to mfakto, is SR-IOV needed for that or equivalent? I have doubts SR-IOV would make it to the consumer version. |
[QUOTE=M344587487;505694]Fingers crossed. If it has the full 1:2 ratio does that mean we can potentially saturate the memory at lower core clocks, or even do TF with the extra headroom with higher clocks? I wonder if it's possible to assign some CU's to gpuowl and others to mfakto, is SR-IOV needed for that or equivalent? I have doubts SR-IOV would make it to the consumer version.[/QUOTE]
From some guy on TWITTER !?!?!?!?!?!? From Anandtech: "on paper the new card only has a 9% compute throughput advantage. So it’s not on compute throughput where Radeon VII’s real winning charm lies" |
[QUOTE=tServo;505707]From some guy on TWITTER !?!?!?!?!?!?
[/QUOTE] "Some guy on Twitter" = Editor in Chief for Anandtech... And the quote refers to FP32 performance. Later on in the same article though, "The Vega 20 GPU does bring new compute features – particularly much higher FP64 compute throughput and new low-precision modes well-suited for neural network inferencing – but these features aren’t something consumers are likely to use." |
[url]https://techgage.com/news/radeon-vii-caps-fp64-performance/[/url]
[url]https://www.hardocp.com/article/2019/01/14/amd_radeon_vii_interview_scott_herkelman/2[/url] |
Awww... :picard: That's it then, unfortunately my interest stopped right there.
|
We'll have to see what it turns out to be. Ryan Smith specifically asked about it.
[url]https://www.reddit.com/r/Amd/comments/afu3dg/amds_radeon_vii_gpu_will_not_support_uncapped/ee1jr5k/[/url] |
[url]https://twitter.com/RyanSmithAT/status/1085680805802733568[/url]
He's got the answer back, it's 1:8 rate. |
[QUOTE=mackerel;506162][url]https://twitter.com/RyanSmithAT/status/1085680805802733568[/url]
He's got the answer back, it's 1:8 rate.[/QUOTE] Still a shame it's so crippled. |
[QUOTE=mackerel;506162][url]https://twitter.com/RyanSmithAT/status/1085680805802733568[/url]
He's got the answer back, it's 1:8 rate.[/QUOTE] Interesting, that's double the DP rate of "classic Vega" (Vega64, Vega56). While a bit disappointing compared to 1:2 DP, may still be a good improvement in PRP especially matched with the higher-bandwidth RAM. |
[QUOTE=preda;506203]Interesting, that's double the DP rate of "classic Vega" (Vega64, Vega56). While a bit disappointing compared to 1:2 DP, may still be a good improvement in PRP especially matched with the higher-bandwidth RAM.[/QUOTE]
Am I right in thinking that DP rate is the bottleneck for Vega 64 but that memory bandwidth comes a close second? Is it as simple as saying that for R7 to roughly match 2x Vega 64 throughput at the same clocks, it needed both double DP rate and double bandwidth (ignoring 4 CU difference)? Any potential bottlenecks other than those two? Other than higher is better I don't know how the specs translate into performance. |
[QUOTE=M344587487;506205]Am I right in thinking that DP rate is the bottleneck for Vega 64 but that memory bandwidth comes a close second? Is it as simple as saying that for R7 to roughly match 2x Vega 64 throughput at the same clocks, it needed both double DP rate and double bandwidth (ignoring 4 CU difference)? Any potential bottlenecks other than those two? Other than higher is better I don't know how the specs translate into performance.[/QUOTE]
2x would be amazing. In practice I would be very happy if I see a 50% speedup. About memory, it is my impression that the latency did not improve much, but the bandwidth doubled. But to take advantage of this, better occupancy would be required (double the number of memory operations in flight), and this is not easily achievable because of other limiting resources: LDS memory and nb. of registers (VGPRs) that remain unchanged I guess. About compute, the parts that aren't DP (e.g. pointer arithmetic, other integer e.g. carry, logic) remain unchanged, and this will reduce the observed speedup. IMO another limiting factor for GCN performance is still the compiler, after so many years: the compiler does a rather poor job at generating highly efficient code (not an easy task I agree). OTOH the better cooling will help, and allow the card to be higher clocked without thermal throttling (which is a problem on Vega64 blower cooler) |
[QUOTE=preda;506206]2x would be amazing. In practice I would be very happy if I see a 50% speedup.
About memory, it is my impression that the latency did not improve much, but the bandwidth doubled. But to take advantage of this, better occupancy would be required (double the number of memory operations in flight), and this is not easily achievable because of other limiting resources: LDS memory and nb. of registers (VGPRs) that remain unchanged I guess. About compute, the parts that aren't DP (e.g. pointer arithmetic, other integer e.g. carry, logic) remain unchanged, and this will reduce the observed speedup. IMO another limiting factor for GCN performance is still the compiler, after so many years: the compiler does a rather poor job at generating highly efficient code (not an easy task I agree). OTOH the better cooling will help, and allow the card to be higher clocked without thermal throttling (which is a problem on Vega64 blower cooler)[/QUOTE] I am procrastinating the buy a new more powerful gpu, do you have any plans to optimize gpuowl for large numbers ? |
[QUOTE=SELROC;506210]I am procrastinating the buy a new more powerful gpu, do you have any plans to optimize gpuowl for large numbers ?[/QUOTE]
I don't have any clear optimization ideas at this stage. (aside from going down the hand-assembly path, which is not realistic for me because it's a lot of work) What large numbers do you have in mind? Do you think of some specific optimizations? |
[QUOTE=preda;506211]I don't have any clear optimization ideas at this stage. (aside from going down the hand-assembly path, which is not realistic for me because it's a lot of work)
What large numbers do you have in mind? Do you think of some specific optimizations?[/QUOTE] The 300M to 500M exponents. A 332M exponent took 2 months of gpu work on the RX580. [url]https://www.mersenne.org/report_exponent/?exp_lo=332412937&full=1[/url] as a side note: it seems it is now assigned to someone else. |
[URL="https://www.pcgamer.com/amd-scoffs-at-rumor-its-radeon-vii-will-be-in-short-supply//"]https://www.pcgamer.com/amd-scoffs-at-rumor-its-radeon-vii-will-be-in-short-supply//[/URL]
It's truly silly season now... |
[QUOTE=M344587487;505503]It's a GFX9 card so no: [URL]https://github.com/RadeonOpenCompute/ROCm[/URL][/QUOTE]
[QUOTE=preda;506206]2x would be amazing. In practice I would be very happy if I see a 50% speedup. About memory, it is my impression that the latency did not improve much, but the bandwidth doubled. But to take advantage of this, better occupancy would be required (double the number of memory operations in flight), and this is not easily achievable because of other limiting resources: LDS memory and nb. of registers (VGPRs) that remain unchanged I guess. About compute, the parts that aren't DP (e.g. pointer arithmetic, other integer e.g. carry, logic) remain unchanged, and this will reduce the observed speedup. IMO another limiting factor for GCN performance is still the compiler, after so many years: the compiler does a rather poor job at generating highly efficient code (not an easy task I agree). OTOH the better cooling will help, and allow the card to be higher clocked without thermal throttling (which is a problem on Vega64 blower cooler)[/QUOTE] [url]https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.20-Increase-AMD-GPU-TDP[/url] |
.
|
Haven't had time to dig yet, saw at Anandtech that AMD changed their minds yet again and FP64 is now 1/4 rate.
[url]https://www.anandtech.com/show/13923/the-amd-radeon-vii-review[/url] |
I just came here to post that
[Quote]The Radeon VII graphics card was created for gamers and creators, enthusiasts and early adopters. Given the broader market Radeon VII is targeting, we were considering different levels of FP64 performance. We previously communicated that Radeon VII provides 0.88 TFLOPS (DP=1/16 SP). However based on customer interest and feedback we wanted to let you know that we have decided to increase double precision compute performance to 3.52 TFLOPS (DP=1/4SP).[/quote] |
[QUOTE=Mark Rose;507912]I just came here to post that[/QUOTE]
Just to make sure all the bases are covered this guy says that FP64 is ~1.7 aka DP=1/8 SP in his gaming review although he's probably parroting old information: [url]https://www.youtube.com/watch?v=6jP3tetYnVI[/url] I was expecting £700 but it's £650 in the UK. Tried to buy one but my bank decided I was trying to steal from myself and now they're out of stock so that's nice. |
I saw it in stock at Scan at £650, but as I was looking at other stores it sold out. Some places are indicating stock arriving tomorrow so assume shipments are ongoing. OCUK have some in stock at £800. I don't want it enough to pay £150 premium for what looks identical to the £650 ones.
This could be the GPU to make the largest known prime not be a mersenne. Over at PrimeGrid they just started "do you feel lucky" project which are GFN22 at a high enough level to exceed largest known prime. Fastest GPUs so far are doing about one a day and the code is FP64. If this card could do several units a day, that would help a lot. Edit: forget that last part. Just been pointed out to me that specific project can't use FP64. Regular GFN21/22 could still see a significant benefit. |
Interesting. How late in the product cycle can they make these decisions on how much FP64 to include? Is it configuration fuses on the die, microcode update, driver limitation, or what? And of course... could it be hacked afterwards :smile:
|
I'll buy 1 ( eventually ) to test but 3 thoughts:
(1) The board is going to be difficult to "live with": High power requirements and blasts lots of heat IN THE CASE. Reviewers have noted its fans are obnoxiously loud. (2) It's impossible to get, of course. It will be interesting to see how soon AMD can alleviate this situation. Is this a result of poor 7nm yields? (3) With such impressive specs, I would have thought that it would absolutely CRUSH other boards in toe-to-toe comparison tests. It wins a lot, but not as many and not by as much. |
[QUOTE=nomead;507922]Is it configuration fuses on the die, microcode update, driver limitation, or what? And of course... could it be hacked afterwards :smile:[/QUOTE]
Probably the former, possibly based on "binning". |
[QUOTE=M344587487;507915]I was expecting £700 but it's £650 in the UK. Tried to buy one but my bank decided I was trying to steal from myself and now they're out of stock so that's nice.[/QUOTE]
Scan just got some more on their site from £684. Too late in day for next day delivery, and I'm out Saturday so might as well see if cheaper at other sites as more stock appears. |
Anyone got one yet? I'll probably pick one up eventually but as I refuse to pre-order it could be a while.
|
[QUOTE=M344587487;508890]Anyone got one yet? I'll probably pick one up eventually but as I refuse to pre-order it could be a while.[/QUOTE]
Despite AMD's promise to have an decent stock of these at launch time, they have been impossible to get in North America. There is only 1 retailer in our continent, Newegg. I have submitted the "notify me via email" request for every one, but by the time I see the email and get to Newegg, they are long gone. I have noticed they are available in the UK. Some may say this portends good things for AMD's bottom line and competitiveness vs Nvidia, or maybe they just are having bad yields. If their rarity continues, we'll know the answer. |
The three UK retailers I use for hardware are scan, ebuyer and overclockers. Scan are out of stock, ebuyer has two cards at £700, overclockers has 5 at £750 and 8 at £800. Scan and overclockers are taking pre-orders at £650. Aside from Powercolor only having a 2 year warranty to everyone else's 3 the only difference between models is the brand sticker they put on the fans, which makes the fishing retailers are doing with overpriced models all the more blatant.
As far as I can tell the cards are just ok for gaming, nothing to tempt anyone but an AMD fan away from nvidia but they are really compelling for certain compute. I wouldn't be surprised if demand from compute customers can suck up all of the supply for a while. |
I've changed my tactics so I get a text when I get an email from newegg so I can respond quicker to their getting more stock.
It's truly bizarre to see the newegg page for Radeon VII that has 7 identical boards from 7 different companies at the exact same price [url]https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&IsNodeId=1&N=100007709%20601328427[/url] |
GPUOWL Incompatible with new adrenaline 19.x.x driver
Considering that GPUOWL simply refuses to run on my vega 64 for the newest adrenaline drivers, I suspect it won't run either on Radeon VII. Just gotta wait until GPUOWL runs properly on the newest drivers or AMD fix their OpenCL stuff in their new 19.x.x drivers, then the card's full potential and stability can be tested.
Also, I reckon that Radeon VII is more like an early access product for now. I have seen a Reddit post claiming the base plate of the heatsink is slightly concave, with washer mod and sanding the baseplate flat down with relatively fine sandpaper reduced the temperature by about 40C (see [url]https://www.reddit.com/r/Amd/comments/arrxt2/radeon_vii_how_to_drop_40c_on_your_stock_cooler/[/url] ), hopefully later AIB models can have a perfectly flat base plate and alleviate the issue. |
[QUOTE=xx005fs;508966]Considering that GPUOWL simply refuses to run on my vega 64 for the newest adrenaline drivers, I suspect it won't run either on Radeon VII. Just gotta wait until GPUOWL runs properly on the newest drivers or AMD fix their OpenCL stuff in their new 19.x.x drivers, then the card's full potential and stability can be tested.
Also, I reckon that Radeon VII is more like an early access product for now. I have seen a Reddit post claiming the base plate of the heatsink is slightly concave, with washer mod and sanding the baseplate flat down with relatively fine sandpaper reduced the temperature by about 40C (see [url]https://www.reddit.com/r/Amd/comments/arrxt2/radeon_vii_how_to_drop_40c_on_your_stock_cooler/[/url] ), hopefully later AIB models can have a perfectly flat base plate and alleviate the issue.[/QUOTE] Are you running windoze or linux ? What version of gpuowl ? I am running windoze 10 with Adrenalin 2019 19.1.1 drivers; gpuowl v5.0 My board is a Frontier vega. This is the first time I have tried 2019 drivers and I did notice one little thing different: Since the frontier board is gpu device #0, I would usually just enter 'gpuowl -fft +3' but on this setup the opencl puked ( I can't recall the error ). Adding the '-d 0' made it work. Many reviewers have commented that they thought the Radeon VII launch was rushed by about 1 or 2 months. The main complaint was the driver looked 'rough': lots of hiccups and crashes. |
[QUOTE=tServo;508993]Many reviewers have commented that they thought the Radeon VII launch was rushed by about 1 or 2 months. The main complaint was the driver looked 'rough': lots of hiccups and crashes.[/QUOTE]
Marketing probably determined the launch date, ready or not. Radeon 7, 7 nm, Vega 2, whatever ... 7.2.2019 :smile: Now rumours say that 7nm Ryzen chips (and perhaps some Navi products) will be launched in the summer. And again, 7.7.2019 seems to be a recurring date in these rumours. |
[QUOTE=tServo;508993]Are you running windoze or linux ? What version of gpuowl ?
I am running windoze 10 with Adrenalin 2019 19.1.1 drivers; gpuowl v5.0 My board is a Frontier vega. This is the first time I have tried 2019 drivers and I did notice one little thing different: Since the frontier board is gpu device #0, I would usually just enter 'gpuowl -fft +3' but on this setup the opencl puked ( I can't recall the error ). Adding the '-d 0' made it work. Many reviewers have commented that they thought the Radeon VII launch was rushed by about 1 or 2 months. The main complaint was the driver looked 'rough': lots of hiccups and crashes.[/QUOTE] yes i am using windoze. I am getting the same error even though I am giving it the device number and ocl still breaks, albeit im using gpuowl 3.8 |
[QUOTE=xx005fs;508997]yes i am using windoze. I am getting the same error even though I am giving it the device number and ocl still breaks, albeit im using gpuowl 3.8[/QUOTE]
Using a more recent version of gpuowl might be worth a shot. It's pretty easy enough to do. I would try 5.0 first then either 6.0 or 6.2 if 5.0 pukes. You may also want to see if AMD has posted new drivers. I think they have updated them at least twice. |
I give up trying to get a Radeon VII board.
I added "notify me when in stock" to the Radeon listings in Newegg but no luck:
I was downstairs feeding the cats when my phone buzzed to tell me they were in stock. I went upstairs and they were sold out! Total elapsed time: 4 minutes! I guess I would need to get an app on my phone and instantly jump on it. I'll bet the quantities coming in are low anyway. |
They're apparently fantastic for mining Ethereum: [url]https://cryptomining-blog.com/10628-amd-radeon-vii-gfx906-delivers-90-mhs-ethash-out-of-the-box/[/url]
Basically three times faster than the GTX 1070, which was one of the better cards previously. |
[QUOTE=Mark Rose;509425]They're apparently fantastic for mining Ethereum: [url]https://cryptomining-blog.com/10628-amd-radeon-vii-gfx906-delivers-90-mhs-ethash-out-of-the-box/[/url]
Basically three times faster than the GTX 1070, which was one of the better cards previously.[/QUOTE] I thought all that hoohah had subsided judging by the flood of mining rigs and cards available on Ebay the past couple of months. from coingecko :[url]https://www.coingecko.com/en/coins/ethereum?utm_content=ethereum&utm_medium=search_coin&utm_source=coingecko[/url] On the graph, click on the 1year or max view to see how its value has crashed. |
[QUOTE=tServo;509431]I thought all that hoohah had subsided judging by the flood of mining rigs and cards available on Ebay the past couple of months.
from coingecko :[URL]https://www.coingecko.com/en/coins/ethereum?utm_content=ethereum&utm_medium=search_coin&utm_source=coingecko[/URL] On the graph, click on the 1year or max view to see how its value has crashed.[/QUOTE] It has. There could be a resurgence in mining if crypto picks up again but right now it's enthusiasts and speculators only. |
In stock
Radeon vii is in stock now on amd.com for anyone interested in buying it. (At least in the US)
|
Also some places in EU (for example computeruniverse.net, Germany) have PowerColor cards in stock. Of course they don't say how many... but they do show how many times it's been ordered in the last 30 days. 12 for the PowerColor version, 4 for MSI (still not in stock), 1 for XFX (not in stock either), 1 for ASRock (not in stock). So maybe supply hasn't been great, but demand isn't that hot either.
|
Overclockers have Powercolor cards in stock at £650 in the UK. They've had 10+ in stock for a day now. Powercolor are the worst of the bunch as they only have a 2 year warranty compared to 3 for the rest, FWIW.
[url]https://www.overclockers.co.uk/powercolor-radeon-rx-vega-vii-16gb-hbm2-pci-express-graphics-card-gx-196-pc.html[/url] |
Stock 5M gpuowl benchmark
Radeon VII gpuowl benchmark at stock settings, these are out of the box numbers using the default performance levels. Using a daily build of Ubuntu Disco (upcoming 19.04) with 5.0.0 kernel. ROCm using upstream. Not headless, GPU was driving display which may have had a tiny impact on performance. Performance level set with rocm-smi --setsclk #, default fan speed and memory clocks. PRP on an 89M exponent using 5M FFT, latest gpuowl. Test bench is a Ryzen 1700 idling, 16GB of RAM and a gold rated PSU.
[code] perf_level wall_power rocm-smi_power temp sclk mclk ms/it joules_per_iteration 8 305 240 101 1802 1001 1.03 0.2472 7 298 229 97 1774 1001 1.04 0.23816 6 290 220 95 1750 1001 1.04 0.2288 5 265 197 95 1684 1001 1.06 0.20882 4 228 164 95 1547 1001 1.09 0.17876 3 191 131 95 1373 1001 1.18 0.15458 2 158 103 87 1135 1001 1.285 0.132355 1 131 82 74 809 1001 1.68 0.13776 0 122 75 69 701 1001 1.9 0.1425 [/code]Numbers can be improved by overclocking the memory and undervolting the core. Tried a basic memory overclock by setting memory OD to 19% in rocm-smi which upped mclk to 1192. At perf level 8 it was doing 0.95 ms/it but I only ran it for 5 minutes as it was hitting the 250W power cap so the figure is not very useful. An undervolt should improve efficiency considerably but I haven't figured out how to do that or change the perf level presets properly yet (amdgpu.ppfeaturemask=0xffffffff is in grub and pp_od_clk_voltage is present in /sys... however pushing "s 4 1547 1050" or whatever doesn't work, nor does --setslevel or --setmlevel in rocm-smi. Pro drivers required?). |
Wow, impressive! LL/PRP performance crown - taken. :toot:
|
Nice! better than I expected.
IMO the sweet-spot seems to be at --setsclk 4 (with default voltages). I think the manual undervolting may be broken, feel free to report bugs to AMD maybe they'll look into them. (e.g. to "ROCM issues"). Could you measure a FFT 4608K exponent too? (84M). [QUOTE=M344587487;511461]Radeon VII gpuowl benchmark at stock settings, these are out of the box numbers using the default performance levels. Using a daily build of Ubuntu Disco (upcoming 19.04) with 5.0.0 kernel. ROCm using upstream. Not headless, GPU was driving display which may have had a tiny impact on performance. Performance level set with rocm-smi --setsclk #, default fan speed and memory clocks. PRP on an 89M exponent using 5M FFT, latest gpuowl. Test bench is a Ryzen 1700 idling, 16GB of RAM and a gold rated PSU. [code] perf_level wall_power rocm-smi_power temp sclk mclk ms/it joules_per_iteration 8 305 240 101 1802 1001 1.03 0.2472 7 298 229 97 1774 1001 1.04 0.23816 6 290 220 95 1750 1001 1.04 0.2288 5 265 197 95 1684 1001 1.06 0.20882 4 228 164 95 1547 1001 1.09 0.17876 3 191 131 95 1373 1001 1.18 0.15458 2 158 103 87 1135 1001 1.285 0.132355 1 131 82 74 809 1001 1.68 0.13776 0 122 75 69 701 1001 1.9 0.1425 [/code]Numbers can be improved by overclocking the memory and undervolting the core. Tried a basic memory overclock by setting memory OD to 19% in rocm-smi which upped mclk to 1192. At perf level 8 it was doing 0.95 ms/it but I only ran it for 5 minutes as it was hitting the 250W power cap so the figure is not very useful. An undervolt should improve efficiency considerably but I haven't figured out how to do that or change the perf level presets properly yet (amdgpu.ppfeaturemask=0xffffffff is in grub and pp_od_clk_voltage is present in /sys... however pushing "s 4 1547 1050" or whatever doesn't work, nor does --setslevel or --setmlevel in rocm-smi. Pro drivers required?).[/QUOTE] |
[QUOTE=M344587487;511461] however pushing "s 4 1547 1050" or whatever doesn't work, nor does --setslevel or --setmlevel in rocm-smi. Pro drivers required?).[/QUOTE]
You know that you have to commit the change, I think by writing a line with a single "c" at the end? A sequence of "s ...." lines, and a single "c" at the end. |
[QUOTE=M344587487;509441]It has. There could be a resurgence in mining if crypto picks up again but right now it's enthusiasts and speculators only.[/QUOTE]
As for the crypto this is highly unlikely. Basically every 'crypto' type 'coin' is inherently a pyramidgame as soon as you can find crypto's by means of mining. Only reason i can think of all the cryptocoins haven't been rendered illegally is because not seldom a huge bank is behind many of the introduced crypto's (ING bank for example behind ethereum) so they have a direct cash interest in swindling people by means of that cryptocoin. Most banks see of course all the criminal money that gets transferred and criminals do not mind in order to whitewash their criminal money into legal money, to pay high fees and transaction percentages. Several banks recently have been fined small fines (order of some hundreds of millions of dollars - for example ING bank) - from my viewpoint seen - at possible criminal transactions. It's nonstop very tempting for huge financials to keep involved in criminal money. And crypto's so far they seem to get away with. Yet from a distance seen it's all drugstrade and transferring criminal money. By now the average dude on the street starts to understand this slowly so it's very unlikely a specific crypto that can get mined in the future will emerge and be a long lived successtory. |
[QUOTE=M344587487;505694]Fingers crossed. If it has the full 1:2 ratio does that mean we can potentially saturate the memory at lower core clocks, or even do TF with the extra headroom with higher clocks? I wonder if it's possible to assign some CU's to gpuowl and others to mfakto, is SR-IOV needed for that or equivalent? I have doubts SR-IOV would make it to the consumer version.[/QUOTE]
It's been some years that i coded OpenCL - back then it was impossible to run different kernels at the same time at different SIMDs at AMD gpu's. AMD claimed it was 'theoretical technical possible' to achieve this yet they didn't program that feature yet. Back then the gpgpu helpdesk from AMD was - let's say it polite - not very professional. Maybe someone has a more recent update to this info. |
[QUOTE=diep;511491]Maybe someone has a more recent update to this info.[/QUOTE]
AMD is developing ROCm, an OpenSource OpenCL implementation on github, and the developers are very responsive to bug reports. The quality in general is good, and is improving rapidly. |
[QUOTE=preda;511492]AMD is developing ROCm, an OpenSource OpenCL implementation on github, and the developers are very responsive to bug reports. The quality in general is good, and is improving rapidly.[/QUOTE]
Yeah all good and well but i want to run multiple kernels at the same time and use the L1 cache to store arrays and data like i can do on Nvidia. Nowadays both manufacturers have what is it already up to 80 SIMDs each card or so? Doing everything via the GDDR5/HBM2 is just not a scaleable option anymore. I want 1 byte of bandwidth for each 3 flops (double precision) it delivers and that isn't asking overly much. Part of that bandwidth can be offloaded to L1 datacaches, provided they are large enough for the number of threads (warps) you run on each SIMD. I'm not aware of AMD but at Nvidia i need to run at least 8 warps (threads of 32 cudacores) at the same time at each SIMD as otherwise it gets too slow - means one has to divide the available L1 datacache amongst at least 8 warps (threads of 32 cuda cores). That's not much L1 datacache which is so much needed for example for my sieving code to sieve on the GPU k * 2^n +- 1 with as variable n. 3000 dollar for a Titan V is high price simply - and they can ask that much because AMD lacks features. I bought second hand Titan Z - because has about 1 byte of bandwidth for each 2.5 flops it delivers. |
[QUOTE=diep;511493]Yeah all good and well but i want to run multiple kernels at the same time and use the L1 cache to store arrays and data like i can do on Nvidia.
Nowadays both manufacturers have what is it already up to 80 SIMDs each card or so? Doing everything via the GDDR5/HBM2 is just not a scaleable option anymore. I want 1 byte of bandwidth for each 3 flops (double precision) it delivers and that isn't asking overly much. Part of that bandwidth can be offloaded to L1 datacaches, provided they are large enough for the number of threads (warps) you run on each SIMD. I'm not aware of AMD but at Nvidia i need to run at least 8 warps (threads of 32 cudacores) at the same time at each SIMD as otherwise it gets too slow - means one has to divide the available L1 datacache amongst at least 8 warps (threads of 32 cuda cores). That's not much L1 datacache which is so much needed for example for my sieving code to sieve on the GPU k * 2^n +- 1 with as variable n. 3000 dollar for a Titan V is high price simply - and they can ask that much because AMD lacks features. I bought second hand Titan Z - because has about 1 byte of bandwidth for each 2.5 flops it delivers.[/QUOTE] Maybe you're looking for what is called "local" memory in OpenCL, LDS (Local Data Share) in AMD GCN. It is basically an "explicitly managed" L1 memory (vs. the "cache" being "implicitly managed"), fast and relatively large. |
[QUOTE=preda;511496]Maybe you're looking for what is called "local" memory in OpenCL, LDS (Local Data Share) in AMD GCN. It is basically an "explicitly managed" L1 memory (vs. the "cache" being "implicitly managed"), fast and relatively large.[/QUOTE]
When i toyed with it at by now this old AMD card i got here (back then fast and expensive) it wasn't very impressive. I understood that newer GPU's of AMD had less or that it was worse accessible - maybe i misunderstood. What size does it have on this Radeon VII? And how many gpu-clocks does it take on Radeon VII to store something in LDS and retrieve from it (of course for 64 opencl-cores at same time)? (Of course i'm speaking about throughput clocks - not actual latencies.) edit: And i assume then that the LDS doesn't get hosted onto the registerfile yet is actual separated storage - is this correct? |
[QUOTE=diep;511497]
And i assume then that the LDS doesn't get hosted onto the registerfile yet is actual separated storage - is this correct?[/QUOTE] Yes, LDS is a very-low-latency memory, close to the processor (L1-type); it is not "registers" in GCN parlance. |
4608K stock perf_level tests:
[code] perf_level wall_power rocm-smi_power temp sclk mclk ms_per_it_4608K joules_per_it 8 311 236 102 1802 1001 0.94 0.22184 7 298 226 98 1774 1001 0.94 0.21244 6 285 213 95 1750 1001 0.95 0.20235 5 264 195 95 1684 1001 0.96 0.1872 4 227 163 95 1547 1001 0.99 0.16137 3 190 129 95 1373 1001 1.07 0.13803 [/code]4608K stock --setfan tests: [code] set_fan wall_power rocm-smi_power temp sclk mclk ms_per_it_4608K 255 307 233 90 1802 1001 0.94 200 305 233 93 1802 1001 0.94 175 310 235 99 1802 1001 0.94 150 314 237 105 1802 1001 0.94 [/code]pp_od_clk_voltage dump:[code]OD_SCLK: 0: 808Mhz 1: 1801Mhz OD_MCLK: 1: 1000Mhz OD_VDDC_CURVE: 0: 808Mhz 690mV 1: 1304Mhz 797mV 2: 1801Mhz 1081mV OD_RANGE: SCLK: 808Mhz 2200Mhz MCLK: 351Mhz 1200Mhz VDDC_CURVE_SCLK[0]: 808Mhz 2200Mhz VDDC_CURVE_VOLT[0]: 738mV 1218mV VDDC_CURVE_SCLK[1]: 808Mhz 2200Mhz VDDC_CURVE_VOLT[1]: 738mV 1218mV VDDC_CURVE_SCLK[2]: 808Mhz 2200Mhz VDDC_CURVE_VOLT[2]: 738mV 1218mV[/code]rocminfo dump:[code]===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (number of timestamp) Machine Model: LARGE System Endianness: LITTLE ========== HSA Agents ========== ******* Agent 1 ******* ===CPU info snip=== ******* Agent 2 ******* Name: gfx906 Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128 Queue Min Size: 4096 Queue Max Size: 131072 Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 16KB Chip ID: 26287 Cacheline Size: 64 Max Clock Frequency (MHz):1802 BDFID: 2816 Compute Unit: 60 Features: KERNEL_DISPATCH Fast F16 Operation: FALSE Wavefront Size: 64 Workgroup Max Size: 1024 Workgroup Max Size Per Dimension: Dim[0]: 67109888 Dim[1]: 184550400 Dim[2]: 0 Grid Max Size: 4294967295 Waves Per CU: 40 Max Work-item Per CU: 2560 Grid Max Size per Dimension: Dim[0]: 4294967295 Dim[1]: 4294967295 Dim[2]: 4294967295 Max number Of fbarriers Per Workgroup:32 Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 16760832KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Acessible by all: FALSE Pool 2 Segment: GROUP Size: 64KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Acessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx906 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Dimension: Dim[0]: 67109888 Dim[1]: 1024 Dim[2]: 16777217 Workgroup Max Size: 1024 Grid Max Dimension: x 4294967295 y 4294967295 z 4294967295 Grid Max Size: 4294967295 FBarrier Max Size: 32 *** Done ***[/code]clinfo dump: [code]Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.1 AMD-APP (2833.0) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices Platform Name: AMD Accelerated Parallel Processing Number of devices: 1 Device Type: CL_DEVICE_TYPE_GPU Vendor ID: 1002h Board name: Vega 20 Device Topology: PCI[ B#11, D#0, F#0 ] Max compute units: 60 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 256 Preferred vector width char: 4 Preferred vector width short: 2 Preferred vector width int: 1 Preferred vector width long: 1 Preferred vector width float: 1 Preferred vector width double: 1 Native vector width char: 4 Native vector width short: 2 Native vector width int: 1 Native vector width long: 1 Native vector width float: 1 Native vector width double: 1 Max clock frequency: 1802Mhz Address bits: 64 Max memory allocation: 14588628172 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 16384 Max image 2D height: 16384 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 26287 Max size of kernel argument: 1024 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: Read/Write Cache line size: 64 Cache size: 16384 Global memory size: 17163091968 Constant buffer size: 14588628172 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 65536 Max pipe arguments: 16 Max pipe active reservations: 16 Max pipe packet size: 1703726284 Max global variable size: 14588628172 Max global variable preferred total size: 17163091968 Max read/write image args: 64 Max on device events: 1024 Queue on device max size: 8388608 Max on device queues: 1 Queue on device preferred size: 262144 SVM capabilities: Coarse grain buffer: Yes Fine grain buffer: Yes Fine grain system: No Atomics: No Preferred platform atomic alignment: 0 Preferred global atomic alignment: 0 Preferred local atomic alignment: 0 Kernel Preferred work group size multiple: 64 Error correction support: 0 Unified memory for Host and Device: 0 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue on Host properties: Out-of-Order: No Profiling : Yes Queue on Device properties: Out-of-Order: Yes Profiling : Yes Platform ID: 0x7fd046e90f70 Name: gfx906 Vendor: Advanced Micro Devices, Inc. Device OpenCL C version: OpenCL C 2.0 Driver version: 2833.0 (HSA1.1,LC) Profile: FULL_PROFILE Version: OpenCL 1.2 Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program[/code][QUOTE=preda;511488]You know that you have to commit the change, I think by writing a line with a single "c" at the end? A sequence of "s ...." lines, and a single "c" at the end.[/QUOTE] The problem is that the lines wouldn't send to pp_od_clk_voltage so there was nothing to push to the card, something along the lines of "bash echo error, invalid input". Performance level was set to manual and pushing as sudo and root was tried. I'm probably doing something silly, will give it another go later. [QUOTE=diep;511491]It's been some years that i coded OpenCL - back then it was impossible to run different kernels at the same time at different SIMDs at AMD gpu's. ...[/QUOTE] SR-IOV is what servers use to allow multiple VMs to use a single PCIe device simultaneously. It's a pro feature that I don't think is enabled on this card. I don't know if resources can be diced up to multiple programs in a non-VM fashion. If there's any other tests or info you want dumped let me know. I'm going to do some gpuowl memory OD tests, then try undervolting again, then if undervolting works undervolt+memory OD. Might as well test mfakto too, what would be a good test for that? |
[QUOTE]
If there's any other tests or info you want dumped let me know. I'm going to do some gpuowl memory OD tests, then try undervolting again, then if undervolting works undervolt+memory OD. Might as well test mfakto too, what would be a good test for that?[/QUOTE] Much appreciated - yes i like to know the size of the LDS and how many clocks it is to store and retrieve from it - as i assume the 16KB L1 it shows is the instruction cache. Knowing timing is crucial because losing suddenly more than 1 clock throughput to using the LDS makes it worthless as that throws away factor 2 performance of the GPU potentially. And also whether it can store/retrieve simultaneously for all 64 units in each SIMD the data in that 1 clock. Because that would result again in slow latencies to the LDS. |
[QUOTE=preda;511500]Yes, LDS is a very-low-latency memory, close to the processor (L1-type); it is not "registers" in GCN parlance.[/QUOTE]
Can you read and write yourself manual into that LDS? As some year and a half ago when i had to decide which gpu to buy there was some cheap GCN generation gpu's from AMD on ebay delivering something around 1.45T double precision. Reason for me to not buy some of them was because when i checked out the AMD documentation it said that it did have a fast L1 cache on the gpu but you could not manual read/write to it - the card used it itself to cache the device memory in a clever manner. So i bought the Titan-Z instead. |
[QUOTE=M344587487;511461]Radeon VII gpuowl benchmark at stock settings, these are out of the box numbers using the default performance levels. Using a daily build of Ubuntu Disco (upcoming 19.04) with 5.0.0 kernel. ROCm using upstream. Not headless, GPU was driving display which may have had a tiny impact on performance. Performance level set with rocm-smi --setsclk #, default fan speed and memory clocks. PRP on an 89M exponent using 5M FFT, latest gpuowl. Test bench is a Ryzen 1700 idling, 16GB of RAM and a gold rated PSU.
[code] perf_level wall_power rocm-smi_power temp sclk mclk ms/it joules_per_iteration 8 305 240 101 1802 1001 1.03 0.2472 7 298 229 97 1774 1001 1.04 0.23816 6 290 220 95 1750 1001 1.04 0.2288 5 265 197 95 1684 1001 1.06 0.20882 4 228 164 95 1547 1001 1.09 0.17876 3 191 131 95 1373 1001 1.18 0.15458 2 158 103 87 1135 1001 1.285 0.132355 1 131 82 74 809 1001 1.68 0.13776 0 122 75 69 701 1001 1.9 0.1425 [/code]Numbers can be improved by overclocking the memory and undervolting the core. Tried a basic memory overclock by setting memory OD to 19% in rocm-smi which upped mclk to 1192. At perf level 8 it was doing 0.95 ms/it but I only ran it for 5 minutes as it was hitting the 250W power cap so the figure is not very useful. An undervolt should improve efficiency considerably but I haven't figured out how to do that or change the perf level presets properly yet (amdgpu.ppfeaturemask=0xffffffff is in grub and pp_od_clk_voltage is present in /sys... however pushing "s 4 1547 1050" or whatever doesn't work, nor does --setslevel or --setmlevel in rocm-smi. Pro drivers required?).[/QUOTE] Impressive, it is faster than a Titan V with HBM overclocked to as high as it goes, by like around 10% even at stock. But it is not the improvement I was hoping for as the thread discussing about memory use for GPUOWL should means that RVII is no where near memory limited with its DP capability and should be a great deal faster than Vega 64 (0.8TFLOP vs 3.4TFLOP). So maybe it is still memory limited and that overclocking HBM would greatly help? |
[B]4608K mem OD 19% perf_level test:[/B]
Used --setfan 200 as I was paranoid that the fans wouldn't ramp properly with memory temps. perf_level 8 omitted as it was hitting the power cap. The power cap can be changed with "--setpoweroverdrive WATTS" but there's no point until I figure out undervolting IMO. [code] perf_level wall_power rocm-smi_power temp sclk mclk ms_per_it_4608K joules_per_it 7 319 244 97 1774 1192 0.86 0.20984 6 306 235 95 1750 1192 0.87 0.20445 5 280 211 85 1684 1192 0.88 0.18568 4 236 173 73 1547 1192 0.92 0.15916 3 198 140 61 1373 1192 0.98 0.1372 [/code][QUOTE=diep;511504]Much appreciated - yes i like to know the size of the LDS and how many clocks it is to store and retrieve from it - as i assume the 16KB L1 it shows is the instruction cache. Knowing timing is crucial because losing suddenly more than 1 clock throughput to using the LDS makes it worthless as that throws away factor 2 performance of the GPU potentially. And also whether it can store/retrieve simultaneously for all 64 units in each SIMD the data in that 1 clock. Because that would result again in slow latencies to the LDS.[/QUOTE] I don't know where to read that info from or how to determine it with a minimal benchmark, if anyone knows how please guide me. [QUOTE=xx005fs;511535]Impressive, it is faster than a Titan V with HBM overclocked to as high as it goes, by like around 10% even at stock. But it is not the improvement I was hoping for as the thread discussing about memory use for GPUOWL should means that RVII is no where near memory limited with its DP capability and should be a great deal faster than Vega 64 (0.8TFLOP vs 3.4TFLOP). So maybe it is still memory limited and that overclocking HBM would greatly help?[/QUOTE] 1200MHz is the default limit for mclk overclocks, not sure if that barrier can be broken or by how much. The above numbers should improve in efficiency with an undervolt and it would also allow for a little more thermal headroom for a faster clock albeit sacrificing efficiency. The throughput is not going to improve massively from here with the stock cooler, some people have pushed the clocks to ~2200MHz with water cooling but I doubt gpuowl would be stable. There's also potential to mod the card to improve temps, the washer mod to improve die contact with the heatsink has been reported to give decent thermal gains although the amount varies. I don't know if I have the stones to potentially crack the die on a day old £650 card but we'll see. A few other notes: [LIST][*]There's an option to change PCIe level from 308 MHz to 80 MHz with rocm-smi --setpclk 0 but it doesn't stick, as soon as the device gets under any sort of load it goes to and stays at level 1. Maybe level 1's speed can be reduced to save a little juice (if it saves any juice) as PCIe speed shouldn't matter, looks like that's a possibility in 5.2: [URL]https://www.phoronix.com/scan.php?page=news_item&px=AMDGPU-SMU-Export-More[/URL][*]Kernel 5.0.0 has a target temp of 95 but not a max temp which should be 105 (fixed in 5.1), be careful when overclocking[*]openowl needs "-device 0" to be passed otherwise 'Exiting because "No OpenCL device"'. Probably a change I missed, I could swear it used to work with no args[*]I think I've found the answer to undervolting and proper clock setting, it works a little differently on Vega 20 with a voltage curve of sorts. Testing that is for another day: [URL]https://www.reddit.com/r/linux_gaming/comments/au7m3x/radeon_vii_on_linux_overclocking_undervolting/[/URL][/LIST] |
[QUOTE=M344587487;511539][B]4608K mem OD 19% perf_level test:[/B]
Used --setfan 200 as I was paranoid that the fans wouldn't ramp properly with memory temps. perf_level 8 omitted as it was hitting the power cap. The power cap can be changed with "--setpoweroverdrive WATTS" but there's no point until I figure out undervolting IMO. [code] perf_level wall_power rocm-smi_power temp sclk mclk ms_per_it_4608K joules_per_it 7 319 244 97 1774 1192 0.86 0.20984 6 306 235 95 1750 1192 0.87 0.20445 5 280 211 85 1684 1192 0.88 0.18568 4 236 173 73 1547 1192 0.92 0.15916 3 198 140 61 1373 1192 0.98 0.1372 [/code]I don't know where to read that info from or how to determine it with a minimal benchmark, if anyone knows how please guide me. 1200MHz is the default limit for mclk overclocks, not sure if that barrier can be broken or by how much. The above numbers should improve in efficiency with an undervolt and it would also allow for a little more thermal headroom for a faster clock albeit sacrificing efficiency. The throughput is not going to improve massively from here with the stock cooler, some people have pushed the clocks to ~2200MHz with water cooling but I doubt gpuowl would be stable. There's also potential to mod the card to improve temps, the washer mod to improve die contact with the heatsink has been reported to give decent thermal gains although the amount varies. I don't know if I have the stones to potentially crack the die on a day old £650 card but we'll see. A few other notes: [LIST][*]There's an option to change PCIe level from 308 MHz to 80 MHz with rocm-smi --setpclk 0 but it doesn't stick, as soon as the device gets under any sort of load it goes to and stays at level 1. Maybe level 1's speed can be reduced to save a little juice (if it saves any juice) as PCIe speed shouldn't matter, looks like that's a possibility in 5.2: [URL]https://www.phoronix.com/scan.php?page=news_item&px=AMDGPU-SMU-Export-More[/URL][*]Kernel 5.0.0 has a target temp of 95 but not a max temp which should be 105 (fixed in 5.1), be careful when overclocking[*]openowl needs "-device 0" to be passed otherwise 'Exiting because "No OpenCL device"'. Probably a change I missed, I could swear it used to work with no args[*]I think I've found the answer to undervolting and proper clock setting, it works a little differently on Vega 20 with a voltage curve of sorts. Testing that is for another day: [URL]https://www.reddit.com/r/linux_gaming/comments/au7m3x/radeon_vii_on_linux_overclocking_undervolting/[/URL][/LIST][/QUOTE] I thought I have seen a post on Reddit that someone realized the vapor chamber coldplate was really convex, so after sanding it flat and polished the temperature greatly improved. I reckon that would be less risky than the washer method as you aren't really adding any mounting pressure on the die? [url]https://www.reddit.com/r/Amd/comments/arrxt2/radeon_vii_how_to_drop_40c_on_your_stock_cooler/[/url] Speaking about the memory frequency, I am pretty sure that if you managed to figure out how to edit the power play table, it can go above 1200MHz and be set to whatever you want, and also you will have more flexibility controlling the frequency and voltage rather than using the curve Wattman provides. |
[QUOTE=xx005fs;511558]I am pretty sure that if you managed to figure out how to edit the power play table, it can go above 1200MHz and be set to whatever you want, and also you will have more flexibility controlling the frequency and voltage rather than using the curve Wattman provides.[/QUOTE]
How will this help you? What are you trying to achieve? |
M344587487 - the question to benchmark it you can ignore it, if what i read in some documentation happens to be true - namely that you cannot manual read/write into the LDS.
It's only relevant question when you can read/write to/from it. Much of the documentation of AMD seems to get written by the latest Indian helpdesk girl they have hired - who still has to learn how a GPU looks like - and seemingly they do not have the actual hardware at the spot where they write the documentation to verify what they write is true. Which is why information from forums is critical. I lost months to some of the Nvidia architectures as wasted time just because i didn't realize how bad Fermat architecture is in prefetching GDDR5 versus Kepler and newer generations. arguably it is also possible it wasn't the prefetching yet profitting more from running multiple warps at the same time could be the reason for this performance penalty as well at Fermat versus Kepler - yet the solution to sieving at Kepler generation and 900 and 1000 series Nvidia is much simpler than for Fermat architecture in that case and i lost months to figuring that one out. So having the correct throughput latencies of features you can potentially use to speedup is very important. However if they can't get used by the programmer then they're useless to benchmark. In general spoken AMD and Nvidia suck bigtime in giving correct documentation for programmers - like throughput latencies and how the actual execution works of threads/warps or whatever you want to call it at the SIMDs. In fact at latest Nvidia architecture they're not giving away how many execution units it has for example for integer multiplication. How can those manufacturers expect lots of good programs for those gpu's giving away so little information? |
[QUOTE=xx005fs;511535]Impressive, it is faster than a Titan V with HBM overclocked to as high as it goes, by like around 10% even at stock. But it is not the improvement I was hoping for as the thread discussing about memory use for GPUOWL should means that RVII is no where near memory limited with its DP capability and should be a great deal faster than Vega 64 (0.8TFLOP vs 3.4TFLOP). So maybe it is still memory limited and that overclocking HBM would greatly help?[/QUOTE]
Out of curiosity - how many fp64 instructions (so not the factory times 2 number) get roughly executed in a single iteration on this test? 3.4 Tflops means it can execute 1.7T FP64 instructions a clock. Parallellizing the entire transform over what is it like 80 compute units makes it really problematic to get a high percentage there out of that 1.7T. Parallellizing supercomputers and processes at CPU's, even with complex game tree search algorithms (where the parallel algorithm just to get the task paralelllized as nonstop searches start and stop and get started based upon the result of this search) was 40 pages of a4 - despite that i usually got nearby 100% scaling and over 50% speedup at hundreds of cpu's. Compare to deep blue that was more like at under 1% there though they claimed 3% in later publication. That > 90% is total wishful thinking at GPU's with 80 compute units. What you can do is run coupleo of dozens of those 89M tests at the same time and then log the average iteration times after couple of minutes. And then look at the average iteration time which you will need to calculate from all the iteration times after couple of minutes. Then draw conclusions based upon that. Start with running 2, then 4 etc. |
[QUOTE=chalsall;511562]How will this help you?
What are you trying to achieve?[/QUOTE] Undervolting and overclocking, not for GIMPS tho. It's a common practice by miners to drop voltages below what AMD allows in Wattman (by changing voltage for other P-states or such) or Overclockers that just want to max out the hardware by changing the max power limit and voltage (obv clock speed too). |
Undervolting
Got undervolting working with that link. Turns out every R7 GPU is tested and ships with a different stock voltage, probably indicating how well you've done in the silicon lottery. Because of this undervolting works but probably won't yield as much of a benefit as undervolting Vega 56/64 does (which I believe had a flat insanely high stock voltage of 1.2V). My stock voltage is 1081mV, other people have reported stock voltages of 1127mV, 1133mV and 1040mV.
The way the voltage is set is on a curve. Here's my stock curve:[code]OD_VDDC_CURVE: 0: 808Mhz 690mV 1: 1304Mhz 797mV 2: 1801Mhz 1081mV[/code]For simplicity (because there was weirdness with a phantom OD % erroneously being applied when trying to set a lower max frequency) I'm not touching the perf_level sclk's and am keeping the max frequency at 1801. Other than setting the mclk to 1200 (which reads as 1201 in rocm-smi and you push with "m 1 1200"), the only thing we change is the undervolt for perf_level 8. To undervolt in this way you simply push "vc 2 1801 $voltage_mV" to pp_od_clk_voltage, push a dummy clock change so that the undervolt applies ("s 1 1801"), push "c" to apply, then pick the desired perf_level with rocm-smi --setsclk #. A perf_level<8 should have some voltage less than or equal to that pushed. Because the card seems to have trouble sticking to the target temp of 95 when memory is overclocked these tests set the fan speed manually so that the temp is around 95. [B]4608K mclk=1200 perf_level=5 undervolt test:[/B] [code] undervolt_setting perf_level wall_power rocm-smi_power --setfan ms_per_it_4608K joules_per_it vc 2 1801 1081 5 282 213 160 0.88 0.18744 vc 2 1801 1030 5 270 200 145 0.88 0.176 vc 2 1801 1020 5 270 200 135 0.88 0.176 vc 2 1801 1015 5 268 197 125 0.88 0.17336 vc 2 1801 1010 errors within minutes. Previous result had no errors within half an hour but it needs an overnight stability test and then get backed off to 1020 if it passes [/code][B]4608K mclk=1200 perf_level=4 undervolt test:[/B] [code] undervolt_setting perf_level wall_power rocm-smi_power --setfan ms_per_it_4608K joules_per_it vc 2 1801 1081 4 243 175 110 0.92 0.161 vc 2 1801 1030 4 237 171 105 0.92 0.15732 vc 2 1801 1020 4 236 169 100 0.92 0.15548 vc 2 1801 1010 4 234 166 97 0.92 0.15272 vc 2 1801 1000 4 233 166 95 0.92 0.15272 vc 2 1801 990 errors within minutes. Previous result had no errors within half an hour but it needs an overnight stability test and then get backed off to 1010 if it passes [/code][QUOTE=xx005fs;511558]I thought I have seen a post on Reddit that someone realized the vapor chamber coldplate was really convex, so after sanding it flat and polished the temperature greatly improved. I reckon that would be less risky than the washer method as you aren't really adding any mounting pressure on the die? [URL]https://www.reddit.com/r/Amd/comments/arrxt2/radeon_vii_how_to_drop_40c_on_your_stock_cooler/[/URL][/QUOTE] That's not a bad option, should probably buy the stuff to do it even if I end up using it for something else. [QUOTE=xx005fs;511558]Speaking about the memory frequency, I am pretty sure that if you managed to figure out how to edit the power play table, it can go above 1200MHz and be set to whatever you want, and also you will have more flexibility controlling the frequency and voltage rather than using the curve Wattman provides.[/QUOTE] I've done manual editing of the powerplay table on a Vega 56 and it's a pain at least for core clocks/voltages, you need to work within certain constraints like strictly descending clocks/voltages and vega 56 had some quirks which may or may not be present for the R7. This ( [URL]https://github.com/sibradzic/upp[/URL] ) looks like a good way to do it but haven't tested. Looks like it is needed to go beyond 1200 MHz, pushing "m 1 1201" to pp_od_clk_voltage complains and doesn't stick. Out of curiosity I just tested running two instances of gpuowl at the same time and surprisingly it worked. More surprisingly it increased throughput and was more efficient to boot. Power use increased compared to a single instance at the same settings but more than made up for it in throughput. I only have time for one data point now, the rest will have to wait until tomorrow: [code] undervolt_setting workers perf_level wall_power rocm-smi_power --setfan ms_per_it_for_each_worker effective_ms_per_it_4608K joules_per_it vc 2 1801 1030 2 4 249 180 115 1.69 0.845 0.1521 [/code]Is this possible with Vega 10 and other cards too and I've just been oblivious? |
I realized that the Vega series cards are very temperature sensitive, especially the HBM memories. Previously, I used the stock air cooler on my Vega 56 and I managed 1600MHz on 1050mV (actual load voltage is more like 1000mV), and HBM was way down to 1060MHz (for gaming that was fine but for GPUOWL i had to drop to 1020MHz). On water cooling however I managed near 1650MHz on the same voltage for core and 1150MHz on HBM (with GPUOWL dropping to 1080-1100MHz for stability). I don't know how R7 behaves when temperature decreases but it's certainly interesting (so far i have been seeing good results online with custom watercooling).
As a side note, I think the person in reddit used 2000grit polishing gel to polish the surface at the end, and I felt that the step is unnecessary as lapping something is usually just to even out the surface. As long as it's not too rough (say 200grit with very noticable grooves) it should be fine and the thermal grease should just fill it in. |
[QUOTE=M344587487;511625]Is this possible with Vega 10 and other cards too and I've just been oblivious?[/QUOTE]A single datapoint here: There's a slight 2.6% advantage in gpuowl 5.0-9c13870 on RX480 in running two instances on Windows.
|
Results
The results are in. Two instances is the limit for simultaneous execution, a third compiles the kernel but doesn't seem to execute. Undervolted by finding the minimum voltage that produced no errors in half an hour, then backed off the voltage by 10mV for safety. If that turns out to be unstable when tested properly all it means is that a few watts get added to the below figures.
[B]Best results:[/B][code] target workers mclk sclk 4608K_combined_throughput_ms_it 5M_combined_throughput_ms_it rocm-smi_power_after_undervolt_YMMV efficient_throughput 2 1201 1547 0.845 0.95 176 quick_single_test 1 1201 1802 0.86 0.95 232[/code]That's ~443 4608K exponents per year at ~3.47kWh per exponent, or ~356 5M tests per year at ~4.32kWh per exponent using the rocm-smi figures and tuned to my card. Here's an approximation of the script I'm using to init the card: [code]#!/bin/bash if [ "$EUID" -ne 0 ]; then echo "Radeon VII init script needs to be executed as root" && exit; fi #Allow manual control echo "manual" >/sys/class/drm/card0/device/power_dpm_force_performance_level #Undervolt by setting max voltage # V Set this to 50mV less than the max stock voltage of your card (which varies from card to card), then optionally tune it down echo "vc 2 1801 1010" >/sys/class/drm/card0/device/pp_od_clk_voltage #Overclock mclk to 1200 echo "m 1 1200" >/sys/class/drm/card0/device/pp_od_clk_voltage #Push a dummy sclk change for the undervolt to stick echo "s 1 1801" >/sys/class/drm/card0/device/pp_od_clk_voltage #Push everything to the card echo "c" >/sys/class/drm/card0/device/pp_od_clk_voltage #Put card into desired performance level /opt/rocm/bin/rocm-smi --setsclk 4 --setfan 110 [/code]"/sys/class/drm/card0/device/" or similar is a symlink which can change between boots if you have multiple GPUs. It's a good idea to use the static path of your card directly especially if non-R7 GPUs are present, which you can find with "readlink -f /sys/class/drm/card0/device". As a quick and dirty guide to getting the same setup as this: [LIST][*]Install Ubuntu 19.04 Disco which comes with kernel 5.0.x[*]Edit /etc/default/grub to add amdgpu.ppfeaturemask=0xffffffff to GRUB_CMDLINE_LINUX_DEFAULT[*]sudo update-grub[*]sudo apt install libnuma-dev[*]wget -qO - [URL]http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key[/URL] | sudo apt-key add -[*]echo 'deb [arch=amd64] [URL]http://repo.radeon.com/rocm/apt/debian/[/URL] xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list[*]sudo apt update && sudo apt install rocm-dev[*]echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.rules[*]reboot[*]git clone [URL]https://github.com/preda/gpuowl[/URL] && cd gpuowl && make[*]Take the above init script and tweak it to suit your card[*]Setup is done. Now at every boot you run the init script as root and run two gpuowl instances[/LIST] |
[QUOTE=M344587487;511655]The results are in. Two instances is the limit for simultaneous execution, a third compiles the kernel but doesn't seem to execute. Undervolted by finding the minimum voltage that produced no errors in half an hour, then backed off the voltage by 10mV for safety. If that turns out to be unstable when tested properly all it means is that a few watts get added to the below figures.
[B]Best results:[/B][code] target workers mclk sclk 4608K_combined_throughput_ms_it 5M_combined_throughput_ms_it rocm-smi_power_after_undervolt_YMMV efficient_throughput 2 1201 1547 0.845 0.95 176 quick_single_test 1 1201 1802 0.86 0.95 232[/code]That's ~443 4608K exponents per year at ~3.47kWh per exponent, or ~356 5M tests per year at ~4.32kWh per exponent using the rocm-smi figures and tuned to my card. Here's an approximation of the script I'm using to init the card: [code]#!/bin/bash if [ "$EUID" -ne 0 ]; then echo "Radeon VII init script needs to be executed as root" && exit; fi #Allow manual control echo "manual" >/sys/class/drm/card0/device/power_dpm_force_performance_level #Undervolt by setting max voltage # V Set this to 50mV less than the max stock voltage of your card (which varies from card to card), then optionally tune it down echo "vc 2 1801 1010" >/sys/class/drm/card0/device/pp_od_clk_voltage #Overclock mclk to 1200 echo "m 1 1200" >/sys/class/drm/card0/device/pp_od_clk_voltage #Push a dummy sclk change for the undervolt to stick echo "s 1 1801" >/sys/class/drm/card0/device/pp_od_clk_voltage #Push everything to the card echo "c" >/sys/class/drm/card0/device/pp_od_clk_voltage #Put card into desired performance level /opt/rocm/bin/rocm-smi --setsclk 4 --setfan 110 [/code]"/sys/class/drm/card0/device/" or similar is a symlink which can change between boots if you have multiple GPUs. It's a good idea to use the static path of your card directly especially if non-R7 GPUs are present, which you can find with "readlink -f /sys/class/drm/card0/device". As a quick and dirty guide to getting the same setup as this: [LIST][*]Install Ubuntu 19.04 Disco which comes with kernel 5.0.x[*]Edit /etc/default/grub to add amdgpu.ppfeaturemask=0xffffffff to GRUB_CMDLINE_LINUX_DEFAULT[*]sudo update-grub[*]sudo apt install libnuma-dev[*]wget -qO - [URL]http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key[/URL] | sudo apt-key add -[*]echo 'deb [arch=amd64] [URL]http://repo.radeon.com/rocm/apt/debian/[/URL] xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list[*]sudo apt update && sudo apt install rocm-dev[*]echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.rules[*]reboot[*]git clone [URL]https://github.com/preda/gpuowl[/URL] && cd gpuowl && make[*]Take the above init script and tweak it to suit your card[*]Setup is done. Now at every boot you run the init script as root and run two gpuowl instances[/LIST][/QUOTE] Are you sure that tuning gpu manually is better than leaving it on automatic ? |
[QUOTE=M344587487;511655]The results are in.
[B]Best results:[/B][code] target workers mclk sclk 4608K_combined_throughput_ms_it 5M_combined_throughput_ms_it rocm-smi_power_after_undervolt_YMMV efficient_throughput 2 1201 1547 0.845 0.95 176 quick_single_test 1 1201 1802 0.86 0.95 232[/code][/QUOTE] 0.85ms/it at wavefront, that's amazing! More than twice vega64. Sweet. |
[QUOTE=SELROC;511714]Are you sure that tuning gpu manually is better than leaving it on automatic ?[/QUOTE]
Definitely, here's the efficient tuning compared to completely stock results (--setfan aside as the temps were too high on auto):[code] type workers effective_5M_ms/it rocm-smi_watts tuned 2 0.95 180 (mclk=1200, sclk=1547, "vc 2 1801 1030") stock 2 0.98 247 stock 1 1.04 247 [/code]The stock settings may have been hampered ever so slightly by the default power limit of 250W although probably not much as the power draw tends to be pretty steady with gpuowl. [LIST][*]It's worth setting the perf_level with --setsclk, if you don't it'll automatically ramp up to --setsclk 8 which has the best throughput but poor efficiency compared to perf_level 4 or 5. By reducing throughput a little you increase efficiency a lot, it depends what you're going for[*]It's worth overclocking the memory because there are some easy gains. Not a big jump in power consumption for a worthwhile jump in throughput[*]It's necessary to set the fans manually with --setfan if you've overclocked the memory as on auto it seems to have trouble maintaining the 95 degree target and can push into the hundreds. If the memory is not overclocked the auto fans are good at maintaining temps unless you're using default settings and the card happens to have a high stock voltage[*]It's worth undervolting IMO for some power savings but more importantly to reduce heat meaning lower fan speeds for less noise and less wear and tear on the fans. I'm not suggesting an undervolt to the bleeding edge but to some safe tens of mV under the stock voltage, whatever is easy to grab. That said they've tuned the voltage to each card so it's the least beneficial (and at the same time most sketchy) thing to tune and you could skip it if you're uneasy about messing with the voltage. If you have a card with a really bad stock voltage and are using --setsclk 8 I think it's borderline necessary to undervolt to try and avoid hitting the power limit and having the card sound like a jet. For reference a fan speed setting under 120 is comfortable to have next to you, above 150 is obnoxious, above 180 approaches jet territory[/LIST] [QUOTE=preda;511719]0.85ms/it at wavefront, that's amazing! More than twice vega64. Sweet.[/QUOTE] Nice :) Glad to have benchmarked it especially as the tuning capabilities on Linux are finally easily accessible. I did try mfakto, perftest results here although I don't really know what they mean or if the test was done right ( [URL]https://www.mersenneforum.org/showpost.php?p=511687&postcount=1504[/URL] ). Two instances of mfakto didn't improve throughput, and running one instance of mfakto and one of gpuowl had mfakto running at ~96% and gpuowl at ~5% of where they would be had they been solo. What else needs benching? Getting it on mersenne.ca by running a specific test is all that's left on my list (it's unclear if they even accept gpuowl/PRP results but I'll try anyway with stock and tuned results). I'm not going to go beyond 1200MHz on the memory as it's beyond the default limits, if something goes tits up I don't want that voiding the warranty. |
[QUOTE=M344587487;511725]Definitely, here's the efficient tuning compared to completely stock results (--setfan aside as the temps were too high on auto):[code]
type workers effective_5M_ms/it rocm-smi_watts tuned 2 0.95 180 (mclk=1200, sclk=1547, "vc 2 1801 1030") stock 2 0.98 247 stock 1 1.04 247 [/code]The stock settings may have been hampered ever so slightly by the default power limit of 250W although probably not much as the power draw tends to be pretty steady with gpuowl. [LIST][*]It's worth setting the perf_level with --setsclk, if you don't it'll automatically ramp up to --setsclk 8 which has the best throughput but poor efficiency compared to perf_level 4 or 5. By reducing throughput a little you increase efficiency a lot, it depends what you're going for[*]It's worth overclocking the memory because there are some easy gains. Not a big jump in power consumption for a worthwhile jump in throughput[*]It's necessary to set the fans manually with --setfan if you've overclocked the memory as on auto it seems to have trouble maintaining the 95 degree target and can push into the hundreds. If the memory is not overclocked the auto fans are good at maintaining temps unless you're using default settings and the card happens to have a high stock voltage[*]It's worth undervolting IMO for some power savings but more importantly to reduce heat meaning lower fan speeds for less noise and less wear and tear on the fans. I'm not suggesting an undervolt to the bleeding edge but to some safe tens of mV under the stock voltage, whatever is easy to grab. That said they've tuned the voltage to each card so it's the least beneficial (and at the same time most sketchy) thing to tune and you could skip it if you're uneasy about messing with the voltage. If you have a card with a really bad stock voltage and are using --setsclk 8 I think it's borderline necessary to undervolt to try and avoid hitting the power limit and having the card sound like a jet. For reference a fan speed setting under 120 is comfortable to have next to you, above 150 is obnoxious, above 180 approaches jet territory[/LIST] Nice :) Glad to have benchmarked it especially as the tuning capabilities on Linux are finally easily accessible. I did try mfakto, perftest results here although I don't really know what they mean or if the test was done right ( [URL]https://www.mersenneforum.org/showpost.php?p=511687&postcount=1504[/URL] ). Two instances of mfakto didn't improve throughput, and running one instance of mfakto and one of gpuowl had mfakto running at ~96% and gpuowl at ~5% of where they would be had they been solo. What else needs benching? Getting it on mersenne.ca by running a specific test is all that's left on my list (it's unclear if they even accept gpuowl/PRP results but I'll try anyway with stock and tuned results). I'm not going to go beyond 1200MHz on the memory as it's beyond the default limits, if something goes tits up I don't want that voiding the warranty.[/QUOTE] I don't follow you here. With gpuowl the gpu RAM already runs at maximum speed with automatic setting (2000MHz). The gpu core clock runs at 1319MHz. Overclocking the gpu cores to the maximum 1360MHz will produce more heat and thermal throttling, thus it is not worth to overclock. BTW, I run my gpus open air, with ambient temperature at 20-25C, they go up to 80-82C already. |
[QUOTE=M344587487;511725]
What else needs benching? Getting it on mersenne.ca by running a specific test is all that's left on my list (it's unclear if they even accept gpuowl/PRP results but I'll try anyway with stock and tuned results). I'm not going to go beyond 1200MHz on the memory as it's beyond the default limits, if something goes tits up I don't want that voiding the warranty.[/QUOTE] If you mean adding an entry in [URL]https://www.mersenne.ca/cudalucas.php[/URL] for the Radeon VII, give the exponent in the upper right of [URL]https://www.mersenne.ca/cudalucas.php[/URL] a try for 30,000 iterations and send in the log content. And do tf and send it in also if you haven't yet. [URL]https://www.mersenne.ca/mfaktc.php[/URL] (And for anyone who has RTX20xx, please submit also.) For submitting results there, mersenne.ca only takes tf for 2[SUP]32[/SUP] > p > 10[SUP]9[/SUP] to my knowledge. [URL]https://www.mersenneforum.org/showpost.php?p=488511&postcount=9[/URL] |
[QUOTE=SELROC;511728]I don't follow you here. With gpuowl the gpu RAM already runs at maximum speed with automatic setting (2000MHz). The gpu core clock runs at 1319MHz. Overclocking the gpu cores to the maximum 1360MHz will produce more heat and thermal throttling, thus it is not worth to overclock.
BTW, I run my gpus open air, with ambient temperature at 20-25C, they go up to 80-82C already.[/QUOTE] I think you might be confusing this with a different card. The Radeon VII stock speeds are 1000MHz HBM2 memory and 1800MHz max core clock. The specs you're quoting sound like a Polaris card. Anything I've said only applies to Radeon VII which I may have confusingly called the R7. [QUOTE=kriesel;511732]If you mean adding an entry in [URL]https://www.mersenne.ca/cudalucas.php[/URL] for the Radeon VII, give the exponent in the upper right of [URL]https://www.mersenne.ca/cudalucas.php[/URL] a try for 30,000 iterations and send in the log content. And do tf and send it in also if you haven't yet. [URL]https://www.mersenne.ca/mfaktc.php[/URL] (And for anyone who has RTX20xx, please submit also.) For submitting results there, mersenne.ca only takes tf for 2[SUP]32[/SUP] > p > 10[SUP]9[/SUP] to my knowledge. [URL]https://www.mersenneforum.org/showpost.php?p=488511&postcount=9[/URL][/QUOTE] Thanks. |
I'm not confusing it. I am telling you what a RX580 runs like. I will be able to tell for Radeon VII when it arrives.
|
Ok, then maybe rewording will clear things up if there is still any confusion. The settings I used are a core underclock from 1800MHz to 1547MHz, a memory overclock from 1000MHz to 1200MHz, and an optional core undervolt by a small amount from stock. Polaris seems to be quite a different beast to Vega. I look forward to seeing how you fare with the R7 and what the stock voltage of your card is.
|
[QUOTE=M344587487;511743]Ok, then maybe rewording will clear things up if there is still any confusion. The settings I used are a core underclock from 1800MHz to 1547MHz, a memory overclock from 1000MHz to 1200MHz, and an optional core undervolt by a small amount from stock. Polaris seems to be quite a different beast to Vega. I look forward to seeing how you fare with the R7 and what the stock voltage of your card is.[/QUOTE]
Does raising the core clock help the throughput when the memory is overclocked to 1200MHz (say that you are running 1800/1200 instead of 1547/1200) |
[QUOTE=xx005fs;511747]Does raising the core clock help the throughput when the memory is overclocked to 1200MHz (say that you are running 1800/1200 instead of 1547/1200)[/QUOTE]
When using two workers yes, didn't bench it before as I figured such inefficient settings would be used to quickly DC a prime contender which is what the quick_single_test result above is. When using 1800/1200 with two workers the iteration times for 5M are 1.76 ms/it each, an effective rate of 0.88 ms/it. It was at the power cap of 250W in rocm-smi which means it's ~8% quicker than the 2 worker 1547/1200 but uses ~39% more energy. The fan speed also had to be set to over 200 which is hilariously loud. |
[QUOTE=M344587487;511753]When using two workers yes, didn't bench it before as I figured such inefficient settings would be used to quickly DC a prime contender which is what the quick_single_test result above is. When using 1800/1200 with two workers the iteration times for 5M are 1.76 ms/it each, an effective rate of 0.88 ms/it. It was at the power cap of 250W in rocm-smi which means it's ~8% quicker than the 2 worker 1547/1200 but uses ~39% more energy. The fan speed also had to be set to over 200 which is hilariously loud.[/QUOTE]
Yeah i figured it was probably going to be limited by HBM. 0.88ms/it for 1 worker is mighty impressive and it's way faster than even 2 vega 64 would do. How much power is the card drawing by itself after undervolt? |
[QUOTE=xx005fs;511756]Yeah i figured it was probably going to be limited by HBM. 0.88ms/it for 1 worker is mighty impressive and it's way faster than even 2 Vega 64 would do. How much power is the card drawing by itself after undervolt?[/QUOTE]
0.88 ms/it is the effective throughput of combining two simultaneous workers doing 1.76 ms/it each at 5M with the card reporting ~250W in rocm-smi. The single worker result is 0.95 ms/it with the card reporting ~232W in rocm-smi. In both cases there was a mild undervolt to ~1030mV which was more for reducing temps than power consumption. |
The price of the PowerColor version on ebuyer has dropped by £10 to £640. First time I've seen the price drop below £650 which is interesting: [url]https://www.ebuyer.com/875303-powercolor-radeon-vii-16gb-hbm2-graphics-card-axvii-16gbhbm2-3dh[/url]
|
I now have a Radeon VII, I'd like to share my setup. I overclocked the RAM from 1000 to 1100, and undervolted a bit.
Initial pp_od_clk_voltage: OD_SCLK: 0: 808Mhz 1: 1801Mhz OD_MCLK: 1: 1000Mhz OD_VDDC_CURVE: 0: 808Mhz 695mV 1: 1304Mhz 791mV 2: 1801Mhz 1089mV Modified to: D_SCLK: 0: 808Mhz 1: 1801Mhz OD_MCLK: 1: 1100Mhz OD_VDDC_CURVE: 0: 808Mhz 695mV 1: 1304Mhz 780mV 2: 1801Mhz 1070mV I run it with --setsclk 4 (1547Mhz). I get 0.92ms/it at wavefront (FFT 4608K), and the GPU uses 160W. Fan on auto, temperature reported by sensors is 105C, but I suspect this value is with 20C over the real temperature (because the limit is reported at 118C). |
[QUOTE=preda;512748]I now have a Radeon VII, I'd like to share my setup. I overclocked the RAM from 1000 to 1100, and undervolted a bit.
Initial pp_od_clk_voltage: OD_SCLK: 0: 808Mhz 1: 1801Mhz OD_MCLK: 1: 1000Mhz OD_VDDC_CURVE: 0: 808Mhz 695mV 1: 1304Mhz 791mV 2: 1801Mhz 1089mV Modified to: D_SCLK: 0: 808Mhz 1: 1801Mhz OD_MCLK: 1: 1100Mhz OD_VDDC_CURVE: 0: 808Mhz 695mV 1: 1304Mhz 780mV 2: 1801Mhz 1070mV I run it with --setsclk 4 (1547Mhz). I get 0.92ms/it at wavefront (FFT 4608K), and the GPU uses 160W. Fan on auto, temperature reported by sensors is 105C, but I suspect this value is with 20C over the real temperature (because the limit is reported at 118C).[/QUOTE] I have ambient temp 22~24 C and the gpu is at 107 C with good cooling. |
With its 16GB of RAM, R7 is also quite good at P-1.
I'm doing P-1(B1=300K,B2=9M) on 91M exponents in about 17min/test. (and I found this 118-bit factor: 420168247365933163207630527781851871 ) |
[QUOTE=preda;513057]With its 16GB of RAM, R7 is also quite good at P-1.
I'm doing P-1(B1=300K,B2=9M) on 91M exponents in about 17min/test. (and I found this 118-bit factor: 420168247365933163207630527781851871 )[/QUOTE] I am waiting to deplenish my current worktodo.txt before starting P-1 :-) |
[QUOTE=preda;513057]With its 16GB of RAM, R7 is also quite good at P-1.
I'm doing P-1(B1=300K,B2=9M) on 91M exponents in about 17min/test. (and I found this 118-bit factor: 420168247365933163207630527781851871 )[/QUOTE] Do you have to manually set anything to do P-1 with gpuowl? I'm not familiar with P-1 and these lines generated from this calculator were ignored: [URL]https://www.mersenne.ca/prob.php[/URL] [code]Pminus1=1,2,344587487,-1,1645000,32900000,82 Pfactor=1,2,344587487,-1,82,2[/code] edit: I figured it out, the calculator is older than an identifying hash being added to the line. |
[QUOTE=M344587487;513060]Do you have to manually set anything to do P-1 with gpuowl? I'm not familiar with P-1 and these lines generated from this calculator were ignored: [URL]https://www.mersenne.ca/prob.php[/URL]
[code]Pminus1=1,2,344587487,-1,1645000,32900000,82 Pfactor=1,2,344587487,-1,82,2[/code] edit: I figured it out, the calculator is older than an identifying hash being added to the line.[/QUOTE] Yes, assignments like PFactor=AID,1,2,91157513,-1,77,2 these are easily generated/submitted by e.g.: gpuowl/primenet.py -u user -p passwd --dirs workdir -w PM1 --tasks 40 and for B1/B2 I add to openowl e.g.: ./openowl -B1 300000 ./openowl -B1 1000000 -rB2 25 Bounds (B1,B2) can also be specified per-exponent, by prefixing the worktodo line with: B1=x;line B1=x,B2=y;line |
Thanks. Could it make sense to do a P-1 test and a PRP test simultaneously? Two PRP tests improves throughput, two P-1 cannot be done simultaneously as they both want to max out RAM. One of each could make the most of the hardware.
|
[QUOTE=M344587487;513063]Thanks. Could it make sense to do a P-1 test and a PRP test simultaneously? Two PRP tests improves throughput, two P-1 cannot be done simultaneously as they both want to max out RAM. One of each could make the most of the hardware.[/QUOTE]
Yes I think that would work. I'm not doing it myself, because the benefit is too small IMO, but it may be worth if you're patient. You should watch the memory use reported by the P-1 at start of test (e.g. "P-1 GPU RAM fits 388 stage2 buffers @ 40.0 MB each, using 360"), to make sure the PRP has enough space -- in this example there is plenty of buffers between 360 and 388 for the PRP. |
| All times are UTC. The time now is 14:19. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.