![]() |
Radeon VII @ newegg for 500 dollars US on 11-27
The title says it all.
I just ordered 2 @ 8:48 CST |
[QUOTE=tServo;531564]The title says it all.
I just ordered 2 @ 8:48 CST[/QUOTE]That's for an XFX. I've had a lot of reliability problems with an XFX. It can do gpuowl 0.6 LL DC reliably, although most exponents' runs contain at least one recovered error logged, and that's with the memory speed dialed down to 95%. It seems not suitable for P-1, since there are not sufficient error checks included in the gpuowl P-1 code. (The P-1 algorithm is not as amenable to including checks as PRP or LL are.) As I recall woltman had a similar experience and ended up returning an XFX for exchange. Good luck. |
[QUOTE=tServo;531564]The title says it all.
I just ordered 2 @ 8:48 CST[/QUOTE] Are yours reliable? What settings are you running them at, and on what OS and host system? |
[QUOTE=kriesel;535246]Are yours reliable? What settings are you running them at, and on what OS and host system?[/QUOTE]
Reliable enough. I run them straight out of the box, no tuning except for fans 100%. They throw an error every so often. I haven't had time to diddle with the settings but plan to do so now that I have gotten some free time. They are in my dungeon, er basement which is 60 degrees F. |
[QUOTE=tServo;535281]They throw an error every so often.[/QUOTE]That sounds bad.[QUOTE=tServo;535281]Reliable enough.[/QUOTE]I think you misspelled "Not very reliable". :rolleyes:
|
[QUOTE=tServo;535281]Reliable enough. I run them straight out of the box, no tuning except for fans 100%.
They throw an error every so often. I haven't had time to diddle with the settings but plan to do so now that I have gotten some free time. They are in my dungeon, er basement which is 60 degrees F.[/QUOTE] Wow, that means they are running hot, loud, and power hungry. Getting some tuning done will reduce heat and almost certainly fix those errors. |
As of this moment there is a used Radeon VII on eBay for $425. I don't know the seller, I just wanted to put out a heads-up in case someone is looking for a number cruncher.
EDIT: Gone. |
[QUOTE=tServo;535281]They throw an error every so often.[/QUOTE]How often? I think the consensus is to be dissatisfied with something like one GEC error in normal PRP in a week or a month.
|
[QUOTE=tServo;535281]Reliable enough. I run them straight out of the box, no tuning except for fans 100%.
They throw an error every so often. I haven't had time to diddle with the settings but plan to do so now that I have gotten some free time. They are in my dungeon, er basement which is 60 degrees F.[/QUOTE] [LIST][*]I suggest at least setting sclk to 4 which corresponds to a core speed underclock of ~1536 IIRC (resulting in a minor throughput loss for a major efficiency and noise/heat win). AMD like to push their stock GPU configuration beyond what is reasonable from an efficiency POV for the minor advantage of bragging rights on the box (that's partly why they have a reputation for releasing noisy inefficient junk, very backwards thinking). At sclk 4 you should be able to have --setfan at 100-120 for ~80C temps, less than 50% of full speed (100% is 255).[*]Beyond setting sclk you can overclock memory (mclk) from 1000 to up to 1200, but I found minimal gains above 1100. A memory OC is an easy win for throughput and efficiency. If you overclock voltage you should manually set the fans as I found that the auto fans wouldn't ramp up as high as required to stay under the default temp target with OC'd memory[*]The next step is undervolting, personally I wouldn't bother as each card has tuned voltage from the factory which is good enough to get reasonable efficiency without with the headache of potential instability but it depends on your specific card as to how effective manually undervolting is so YMMV.[/LIST] |
[QUOTE=PhilF;535429]As of this moment there is a used Radeon VII on eBay for $425. I don't know the seller, I just wanted to put out a heads-up in case someone is looking for a number cruncher.
EDIT: Gone.[/QUOTE] Too late for me - I order an XFX R7 from Amazon for $550 on the 14th, just arrived this morning as it happens. But given how hit-or-miss even the new ones are - (per George: "I've had 4 and returned 2. Quality control may not be the best, but [$550] is a good price and Amazon was real good about the return") - I'd be very leery about trying to save a few bucks by buying used. |
[QUOTE=tServo;535281]Reliable enough. I run them straight out of the box, no tuning except for fans 100%.
[COLOR="Red"]They throw an error every so often.[/COLOR] I haven't had time to diddle with the settings but plan to do so now that I have gotten some free time. They are in my dungeon, er basement which is 60 degrees F.[/QUOTE] I've been trying to gradually increase the frequency by trial and error. So far my `safe tunings' are as follows where ambient is at 21C +-2C: 1,1500mhz 850mv, memory at 1150mhz, sitting around 80C, usually no errors. 2,1480mhz 845mv, memory at 1050mhz, sitting around 82C, usually around 1-3 errors. Tried numerous times above these threshold immediately gets like 20 errors for one exponent. LOL |
[QUOTE=ewmayer;535592]Too late for me - I order an XFX R7 from Amazon for $550 on the 14th, just arrived this morning as it happens. But given how hit-or-miss even the new ones are - (per George: "I've had 4 and returned 2. Quality control may not be the best, but [$550] is a good price and Amazon was real good about the return") - I'd be very leery about trying to save a few bucks by buying used.[/QUOTE]
Not me. I'm just the opposite. I'm very good at refurbishing used computer equipment. I was able to snag my card for $400. :smile: But math? Not so much... :smile: |
[QUOTE=tServo;535281]They throw an error every so often.[/QUOTE]
I say use PPT (Power Play Table) to tune the clock and voltage a bit instead of using Wattman. But first use Wattman or afterburner to figure out whether if it's the memory that's unstable or not (if it's the memory just drop the clocks in Wattman and it should be good to go), otherwise use PPT to lock the voltage and clockspeed so the boost feature won't cause instability. This is what I did on my Vega 64 to maintain a stable clockspeed and temperature to whatever I want it to be instead of dealing with the default boost behavior. Sometimes it can make a certain clock that's usually unstable stable again. |
[QUOTE=PhilF;535429]As of this moment there is a used Radeon VII on eBay for $425. I don't know the seller, I just wanted to put out a heads-up in case someone is looking for a number cruncher.
EDIT: Gone.[/QUOTE] If one shows up used you might want to get it ASAP if it's from the first batch Radeon VII's. The first 5000 that they produced are actually Radeon Instinct MI50's that they put a sticker Radeon VII onto. They are 6.6 Tflops double precision which they sold as 'radeon vii's. Being capable of executing 3.3 Tops is a lot (3.3 T double precision instructions a clock - where FMA counts as 1 instruction). When they sell next batch it's not clear whether it still is the same gpu or whether it's a fiat panda edition with lobotomized fp64. Like first 5000 panda's are having a V12 Ferrari engine of 6.6 Tflops pardon i mean liters and when they want to make money with it it's a 2 cylinder fiat panda. I have no information whatsoever so i assume it's gonna be fiat panda's. Would be interesting to know though. All those 'benchmark websites' already tested the radeon vii so they are not gonna change that any soon. Who would notice anyway? I've seen supercomputers sold for dozens of millions of dollars factor 12 slower than they would be on paper ("marketing lies" in order to win the bidding contest) that never got tested during their lifetime. Yeah well until i ran on it and had written code for it that 'assumed' it would be 12x faster than it was :) |
That is not my experience. I bought multiple RadeonVII at regular intervals (roughly 2 months apart), from multiple vendors, and all have similar performance. Small differences in undervolting (that is expected), but otherwise the same. Even the XFXs (I have two) are perfectly fine, it's just that the others undervolt a tiny bit better, not a big deal. I did have fan trouble on the Asrock one (treated with teflon oil).
[QUOTE=diep;535622]If one shows up used you might want to get it ASAP if it's from the first batch Radeon VII's. The first 5000 that they produced are actually Radeon Instinct MI50's that they put a sticker Radeon VII onto. They are 6.6 Tflops double precision which they sold as 'radeon vii's. Being capable of executing 3.3 Tops is a lot (3.3 T double precision instructions a clock - where FMA counts as 1 instruction). When they sell next batch it's not clear whether it still is the same gpu or whether it's a fiat panda edition with lobotomized fp64. Like first 5000 panda's are having a V12 Ferrari engine of 6.6 Tflops pardon i mean liters and when they want to make money with it it's a 2 cylinder fiat panda. I have no information whatsoever so i assume it's gonna be fiat panda's. Would be interesting to know though. All those 'benchmark websites' already tested the radeon vii so they are not gonna change that any soon. Who would notice anyway? I've seen supercomputers sold for dozens of millions of dollars factor 12 slower than they would be on paper ("marketing lies" in order to win the bidding contest) that never got tested during their lifetime. Yeah well until i ran on it and had written code for it that 'assumed' it would be 12x faster than it was :)[/QUOTE] |
Here is the article:
[url]https://www.tweaktown.com/news/64501/amd-radeon-vii-less-5000-available-custom-cards/index.html[/url] |
As you can see here:
[url]https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_VII_series[/url] the mi50 gets listed as 6.6 Tflops double precision and the radeon VII gets listed as 2.x Tflops. Factor 3 slower. |
[QUOTE=diep;535638]As you can see here:
[url]https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_VII_series[/url] the mi50 gets listed as 6.6 Tflops double precision and the radeon VII gets listed as 2.x Tflops. Factor 3 slower.[/QUOTE] Yes, that 2784 GFLOPS is for a low (base) clock frequency at 1400 MHz. At boost clock (1750 MHz) 3458.5 GFLOPS. While there is only one figure for the MI50, I assume it's for the boost clock, as it should be exactly 2x faster at the same clock. The [URL="https://www.amd.com/en/products/graphics/amd-radeon-vii"]AMD Radeon VII product page[/URL] lists peak DP compute as 3.46 TFLOPS. That Tweaktown article is over a year old now. They didn't know much about the card back then. Anyway the real MI50 is now available for qualified buyers (datacenter customers), the 16GB model is less than $4000 while the 32GB model is about $4400. They've actually discontinued the MI60 32GB model that had all the compute units enabled. It is well known by this point, a year later, that the Radeon VII hardware is pretty much the same as that Radeon Instinct MI50 16GB model, but with 1:4 FP64 instead of 1:2. The silicon is identical, it is limited in some other way (vBIOS?) |
[QUOTE=nomead;535649]Yes, that 2784 GFLOPS is for a low (base) clock frequency at 1400 MHz. At boost clock (1750 MHz) 3458.5 GFLOPS. While there is only one figure for the MI50, I assume it's for the boost clock, as it should be exactly 2x faster at the same clock.
The [URL="https://www.amd.com/en/products/graphics/amd-radeon-vii"]AMD Radeon VII product page[/URL] lists peak DP compute as 3.46 TFLOPS. That Tweaktown article is over a year old now. They didn't know much about the card back then. Anyway the real MI50 is now available for qualified buyers (datacenter customers), the 16GB model is less than $4000 while the 32GB model is about $4400. They've actually discontinued the MI60 32GB model that had all the compute units enabled. It is well known by this point, a year later, that the Radeon VII hardware is pretty much the same as that Radeon Instinct MI50 16GB model, but with 1:4 FP64 instead of 1:2. The silicon is identical, it is limited in some other way (vBIOS?)[/QUOTE] Ignore all those boosts if you do gpgpu 24/24. Even If you do gpgpu watercooled, which is much better than aircooled, you don't want to boost of course as you're going to be far above that 300 watt figure given - boost is not adviced for gpgpu. The question is whether the cards we see now LL benchmarks from are 6.7 Tflops double precision - in short not lobotimized in FP64 whereas what they sell FEBRUARI 2020 and later are Fiat Panda's that get lobotomized which get 2.x Tflops double precision on paper. |
[QUOTE=diep;535650]The question is whether the cards we see now LL benchmarks from are 6.7 Tflops double precision - in short not lobotimized in FP64 whereas what they sell FEBRUARI 2020 and later are Fiat Panda's that get lobotomized which get 2.x Tflops double precision on paper.[/QUOTE]
I recently got my card used, but I haven't figured out how to determine its date of manufacture. Do you have a method for that? |
[QUOTE=PhilF;535654]I recently got my card used, but I haven't figured out how to determine its date of manufacture. Do you have a method for that?[/QUOTE]
How would i know of course, but those first 5000 cards sold they should overclock wonderfully well. Let's use logics in a normal real world manner here. What's a 6.7 tflops Tesla selling for these days. 5000 dollar or so? So AMD basically invested 25 million dollar more or less into the first 5000 cards just in order to have a gpu on it that overclocks wonderfully well and has 6.7 tflops worth of double precision resources - all this just to score better in tests. Hoping to sell billions of dollars worth of radeon vII and its future improved versions - just to position the card very well into the market place. So it should overclock really well and turboboost better by default than the Fiat Panda's that come after the first 5000. So by default in tests it should score higher than the fiat panda's. I see a guy on ebay offer one claiming it overclocks well to 1.9Ghz. That probably is one. (edit; the overclocking is for games - which is very short term. for gpgpu i would advice to not overclock at all of course - you burn up the wires of the gpu (yeah those small 193 nm lines or something which they optimistically call '7 nm technology'). |
[QUOTE=diep;535676]Hoping to sell billions of dollars worth of radeon vII and its future improved versions - just to position the card very well into the market place.
So it should overclock really well and turboboost better by default than the Fiat Panda's that come after the first 5000. So by default in tests it should score higher than the fiat panda's.[/QUOTE] But rumors are there's limited availability, and that the product has already reached end-of-life. So I have another possibility to offer: It could be that AMD is doing what all chip makers do, which is take all the chips that won't run reliably at rated speed (MI50 in this case) but are otherwise good, and throw them in a pile destined for later use (in this case the Radeon VII). That might even explain the limited availability, assuming that rumor is even true. The Gigabyte branded board I have comes set at a default speed of 1800Mhz(!) and voltage over a volt. And you're absolutely right: That speed would never work reliably or efficiently when it comes to gpu computing. |
[QUOTE=diep;535676](yeah those small 193 nm lines or something which they optimistically call '7 nm technology').[/QUOTE]193nm is the ArF excimer laser lithography light source wavelength for DUV lithography, related only by diffraction relations to produced feature sizes. 7nm uses EUV ~13.5nm wavelength light sources. [url]https://en.wikipedia.org/wiki/Extreme_ultraviolet_lithography#Light_source_power,_throughput,_and_uptime[/url]
|
[QUOTE=diep;535676](edit; the overclocking is for games - which is very short term. for gpgpu i would advice to not overclock at all of course - you burn up the wires of the gpu (yeah those small 193 nm lines or something which they optimistically call '7 nm technology').[/QUOTE]
I don't know where you get all this "information". The TSMC 7nm process has a minimum metal pitch of 40 nm. Pitch means that you could have a wire of 20 nm, then an empty space of 20 nm between them, for example. But it is not the smallest feature on those chips. The FinFET gate has a fin pitch of 30 nm, and a fin width (at the top of the fin) of just 6 nm. The fin is somewhat thicker at its base because the structure is so high relative to its thickness. Now, the specific process as used for AMD 7 nm products thus far (N7) only uses deep ultraviolet (DUV) lithography at 193 nm wavelength. There are several neat tricks that have been piled up on top the traditional lithography process. High numerical aperture optics, immersion lithography, and most recently [URL="https://en.wikipedia.org/wiki/Multiple_patterning"]multiple patterning[/URL]. With just the first two, the resolution limit is about 36 nm. But with multiple patterning you can make much smaller structures on the chip, with the tradeoff of adding many processing steps for the densest layers. Also the resulting pattern fidelity suffers a bit. The next step is extreme ultraviolet (EUV) lithography at 13.5 nm for the densest layers. But there were several technical obstacles to overcome before it could be put into production use. The masks are now reflective, and the light sources are highly inefficient and barely have enough power to make mass production viable. Problems with the photoresist (the material exposed to light) producing secondary electrons because of the high energy photons, and these electrons bounce around in the material and reduce resolution. Shot noise due to insufficient exposure, again it in effect reduces resolution. And so on. Also the introduction of EUV has been delayed for so long that multiple patterning may become necessary again, sooner than expected. So the next step is TSMC 7 nm with EUV, N7+ process. The feature sizes haven't changed, but the better patterning fidelity still gives some density and performance advantages. It is not yet used for any AMD devices on the market, the first ones are likely to be the next generation of processors based on the Zen 3 architecture (Epyc "Milan"). |
Thanks for clearing that up kriesel.
Phil : yes as it seems they used the very much best gpu's they had produced from which we know they can pump out 6.7 Tflops double precision - as they had been tested to do that. That's probably gpu's that were in the middle of the round silicon wafers. The center ones, which is a small minority, they clock historically far higher than what gets produced at the edges. Not seldom memory chips get produced there. Now i don't know how new this 7 nm process from ASML is. Usually first few years there is massive improvements in production quality each so many months. Yet because all the mi50 gpu's had to be perfect they must've been produced in the center of wafers with maybe memory chips at the edges intended to get clocked a lot lower. This means there would be a possibility to clock them higher than 1.4Ghz - but this can only be done with watercooling. The modern processtechnology as i understood is supposed to be run nearby room temperature - read 19C. So you want to watercool it to nearby that temperature. A difference of some dozens of Celcius, or in short better watercooling means they eat up to 10% less power. Now for us here the sad thing is of course that multiplications which get done massively is the thing eating most power and nowadays that 300 watt isn't called TDP anymore but TBD - in short it's 300 watt when running moderated loads. Even under perfect conditions running DWT's or FFT's on them that have been efficient implemented will easily overshoot with up to 200 watt. So probably you want to watercool it anyway. Aircooling is a problem. The amount of CFM that those tiny fans manage to push through those small ridges of the heatsinks is pretty little compared to what you want to push through. Yet those initial 5000 gpu's are the interesting gpu's to get. AMD threw in say 25 million dollar what they didn't make now - in order to promote radeon VII. Note intel historically threw in way more there during their glorydays. Arguably intel is still shining - so what AMD throws in is pretty much peanuts. What i didn't realize but figured out past few days is that newer games - some of them also seem to profit from faster fp64. So taking care the first batch of gpu's from which some will be used to benchmark at testsites it's easy to enable more fp64 resources. As some of my software was in testsets and others used to test as a game at websites past 20 years (not so much past couple of years as i'm busy releasing a 3d printer now and moved into autonomeous robotics and autonomeous attack drones past 8 years), i did get logfiles back from them usually under the constraint i would keep those private until they had posted their article (which sometimes could take 6 months or so in some special cases). Without wanting to accuse any manufacturer of cheating - let's put it this way - they all have special teams for testing which prepare the hardware that gets shipped away to testers. This is expensive teams that use the most expensive fast low latencry ram. Not seldom CPU's or GPU's that get sold for couple of hundreds of bucks equipped with $10k RAM which you can't buy in a store practical spoken at the time they equip it with. A good example is start this century introduction of the P4's with hyperthreading. The very first capable of that tested by Johan de Gelas. The testmachien simply was 15% - 20% faster than anyone who bought 6 months up to a year later the same cpu (newer batches in fact). The hyperthreading at all those p4's in the stores got 10% out of it (same version) whereas Johan de Gelas box on that clockrate mentionned got 20-25% out of it. Effectively that box was 15-20% faster than anyone building it himself at home (edit: at the same reported clock) - and that wasn't beginners and also dudes with expensive RAM. It wasn't the single core speed though that was that much faster. It was the hyperthreading that was so much faster. Unexplainable faster. (edit it wasn't until many many years later with i7-990x watercooled overclocked to 4.5Ghz, very dubious for its cache latencies to overclock that much, which got a similar or better hyperthreading speed with 6 cores @ 12 threads though - which we could explain by that the chip was that much faster than the RAM could deliver data to it because of overclocking - so the logical explanation for Johan de Gelas timings would be that the chip in fact ran at a higher clock with 2 hyperthreading cores than it reported to Johan). A good explanation would be special editions or unlocking 'dubious' features of that chip in the very first batches. Other manufacturers aren't better there. So my advice to those interested in this gpu: get one from that first 5000. I'm betting it works better than the fiat panda's that are in the stores soon :) |
Please note but this is a personal opinion on gpu's is that if a chip clocks higher than 1 Ghz the manufacturer is doing something wrong. Because if it can clock like 1.5Ghz or whatever above 1 Ghz - they could've equipped the gpu also with more cores (SIMDs). It's better to have 120 SIMDs at 1Ghz than to have 60 at 1.4Ghz - but this is just my 2 cents :)
Of course historically many games profit more from higher GPU clock than from even more cores - which is why they're doing what they do. |
[QUOTE=diep;535731]
So my advice to those interested in this gpu: get one from that first 5000. I'm betting it works better than the fiat panda's that are in the stores soon :)[/QUOTE] Are you still stuck in 2019, or what are these cards that are in the stores "soon" ? Radeon VII is EOL, no more are getting manufactured. Whatever is still on sale is old stock. Anandtech article from February 2019 clarifying the FP64 performance with direct quotes from AMD : [URL="https://www.anandtech.com/show/13923/the-amd-radeon-vii-review/3"]https://www.anandtech.com/show/13923/the-amd-radeon-vii-review/3[/URL] So, AMD can limit FP64 performance through vBIOS and drivers, and back then, finally decided upon 1:4 FP64 = 3.46 TFLOPS. So unless you can hack the vBIOS, you're stuck at 1:4. Haven't heard about anyone even trying... Or, by "fiat panda" do you refer to the Navi cards, that have 1:16 FP64, and have been on the market since July 2019, starting with 5700 XT? |
I don't see what [URL]https://en.wikipedia.org/wiki/Fiat_Panda[/URL] has to do with gpus. Yield is not 100% to design spec. It's long been SOP for chip manufacturers to test bare dies and sort according to performance. Intel sorts and sells chips with fewer than full core complement functional. The 486SX was 486DX dies with working integer but broken FP. AMD made MI50 chips, and very likely sorted according to performance, selling the well performing ones for big bucks. But they likely don't grind up or throw away the underperforming chips. They would stockpile them until there are enough for making a little profit in the consumer market, in a product called Radeon VII, that outperforms the consumer-grade competition. To have pumped out that volume of lower performance hardware into the server market instead probably would have lowered prices and profits. Making the consumer king gpu is good for the AMD brand too.
|
[QUOTE=nomead;535760]
Anandtech article from February 2019 clarifying the FP64 performance with direct quotes from AMD : [URL="https://www.anandtech.com/show/13923/the-amd-radeon-vii-review/3"]https://www.anandtech.com/show/13923/the-amd-radeon-vii-review/3[/URL] So, AMD can limit FP64 performance through vBIOS and drivers, and back then, finally decided upon 1:4 FP64 = 3.46 TFLOPS. So unless you can hack the vBIOS, you're stuck at 1:4. Haven't heard about anyone even trying... [/QUOTE] If that would be possible... to double FP64 through a BIOS change, that'd be amazing! I would hope that either somebody finds how to edit the BIOS, or maybe AMD reaches the conclusion, in 2020, that RadeonVII is no longer a threat for the other products (being EOL) and publishes a new BIOS that unlocks the hardware. Otherwise.. it fees silly for us to try so hard for every little 1% of performance improvement, while the hardware stays locked at half-capacity. Let's ask AMD for an Easter gift -- double my GPU through a software update :) |
[QUOTE=kriesel;535762]I don't see what [url]https://en.wikipedia.org/wiki/Fiat_Panda[/url] has to do with gpus.[/QUOTE]
Must be a methaphor for GPUs.. "What did you need another Fiat Panda for? where are you gonna run it?" |
[QUOTE=preda;535764]If that would be possible... to double FP64 through a BIOS change, that'd be amazing! I would hope that either somebody finds how to edit the BIOS, or maybe AMD reaches the conclusion, in 2020, that RadeonVII is no longer a threat for the other products (being EOL) and publishes a new BIOS that unlocks the hardware. Otherwise.. it fees silly for us to try so hard for every little 1% of performance improvement, while the hardware stays locked at half-capacity. Let's ask AMD for an Easter gift -- double my GPU through a software update :)[/QUOTE]Haha, Good luck with your dreaming.
If you can double the throughput of your existing systems with a simple download then you would have no incentive to buy more of their stuff. At least that is how they will see it. |
[QUOTE=preda;535764]If that would be possible... to double FP64 through a BIOS change, that'd be amazing! I would hope that either somebody finds how to edit the BIOS, or maybe AMD reaches the conclusion, in 2020, that RadeonVII is no longer a threat for the other products (being EOL) and publishes a new BIOS that unlocks the hardware. Otherwise.. it fees silly for us to try so hard for every little 1% of performance improvement, while the hardware stays locked at half-capacity. Let's ask AMD for an Easter gift -- double my GPU through a software update :)[/QUOTE]Seems unlikely. If it could run at MI50 DP speed, they'd try to get MI50 price for it. But you know we'd take the double, and the x% too, and the next, if we could. The effort needed for finding the next mersenne prime is a STEEP function.
|
nomead: being EOL would be interesting info as some hardware reviewers on websites i spoke didn't pick up that info yet. Note that the 'losing money on each card'- info is also old info from end 2018.
|
[QUOTE=diep;535768]Note that the 'losing money on each card' ...[/QUOTE]That is just marketing-speak for "we are not making as much as we could", [u]not[/u] a literal "it costs us more to make than the sell price".
|
[QUOTE=retina;535769]That is just marketing-speak for "we are not making as much as we could", [u]not[/u] a literal "it costs us more to make than the sell price".[/QUOTE]
Retina - As a GPU manufacturer AMD isn't publicly showing its future plans. Suddenly DANG there is something - or it isn't. It's not like the endless drumbeats intel shows prior to producing something. |
[QUOTE=diep;535770]Retina - As a GPU manufacturer AMD isn't publicly showing its future plans. Suddenly DANG there is something - or it isn't.
It's not like the endless drumbeats intel shows prior to producing something.[/QUOTE]Sure, they have difference strategies with regard to announcements. But my comment was more about the misleading "we make a loss" vs "we could make more" thing. |
[QUOTE=diep;535768]nomead: being EOL would be interesting info as some hardware reviewers on websites i spoke didn't pick up that info yet. Note that the 'losing money on each card'- info is also old info from end 2018.[/QUOTE]
August 2019: [URL="https://www.notebookcheck.net/Radeon-VII-confirmed-to-be-EOL-just-6-months-after-its-launch.432773.0.html"]https://www.notebookcheck.net/Radeon-VII-confirmed-to-be-EOL-just-6-months-after-its-launch.432773.0.html[/URL] [URL="https://www.tomshardware.com/news/amd-radeon-vii-end-of-life-status,39861.html"]https://www.tomshardware.com/news/amd-radeon-vii-end-of-life-status,39861.html[/URL] Original source for the info: [URL="https://www.pugetsystems.com/labs/articles/DaVinci-Resolve-GPU-Roundup-NVIDIA-SUPER-vs-AMD-RX-5700-XT-1563/"]https://www.pugetsystems.com/labs/articles/DaVinci-Resolve-GPU-Roundup-NVIDIA-SUPER-vs-AMD-RX-5700-XT-1563/[/URL] [QUOTE]Radeon VII is 100% EOL, we confirmed that directly with AMD before we started this round of GPU testing. Leftover supply does not mean it is still being manufactured.[/QUOTE] |
[QUOTE=retina;535772]Sure, they have difference strategies with regard to announcements. But my comment was more about the misleading "we make a loss" vs "we could make more" thing.[/QUOTE]
Well a gpu is always a loss until they manage to get the yields up of it. The yields any manufacturer manages to make at a relative new process technology is more secret than any military secret. Once they start mass producing it's easier to guess yields as you need a specific high yield number to break even which is easier to guesstimate. As for the design that's the easy and cheap part so to say. Some years a very small team designs such gpu. It's usually no more than less than half a dozen of guys. They make a design. Get paid each teammember some millions or a tad more. They leave the AMD headquarters and say "good luck and goodbye" - and they go without a job yet with cash loaded into their pockets back home and hope the hundreds of engineers that now in the 2nd phase to to work on their gpu design trying to get up yields, are succesful. They hear no nothing then from AMD - no feedback - no nothing. They know just as much as you and i do at that point. Speaking of giving 'feedback' to designers there. Doh. How simple the world always works... The expensive phase then starts trying to get up yields at usually a new process technology. What always amazes me is how little manufacturers 'bet' at having several design teams. |
[QUOTE=nomead;535774]August 2019:
[URL="https://www.notebookcheck.net/Radeon-VII-confirmed-to-be-EOL-just-6-months-after-its-launch.432773.0.html"]https://www.notebookcheck.net/Radeon-VII-confirmed-to-be-EOL-just-6-months-after-its-launch.432773.0.html[/URL] [URL="https://www.tomshardware.com/news/amd-radeon-vii-end-of-life-status,39861.html"]https://www.tomshardware.com/news/amd-radeon-vii-end-of-life-status,39861.html[/URL] Original source for the info: [URL="https://www.pugetsystems.com/labs/articles/DaVinci-Resolve-GPU-Roundup-NVIDIA-SUPER-vs-AMD-RX-5700-XT-1563/"]https://www.pugetsystems.com/labs/articles/DaVinci-Resolve-GPU-Roundup-NVIDIA-SUPER-vs-AMD-RX-5700-XT-1563/[/URL][/QUOTE] That's not an official AMD statement. That's only saying odds are there in this case they do not manage to get yields up or the gpu will live on in a different incarnation (mi50?) . It's obvious they manage to produce it somehow high clocked and output a considerable amount of Tflops with it. Moving to an entire new gpu design and abandonning this one would be a huge step to take for a manufacturer. |
[QUOTE=diep;535776]That's only saying odds are there in this case they do not manage to get yields up or the gpu will live on in a different incarnation (mi50?) .
Moving to an entire new gpu design and abandonning this one would be a huge step to take for a manufacturer.[/QUOTE] They are not abandoning the chip design, just the Radeon VII end product. The full featured (and priced) MI50 is still in production. |
[QUOTE=diep;535734]Please note but this is a personal opinion on gpu's is that if a chip clocks higher than 1 Ghz the manufacturer is doing something wrong. Because if it can clock like 1.5Ghz or whatever above 1 Ghz - they could've equipped the gpu also with more cores (SIMDs). It's better to have 120 SIMDs at 1Ghz than to have 60 at 1.4Ghz - but this is just my 2 cents :)[/QUOTE]So, without any proof, you claim to know better than the world's leading IC manufacturers. That's persuasive.
Process technology->feature size, yield, and die size per core->economic optimization->product design and specs [URL]https://www.pcworld.com/article/3281386/amd-next-gen-radeon-navi-gpu-multi-chip.html[/URL] |
[QUOTE=nomead;535779]They are not abandoning the chip design, just the Radeon VII end product. The full featured (and priced) MI50 is still in production.[/QUOTE]Which is exactly what you would get if MI50 yield improves over time, as processing gets improved, as it generally does, and the accumulated inventory of sub-MI50-spec chips got used up in making Radeon VIIs. Just like the 486SX came and went.
|
Alleged simple 10C-lower cooling hack
1 Attachment(s)
While preparing to physically install my R7 - turns out the 2-tier mounting bracket, pictured below, doesn't quite fit my ATX case, I'm looking at Dremeling out the cleft between the 2 tabs to make it ~1cm deeper, and was looking for "radeon vii replacement mounting bracket" online, in case I end up having to return the card - found this review of the R7 which claims a very simple 10C-lower cooling hack, the expedient of simple adding a small thin washer to each of the 4 screws that attach at the corner of the X-shaped backplate opening of the R7 in order to increase the pressure between the cooling plate and the chip. Said hack is described in the context of an overall product review:
[url]https://www.techpowerup.com/review/amd-radeon-vii/33.html[/url] |
[QUOTE=ewmayer;536072]While preparing to physically install my R7 - turns out the 2-tier mounting bracket, pictured below, doesn't quite fit my ATX case, I'm looking at Dremeling out the cleft between the 2 tabs to make it ~1cm deeper, and was looking for "radeon vii replacement mounting bracket" online, in case I end up having to return the card - found this review of the R7 which claims a very simple 10C-lower cooling hack, the expedient of simple adding a small thin washer to each of the 4 screws that attach at the corner of the X-shaped backplate opening of the R7 in order to increase the pressure between the cooling plate and the chip. Said hack is described in the context of an overall product review:
[url]https://www.techpowerup.com/review/amd-radeon-vii/33.html[/url][/QUOTE] I tried the washer hack, didn't work for me. Warning: the thermal pad that the GPU comes with is very good -- it's hard to replace it with anything with comparable performance. When taking the cooler apart, the termal pad may be demaged and need to be replaced which would be a net loss. Personally I would recommend against trying out the washer hack. OTOH what does work is improving the air flow to the card; putting the GPU in an open-air "mining" rig did improve temperatures by more that 10C. |
I tried the washer mod with minimal success, YMMV: [URL]https://www.mersenneforum.org/showpost.php?p=521689&postcount=13[/URL]
edit: [quote]Warning: the thermal pad that the GPU comes with is very good -- it's hard to replace it with anything with comparable performance. When taking the cooler apart, the termal pad may be demaged and need to be replaced which would be a net loss. Personally I would recommend against trying out the washer hack.[/quote]The thermal pad is good enough and will need replacing if you remove the cooler, but even the moderately decent thermal paste I used has better conductivity (the pad has better thermal properties on paper but paper is misleading, if the pad and paste were the same thickness the pad would win but the paste is applied much more thinly). That said IMO it's not worth doing a repaste or the washer mod. |
[QUOTE=M344587487;536079]I tried the washer mod with minimal success, YMMV: [url]https://www.mersenneforum.org/showpost.php?p=521689&postcount=13[/url][/QUOTE]
Thanks - but did you try first just-the-washers before doing the other stuff? If you tried a bunch of different things at once you risk, using the terminology of clinical trials in medicine, "confounding effects". |
After a fresh Ubuntu 19.10 install on my ~6-year-old Haswell system and several afternoons' work, including some awkward Dremel hackery of both the R7 mounting bracket and the back of my ATX case in order to resolve a geometric mismatch there, the R7 is in and recognized by the OS, lspci shows 2 R7 entries:
[i] 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon VII] (rev c1) 03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 HDMI Audio [Radeon VII] [/i] In terms of needed drivers, Matt (a.k.a. M344587487) noted this: "Ubuntu 19.10 uses kernel 5.3 which means the open source AMD driver that's built into the kernel can handle the Vega 20. If you were on an earlier kernel you'd need to install the amdgpu-pro driver from AMD's site but you should be good. Something you might need is the Vega 20 firmware, there was a strange period where the kernel had the right drivers but some distro's hadn't caught up to providing Vega 20 firmware. To check if you have the firmware open a terminal and run 'ls /lib/firmware/amdgpu/vega20*'." That latter list command shows 13 vega20_*.bin files, so that seems set to go. But - and I was clued in to the pronlem by my usual Mlucas 4-thread job on the Haswell CPU running 3x slower than usual - there is some kind of misconfiguration/driver problem remaining. 'top' shows multiple cycle-eating 'system-udevd' and 'modprobe' processes. Invoking 'dmesg' shows what appears to be the problem - endless repeats of this message: [i] NVRM: No NVIDIA graphics adapter found! nvidia-nvlink: Unregistered the Nvlink Core, major device number 238 nvidia-nvlink: Nvlink Core is being initialized, major device number 238 [/i] It's not clear to me which of the following 3 possible causes is the likely culprit: 1. Preparing to instal the R7, I first removed an old nvidia gtx430 card from the PCI 2.0 slot (seems unlikely, because I quickly found the issue with the R7 mounting bracket after that, at which point I rebooted sans any gfx card, and had been running happily for several days like that). 2. The R7 needs some nVidia drivers and is not finding them; 3. The system is detecting *a* new video card - brand not important - and doing something nVidia-ish as a result. |
Maybe it is just easier to backup, reinstall afresh and, like most of us do, use RocM drivers.
I will be interested how the R7 performs on a PCIE-2 rather than a PCIE-3... |
1 Attachment(s)
[QUOTE=paulunderwood;536345]Maybe it is just easier to backup, reinstall afresh and, like most of us do, use RocM drivers.
I will interested how the R7 performs on a PCIE-2 rather than a PCIE-3...[/QUOTE] My old gtx430 was on the PCIE-2 slot ... the R7 is on the PCIE-3, plus used both the 8-pin power connectors on the PSU in this system. It also needed me to use my Dremel with a small cutting wheel to chop out the metal bridge between the 2 back-of-case PCI cutout used by the R7. Here the gory post-surgery picture of the patient's innards: |
Re. the nVidia-related dmesg errors in post #47, one additional possibility occurs to me ... the only nVidia drivers I ever explicitly installed were under the old headless Debian setup, which I blew away.
I removed the nVidia card a week ago, in prep. for trying to install the R7. However, the nVidia card was still installed when I upgraded to Ubuntu 19.10 ... might the Ubuntu installer have auto-detected the nVidia card and installed/defaulted-to-use the appropriate drivers at that point, and now the kernel is throwing errors due to the mismatch between those initial-OS-install drivers and the new gfx card? |
[QUOTE=ewmayer;536081]Thanks - but did you try first just-the-washers before doing the other stuff? If you tried a bunch of different things at once you risk, using the terminology of clinical trials in medicine, "confounding effects".[/QUOTE]
I did both at once and ruined the clinical trial, my understanding that the paste made a bigger difference than the washer mod is from a tech youtuber so YMMV. [QUOTE=ewmayer;536356]... However, the nVidia card was still installed when I upgraded to Ubuntu 19.10 ... might the Ubuntu installer have auto-detected the nVidia card and installed/defaulted-to-use the appropriate drivers at that point, and now the kernel is throwing errors due to the mismatch between those initial-OS-install drivers and the new gfx card?[/QUOTE] Yes, Ubuntu installs non-free drivers by default when it needs to unless you tell it not, including nvidia's blobs if an nvidia card is present. I'm inclined to blame nvidia's proprietary crap for your problems, people have trouble mixing vendors in the same system and I believe it's because nvidia does things it's own way via binary blob which means they're not integrating properly with the Linux way of doing things. The easiest/safest fix is probably to wipe and restart (after burning the nvidia card and burying it in a deep pit preferably, YMMV), but it can't hurt to try purging nvidia from the system if you feel like it (it's not critical but highly recommended that you change your wallpaper to Linus flipping off nvidia at this point, for luck). This is from an old guide but it seems reasonable: This command should list all nvidia packages, there should be a few dozen of them: [code]dpkg -l | grep -i nvidia[/code]Purge all packages beginning with nvidia-, which should also remove their dependencies: [code]sudo apt-get remove --purge '^nvidia-.*'[/code]Reinstall ubuntu-desktop which was just erroneously removed: [code]sudo apt-get install ubuntu-desktop[/code]Then reboot and see where you stand. |
[QUOTE=M344587487;536371]Yes, Ubuntu installs non-free drivers by default when it needs to unless you tell it not, including nvidia's blobs if an nvidia card is present. I'm inclined to blame nvidia's proprietary crap for your problems, people have trouble mixing vendors in the same system and I believe it's because nvidia does things it's own way via binary blob which means they're not integrating properly with the Linux way of doing things.
The easiest/safest fix is probably to wipe and restart (after burning the nvidia card and burying it in a deep pit preferably, YMMV), but it can't hurt to try purging nvidia from the system if you feel like it (it's not critical but highly recommended that you change your wallpaper to Linus flipping off nvidia at this point, for luck). This is from an old guide but it seems reasonable: This command should list all nvidia packages, there should be a few dozen of them: [code]dpkg -l | grep -i nvidia[/code]Purge all packages beginning with nvidia-, which should also remove their dependencies: [code]sudo apt-get remove --purge '^nvidia-.*'[/code]Reinstall ubuntu-desktop which was just erroneously removed: [code]sudo apt-get install ubuntu-desktop[/code]Then reboot and see where you stand.[/QUOTE] Thanks, Matt - I PMed you the 'before' and 'after' results of 'dpkg -l | grep -i nvidia' ... on reboot, I still quickly get and a "system program problem detected" popup (but now only one, versus multiple ones before) which I dismiss, but 'dmesg' now shows no more of the repeating nVidia-crud. I PMed you the shortlist of bold-highlighted warnings/errors I did find in the dmesg output, one of which involves a vega20*bin firmware file, namely [i] [ 2.517924] amdgpu 0000:03:00.0: Direct firmware load for amdgpu/vega20_ta.bin failed with error -2 [/i] I see 13 files among the /lib/firmware/amdgpu/vega20*.bin set which Ubuntu 19.10 auto-installed, but no vega20_ta.bin among those, probably just need to grab that one separately. Most importantly, 'top' no longer shows any out-of-control system processes, and my Mlucas runs on the CPU are once again back at normal throughput. So, progress! |
Now that Super Bowl Sunday (quasi-holiday in the US revolving around the National Footbal League championship game) is behind us, an update - the card seems to be functioning properly. I've been following Matt's "quick and dirty setup guide" [url=https://www.mersenneforum.org/showpost.php?p=511655&postcount=76]here[/url], am currently at the "Take the above [bash] init script [to set up for 2-gpuwol-instance running] and tweak it to suit your card". First I'd like to play with some basic single-instance running, but something is borked. The readme says "Self-test: simply start gpuowl with any valid exponent..." but does not say how to specify that expo via cmd-line flags. I tried just sticking a prime expo in there, then without any arguments whatever, both gave the following kind of error:
[code] ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl 90110269 2020-02-01 18:43:36 gpuowl v6.11-142-gf54af2e 2020-02-01 18:43:36 Note: not found 'config.txt' 2020-02-01 18:43:36 config: 90110269 2020-02-01 18:43:36 device 0, unique id '' 2020-02-01 18:43:36 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs 2020-02-01 18:43:36 Bye ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl 2020-02-01 18:44:02 gpuowl v6.11-142-gf54af2e 2020-02-01 18:44:02 Note: not found 'config.txt' 2020-02-01 18:44:02 device 0, unique id '' 2020-02-01 18:44:02 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs 2020-02-01 18:44:02 Bye [/code] Matt had noted to me, "If the PRP test starts we are good to go. If it fails with something along the lines ofclGetDeviceId then gpuowl couldn't see the card." How to debug that latter problem? Looking ahead, the first 2 steps of the setup-for-2-instances script are these: [code]#Allow manual control echo "manual" >/sys/class/drm/card0/device/power_dpm_force_performance_level #Undervolt by setting max voltage # V Set this to 50mV less than the max stock voltage of your card (which varies from card to card), then optionally tune it down echo "vc 2 1801 1010" >/sys/class/drm/card0/device/pp_od_clk_voltage [/code] How do I find the max stock voltage? rocm-smi gives a bunch of things, but not that: [code]GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 1 31.0c 21.0W 809Mhz 351Mhz 21.96% auto 250.0W 0% 0% [/code] ...and fiddling with various values of "/opt/rocm/bin/rocm-smi --setfan [n]" to set a constant fan speed causes the Fan value in the above to rise and fall. Thanks for any help from current gpuowl users. |
[QUOTE=ewmayer;536581]Now that Super Bowl Sunday (quasi-holiday in the US revolving around the National Footbal League championship game) is behind us, an update - the card seems to be functioning properly. I've been following Matt's "quick and dirty setup guide" [url=https://www.mersenneforum.org/showpost.php?p=511655&postcount=76]here[/url], am currently at the "Take the above [bash] init script [to set up for 2-gpuwol-instance running] and tweak it to suit your card". First I'd like to play with some basic single-instance running, but something is borked. The readme says "Self-test: simply start gpuowl with any valid exponent..." but does not say how to specify that expo via cmd-line flags. I tried just sticking a prime expo in there, then without any arguments whatever, both gave the following kind of error:
[code] ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl 90110269 2020-02-01 18:43:36 gpuowl v6.11-142-gf54af2e 2020-02-01 18:43:36 Note: not found 'config.txt' 2020-02-01 18:43:36 config: 90110269 2020-02-01 18:43:36 device 0, unique id '' 2020-02-01 18:43:36 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs 2020-02-01 18:43:36 Bye ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl 2020-02-01 18:44:02 gpuowl v6.11-142-gf54af2e 2020-02-01 18:44:02 Note: not found 'config.txt' 2020-02-01 18:44:02 device 0, unique id '' 2020-02-01 18:44:02 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs 2020-02-01 18:44:02 Bye [/code] [/QUOTE] Run as root (or sudo) with the [c]-user ewmayer[/c] switch (or is it --user? I just run as root.) Start with fans at 170; monitor the temperatures and, depending on your overclock, undervolt and ambient temperature, you might be able to reduce the fan speed. |
[QUOTE=paulunderwood;536593]Run as root (or sudo) with the [c]-user ewmayer[/c] swtich (or is it --user? I just run as root.)
Start with fans at 170; monitor the temperatures and, depending on your overclock and undervolt, you might be able to reduce the fan speed.[/QUOTE] Thanks- Per the readme, single minus sign ... from within a subdir 'run0' where I have created a worktodo.txt file containing a pair of PRP assignments, I tried 'sudo ../gpuowl -user ewmayer' ... after entering my sudo password the run echoed same as the 2nd #fail above, just with an added 'config: -user ewmayer' line. Trying to instead login as root and run that way [this the Ubuntu 19.10 setup I created last week] and using the same pwd gives 'Authentication failure'. I don't recall entering any other pwd during the set-pwd phase of Ubuntu 19.10 setup. Not needed yet since I can't run at all, but how do determine the max stock voltage of my R7? |
[QUOTE=ewmayer;536597]Thanks-
Per the readme, single minus sign ... from within a subdir 'run0' where I have created a worktodo.txt file containing a pair of PRP assignments, I tried 'sudo ../gpuowl -user ewmayer' ... after entering my sudo password the run echoed same as the 2nd #fail above, just with an added 'config: -user ewmayer' line. Trying to instead login as root and run that way [this the Ubuntu 19.10 setup I created last week] and using the same pwd gives 'Authentication failure'. I don't recall entering any other pwd during the set-pwd phase of Ubuntu 19.10 setup. Not needed yet since I can't run at all, but how do determine the max stock voltage of my R7?[/QUOTE] Two things: Make sure you are in the group "video" by running [c]id ewmayer[/c]. If not, you need to be added to it and re-login. To create a root password run [C]sudo passwd root[/C] and take it from there. It is best not to run X as root, but in a terminal type [c]su[/c] and enter root's password. |
[QUOTE=ewmayer;536597]...
I don't recall entering any other pwd during the set-pwd phase of Ubuntu 19.10 setup. ...[/QUOTE]If my memory serves well once installed there is no password for the root account and it can't be used because of that until you have set it via "sudo passwd root". Jacob |
[QUOTE=S485122;536599]If my memory serves well once installed there is no password for the root account and it can't be used because of that until you have set it via "sudo passwd root".
Jacob[/QUOTE] That worked - thanks - but even running as root, I still get the getDeviceIDs error, whether I use -user ewmayer, -user root, or no -user stuff at all. I've PMed Mihai, hopefully he can provide further guidance. |
[QUOTE=ewmayer;536602]That worked - thanks - but even running as root, I still get the getDeviceIDs error, whether I use -user ewmayer, -user root, or no -user stuff at all.
I've PMed Mihai, hopefully he can provide further guidance.[/QUOTE] Did you login to root in a terminal by using [c]su[/c]? |
[QUOTE=paulunderwood;536603]Did you login to root in a terminal by using [c]su[/c]?[/QUOTE]
Yes ... 'su' using the newly-set root pwd, instead 'sudo'. |
[QUOTE=ewmayer;536581]First I'd like to play with some basic single-instance running, but something is borked. The readme says "Self-test: simply start gpuowl with any valid exponent..." but does not say how to specify that expo via cmd-line flags. I tried just sticking a prime expo in there, then without any arguments whatever, both gave the following kind of error:
[code] ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl 90110269 2020-02-01 18:43:36 gpuowl v6.11-142-gf54af2e 2020-02-01 18:43:36 Note: not found 'config.txt' 2020-02-01 18:43:36 config: 90110269 2020-02-01 18:43:36 device 0, unique id '' 2020-02-01 18:43:36 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs 2020-02-01 18:43:36 Bye ewmayer@ewmayer-haswell:~/gpuowl$ ./gpuowl 2020-02-01 18:44:02 gpuowl v6.11-142-gf54af2e 2020-02-01 18:44:02 Note: not found 'config.txt' 2020-02-01 18:44:02 device 0, unique id '' 2020-02-01 18:44:02 Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs 2020-02-01 18:44:02 Bye [/code]Matt had noted to me, "If the PRP test starts we are good to go. If it fails with something along the lines ofclGetDeviceId then gpuowl couldn't see the card." How to debug that latter problem?[/QUOTE]./gpuowl -help should give a big list of fft lengths, then at the end a list of detected gpus. If that's an empty list, confirm it with something else like an OpenCL diagnostic tool. Possibly one of the tools listed at the top of page 3 of the pdf attachment at [URL]https://www.mersenneforum.org/showpost.php?p=488474&postcount=6[/URL] Then fix the OpenCL driver installation somehow. (Can't help you there, too little gpu linux experience to go by.) On Windows one can also use GPU-Z to check driver version, OpenCL and other standards' parameters, etc. Some of the other hardware monitoring tools listed on pages 1-2 of that same attachment might also allow that. |
[QUOTE=ewmayer;536602]That worked - thanks - but even running as root, I still get the getDeviceIDs error, whether I use -user ewmayer, -user root, or no -user stuff at all.
I've PMed Mihai, hopefully he can provide further guidance.[/QUOTE] Does clinfo work? (i.e. does it detect any devices) If clinfo does not detect anything, then the problem is with the OpenCL setup in the system (i.e. drivers, ROCm). |
Maybe the issue is to do with me recommending he install ROCm using the upstream drivers ( [url]https://github.com/RadeonOpenCompute/ROCm#using-debian-based-rocm-with-upstream-kernel-drivers[/url] ), it's been a long time since I did a ROCm setup and something in the installation procedure or environment may have changed breaking this method or requiring extra steps. rocm-smi can see the card but I didn't have him check clinfo or rocminfo.
|
[QUOTE=M344587487;536631]Maybe the issue is to do with me recommending he install ROCm using the upstream drivers ( [url]https://github.com/RadeonOpenCompute/ROCm#using-debian-based-rocm-with-upstream-kernel-drivers[/url] ), it's been a long time since I did a ROCm setup and something in the installation procedure or environment may have changed breaking this method or requiring extra steps. rocm-smi can see the card but I didn't have him check clinfo or rocminfo.[/QUOTE]
If it's ROCm 3.0, it may have broken OpenCL, see [url]https://github.com/RadeonOpenCompute/ROCm/issues/977[/url] ROCm 2.10 works for me. |
[QUOTE=preda;536634]If it's ROCm 3.0, it may have broken OpenCL, see [url]https://github.com/RadeonOpenCompute/ROCm/issues/977[/url]
ROCm 2.10 works for me.[/QUOTE] This looks like the best advice: [QUOTE] OlegSmelov commented on Dec 23, 2019 For those wondering how to revert to a previous version on Debian-based distros: sudo apt autoremove rocm-dkms rock-dkms sudo vim /etc/apt/sources.list.d/rocm.list Replace [url]http://repo.radeon.com/rocm/apt/debian/[/url] with [url]http://repo.radeon.com/rocm/apt/2.10.0/[/url] sudo apt update sudo apt install rocm-dkms # or any other set of packages you need[/QUOTE] |
[QUOTE=kriesel;536621]./gpuowl -help
should give a big list of fft lengths, then at the end a list of detected gpus. If that's an empty list, confirm it with something else like an OpenCL diagnostic tool. Possibly one of the tools listed at the top of page 3 of the pdf attachment at [URL]https://www.mersenneforum.org/showpost.php?p=488474&postcount=6[/URL] Then fix the OpenCL driver installation somehow. (Can't help you there, too little gpu linux experience to go by.)[/QUOTE] Thanks - note the help command needs --help or -h ... anyhow, that gives me [code] [build version] Command line options: ... -device <N>: select a specific device: [timestamp] Exception gpu_error: DEVICE_NOT_FOUND clGetDeviceIDs(platforms[i], kind, 64, devices, &n) at clwrap.cpp:77 getDeviceIDs [/code] I don't see the big list of FFT lengths you mentioned. [QUOTE=preda;536626]Does clinfo work? (i.e. does it detect any devices) If clinfo does not detect anything, then the problem is with the OpenCL setup in the system (i.e. drivers, ROCm).[/QUOTE] Is that supposed to be an installed command? 'which clinfo' comes up empty, and I don't see any such command in /opt/rocm/bin. [QUOTE=preda;536634]If it's ROCm 3.0, it may have broken OpenCL, see [url]https://github.com/RadeonOpenCompute/ROCm/issues/977[/url] ROCm 2.10 works for me.[/QUOTE] That sounds like a possible suspect, given that I installed Ubunto 19.10, which is newer than the 19.04 Matt based his setup-recipe on. How do I query the version number for the ROCm install on my system? Once I do that, if it indeed is 3.0, I'll try the Debian-distro reversion commands Paul dug up, which hopefully will work similarly on Ubuntu. |
You're probably on 3.0. Looks like to try reverting to 2.10 you'll need to add rocm-dev to Paul's apt autoremove line as that's the package you used, one or both of rocm-dkms and rock-dkms shouldn't be installed but it doesn't matter if you leave them in the remove command. Similarly if you want to try the 2.10 upstream drivers install rocm-dev instead of rocm-dkms.
clinfo should be in [code]/opt/rocm/opencl/bin/x86_64/[/code] |
[QUOTE=M344587487;536667]You're probably on 3.0. Looks like to try reverting to 2.10 you'll need to add rocm-dev to Paul's apt autoremove line as that's the package you used, one or both of rocm-dkms and rock-dkms shouldn't be installed but it doesn't matter if you leave them in the remove command. Similarly if you want to try the 2.10 upstream drivers install rocm-dev instead of rocm-dkms.
clinfo should be in [code]/opt/rocm/opencl/bin/x86_64/[/code][/QUOTE] OK, clinfo gives [code]ewmayer@ewmayer-haswell:~/gpuowl/run0$ /opt/rocm/opencl/bin/x86_64/clinfo Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.1 AMD-APP (3052.0) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices Platform Name: AMD Accelerated Parallel Processing ERROR: clGetDeviceIDs(-1)[/code] Querying the installed packages with 'apt list | grep roc', I see the following ROC-related ones: [code]rocm-bandwidth-test/Ubuntu 16.04 1.4.0.9-rocm-rel-3.0-6-g8c2ce31 amd64 rocm-clang-ocl/Ubuntu 16.04,now 0.5.0.47-rocm-rel-3.0-6-cfddddb amd64 [installed,automatic] rocm-cmake/Ubuntu 16.04,now 0.3.0.134-rocm-rel-3.0-6-e6d1ef3 amd64 [installed,automatic] rocm-debug-agent/Ubuntu 16.04,now 1.0.0 amd64 [installed,automatic] rocm-dev/Ubuntu 16.04,now 3.0.6 amd64 [installed] rocm-device-libs/Ubuntu 16.04,now 1.0.0.559-rocm-rel-3.0-6-628eea4 amd64 [installed,automatic] rocm-dkms/Ubuntu 16.04 3.0.6 amd64 rocm-libs/Ubuntu 16.04 3.0.6 amd64 rocm-opencl-dev/Ubuntu 16.04,now 2.0.0-rocm-rel-3.0-6-9a4afec amd64 [installed,automatic] rocm-opencl/Ubuntu 16.04,now 2.0.0-rocm-rel-3.0-6-9a4afec amd64 [installed,automatic] rocm-profiler/Ubuntu 16.04 5.6.7262 amd64 rocm-smi-lib64/Ubuntu 16.04,now 2.2.0.8.rocm-rel-3.0-6-8ffe1bc amd64 [installed,automatic] rocm-smi/Ubuntu 16.04,now 1.0.0-192-rocm-rel-3.0-6-g01752f2 amd64 [installed,automatic] rocm-utils/Ubuntu 16.04,now 3.0.6 amd64 [installed,automatic] rocm-validation-suite/Ubuntu 16.04 0.0.33 amd64 rocminfo/Ubuntu 16.04,now 1.0.0 amd64 [installed,automatic] rocprim/Ubuntu 16.04 2.9.0.950-rocm-rel-3.0-6-b85751b amd64 rocprofiler-dev/Ubuntu 16.04,now 1.0.0 amd64 [installed,automatic] rocrand/Ubuntu 16.04 2.10.0.656-rocm-rel-3.0-6-b9f838b amd64 rocs/eoan 4:19.04.3-0ubuntu1 amd64 rocs/eoan 4:19.04.3-0ubuntu1 i386 rocsolver/Ubuntu 16.04 2.7.0.57-rocm-rel-3.0-6-7983da3 amd64 rocsparse/Ubuntu 16.04 1.5.15.691-rocm-rel-3.0-6-aee785e amd64 rocthrust/Ubuntu 16.04 2.9.0.413-rocm-rel-3.0-6-957b1e9 amd64[/code] So as you note, -dev is the one I want, -dkms is not installed. Did the autoremove, but for the next file-entry-edit step per Smelov, I don't see an 'apt' subdir in my /etc dir - is that likely a Debian-specific thing, or is the needed file perhaps somewhere else in Ubuntu? |
It should be there, as you followed my guide you added the rocm repo to the sources list with this:[code]echo 'deb [arch=amd64] [URL]http://repo.radeon.com/rocm/apt/debian/[/URL] xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list[/code]If it didn't exist you wouldn't have been able to install rocm.
|
[QUOTE=M344587487;536675]It should be there, as you followed my guide you added the rocm repo to the sources list with this:[code]echo 'deb [arch=amd64] [URL]http://repo.radeon.com/rocm/apt/debian/[/URL] xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list[/code]If it didn't exist you wouldn't have been able to install rocm.[/QUOTE]
Ah ... I hit ctrl-o in my edit window on the system, it default-pointed me to my last location, which was a subdir of /etc ... as root, did the file-entry debian->2.10.0 edit, 'apt update' and 'apt install rocm-dev' were successful, and from one of the 2 run subdirs I created in ~/gpuowl, fired up one job, success at last! [code]ewmayer@ewmayer-haswell:~/gpuowl/run0$ sudo ../gpuowl -user ewmayer [sudo] password for ewmayer: 2020-02-04 13:58:31 gpuowl v6.11-142-gf54af2e 2020-02-04 13:58:31 Note: not found 'config.txt' 2020-02-04 13:58:31 config: -user ewmayer 2020-02-04 13:58:31 device 0, unique id '' 2020-02-04 13:58:32 gfx906+sram-ecc-0 103984877 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 18.03 bits/word 2020-02-04 13:58:34 gfx906+sram-ecc-0 OpenCL args "-DEXP=103984877u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x1.f54acc23489eep+0 -DIWEIGHT_STEP=0x1.0577e0c0e09e4p-1 -DWEIGHT_BIGSTEP=0x1.ae89f995ad3adp+0 -DIWEIGHT_BIGSTEP=0x1.306fe0a31b715p-1 -DAMDGPU=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 1 warning generated. 2020-02-04 13:58:44 gfx906+sram-ecc-0 warning: argument unused during compilation: '-I .' 2020-02-04 13:58:44 gfx906+sram-ecc-0 OpenCL compilation in 10.22 s 2020-02-04 13:58:45 gfx906+sram-ecc-0 103984877 P1 B1=1000000, B2=30000000; 1442134 bits; starting at 0 2020-02-04 13:58:53 gfx906+sram-ecc-0 103984877 P1 10000 0.69%; 758 us/it; ETA 0d 00:18; 7011c7174679e5dd 2020-02-04 13:59:00 gfx906+sram-ecc-0 103984877 P1 20000 1.39%; 753 us/it; ETA 0d 00:18; f066604ab63196d0 2020-02-04 13:59:08 gfx906+sram-ecc-0 103984877 P1 30000 2.08%; 760 us/it; ETA 0d 00:18; 6e54df44e09f831d 2020-02-04 13:59:15 gfx906+sram-ecc-0 103984877 P1 40000 2.77%; 755 us/it; ETA 0d 00:18; 306d220bd3f66b99 2020-02-04 13:59:23 gfx906+sram-ecc-0 103984877 P1 50000 3.47%; 753 us/it; ETA 0d 00:17; 18faa6b7b06be852 2020-02-04 13:59:30 gfx906+sram-ecc-0 103984877 P1 60000 4.16%; 754 us/it; ETA 0d 00:17; b499eb4c155b7ed4 2020-02-04 13:59:38 gfx906+sram-ecc-0 103984877 P1 70000 4.85%; 758 us/it; ETA 0d 00:17; b26087c1e503d5f6 2020-02-04 13:59:46 gfx906+sram-ecc-0 103984877 P1 80000 5.55%; 762 us/it; ETA 0d 00:17; 3a4debdafd61495c 2020-02-04 13:59:53 gfx906+sram-ecc-0 103984877 P1 90000 6.24%; 756 us/it; ETA 0d 00:17; 928441b2e23adf31[/code] But, ctrl-z/bg didn't stop those screen outputs ... how do I divert those to a file? I left the smi fan control setting at 10, the fan has automatically kicked into turbo-blast mode. After several minutes of running, per-iter times have stabilized at ~800 us, which suggests that I may want to manually up the fan speed (and/or downclock the card). rocm-smi shows [code]PU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 1 80.0c 248.0W 1684Mhz 1001Mhz 56.86% auto 250.0W 2% 100%[/code] ...and my wall wattmeter jumped from 120W to 400W. So I think I need to downclock the system a bit, to gpuowl drawing maybe ~200W instead of 280W. How do I find the max stock voltage of my card, so I can tweak it downward per your instructions? [I'll do the setting-up-for-2-jobs later, think I'll quit while I'm ahead today. :] |
[QUOTE=ewmayer;536679]Ah ... I hit ctrl-o in my edit window on the system, it default-pointed me to my last location, which was a subdir of /etc ... as root, did the file-entry debian->2.10.0 edit, 'apt update' and 'apt install rocm-dev' were successful, and from one of the 2 run subdirs I created in ~/gpuowl, fired up one job, success at last!
[code][/code] But, ctrl-z/bg didn't stop those screen outputs ... how do I divert those to a file? I has left the smi fan control setting at 10, the fan has automatically kicked into turbo-blast mode. After several minutes of running, rocm-smi shows [code][/code] ...and my wall wattmeter jumped from 120W to 400W. So I think I need to downclock the system a bit, to gpuowl drawing maybe ~200W instead of 280W. How do I find the max stock voltage of my card, so I can tweak it downward per your instructions? [I'll do the setting-up-for-2-jobs later, think I'll quit while I'm ehead today. :][/QUOTE] rocm-smi --setsclk 3 or 4 now you only need one job per GPU for optimal throughput. |
[QUOTE=ewmayer;536679]But, ctrl-z/bg didn't stop those screen outputs ... how do I divert those to a file?[/QUOTE]I think you don't. Gpuowl prints to both gpuowl.log and to console. On Windows the console output is not redirectable in my experience. Just dedicate a (virtual) terminal to it and move on.
|
[QUOTE=ewmayer;536679]...success at last!...
[/QUOTE] Welcome to the Radeon VII club. You will never look back :smile: |
[QUOTE=preda;536680]rocm-smi --setsclk 3
or 4 now you only need one job per GPU for optimal throughput.[/QUOTE] Thanks - nice and simple. In the meantime I upped the fan setting to 150, then tried --setsclk with setting 3,4,5 - looks like 5 is the default, is that right? [code] --setsclk 5: 757 us/iter, temp = 70C, watts = 400 [~120 of those are baseline, including an ongoing 4-thread Mlucas job on the CPU] --setsclk 4: 792 us/iter, temp = 65C, watts = 350 --setsclk 3: 848 us/iter, temp = 63C, watts = 300[/code] So without fiddling the clocking, simply upping fanspeed to 150 dropped the temp from 80C to 70C. Downclocking cuts the wattage nicely, but it's hard to see what the effect on runtime is because the job I started is in p-1 stage 2. I'll update with effect of the above setting on per-iteration times once the job gets into PRP-test mode. [b][Edit: added per-iter to above table.][/b] Based on the results, I'll use '--setsclk 4' for now. Preda, can I expect any total-throughput boost from running 2 jobs per Matt's instructions, at the same settings? |
[QUOTE=ewmayer;536679]
But, ctrl-z/bg didn't stop those screen outputs ... how do I divert those to a file? [/QUOTE] There will be a way to use the command "screen" (from a crontab -- but you will need root not sudo). That way you can open up a terminal and screen the output. See [url]https://www.mersenneforum.org/showpost.php?p=534091&postcount=7[/url] |
[QUOTE=kriesel;536681]I think you don't. Gpuowl prints to both gpuowl.log and to console. On Windows the console output is not redirectable in my experience. Just dedicate a (virtual) terminal to it and move on.[/QUOTE]
Yeah, that's what I did while awaiting an answer from one of the old hands. [QUOTE=paulunderwood;536682]Welcome to the Radeon VII club. You will never look back :smile:[/QUOTE] Seeing those actual per-iter times on what was until an hour ago an aged, clunky 6-y.o. Haswell system is something else, that's for sure. Thanks, Mihai, for such a great program! It was nice to be able to upgrade the aforementioned aging system this way, got a lot of added-throughput bang for my hardware-purchase $. So it looks like p-1 stage 2 finished, no factor found ... I will update my previous post with the per-iter times at each of the 3 clock settings I tried. |
[QUOTE=preda;536634]If it's ROCm 3.0, it may have broken OpenCL, see [URL]https://github.com/RadeonOpenCompute/ROCm/issues/977[/URL][/QUOTE]
I am honestly really disappointed on how AMD is handling OpenCL right now, since they have basically neglected support of it on Windows machines, now ROCm 3.0 breaks OpenCL. I think what they need to do is work out something similar to CUDA or somehow convert CUDA codes automatically while maintaining good performance. I hope in the future with stronger hardware, OpenCL won't be neglected to the degree that they can't be utilized to run GPUOWL. |
Ok, first gpuowl issue - my Haswell system has always been notoriously unstable, I get the Linux equivalent of BSOD ~2x per week, no overclocking, either. Just did a quick before-going-to-bed check, found it had done so sometime in last few hours. On reboot, starting my Mlucas job on the CPU was no problem, but trying to restart gpuowl (from within the run0 dir I created within the main gpuowl dir) hits this - file list shown at end:
[code]ewmayer@ewmayer-haswell:~/gpuowl/run0$ ../gpuowl 2020-02-04 22:37:23 gpuowl v6.11-142-gf54af2e 2020-02-04 22:37:23 Note: not found 'config.txt' 2020-02-04 22:37:23 device 0, unique id '' 2020-02-04 22:37:24 gfx906+sram-ecc-0 103984877 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 18.03 bits/word 2020-02-04 22:37:25 gfx906+sram-ecc-0 OpenCL args "-DEXP=103984877u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x1.f54acc23489eep+0 -DIWEIGHT_STEP=0x1.0577e0c0e09e4p-1 -DWEIGHT_BIGSTEP=0x1.ae89f995ad3adp+0 -DIWEIGHT_BIGSTEP=0x1.306fe0a31b715p-1 -DAMDGPU=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 1 warning generated. 2020-02-04 22:37:29 gfx906+sram-ecc-0 warning: argument unused during compilation: '-I .' 2020-02-04 22:37:29 gfx906+sram-ecc-0 OpenCL compilation in 3.90 s 2020-02-04 22:37:29 gfx906+sram-ecc-0 '/home/ewmayer/gpuowl/run0/103984877/103984877.owl' invalid 2020-02-04 22:37:30 gfx906+sram-ecc-0 103984877 OK 35000000 loaded: blockSize 400, 2c0ebcb44118e8be 2020-02-04 22:37:31 gfx906+sram-ecc-0 Can't open '/home/ewmayer/gpuowl/run0/103984877/103984877-new.owl' (mode 'wb') 2020-02-04 22:37:31 gfx906+sram-ecc-0 Exception NSt10filesystem7__cxx1116filesystem_errorE: filesystem error: can't open file: Success [/home/ewmayer/gpuowl/run0/103984877/103984877-new.owl] 2020-02-04 22:37:31 gfx906+sram-ecc-0 Bye ewmayer@ewmayer-haswell:~/gpuowl/run0$ ll total 80 drwxr-xr-x 3 ewmayer ewmayer 4096 Feb 4 14:41 ./ drwxr-xr-x 8 ewmayer ewmayer 4096 Feb 3 15:40 ../ drwxr-xr-x 2 root root 4096 Feb 4 22:28 103984877/ -rw-r--r-- 1 ewmayer ewmayer 45684 Feb 4 22:37 gpuowl.log -rw-r--r-- 1 ewmayer ewmayer 301 Feb 4 14:44 results.txt -rw-r--r-- 1 root root 181 Feb 4 14:41 worktodo.txt -rw-r--r-- 1 root root 244 Feb 4 13:58 worktodo.txt-bak ewmayer@ewmayer-haswell:~/gpuowl/run0$ ll 103984877/ total 128216 drwxr-xr-x 2 root root 4096 Feb 4 22:28 ./ drwxr-xr-x 3 ewmayer ewmayer 4096 Feb 4 14:41 ../ -rw-r--r-- 1 root root 12998165 Feb 4 22:26 103984877-old.owl -rw-r--r-- 1 root root 12998155 Feb 4 14:17 103984877-old.p1.owl -rw-r--r-- 1 root root 46137398 Feb 4 14:38 103984877-old.p2.owl -rw-r--r-- 1 root root 0 Feb 4 22:28 103984877.owl -rw-r--r-- 1 root root 12998155 Feb 4 14:18 103984877.p1.owl -rw-r--r-- 1 root root 46137398 Feb 4 14:40 103984877.p2.owl[/code] I notice the 0-sized .owl file is the primary backup, and there is no -new.owl file. But there is a -old.owl file last updated 2 mins before the .owl one, so I copied that to the .owl one and restarted ... no joy: [code]ewmayer@ewmayer-haswell:~/gpuowl/run0$ ../gpuowl 2020-02-04 22:52:21 gpuowl v6.11-142-gf54af2e 2020-02-04 22:52:21 Note: not found 'config.txt' 2020-02-04 22:52:21 device 0, unique id '' 2020-02-04 22:52:21 gfx906+sram-ecc-0 103984877 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 18.03 bits/word 2020-02-04 22:52:22 gfx906+sram-ecc-0 OpenCL args "-DEXP=103984877u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x1.f54acc23489eep+0 -DIWEIGHT_STEP=0x1.0577e0c0e09e4p-1 -DWEIGHT_BIGSTEP=0x1.ae89f995ad3adp+0 -DIWEIGHT_BIGSTEP=0x1.306fe0a31b715p-1 -DAMDGPU=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 1 warning generated. 2020-02-04 22:52:26 gfx906+sram-ecc-0 warning: argument unused during compilation: '-I .' 2020-02-04 22:52:26 gfx906+sram-ecc-0 OpenCL compilation in 3.80 s 2020-02-04 22:52:26 gfx906+sram-ecc-0 103984877 OK 35000000 loaded: blockSize 400, 2c0ebcb44118e8be 2020-02-04 22:52:27 gfx906+sram-ecc-0 Can't open '/home/ewmayer/gpuowl/run0/103984877/103984877-new.owl' (mode 'wb') 2020-02-04 22:52:27 gfx906+sram-ecc-0 Exception NSt10filesystem7__cxx1116filesystem_errorE: filesystem error: can't open file: Success [/home/ewmayer/gpuowl/run0/103984877/103984877-new.owl] 2020-02-04 22:52:27 gfx906+sram-ecc-0 Bye[/code] So then copied same -old.owl file to the -new.owl one ... still no joy: [code]ewmayer@ewmayer-haswell:~/gpuowl/run0$ ../gpuowl 2020-02-04 22:53:31 gpuowl v6.11-142-gf54af2e 2020-02-04 22:53:31 Note: not found 'config.txt' 2020-02-04 22:53:31 device 0, unique id '' 2020-02-04 22:53:32 gfx906+sram-ecc-0 103984877 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 18.03 bits/word 2020-02-04 22:53:33 gfx906+sram-ecc-0 OpenCL args "-DEXP=103984877u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x1.f54acc23489eep+0 -DIWEIGHT_STEP=0x1.0577e0c0e09e4p-1 -DWEIGHT_BIGSTEP=0x1.ae89f995ad3adp+0 -DIWEIGHT_BIGSTEP=0x1.306fe0a31b715p-1 -DAMDGPU=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 1 warning generated. 2020-02-04 22:53:36 gfx906+sram-ecc-0 warning: argument unused during compilation: '-I .' 2020-02-04 22:53:36 gfx906+sram-ecc-0 OpenCL compilation in 3.76 s 2020-02-04 22:53:37 gfx906+sram-ecc-0 103984877 OK 35000000 loaded: blockSize 400, 2c0ebcb44118e8be 2020-02-04 22:53:38 gfx906+sram-ecc-0 Can't open '/home/ewmayer/gpuowl/run0/103984877/103984877-new.owl' (mode 'wb') 2020-02-04 22:53:38 gfx906+sram-ecc-0 Exception NSt10filesystem7__cxx1116filesystem_errorE: filesystem error: can't open file: Success [/home/ewmayer/gpuowl/run0/103984877/103984877-new.owl] 2020-02-04 22:53:38 gfx906+sram-ecc-0 Bye[/code] Help! In the meantime I simply deleted the current entry from worktodo.txt and restarted gpuowl on the next one. |
[QUOTE=ewmayer;536686]Seeing those actual per-iter times on what was until an hour ago an aged, clunky 6-y.o. Haswell system is something else, that's for sure. Thanks, Mihai, for such a great program! It was nice to be able to upgrade the aforementioned aging system this way, got a lot of added-throughput bang for my hardware-purchase $[/QUOTE]Gpuowl is certainly great for the price and Mihai due a lot of thanks. And let's also remember the contributions to its speed by Prime95, NVIDIA compatibility by Fan Ming, documentation and other contributions by SELROC and others.
Welcome to the gpu side. Old hardware with hefty power supplies increasingly look like homes just begging for fast gpus. |
That error message wants to say that gpuowl intends to *create* the file <n>-new.owl (to write to it a new checkpoint), and of course it's a fatal error if it can't do so. Why it can't create the file? maybe disk full, maybe wrong rights on the folder, maybe something else? Can you manually write to that path? as the same user as gpuowl?
[QUOTE=ewmayer;536721] 2020-02-04 22:37:31 gfx906+sram-ecc-0 Can't open '/home/ewmayer/gpuowl/run0/103984877/103984877-new.owl' (mode 'wb') [/QUOTE] It seems the owner of the folder /home/ewmayer/gpuowl/run0/103984877/ is root. |
[QUOTE=kriesel;536681]I think you don't. Gpuowl prints to both gpuowl.log and to console. On Windows the console output is not redirectable in my experience. Just dedicate a (virtual) terminal to it and move on.[/QUOTE]
on Linux you could redirect output to a file, or to /dev/null ./gpuowl options > /dev/null Or, nohup will also redirect output to a file and keep the background process running after shell close: nohup ./gpuowl options & |
I think the max sclk is 7, that being the default too. The card can't run for any amount of time on that sclk though due to overheating, thus thermally throttles *a lot* until it cools down, after which it speeds up again etc in an inefficient see-saw pattern.
While running PRP you could proceed to memory overclock tuning, usually 1150 is safe and can go up to 1180 or 1200. In general you want at least 24h without errors as validation. I usually run at sclk 3 or lower, but never more than 4. [quote] GPU VDD SCLK MCLK Mem-used Mem-busy PWR FAN Temp PCIeErr 0 762mV 1243 1181 0.43GB 36% 129W 2004 64/79/72 2 1 781mV 1252 1161 0.43GB 37% 136W 1803 65/77/71 0 2 737mV 1251 1181 0.80GB 36% 124W 1805 63/76/71 0 [/quote] The above values correspond to a bit under sclk 3 (between 2 and 3). I get 800us/it at 5M FFT. The total system power at the plug is 580W. [QUOTE=ewmayer;536684]Thanks - nice and simple. In the meantime I upped the fan setting to 150, then tried --setsclk with setting 3,4,5 - looks like 5 is the default, is that right? [code] --setsclk 5: 757 us/iter, temp = 70C, watts = 400 [~120 of those are baseline, including an ongoing 4-thread Mlucas job on the CPU] --setsclk 4: 792 us/iter, temp = 65C, watts = 350 --setsclk 3: 848 us/iter, temp = 63C, watts = 300[/code] So without fiddling the clocking, simply upping fanspeed to 150 dropped the temp from 80C to 70C. Downclocking cuts the wattage nicely, but it's hard to see what the effect on runtime is because the job I started is in p-1 stage 2. I'll update with effect of the above setting on per-iteration times once the job gets into PRP-test mode. [b][Edit: added per-iter to above table.][/b] Based on the results, I'll use '--setsclk 4' for now. Preda, can I expect any total-throughput boost from running 2 jobs per Matt's instructions, at the same settings?[/QUOTE] |
[QUOTE=preda;536730]on Linux you could redirect output to a file, or to /dev/null
./gpuowl options > /dev/null Or, nohup will also redirect output to a file and keep the background process running after shell close: nohup ./gpuowl options &[/QUOTE] Attempts to redirect with append by >> on Google Colab, which is linux VMs, did not work, for background tasks, so that the VM could be monitored with top in the foreground. |
[QUOTE=preda;536728]That error message wants to say that gpuowl intends to *create* the file <n>-new.owl (to write to it a new checkpoint), and of course it's a fatal error if it can't do so. Why it can't create the file? maybe disk full, maybe wrong rights on the folder, maybe something else? Can you manually write to that path? as the same user as gpuowl?
It seems the owner of the folder /home/ewmayer/gpuowl/run0/103984877/ is root.[/QUOTE] In my trying-to-restart-post-crash flailings, I was able to create the <n>.owl copy (as that file showed empty) of the <n>-old.owl file using "sudo cp", similar for try #2, "sudo cp <n>-old.owl <n>-new.owl" ... so 'sudo cp' allowed the file-copy, but left the ownership as root ... weird. Still learning the various subtle differences between using sudo and doing stuff as root. So woke up this a.m., fan noise from the system was suspiciously quiet ... no crash, just the 'backup run' of the next assignment in the worktodo file quit due to p-1 stage 2 finding a factor. And I'd neglected to add more assignments to pad the worktodo file. Grr. Anyhow, as root, I restored the 1*7 files-dir to its post-system-crash state, a valid-looking <n>-old.owl file, an empty <n>.owl file, and no <n>-new.owl file, then chown'ed the ownership to me-as-regular-user, restored the worktodo entry, and restarted ... still same error trying to create <n>-new.owl. But then saw that I'd forgotten to change the group of the files in question from root to me (i.e. my 'chown ewmayer *' should've been 'chown ewmayer:ewmayer *', so used 'sudo chgrp *' (equivalent to 'chown :ewmayer *') to do that, now restart is successful. Thanks for the help. [QUOTE=preda;536732]I think the max sclk is 7, that being the default too. The card can't run for any amount of time on that sclk though due to overheating, thus thermally throttles *a lot* until it cools down, after which it speeds up again etc in an inefficient see-saw pattern.[/quote] Yes, I noticed that last night during my post-crash restart of the backup assignment - wall wattage (again, 120W of which are baseline with Mlucas on the CPU) started at a whopping 450W, --setclk 5 lowered that to 400W, --setclk 4 to 350W. [quote]While running PRP you could proceed to memory overclock tuning, usually 1150 is safe and can go up to 1180 or 1200. In general you want at least 24h without errors as validation. I usually run at sclk 3 or lower, but never more than 4.[/quote] On my R7, --showmclkrange shows a valid range of 808MHz - 2200MHz, and arg-less rocm-smi shows a default memory clocking of 1001Mhz ... to upclock that should I use --setmclk [level], or should I use --setmlevel MCLKLEVEL MCLK MVOLT (if the latter, lmk what 3 arg values I should use)? [QUOTE]The above values correspond to a bit under sclk 3 (between 2 and 3). I get 800us/it at 5M FFT. The total system power at the plug is 580W.[/QUOTE] That's a very nicely low wattage for 3 R7s plus system background. What temperature range do your cards run at? In your experience, what is the maximum safe temp for stable running? |
IIRC the default temp target maintained by variable fan speed is 95 C and the "oh dear" territory is 105 C, I suggest anything lower than 90 C, depends how much tolerance you have for noise and wear and tear on the fans.
|
[QUOTE=M344587487;536806]IIRC the default temp target maintained by variable fan speed is 95 C and the "oh dear" territory is 105 C, I suggest anything lower than 90 C, depends how much tolerance you have for noise and wear and tear on the fans.[/QUOTE]
Currently getting a very manageable 70C with fan level override set at 120 ... interestingly, the ATX case here is so old/beat-up that it has no working case fans anymore, just the CPU fan and R7 fan array. I found that simply leaving the case side panel one removes to access the mobo off allows for good convective airflow, cooler room air enters the case though the open side and once warmed can easily escape through the upper-back and top-panel case-fan openings, as well as the top of the open side. The 2 fans at front of the case were never connected, so I have the option of pulling those and replacing the aforementioned pair of defunct fans with those, but so far it's hasn't proved necessary: your best case ventilation is that removed side panel. Plus it allows one to see the kewl red LEDs spelling out RADEON on the side of the R7 ... at night, it looks like the Vegas strip in there now. :) Oh, Matt - do you agree with Preda's comment that single-job running with appropriately tuned fan and memclock settings now gives total throughput similar to the 2-job running your script sets up for? And would it be worthwhile updating your setup-guide post to reflect some of the issues I hit with my setup under Ubuntu 19.10? Specifically: o Recent versions of GpuOwl need libgmp-dev to be installed; o I needed to manually removed a bunch of nVidia package crud to get the system to properly recognize the R7; o ROCm 3.0 breaks OpenCL, so if that is the current version shipping with one's distro, it needs to be reverted to 2.10 (or perhaps fiddle the pkg-install notes to get the latter from the start); o If single-job running can now be done at more or less the same total throughput as 2-job, that part of the setup guide can be simplified. |
[QUOTE=ewmayer;536815]...
Oh, Matt - do you agree with Preda's comment that single-job running with appropriately tuned fan and memclock settings now gives total throughput similar to the 2-job running your script sets up for? ...[/QUOTE]I trust preda that single job is now optimal, my info is outdated and gpuowl has been worked on heavily. [QUOTE=ewmayer;536815]... And would it be worthwhile updating your setup-guide post to reflect some of the issues I hit with my setup under Ubuntu 19.10? Specifically: o Recent versions of GpuOwl need libgmp-dev to be installed; o I needed to manually removed a bunch of nVidia package crud to get the system to properly recognize the R7; o ROCm 3.0 breaks OpenCL, so if that is the current version shipping with one's distro, it needs to be reverted to 2.10 (or perhaps fiddle the pkg-install notes to get the latter from the start); o If single-job running can now be done at more or less the same total throughput as 2-job, that part of the setup guide can be simplified.[/QUOTE] It would, that was never intended to be a robust guide but I will make it one. I was planning to wait until Ubuntu 20.04 was released and ROCm had rebased to it but I can do a small update now. [LIST][*]Install Ubuntu 19.10[*]Update if you've never updated before to shake off any gremlins:[code]sudo apt update && sudo apt upgrade[/code][*]If an nvidia card is present remove it and uninstall nvidia drivers (AMD cards do not play nice with nvidia cards):[code]sudo apt remove --purge '^nvidia-.*' && sudo apt install ubuntu-desktop[/code][*]Expose AMD GPU tuning in the kernel:[LIST][*]Add tuning flag to grub:[code]Edit /etc/default/grub to add amdgpu.ppfeaturemask=0xffffffff to GRUB_CMDLINE_LINUX_DEFAULT[/code][*]Push changes:[code]sudo update-grub[/code][/LIST] [*]Install required libs including GMP:[code]sudo apt install libnuma-dev libgmp-dev[/code][*]Add ROCm 2.10 repository to your sources list:[LIST][*]Add ROCm GPG key for signed packages:[code]wget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -[/code][*]Add 2.10 repo to sources (at time of writing there's a problem with the current latest version, 3.0):[code]echo 'deb [arch=amd64] http://repo.radeon.com/rocm/apt/2.10.0/ xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list[/code][/LIST] [*]Install ROCm using the upstream drivers, add current user to video group so that they can access the GPU and reboot:[code]sudo apt update && sudo apt install rocm-dev && echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.rules && sudo shutdown -r now[/code][*]At this point the GPU and ROCm should be installed and working. The following commands should show information about the card and the environment:[code]/opt/rocm/bin/rocm-smi /opt/rocm/opencl/bin/x86_64/clinfo /opt/rocm/opencl/bin/x86_64/rocminfo lspci[/code][*]Download and build gpuowl:[code]git clone https://github.com/preda/gpuowl && cd gpuowl && make[/code][*]Run gpuowl with no options to make sure it detects the card. It should also show the cards unique id[*]Start a PRP test to make sure it works, CTRL-C to cancel out[*]Setup is done. Now all you need to do is create a script you run on every reboot to tune the settings of the card. Bonus points if you make it a cron job. This is where my knowledge is outdated and I'll save researching it until Ubuntu 20.04 is viable:[LIST][*]Perhaps the unique id can be used to robustly and easily identify the card for tuning instead of groping around /sys?[*]At the very least have the card underclock for efficiency. Something along the lines of "rocm-smi --setsclk 3" using the unique id somehow as identifier[*]Memory overclock. Has this changed? I'm sure the old method still works but newer methods exist that may be more user friendly[*]Undervolt. Instead of the hacky "tweak max voltage on curve" there is a new way to be able to set the voltage on a per sclk/P-state basis. It may apply only to kernel 5.5+[/LIST] [/LIST] |
[QUOTE=ewmayer;536804]
On my R7, --showmclkrange shows a valid range of 808MHz - 2200MHz, and arg-less rocm-smi shows a default memory clocking of 1001Mhz ... to upclock that should I use --setmclk [level], or should I use --setmlevel MCLKLEVEL MCLK MVOLT (if the latter, lmk what 3 arg values I should use)? [/QUOTE] I don't have much experience with setting the mem frequency with rocm-smi, I was not aware of --setmlevel. In particular, when overclocking the mem, I was setting only the frequency but not the voltage. (I don't know if the mem voltage is different form the "sclk" voltage, and if so how to read the current mem voltage) Anyway, maybe you could try something like: --setmlevel 1 1150 and see if that has an effect on performance (expected: increase in perf) and on power (expected: small increase in power). My Gpus usually run at under 85C. I think a max safe temperature is 102-105. Anyway in the region above 100 the GPU throttles, so I would try to keep it under 97 to avoid thermal-throttling. (the values above are for the "junction" temperature, which is the highest value of the three (edge, junction, mem)). The default fan curve keeps the GPU too hot, so I set a higher manual fan speed. |
[QUOTE=preda;536871]I don't have much experience with setting the mem frequency with rocm-smi, I was not aware of --setmlevel. In particular, when overclocking the mem, I was setting only the frequency but not the voltage. (I don't know if the mem voltage is different form the "sclk" voltage, and if so how to read the current mem voltage)
Anyway, maybe you could try something like: --setmlevel 1 1150 and see if that has an effect on performance (expected: increase in perf) and on power (expected: small increase in power).[/QUOTE] Thanks - I tried a couple different things: [1] --setmclk: there is no flag to show the valid range, so I just started with a deliberately outrageous value 100, and the resulting error message said "Max clock level is 2". Current (default) mclk level is unknown, but it yields a frequency of 1001MHz. --setmclk 2 leaves that unchanged, so that appears to be the default. --setmclk 1 knocks that down to 801MHz - the opposite direction of where we want to go - cutting ~10W from the power usage, and rasing iteration times @5632K from 790us to ~840us. I passed on trying level 0 and instead reverted to level 2. :) [2] --setmlevel: your suggestion of args '1 1150' failed with "expected 3 argument(s)". So we need the third MVOLT arg to be supplied. --showvoltagerange indicates a valid voltage range of 738-1218mV, so next I tried --setmlevel 1 1150 1000. After answering 'y' to the resulting scary SMI warning about operating outside of official AMD specs, got this: [code]Unable to write to sysfs file /sys/class/drm/card1/device/pp_od_clk_voltage[/code] That probably just means I need to sudo the command, but that filename sounded familiar, it appears in Matt's [url=https://www.mersenneforum.org/showpost.php?p=511655&postcount=76]setup-to run-2-instances script[/url]. So let's see what the current values in the file are: [code]OD_SCLK: 0: 808Mhz 1: 1801Mhz OD_MCLK: 1: 1150Mhz OD_VDDC_CURVE: 0: 808Mhz 711mV 1: 1304Mhz 803mV 2: 1801Mhz 1096mV OD_RANGE: SCLK: 808Mhz 2200Mhz MCLK: 801Mhz 1200Mhz VDDC_CURVE_SCLK[0]: 808Mhz 2200Mhz VDDC_CURVE_VOLT[0]: 738mV 1218mV VDDC_CURVE_SCLK[1]: 808Mhz 2200Mhz VDDC_CURVE_VOLT[1]: 738mV 1218mV VDDC_CURVE_SCLK[2]: 808Mhz 2200Mhz VDDC_CURVE_VOLT[2]: 738mV 1218mV[/code] Matt's script: [code]#!/bin/bash if [ "$EUID" -ne 0 ]; then echo "Radeon VII init script needs to be executed as root" && exit; fi #Allow manual control echo "manual" >/sys/class/drm/card0/device/power_dpm_force_performance_level #Undervolt by setting max voltage # V Set this to 50mV less than the max stock voltage of your card (which varies from card to card), then optionally tune it down echo "vc 2 1801 1010" >/sys/class/drm/card0/device/pp_od_clk_voltage #Overclock mclk to 1200 echo "m 1 1200" >/sys/class/drm/card0/device/pp_od_clk_voltage #Push a dummy sclk change for the undervolt to stick echo "s 1 1801" >/sys/class/drm/card0/device/pp_od_clk_voltage #Push everything to the card echo "c" >/sys/class/drm/card0/device/pp_od_clk_voltage #Put card into desired performance level /opt/rocm/bin/rocm-smi --setsclk 4 --setfan 110[/code] So that 'vc 2 1801 1010' line appears to correspond to the level-2 entry in the above file: [code]OD_VDDC_CURVE: 0: 808Mhz 711mV 1: 1304Mhz 803mV 2: 1801Mhz 1096mV[/code] I'm guessing that Matt's "Set this to 50mV less than the max stock voltage of your card (which varies from card to card)" with arrow pointing down at the 1010 entry means his card has a max stock voltage of 1060mV, whereas mine has 1096mV. But better safe than sorry for starters, I kept his script as-is and used value 1010. But even executing the script as root I get permission errors: [code]root@ewmayer-haswell:/home/ewmayer# ./radeon_setup.sh ./radeon_setup.sh: line 6: /sys/class/drm/card0/device/power_dpm_force_performance_level: Permission denied ./radeon_setup.sh: line 9: /sys/class/drm/card0/device/pp_od_clk_voltage: Permission denied ./radeon_setup.sh: line 11: /sys/class/drm/card0/device/pp_od_clk_voltage: Permission denied ./radeon_setup.sh: line 13: /sys/class/drm/card0/device/pp_od_clk_voltage: Permission denied ./radeon_setup.sh: line 15: /sys/class/drm/card0/device/pp_od_clk_voltage: Permission denied ========================ROCm System Management Interface======================== GPU[1] : Successfully set sclk frequency mask to Level 4 GPU[1] : Successfully set fan speed to Level 110 ==============================End of ROCm SMI Log ==============================[/code] Per-iter times of my GpuOwl run unchanged, only change I can see is that the 'Perf' entry in the rocm-smi output now reads 'manual'. |
p.s.: The above write errors in the script-exec appear to be another manifestation of the "Unable to write to sysfs file /sys/class/drm/card1/device/pp_od_clk_voltage" error I got previously, because when I next try '--setmlevel 2 1801 1150' I get the above file-write error followed by a "Unable to set mclk clock to Level m 2 1801 1150".
The permissions on said file are '-rw-r--r-- 1 root root' ... so why can't root write it? |
That's the gist of it, reading pp_od_clk_voltage gives a human-readable table of the current state and passing it different command strings can alter the state. Passing "m 1 1200" to pp_od_clk_voltage is saying set the memory clock of state 1 to 1200 MHz, which is the state we're normally in. Similarly pushing "vc 2 1801 1010" sets the max voltage on the voltage curve to 1010, which you're right was the voltage I was using to undervolt. The way I set voltage in that script was a hack because we're setting the max voltage for running at --setsclk 7 (corresponding to 1801 MHz) but actually running at --setsclk 3/4, not setting the voltage we're using directly and instead letting the voltage curve black box set the voltage to somewhere between states 1 and 2. It was done this way because at the time (and perhaps still) it was finicky as hell so I did the minimum alterations required to get it working.
--setmlevel wasn't a thing when the script was made so I never tried it, it does look like it's just an interface to setting mclk and mem voltage. Never touched memory voltage and I never want to, unless going way out of spec I don't think it's an issue. Memory voltage may not even apply to Vega20, IIRC with Vega10 the memory voltage is shared with one of the clock voltages so changing memory voltage directly shouldn't work and vega20 may have inherited that trait. As for being unable to write to pp_od_clk_voltage let's flail about wildly. the first thing I'd try is the scorched earth approach to file permissions:[code]sudo chmod 777 /sys/class/drm/card0/device/pp_od_clk_voltage[/code]It probably won't work because pp_od_clk_voltage is not a normal file but try it anyway. Next try running with sudo instead of as actual root, shouldn't make a difference but it's an easy thing to rule out, there may be some group funkiness going on. The next thing is to try and put root in the video group if it isn't already if you even can, it probably won't work as I think root should by default act like it's in all groups but it's worth a shot and we are flailing about so it's appropriate to try all angles. |
@Matt:
o sudo allowed the 777-chmod, but --setmlevel still fails the same way, whether run with sudo or as root. o Re-doing the chmod as root, again same --setmlevel failure. o Contents of /etc/udev/rules.d/70-kfd.rules are the stuff from your setup guide: [i] SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video" [/i] I don't see any kind of 'user' field in there - that file is owned by root, so presumably root is in the video group? The error message specifically mentions the file as being a 'sysfs file', perhaps its status as a system file contains a clue? |
[QUOTE=ewmayer;536929]The error message specifically mentions the file as being a 'sysfs file', perhaps its status as a system file contains a clue?[/QUOTE]
OK, found some info about this issue [url=https://askubuntu.com/questions/92379/how-do-i-get-permissions-to-edit-system-configuration-files]here[/url]. The gksu package no longer exists in Ubuntu, so I tried the 'sudo -H gedit' method to open the file. In the '2' entry under OD_VDDC_CURVE, manually changed 1096mV to 1150mV, then attempted to save, got this error message: [i] Could not save the file "/sys/class/drm/card0/device/pp_od_clk_voltage". Unexpected error: Error writing to file: Invalid argument [/i] Clicking the X-checkbox to close the dialog, got a further error dialog: [i] The file "/sys/class/drm/card0/device/pp_od_clk_voltage" changed on disk. [/i] After clicking "drop changes and reload", I notice the file timestamp immediately updates to the current time. Could it be that the software which runs the R7 has grabbed the file and is continually accessing it in some way which precludes root from editing it? |
[list][*]gksu and gksudo are like su and sudo but with a graphical interface and might do something special when running GUI programs, I don't know the specifics but lack of gksu shouldn't be the issue. You can use nano to edit simple config files from the terminal if gedit is problematic
[*]You can't edit sysfs files like that, you tend to be able to read some normally and you tend to write to some by piping commands like in the script. The kernel is maintaining the state of the virtual file and in all likelihood it's working as intended [/list] Unfortunately I don't really know where to go from here to debug your issue. Did you try the last point from my previous post of adding root to the video group?[code]usermod -a -G video root[/code] |
[QUOTE=M344587487;537031]Did you try the last point from my previous post of adding root to the video group?[code]usermod -a -G video root[/code][/QUOTE]
I had a followup about that in the post below it, so no, did not yet do the add-to-group. Just tried it, first as sudo, then as root, both failed with the the same unable-to-write-file error as before. So looks like I'm stuck with the default mem-voltage. Ah, well - thanks for all the suggestions! ======================= Next up: Now that the R7 is up and blasting, I'm just gonna let my 2 ongoing CPU/Mlucas jobs finish and then let the CPU idle, crunching-wise - no point having the CPU burning similar watts as the GPU, for less than 1/10th the throughput. That will free up ~100 watts, so the obvious question is, what is the best bang-for-$ GPU which I could plug into the PCI2 slot? Note I'm all out of 8-pin connectors on my current PSU's wire bundle, though there may be a couple 6-pin ones remaining which I could gang together to create a single 8-pin one. Are there any decently fast GPUs which can get all their needed power through the PCI bus? And if there are some nVidia ones which qualify, can one mix the 2 card types in a single system? |
1 Attachment(s)
[QUOTE=ewmayer;537043]Now that the R7 is up and blasting, I'm just gonna let my 2 ongoing CPU/Mlucas jobs finish and then let the CPU idle, crunching-wise - no point having the CPU burning similar watts as the GPU, for less than 1/10th the throughput.
That will free up ~100 watts, so the obvious question is, what is the best bang-for-$ GPU which I could plug into the PCI2 slot? Note I'm all out of 8-pin connectors on my current PSU's wire bundle, though there may be a couple 6-pin ones remaining which I could gang together to create a single 8-pin one. Are there any decently fast GPUs which can get all their needed power through the PCI bus? And if there are some nVidia ones which qualify, can one mix the 2 card types in a single system?[/QUOTE] The GTX 1650 has 75 watt rating so can be driven by the PCIe slot only. It also has an excellent GhzD/day / watt TF rating just over 12. To get above that, costs $500 to $9,000. [URL]https://www.mersenne.ca/mfaktc.php?sort=gpw[/URL] A Radeon RX470 can get fed by a SATA power to 6-pin adapter; nominally 110 W. It's also decent for PRP GhzD/d/watt for the price range. [URL]https://www.mersenne.ca/cudalucas.php?sort=gpw[/URL] Half the TF speed of the GTX1650 but 150% of the PRP speed. And should get along well with the Radeon VII. It's possible to run a mixed system. In fact it's required to use all the slots of the bigger mining rig boards, due to limits on how many gpus a driver can support. I had 3 NVIDIA coexisting with an RX550 for a while on WIn7. Installing a new driver for an RTX resulted in no OpenCL for NVIDIA, so no gpuowl for NVIDIA there, and no function for lower compute capability 2.x models either. The Intel HD4600 IGP never did hold indications of a functioning OpenCL capability long enough to be useful for mfakto on that install. [URL]https://www.anandtech.com/show/11739/asus-announces-b250-expert-mining-motherboard-19-expansions-slots[/URL] |
@Ken: Thanks, I will look into those!
[b]Edit:[/b] OK, a couple of notes. First off, note that have little interest in TF work, though I recognize how important it is for the project to have plenty of folks who enjoy that kid of crunching. o The R7 numbers at the top of [url=https://www.mersenne.ca/cudalucas.php?sort=gpw]the mersenne.ca GPU-for-LL/PRP page[/url] appear to be out of date: they have R7 costing $700, using 300W and getting ~280 GHz-days per calendar day. (The table only goes up to p ~95M, but the numbers are pretty flat). My R7 cost $550, and with a slight downclocking tweak to bring its FLOP/W number into the sweet spot for the card, burns ~220W. It completes one assignment with p ~ 103M, worth 431GHz-day (based on a just submitted one, 15 GHz-day for the p-1 step with no factor found, 416 GHz-day for the PRP) every 23 hours, giving a daily output of 450 GHz-day, which is 60% higher than that listed at the above page, and a per-watt output of ~2.05, more than 2x the 0.972 figure at the above page. o For the 4-core Haswell CPU on the same system, running Mlucas (suboptimal, ~2/3 the throughput of Prime95, but my need for continual QA testing of my own code trumps the quest for optimality), currently finishing some p ~96M assignments (credit of ~360 GHz-day), 1 every 16 days, at almost exactly 100W, thus a modest ~23 GHd/d and 0.23 GHd/d/W. Again, if all I cared about was total throughput I'd be running mprime and getting closer to 0.4 GHd/d/W. The sub-$500 cards I see on the above page top out at about the same ~0.4 GHd/d/W. OTOH Mike/Xyzzy suggest a used GTX-1050 might set me back a mere $100 or so, with that I'm looking at similar GHd/d and GHd/d/W as running my own code on the CPU. (Unless the tabulated throughput numbers for the 1050 are understatements like those for the R7). |
| All times are UTC. The time now is 14:17. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.