mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU to 72 (https://www.mersenneforum.org/forumdisplay.php?f=95)
-   -   Making things work right? (https://www.mersenneforum.org/showthread.php?t=19910)

Mark Rose 2014-12-17 18:36

Making things work right?
 
[QUOTE=Xyzzy;390295]FWIW, we only submit results on Sundays.

:mike:[/QUOTE]

I've noticed your results tend to come in chunks. Is there a reason why you only submit once a week? And not use a tool that submits more frequently?

Xyzzy 2014-12-18 04:20

We moved your question to a new thread because we really have a weird problem to solve and your question is directly related to that problem.

Perhaps the community here can help?

We have two boxes that run trial factoring on GPU cards. We do not run anything else on the boxes because we are trying to maximize the ratio of "work done" to the amount of energy (electricity) consumed. Firstly, because we want to be as "green" as possible and secondly because we are trying to keep our electric bill at a reasonable amount. (It costs us to run the cards and in the summer it costs us again to cool the house.) Even though the CPUs in each box are fairly modern, no worktype delivers anything close to the "value" that we get from the GPU cards.

Between the two boxes there are three GTX980 cards.

OOTB, the default (BIOS?) GTX980 power/cooling profile is heavily weighted towards keeping the fan speeds low. We assume this is for noise reasons. If we run the cards "as is" each card will go to about 81°C, the fans will barely kick on and we will get ~560GHz-d/d per card. At no time will the cards draw more than 100% TDP. In fact, they will draw around 65-75% TDP each.

We really dislike managing Windows, but there are tools in Windows that allow us to do three interesting things. The first is we can command the card fans to run at maximum speed. (The boxes are in a vacant room so the noise is not bad.) The second is we can command the cards to use up to 125% TDP. Lastly, we can command the cards to never allow the temperature to go above 80°C. We do all of this with EVGA's [URL="http://www.evga.com/precision/"]Precision X[/URL] program. As a result of these three tweaks we get ~625GHz-d/d per card. We think this is a significant improvement although we are running the cards full tilt. Note that we are not "overclocking" the cards. We are just enabling the cards to run as fast as possible by tweaking parameters that are "normal". In fact, the cards are "pre-overclocked" from the factory. We are not fans of overclocking in general but these cards have been put through a binning process to weed out the "weaker" cards. It takes a lot more energy per card to get this bump in speed. Between three cards, this bump is equal to ~195GHz-d/d overall. To be safe, we have tested the boxes by putting a garbage bag over them, to ensure that the GPUs will throttle back and stay at a safe temperature in the event "something bad" happens. In every case, whether it is the default profile or our customized profile, the cards respond to the added thermal load and throttle significantly.

Now, the weird thing is we are using an older copy of Windows 7. We are using it in "evaluation" mode, which, with a few registry tweaks, we can extend from the default 30 days to 360 days. We are torn a little about how ethical this is and right now we are just ignoring that issue which is probably not a healthy or proper attitude. We think we could connect the two boxes to the Internet without causing any trouble, but then we have to manage updates, drivers and all sorts of stuff like that. We have an install routine that takes about an hour that installs all of the needed drivers and programs without the need for Internet access. That includes installing Windows itself! The downside is we have to sneakernet our results and assignments via USB key. We have been doing so every Sunday. From a reporting point of view this is not optimal but it takes only a few minutes to do so it isn't a chore or anything.

We are very familiar and comfortable with Linux.

To switch to a Linux solution we see a big obstacle, which is we know of no way to adjust fan speeds, power targets and temperature targets in Linux. As a result, our overall daily output would drop from ~1875GHz-d/d to 1680GHz-d/d.

If we can accept this 10% drop in productivity, we would be ready to implement a Linux solution.

So, to begin this discussion, having read all of that stuff above, is switching to Linux worth the loss in productivity?

Future questions, if the decision is made to run Linux:

- Can we run the boxes from the onboard GPU and use the GTX cards solely as CUDA devices? (Whatever we install will be just a text-based console installation.)
- How will we exchange results and get new assignments from PrimeNet via the GPUto72 interface?

:mike:

axn 2014-12-18 04:51

[QUOTE=Xyzzy;390336]To switch to a Linux solution we see a big obstacle, which is we know of no way to adjust fan speeds, power targets and temperature targets in Linux. As a result, our overall daily output would drop from ~1875GHz-d/d to 1680GHz-d/d.

If we can accept this 10% drop in productivity, we would be ready to implement a Linux solution.

So, to begin this discussion, having read all of that stuff above, is switching to Linux worth the loss in productivity?
[/QUOTE]
What are the power draw numbers for the two scenarios? I would think that the 10% drop would be acceptable if there is a significant difference in power draw.

chris2be8 2014-12-18 16:38

[QUOTE=Xyzzy;390336] - Can we run the boxes from the onboard GPU and use the GTX cards solely as CUDA devices? (Whatever we install will be just a text-based console installation.)
[/QUOTE]

That should be possible, I'm doing it with my box that has a GTX 560 Ti in it. I had to update BIOS settings to make it always enable the onboard graphics and use them as the primary display device.

You probably need to blacklist the nouveau graphics driver to make Linux use the Novell drivers. See the thread where I described what I had to do to make it work ([url]http://mersenneforum.org/showthread.php?t=16480&highlight=nouveau&page=22[/url] post 234). I'm using it for msieve polynomial selection and ECM stage 1.

Chris

Xyzzy 2014-12-18 18:11

[QUOTE=axn;390339]What are the power draw numbers for the two scenarios? I would think that the 10% drop would be acceptable if there is a significant difference in power draw.[/QUOTE]We will get power numbers today.

We are also searching to see if there is a way to flash the fan speed to the card BIOS.

owftheevil 2014-12-18 19:19

The Linux Nvidia drivers have some limited capability to adjust fan speed and clocks, but if that is not enough, overclock.net has a wealth of information on bios modifications.

Xyzzy 2014-12-18 19:58

1 Attachment(s)
[QUOTE=Xyzzy;390375]We will get power numbers today.[/QUOTE]We have attached the results.

:mike:

Xyzzy 2014-12-18 20:03

[QUOTE=owftheevil;390385]The Linux Nvidia drivers have some limited capability to adjust fan speed and clocks, but if that is not enough, overclock.net has a wealth of information on bios modifications.[/QUOTE]We seem to remember being able to adjust the fan speed via a GUI tool, but it was not persistent across reboots. If we did load up Linux it would be in CLI mode, so that we could remotely manage the boxes via ssh. (We are not interested in VNC or anything like that.)

We will look into the BIOS flashing deal. We have not yet analyzed the power numbers we attached earlier, but at a glance it looks like more speed comes at a greater cost per GHz-d. We also assume flashing the BIOS would void any warranty claims.

Xyzzy 2014-12-18 20:43

1 Attachment(s)
[QUOTE=Xyzzy;390391]We have not yet analyzed the power numbers we attached earlier, but at a glance it looks like more speed comes at a greater cost per GHz-d.[/QUOTE]We have attached the results.

:mike:

Mark Rose 2014-12-18 20:46

I don't have time to write an in-depth response at the moment. Assuming you're using Ubuntu/Kubuntu with the updated nvidia-331 drivers. You'll probably need an even newer version to work with the GTX 970/980s.

1. You can use the on-board video as the primary display device. I'm doing that on the box I'm typing this on. You have to prevent the nvidia driver from loading on boot. The easiest way to accomplish this is to install the bumblebee package. Then to start mfaktc, I use this script for two cards:

mf-start
[code]
#!/bin/bash

mf-stop

if [ "$(lsmod | egrep -c '^nvidia')" = "0" ] ; then
sudo modprobe nvidia_331
# also need a variant of the following with later drivers
sudo modprobe nvidia_331-uvm
fi

num=$(lspci | grep NVIDIA | grep VGA | wc -l)

num=$(expr $num - 1)

for i in $(seq 0 $num) ; do
if [ ! -e /dev/nvidia$i ] ; then
sudo mknod -m 666 /dev/nvidia$i c 195 $i;
fi
done

if [ ! -e /dev/nvidiactl ] ; then
sudo mknod -m 666 /dev/nvidiactl c 195 255
fi

cd ~/mfaktc0 && screen -d -m -S mf0 ./mfaktc.exe -d 0
cd ~/mfaktc1 && screen -d -m -S mf1 ./mfaktc.exe -d 1
[/code]

mf-stop
[code]
#!/bin/bash

killall mfaktc.exe 2> /dev/null

while pgrep -c mfaktc.exe > /dev/null ; do sleep 0.5 ; done
[/code]

2. You can control fan speed using the nvidia-settings utility, however, it only works if you're using the nvidia drivers directly, so the above actually disables that. There is a work around I found [url=https://sites.google.com/site/akohlmey/random-hacks/nvidia-gpu-coolness]here[/url] but it doesn't work with recent Linux and the drivers. I've hacked it but I haven't got it working 100%. I need to make a GitHub for it. So... the answer is maybe soon. I don't know when I'll have time to work on it.

3. Later Nvidia drivers allow overclocking via the nvidia-settings utility.



If you just use Nvidia as the primary display device, and run the latest drivers, the nvidia-settings utility should do all you want. The start scripts I provided above should still work. Using the on-board video is problematic at the moment.

axn 2014-12-19 03:18

[QUOTE=Xyzzy;390390]We have attached the results.

:mike:[/QUOTE]

[QUOTE=Xyzzy;390395]We have attached the results.

:mike:[/QUOTE]

Looking at the results, OOTB is the way to go.
In the AMD box:
* An incremental 33w is being consumed for a measly gain of 40GH d/d while going from OOTB to 100%
* An incremental 34w is being consumed for 19 GH d/d while going from 100% to 125%

Intel box also shows similar results (x2 of course).

In fact, if you could somehow put the 980 from AMD box to Intel box and ran all 3 at OOTB, you would achieve 1670 GH d/d @ 495w, compared to what the intel box is doing with 2x GPU @ 125% (1251 GH d/d @ 492w)

In conclusion: OOTB is the way to go.


All times are UTC. The time now is 10:25.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.