mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Blogorrhea > kriesel

Closed Thread
 
Thread Tools
Old 2018-06-03, 14:59   #1
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23×3×197 Posts
Default System management notes

This thread is intended as reference material for system management, not specific to Mersenne hunting, but important for it.
(Suggestions are welcome. Discussion posts in this thread are not encouraged. Please use the reference material discussion thread http://www.mersenneforum.org/showthread.php?t=23383.)

Table of Contents
  1. this post
  2. Partial checklist for system maintenance and reliability http://www.mersenneforum.org/showpos...59&postcount=2
  3. Drivers and gpus trivia / traps / tricks http://www.mersenneforum.org/showpos...60&postcount=3
  4. Application logging and tee http://www.mersenneforum.org/showpos...67&postcount=4
  5. Memory error control http://www.mersenneforum.org/showpos...72&postcount=5
  6. Windows 10 http://www.mersenneforum.org/showpos...09&postcount=6
  7. Running multiple computation types on multiple gpus per system. https://www.mersenneforum.org/showpo...95&postcount=7
  8. Power settings https://www.mersenneforum.org/showpo...87&postcount=8
  9. etc tbd
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2020-10-21 at 19:08 Reason: added power settings
kriesel is online now  
Old 2018-06-03, 15:09   #2
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

127816 Posts
Default Partial checklist for system maintenance & reliability

Some things to check if the system uptime or other reliability is less than quite good.
  1. How old is the hardware? (Hard drive etc not too ancient? All components and connectors well seated and making good contact?)
  2. Recent backups, running on schedule, well monitored to ensure they're actually running to completion? Restore process tested and practiced to confirm it's possible to restore from those backups? N-deep backups, so accidental deletion not noticed before the next backup runs is not necessarily data gone forever?
  3. Before making any changes, are there lengthy computations that are nearly done and could be completed before those changes? Application updates or other changes might cause the computation to restart from the beginning or be unable to complete. Much better to lose some work that was just begun, than to lose work that was running for days and weeks and almost finished.
  4. How well patched is the system?
  5. How well is it protected from power interruption or transients or sags? (Voltage regulating UPS?)
  6. Do you have a way of monitoring the line voltage?
  7. What do system logs have to say?
  8. How detailed and complete is your system logging? (Is some logging going to another system or non-boot storage device? Will it survive a HD problem in the system of interest?)
  9. What OS is it running?
  10. What other software?
  11. Is it safe from children and other small animals?
  12. System components and memory pass reliability tests? What if anything does/would a serious diagnostics attempt tell you? https://lifehacker.com/5551188/best-...agnostic-tools
  13. What assumptions are you making and may not even realize it?
  14. Temperature of components and ambient environment in a reasonable range?
  15. Relative humidity in a reasonable range?
  16. All fans in the system in good working order? Grilles and components free of dust, lint, and pet hair?
  17. A full complement of drivers, of reliable versions, typically up to date except for recent releases with known issues?
  18. Well secured?
  19. Correct power supply output voltages, and adequate current output for all the components now installed on all voltage levels? System components get added, and power supply components degrade over time. Wattage required varies with operating temperature, clock rate, program execution, etc.
  20. Is the system bios up to date? (thanks SELROC) The various components' firmware also?
  21. If the worst happens, dead drive, backups failed or were not current enough or can't be restored for some reason, do you have info on one or more good data recovery companies, that can repair a drive or open it in a clean room and get the data to new media for a price? Know the price, and maybe backup in depth will seem more economical.
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2019-11-17 at 15:01
kriesel is online now  
Old 2018-06-03, 15:17   #3
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

111708 Posts
Default Drivers and gpus trivia / traps / tricks

These are mostly from Windows experience.
  1. AMD and NVIDIA gpus installed in the same system can be very problematic. There is a way to get them to coexist. https://www.youtube.com/watch?v=l_f_lIF3A7Q Segregating them to separate systems seems simpler and more robust.
  2. A failed graphics driver install can create a lingering mess/problem. Thorough file deletion and registry editing after removing a driver with the vendor-supplied tools and Add/remove program, or use of DDU may be required. Or use the "Clean Install" option.
  3. NVIDIA allows only one NVIDIA graphics driver installed per system. The driver must support all the installed NVIDIA cards. Older GPUs get dropped from support as newer GPUs come along and require newer drivers. The really old GPUs may need to be segregated to a system that is not automatically getting driver updates. There is a relationship between driver version, minimum and maximum CUDA level supported, and compute capability minimum and maximum supported, and therefore gpu models supported. See https://tech.amikelive.com/node-930/...y-gpu-drivers/ for more on this. (Eventually old GPUs become uneconomic to operate, as newer GPUs become available that are more energy efficient. Or they fail before then or are replaced with faster hardware.)
  4. Installing the AMD or CUDA SDKs on a system can disable the OpenCL driver that was allowing the Intel IGP to run Mfakto until then.
  5. Some systems by design disable the IGP when a discrete GPU is installed, so the IGP can not be used for computation or display in that case. (Dell Optiplex 755 Core 2 Duo was an example)
  6. The Linux nouveaux driver installs by default for NVIDIA GPUs, and prevents installation of the NVIDIA driver needed for CUDA computing. The nouveaux puts up a pretty good fight, at least on the Debian version I tried. Supposedly it can be defeated by blacklisting it.
  7. Mersenne code that uses multiple GPUs working together to process a single worktodo entry does not exist in the GIMPS community, to my knowledge. (Prime95 has this capability on cpu cores.) Physically linking GPUs with NVIDIA SLI or AMD Crossfire means multiple GPUs work together sharing the memory installed on one while the other's is idle. As fast as those interconnects are, they are slow compared to on-board memory bandwidth. For P-1 especially, and also in primality testing high exponents, lots of memory is a plus, so that loss of available gpu memory would be a drawback. Throughput is better to have individual GPUs working each with their own full complement of memory, on separate assignments, via separate program instances. (clarified with SELROC's input.)
  8. PCIe extenders can be used. Test well for reliability.
    Powered extenders are recommended, non-powered extenders are not.
    Extenders have a power limit of about 60 Watts. Beyond that additional gpu power plugs are required.
    Extenders are very common in mining the various types of digital coin.
    Bus load for most gpu mersenne code is quite light, so using a 1x pcie interface is not much of a limit on throughput.
  9. Some systems won't make use of a gpu connected by PCI slot via PCIe/PCI adapter if there's already a PCIe-connected gpu present. The adapter and gpu won't even be detected as present by Windows, appearing to be not functional.
  10. The same adapter and external GPU that's ignored in the preceding can be used on a system that has PCI but no PCIe slots or other discrete GPUs (and also takes over display duties there from its IGP).

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2020-07-16 at 18:44
kriesel is online now  
Old 2018-06-03, 17:02   #4
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23×3×197 Posts
Default Application logging and tee

Logging of application output normally directed to the console is encouraged, except for gpuowl, which has comprehensive built-in logging. Stdout and stderr are where error messages, warnings, and normal program output typically appear. Most applications don't log much of this themselves to a file. Errors not trapped for by the program may scroll off screen before the user has a chance to see them.

Per Chalsall, in Linux the append option for the tee command is either "-a" or "--append".

Re Windows powershell and tee use, I saw a warning somewhere that tee creates a destination file, even if a file by that name exists. Which would blow away the previously accumulated log every time the app halted from the Windows display driver timeout or other reason and the batch wrapper restarted the application with tee to redirect a copy of screen output to the file.. Unless the alert user incorporated the batch loop count or %date%%time% into the tee destination file name in the batch file. That first time could be a killer, of months of logging. The -append modifier for tee is not accepted at the command line in my test on Win7.

Win7 Pro PS (same comments except no -a or -append work)
PS C:\Users\Ken\documents> dir | tee-object -filepath tee-test.txt -append
Tee-Object : A parameter cannot be found that matches parameter name 'append'.
At line:1 char:48
+ dir | tee-object -filepath tee-test.txt -append <<<<
+ CategoryInfo : InvalidArgument: (:) [Tee-Object], ParameterBindingException
+ FullyQualifiedErrorId : NamedParameterNotFound,Microsoft.PowerShell.Commands.TeeObjectCommand

PS C:\Users\Ken\documents> dir | tee-object -filepath tee-test.txt -a
Tee-Object : A parameter cannot be found that matches parameter name 'a'.
At line:1 char:43
+ dir | tee-object -filepath tee-test.txt -a <<<<
+ CategoryInfo : InvalidArgument: (:) [Tee-Object], ParameterBindingException
+ FullyQualifiedErrorId : NamedParameterNotFound,Microsoft.PowerShell.Commands.TeeObjectCommand

PS C:\Users\Ken\documents>

The -append option is not present in command line tee help on Win7, but is in Win10.

Windows 8 & 8.1 status, unknown.

In the absence of tee -append, I use append redirection >>. Some applications have part of a message printed to stderr and part to stdout, which gets partially redirected.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2019-11-17 at 15:03 Reason: added gpuowl as exception
kriesel is online now  
Old 2018-06-03, 18:18   #5
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23·3·197 Posts
Default Memory error control

Memory errors might occur in the gpu vram, or in the system ram, or if particularly unlucky, both. Either can affect the GIMPS calculation results of GPU applications. Ideally we would all use highly reliable hardware, with ECC present and turned on.

On the cpu side:
System ram can be tested with memtest86 or memtest86+. https://www.memtest86.com/ or http://www.memtest.org/
Memtest86+ has the capability to prepare a table of bad physical locations.
System ram is inexpensive, so bad modules can be detected and removed or replaced, and the system retested. Retest periodically (annually?) is advisable.

For Linux systems, those badram tables from memtest86+ can be input to the Linux badram kernel patch, which allocates those bad physical locations and hangs on to them so they don't get allocated to some application we care about whose results could be ruined by memory errors, such as GIMPS computations.

For Windows systems, there is not an equivalent user-appliable patch available to my knowledge. For at least some versions, there's a built-in alternative described at https://superuser.com/questions/4200...ive-ram#490522 including lots of detail. Note the caution about possibly causing a boot failure if done incorrectly. This should be a temporary workaround while replacement RAM is on order.

For other OSes, there may be no alternative to RAM replacement or removal.

On the GPU side:
NVIDIA GPU memory can be tested with the -memtest option of CUDALucas.
AMD or NVIDIA with gpumemtest https://sourceforge.net/projects/cudagpumemtest/
Also https://www.raymond.cc/blog/having-p...st-its-memory/
Intel IGPs use system ram so that gets tested on the system side.

ECC is often not available, and if present and enabled reduces performance. (Only high end pro-quality card models included ECC in their design.)

Speculatively:
The gpu memory may or may not be subject to the virtual memory management of the host OS. It may be possible to develop code to do bad-gpu-memory lockout at the application level, or at the driver level. Whether that results in gpu memory fragmentation that causes problems is to be determined.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2020-07-16 at 18:45
kriesel is online now  
Old 2018-07-12, 15:50   #6
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

127816 Posts
Default Windows 10

For Windows 10 setup for better privacy, there's https://www.reddit.com/r/conspiracy/...in_windows_10/ which may be useful.

Setting Windows classic theme in Win 10: https://www.youtube.com/watch?v=j2fL1RRsuTw

Stopping Cortana: https://www.pcworld.com/article/2949...assistant.html

On Windows 7 a while back, benchmarking CUDALucas under different gpu driver versions, before I figured out how to reliably stop automatic driver updates, removing the network cable temporarily worked very well to block updates.

Controlling when updates occur can be useful. This can be configured in Windows 10 to require your consent. See https://www.techradar.com/news/softw...ows-10-1307070 If you don't trust it there are fallbacks. Firewalling off update servers is a possibility; in your firewall router, make a router entry that says the undesired server addresses are on the LAN side. If the update software's packets never cross the router, updates won't be downloaded or installed. If all else fails, or for simplicity, temporarily unplug the network cable.

Getting the Pro version of Windows provides remote desktop server capability. In some versions it also means better control of backups (more choice of destination for example).


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2019-11-17 at 15:04
kriesel is online now  
Old 2019-05-22, 19:37   #7
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23×3×197 Posts
Default Running multiple computation types on multiple gpus per system

It gets complicated. The client management software that's available for one computation type generally does not support another, and some are specific to CUDA or OpenCl in addition. Not all GIMPS gpu applications are supported by separate client management software. None of the gpu apps have integrated Primenet API communication. App instances, folders, files, etc proliferate quickly if running multiple computation types on multiple gpus, and more so if running multiple TF instances to extract higher performance. See the example system attached.
Bring them up one gpu and one application instance at a time.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: pdf condorette gimps configuration.pdf (11.8 KB, 96 views)

Last fiddled with by kriesel on 2019-11-17 at 15:05
kriesel is online now  
Old 2020-10-21, 18:52   #8
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23×3×197 Posts
Default Power settings

BIOS:
Turn off what you are sure you don't need. Varies considerably by BIOS flavor

OS:
To help the system stay up and running prime95 or mprime or whatever on the cpu, and any relevant gpu applications, at full tilt while unattended, modify the default OS power saving settings. Test by leaving the system on continuously after a restart.
For Windows 10:
click on the lower left Windows icon, the gear that will appear a bit above it, System in the pane that will appear, then "Power & sleep". Find "when plugged in, pc goes to sleep" and select "never".

Scroll down and find "additional power settings", click on it, then in the pane that opens, click change plan settings, then click "change advanced power settings". Adjust the many settings in the resulting window to the speed/power tradeoffs you want after considering utility cost.

GPUs:
Consder using the power limiting capabilities of nvidia-smi for NVIDIA gpus, or the corresponding AMD gpu utilities, to reduce gpu power from maximum to a lower level that is more power efficient. Especially during the air conditioning season.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2020-10-21 at 19:07
kriesel is online now  
Closed Thread

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Fast Breeding (guru management) VictordeHolland NFS@Home 2466 2020-09-20 06:51
Improving the queue management. debrouxl NFS@Home 10 2018-05-06 21:05
Script-based Primenet assignment management ewmayer Software 3 2017-05-25 04:02
Mally's marginal notes devarajkandadai Math 3 2008-12-19 03:33
Power Management settings PrimeCroat Hardware 3 2004-02-17 19:11

All times are UTC. The time now is 08:33.

Sun Nov 29 08:33:20 UTC 2020 up 80 days, 5:44, 3 users, load averages: 1.42, 1.35, 1.31

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.