![]() |
![]() |
#1 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
29×167 Posts |
![]()
This thread is intended as reference material for system management, not specific to Mersenne hunting, but important for it.
(Suggestions are welcome. Discussion posts in this thread are not encouraged. Please use the reference material discussion thread http://www.mersenneforum.org/showthread.php?t=23383.) Table of Contents
Last fiddled with by kriesel on 2020-10-21 at 19:08 Reason: added power settings |
![]() |
![]() |
#2 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
29·167 Posts |
![]()
Some things to check if the system uptime or other reliability is less than quite good.
Last fiddled with by kriesel on 2019-11-17 at 15:01 |
![]() |
![]() |
#3 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
29·167 Posts |
![]()
These are mostly from Windows experience.
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-07-16 at 18:44 |
![]() |
![]() |
#4 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
29×167 Posts |
![]()
Logging of application output normally directed to the console is encouraged, except for gpuowl, which has comprehensive built-in logging. Stdout and stderr are where error messages, warnings, and normal program output typically appear. Most applications don't log much of this themselves to a file. Errors not trapped for by the program may scroll off screen before the user has a chance to see them.
Per Chalsall, in Linux the append option for the tee command is either "-a" or "--append". Re Windows powershell and tee use, I saw a warning somewhere that tee creates a destination file, even if a file by that name exists. Which would blow away the previously accumulated log every time the app halted from the Windows display driver timeout or other reason and the batch wrapper restarted the application with tee to redirect a copy of screen output to the file.. Unless the alert user incorporated the batch loop count or %date%%time% into the tee destination file name in the batch file. That first time could be a killer, of months of logging. The -append modifier for tee is not accepted at the command line in my test on Win7. Win7 Pro PS (same comments except no -a or -append work) PS C:\Users\Ken\documents> dir | tee-object -filepath tee-test.txt -append Tee-Object : A parameter cannot be found that matches parameter name 'append'. At line:1 char:48 + dir | tee-object -filepath tee-test.txt -append <<<< + CategoryInfo : InvalidArgument: (:) [Tee-Object], ParameterBindingException + FullyQualifiedErrorId : NamedParameterNotFound,Microsoft.PowerShell.Commands.TeeObjectCommand PS C:\Users\Ken\documents> dir | tee-object -filepath tee-test.txt -a Tee-Object : A parameter cannot be found that matches parameter name 'a'. At line:1 char:43 + dir | tee-object -filepath tee-test.txt -a <<<< + CategoryInfo : InvalidArgument: (:) [Tee-Object], ParameterBindingException + FullyQualifiedErrorId : NamedParameterNotFound,Microsoft.PowerShell.Commands.TeeObjectCommand PS C:\Users\Ken\documents> The -append option is not present in command line tee help on Win7, but is in Win10. Windows 8 & 8.1 status, unknown. In the absence of tee -append, I use append redirection >>. Some applications have part of a message printed to stderr and part to stdout, which gets partially redirected. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-17 at 15:03 Reason: added gpuowl as exception |
![]() |
![]() |
#5 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
29×167 Posts |
![]()
Memory errors might occur in the gpu vram, or in the system ram, or if particularly unlucky, both. Either can affect the GIMPS calculation results of GPU applications. Ideally we would all use highly reliable hardware, with ECC present and turned on.
On the cpu side: System ram can be tested with memtest86 or memtest86+. https://www.memtest86.com/ or http://www.memtest.org/ Memtest86+ has the capability to prepare a table of bad physical locations. System ram is inexpensive, so bad modules can be detected and removed or replaced, and the system retested. Retest periodically (annually?) is advisable. For Linux systems, those badram tables from memtest86+ can be input to the Linux badram kernel patch, which allocates those bad physical locations and hangs on to them so they don't get allocated to some application we care about whose results could be ruined by memory errors, such as GIMPS computations. For Windows systems, there is not an equivalent user-appliable patch available to my knowledge. For at least some versions, there's a built-in alternative described at https://superuser.com/questions/4200...ive-ram#490522 including lots of detail. Note the caution about possibly causing a boot failure if done incorrectly. This should be a temporary workaround while replacement RAM is on order. For other OSes, there may be no alternative to RAM replacement or removal. On the GPU side: NVIDIA GPU memory can be tested with the -memtest option of CUDALucas. AMD or NVIDIA with gpumemtest https://sourceforge.net/projects/cudagpumemtest/ Also https://www.raymond.cc/blog/having-p...st-its-memory/ Intel IGPs use system ram so that gets tested on the system side. ECC is often not available, and if present and enabled reduces performance. (Only high end pro-quality card models included ECC in their design.) Speculatively: The gpu memory may or may not be subject to the virtual memory management of the host OS. It may be possible to develop code to do bad-gpu-memory lockout at the application level, or at the driver level. Whether that results in gpu memory fragmentation that causes problems is to be determined. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-07-16 at 18:45 |
![]() |
![]() |
#6 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
29×167 Posts |
![]()
For Windows 10 setup for better privacy, there's https://www.reddit.com/r/conspiracy/...in_windows_10/ which may be useful.
Setting Windows classic theme in Win 10: https://www.youtube.com/watch?v=j2fL1RRsuTw Stopping Cortana: https://www.pcworld.com/article/2949...assistant.html On Windows 7 a while back, benchmarking CUDALucas under different gpu driver versions, before I figured out how to reliably stop automatic driver updates, removing the network cable temporarily worked very well to block updates. Controlling when updates occur can be useful. This can be configured in Windows 10 to require your consent. See https://www.techradar.com/news/softw...ows-10-1307070 If you don't trust it there are fallbacks. Firewalling off update servers is a possibility; in your firewall router, make a router entry that says the undesired server addresses are on the LAN side. If the update software's packets never cross the router, updates won't be downloaded or installed. If all else fails, or for simplicity, temporarily unplug the network cable. Getting the Pro version of Windows provides remote desktop server capability. In some versions it also means better control of backups (more choice of destination for example). Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-17 at 15:04 |
![]() |
![]() |
#7 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
29·167 Posts |
![]()
It gets complicated. The client management software that's available for one computation type generally does not support another, and some are specific to CUDA or OpenCl in addition. Not all GIMPS gpu applications are supported by separate client management software. None of the gpu apps have integrated Primenet API communication. App instances, folders, files, etc proliferate quickly if running multiple computation types on multiple gpus, and more so if running multiple TF instances to extract higher performance. See the example system attached.
Bring them up one gpu and one application instance at a time. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-17 at 15:05 |
![]() |
![]() |
#8 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
484310 Posts |
![]()
BIOS:
Turn off what you are sure you don't need. Varies considerably by BIOS flavor OS: To help the system stay up and running prime95 or mprime or whatever on the cpu, and any relevant gpu applications, at full tilt while unattended, modify the default OS power saving settings. Test by leaving the system on continuously after a restart. For Windows 10: click on the lower left Windows icon, the gear that will appear a bit above it, System in the pane that will appear, then "Power & sleep". Find "when plugged in, pc goes to sleep" and select "never". Scroll down and find "additional power settings", click on it, then in the pane that opens, click change plan settings, then click "change advanced power settings". Adjust the many settings in the resulting window to the speed/power tradeoffs you want after considering utility cost. GPUs: Consder using the power limiting capabilities of nvidia-smi for NVIDIA gpus, or the corresponding AMD gpu utilities, to reduce gpu power from maximum to a lower level that is more power efficient. Especially during the air conditioning season. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-10-21 at 19:07 |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Fast Breeding (guru management) | VictordeHolland | NFS@Home | 2466 | 2020-09-20 06:51 |
Improving the queue management. | debrouxl | NFS@Home | 10 | 2018-05-06 21:05 |
Script-based Primenet assignment management | ewmayer | Software | 3 | 2017-05-25 04:02 |
Mally's marginal notes | devarajkandadai | Math | 3 | 2008-12-19 03:33 |
Power Management settings | PrimeCroat | Hardware | 3 | 2004-02-17 19:11 |