Go Back > Extra Stuff > Blogorrhea > kriesel

Closed Thread
Thread Tools
Old 2018-06-03, 14:59   #1
kriesel's Avatar
Mar 2017
US midwest

137108 Posts
Default System management notes

This thread is intended as reference material for system management, not specific to Mersenne hunting, but important for it.
(Suggestions are welcome. Discussion posts in this thread are not encouraged. Please use the reference material discussion thread

Table of Contents
  1. this post
  2. Partial checklist for system maintenance and reliability
  3. Drivers and gpus trivia / traps / tricks
  4. Application logging and tee
  5. Memory error control
  6. Windows 10
  7. Running multiple computation types on multiple gpus per system.
  8. Power settings
  9. WSL
  10. Linux
  11. Windows - Linux dictionary
  12. etc tbd
Top of reference tree:

Last fiddled with by kriesel on 2021-09-01 at 16:29 Reason: added WSL, Linux, Windows - Linux dictionary
kriesel is offline  
Old 2018-06-03, 15:09   #2
kriesel's Avatar
Mar 2017
US midwest

23·761 Posts
Default Partial checklist for system maintenance & reliability

Some things to check if the system uptime or other reliability is less than quite good.
  1. How old is the hardware? (Hard drive etc not too ancient? All components and connectors well seated and making good contact?)
  2. Recent backups, running on schedule, well monitored to ensure they're actually running to completion? Restore process tested and practiced to confirm it's possible to restore from those backups? N-deep backups, so accidental deletion not noticed before the next backup runs is not necessarily data gone forever?
  3. Before making any changes, are there lengthy computations that are nearly done and could be completed before those changes? Application updates or other changes might cause the computation to restart from the beginning or be unable to complete. Much better to lose some work that was just begun, than to lose work that was running for days and weeks and almost finished.
  4. How well patched is the system?
  5. How well is it protected from power interruption or transients or sags? (Voltage regulating UPS?)
  6. Do you have a way of monitoring the line voltage?
  7. What do system logs have to say?
  8. How detailed and complete is your system logging? (Is some logging going to another system or non-boot storage device? Will it survive a HD problem in the system of interest?)
  9. What OS is it running?
  10. What other software?
  11. Is it safe from children and other small animals?
  12. System components and memory pass reliability tests? What if anything does/would a serious diagnostics attempt tell you?
  13. What assumptions are you making and may not even realize it?
  14. Temperature of components and ambient environment in a reasonable range?
  15. Relative humidity in a reasonable range?
  16. All fans in the system in good working order? Grilles and components free of dust, lint, and pet hair?
  17. A full complement of drivers, of reliable versions, typically up to date except for recent releases with known issues?
  18. Well secured?
  19. Correct power supply output voltages, and adequate current output for all the components now installed on all voltage levels? System components get added, and power supply components degrade over time. Wattage required varies with operating temperature, clock rate, program execution, etc.
  20. Is the system bios up to date? (thanks SELROC) The various components' firmware also?
  21. If the worst happens, dead drive, backups failed or were not current enough or can't be restored for some reason, do you have info on one or more good data recovery companies, that can repair a drive or open it in a clean room and get the data to new media for a price? Know the price, and maybe backup in depth will seem more economical.
Top of reference tree:

Last fiddled with by kriesel on 2019-11-17 at 15:01
kriesel is offline  
Old 2018-06-03, 15:17   #3
kriesel's Avatar
Mar 2017
US midwest

23×761 Posts
Default Drivers and gpus trivia / traps / tricks

These are mostly from Windows experience.
  1. AMD and NVIDIA gpus installed in the same system can be very problematic. There is a way to get them to coexist. Segregating them to separate systems seems simpler and more robust.
  2. A failed graphics driver install can create a lingering mess/problem. Thorough file deletion and registry editing after removing a driver with the vendor-supplied tools and Add/remove program, or use of DDU may be required. Or use the "Clean Install" option.
  3. NVIDIA allows only one NVIDIA graphics driver installed per system. The driver must support all the installed NVIDIA cards. Older GPUs get dropped from support as newer GPUs come along and require newer drivers. The really old GPUs may need to be segregated to a system that is not automatically getting driver updates. There is a relationship between driver version, minimum and maximum CUDA level supported, and compute capability minimum and maximum supported, and therefore gpu models supported. See for more on this. (Eventually old GPUs become uneconomic to operate, as newer GPUs become available that are more energy efficient. Or they fail before then or are replaced with faster hardware.)
  4. Installing the AMD or CUDA SDKs on a system can disable the OpenCL driver that was allowing the Intel IGP to run Mfakto until then.
  5. Some systems by design disable the IGP when a discrete GPU is installed, so the IGP can not be used for computation or display in that case. (Dell Optiplex 755 Core 2 Duo was an example)
  6. The Linux nouveaux driver installs by default for NVIDIA GPUs, and prevents installation of the NVIDIA driver needed for CUDA computing. The nouveaux puts up a pretty good fight, at least on the Debian version I tried. Supposedly it can be defeated by blacklisting it.
  7. Mersenne code that uses multiple GPUs working together to process a single worktodo entry does not exist in the GIMPS community, to my knowledge. (Prime95 has this capability on cpu cores.) Physically linking GPUs with NVIDIA SLI or AMD Crossfire means multiple GPUs work together sharing the memory installed on one while the other's is idle. As fast as those interconnects are, they are slow compared to on-board memory bandwidth. For P-1 especially, and also in primality testing high exponents, lots of memory is a plus, so that loss of available gpu memory would be a drawback. Throughput is better to have individual GPUs working each with their own full complement of memory, on separate assignments, via separate program instances. (clarified with SELROC's input.)
  8. PCIe extenders can be used. Test well for reliability.
    Powered extenders are recommended, non-powered extenders are not.
    Extenders have a power limit of about 60 Watts. Beyond that additional gpu power plugs are required.
    Extenders are very common in mining the various types of digital coin.
    Bus load for most gpu mersenne code is quite light, so using a 1x pcie interface is not much of a limit on throughput.
  9. Some systems won't make use of a gpu connected by PCI slot via PCIe/PCI adapter if there's already a PCIe-connected gpu present. The adapter and gpu won't even be detected as present by Windows, appearing to be not functional.
  10. The same adapter and external GPU that's ignored in the preceding can be used on a system that has PCI but no PCIe slots or other discrete GPUs (and also takes over display duties there from its IGP).

Top of reference tree:

Last fiddled with by kriesel on 2020-07-16 at 18:44
kriesel is offline  
Old 2018-06-03, 17:02   #4
kriesel's Avatar
Mar 2017
US midwest

10111110010002 Posts
Default Application logging and tee

Logging of application output normally directed to the console is encouraged, except for gpuowl, which has comprehensive built-in logging. Stdout and stderr are where error messages, warnings, and normal program output typically appear. Most applications don't log much of this themselves to a file. Errors not trapped for by the program may scroll off screen before the user has a chance to see them.

Per Chalsall, in Linux the append option for the tee command is either "-a" or "--append".

Re Windows powershell and tee use, I saw a warning somewhere that tee creates a destination file, even if a file by that name exists. Which would blow away the previously accumulated log every time the app halted from the Windows display driver timeout or other reason and the batch wrapper restarted the application with tee to redirect a copy of screen output to the file.. Unless the alert user incorporated the batch loop count or %date%%time% into the tee destination file name in the batch file. That first time could be a killer, of months of logging. The -append modifier for tee is not accepted at the command line in my test on Win7.

Win7 Pro PS (same comments except no -a or -append work)
PS C:\Users\Ken\documents> dir | tee-object -filepath tee-test.txt -append
Tee-Object : A parameter cannot be found that matches parameter name 'append'.
At line:1 char:48
+ dir | tee-object -filepath tee-test.txt -append <<<<
+ CategoryInfo : InvalidArgument: (:) [Tee-Object], ParameterBindingException
+ FullyQualifiedErrorId : NamedParameterNotFound,Microsoft.PowerShell.Commands.TeeObjectCommand

PS C:\Users\Ken\documents> dir | tee-object -filepath tee-test.txt -a
Tee-Object : A parameter cannot be found that matches parameter name 'a'.
At line:1 char:43
+ dir | tee-object -filepath tee-test.txt -a <<<<
+ CategoryInfo : InvalidArgument: (:) [Tee-Object], ParameterBindingException
+ FullyQualifiedErrorId : NamedParameterNotFound,Microsoft.PowerShell.Commands.TeeObjectCommand

PS C:\Users\Ken\documents>

The -append option is not present in command line tee help on Win7, but is in Win10.

Windows 8 & 8.1 status, unknown.

In the absence of tee -append, I use append redirection >>. Some applications have part of a message printed to stderr and part to stdout, which gets partially redirected.

Top of reference tree:

Last fiddled with by kriesel on 2019-11-17 at 15:03 Reason: added gpuowl as exception
kriesel is offline  
Old 2018-06-03, 18:18   #5
kriesel's Avatar
Mar 2017
US midwest

17C816 Posts
Default Memory error control

Memory errors might occur in the gpu vram, or in the system ram, or if particularly unlucky, both. Either can affect the GIMPS calculation results of GPU applications. Ideally we would all use highly reliable hardware, with ECC present and turned on.

On the cpu side:
System ram can be tested with memtest86 or memtest86+. or
Memtest86+ has the capability to prepare a table of bad physical locations.
System ram is inexpensive, so bad modules can be detected and removed or replaced, and the system retested. Retest periodically (annually?) is advisable.

For Linux systems, those badram tables from memtest86+ can be input to the Linux badram kernel patch, which allocates those bad physical locations and hangs on to them so they don't get allocated to some application we care about whose results could be ruined by memory errors, such as GIMPS computations.

For Windows systems, there is not an equivalent user-appliable patch available to my knowledge. For at least some versions, there's a built-in alternative described at including lots of detail. Note the caution about possibly causing a boot failure if done incorrectly. This should be a temporary workaround while replacement RAM is on order.

For other OSes, there may be no alternative to RAM replacement or removal.

On the GPU side:
NVIDIA GPU memory can be tested with the -memtest option of CUDALucas.
AMD or NVIDIA with gpumemtest
Intel IGPs use system ram so that gets tested on the system side.

ECC is often not available, and if present and enabled reduces performance. (Only high end pro-quality card models included ECC in their design.)

The gpu memory may or may not be subject to the virtual memory management of the host OS. It may be possible to develop code to do bad-gpu-memory lockout at the application level, or at the driver level. Whether that results in gpu memory fragmentation that causes problems is to be determined.

Top of reference tree:

Last fiddled with by kriesel on 2020-07-16 at 18:45
kriesel is offline  
Old 2018-07-12, 15:50   #6
kriesel's Avatar
Mar 2017
US midwest

17C816 Posts
Default Windows 10

For Windows 10 setup for better privacy, there's which may be useful.

Setting Windows classic theme in Win 10:

Stopping Cortana:

On Windows 7 a while back, benchmarking CUDALucas under different gpu driver versions, before I figured out how to reliably stop automatic driver updates, removing the network cable temporarily worked very well to block updates.

Controlling when updates occur can be useful. This can be configured in Windows 10 to require your consent. See If you don't trust it there are fallbacks. Firewalling off update servers is a possibility; in your firewall router, make a router entry that says the undesired server addresses are on the LAN side. If the update software's packets never cross the router, updates won't be downloaded or installed. If all else fails, or for simplicity, temporarily unplug the network cable.

Getting the Pro version of Windows provides remote desktop server capability. In some versions it also means better control of backups (more choice of destination for example).

Windows limits its page file to no more than 3 times the physical memory present, so it is possible to exhaust virtual memory near 4 times physical ram. That's a showstopper. Take care to run with no more committed ram than ~3.5 times physical system ram. Get a motherboard capable of sufficient RAM expansion that it does not become an issue.

Top of reference tree:

Last fiddled with by kriesel on 2021-08-19 at 22:12 Reason: Add virtual memory/physical ratio
kriesel is offline  
Old 2019-05-22, 19:37   #7
kriesel's Avatar
Mar 2017
US midwest

23×761 Posts
Default Running multiple computation types on multiple gpus per system

It gets complicated. The client management software that's available for one computation type generally does not support another, and some are specific to CUDA or OpenCl in addition. Not all GIMPS gpu applications are supported by separate client management software. None of the gpu apps have integrated Primenet API communication. App instances, folders, files, etc proliferate quickly if running multiple computation types on multiple gpus, and more so if running multiple TF instances to extract higher performance. See the example system attached.
Bring them up one gpu and one application instance at a time.

Top of reference tree:
Attached Files
File Type: pdf condorette gimps configuration.pdf (11.8 KB, 294 views)

Last fiddled with by kriesel on 2019-11-17 at 15:05
kriesel is offline  
Old 2020-10-21, 18:52   #8
kriesel's Avatar
Mar 2017
US midwest

608810 Posts
Default Power settings

There are multiple reasons to configure for power usage.
One is to ensure the system continues to work on assignments with no user interaction.
Another is to optimize power efficiency.
Another is to reduce total power to levels within the system's current cooling capacity or power supply capacity, which may diminish over time, or to improve running temperatures.

Turn off what you are sure you don't need. Varies considerably by BIOS flavor. Then test for the maybes. Some candidates are onboard video, USB, PCIe, audio, serial or parallel ports if present, extra NICs.
Reducing maximum cpu clock rate can reduce power usage, and improve power efficiency.

To help the system stay up and running prime95 or mprime or whatever on the cpu, and any relevant gpu applications, at full tilt while unattended, modify the default OS power saving settings. Test by leaving the system on continuously after a restart.
For Windows 10:
click on the lower left Windows icon, the gear that will appear a bit above it, System in the pane that will appear, then "Power & sleep". Find "when plugged in, pc goes to sleep" and select "never".

Scroll down and find "additional power settings", click on it, then in the pane that opens, click change plan settings, then click "change advanced power settings". Adjust the many settings in the resulting window to the speed/power tradeoffs you want after considering utility cost.

Linux is reported to have lower overhead, so switching OS is a possibility.

Disabling unneeded services or applications helps.

Consider using the power limiting capabilities of nvidia-smi for NVIDIA GPUs, or the corresponding AMD GPU utilities, to reduce GPU power from maximum to a lower level that is more power efficient. Especially during the air conditioning season.

Misc. hardware:
Remove any unnecessary hardware that's removable. Unused GPUs, excess RAM, DVD or CD drives, extra HDs, etc all draw power at idle. (Although reducing the number of occupied RAM channels will reduce computing performance somewhat.)

Power supply:
Use high efficiency power supplies. Select for output rating ~1.7 times the expected usage, so that the system will run near the power supply's peak efficiency, and some room for growth in installed components.
If using UPSes, select for high efficiency

Locate the system in a cool area. Fans won't run as hard, saving some wattage. Lower temperatures improve both efficiency and component lifetime. Low locations (floor) tend to be cooler than elevated locations.

Dust, lint etc builds up on grilles, fan blades and heat sinks, reducing cooling effectiveness over time. Use clean dry air to blow it out, or an old toothbrush or other brush to loosen and remove it. Use necessary antistatic precautions.

Application features:
Power usage can be reduced by certain features of GIMPS applications.
Mprime/prime95 can use fewer cpu cores than the maximum, or use "throttle=", or both.
Reducing the frequency of saves to disk may help.
Application tuning may increase total wattage while raising throughput by a greater amount, improving power efficiency.
In CUDALucas and CUDAPm1, low values for "PoliteValue=" can be used as a form of GPU throttling.

(This post has been supplemented with content from and other sources.)

Top of reference tree:

Last fiddled with by kriesel on 2021-05-07 at 00:07
kriesel is offline  
Old 2021-09-01, 03:43   #9
kriesel's Avatar
Mar 2017
US midwest

23×761 Posts
Default WSL

Windows Subsystem for Linux (WSL) seems to me named mostly backwards. It provides Linux capabilities for Windows systems; Linux support on Windows. (The opposite of something like WINE, enabling Windows capabilities on Linux)

WSL is a way to make a computer dual-personality-simultaneously. Or more. More than one installation, more than one distribution, more than one distribution version, more than one WSL version, all possible. Ubuntu 18.04 LTS or 20.04 LTS, Kali, Debian, Alpine WSL, OpenSUSE Leap 15.3 or 15.2, all free. Fedora Remix ~$10.

Comparison of WSL1 and WSL2

Installation of WSL
In a nutshell, if you want WSL1, which will install on systems lacking certain hardware support of virtualization required by WSL2/Hyper-V:
Manual installation step 1, restart, step 6, done. Most of my WSL-installed test systems are hardware limited to WSL1.

On WSL1, with an explorer window launched from the Windows side, I found the well-hidden WSL user home directory's mlucas folder with explorer's search function. Windows could apparently delete files there. But they would reappear later! Enough later that I had already deleted 7 unneeded files before an issue became apparent. All attempts to date to clean that situation up have failed; files indicate 0 hard links now;

rm, rm -f, or after chmod to all permissions to everyone (777) on the files and containing directory followed by rm -f, or even sudo rm -f, all respond permission denied. Perhaps a cold system restart followed by scandisk on the Windows side, or the Linux equivalent (fsck?) in Ubuntu/WSL, will handle it. After fresh backups...

Accessing Linux/WSL files from Windows 10
Works in WSL1 or 2. From the Ubuntu-launched explorer session, drag and drop to a Windows-launched explorer works.
Opposite way, from Windows to Linux-launched, had permissions problems when I tried it. The two explorer windows look identical. Adopt some convention, such as Linux on the left, to limit mistakes.


Because WSL performs some virtualization, it makes the core number specification in Mlucas relatively ineffective. I observed many other cores in a variety of test hardware joining in the party, and what appeared to be substantial core-hopping overhead on large-core-count systems, consistent with considerable discrepancy between timings on Ernst's CentOS Xeon Phi 7250, and my Ubuntu/WSL1/Win10 Pro x64 Xeon Phi 7250, for the same fft length, Mlucas version, and cores specification. A dual-12-core x2HT system was somewhat better behaved but not immune to socket-straddling. This apparent core-hopping involved all logical processors, usually in one socket, for Mlucas core counts <= logical-processor-count/socket, and made attempts to test use of 2 or 4 separate cores versus 2 or 4 with x2HT use futile. One way of mitigating the effect may be to fully load all logical processors with assigned work. Causes, mitigation and eventual software solutions are yet to be investigated. The issue seemed particularly severe on Xeon Phi with 64 or 68 cores, 256 or 272 logical processors, less so on lower-core-count and lower-hyperthreading-multiplier machines.

There's something about WSL1 that prevents using more than 64 logical processors in an Mlucas run. I think it's related to Windows handling many-core systems as if they were NUMA, with processor groups of no more than 64 logical processors, even when they're single-socket. Or a design decision made for WSL1. For example, a KNL Xeon Phi 7210 presents in Task Manager as 4 NUMA groups of 64 logical processors. A KNL Xeon Phi 7250 (68 cores by 4-way HT = 272 logical processors) presents as 5 NUMA groups. WSL1 hosted Ubuntu indicates 16 cores by x4 HT = 64 logical processors. Even after launching and loading up in WSL sessions, 4 with 64-core mlucas workloads each, the 5th also indicates 64 processors. It appears that core count is both subsetted per Ubuntu running window, and in total, exceeding the hardware capacity; 5 x 64 = 320 vs. 272 logical processors supported in hardware. I'm unable to install or test WSL2 behavior on Xeon Phi, because KNL lack required hardware virtualization support for WSL2.

Top of reference tree:

Last fiddled with by kriesel on 2021-09-03 at 22:53
kriesel is offline  
Old 2021-09-01, 04:21   #10
kriesel's Avatar
Mar 2017
US midwest

608810 Posts
Default Linux


A link the fans will like:

System monitoring utilities include top, vmstat, nmon. top -d (delay expressed in seconds), c option to show full command line of a process, and 1 option to show individual core activity are useful. A description of top options I've found useful is

I was surprised (shocked, really) to find remote desktop access unusable in current releases of Ubuntu. See

Top of reference tree:

Last fiddled with by kriesel on 2021-09-01 at 15:40
kriesel is offline  
Old 2021-09-01, 15:56   #11
kriesel's Avatar
Mar 2017
US midwest

608810 Posts
Default Windows - Linux dictionary

(very early draft)
Task                            Windows     Linux
change current directory        cd          cd
change current directory up     cd..        cd ..
change file attributes/perms    attrib      chmod
check system time and date      time        date
copy a file                     copy        cp
delete a file                   del         rm
display contents of a text file type        cat
executable file type            .exe        (nothing)
identify OS version             ver         cat /etc/os-release | grep PRETTY_NAME
make a directory                mkdir       mkdir
provide help                    help        man
rename a file                   rename      mv
show contents of a directory    dir         ls
Top of reference tree:

Last fiddled with by kriesel on 2021-09-01 at 16:22
kriesel is offline  
Closed Thread

Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Fast Breeding (guru management) VictordeHolland NFS@Home 2466 2020-09-20 06:51
Improving the queue management. debrouxl NFS@Home 10 2018-05-06 21:05
Script-based Primenet assignment management ewmayer Software 3 2017-05-25 04:02
Mally's marginal notes devarajkandadai Math 3 2008-12-19 03:33
Power Management settings PrimeCroat Hardware 3 2004-02-17 19:11

All times are UTC. The time now is 21:28.

Sun Jan 16 21:28:08 UTC 2022 up 177 days, 15:57, 0 users, load averages: 1.12, 1.40, 1.25

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔