Partial checklist for system maintenance & reliability

Some things to check if the system uptime or other reliability is less than quite good.
  1. How old is the hardware? (Hard drive etc not too ancient? All components and connectors well seated and making good contact?)
  2. Recent backups, running on schedule, well monitored to ensure they're actually running to completion? Restore process tested and practiced to confirm it's possible to restore from those backups? N-deep backups, so accidental deletion not noticed before the next backup runs is not necessarily data gone forever?
  3. Before making any changes, are there lengthy computations that are nearly done and could be completed before those changes? Application updates or other changes might cause the computation to restart from the beginning or be unable to complete. Much better to lose some work that was just begun, than to lose work that was running for days and weeks and almost finished.
  4. How well patched is the system?
  5. How well is it protected from power interruption or transients or sags? (Voltage regulating UPS?)
  6. Do you have a way of monitoring the line voltage?
  7. What do system logs have to say?
  8. How detailed and complete is your system logging? (Is some logging going to another system or non-boot storage device? Will it survive a HD problem in the system of interest?)
  9. What OS is it running?
  10. What other software?
  11. Is it safe from children and other small animals?
  12. System components and memory pass reliability tests? What if anything does/would a serious diagnostics attempt tell you?
  13. What assumptions are you making and may not even realize it?
  14. Temperature of components and ambient environment in a reasonable range?
  15. Relative humidity in a reasonable range?
  16. All fans in the system in good working order? Grilles and components free of dust, lint, and pet hair?
  17. A full complement of drivers, of reliable versions, typically up to date except for recent releases with known issues?
  18. Well secured?
  19. Correct power supply output voltages, and adequate current output for all the components now installed on all voltage levels? System components get added, and power supply components degrade over time. Wattage required varies with operating temperature, clock rate, program execution, etc.
  20. Is the system bios up to date? (thanks SELROC) The various components' firmware also?
  21. If the worst happens, dead drive, backups failed or were not current enough or can't be restored for some reason, do you have info on one or more good data recovery companies, that can repair a drive or open it in a clean room and get the data to new media for a price? Know the price, and maybe backup in depth will seem more economical.
