mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2022-01-07, 16:36   #23
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

556410 Posts
Default

Quote:
Originally Posted by chalsall View Post
If you want to get serious about determining the problem(s) (and have a bit of fun)... I highly recommend the Cacti monitoring system.

The graphs are pretty! And very useful to assist with correlation determination.
Thanks! This looks interesting for several of my LAN questions.
EdH is offline   Reply With Quote
Old 2022-01-07, 16:38   #24
S485122
 
S485122's Avatar
 
"Jacob"
Sep 2006
Brussels, Belgium

195410 Posts
Default

If the machines are kept on (and you do not have the cold start problem) you might play with the profiles through the BIOS or a program. Modifying the case-fans profile to even stop under a certain temperature to keep the heat in the case. I would not stop the processor-cooler fans : there must be a minimum of air circulation in the case to avoid hot spots.
S485122 is offline   Reply With Quote
Old 2022-01-07, 17:51   #25
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2·112·47 Posts
Default

Quote:
Originally Posted by EdH View Post
Thanks! This looks interesting for several of my LAN questions.
Yup. Network monitoring is what Cacti (and RRDTools) was originally created for. But, at the end of the day, anything which can be sampled regularly can be logged, graphed, and analyzed. Temps, utilization, latency, UPS stats, etc.

No serious network engineer doesn't deploy this (or something similar).
chalsall is offline   Reply With Quote
Old 2022-01-08, 14:37   #26
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

15BC16 Posts
Default

Here's a screenshot of the temps for the machine I had trouble with already this year. This was taken while running all 24 threads as CADO-NFS clients:
Code:
top - 08:59:50 up 1 day, 21:50,  2 users,  load average: 23.28, 23.25, 23.12
Tasks: 418 total,   2 running, 416 sleeping,   0 stopped,   0 zombie
%Cpu(s): 94.2 us,  1.5 sy,  0.0 ni,  4.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  24036.7 total,  16509.3 free,   3601.6 used,   3925.8 buff/cache
MiB Swap:  16373.0 total,  16373.0 free,      0.0 used.  20004.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
 190326 math98    20   0 1476540 863236   9532 S 763.8   3.5  20:09.80 las      
 190369 math98    20   0 1477580 860944   9408 S 757.1   3.5  12:43.95 las      
 190286 math98    20   0 1480620 867696   9532 S 756.1   3.5  21:54.07 las
Thanks for all the suggestions everyone!
Attached Thumbnails
Click image for larger version

Name:	tempsm98.png
Views:	63
Size:	26.0 KB
ID:	26370  
EdH is offline   Reply With Quote
Old 2022-01-08, 14:50   #27
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

22·13·107 Posts
Default

Maybe my real trouble is a Motherboard sensor. Does temp2 look suspicious to anyone else?
Attached Thumbnails
Click image for larger version

Name:	temps2m98.png
Views:	70
Size:	52.5 KB
ID:	26371  
EdH is offline   Reply With Quote
Old 2022-01-08, 16:25   #28
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

1137410 Posts
Default

Quote:
Originally Posted by EdH View Post
Maybe my real trouble is a Motherboard sensor. Does temp2 look suspicious to anyone else?
Definitely! Unless you've got some serious cryogenic cooling going on in that box...

For comparison, here are my sensors:
Code:
chalsall@hobbit:~$ sensors
nouveau-pci-0100
Adapter: PCI adapter
fan1:         897 RPM
temp1:        +48.0°C  (high = +95.0°C, hyst =  +3.0°C)
                       (crit = +105.0°C, hyst =  +5.0°C)
                       (emerg = +135.0°C, hyst =  +5.0°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +16.8°C  (crit = +20.8°C)
temp2:        +27.8°C  (crit = +119.0°C)
temp3:        +29.8°C  (crit = +119.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +64.0°C  (high = +82.0°C, crit = +100.0°C)
Core 0:        +61.0°C  (high = +82.0°C, crit = +100.0°C)
Core 1:        +60.0°C  (high = +82.0°C, crit = +100.0°C)
Core 2:        +59.0°C  (high = +82.0°C, crit = +100.0°C)
Core 3:        +60.0°C  (high = +82.0°C, crit = +100.0°C)
Core 4:        +59.0°C  (high = +82.0°C, crit = +100.0°C)
Core 5:        +62.0°C  (high = +82.0°C, crit = +100.0°C)

nvme-pci-0b00
Adapter: PCI adapter
Composite:    +35.9°C  (low  =  -0.1°C, high = +74.8°C)
                       (crit = +79.8°C)
chalsall is offline   Reply With Quote
Old 2022-01-08, 17:13   #29
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

126748 Posts
Default

Even though I ran sudo sensors-detect, I've been suspicious of my readings, anyway. This is a thrown together dual processor 6c/12t each, machine. Here's my total sensors display, which is heavy with alarms:
Code:
$ sensors
w83627dhg-isa-0a10
Adapter: ISA adapter
Vcore:       728.00 mV (min =  +0.00 V, max =  +1.74 V)
in1:         712.00 mV (min =  +0.68 V, max =  +0.77 V)
AVCC:          3.34 V  (min =  +2.98 V, max =  +3.63 V)
+3.3V:         3.34 V  (min =  +2.98 V, max =  +3.63 V)
in4:           1.02 V  (min =  +0.51 V, max =  +1.13 V)
in5:         712.00 mV (min =  +1.28 V, max =  +0.65 V)  ALARM
in6:         712.00 mV (min =  +1.04 V, max =  +0.33 V)  ALARM
3VSB:          3.39 V  (min =  +2.98 V, max =  +3.63 V)
Vbat:          3.15 V  (min =  +2.70 V, max =  +3.63 V)
fan1:           0 RPM  (min =  285 RPM, div = 128)  ALARM
fan2:           0 RPM  (min =  239 RPM, div = 128)  ALARM
fan3:           0 RPM  (min =  659 RPM, div = 128)  ALARM
fan5:           0 RPM  (min = 10546 RPM, div = 128)  ALARM
temp1:        +48.0�C  (high = -128.0�C, hyst = +87.0�C)  sensor = thermistor
temp2:        -88.0�C  (high = +80.0�C, hyst = +75.0�C)  sensor = thermistor
temp3:        +43.0�C  (high = +80.0�C, hyst = +75.0�C)  sensor = thermistor
cpu0_vid:    +0.000 V
intrusion0:  ALARM

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +36.0�C  (high = +80.0�C, crit = +96.0�C)
Core 1:       +29.0�C  (high = +80.0�C, crit = +96.0�C)
Core 2:       +33.0�C  (high = +80.0�C, crit = +96.0�C)
Core 8:       +32.0�C  (high = +80.0�C, crit = +96.0�C)
Core 9:       +34.0�C  (high = +80.0�C, crit = +96.0�C)
Core 10:      +35.0�C  (high = +80.0�C, crit = +96.0�C)

w83795adg-i2c-0-2f
Adapter: SMBus I801 adapter at 0400
in0:           1.19 V  (min =  +0.67 V, max =  +1.49 V)
in1:           1.18 V  (min =  +0.67 V, max =  +1.49 V)
in2:           1.52 V  (min =  +1.35 V, max =  +1.65 V)
in3:           1.53 V  (min =  +1.35 V, max =  +1.65 V)
in4:           1.26 V  (min =  +1.13 V, max =  +1.38 V)
in5:           1.26 V  (min =  +1.13 V, max =  +1.38 V)
in6:           1.83 V  (min =  +1.63 V, max =  +2.00 V)
in7:           1.47 V  (min =  +1.42 V, max =  +1.53 V)
in11:          1.12 V  (min =  +1.48 V, max =  +1.82 V)  ALARM
+3.3V:         3.24 V  (min =  +2.96 V, max =  +3.63 V)
3VSB:          3.26 V  (min =  +2.96 V, max =  +3.63 V)
Vbat:          3.13 V  (min =  +2.70 V, max =  +3.63 V)
fan1:        2566 RPM  (min =  712 RPM)
fan2:        2495 RPM  (min =  712 RPM)
fan3:           0 RPM  (min =  712 RPM)  ALARM
fan4:        4838 RPM  (min =  712 RPM)
fan5:           0 RPM  (min =  712 RPM)  ALARM
fan6:        3040 RPM  (min =  712 RPM)
temp1:        +14.2�C  (high = +127.0�C, hyst = +127.0�C)
                       (crit = +127.0�C, hyst = +127.0�C)  sensor = thermal diode
temp5:        +10.0�C  (high = +127.0�C, hyst = +127.0�C)
                       (crit = +75.0�C, hyst = +70.0�C)  sensor = thermistor
temp7:        +40.2�C  (high = +95.0�C, hyst = +92.0�C)
                       (crit = +95.0�C, hyst = +92.0�C)  sensor = Intel PECI
temp8:        +43.0�C  (high = +95.0�C, hyst = +92.0�C)
                       (crit = +95.0�C, hyst = +92.0�C)  sensor = Intel PECI
intrusion0:  OK
beep_enable: enabled

coretemp-isa-0001
Adapter: ISA adapter
Core 0:       +37.0�C  (high = +80.0�C, crit = +96.0�C)
Core 1:       +32.0�C  (high = +80.0�C, crit = +96.0�C)
Core 2:       +30.0�C  (high = +80.0�C, crit = +96.0�C)
Core 8:       +34.0�C  (high = +80.0�C, crit = +96.0�C)
Core 9:       +39.0�C  (high = +80.0�C, crit = +96.0�C)
Core 10:      +39.0�C  (high = +80.0�C, crit = +96.0�C)
EdH is offline   Reply With Quote
Old 2022-01-08, 18:08   #30
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

261568 Posts
Default

Quote:
Originally Posted by EdH View Post
Even though I ran sudo sensors-detect, I've been suspicious of my readings, anyway.
Pilots will always tell you never to implicitly trust your instruments.

I'd suggest you spend some "quality time" down in the BIOS interface, seeing if you're actually running "in spec". Perhaps a "reset" and/or a BIOS upgrade (if available) would be in order.

Also, I have found some situations where sensors-detect doesn't get all the values correct.
chalsall is offline   Reply With Quote
Old 2022-01-08, 19:07   #31
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

10101101111002 Posts
Default

Quote:
Originally Posted by chalsall View Post
Pilots will always tell you never to implicitly trust your instruments.

I'd suggest you spend some "quality time" down in the BIOS interface, seeing if you're actually running "in spec". Perhaps a "reset" and/or a BIOS upgrade (if available) would be in order.

Also, I have found some situations where sensors-detect doesn't get all the values correct.
It's the latest version, unless there's something really recent" and I did a little BIOS peeking, but I just turned it loose and it's not in a convenient location for BIOS work ATM. It's been reset a few times. The most recent was a couple days ago. It ran pretty well through the warmer times. It's just been the cold that's affecting it. I'm kind of thinking the dual processors may be affecting some of the sensors values.
EdH is offline   Reply With Quote
Old 2022-01-10, 23:12   #32
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24·199 Posts
Default

Could also be condensation. Do you see any when you look at the system, around the time of day of the restarts?
Mark Rose is offline   Reply With Quote
Old 2022-01-11, 00:55   #33
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

22·13·107 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
Could also be condensation. Do you see any when you look at the system, around the time of day of the restarts?
The systems run 24/7 and are not comfortably accessible. A restart is normally only due to a failure. On a rare occasion the failure is the grid.* For the near term, I have configured a secondary factoring script that is paused during the primary script's processing. The secondary one is never-ending, unless something fails. We have a sub-zero forecast for the next few days. I'll see how the systems do.

* The systems are configured to power up after AC loss, but I hesitate to use that as a reset.
EdH is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
How cold is it? petrw1 Lounge 14 2015-01-19 03:22
Julian Schwinger and Cold Fusion ewmayer Science & Technology 1 2014-01-24 08:48
Hot and cold running crud xilman Science & Technology 2 2013-04-12 00:49
Warming cold ∞ xilman Lounge 7 2013-01-21 20:38
Cold Fusion? Is it possible? Fusion_power Lounge 3 2003-08-19 01:13

All times are UTC. The time now is 16:28.


Fri Jul 7 16:28:31 UTC 2023 up 323 days, 13:57, 0 users, load averages: 1.93, 2.03, 1.74

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔