mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2019-08-16, 21:23   #12
hansl
 
hansl's Avatar
 
Apr 2019

5×41 Posts
Default

I found there is a firmware update for the NVMe SSD

Version P3CR013
Released July 23, 2019

This firmware update addresses the following items:
  • Optimized behavior in Link Power Management.
  • Improved background garbage collection methodology
  • Improved thermal management
  • Implemented NGUID and EUI64 NVMe Identifiers.

Already upgraded and starting testing mprime again at 8192MB usage. Seems better(been running about 11minutes so far, longer than that last couple attempts) , fingers cross they fixed the issue with this.
hansl is offline   Reply With Quote
Old 2019-08-16, 21:43   #13
hansl
 
hansl's Avatar
 
Apr 2019

5·41 Posts
Default

Since the main problem seems to be solved now (knock on wood), I am curious about this:
Quote:
Originally Posted by kriesel View Post
I routinely run cpu P-1 at 8 to32 GB allowed in prime95
Is there a reason 32GB is the max allowed? I have another system with 96GB total, and was hoping to try running up to ~64GB of that.
hansl is offline   Reply With Quote
Old 2019-08-16, 23:59   #14
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by hansl View Post
Since the main problem seems to be solved now (knock on wood), I am curious about this:

Is there a reason 32GB is the max allowed? I have another system with 96GB total, and was hoping to try running up to ~64GB of that.
No. I chose 32 because it was a 128GB system with 4 prime95 workers. I could potentially run 3 instances of stage 2 at once, stage one on another, and still have enough total ram to run the OS and other apps and avoid paging. I recall I have tried it at 64GB with 1 P-1 instance and 3 PRP workers and had no issues, and it didn't use the whole 64. If there is a limit, I'm not aware of it. Haven't read the source code for that.

Last fiddled with by kriesel on 2019-08-17 at 00:01
kriesel is online now   Reply With Quote
Old 2019-08-17, 00:31   #15
hansl
 
hansl's Avatar
 
Apr 2019

5·41 Posts
Default

OK, because I think I tried setting Memory in prime.txt to 65536, and it looks like it got reset to 32768.

Anyways, I see what you mean, it looks like it onlys need a max of ~20GB for the assigned bounds in the 92M range for example.
Code:
Using 20102MB of memory.  Processing 480 relative primes (0 of 480 already processed).
Since lower memory limits seem to only change the number of relative primes processed at once and its doing the max for that job anyways, there is no reason to allocate more.
hansl is offline   Reply With Quote
Old 2019-08-17, 02:58   #16
hansl
 
hansl's Avatar
 
Apr 2019

5×41 Posts
Default

Well I guess I spoke too soon, that particular run went smoothly for many hours, then I tried stopping and restarting mprime and it did the same crash behavior with all the same NVMe errors again.

Should I just return this drive for a different model?
hansl is offline   Reply With Quote
Old 2019-08-17, 03:40   #17
hansl
 
hansl's Avatar
 
Apr 2019

110011012 Posts
Default

I searched around some more about the error and found some related bug reports about nvme not coming out of some power saving mode.

https://bugs.launchpad.net/ubuntu/+s...d/+bug/1682704
which recommends setting the grub parameter:
Code:
nvme_core.default_ps_max_latency_us=6000
However the last commenter on it said that didn't even work for them.

Then thereis' this page:
https://wiki.archlinux.org/index.php...ate_drive/NVMe
Which says you can disable the power saving entirely by setting the same param to 0
Code:
nvme_core.default_ps_max_latency_us=0
So not wanting to fuss with trying various values and more failures I've decided to go for disabling it entirely.
Can I just have a stable system now?! *shaking fist at the compute gods*

edit: NOPE. Crashed again! *sigh* this is getting tiring.

Last fiddled with by hansl on 2019-08-17 at 03:52
hansl is offline   Reply With Quote
Old 2019-08-17, 07:08   #18
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

5×937 Posts
Default

Here: https://forums.opensuse.org/showthre...91#post2907091 they recommend 5500.
paulunderwood is offline   Reply With Quote
Old 2019-08-17, 16:28   #19
hansl
 
hansl's Avatar
 
Apr 2019

5·41 Posts
Default

I think I found the issue. I had previously set another grub parameter "pcie_aspm=off" to get rid of a bunch of dmesg pcie error spam about "BadTLP" and "BadDLLP" (apparently caused by Nvidia card). I had found that solution from this thread:
https://forum.level1techs.com/t/thre...rors/118977/88

I guess turning off aspm is a bad idea for NVMe stability though. So I've re-enabled that and things see better I think. I have not been able to trigger the crashing now that I've removed the "pcie_aspm=off" from grub.
I'm back to getting all the BadTLP errors, etc. that the above thread is about, but they seem harmless I guess.

Now that I think about it, I had changed that setting about the same time I decided to try P-1 work, so it might also be triggered by LL work, or just high loads in general. I'm not too interested in testing further permutations though, as long as my system is stable.
hansl is offline   Reply With Quote
Old 2019-08-18, 14:57   #20
hansl
 
hansl's Avatar
 
Apr 2019

5·41 Posts
Default

Well it crashed again in the middle of the night. Unfortunately I didn't have ssh session connected monitoring anything. The display was off when I got to it this morning and clicking the keyboard/mouse wouldn't wake it. I also could not ssh in so I hit the reset button. The logs showed no particular error aside from the "expected" ones which come from the nvidia device, mentioned in my previous post. You can see where I reboot it at 7:05am, Before that it was writing a line around 1:49am, and just got cut off at "Aug 1", where I'm guessing the drive when to readonly mode again.

Code:
...
Aug 18 01:48:52 gypsy kernel: [56310.005183] pcieport 0000:00:03.1: AER: Corrected error received: id=0000
Aug 18 01:48:52 gypsy kernel: [56310.005196] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
Aug 18 01:48:52 gypsy kernel: [56310.005208] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001000/00006000
Aug 18 01:48:52 gypsy kernel: [56310.005212] pcieport 0000:00:03.1:    [12] Replay Timer Timeout  
Aug 18 01:49:03 gypsy kernel: [56321.303235] pcieport 0000:00:03.1: AER: Corrected error received: id=0000
Aug 18 01:49:03 gypsy kernel: [56321.303250] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
Aug 18 01:49:03 gypsy kernel: [56321.303259] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001000/00006000
Aug 18 01:49:03 gypsy kernel: [56321.303263] pcieport 0000:00:03.1:    [12] Replay Timer Timeout  
Aug 18 01:49:03 gypsy kernel: [56321.314419] pcieport 0000:00:03.1: AER: Corrected error received: id=0000
Aug 18 01:49:03 gypsy kernel: [56321.314432] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
Aug 18 01:49:03 gypsy kernel: [56321.314441] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00001000/00006000
Aug 18 01:49:03 gypsy kernel: [56321.314444] pcieport 0000:00:03.1:    [12] Replay Timer Timeout  
Aug 18 01:49:07 gypsy kernel: [56325.237633] pcieport 0000:00:03.1: AER: Corrected error received: id=0000
Aug 18 01:49:07 gypsy kernel: [56325.237649] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Receiver ID)
Aug 18 01:49:07 gypsy kernel: [56325.237660] pcieport 0000:00:03.1:   device [1022:1453] error status/mask=00000040/00006000
Aug 18 01:49:07 gypsy kernel: [56325.237663] pcieport 0000:00:03.1:    [ 6] Bad TLP               
Aug 18 01:49:10 gypsy kernel: [56327.452861] pcieport 0000:00:03.1: AER: Corrected error received: id=0000
Aug 1Aug 18 07:05:41 gypsy kernel: [    0.000000] Linux version 4.15.0-58-generic (buildd@lcy01-amd64-013) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 (Ubuntu 4.15.0-58.64-generic 4.15.18
)
Aug 18 07:05:41 gypsy kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-58-generic root=UUID=2c3ef177-4d57-4f22-9f3c-d0fafa49b3c6 ro quiet splash vt.handoff=1
Aug 18 07:05:41 gypsy kernel: [    0.000000] KERNEL supported cpus:
Aug 18 07:05:41 gypsy kernel: [    0.000000]   Intel GenuineIntel
Aug 18 07:05:41 gypsy kernel: [    0.000000]   AMD AuthenticAMD
Aug 18 07:05:41 gypsy kernel: [    0.000000]   Centaur CentaurHauls
...
I ended up installing Linux Mint 19.2 by the way(why the kernel log says Ubuntu), but its the same sort of problems that I had on OpenSUSE Tumbleweed.

I read some more about this latency setting here:
https://wiki.archlinux.org/index.php...ate_drive/NVMe

Quote:
Originally Posted by https://wiki.archlinux.org/index.php/Solid_state_drive/NVMe
Andy Lutomirski has created a patchset which fixes powersaving for NVME devices in linux. The patch has been merged into mainline kernel v4.11.
I'm currently on kernel 4.15.0-58-generic, so that should include the patch no problem.

Quote:
Originally Posted by https://wiki.archlinux.org/index.php/Solid_state_drive/NVMe
To test if NVME Power Management is working, install nvme-cli, and run "nvme get-feature -f 0x0c -H /dev/nvme[0-9]"
Code:
hans@gypsy:~$ sudo nvme get-feature -f 0x0c -H /dev/nvme0
get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
	Autonomous Power State Transition Enable (APSTE): Enabled
	Auto PST Entries	.................
	Entry[ 0]   
	.................
	Idle Time Prior to Transition (ITPT): 100 ms
	Idle Transition Power State   (ITPS): 3
	.................
	Entry[ 1]   
	.................
	Idle Time Prior to Transition (ITPT): 100 ms
	Idle Transition Power State   (ITPS): 3
	.................
	Entry[ 2]   
	.................
	Idle Time Prior to Transition (ITPT): 100 ms
	Idle Transition Power State   (ITPS): 3
	.................
	Entry[ 3]   
	.................
	Idle Time Prior to Transition (ITPT): 700 ms
	Idle Transition Power State   (ITPS): 4
	.................
	Entry[ 4]   
	.................
	Idle Time Prior to Transition (ITPT): 0 ms
	Idle Transition Power State   (ITPS): 0
	.................

*** Entries 5-31 trimmed for brevity, all same 0ms, 0 state as Entry[ 4] *** -hansl

	.................
       0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
0000: 18 64 00 00 00 00 00 00 18 64 00 00 00 00 00 00 ".d.......d......"
0010: 18 64 00 00 00 00 00 00 20 bc 02 00 00 00 00 00 ".d.............."
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
Quote:
Originally Posted by https://wiki.archlinux.org/index.php/Solid_state_drive/NVMe
When APST is enabled the output should contain "Autonomous Power State Transition Enable (APSTE): Enabled" and there should be non-zero entries in the table below indicating the idle time before transitioning into each of the available states.
So the first 4 states ([0] through [3]) look valid to me.

Quote:
Originally Posted by https://wiki.archlinux.org/index.php/Solid_state_drive/NVMe
If APST is enabled but no non-zero states appear in the table, the latencies might be too high for any states to be enabled by default. The output of # nvme id-ctrl /dev/nvme[0-9] should show the available non-operational power states of the NVME controller. If the total latency of any state (enlat + xlat) is greater than 25000 (25ms) you must pass a value at least that high as parameter default_ps_max_latency_us for the nvme_core kernel module. This should enable APST and make the table in # nvme get-feature show the entries.
I ran this just for good measure to see what latencies were listed:
Code:
sudo nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid     : 0xc0a9
ssvid   : 0xc0a9
sn      : 1923E20992C5        
mn      : CT500P1SSD8                             
fr      : P3CR013 
rab     : 6
ieee    : 00a075
cmic    : 0
mdts    : 5
cntlid  : 1
ver     : 10300
rtd3r   : 7a120
rtd3e   : 4c4b40
oaes    : 0x200
ctratt  : 0
oacs    : 0x16
acl     : 3
aerl    : 7
frmw    : 0x14
lpa     : 0xf
elpe    : 255
npss    : 4
avscc   : 0
apsta   : 0x1
wctemp  : 343
cctemp  : 353
mtfa    : 20
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
edstt   : 5
dsto    : 1
fwug    : 0
kas     : 0
hctma   : 0x1
mntmt   : 323
mxtmt   : 353
sanicap : 0x2
hmminds : 0
hmmaxd  : 0
sqes    : 0x66
cqes    : 0x44
maxcmd  : 0
nn      : 1
oncs    : 0x5e
fuses   : 0
fna     : 0
vwc     : 0x1
awun    : 0
awupf   : 0
nvscc   : 0
acwu    : 0
sgls    : 0
subnqn  : 
ioccsz  : 0
iorcsz  : 0
icdoff  : 0
ctrattr : 0
msdbd   : 0
ps    0 : mp:9.00W operational enlat:5 exlat:5 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:4.60W operational enlat:30 exlat:30 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.80W operational enlat:30 exlat:30 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0500W non-operational enlat:1000 exlat:1000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0040W non-operational enlat:6000 exlat:8000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-
So the max total latency shown is 6000+8000 = 14000, which is definitely less than the quoted 25000.

Even so I decided to verify what the kernel module parameters are currently set to, and I found that I can use this command:
Code:
hans@gypsy:~$ systool -vm nvme_core
Module = "nvme_core"

  Attributes:
    coresize            = "61440"
    initsize            = "0"
    initstate           = "live"
    refcnt              = "5"
    srcversion          = "97936FE07FF19AF20FC8C2E"
    taint               = ""
    uevent              = <store method only>
    version             = "1.0"

  Parameters:
    admin_timeout       = "60"
    default_ps_max_latency_us= "100000"
    force_apst          = "N"
    io_timeout          = "30"
    max_retries         = "5"
    multipath           = "Y"
    shutdown_timeout    = "5"
    streams             = "N"

  Sections:
So on this Linux Mint, it looks like default_ps_max_latency_us is already set to 100000 (even higher than the Arch quoted 25ms) which should be more than enough.
(Assuming it only matters that it is set sufficiently high?)

And lastly I checked the SMART data:
Code:
255 hans@gypsy:~$ sudo nvme smart-log /dev/nvme0
[sudo] password for hans:              
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 42 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 0%
data_units_read                     : 550,056
data_units_written                  : 1,258,557
host_read_commands                  : 7,345,377
host_write_commands                 : 24,799,253
controller_busy_time                : 264
power_cycles                        : 18
power_on_hours                      : 607
unsafe_shutdowns                    : 7
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 43 C
Temperature Sensor 2                : 42 C
Temperature Sensor 5                : 61 C
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0
Thermal Management T2 Total Time    : 0
Which doesn't seem to show anything of interest.

So again I don't really know what else to try but to wait for another crash and hope something interesting shows up in "dmesg -w" over ssh.
hansl is offline   Reply With Quote
Old 2019-08-18, 21:40   #21
PhilF
 
PhilF's Avatar
 
"6800 descendent"
Feb 2005
Colorado

32×83 Posts
Default

Are you sure you have disabled every possible power saving feature related to the drive, including any that might be in the BIOS settings of the machine?
PhilF is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Ryzen help Prime95 Hardware 9 2018-05-14 04:06
29.2 benchmark help #2 (Ryzen only) Prime95 Software 10 2017-05-08 13:24
AMD Ryzen is risin' up. jasong Hardware 11 2017-03-02 19:56
Stage 1 with mprime/prime95, stage 2 with GMP-ECM D. B. Staple Factoring 2 2007-12-14 00:21
Stage 1 and stage 2 tests missing Matthias C. Noc PrimeNet 5 2004-08-25 15:42

All times are UTC. The time now is 16:36.


Fri Jul 7 16:36:34 UTC 2023 up 323 days, 14:05, 1 user, load averages: 2.85, 2.49, 2.11

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔