![]() |
|
|
#12 |
|
Apr 2019
5×41 Posts |
I found there is a firmware update for the NVMe SSD
Version P3CR013 Released July 23, 2019 This firmware update addresses the following items:
Already upgraded and starting testing mprime again at 8192MB usage. Seems better(been running about 11minutes so far, longer than that last couple attempts) , fingers cross they fixed the issue with this. |
|
|
|
|
|
#13 |
|
Apr 2019
5·41 Posts |
Since the main problem seems to be solved now (knock on wood), I am curious about this:
Is there a reason 32GB is the max allowed? I have another system with 96GB total, and was hoping to try running up to ~64GB of that. |
|
|
|
|
|
#14 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24·3·163 Posts |
No. I chose 32 because it was a 128GB system with 4 prime95 workers. I could potentially run 3 instances of stage 2 at once, stage one on another, and still have enough total ram to run the OS and other apps and avoid paging. I recall I have tried it at 64GB with 1 P-1 instance and 3 PRP workers and had no issues, and it didn't use the whole 64. If there is a limit, I'm not aware of it. Haven't read the source code for that.
Last fiddled with by kriesel on 2019-08-17 at 00:01 |
|
|
|
|
|
#15 |
|
Apr 2019
5·41 Posts |
OK, because I think I tried setting Memory in prime.txt to 65536, and it looks like it got reset to 32768.
Anyways, I see what you mean, it looks like it onlys need a max of ~20GB for the assigned bounds in the 92M range for example. Code:
Using 20102MB of memory. Processing 480 relative primes (0 of 480 already processed). |
|
|
|
|
|
#16 |
|
Apr 2019
5×41 Posts |
Well I guess I spoke too soon, that particular run went smoothly for many hours, then I tried stopping and restarting mprime and it did the same crash behavior with all the same NVMe errors again.
Should I just return this drive for a different model? |
|
|
|
|
|
#17 |
|
Apr 2019
110011012 Posts |
I searched around some more about the error and found some related bug reports about nvme not coming out of some power saving mode.
https://bugs.launchpad.net/ubuntu/+s...d/+bug/1682704 which recommends setting the grub parameter: Code:
nvme_core.default_ps_max_latency_us=6000 Then thereis' this page: https://wiki.archlinux.org/index.php...ate_drive/NVMe Which says you can disable the power saving entirely by setting the same param to 0 Code:
nvme_core.default_ps_max_latency_us=0 Can I just have a stable system now?! *shaking fist at the compute gods* edit: NOPE. Crashed again! *sigh* this is getting tiring. Last fiddled with by hansl on 2019-08-17 at 03:52 |
|
|
|
|
|
#18 |
|
Sep 2002
Database er0rr
5×937 Posts |
Here: https://forums.opensuse.org/showthre...91#post2907091 they recommend 5500.
|
|
|
|
|
|
#19 |
|
Apr 2019
5·41 Posts |
I think I found the issue. I had previously set another grub parameter "pcie_aspm=off" to get rid of a bunch of dmesg pcie error spam about "BadTLP" and "BadDLLP" (apparently caused by Nvidia card). I had found that solution from this thread:
https://forum.level1techs.com/t/thre...rors/118977/88 I guess turning off aspm is a bad idea for NVMe stability though. So I've re-enabled that and things see better I think. I have not been able to trigger the crashing now that I've removed the "pcie_aspm=off" from grub. I'm back to getting all the BadTLP errors, etc. that the above thread is about, but they seem harmless I guess. Now that I think about it, I had changed that setting about the same time I decided to try P-1 work, so it might also be triggered by LL work, or just high loads in general. I'm not too interested in testing further permutations though, as long as my system is stable. |
|
|
|
|
|
#20 | ||||
|
Apr 2019
5·41 Posts |
Well it crashed again in the middle of the night. Unfortunately I didn't have ssh session connected monitoring anything. The display was off when I got to it this morning and clicking the keyboard/mouse wouldn't wake it. I also could not ssh in so I hit the reset button. The logs showed no particular error aside from the "expected" ones which come from the nvidia device, mentioned in my previous post. You can see where I reboot it at 7:05am, Before that it was writing a line around 1:49am, and just got cut off at "Aug 1", where I'm guessing the drive when to readonly mode again.
Code:
... Aug 18 01:48:52 gypsy kernel: [56310.005183] pcieport 0000:00:03.1: AER: Corrected error received: id=0000 Aug 18 01:48:52 gypsy kernel: [56310.005196] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID) Aug 18 01:48:52 gypsy kernel: [56310.005208] pcieport 0000:00:03.1: device [1022:1453] error status/mask=00001000/00006000 Aug 18 01:48:52 gypsy kernel: [56310.005212] pcieport 0000:00:03.1: [12] Replay Timer Timeout Aug 18 01:49:03 gypsy kernel: [56321.303235] pcieport 0000:00:03.1: AER: Corrected error received: id=0000 Aug 18 01:49:03 gypsy kernel: [56321.303250] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID) Aug 18 01:49:03 gypsy kernel: [56321.303259] pcieport 0000:00:03.1: device [1022:1453] error status/mask=00001000/00006000 Aug 18 01:49:03 gypsy kernel: [56321.303263] pcieport 0000:00:03.1: [12] Replay Timer Timeout Aug 18 01:49:03 gypsy kernel: [56321.314419] pcieport 0000:00:03.1: AER: Corrected error received: id=0000 Aug 18 01:49:03 gypsy kernel: [56321.314432] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID) Aug 18 01:49:03 gypsy kernel: [56321.314441] pcieport 0000:00:03.1: device [1022:1453] error status/mask=00001000/00006000 Aug 18 01:49:03 gypsy kernel: [56321.314444] pcieport 0000:00:03.1: [12] Replay Timer Timeout Aug 18 01:49:07 gypsy kernel: [56325.237633] pcieport 0000:00:03.1: AER: Corrected error received: id=0000 Aug 18 01:49:07 gypsy kernel: [56325.237649] pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Receiver ID) Aug 18 01:49:07 gypsy kernel: [56325.237660] pcieport 0000:00:03.1: device [1022:1453] error status/mask=00000040/00006000 Aug 18 01:49:07 gypsy kernel: [56325.237663] pcieport 0000:00:03.1: [ 6] Bad TLP Aug 18 01:49:10 gypsy kernel: [56327.452861] pcieport 0000:00:03.1: AER: Corrected error received: id=0000 Aug 1Aug 18 07:05:41 gypsy kernel: [ 0.000000] Linux version 4.15.0-58-generic (buildd@lcy01-amd64-013) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 (Ubuntu 4.15.0-58.64-generic 4.15.18 ) Aug 18 07:05:41 gypsy kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-58-generic root=UUID=2c3ef177-4d57-4f22-9f3c-d0fafa49b3c6 ro quiet splash vt.handoff=1 Aug 18 07:05:41 gypsy kernel: [ 0.000000] KERNEL supported cpus: Aug 18 07:05:41 gypsy kernel: [ 0.000000] Intel GenuineIntel Aug 18 07:05:41 gypsy kernel: [ 0.000000] AMD AuthenticAMD Aug 18 07:05:41 gypsy kernel: [ 0.000000] Centaur CentaurHauls ... I read some more about this latency setting here: https://wiki.archlinux.org/index.php...ate_drive/NVMe Quote:
Quote:
Code:
hans@gypsy:~$ sudo nvme get-feature -f 0x0c -H /dev/nvme0
get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
Autonomous Power State Transition Enable (APSTE): Enabled
Auto PST Entries .................
Entry[ 0]
.................
Idle Time Prior to Transition (ITPT): 100 ms
Idle Transition Power State (ITPS): 3
.................
Entry[ 1]
.................
Idle Time Prior to Transition (ITPT): 100 ms
Idle Transition Power State (ITPS): 3
.................
Entry[ 2]
.................
Idle Time Prior to Transition (ITPT): 100 ms
Idle Transition Power State (ITPS): 3
.................
Entry[ 3]
.................
Idle Time Prior to Transition (ITPT): 700 ms
Idle Transition Power State (ITPS): 4
.................
Entry[ 4]
.................
Idle Time Prior to Transition (ITPT): 0 ms
Idle Transition Power State (ITPS): 0
.................
*** Entries 5-31 trimmed for brevity, all same 0ms, 0 state as Entry[ 4] *** -hansl
.................
0 1 2 3 4 5 6 7 8 9 a b c d e f
0000: 18 64 00 00 00 00 00 00 18 64 00 00 00 00 00 00 ".d.......d......"
0010: 18 64 00 00 00 00 00 00 20 bc 02 00 00 00 00 00 ".d.............."
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
Quote:
Quote:
Code:
sudo nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid : 0xc0a9
ssvid : 0xc0a9
sn : 1923E20992C5
mn : CT500P1SSD8
fr : P3CR013
rab : 6
ieee : 00a075
cmic : 0
mdts : 5
cntlid : 1
ver : 10300
rtd3r : 7a120
rtd3e : 4c4b40
oaes : 0x200
ctratt : 0
oacs : 0x16
acl : 3
aerl : 7
frmw : 0x14
lpa : 0xf
elpe : 255
npss : 4
avscc : 0
apsta : 0x1
wctemp : 343
cctemp : 353
mtfa : 20
hmpre : 0
hmmin : 0
tnvmcap : 0
unvmcap : 0
rpmbs : 0
edstt : 5
dsto : 1
fwug : 0
kas : 0
hctma : 0x1
mntmt : 323
mxtmt : 353
sanicap : 0x2
hmminds : 0
hmmaxd : 0
sqes : 0x66
cqes : 0x44
maxcmd : 0
nn : 1
oncs : 0x5e
fuses : 0
fna : 0
vwc : 0x1
awun : 0
awupf : 0
nvscc : 0
acwu : 0
sgls : 0
subnqn :
ioccsz : 0
iorcsz : 0
icdoff : 0
ctrattr : 0
msdbd : 0
ps 0 : mp:9.00W operational enlat:5 exlat:5 rrt:0 rrl:0
rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:4.60W operational enlat:30 exlat:30 rrt:1 rrl:1
rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:3.80W operational enlat:30 exlat:30 rrt:2 rrl:2
rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0500W non-operational enlat:1000 exlat:1000 rrt:3 rrl:3
rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0040W non-operational enlat:6000 exlat:8000 rrt:4 rrl:4
rwt:4 rwl:4 idle_power:- active_power:-
Even so I decided to verify what the kernel module parameters are currently set to, and I found that I can use this command: Code:
hans@gypsy:~$ systool -vm nvme_core
Module = "nvme_core"
Attributes:
coresize = "61440"
initsize = "0"
initstate = "live"
refcnt = "5"
srcversion = "97936FE07FF19AF20FC8C2E"
taint = ""
uevent = <store method only>
version = "1.0"
Parameters:
admin_timeout = "60"
default_ps_max_latency_us= "100000"
force_apst = "N"
io_timeout = "30"
max_retries = "5"
multipath = "Y"
shutdown_timeout = "5"
streams = "N"
Sections:
(Assuming it only matters that it is set sufficiently high?) And lastly I checked the SMART data: Code:
255 hans@gypsy:~$ sudo nvme smart-log /dev/nvme0 [sudo] password for hans: Smart Log for NVME device:nvme0 namespace-id:ffffffff critical_warning : 0 temperature : 42 C available_spare : 100% available_spare_threshold : 10% percentage_used : 0% data_units_read : 550,056 data_units_written : 1,258,557 host_read_commands : 7,345,377 host_write_commands : 24,799,253 controller_busy_time : 264 power_cycles : 18 power_on_hours : 607 unsafe_shutdowns : 7 media_errors : 0 num_err_log_entries : 0 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Temperature Sensor 1 : 43 C Temperature Sensor 2 : 42 C Temperature Sensor 5 : 61 C Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0 So again I don't really know what else to try but to wait for another crash and hope something interesting shows up in "dmesg -w" over ssh. |
||||
|
|
|
|
|
#21 |
|
"6800 descendent"
Feb 2005
Colorado
32×83 Posts |
Are you sure you have disabled every possible power saving feature related to the drive, including any that might be in the BIOS settings of the machine?
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Ryzen help | Prime95 | Hardware | 9 | 2018-05-14 04:06 |
| 29.2 benchmark help #2 (Ryzen only) | Prime95 | Software | 10 | 2017-05-08 13:24 |
| AMD Ryzen is risin' up. | jasong | Hardware | 11 | 2017-03-02 19:56 |
| Stage 1 with mprime/prime95, stage 2 with GMP-ECM | D. B. Staple | Factoring | 2 | 2007-12-14 00:21 |
| Stage 1 and stage 2 tests missing | Matthias C. Noc | PrimeNet | 5 | 2004-08-25 15:42 |