![]() |
|
|
#1 |
|
Apr 2019
5·41 Posts |
Hello, I have recently built a computer that I put to work running mprime on OpenSUSE tumbleweed.
I had unsuccessfully tried some mild overclocking; just turning on PBO was not stable at all. Eventually gave up on any of that and set everything in BIOS to basically default, aside from enabling the DOCP(XMP) profile. With this I was running stable for days at a time doing LL-DC work and running other tasks like mfaktc and some multithreaded PARI/gp scripts. Then last night I tried switching it over to do some P-1. I started with mprime set to use 15 out of 16GB of RAM, which rebooted itself overnight. I figured maybe that was too much mem usage, (although I also have 16GB of swap so it shouldn't completely kill it if there was some out-of-memory issue). Anyways, so I tried dropping it down to just 8GB usage in mprime, and within a minute of resuming the stage 2 work, X server just quit and sent me to a virtual console login prompt which took forever to login: (I entered username, but never got prompted for password, but only had patience for a minute or two of waiting before I hard reset). I'm running memtest86 right now; so far its done 1 pass with no errors. Honestly feels like a waste of time, as I've never had memtest86 find errors in all my years of troubleshooting, but I'm letting it run since I have no other ideas. Should I need to manually clock down/loosen timings on the RAM below its given profile? Or maybe give it a little more voltage? Is there anything else it could be about this setup? Does P-1 Stage 2 write much to disk? I have a single M.2 NVMe drive on this system, and I'm wondering if that could also possibly be the source of some problem. Some hardware specs: CPU: Ryzen 2700X 8C/16T "4.3Ghz" (I've never seen above 4 on a single thread) Motherboard: Asus Prime B450 Plus Cooler: Noctua NH-D15 Memory: Corsair Vengeance LPX 16GB (2 X 8GB) DDR4 3000 (PC4-24000) C16 1.35V CMK16GX4M2D3000C16 Storage: Crucial P1 500GB 3D NAND NVMe PCIe M.2 SSD - CT500P1SSD8 GPU: GTX 1660 PSU: CORSAIR RM750x, 750 Watt, 80+ Gold Certified I've been trying out OpenSUSE Tumbleweed on this computer, which is new to me. But now I'm considering falling back to my mainstay of Ubuntu for hopefully more stable system. Though if its a hardware issue (or BIOS config maybe?) that's obviously not going to help. This is also my first Ryzen system, and I've read some vague sentiments that the 1st and 2nd gen have had a lot of issues, being picky about RAM somehow? edit: Should I just not bother with P-1 on system with no ECC? Last fiddled with by hansl on 2019-08-16 at 14:38 |
|
|
|
|
|
#2 | |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
2·112·47 Posts |
Quote:
Try bringing the allowed RAM down to 1 GB; it will still complete the second stage, just less quickly. |
|
|
|
|
|
|
#3 |
|
Sep 2009
5×17×29 Posts |
Was the hard disk activity LED showing anything? If it was page thrashing that would be on all the time.
If you have another system try logging onto the Ryzen over the network and running top or watch sensors before you start P-1. If the Ryzen crashes you should still be able to see the last lot of output on the other system. I've used that a few times (once I found the cooler had failed so the box was overheating, another time I never found why the system would crash). Chris |
|
|
|
|
|
#4 |
|
Apr 2019
5·41 Posts |
Small update, I let memtest run through 3 passes and no errors. I ended up cancelling that test and trying some more tests with different memory amounts in mprime. I started at 1024 and raised it slowly, letting it report a couple lines of progress, then stopping and restarting with a higher memory limit. I got back up to 8192 without any crashes so far.
I'm monitoring over temps and free memory over ssh now so I should be able to catch the status if it crashes again. I had a hunch that maybe it was something that was only happening at the very beginning of stage 2 when high memory limit was set, which I had been able to get past when setting to only 1024MB usage. So I removed the progress files (I wasn't really thinking and forgot that it would need to redo all of stage 1 also, so I gotta wait a few hours to get back to beginning of stage 2 again), Is there any way to tell mprime to restart at the beginning of stage 2? |
|
|
|
|
|
#5 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
172208 Posts |
That's strange behavior. 15 out of 16GB seems ambitious. But 8 of 16 should be no problem, unless you have something else hogging a lot of ram. Other apps? Ramdisk?
I routinely run cpu P-1 at 8 to32 GB allowed in prime95 on Win 7 or Win10. On systems with 4 or more prime95 workers, plus multiple gpus going. (Corresponding to 1/2 to 1/4 of total installed ram) Four out of 8GB on Win7 with other apps going tends to be unpleasant for response time and paging during stage 2, but does not crash. Can you make your system multiboot? Adding ubuntu and/or Win would let you try things on the same hardware and app and parameters but vary the OS. If it's a SUSE related issue, that should show it up quickly. Good luck. Last fiddled with by kriesel on 2019-08-16 at 16:44 |
|
|
|
|
|
#6 |
|
"Sam Laur"
Dec 2018
Turku, Finland
317 Posts |
I'm running P-1 with 14.5 out of 16 GB, on a Ryzen 3 2200G so it's Zen not Zen+. Linux, too. But I've found that having the system "memtest86 stable" isn't quite enough. Prime95/mprime running LL or PRP stresses the memory much more and P-1 stage 2 stresses it in a different way, but still more than memtest86. Any overclocks, either CPU or memory, have not worked for me. But it hasn't crashed, just produced various errors while still running.
|
|
|
|
|
|
#7 |
|
Apr 2019
3158 Posts |
OK, it happened again. The last message from ssh showed it was at ~50C CPU temp and about 7GB free when the connection was closed (I meant to copy the screen output but accidentally overwrote it by reconnecting ssh). I reconnected to it, while its still in this weird state, and it looks like its something related to the nvme drive. This is the output from dmesg:
Code:
[ 989.409598] perf: interrupt took too long (4979 > 4912), lowering kernel.perf_event_max_sample_rate to 40000 [ 1195.031765] fuse: init (API version 7.31) [ 1327.328770] perf: interrupt took too long (6268 > 6223), lowering kernel.perf_event_max_sample_rate to 31750 [ 2238.284260] perf: interrupt took too long (7846 > 7835), lowering kernel.perf_event_max_sample_rate to 25250 [ 9117.462381] perf: interrupt took too long (9831 > 9807), lowering kernel.perf_event_max_sample_rate to 20250 [ 9261.476036] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff [ 9261.603999] pci_raw_set_power_state: 19 callbacks suppressed [ 9261.604009] nvme 0000:01:00.0: Refused to change power state, currently in D3 [ 9261.604430] nvme nvme0: Removing after probe failure status: -19 [ 9261.632241] print_req_error: I/O error, dev nvme0n1, sector 15247304 flags 100001 [ 9261.632255] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 [ 9261.729511] nvme nvme0: failed to set APST feature (-19) [ 9261.739582] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 [ 9261.739591] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0 [ 9261.739595] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0 [ 9261.756670] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 1, flush 0, corrupt 0, gen 0 [ 9261.756951] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 2, flush 0, corrupt 0, gen 0 [ 9261.758061] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 3, flush 0, corrupt 0, gen 0 [ 9261.758368] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 4, flush 0, corrupt 0, gen 0 [ 9261.759112] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 5, flush 0, corrupt 0, gen 0 [ 9261.759138] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 4, rd 6, flush 0, corrupt 0, gen 0 [ 9262.276359] Core dump to |/bin/false pipe failed [ 9262.336595] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] [ 9262.336817] caller _nv000939rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs [ 9262.975980] snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD data byte 62 [ 9263.012987] Core dump to |/bin/false pipe failed [ 9263.015801] Core dump to |/bin/false pipe failed [ 9263.035986] snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD data byte 1 [ 9263.134288] Core dump to |/bin/false pipe failed [ 9265.580609] BTRFS: error (device nvme0n1p2) in btrfs_commit_transaction:2234: errno=-5 IO failure (Error while writing out transaction) [ 9265.580610] BTRFS info (device nvme0n1p2): forced readonly [ 9265.580611] BTRFS warning (device nvme0n1p2): Skipping commit of aborted transaction. [ 9265.580612] BTRFS: error (device nvme0n1p2) in cleanup_transaction:1794: errno=-5 IO failure [ 9265.580613] BTRFS info (device nvme0n1p2): delayed_refs has NO entry [ 9292.708719] btrfs_dev_stat_print_on_error: 320 callbacks suppressed [ 9292.708723] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 123, rd 208, flush 0, corrupt 0, gen 0 [ 9368.485780] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 208, flush 0, corrupt 0, gen 0 [ 9577.728458] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 209, flush 0, corrupt 0, gen 0 [ 9577.728508] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 210, flush 0, corrupt 0, gen 0 [ 9577.728715] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 211, flush 0, corrupt 0, gen 0 [ 9577.728768] Core dump to |/bin/false pipe failed [ 9578.059425] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 212, flush 0, corrupt 0, gen 0 [ 9578.059466] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 213, flush 0, corrupt 0, gen 0 [ 9578.059531] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 214, flush 0, corrupt 0, gen 0 [ 9578.059555] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 215, flush 0, corrupt 0, gen 0 [ 9578.059574] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 216, flush 0, corrupt 0, gen 0 [ 9578.059590] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 217, flush 0, corrupt 0, gen 0 [ 9578.059604] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 124, rd 218, flush 0, corrupt 0, gen 0 [ 9608.872774] btrfs_dev_stat_print_on_error: 1 callbacks suppressed [ 9608.872777] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 125, rd 219, flush 0, corrupt 0, gen 0 [ 9608.872797] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 126, rd 219, flush 0, corrupt 0, gen 0 [ 9608.872805] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 127, rd 219, flush 0, corrupt 0, gen 0 [11308.648706] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 127, rd 220, flush 0, corrupt 0, gen 0 [11308.648753] BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 127, rd 221, flush 0, corrupt 0, gen 0 It looks like that 9117.476036 timestamp is where it started stage 2 again since that's the last of those perf interrupt things. Then about 2 minutes later it thinks it says the NVMe controller is down and tries to reset it, and everything just fails for some reason? Could high-bandwidth memory access by P-1 induce some sort of timeout on the bus that trips up the NVMe? |
|
|
|
|
|
#8 |
|
Apr 2019
CD16 Posts |
I guess my main question now is, should I start trying to return this drive or is it possible there's just some weird bus conflict that's causing this and the drive isn't really bad?
edit: Also is there anything I should try checking before I reset this again? I'm connected via ssh right now, doesn't seem to work logging in directly on the box itself. I can't really do much in this state since sudo and su just give me: "Input/output error" presumably because the drive is not mounted or something? Its weird though because I can "cd" into the mprime directory and "ls" and see the progress files for the P-1 work. Last fiddled with by hansl on 2019-08-16 at 19:23 |
|
|
|
|
|
#9 | |
|
Sep 2002
Database er0rr
468510 Posts |
Quote:
Last fiddled with by paulunderwood on 2019-08-16 at 19:58 |
|
|
|
|
|
|
#10 |
|
Apr 2019
110011012 Posts |
The drive is not close to being full. There is not hard drive cable, because its M.2 NVMe. I could try reseating it in its socket though.
df: Code:
Filesystem Size Used Avail Use% Mounted on devtmpfs 7.8G 0 7.8G 0% /dev tmpfs 7.9G 44K 7.9G 1% /dev/shm tmpfs 7.9G 158M 7.7G 2% /run tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup /dev/nvme0n1p2 450G 32G 418G 7% / /dev/nvme0n1p2 450G 32G 418G 7% /boot/grub2/x86_64-efi /dev/nvme0n1p2 450G 32G 418G 7% /opt /dev/nvme0n1p2 450G 32G 418G 7% /var /dev/nvme0n1p2 450G 32G 418G 7% /boot/grub2/i386-pc /dev/nvme0n1p2 450G 32G 418G 7% /.snapshots /dev/nvme0n1p2 450G 32G 418G 7% /root /dev/nvme0n1p2 450G 32G 418G 7% /tmp /dev/nvme0n1p2 450G 32G 418G 7% /srv /dev/nvme0n1p2 450G 32G 418G 7% /usr/local /dev/nvme0n1p2 450G 32G 418G 7% /home /dev/nvme0n1p1 500M 5.1M 495M 2% /boot/efi tmpfs 1.6G 8.0K 1.6G 1% /run/user/1000 Code:
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) devtmpfs on /dev type devtmpfs (rw,nosuid,size=8155044k,nr_inodes=2038761,mode=755) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755) tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755) cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd) pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime) efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime) bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) /dev/nvme0n1p2 on / type btrfs (ro,relatime,ssd,space_cache,subvolid=268,subvol=/@/.snapshots/1/snapshot) systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=28,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=19580) debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime) hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M) mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime) /dev/nvme0n1p2 on /boot/grub2/x86_64-efi type btrfs (ro,relatime,ssd,space_cache,subvolid=265,subvol=/@/boot/grub2/x86_64-efi) /dev/nvme0n1p2 on /opt type btrfs (ro,relatime,ssd,space_cache,subvolid=263,subvol=/@/opt) /dev/nvme0n1p2 on /var type btrfs (ro,relatime,ssd,space_cache,subvolid=258,subvol=/@/var) /dev/nvme0n1p2 on /boot/grub2/i386-pc type btrfs (ro,relatime,ssd,space_cache,subvolid=266,subvol=/@/boot/grub2/i386-pc) /dev/nvme0n1p2 on /.snapshots type btrfs (ro,relatime,ssd,space_cache,subvolid=267,subvol=/@/.snapshots) /dev/nvme0n1p2 on /root type btrfs (ro,relatime,ssd,space_cache,subvolid=262,subvol=/@/root) /dev/nvme0n1p2 on /tmp type btrfs (ro,relatime,ssd,space_cache,subvolid=260,subvol=/@/tmp) /dev/nvme0n1p2 on /srv type btrfs (ro,relatime,ssd,space_cache,subvolid=261,subvol=/@/srv) /dev/nvme0n1p2 on /usr/local type btrfs (ro,relatime,ssd,space_cache,subvolid=259,subvol=/@/usr/local) /dev/nvme0n1p2 on /home type btrfs (ro,relatime,ssd,space_cache,subvolid=264,subvol=/@/home) /dev/nvme0n1p1 on /boot/efi type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro) tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=1640604k,mode=700,uid=1000,gid=100) tracefs on /sys/kernel/debug/tracing type tracefs (rw,nosuid,nodev,noexec,relatime) gvfsd-fuse on /run/user/1000/gvfs type fuse.gvfsd-fuse (rw,nosuid,nodev,relatime,user_id=1000,group_id=100) fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime) |
|
|
|
|
|
#11 |
|
Sep 2002
Database er0rr
5×937 Posts |
Suse, right? You might have to go into safe mode and run the check disk from there, having checked the cable first.
Last fiddled with by paulunderwood on 2019-08-16 at 20:33 |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Ryzen help | Prime95 | Hardware | 9 | 2018-05-14 04:06 |
| 29.2 benchmark help #2 (Ryzen only) | Prime95 | Software | 10 | 2017-05-08 13:24 |
| AMD Ryzen is risin' up. | jasong | Hardware | 11 | 2017-03-02 19:56 |
| Stage 1 with mprime/prime95, stage 2 with GMP-ECM | D. B. Staple | Factoring | 2 | 2007-12-14 00:21 |
| Stage 1 and stage 2 tests missing | Matthias C. Noc | PrimeNet | 5 | 2004-08-25 15:42 |