Need help debugging plotting crash/kernel panic, already did memtest and tried to dump core

zictes · May 9, 2021, 12:40am

I really need help, I’m making pretty much no progress plotting because my system keeps locking up multiple times per day and I can gather exactly no information as to why that is.

I already posted about the issue here and kindly received some help from @codinghorror which led me to discover a bad DIMM and completely replace my entire memory kit. I thought that would be it, but the problem persists.

I’m unable to find a cause since I keep getting kernel panic, and I have not found any way to collect information from that. I’ve asked multiple times on stackoverflow, the linux stackexchange and serverfault but nobody even bothers to respond. The only thing I could gather from other threads is that it’s not really possible to write a log to the filesystem from a broken kernel state, and that it should not even be attempted to avoid writing garbage to random locations on your disk. I tried overriding my sysctl config to dump the core to /var/crash anyways since I have no other option, but there is still nothing being written. I tried dumping safely by setting up linux-crashdump/kdump first, but that had no effect either.

I would be super grateful if somebody with more linux experience could take a look at my journal to see if there is any information pointing to a cause that I missed. Last entry before this crash is May 09 01:35:01.

github.com

zictes/log/blob/main/journal

-- Journal begins at Sat 2021-05-08 00:49:10 CEST, ends at Sun 2021-05-09 02:17:15 CEST. --
May 08 00:49:10 stephen kernel: Linux version 5.11.0-16-generic (buildd@lgw01-amd64-035) (gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0, GNU ld (GNU Binutils for Ubuntu) 2.36.1) #17-Ubuntu SMP Wed Apr 14 20:12:43 UTC 2021 (Ubuntu 5.11.0-16.17-generic 5.11.12)
May 08 00:49:10 stephen kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-16-generic root=UUID=1b63092c-1327-422f-a327-c8ddbfb3b056 ro
May 08 00:49:10 stephen kernel: KERNEL supported cpus:
May 08 00:49:10 stephen kernel:   Intel GenuineIntel
May 08 00:49:10 stephen kernel:   AMD AuthenticAMD
May 08 00:49:10 stephen kernel:   Hygon HygonGenuine
May 08 00:49:10 stephen kernel:   Centaur CentaurHauls
May 08 00:49:10 stephen kernel:   zhaoxin   Shanghai  
May 08 00:49:10 stephen kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
May 08 00:49:10 stephen kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
May 08 00:49:10 stephen kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
May 08 00:49:10 stephen kernel: x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
May 08 00:49:10 stephen kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
May 08 00:49:10 stephen kernel: x86/fpu: xstate_offset[9]:  832, xstate_sizes[9]:    8
May 08 00:49:10 stephen kernel: x86/fpu: Enabled xstate features 0x207, context size is 840 bytes, using 'compacted' format.
May 08 00:49:10 stephen kernel: BIOS-provided physical RAM map:
May 08 00:49:10 stephen kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
May 08 00:49:10 stephen kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
May 08 00:49:10 stephen kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000009d1efff] usable

This file has been truncated. show original

CHIA: 1.1.4, using plotman 0.2
OS : Ubuntu Server 21.04, Kernel 5.11.0-16-generic, up to date
CPU : AMD Ryzen 9 5950X
RAM : 2x 32GB Mushkin Redline Ridgeback G2 DDR4-3600
MBD : Asus Prime B550-Plus, UEFI up to date
SSD : 2x 1.6TB Intel P4610 in RAID0
PSU : Corsair HX750

codinghorror · May 9, 2021, 12:48am

OK so your system now passes one complete memtest run? For sure? It must pass memtest.

If so, the next step is… does your system pass mprime run overnight?

zictes · May 9, 2021, 12:29pm

Haven’t memtested the new memory yet, but its from a different system of mine that has also been plotting 24/7 for a more than a week without any crashes. If the new memory now fails memtest I think it’s safe to assume that something in my system broke it. Maybe that’s also what happened to the previous memory.

I’m testing my known to be stable 5900X instead of the 5950X right now so I can keep plotting while still eliminating a possible cause. If the system still crashes with the 5900X I’ll do another memtest and mprime run and report back.

anonymousjohndoe999 · May 24, 2021, 4:14am

I’m also experiencing this.
My spec:
DELL R730, 128gb ecc mem, 2 Intel P4610 3.2TB, raid 0 into one 6.4Tb /dev/md0.

At first , I have only one Intel P4610 3.2TB nvme ssd, and plotting on that works well for a long time. Then I bought another Intel P4610 3.2TB, so I raid 0 them in to /dev/md0.
My ubuntu desktop 20.04 would simply freeze after plotting on /dev/md0 for serveral hours, sometimes 2hrs, sometimes 8hrs, etc. I know this because I can no longer ssh into this ubuntu and even using DELL idrac can not capture the ubuntu’s screen, the screen is just blank and can not respond to my keyboard command.

After a reboot, I can see that logs in /var/log/syslog just stopped at a specific timepoint, it’s like the system just freezed.

anonymousjohndoe999 · May 24, 2021, 4:24am

By the way, I have another server DELL R630, which is using 6 Intel s4610 SATA ssd in raid0 mode and it worked well for serveral days. So I suspect that Intel P4610 nvme ssd does not work well with mdadm raid0?