Need help debugging plotting crash/kernel panic, already did memtest and tried to dump core

I really need help, I’m making pretty much no progress plotting because my system keeps locking up multiple times per day and I can gather exactly no information as to why that is.

I already posted about the issue here and kindly received some help from @codinghorror which led me to discover a bad DIMM and completely replace my entire memory kit. I thought that would be it, but the problem persists.

I’m unable to find a cause since I keep getting kernel panic, and I have not found any way to collect information from that. I’ve asked multiple times on stackoverflow, the linux stackexchange and serverfault but nobody even bothers to respond. The only thing I could gather from other threads is that it’s not really possible to write a log to the filesystem from a broken kernel state, and that it should not even be attempted to avoid writing garbage to random locations on your disk. I tried overriding my sysctl config to dump the core to /var/crash anyways since I have no other option, but there is still nothing being written. I tried dumping safely by setting up linux-crashdump/kdump first, but that had no effect either.

I would be super grateful if somebody with more linux experience could take a look at my journal to see if there is any information pointing to a cause that I missed. Last entry before this crash is May 09 01:35:01.

CHIA: 1.1.4, using plotman 0.2
OS : Ubuntu Server 21.04, Kernel 5.11.0-16-generic, up to date
CPU : AMD Ryzen 9 5950X
RAM : 2x 32GB Mushkin Redline Ridgeback G2 DDR4-3600
MBD : Asus Prime B550-Plus, UEFI up to date
SSD : 2x 1.6TB Intel P4610 in RAID0
PSU : Corsair HX750

OK so your system now passes one complete memtest run? For sure? It must pass memtest.

If so, the next step is… does your system pass mprime run overnight?

2 Likes

Haven’t memtested the new memory yet, but its from a different system of mine that has also been plotting 24/7 for a more than a week without any crashes. If the new memory now fails memtest I think it’s safe to assume that something in my system broke it. Maybe that’s also what happened to the previous memory.

I’m testing my known to be stable 5900X instead of the 5950X right now so I can keep plotting while still eliminating a possible cause. If the system still crashes with the 5900X I’ll do another memtest and mprime run and report back.

I’m also experiencing this.
My spec:
DELL R730, 128gb ecc mem, 2 Intel P4610 3.2TB, raid 0 into one 6.4Tb /dev/md0.

At first , I have only one Intel P4610 3.2TB nvme ssd, and plotting on that works well for a long time. Then I bought another Intel P4610 3.2TB, so I raid 0 them in to /dev/md0.
My ubuntu desktop 20.04 would simply freeze after plotting on /dev/md0 for serveral hours, sometimes 2hrs, sometimes 8hrs, etc. I know this because I can no longer ssh into this ubuntu and even using DELL idrac can not capture the ubuntu’s screen, the screen is just blank and can not respond to my keyboard command.

After a reboot, I can see that logs in /var/log/syslog just stopped at a specific timepoint, it’s like the system just freezed.

By the way, I have another server DELL R630, which is using 6 Intel s4610 SATA ssd in raid0 mode and it worked well for serveral days. So I suspect that Intel P4610 nvme ssd does not work well with mdadm raid0?