Kernel panic journey continues, looks like its a md driver bug

So even after replacing bad memory and a bad processor, my system is still crashing. Looks like it’s the SSDs this time.

The problems started yesterday when I was just connecting a new external drive over USB and all of my plotting jobs crashed because somehow that broke the SSD mountpoint. It was only reporting this when you tried to access it in any way:

OSError: [Errno 5] Input/output error: '/mnt/tmp/00'

It was working again after remounting and starting over with plotting, but then the system went into kernel panic a day later. I remembered that I created the md array under a different Ubuntu Server version and thought that might be causing problems, so I stopped it and created a new one. When I tried to format it to XFS though, I got kernel panic again.

There are 63 entries of this cryptic message in the nvme error log for both 1.6TB Intel P4610s:

 Entry[63]
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................

And these are the smart logs for both:

Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 40 C
available_spare                         : 100%
available_spare_threshold               : 10%
percentage_used                         : 1%
endurance group critical warning summary: 0
data_units_read                         : 596,988,276
data_units_written                      : 564,366,836
host_read_commands                      : 3,792,553,717
host_write_commands                     : 2,961,035,683
controller_busy_time                    : 2,725
power_cycles                            : 15
power_on_hours                          : 304
unsafe_shutdowns                        : 5
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0
Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 39 C
available_spare                         : 100%
available_spare_threshold               : 10%
percentage_used                         : 1%
endurance group critical warning summary: 0
data_units_read                         : 575,183,605
data_units_written                      : 542,139,408
host_read_commands                      : 3,735,260,606
host_write_commands                     : 2,597,280,823
controller_busy_time                    : 2,747
power_cycles                            : 12
power_on_hours                          : 281
unsafe_shutdowns                        : 5
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0

Other than the unsafe shutdowns that looks fine to me, but what do I know. I don’t think they got too hot or anything, I’ve been obsessively monitoring the temperature. It stayed under 50 celsius all the time, and they’re rated up to 55 for operating temperature.

I don’t think it’s bad firmware either, I updated it just a week ago.

Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     PHLN026001YK1P6AGN   INTEL SSDPE2KE016T8                      1           1.60  TB /   1.60  TB      4 KiB +  0 B   VDV10170
/dev/nvme1n1     PHLN035300CV1P6AGN   INTEL SSDPE2KE016T8                      1           1.60  TB /   1.60  TB      4 KiB +  0 B   VDV10170

Specs:

OS  : Ubuntu Server 21.04 / kernel 5.11.0
CPU : AMD Ryzen 9 5900X
RAM : 2x 32GB Mushkin Redline Ridgeback G2 3600
MBD : ASUS Prime B550-Plus
SSD : 2x 1.6TB Intel P4610
HDD : 12TB WD My Book over USB3.2
PSU : Corsair HX750
1 Like

So after days and nights trying to find the cause of this nightmare, I have now painstakingly replaced every component in my system one by one except for the power supply. The memory passes memtest86 and mprime runs with no issues.

Still can’t plot even for a single day without going into kernel panic.

I have finally at least managed to find another poor soul experiencing the exact same problem, who has also received no help.

I’m running RAID 0 instead of RAID 6, but apart from that every single detail he describes matches up with my experience. Just like me he was unable to dump the core and even figured out why that’s not possible. He also had the same idea that it could be the chipset and went from X570 to B550, while I went the other direction with equally little success.

It still baffles me that nobody else here is experiencing this issue, since a lot of people here have very similar specs and are running the same exact software and OS in the same exact configuration.

I am also experiencing this issue on x570 + 5950x - after a few hours in plotman computer is going to unresponsive.
Can’t do anything about it, tried a kernel 5.11, nothing.

1 Like

I am also experiencing this issue.
2 Intel p4610 3.2TB raid 0 into /dev/md0.
Is it because of this raid 0?

1 Like

Did u also raid 0 two nvme drives?

1 Like

Most likely yes. Check the serial console log after the crash, and see if the call trace says something about md_end_io:

[406005.583319]  ? mempool_kfree+0xe/0x10
[406005.583319]  ? kfree+0xb8/0x220
[406005.583319]  ? mempool_kfree+0xe/0x10
[406005.583319]  ? mempool_free+0x2f/0x80
[406005.583319]  ? md_end_io+0x4b/0x70
[406005.583319]  ? bio_endio+0xe6/0x150

what’s your plan now? Don’t use mdadm raid0?

1 Like

Have you tried disabling C-States in your BIOS/UEFI and adding processor.max_cstate=5 rcu_nocbs=0-31 to kernel args? The guy from the stackoverflow thread said he had success with that on a 5950X system.

I’m using DELL R730 server and if I remember correctly, I don’t see any C-states settings in bios. I’ll check for it later. Have you tried these?

1 Like

Yes, I’ve been running stable for 5 days since abandoning raid 0. It adds about 20-30 minutes per plot on my system, but that’s much more bearable than crashing once a day.

I have not tried disabling c-states yet, will do that and report once I have to take the system down for maintenance again.

thank you sir. Keep in touch.

1 Like

One more thought , might it be a problem of the u.2 to PCIE adapter? I’m using two diferrent brands of two adapters.

1 Like

I thought about that too, might have something to do with the adapter. I’m using the same one for both SSDs though, so the different brands are not the issue.

I’m using this cable and adapter combo for both of them:

then I guess it’s not the PCIE slot problem either. Thought might be the slot problem. Considered switching a slot. Now I can skip that process.
Saw your comment on chia decentral Youtube video. :grinning:

1 Like

Could also be a problem of Intel P4610 I guess?

1 Like

Yeah I don’t think it has anything to do with PCIe connection, I just remembered that the stackoverflow guy was having the issue on a bunch of standard 4TB WD Red SATA HDDs.

That’s also why I’m still skeptical about the P4610s being the problem, but that thought has also crossed my mind. Haven’t heard from anyone with different SSDs having periodic crashes.

1 Like

I’m now running into this exact problem too. I’m using 5 PNY XLR8’s. I was successfully plotting on Debian Sid with an mdamd Raid0 with xfs. On Ubuntu 21.04 or 21.10, it kernel panics after a while. This user on Github also seems to be running into the same issue on a variety of Ubuntu versions.

1 Like